Balancing a malign skin mole dataset with image augmentation

7 min readMar 2, 2021

For a group project during our AI Bootcamp at BeCode, we were asked to build a service where people can upload a picture of a skin mole they worry about, and learn if they should have it checked by a doctor or not. This article tries to answer a question that came out of that project (we couldn’t let it go); would our model performance have been better if we would have known how to handle imbalanced classes?

The key input received from our coaches is described below:

An excel file holds the diagnosis of every image file. File names are held in the ‘id’ column and there are 3 undefined subfolders d/e/f. The important thing is that images with ‘kat.Diagnose’ value 1 should carry the ‘go check with doctor’ label, others look ‘harmless’.
Let’s see our class distribution!

There you have it. Just 629 images or 22% is worth checking by the doctor. We know now that to help our model understand the difference between the two classes we need to do something. It needs to have equal opportunity to learn from both classes.
I felt the best thing would be to augment the doctor category, generating as many variants as the no_doctor category and storing them back to disk. Not just oversampling while training, because that would decrease variation and lead to overfitting.
Alternatively, and what seems to be more standard, is the approach of fitting the imbalanced data using class weights. Interestingly, on Slack somebody referred to this as a mathematical hack solution, so well worth to check out its performance.
Of course we also need to set this off against the baseline of not compensating for imbalance at all.

I boost the doctor class by brightness adjustments and 2-way image flipping. I avoid shear or stretching because I’m afraid it might affect symmetry which is a telling characteristic for diagnosis according to a paper we used as basis.
Several images have a tiny black edge as if they were scanned, so the automatic border replication from shifts or rotations creates patches of black. Instead of keeping those patches I guessed it is better to suppress them with a bit of zoom.

Doctor augmentation: flips and brightness changes

Training aug: subtle rotations, shifts, brightness changes

On a side note, I skip image pre-processing here because we couldn’t quite finish it during our group project. Some stellar research from a teammate underpinned that effort however. Bringing out just the moles themselves can help a model focus its learning, and the paper saw a 4% jump in accuracy from it.

We first set aside a part of the dataset. It is for evaluating and comparing the different methods at the end.
Strafication makes sure the class (im)balance is the same in both parts, and the DataFrame should also be shuffled before the split. Why? Because records are ordered by filename (d/e/f) which could unintentianally group image traits together.

Using flow_from_dataframe() seems the logical choice here for writing out the augmented doctor images, but since I only had experience with image_dataset_from_directory() I was preparing to use that first. Luckily my curiousity took over and I discovered new powers!
Perhaps you already noticed that you can’t change the composition of your train/test split with flow_from_directory(), making it a no go for cross-validation. You’d think you can at least redraw the splits by setting the shuffle parameter, but in fact no, the split will remain in the exact same position! I hear you ask what the shuffling then actually does do, well…, it randomizes the order in which files are read FROM those splits. If I recall correctly that makes for better learning, the model will then see differently composed batches every time.
Using flow_from_dataframe() you CAN change the train/test composition if you pre-shuffle the dataframe. But the real power lies in not using the built-in validation split at all, and slicing the dataframe yourslef with KFold cross-validation.

But first things first. Let’s write out the augmented doctor images and insert their path back into the training dataframe, replacing the original doctor class.
A copy of the imbalanced dataframe before overwriting is kept to be fed to the two comparison methods.

I could use the earlier created test set for validation during each method’s training, but I decide to practice my newly found Kfold cross-validation power.
I will take the hit of doing 5 times more computation and I figure I should select the model with median performance for inclusion in the comparison. That one should be least influenced by the coincidences that play in train/val composition.

When building the model we followed the approach from the mentioned paper and selected VGG16 for transfer learning. We couldn’t crop to squares however, because not all moles were centered. This meant we couldn’t transfer the top fully connected layers (which were trained on square feature maps), and we had to randomly initialize our own set.
For our training we didn’t pre-calculate the VGG16 outputs to then repeat those to our model. The training augmentation required us to keep VGG16 feature extraction in the loop.
For this article I tried a change compared to the model in our group project, which was to drop the last convolutional block. Even though the paper kept it in, I was somehow expecting its complex features to be unhelpful or not so activated for skin moles. Apparantly not! Those features may be complex, but they indeed seem to apply oh so generally. When skipping them, the model wasn’t able to surpass chance after trying several different classification node sizes and regularization factors. Lesson learned, AI bootcampers shouldn’t ride too much on their own intuition.
Because we fix the ImageNet weights, we should be aware of the kind of image preprocessing used in that training. It turns out there is a function for that. It was interesting to see they didn’t downscale or z-score the channels, they only shifted the mean value observed in ImageNet to 0.

I observed that without downsizing the input images the validation accuracy didn’t improve during training, suggesting strong overfitting. A possible explanation might be that individual images were too discernable/memorable and that the model found it easier to learn by heart rather than understanding general traits. In other words, the dimensionality of the dataset was perhaps too big relative to its size. I downsize to an input resolution that results in 6 by 4 pixels feature maps at the end, below the 7x7 of the original VGG16 configuration.
Notwithstanding the augmentations, overfitting was still there. Interestingly, we came across hints that augmentation is not a catch-all solution for it. You are adding data that is still mathematically related, and more than 3x varying the same doctor image may have been past the optimal point. An additional way to boost the variance seen by the model is Dropout, so we used that. Finally, we nudged weights smaller and towards 0 to make the network less capable of internalizing the whole dataset. (they call that reducing the entropic capacity!)

I collect our prime candidate model from an early stop on the 1st split, which produced the median max AUC score of 0.89.

Next up is to train the imbalanced dataset without any balance compensation. From a tensorflow tutorial I learned a way to get the model to converge faster. It is done by shifting the the initial predictions closer to the expected balance.
Also here, I save weights for a medium performing model from the first split.

Last up is the method that applies class weights to the imbalanced data. I found how to calculate them in the same tutorial linked above. The takeaway for me was that their sum doesn’t need to be 1, as long as you are aware that the loss value is then not comparable.
Also here the first split model is selected for comparison.

We can see that balancing through augmentation was counterproductive for me, turning in lower segregating performance than doing no correction at all. Should I have done stronger augmentations? Or should I have increased the doctor class only 2 fold and not 3.5X? Should the comparison be drawn again after finetuning VGG16? … You tell me!
We can at least see that the standard approach of applying class weights did hold up, albeit only adding the smallest of step ups.

Thanks to my BeCode mates for the encouragement in finishing this article.

Refs:
https://web.stanford.edu/~kalouche/docs/Vision_Based_Classification_of_Skin_Cancer_using_Deep_Learning_(Kalouche).pdf
https://www.tensorflow.org/api_docs/python/tf/keras/applications/vgg16/preprocess_input
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#optional_set_the_correct_initial_bias

Balancing a malign skin mole dataset with image augmentation

Written by Philippe Fimmers