Despite impressive advances in machine learning, directed human learning continues to be more effective, especially for generalization and real-world active learning settings. One major difference between the two approaches is the sequencing of the subject matter. To teach a machine about the natural world, we typically throw in as many examples as we can obtain, in randomized order and often with multiple labels (e.g. a Dalmatian is also a dog, canine, mammal, animal). We preprocess and normalize our training images to compensate for the myriad differences in camera settings and photo quality (aspect ratio, focus, lighting, pixel density, perspective, range, color balance, contrast…), let alone scene composition (is the object centered? fully visible or occluded? what other objects/patterns are in the background?). Then we let the stacked matrices work their magic, hoping that the sheer data quantity leads us to learn meaningful high-level features rather than noise (an assumption challenged by recent findings that neural nets mostly pick up bags of textures).
To teach a child about the natural world, we first point out simple, broad concepts—dogs, trees, birds—and gradually get more subtle and specific: a tree with needles for leaves is likely to be a pine and a white dog with black spots a Dalmatian. We don’t worry much about the context of the visual examples—the Dalmatian might be in a picture book, a movie, or playing in the park next to us. Of course, a child’s learning is often unstructured and distributed over many years of experience—as adults trying to learn a new language or ontology, we may well resort to randomized flash cards and memorization. Perhaps a more intuitive example of the importance of ordering the material for humans is learning basic arithmetic before algebra before calculus, although the machine equivalent of even the first lesson would be a lot more involved . Could the human strategy of learning by progressively increasing specificity or difficulty, known as curriculum learning , benefit a neural network?
To explore this intuition, I adapted this Keras tutorial on fine-tuning with small datasets to iNaturalist 2017, which contains >675K photos of >5K different species of life (see my previous blog post for details). The 2017 version crucially includes a higher-level category or taxon—the parent kingdom (plants, animals), phylum (just mollusks), or class (mammals, reptiles)—for each species. For example, the parent class of the species Ursus arctos, a type of bear, is mammals.
There are 13 taxa, from Aves/birds, with 964 constituent species and >235K total images, to Protozoa, with only 4 species and 381 images. Given this long-tailed distribution and our priors from evolutionary biology, predicting the correct taxon—insect, mammal, or mushroom?—is a much simpler and more general problem than predicting the exact species. Just try telling apart the six different species of castilleja below.
One curriculum for species identification would teach the taxa first and then the constituent species. If the order of the material matters—specifically, if machines, like humans, learn more effectively in settings of gradually increasing complexity—then we would expect a model trained on taxa first, then species to outperform a model trained directly on species (other factors like network architecture, number and distribution of examples, number of epochs, etc. being equal). In this post, I will describe a few ways to concretize this hypothesis and explore the results using Weights & Biases (wandb). The Keras code is here can be run with python adv_finetune.py. Note that the full iNaturalist dataset is 186GB, so a convenient subset can be downloaded here .
To standardize the curriculum, I reduced the dataset to taxa of the same depth to limit the variation of the species within. For example, “plants” range from redwoods to roses to rafflesia—they are a much bigger, more general, and more visually diverse category of living creature than “birds”, which almost always have feathers, wings, a beak, etc. The taxa represented in this subset are thus strict biological classes, and not a mix of kingdoms/phyla/classes as in the original dataset. To further control for biodiversity, I balanced the dataset so that each class has the same number of constituent species with the same total number of images. This left me with 720 images for each of 25 species in the 5 classes of amphibians, birds, insects, mammals, and reptiles. This is of course tiny in comparison to the original set of over 675K, but it’s the cleanest way to start this experiment. The problem we’re solving with this much cleaner data under the curriculum design paradigm is now much easier than the original task and may not generalize well to the full dataset. I will likely relax the constraints of balanced image count and the inductively-biased biodiversity in the future and face the bitter, long-tailed truth of universe as the original authors do . For now, this toy data is split into 620 train/100 val for each species (which becomes 3100/500 for each class, or 15,500/2500 total), and the train/val assignment stays fixed for two training paradigms.
I used the best-performing 7-layer convnet from the previous post to train a class baseline to predict one of the five class labels and a species baseline to predict one of the 25 species labels for each image. I then set a switch epoch such that once a model training on classes reaches this epoch, I splice off the class prediction layer (size 5) and replace it with a new one for predicting species (size 25). The class baseline in blue is substantially higher since that initial problem is much easier—guessing one in five classes (expected random accuracy is 20%) as opposed to one in 25 species (4%).
There is a sliver of evidence in favor of curriculum learning. In particular, switching from class to species at epoch 5 yields a slightly higher accuracy than the species baseline, and switches of 3 and 10 also surpass the species baseline in later epochs. The validation accuracy curves are noisier but also show the 3, 5, and 10 switches tying with and sometimes surpassing the species baseline. However, this may be noise and not a significant effect—the network quickly adapts to the species prediction setting. Can we magnify this slight positive influence of class pre-training in certain settings or by increasing retention of the first stage of material?
One way to encourage a relative difference in the retention of the material is to vary the learning optimizer and learning rate (LR). Here are a few experiments with sgd and adam relative to the species only baseline in purple. Although no other method surpasses the baseline, switching at epoch 5 using sgd with LR=0.05 for classes and LR=0.005 for species learns surprisingly quickly and converges to the baseline as if it never missed the first five epochs. Further exploration of these hyperparameters—different combinations of optimizers, learning rates, and switch epochs—seems promising.
Overall, the number of examples per species and the learning capacity of this model may be too small to demonstrate a significant effect from pretraining—the model doesn’t retain the foundational material. Finetuning larger, more expressive models pretrained on ImageNet (Inception/ResNet series) may be more representative. In initial tests predicting species from 10% of the total data, Inception V3 is the most promising, as it outperforms Xception and ResNet-50 with half the parameters and memory of the comparably-accurate Inception ResNet V2.
The switch from 5 to 25 output labels is disruptive and necessarily undoes some of the initial learning—merely adjusting the learning rate or freezing different numbers of base layers may not be enough to mitigate this damage. A layer architecture that branches the single parent class prediction to a finer-grain prediction over one of the five constituent species may help. Finally, the effect of curriculum learning may be more clear in the speed of convergence than the accuracy. In subsequent posts, I will explore these hypotheses and other ways to apply curriculum learning, track per-class accuracy (why are reptiles harder to identify than other taxa?), and handle more realistic (unbalanced) data distributions. You can check out the evergreen report here.
[1 ] Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
 Bengio, Yoshua, Louradour, Jerome, Collobert, Ronan, and Weston, Jason. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009
 In practice, I generate directory trees of symlinks to the full dataset as needed, where the root directory contains a train and a validation folder—in each of these, all examples of class A are placed in a folder entitled A. You can adapt data_symlink.py to generate your own symlink trees.
 Horn, G.V., Aodha, O.M., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., & Belongie, S. The INaturalist species classification and detection dataset. CVPR, 2018.