Consider these two sentences, which only differ by one word:
Hopefully it's clear that in sentence A, the word "them" must refer to the "humans", while in sentence B, "them" refers to "deep nets". This task is straightforward for humans but challenging for machines. Disambiguating the pronoun "them" requires knowing more about the real world—the relationship between humans and deep nets—than entailed in the sentence. Until machine learning models can parse these types of sentences as well as humans do, we won't really be able to communicate with them in natural language, no matter what new website, app, or device we are using.
Originally proposed by Terry Winograd, a Stanford professor who did foundational work on natural language understanding, in 1972, the Winograd Schema Challenge remains one of the main benchmarks in this space. Specifically, Winograd Schema sentences like the example above are part of the General Language Understanding Evaluation, or GLUE Benchmark for text-based deep learning models. In GLUE, the task is called Winograd Natural Language Inference, or WNLI. The baseline human performance on this task is 95.9, but the best trained models are still at 94.5, so there is room for improvement. This is the second hardest of the nine subtasks in GLUE, and in six other tasks, deep learning models have already surpassed the human baseline.
The HuggingFace Transformers repository makes it very easy to work with a variety of advanced natural language models and try them on these benchmarks. In this report, I'll show how to use Transformers with Weights & Biases to start tuning models for a natural language task like correctly disambiguating Winograd schemas.