In the last 8 weeks, I was a part of the Applied Deep Learning course run by Weights & Biases. Part of the curriculum was to build a project in about 3 hours/week. I was working alone (from Seattle!) so I picked one which I could complete end-to-end in about 24 hours of work. Here’s the idea in a nutshell - given a business description, can we generate a domain name suggestions?
There’s 100,000 new domains registered every day (source). The overall experience is still 1990-esque. I was sure we can do better.
My goal was to build a system where users can input natural language text and get suggestions.
I spent most of my time on acquiring data because my hunch was that fine-tuning pre-trained models on good data would likely give good results.
Generating data was a simple process
This resulting dataset looks like follows:
Cupcakes and Cashmere is the premier destination for fashion, food and lifestyle inspiration = @ = cupcakesandcashmere.com
I was able to package the model in docker container and host it on Google cloud using this repo. This was still more work than I expected given how popular pretrained models have become.
Here are some results. There’s no cherry-picking of queries or results except one-time setting of the temperature parameter and exclusion of adult results (these should’ve really been excluded from input data)
The original test query’s results already look a lot more promising than current domain name suggestions. Keep in mind that these are results from the very first end-to-end run of my system. There’s been no dataset or hyperparameter tuning and the code has zero business logic for filtering junk results. I especially like ones like fullofcocunutcake.com because they are able to generalize beyond the input text showing the power of GPT-2.
Next, I tried generating the results for my own idea:
Again, quite promising. There’s some nonsensical ones that I don’t like but also a couple of others that I would consider buying if they were available.
Next, I moved on to generate domain names for some of the other class projects. I used the description that they had posted in their original pitch docs. Once again, no cherry-picking of queries or results.
Metrics are critical in such projects because they help measure progress. For offline model development, I usually look for some combination of a quality metric (e.g. precision), a quantity metrics (e.g. recall). There’s also online metrics and system performance metrics which come into play later on.
This problem, however, involves generating unseen output which makes offline metrics quite difficult. Perplexity is one of the most commonly used benchmarks along with similarity metrics like Bleu and Rouge. We need to take these metrics with a grain of salt though because there’s flaws both ways. E.g. cupcakes.com and sweettreats.com have no similarity but are good domain names for the same business where cupcakes.com and cupackes.com have high similarity but latter is a terrible name.
Asking humans to judge the quality of results is also a good metric. It has its own flaws though because it’s expensive, slow and difficult to setup correctly.
For the scope of this project, I left metrics as future work.
As I get more time, I definitely want to productionize this idea. From a quality perspective, there’s a bunch of simple improvements around cleaning up & augmenting datasets that should make big improvements.