A provocative paper, Energy and Policy Considerations for Deep Learning in NLP by Emma Strubell, Ananya Ganesh and Andrew McCallum has been making the rounds recently. While the paper itself is thoughtful and measured, headlines and tweets have been misleading, with titles like “Deep learning models have massive carbon footprints”. One especially irresponsible article summarized the finding as “an average, off-the-shelf deep learning software can emit over 626,000 pounds of carbon dioxide” which is an egregious misinterpretation.
As someone that cares a lot about deep learning and the environment I was happy to see a thoughtful article on the topic written by machine learning practitioners, but sad to see it badly misrepresented in the media.
I started a company, Weights and Biases to help machine learning practitioners keep track of their models and experiments. I’ve seen the expense of redundant model training first hand and I hope that Weights and Biases can play a part in helping machine learning practitioners use their resources more wisely.
Model training is probably not a significant source of carbon emissions today (but it is increasing exponentially)
The example model in the paper, “Transformer with neural architecture search” is way out at the far end in terms of computational cost of what almost anyone really does in 2019. For example a more representative task, training a standard neural net inception on imagenet to 95% accuracy of recognizing object in pictures takes around 40 GPU-hours which would consume around 10 kWh and produce around 10 lbs of CO2, which is equivalent to around 2-3 hours of running a central air conditioner.
A typical machine learning practitioner that we see using Weights & Biases might have eight GPUs at their disposal, and they don’t come close to 100% utilization. Even if they did, the energy consumption would be around 2kW. If there are 100,000 machine learning practitioners out there (probably generous), the total training consumption would be 200MW. That’s not much more energy and probably less carbon emissions than keeping a single 747 airplane in the sky.
Another way of looking at deep learning’s impact is looking at Nvidia’s sales, since Nvidia provides the processors most people use for training. In Q1 2019 their datacenter revenue was $701 million, which means they sold on order of magnitude 100,000 GPUs for data centers. Even if all those GPUs are training models (again, unlikely), we reach a similar conclusion.
Why machine learning could become a significant fraction of carbon emissions in the future
Just because model training is probably not a major carbon producer today doesn’t mean that we shouldn’t be looking at what its impact might be in the future.
Although the typical machine learning practitioner might be only using eight GPUs to train models, at Google, Facebook, OpenAI and other large organizations the usage can be much, much higher.
The US Department of Energy bought 27,648 Volta GPUs for their Oak Ridge supercomputer that they plan to use for deep learning which would consume around a megawatt at 100% utilization.
The recent trend in deep learning is clearly to orders of magnitude more compute. That means orders of magnitude more energy and climate impact. ust because the impact might be small today, it could change rapidly if trends continue. OpenAI has an excellent blog post, AI and Compute which shows the rapid increase in compute costs to build state-of-the-art models.
GPU performance per watt is also increasing exponentially, but it seems to be more like a factor of 10 every 10 years, vs compute necessary for state of the art model performance increasing by a factor of 10 every year.
Model inference is a bigger consumer of energy than model training (today and possibly forever)
Models don’t just consume power when they are trained, a bigger source of energy consumption today comes from after they are deployed. Nvidia estimated that in 2019, 80-90% of the cost of a model is in the inference. It’s unclear how much power a neural net would need to do autonomous driving, but some prototypes take as much as 2500 watts which if deployed in every car in the world would have quite an impact, although an order of magnitude smaller than actually physically moving the cars.
A more immediate energy use issue is data centers today use over 200TW and this number is growing. Google’s energy use in datacenters was enough to inspire them to design their own processor for inference called a TPU, which they now also offer as part of Google cloud.
Economic incentives mostly align with environmental incentives for model training
Model training is becoming extremely expensive. Running models in the cloud, a single GPU cost is around $1/hr and produces around 0.25lbs of CO2 - at $10/ton for reputable carbon offsets, offsetting this CO2 would cost around 0.1 cents, increasing my bill by just 0.1%. A small incremental price to pay for carbon neutral model training.
Environmental impacts may not even be the worst thing about the rapidly increasing demands for compute in deep learning
Ananya’s paper mentions this, but it’s worth emphasizing: the high cost of training state of the art models has a lot of worrying impacts. Today, researchers and startups have trouble competing with or even replicating the work of companies like Google and Facebook because the training is so expensive.
Until recently models were generally thought to be data bound and many worried that large companies had an unassailable advantage in having the most data. But researchers are still able to make progress on high quality open datasets like ImageNet. Startups were able to build the best machine learning applications on the data that was available to them.
In a world where researchers and companies are compute bound it’s hard to imagine how they will compete or even collaborate with large companies. If state of the art models cost millions of dollars to train will anyone even try to reproduce each other's results?
Even among researchers, the more high profile labs have disproportionate access to funding and resources, causing them to publish more exciting results, which in turn increases their access to compute. This could lead to a very small number of institutions being the only ones able to do fundamental deep learning research,
There is a huge and increasing amount of wasted and redundant computation happening
The article has several excellent conclusions, all of which I agree with. The first is “Authors should report training time and sensitivity to hyperparameters”. Why is this so important? One thing that a non-practitioner probably wouldn’t realize from the article is how the same deep learning models are being trained over and over. Practitioners typically start from an existing state of the art model and try training it.
For example, a popular machine learning repository like Facebook’s mask rcnn vision model has been starred over 5000 times and forked over 1,500 times. It’s very hard to say how many people have trained using this model but I think a reasonable estimate might be ten times the number of stars meaning, 25,000 different people have tried it out. This first thing someone will do with the model is train it to see how it performs. And then typically they will train it more times, trying out different hyperparameters. But all of this information is lost, and most of these training runs are redundant.
This is one of the reasons I started my company, Weights and Biases. We save all the experiments you run so that you don’t have to run them again and someone picking up your work doesn’t have to run them again. I get really excited when I see researchers tracking their experiments in our system.
Another excellent point made in the paper is “An additional avenue through which NLP and machine learning software developers could aid in reducing the energy associated with model tuning is by providing easy-to-use APIs implementing more efficient alternatives to brute-force grid search for hyperparameter tuning, e.g. random or Bayesian hyperparameter search techniques.“
In other words instead of trying all possible sets of hyperparameters, researchers would save money, time and environmental impact by letting an algorithm intelligently pick promising hyperparameters. We’ve really tried to make it dead simple to do smarter hyperparameter search.
Consider offsetting the environmental impact of your model training
Buying carbon offsets is cheap compared to buying GPU hours on Amazon. So why not be carbon neutral? Some people don’t think carbon offsets are really effective, but that’s outside of my area of expertise. The non-profit Carbon fund has put a lot of thought into this and offer what I think most people would view as high quality carbon offsets that are net positive even if they might quibble with the exact “offset”. Less direct but possibly more impactful could be donating to an organization like Earth Justice. If you’re torn maybe hedge your bets by doing both!
A simple formula is that one hour of training on an Nvidia GPU in California in 2019 produces about 0.25 lbs of CO2 equivalent emissions.
If you’d like help calculating your model training carbon footprint, I’d be happy to help you out.
Thanks James Cham, Ed McCullough, Chris Van Dyke, Bruce Biewald, Stacey Svetlachniya and Noga Leviner for helpful feedback.