Biggest challenge in making ML work in the real world with Richard Socher
Richard Socher, ex-Chief Scientist at Salesforce, joins us to talk about The AI Economist, NLP protein generation and biggest challenge in making ML work in the real world.
View all podcasts
Gradient Dissent - a Machine Learning Podcast · Evolution of Reinforcement Learning and the Robot Hand
BIO

The East Germany born AI expert and researcher discusses some of his most interesting and groundbreaking research along with perspectives as the Chief Scientist at Salesforce.

Richard Socher was the Chief scientist (EVP) at Salesforce where he lead teams working on fundamental research(einstein.ai/), applied research, product incubation, CRM search, customer service automation and a cross-product AI platform for unstructured and structured data. Previously, he was an adjunct professor at Stanford’s computer science department and the founder and CEO/CTO of MetaMind(www.metamind.io/) which was acquired by Salesforce in 2016. In 2014, he got my PhD in the [CS Department](www.cs.stanford.edu/) at Stanford. He likes paramotoring and water adventures, traveling and photography.

More info:

Research

Google Scholar Link

2020

The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies, Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C. Parkes, Richard Socher.

[ arxiv link, blog, short video, Q&A, Press: VentureBeat, TechCrunch ]

ProGen: Language Modeling for Protein Generation, Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R Eguchi, Possu Huang and Richard Socher.

[ bioRxiv link, blog ]

Dye-sensitized solar cells under ambient light powering machine learning: towards autonomous smart sensors for the internet of things, Hannes Michaels, Michael Rinderle, Richard Freitag, Lacopo Benesperi, Tomas Edvinsson, Richard Socher, Alessio Gagliardib and Marina Freitag

Issue11, (Chemical Science 2020). [ paper link ]

2019

CTRL: A Conditional Transformer Language Model for Controllable Generation, Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher.

[ arxiv link, code (pre-trained and fine-tuning), blog ]

Genie: a generator of natural language semantic parsers for virtual assistant commands, Giovanni Campagna, Silei Xu, Mehrad Moradshahi, Richard Socher, Monica S. Lam

PLDI 2019 [ pdf link, https://almond.stanford.edu/ ]

Topics Covered:

0:00 intro

0:42 the AI economist

7:08 the objective function and Gini Coefficient

12:13 on growing up in Eastern Germany and cultural differences

15:02 Language models for protein generation (ProGen)

27:53 CTRL: conditional transformer language model for controllable generation

37:52 Businesses vs Academia

40:00 What ML applications are important to salesforce

44:57 an underrated aspect of machine learning

48:13 Biggest challenge making ML work in the real world

TRANSCRIPT

Lukas: I was curious how you got inspired by this AI Economist paper and project. I mean, I was trying to read it and I'm not an economist, so I had a whole bunch of basic questions that are probably pretty embarrassing but when did you learn so much about economics? And it's such an interesting idea. Maybe you should probably summarize the paper first for those who haven't read it and then talk about how you got interested in it.


Richard: Happy to. Yeah. So, AI Economist is essentially framework, it's more than just a single model or something. It's a whole framework that tries to model an economy and in sort of the most simple forms for now. Though it'll get more realistic, I think in the next months and years to come. And then inside that economic simulation, you have a 2-level reinforcement learning setup where you have an AI economist that basically can set taxes and subsidies and other kinds of financial instruments in order to optimize an overall objective for the economy, namely, in our case, productivity times equality, where equality is measured as sort of a 1-Gini index, which is a measure of equality that's used worldwide. And productivity makes sense in terms of how much do all the single agents in the simulation make? Each single agent is also a reinforcement learning agent, but their goals are just to maximize their own objectives, which is to maximize their own income and wealth. And is that realistic? Of course not. Also in the simulation, there's mostly just three different types of resources, wood, stone and space in some ways. Then agents walk around in this 2D grid world. They can build houses, they can block other agents by building these houses, and they can trade resources as well to try... You need certain more wood to build the house, but you have plenty of stone, you can trade it and so on. So it's some of the fundamentals. Also, you have utility curves, which is quite common in economic modeling that you wouldn't have in the game. What does the utility curve do? It tells you, for instance, that after a certain amount of work you have diminishing returns. You could work seven days a week. But most people at some point want to actually take some time off and not spend all their time just to minimize another little bit of money. That and a couple of other things make it quite different to playing just a game. We thought about this, too. Could we just use civilization or age of empires or some of these things? But we wanted to, one, steer away from this zero sum war games where you train and just get really, really good at fighting each other and instead try to think of a system to try to have an overall improvement for the world so that if that system actually gets deployed, it would have as is a positive impact versus, oh, we used it to develop interesting technology that eventually maybe will have this positive impact. So that's kind of what AI Economist is. What's interesting technically and hard about it, is that with this 2-level reinforcement learning, essentially the AI Economist keeps changing the goalposts for all the RL agents and they say, oh, I found this great strategy. I'm going to sell this, trade this, collect these resources and build houses in this way to block off some other person. And then all of a sudden the AI economist changes because they realize you have a monopoly and equality is suffering and maybe you're going to tax the person with the monopoly and I have blocked all the resources away from the other agents. And now all of a sudden, the agents have to adapt too and almost all RL before, you have a fixed objective function. You know, this is how you win gold or is how you win chess, this is how you win lottery games and so on. They don't change, and here the goalposts keep changing so it is a really hard, interesting optimization problem. So that's kind of what the AI Economist is. Now how did we get to that idea? It actually came from a couple of different strains. The first time I had this idea was during my PhD where this idea of essentially all these different cultures in the world have had their different energy landscapes on their optimization strategy and a lot of them were trying to optimize roughly similar things, you would hope. People want to prosper. People want to have certain amenities and freedoms and so on, but they all end up in their different local optima. And I thought about this as this non convex objective function that different cultures go in and try to optimize it and end up in all these different local optima. And so that was kind of the idea but I didn't quite know how to structure it as an AI problem. I just had some sort of quick little notes and I drew some objective function and I kind of just started and continued to do NLP research. And then we hired Stephan Zheng, the first author of the AI Economist and we've also had Alex Trott in the team already for a year and with him, we were working on trying to build houses just from lots of bricks, reflect resources we have a 3D agent, something like Minecraft, that tries to build a house. And then house building turned out to be pretty complicated but the goal of that house building project was to eventually have multiple agents and a whole island and the whole thing. Then we realized, man, we could spend another two years or three years just trying to build the houses properly before we could get to that AI Economist and thought, Stephan, like, hey, why don't we just go directly, assume the house building is just one action. Build the house in this location and that's it, rather than all the different 3D structures and so on and finding out what is a good structure for a house, just one thing and then eventually maybe we can merge into projects but then Stephan actually did a phenomenal job deepening our understanding of the literature and economics and reached out to other economists and that's how we worked with David Parkes in the end, from Harvard, and really fleshed out how to make it work in an RL framework and then getting all the complex optimization going and so on. So I have to give a lot of credit to Stephan. And Alex eventually said, you know, screw this simple house building. This is why I'm wanting to do this anyway. And he's been interested in economics for a long time, too. So the two of them kind of just jumped in on that project and it became really great.


Lukas: That's so cool. I thought there could be a lot of debate on the objective function, right? You did the economic growth times one minus the Gini coefficient. I mean, why not economic growth plus the one minus Gini you know? I can think of many other ways to do it was that...


Richard: And so could we, and I think eventually you're going to... Like it. I love this project in that it literally covers the whole spectrum from like a hard core optimization problem that's really technical and sort of min-max and shifting objective function and it's landscapes and so on, all the way to the most philosophical civilization level, debates and questions of what we need to do in the world and what is economics and politics and what are we all optimizing and should it be quality? Like in some ways you don't want the Gini index to go all the way to its maximum either, because that means absolutely everybody is forced to be 100% equal, which is questionable in terms of monetary things. Of course, we should all be equal in terms of rights and opportunities and so on. But in terms of financial equality, it's an important thing to point it out. Yeah. In fact, I think actually this kind of work will help with that kind of equality because we can push for that and improve it a lot. Does it mean we have to get to the maximum rate? Maximum would be like infinite productivity and nobody has any difference in the world and I think we should also celebrate some types of differences. But I think economic inequality is, in my eyes, the single biggest issue that we have in the world right now. A lot of issues fall from that. If certain minorities would have more economic equality, they would be better off. I think we'd have less racism, less sexism if people of color and women had the same financial equality as men do, statistically speaking. I think economic equality is a big part, like a lot of wars get started from that. A lot of genocides and so many other issues happen from economic inequality that it's a really tough one. Now, should it only be productivity times equality? Maybe not. Maybe there should be other things like sustainability in there. So in the simulation, you have trees. The trees will eventually regrow, but you can have a tragedy of the common situation where all the agents just get rid of all the trees and then there are no more trees and they all optimize their thing. Everybody's equal. But then, long term, everybody will suffer because there will be spaces and people won't feel anything anymore and it will flatline because they destroyed all their resources. So I think sustainability is a reasonable one, and then there are interesting questions, clearly, utilitarianism isn't I think, at least philosophically, the only answer to this. So you may need to have other protections in the objective functions and some boundary conditions. We could talk about just that for hours, probably, and over drinks in the evening. We could really spend a lot of fun philosophy and ethics of what what we should optimize. I think what I'm hoping to realistically, though, is that when a politician in the future would say, I want to help the middle class, that's one of the things I want to do. And then eventually, either right away during their campaign or later on, they propose to say, now, this is what I'm going to do to do this. And then you run that set up through the simulation and you say that is really different from any of the potential solutions that the simulation would get for helping the middle class. Why aren't you... Why does yours differ so far? What's your thinking about how that will actually help more than these other ways? And so hopefully we can agree more easily on the objective function and then we can disagree less on how to exercise and how to get there.


Lukas: In your simulation, was there any emergent behavior that surprised you? Is there anything counter-intuitive that you discovered from doing these experiments?


Richard: Yeah, there's definitely some things that at first you're like, wait, this doesn't make sense. So we have taxes and we have subsidies. And when you look at it, it's actually the lowest income bracket got taxed a lot and then it actually went down for this sort of middle class of the simulation. And we're like, wait a minute, that seems very counter-intuitive. But it turned out they were also getting more subsidies. So they were actually much more positive because the subsidies were also given to that income bracket. But that at first was kind of counter-intuitive but then once you double clicked on and you realize, well, effectively they're actually getting more subsidies than they have to pay taxes. So it kind of levelled out and made more sense.


Lukas: Interesting and sorry, this is I guess maybe a personal question, but you grew up in eastern Germany, didn't you?


Richard: I did for a couple of years. Ethiopia for three years and four years or so of East Germany and then Germany Reunited.


Lukas: I see. Do you feel like that gives you a different perspective maybe than Americans on these topics?


Richard: I mean, I think I was still pretty young when the Berlin Walls came down. So I think, though, culturally, of course, East Germany still had... It wasn't like Germany Reunited and then there was like no more differences between East and West. In fact, some of the issues you see in other countries like you would see between east and west, like the east, still has lower income compared to the West, and a lot of women actually left East Germany to go to the west. So there are a couple of counties that have like too many men and so on. So there are still a lot of differences between the east and west still to this day. But I think growing up in Germany overall, which is where I got most of my education was in Reunited Germany, probably did affect me. Like in general as I grew up, it was free health care and free education all the way down to or up to a PhD level, Masters level, was just not ever a politicized question. It was just a given and being sort of anti big military intervention was something that was still pretty deeply ingrained and Germans have been in there twice for a century. It was clear that that's something we should all try to as best as we can to avoid. So it was just a lot less sort of pro military conversations going on in Germany. And so, yeah. So it's kind of interesting. The whole political discourse in Germany is kind of shifted, even the most sort of liberal, pro economy, pro companies, types of parties and on the political spectrum, none of them would ever question free health care or free education, because statistically speaking, it just helps everyone. It's kind of an interesting definition of freedom even now. Is it more free that you always have health care no matter what job you have, or is it less free because you have to pay for it? Like it's interesting, interesting cultural differences. So that definitely had a little bit of impact.


Lukas: Got it. Yeah, that makes sense. I mean, I guess I also want to make sure we cover some of the other papers that... Since I have you, I have just so many questions. I was wondering if you could talk about.. We've actually been you know.. My company has been working with a lot of people, doing various aspects of protein generation and folding. And I really feel like there's something going on in ML right now with all the applications. And it's something I know very little about because it didn't feel like a topic when I was in school. I'd love to just... If you could just describe I mean.. It's such an intriguing idea that language modeling techniques could be used for Protein Generation. Maybe you just tell us what you did and kind of what you think about the field in general.


Richard: Sure. Yeah. So generally, high level Protein Generation or the ProGen model that we published is a language model. As a language model, it's basically just trying to predict the next word in a large unsupervised text. So take all of Wikipedia, as much of the Internet text as you can, and some people innovated by taking Reddit data, which is kind of more interesting, but then also has issues with bias and so on; and you have a very simple objective function for a large neural network architecture just to try to predict what the next word is. And people have been doing that for many, many years, many decades, because it's a very helpful way to disambiguate words that sound the same, but actually a written and mean something different. So if I want to say the price of wood versus would you tell me something? Then the wood sounds the same in both sentences, but in one it's the wood of trees and one's an auxiliary verb. And so you basically go into disambiguate which one is more likely in that context and so that was used for speech recognition and still is in a lot of speech recognition models for translation you know, you can try to generate a bunch of possible translations for German to English translation, but then try to identify which sentences are the most fluid, fluent for the English language. And so the interesting novelty that came out recently with GPT and Open AI is to actually take these existing models to make them even bigger and not just look at the perplexity numbers going lower and lower, perplexity is essentially sort of an inverse metric of the probability that you assign to each word. So the less perplexed you are, the more correctly you've assigned probability mass to the word that actually comes next. And so as the perplexity reduced more and more, you cross the threshold and Open AI was clever enough to realize the threshold is so low now, we should really look at what they're generating and see what comes out and it turned out that they're actually surprisingly good, better than most anybody in the field had thought five, ten years ago in generating fluent paragraphs that actually made sense, that had some coherence and flow to them and of course, after one or two paragraphs, they will repeat themselves and won't make that much sense still, because they don't have this, I think is actually an interesting question for the future, what's the next objective function? Like just producing the next word and generating the next word isn't going...that doesn't include the fact that usually when you try to say stuff, you have a goal in mind to convince somebody of something to learn something, to get a message across, to get somebody to do something. All these different goals that you have as you try to use language; that I think will be the next level of AI research to identify and understand new objective functions in general and that actually allow AI to come up with its own objective function but that's.. Anyway, so back to ProGen we have.. This is fun like usually I don't have that technical of an audience to geek out and about these things. It's just that I have to stay more high level for most other interviews. And so what's cool about ProGen is we took this idea of predicting the next word, which for languages makes sense. Humans can do it, but humans can't actually do it for proteins. We're not built to look at a bunch of different amino acids and protein sequences and then try to learn what they would look like, what would come next and so I love the fact that when you develop really novel A.I. techniques, that you can apply to so many different areas and I still think that one of the most exciting things is if you find a new model family and then you apply it to all these different things and you show and eventually have a multitasks model that can do multiple different things. So here it made sense to us because, again, it's a language that has a meaning, we just have much harder way of accessing that meaning and we have a ton of training data now that sequencing is getting cheaper and cheaper. There's also an interesting sort of learning about you... The first time somebody got sequenced was incredibly expensive and that was like a white man and now like you can for one hundred bucks, anybody can sequence their technologies, so that's actually a great, great story for technology. So long story short, we predict each new protein one at a time, and then also generate new proteins. And so what does that mean? Why would that be useful for people not familiar with proteins? Everything in life is governed by proteins. Every function in your body is governed by proteins. Deep down and the level below cells and everything, it's all guided by proteins. The digestive system, even you could develop proteins that will fight certain types of cancer, certain types of viruses that's actually something we're also working on now to try to do some interesting things for curing certain kinds of viruses but it's too early to talk about it right now. It will take some time. It's kind of another moonshot. But there's really exciting work that you could do to develop proteins that will need plastic to try to help with pollution. It is unlimited the kinds of stuff you could do with proteins if you understood that language well. One big important factor for this protein model was that it's also a controllable language model, that it has these control codes in the beginning because you don't want it to just sort of randomly generate random proteins. You have an actual goal in mind like this should bind to this binding site in a cell or this should try to be able to connect to a plastic or all these different kinds of things you could do. We have these control codes. They basically give you the function and which area of the body and things like that it's in or what other binding sites you should have, and then it will actually generate reasonable proteins. And Ali, on our team, has just been doing a phenomenal job pushing that line of research. He's also the first author of ProGen.


Lukas: Of course. How did you get the training data for that, like, I could see how you could get protein data from DNA but how did you get the data of what these proteins do and things like that?


Richard: So we took a subset of data and that had some kind of metadata associated with them. What's interesting is there you actually can look at a lot more tree data once you just say, look, any control code goes, it just goes in here and then we can also use that. The majority of datasets are still very unstructured. There's no good documentation and coherence between these different datasets. Each different dataset in on the proteins are of different lengths and then some people say, oh, it has these three functions and other people say, well, I just got this from somewhere. The next level is actually to try to train it, even if you have zero metadata associated with them. There are some interesting meta studies that have a lot of unsupervised sequences from soil and all kinds of things, and so if you could learn from unsupervised sequences, you could train them even more but for now, we just took datasets that had at least some kind of metadata associated with it. Even if there is no general nice Imagenet-like taxonomy or Wordnet-like taxonomy for them but any kind of metadata was enough for us to incorporate the data into them.


Lukas: And was this the same with GPT where it's just like predicting the next one and you're just trying to have the lowest perplexity or the highest probability?


Richard: That's right. It's a super simple objective function still, and it's just trying to predict what's the next one, and what's amazing is actually that, and we just released this on Twitter today and on the blogpost, we analyzed it and some super fascinating stuff. So there's protein folding, which is computationally expensive, really hard problem. But what we found is that even though the model goes through sequences just one at a time, you can visualize the attention mechanism inside these transformer networks and the attention mechanisms actually have a very high correlation to folding patterns in 3D. So the areas of the protein, when they fold around and they actually are close to another area and they would often fold in a way that they're very close and then also different binding sites and so on, they're highly, highly correlated with transforming attention. So there's I think there's a lot more there to find out and explore.


Lukas: Were the same mechanisms like attention that make language models work well, was it the same things that really mattered for the protein prediction, or was there any difference in the kinds of models that worked?


Richard: So to be honest, these models are so large and we don't want to burn through hundred million dollars to train ten of them. We just trained tiny samples of a transformer and then we trained very, very few one or two of the 1.7 billion parameter with 230 billion or 70 billion or so protein sequences.


Lukas: I see.


Richard: Sorry we don't have a huge tabulation table where we're like we spend 100 million dollars and not one piece of paper that gives all the different numbers on a big table. These models are so large you really better not have a bug in that in the beginning and then realize it more later.


Lukas: But I guess, do these larger models work significantly better than simpler models?


Richard: For sure. Yeah. This is really where neural networks shine. They have so much more expressive power. They can capture so many more different, non convex, highly complex functions that you need that. I think this is sort of where you couldn't do this thing with a linear, linear model, the world is not linear. It would be a lot easier to solve all kinds of issues in medicine and so on if everything was like some nice convex problem in biology but it's far from that. So we really need the complexity of these very large models.


Lukas: Do you have any way to... Like I feel like that.... I feel like GPT2, one of the coolest things about it was it produced a sentence where they're so evocative and you decide that, OK, this thing is not going to long range dependencies, but it's very fluent, you know? Could a scientist look at the proteins you generate and have some sense of these seem fairly realistic, or is there any way to measure that?


Richard: It's super fascinating, right? Like what is the energy landscape in this discrete protein space that actually makes sense when you look at. So biologists already do this like two years ago. I think that the Nobel Prize in medicine was given to a team that essentially randomly modified existing protein sequences and then just tested them out. You can synthesize them and see if they actually have certain properties, like you can try to make them fluorescent or not, and then see like how many proteins can I randomly change to still have that property or I have even more of that property, which could be useful for for drugs and drug development and so on. And so you don't want to steal usually too far away from it. And then there are a couple of different metrics you can look at of as you generated a new type of protein that doesn't yet exist in nature. How likely would that to be structurally sound at all? So and we actually in the paper, we have different experiments where we show that there's a certain energy's program said you can compute that says like this would actually have a very low or very high energy and hence this protein would not just disintegrate and fall apart. It would actually be structurally sound. And it turns out that compared to the random baseline, which is relatively easy to beat, we're so much more better and create much more stable proteins that are more likely to actually work.


Lukas: That's so cool. I'm going to keep jumping around because I have so many questions I want to get through but I'd love to hear about the language model that you came out with last year, the CTRL and what inspired you to make a new language model like what it does differently than other options out there?


Richard: Yeah, it's a great question. So Control is essentially a controllable language model where instead of just saying, here's a beginning sentence now just spitball like randomly generated, like how that could continue. Usually it would make more sense for us to try to create language technology that we have a little bit more control over. So we created these control codes that will essentially say for the sequence, but also given this genre like continue to sentence. So if you start with a knife and you say the genre is a horror movie, then the knife peeped through the door and a lot of crazy stuff is happening but where we say a knife and a review story, then it's like, oh, the knife cuts really well. My vegetable, my husband loves using it in the kitchen and blah, blah, blah. So that's a difference. You have more control over what it would actually generate. Control codes can also be used as task codes and you can say the task code or control code generates the translation of this and then it generates the translated sentence after instead of just the next random possible sentences that might make some sense. And so at that point, this has been something I've been trying to work on for a long time with the natural language processing decathlon decaNLP and a lot of other projects. I think we're at that state now in NLP where we can try to just solve a lot of the standard NLP problems by having a single, large, multitask model that you have the substrate a large, complex neural network structure. It almost doesn't matter anymore these days what it is like. You could probably have a very deep large stack LSDTM now it would be a transformer will probably come up with other versions of that, but some kind of large general function approximator, some neural substrate and then the novelty is you try to train it to have all these different objective functions, different tasks. It gets better over time and then you can get to transfer learning tasks, you can get to zero shot abilities and so on. So that's been a dream of our first line of that work with Bryan McCann on Contextual Vectors Cov which we trained back then, still with translation. Then Elmo took that idea and replaced translation of language modeling, which is even more clever because you have even more data that's unsupervised than you have with translation, and you know that's sort of the biggest supervised dataset like ImageNet and then Elmo, of course, became BERT with even more novelties on top of it but still sticking to language models and taking these contextual vectors. And so when you have contextual vectors that can get easily fine tuned on multiple tasks, then you have something like decaNLP where have everything is described as one task. Then you get closer and closer to that step to eventually just having a single model for all of NLP. And then my hope is that eventually the NLP community can work in a conglomerate kind of cumulative way where we have a control-like language model or question-answering model that you can ask any kind of question and so on. Or you can even have just a general language model, but you ask it questions and then the next words that come after the question should be the answer if it really learned something about language in the world and everything. So that is equivalent supertask of NLP. The long story short is if we're able to do that and every research that we do actually makes an existing super model better and better, then we would all of a sudden have an explosion, I think in progress in natural language processing. And we would stop saying, oh, yes, this Paper has a baseline and we're making a little bit better. And then the next paper, we jump back to the baseline, we make it a little bit better in a different direction. Next time we improve our baselines from time to time but all these papers do sort of one off improvements of these baselines versus everytime somebody publishes a good paper, the model overall gets better and then everybody will start directly from that improved model. So that's been my dream for the NLP field for a while.


Lukas: It does kind of seem like NLP is moving in that direction, doesn't it, with the big multitask baselines?


Richard: That's right. And T5 and all these other large models. I'm super excited to see it. I think it's finally happening. It'll still take some time because just like... About ten years ago, I had my first deep learning and neural network paper at an NLP conference and the reviewer still wanted to reject it, a lot of people were like, why are you applying neural networks to NLP? That stuff from the 90s, it doesn't work here. And I had like in the beginning of my PhD a lot of papers rejected. And I think part of it is that a lot of people built their careers and their knowledge and their academic standing and so on, on feature engineering. And so when you say, oh, you don't need to do feature engineering anymore, you just now have these models and they learn the features. It doesn't sound that great if you've done feature engineering for 10 years and now we have the last 10 years or so of people doing architecture engineering, and they don't want to hear that the architecture doesn't quite matter anymore. It's now about the objective functions. And so let's ignore all these architecture engineering papers that just assume there's one very large, efficiently trainable neural network architecture, probably a transformer because it's paralyzable on GPUs nowadays, but it could be LCMs or whatever. And we train this really large one. And now we become clever about the objective functions on improving that neural substrate. Again, it will be a shift and it usually takes the community a couple of years to make these shifts and young people are jumping on it. And then people that are older and have been longer in the field, will eventually kind of through their grad students and so on, adjust and then embrace it and then start doing amazing work in that new area.


Lukas: So how does control fit into that? It sounds to me like it was that a new architecture or was that really just adding control codes?


Richard: It was mostly adding control codes to a large language model? That was kind of the main idea. And it fits into this as a way to unify. Basically, the way I see it is there are three equivalent supertasks of NLP; dialogue systems, language models and answering systems. You can cast every other NLP problem into any of those three and you can map those three between one another. Like in the dialogue, something happens and then you have to generate the answer to what the previous agent just said and the language modeling, you can also cast it as question answering by asking a question. And then the words that should be predicted after that question should be the answer. So question answering in language modeling are equivalent. So we tried this with decaNLP where we had language modeling, where we used question answering as their default framework and with control, it's the acknowledgement that if you start with a large substrate that can be trained unsupervised from a large amount of text, it's sort of the best single task to then transfer from and do multitask, learning from.


Lukas: Do you treat the control codes differently than other tokens? Because I feel like I see a lot of examples where people do a translation by just showing pairs and then it's like their language models are such generating pairs. Is that the same thing or is it somehow controlled doing things more systematically?


Richard: So you do have... In some cases you can actually make those control codes be a language themselves, right? So you could say like, here's a question. Now is the text, generate the answer after you've read that whole thing, but you can also have control codes that are just like task_1 is a control token and then it will... What's surprising is with these control tokens, the outputs will be very different, right? Like translational all of a sudden, generating very different output with the same neural network architecture overall. That's.. It's pretty amazing how that works.


Lukas: It's amazing how that works. I mean, I can't believe it though. Like I remember when I was studying this stuff, it was like, you know, it was like linguists that wanted to do it completely, explicitly and rule-based versus like, you know, people learning machine learning. So I guess you sort of keep going up levels of abstractions.


Richard: You know what's interesting? Sometimes these rules, I used to just so much discard it that when you try to build a real system for a real company, you have a chatbot. And that company has in the end, like everything, if you have the ability to make it into chatbot, you have some API somewhere, right? Whether that API is like you click on these fields or you already have it as a program, it needs to be a structured disambiguated programming language output at some point that fulfills actions like "what order do you want?" We go into our order management system and operate this field and resend a new one that goes into some Logistics center. And so when you have these concrete chat boards for a company, I was always thinking, oh, it should just all be learned and then at the very end, they generate some code and so on, but the truth is, companies want to sometimes have control. They want to say, yeah, maybe there was like bad biases in my past training data, or maybe we changed a process and now we don't want to do it the way we used to do it. It's going to be a new process or like in this country, we have some regulations so we need to first ask this other question that wasn't in the training data from that country and so on. So then I'm surprised how often when it comes down to real business and products, you have to still have these rules in there.


Lukas: I'm curious, actually, so you've gone through this transition from mainly academic to startup founder to you're big company, like C-level executive. Has there been any other surprises like that, like seeing how businesses think about machine learning versus how academia thinks about it?


Richard: Yeah. One thing I love is that I actually still dabble a little bit and all the other ones. And obviously we're still doing fundamental research now, but also now lots of product and stuff. But there are a lot of interesting different mindsets. In many ways, if you have a domain problem, this is actually something you see even in research, if you work in just biology or you're just trying to solve one particular domain problem in a particular modality like language revision, then it's rare. It's really hard for the people working on those applications to also find out new architectures. It's just a different mindset. You're trying to solve the problem. Like if you try to, for instance, help babies in the ICU or you try to cure COVID or something, you don't care. If you can do it with naive base or an SVM or some like latent allocation or whatever the model used to be the popular model at the time, it doesn't really matter. You solve like cancer or some specific type of cancer. And so it's interesting, you are starting to throw the kitchen sink at applied problems and that's sort of still true even for applied engineering teams. They say, you know, by the end of this quarter and this spring planning and so on, you've got to have a solution that works on some level, whether it's the absolutely latest greatest and like really squeezes out those 2% that depends on the business model. Like for Google, it makes sense to spend a lot of time on AI because they have clearly certain AI metrics like recommender systems for advertisement and so on, where an improvement in an AI metric results and immediately more revenue. That isn't the case for every AI problem and solution product out there in the B2B world. Sometimes, it's just like if it works, you make the same amount of money than if it works 5% better.


Lukas: What are the things that Salesforce cares about? Like what are the ML applications that are really important inside of your company?


Richard: So there are a ton. There are roughly different groups such as packaged applications that you can sell as is like a chatbot application or opportunity or need scoring. Some of these sometimes go into a second category, which is quality of life and user experience where you just make the product a little bit better and you have a lot of those...


Lukas: Wait sorry, what would that be like? Like make the product a little better?


Richard: For instance, you type in the name of a company to create a new lead object as a salesperson and it just finds the company's logo just like boom! And now it looks better in a nice table. This is not a feature you could get money for, or the search functionality. Search is one of the most used features in most CRM software. But spending another billion dollars on improving Search is questionable, because you'll make the same amount of money everybody assumes with Search it should just kind of work and you don't pay extra for it for the most part. So you have packaged applications where you clearly make a lot more money, like a recommendation engine and commerce platforms with one of the largest e-commerce platforms in the United States, which many people don't know because nobody goes to Salesforce.com to buy their shoes but you go to Adidas.com, which runs on Salesforce. And so there you have recommendation engines as really sort of almost an obvious kind of task. Everybody knows you should use recommendation engines and e-commerce, but those are sort of packaged applications that you can sell as is. Then you have this quality of life features. You have things like you want to improve your operational efficiency, like make recommendations for your own salespeople or learn how to work with your data centers more efficiently and things like that. And then we have in the company also a lot of platforms where we enable our hundreds of thousands of customers to build their own AI applications with their own data without us interfering. And so there you also have interesting problems because you have to not just build one app, but you have to build an engine such that a lot of admins with low or no code, you can create an AI application, some prediction model, some recommendation model, some OCR model to read and forms from some complex form directly into digital form, which is surprisingly still necessary a lot of times these days. So there's so many different applications. That's why it's so exciting here.


Lukas: How do you even... Like in your team, how do you take lots of work on? Is it by research interests?


Richard: It's a complex process. I'm wearing these different hats and so on the research side, we go mostly for impact on the AI community as a whole. So that's one of our objective findings; impact in AI research. Another one could be impact down the line eventually on products. So we have things that we work on in medicine where we don't currently work in medicine, but maybe down the line that could be used. We have things that we work on in the AI Economist or ProGen where maybe eventually the world will improve, but it's not really clear. So there's sort of pure AI research impact on the world and all our stakeholders and the community and so on then on real products. So a lot of natural language processing research is surprisingly close. So you can do some fundamental research and semantic parsing, learning how to really disambiguate a sentence into a query that could be used to get the answer from a database. And that is fundamental research but it's also pretty applied and could be used for Tablo and a lot of other exciting areas inside the CRM where people need to find answers in the database. So that is kind of the two different worlds on the research side. Then on the product side and the large engineering groups, it's very customer driven and sometimes it's driven by what we think the future will be like. So we announced, for instance, at Dreamforce last year a first agent over the phone. So an agent that you can just pick up the phone and have a natural conversation with. So Mark and I were on the stage and showing sort of what that would look like. So that is obviously something that maybe customers aren't even thinking yet about because they're not sure it's even possible. But we're working on those kinds of things because we think it will be possible soon and we're now making it possible.


Lukas: Cool. All right. Well, we're running out of time. We always end with two questions that I didn't warn you about but I'm curious what you'll say. So the first question is, what's something in the aspect of machine learning that you think practitioners are not paying enough attention to?


Richard: I think now that AI has reached that deep impact level on the world, you really need to think about the biases in a holistic way, the systems, the people, the structures that are using AI for something. Are we thinking enough about the bias? And as AI has a bigger and bigger impact on people's lives, I think the bar needs to increase more and more. Like a loan application AI that decides who should be able to start a business and so on, you really need to pay a lot of attention to the biases in the datasets, the biases of how those datasets are created by the people, the hidden agendas and what the status quo is and so on. How do you improve the world in the end versus entrenching it in the current system and just keeping the current system the way it is? And I think that's sort of something that a lot of practitioners still need to work on. And now also more researchers need to work on, because even when we play around with like, oh, that's just a cute little artsy research project, right? E.g. Depixelization, it turns out it's another deeply-rooted bias that is there and that gets exposed in that and I think we should all work on.


Lukas: Do you have any suggested reading material for people who want to get more educated on the topic, where you would point them?


Richard: Yeah, for sure. I think Timnit Gebru right now is really one of the leaders in that area and she has given this great tutorial on CGPR, the slides are online. They're a bunch of papers from a lot of other people. In our team, we also have Kathy Baxter and she looked at more of the applied side of AI a lot more, making sure that AI systems are explainable, transparent, that you have feedback loops in them that people can give feedback to. When an important decision about them was made in an automated fashion, they think it's wrong that they're able to fix it and sort of escalated to humans or improve that data, making sure they're explainable. You actually understand how it came about that this decision was made about you and so on. So there are a couple of different things, like making sure. Even though it sounds kind of crazy but I think we need to even think about human rights when it comes to applications that we work on. And so I think Kathy Baxter has a lot of materials online, interviews and materials. We have some trailers also on the Salesforce learning platform on ethics and ethics in AI in particular. And then Timnit Gebri has a lot of great materials on research in AI and the systemic issues, as well as other concrete issues.


Lukas: Cool. Yeah. We'll put this in the notes and totally agree. The final question, so you're coming from a research perspective but you're at a company that does lots of applied machine learning. When you look at that path from taking a research thing to, you know, deployed inside the Salesforce product, what's the biggest challenge that you see in that process? Like, what things do you think get bogged down the most?


Richard: Boy, it's interesting. I feel like we're finally really getting into a groove and we're getting a lot of features out much, much more quickly than we used to. I think part of it is just that the two different sides of the pure researchers/research engineers/data scientists/data engineers, they have a certain way they see, where's the complexity of deploying an actual AI product and then you have engineers. The truth is, though, that somewhere between 5% - 20% of an AI product is actual AI and then somewhere between 95% - 80% is just relatively standard, but still very hard software engineering. Everybody can nowadays quickly hack together a quick TensorFlow image classifier, right? It's like, oh, and you feel like after 10 minutes you're an expert and so cool and you're super smart and you know AI now and so on but then when you actually want to deploy that in a large context, now you have load balancing, security, you have privacy and you have all these issues. Now somebody from GDPR from Europe says, I want you to delete this picture. Now you need to retrain it. If that happens every day, are you retraining a whole huge model every day because somebody are asking to take their data out of the thing that eventually fed your classifier. How do you update the classifier continuously? How do you make sure that as you update the classifier, if you've had something like FDA approval or HIPA compliance, how do you make sure that the new classifier is still compliant with all the various regulations you have? So there's so much complexity on the engineering and productionizing of AI and that is sort of what a lot of people who are super deep AI experts often underestimate.


Lukas: Cool. Well, great. And great talking to you Richard. Thank you so much for doing this.


Richard: Pleasure. Great questions. It was super fun to geek out a little bit and go deep into some of these papers.


Lukas: Totally. Yeah. Thanks so much.

Join our mailing list to get the latest machine learning updates.