DeepChem creator Bharath Ramsundar on using deep learning for molecules and medicine discovery
Lukas and Bharath discuss how ML is being used in the medical and biology research fields
View all podcasts
Gradient Dissent - a Machine Learning Podcast · Evolution of Reinforcement Learning and the Robot Hand

Bharath created the open-source project to grow the deep drug discovery open source community, co-created the benchmark suite to facilitate development of molecular algorithms, and more. Bharath’s graduate education was supported by a Hertz Fellowship, the most selective graduate fellowship in the sciences. Bharath is the lead author of “TensorFlow for Deep Learning: From Linear Regression to Reinforcement Learning”, a developer’s introduction to modern machine learning, with O’Reilly Media. Today, Bharath is focused on designing the decentralized protocols that will unlock data and AI to create the next stage of the internet. He received a BA and BS from UC Berkeley in EECS and Mathematics and was valedictorian of his graduating class in mathematics. He did his PhD in computer science at Stanford University where he studied the application of deep-learning to problems in drug-discovery.

Follow Bharath on Twitter and Github

Check out some of his projects:


Lukas: It's really exciting to talk to you. We've been seeing a lot of customers come in doing drug discovery and other medical applications, and it's something that I'm not super familiar with but seems incredibly meaningful. We've got a chance to talk to a whole bunch of our customers and ask them what they're doing. And one thing that keeps coming up is actually the DeepChem library that I think you're the original author of. So I really wanted to start off by asking you about that. What inspired you to make it, and what problems were you trying to solve?

Bharath: Yeah, absolutely. First of all, thank you for having me on the show. I'm glad and excited to chat as well. Lots of folks I know have been using Weights and Biases to train models and track experiments so I think it should be a fun conversation, I hope. A few years ago, basically during my PhD, I did an internship at Google where I used their mini-computers to train some deep learning models for molecular design broadly. But I think what happened was, as with all good things, the internship came to an end, and I had to head back to Stanford and then I found out I no longer had access to all that code. I couldn't really reproduce my results. So I think the starting point was I just wanted to reproduce the results of my own paper. And I think to start basically was just a few scripts in Theano and Keras at that point. Then I put it up on GitHub, I mean why not? Then a few more people did start to use it, then it just sort of grew slowly and steadily from there. I think the original aim of DeepChem was really to enable answering questions about so-called small molecules. So most of the drugs that we take, Tylenol, your Ibuprofens, things like that are all small molecules. But over time, I think pharma has actually begun to shift off it and so now there is newer classes of medicines. There are of course things like vaccines. So nowadays, I think DeepChem is slowly trying to grow out to enable open-source medicine discovery across a broader swath of modern biotech. So that's just a little bit about the project. I think there is a very active community of users. There's a number of educational materials and tutorials built up around it. I think it's also that a lot of medicine discovery is quite proprietary. There is biotech things that we often see their advertising material like in our proprietary algorithm, our proprietary technique, which has worked fine for the industry for a long time. You know, that's the way most medicine we know was discovered. But, of course, as we know in tech there's just been a shift, in that open source is increasingly a foundational part of the way we build companies, we discover things. So I think part of the goal of DeepChem is to bring some of this open-source energy to the biotech drug discovery community and enable more people to be able to share in these tools.

Lukas: It seems like you've definitely been successful at that. I mean even before I knew of you, talking to folks at Genentech and GSK and I would say, over half of the conversations I've had with pharma companies have mentioned DeepChem, I thought it was pretty cool that they are using the same platform and contributing IP. I didn't know that pharma did that at all. So that seems really wonderful.

Bharath: I think it definitely is kind of a new shift in thinking. But of course, you know, pharma has seen the fact that TensorFlow is open source, PyTorch's open source. So I think it is the beginnings of a shift. At the same time, I think IP considerations definitely do matter a lot. So I think a lot of folks find they can't contribute at some places, which is fine, I think it's just a policy. But there is still a culture of caution around potentially releasing valuable IP. But I think what helps things a bit is there's this recognition that oftentimes it's the actual data that's the core IP. It's not necessarily the algorithm that's just calculus. And so I think there is some favorable shifts in the industry, but it's definitely something that's only beginning to happen.

Lukas: So just taking a step back, because I think not everyone necessarily knows the field at all. I actually didn't, till maybe six months ago when we started to see our users doing this. What's the canonical problem here that Pharma is trying to solve?

Bharath: Yeah, I think it's a great question. At heart, the goal really is to design medicine for diseases you care about and the reality is this is an extraordinarily complicated process. And I'd say even now, machine learning is only useful for 10% of it. And the task here is that say you identify a disease, then you want to find a hypothesis for what causes the disease. Maybe there is a protein that somehow has become misconfigured or mutate in the body. There can be a whole host of disease-causing factors but you oftentimes try to take a reductionist, you narrow that down to one protein target. So you say that if I somehow could change the behavior of this protein, I could potentially cure this disease. It's a hypothesis. It might be right, It might be wrong, but it's a good starting place. Then you go out and you say, now I know this protein. Can I find a molecule that causes it to have some interaction? So there is a few mental models for this, you can about it as a lock and key, you can think of it as basically an interacting agent that comes in, the drug, that is, and shifts the behavior of the protein the way that's favorable. So the goal computationally at a crude level is that design the molecule, given the description of this problem, print out the ideal molecule for this. Now, the reason this gets challenging is that the ideal molecule is extremely hard. I think one of the hardest problem here is that there's this question of toxicity. I think the silly example for this is if you want to kill cancer cells, you can pour bleach on them. You can't drink that bleach - that's going to kill you too. So a lot of medicine is pretty indistinguishable from poison. It's really targeted poison that goes after one particular part of the body. So when you're designing medicine, you're often just struggling with this challenge, if you're on this very razor-thin design edge of between poison to medicine. You also often don't have a precise model of whether the potential drug works or not until you try it in real patients. So you try to make proxy models for this. Traditionally you'd have something like a rat that has some variant of the disease, or sometimes it's things like cats or even dogs but when you think it's safe, you then try it out on real patients. So this is kind of the clinical trial process: Phase One, which tests toxicity. Is it safe for humans? There's Phase Two that has efficacy. Is this actually showing effect in a group of patients I'm trying this on? And then Phase Three is basically "OK. We think there is effect, let's make sure on a big trial with lots of people." And occasionally there are things like Phase Four, which is after the drug is being used by real people, let's do more studies, understand the real effects it's having on patients so that we can give better guidance to doctors. So I think the heart of the challenge in applying machine learning here is that we are dealing with a lot of unknowns. We don't know precisely why things become poisonous. We know some of the reasons. But oftentimes you'll get these strange factors that crop up. We don't know if a potential medicine actually treats the disease in question until we try it.

Lukas: Just to slow down for a second. I think it's not even obvious to me necessarily what the machine learning problem is within that. What's the input data and what are we trying to predict?

Bharath: That's definitely another great question. And usually, the challenge here is that you start with a very narrow sliver of this problem. So there are, say, limited models for toxicity that given some amount of data, you create a database of compounds and you're like this molecule induces a negative effect. You can train a machine learning model that given the structure of a new molecule, will predict an output, which is the toxicity label. The challenge, of course, is generalization. You know it works on your training set but if I give you a new molecule, does it actually work? That's often the question. That it's very hard to gauge that.

Lukas: And then how is it possible.? Sorry, there are some questions I have. How would you possibly have enough training data? You're not going to keep poisoning cats to keep finding more and more poisonous molecules, right? How does that work?

Bharath: I think there's another great question, and the real answer is we don't have enough training data. Which is why I think molecular machine learning is a bit of an art right now. Unlike images and speech where there are these dramatically larger training sets, the datasets are fundamentally limited. There are a few approaches people take to deal with this. I think one common theme is let's use more of the fact that we know a lot about physics and chemistry. Toxicity, I think is a very hard problem; it's biology. It's kind of harder. But in many cases, you'd say that "well, okay, I know something about the molecule. I know something about its invariances. I can encode that into the convolution network." So now you have increasingly sophisticated graph convolutional networks that encode more factors of known molecular structure. It's definitely not a salt field. I think this entire part of machine learning is far from what I call the image net moment, there is that point in which the thing just crosses over and breaks out and I think right now it's useful, but it isn't that magic bullet in this order.

Lukas: I actually really would like to go back to that but I want to make sure I understand the core problem here. So it sounds like you have a molecule and you want to predict some kind of property?

Bharath: I think that is definitely the most common one. There's a number of variants to this. Like you might have a protein, then you want to find a molecule that interacts with it. One way you can do this is, does the property interact with the protein? There is also generative models where you say that okay, given a database of known drugs use an LSTM or something to just print out the new potential drug. This tends to get a little hairy. It's kind of hot research, but it's not safe to really use in production. I think there is some reaching academic debates about that right now.

Lukas: Alright. Sorry, could I ask some more dumb questions? How do you even represent a molecule? Text seems kind of obvious to me but I mean, it seems like molecules have a variable length and they have some structure. Is it a graph?

Bharath: It's actually a great question. Thankfully there's the field of chemo informatics where a number of years ago they defined a thing which is called SMILES, S-M-I-L-E-S. So SMILES strings are basically a language that allows you to write down molecules. It's most often used for small molecules but you can write pretty big arbitrary molecules as well. Many architectures take the smiles and do convert it into a graph. And the idea is that the atoms in the molecule turn into nodes in the graph and bonds usually turn into edges. Although sometimes you do something like a distance cut off because there's these non-covalent interactions. So you might say all atoms that are close to each other are now bonds and have edges in my graph.

Lukas: And does that completely represent a molecule?

Bharath: Honestly, not at all. The real molecules are these very complex quantum beasts that have orbitals and extremely complicated wave functions. In fact, I'd say that when you get past really teensy molecules like helium (there's probably a few slightly more complicated ones), you actually don't know the quantum structure of these things. Until the quantum computers arrive and we can run these simulations, we actually do not really have the ability to grasp the "true structure" of a molecule in most cases. So it's an approximation. It's mostly useful for many purposes though. But yeah, molecules are more complicated than we understand. In many cases.

Lukas: So when you talk about an LSTM generating a molecule, it's generating, literally generating, a string that gets interpreted as a molecule?

Bharath: Exactly. So the smiles language I mentioned, precisely what you do is that you just treat it like a sentence generation task but you're generating in the smiles language. And oftentimes the challenge there is that if you do this naively, you'll generate grammatical errors. So it's not an actual molecule but there's been a lot of research by some groups at MIT in particular and UToronto, that have worked out ways to constrain the generative models so that it's more likely to generate real molecules.

Lukas: So I guess this sounds, you know, as an ML person, this sounds incredibly appealing, right? Like a kind of well-formed tricky ML problem that has the potential of saving lives. And I guess I wonder how much of this is real and how much of it is speculative? Can you point to an example of a drug that was created through this process or helped by this process?

Bharath: So absolutely not, unfortunately. So this is kind of where it gets really fuzzy. Is it some on average like, you know, I think Covid might actually speed up discovery in some cases, but most of the time it's like 15 years from the first discovery, starting a project to like the actual getting to patients. So there have been simpler computational techniques in use for decades now. So there is some degree of evidence that they help. But I don't think there's been a smoking gun. There isn't like one molecule they can really point to and say that and AI made that. And I think it's more like, you know, the process of using this program helped, you know, in some fuzzy, hard to quantify fashion, the design of this compound.

Lukas: But it seems like the programs are kind of suggesting, or at least the framing that I hear from a lot of our customers is the programs are like suggesting compounds to try. Which makes a ton of sense, right? Because you have to try something. So I assume that people have some non-random approach for this. It seems that there must be evidence now if these deep learning techniques work better for this kind of suggestion than other techniques. That seems like pretty quantifiable. or am I missing something?

Bharath: So. I think part of the challenge here is that it's hard. There is like many steps in the process. So there is a paper from Google recently where they showed that on one particular task that, when they ran the experiment, they say naively was like a few percent hit rate. That is like things that actually like looked like they might work in that stage. And when they bootstrapped it by training and machine learning model, then make predictions it was something like 30 percent. And, you know, that sounds like a giant boost, but I think that's like one step out of like 20 in the process. So, you take the thing that comes with that and you go to the next stage where you are like, well this molecule's good, but it turns out that it gets caught up by the liver. We need to, like, change it somehow so that it  avoids that. And right now, the best way to do that is still to hire a seasoned team of medicinal chemists who can guide you through that process. In the later stages, it gets particularly gnarly because you have very small amounts of data. So like the Google paper, it was at an early stage where they could generate programmatic large datasets, like 50 million data points or something. But in the later stages, you might have like a hundred. And then also you are in that fuzzy no man's world in which machine learning is kind of witchcraft. So I think that's part of the reason. Because maybe you started out with something that was AI-generated but then 10 medicinal chemists came along, tweaked it here tweaked it there, then what do you have at the end? And honestly, we don't know, like, I think 10 years from now, maybe there will be a molecule we can point to. But for now, I think it's so fuzzy.

Lukas: It's kind of interesting that you said I mean, I totally resonate with the ImageNet moment, because I definitely remember the ImageNet moment for vision, because I ran a company that was selling training data and suddenly, you know, everyone flipped from wanting text training data to images because suddenly all the image applications were working. But I guess what was kind of interesting was that I actually feel like the ImageNet moment came a few years after ImageNet, like not only did we see vision starting to work, but it took people a while to realize it. And then companies started to staff up. And now, you know, I can go on Pinterest and click on stuff and buy them right away. Or I can find out my baby photos on my iPhone. But like, it seems like this one, the medical companies have kind of staffed up maybe before it's clear that it's working. Because it does seem like deep learning is now important to basically every Pharma company. I mean, it seems like this could be set up for a real serious disappointment also.

Bharath: I think that's very kind of insightful as an observation and I think you're totally right. I think if you talk to a former veteran and they'll talk, there's like this old Fortune magazine from 1980 where they had some pictures of molecules on a computer and they said it's going to be like medicine on the computer, It's going to change everything. And of course, nothing changed. And I think, you know, even for the Human Genome Project, there's a lot of hype. You know, people thought, having access to the genome would change everything. But I think the recurring theme of biology is that billions of years of evolution always have more tricks behind them. So I think you're right. I think deep learning is a useful but not magical tool in the space right now. And I think that in some cases that disappointment has already hit people. I think in other cases still, my hope is that people stick with it because I think these techniques do have a lot to offer. But I don't think it's going to magically cure cancer. I think it'll be one useful tool in the scientist's toolkit to discover medicine.

Lukas: But what do you think caused people to feel this optimism because machine learning techniques have been around for quite a long time. And I presume people were trying these on the same datasets. Like, is there something special about deep learning that it sort of feels more promising in some way?

Bharath: It's a great question. You know, I think, you know, we all saw this amazing wave of just deep learning hype. Because I think that ImageNet moment spread out into these other fields. And I think people started hoping. I think there are some genuinely new advances that deep learning on molecules has engendered. For example, the more predictive models, when you have enough data, they actually start working considerably better. This Google paper I mentioned a while back, it actually gets like a considerable boost over a simpler or random forest or something because it has enough data. The generative models, they can sometimes do clever things. So I think there is some substance, that sort of paper. But there isn't that. I think there is the hope that it might lead to a breakthrough. And just speaking for me personally, when I started working in this field, I didn't really understand any biology or chemistry, I think 9th bio classes, was my last formal training in the subject.

Lukas: You and me both [laughs]

Bharath: Had a good 9th grade bio teacher but yeah I think when you come in, you're like, well, you know, tech can solve many hard problems, like why can't it solve this? Why not? And I think the answer is evolution has had billions of years and that just builds up irreducible complexity sometimes. So I think it's still hopeful. I think there is real potential and value. But I think also  once you can spend some time in you get some humility, the scope of the problem is much grander than you. At least I first realized when I was coming into the space. But yeah, I think it's just a hype train got ahead of the actual technology and then it's like the Gartner hype cycle. I think now we'll end that trough of disappointment and then that slope of enlightenment coming up a few years from now.

Lukas: Interesting. People seem fairly optimistic for a trough of disappointments. It is an interesting perspective. Yeah maybe we're still coming down, I hope not. One problem that I've always found in health applications is missing data. Like, are there data sets like ImageNet for these kinds of applications?

Bharath: So honest answer, not really. So I kind of started a project called MoleculeNet a number of years back in grad school, along with kind of one of my coauthors. And our intent was to gather as many datasets as we could to try to make something like ImageNet. And I think the honest answer is we helped a little bit. I think there is a useful collection of data and benchmarks we put together. But the challenge is that molecules are non... So I think in computer vision, I think object detection, object localization don't cover all vision tasks. I think their is some hard frontier of problems still. But you get like a pretty big chunk of them. In molecules, it's more like there's just an entire range of things people want to do with them. You have a little bit of data for each task and the tasks are often not latent. So if you take like a quantum mechanical dataset, you'll find that very different featurization and algorithms actually work better than if you take a biophysical task or a biological task. So I think there is a reasonable amount of data in aggregate. But it's for different applications and you can't easily blend it into one ImageNet style mono data set yet.

Lukas: Interesting. It kind of reminds you of natural language processing with all of its different applications.

Bharath: I think there is a dream that maybe we can figure out some type of universal pretraining that akin to the GPT2 models or to like actually does get you to that universal molecular model. I think as of now, we haven't achieved it, but maybe it's so crazy to think that we can. Like, we do know that Schrodinger's equation at some deep level is pretty close to a, leaving aside relativity, it's the best known model of these molecules we have. So maybe if the quantum computers will eventually help solve this. But it's a ways off for now.

Lukas: Interesting. And the experiments presumably are kind of expensive to run now.

Yeah I think there's the rise of mail-order services, things like Enamine or Muchi where you can pick out a molecule out of a catalog, then they'll make it for you and they'll ship it to you. So it's a little easier than it used to be. You don't actually need to be a bench chemist at the same time you do still need to run an experiment. So oftentimes people will say use Enamine to buy it and they'll use a second contract research organization to run the experiment and they'll just keep track of quality control. So it is possible to do it, you know, not quite in your basement, I think. But maybe in a well stocked garage where you can carefully coordinate many e-mail threads or something like that. But, yeah, it's expensive. It'll put you somewhere between a few hundred to a few thousand dollars per compound depending.

Lukas: We have a whole bunch of customers that are startups doing this type of thing, how do they hope to kind of compete with bigger companies when they don't have access to these datasets?

Bharath: That is a great question in many ways. Maybe I'm not the right person to ask because I didn't found one of these startups.I think there is some advantage to coming at it with some new eyes. I think when you're a very big company and are trying to introduce just a shift and thinking. There is of course, a lot of cultural inertia. Traditional startup versus Bigco dynamics. I think there is some potential to pick up kind of interesting potentially looking fruit that just people haven't looked at. I think there is also some eventually, I think, potential for mergers and acquisitions. I think building a talented machine learning team can be difficult. And I think if you have a company that has succeeded and has shown some promise, maybe it's a good acquisition target. So I think there are fruitful paths forward for many of these companies. I think some of them are actually aiming really high. They want to be the next Genentech. And I think it is possible, but I think that might end up coming down more to your biologists than it does to your machine learning people. And perhaps I'm a bit of a pessimist on that front. I think core biology, the really foundational stuff, is still beyond our current machine learning and AI techniques. I think it's beginning to change as you get more genomics data, more kind of biological material that you can feed into machine learning models, there's a lot of companies at that frontier. But for now, I think it really is that if you have a crack team of scientists, that might take you further than a crack team or machine learning engineers. Ideally have both, and then you have the best of all worlds,

Lukas: Though it just seems like the data collection process is so hard. It seems you might need to innovate there, too. I mean, I'm coming from my own background of data leveling. It seems so daunting, the idea that you have to order molecules somehow and run a wet lab. I guess again, I have a whole bunch of different questions. One thought I have I guess is probably like the dumb things that people think of when they first hear about this stuff. But it seems like if you could model things about molecules, that's so powerful. That's like the stuff everything's made out of. Like, there must be applications besides biology that might be simpler. Is that is that true?

Bharath: I think absolutely. Now, unfortunately, the challenging part of some of the most interesting applications are in places like batteries. So I think there are kind of other fields. Like, for example, the crop protection industry. So if you make pesticides, herbicides, fungicides, pretty similar techniques,

Lukas: Really?. I guess they deal with the properties of molecules.

Bharath: In fact, this is kind of coming back to that thin line between poison and medicine. If you actually take a look at some pesticides and you look at them, it kind of looks like the same small molecules you have in medicine, which might explain a few things about the world. I think there's also other applications, in industrial applications, probably in petrochemicals even. I think there is a bit. So there is absolutely kind of other cases. But, I think we in the software industry are sometimes used to working in our world of bits. Whereas I think when you get into these industries, you're like, at the end, you have to make something and I think there is that slowdown. I think maybe batteries is actually the hardest. Pharma's a little behind that. I think some of these agricultural applications are a little easier to get to market, but still quite daunting. I think in general, it just kind of comes down to like for a lot of these things, it's actually really easy to make something poisonous. And as governments, as the industry has grown recognition to this fact, you just have this recurring thing that all of a sudden, you invent a miracle, something or other, oh plastics. Plastics were thought to be the wave of the future in the 1950s. They're also a type of just a molecular product. And now we find out that they choke Seagulls they choke baby turtles. There is microplastics everywhere. I think this is a type of generalized toxicity issue that we realize if you make large quantities of a new substance, that the world broadly isn't prepared to digest. What happens is 30 years down the line, you're like, oh, crap, I killed off the trout. I killed off the eagles. So it all comes down to the fact that I think, you know, living systems are extraordinarily complicated, and making something that is tested and safe for a living thing to interact is actually very challenging.

Lukas: What about other medical applications? I think you wrote a book on this. Right. So, like, what are the other categories of things? And I guess, I'd be curious to your take on like how promising they are, it sounds like it's hard to separate the hype and you've probably thought deeply about this.

I definitely think there is a whole host of really promising applications. I think to name two, I think microscopy is going to be completely changed by ConvNets. This is one of those magical places where ImageNet works, you can actually take an ImageNet model and stick it on top of a microscope and start doing pretty sensible things pretty quickly.

Lukas: What's an example of a thing that you might do with microscopy?

Bharath: One of the kind of interesting things about this field is that you can pick up a lot more out of a microscope than you could have thought. So there are some really interesting papers that show that oftentimes like, so there's some say readouts of a cell, where traditionally you had to kind of destroy the cell, blow it up in order to get at it. But people have started to show that you can instead get a dataset where you take the original cell, then you blow it up, get the read out, then you can train the machine learning model to start to input that from the raw cell so you can potentially get non-destructive readouts that enable new things. This is kind of more basic science. Like it's not clear what the downstream effect is. There are a number of companies, I think, Recursion Therapeutics is a prominent one that has been using microscopy and machine learning broadly to do phenotypic screens. Earlier, I mentioned you often pick a protein target.

Lukas: Which I did to slow down for my 9th grade biology. Phenotypic screen is what?

Bharath: My apologies.

Lukas: No no, I know that phenotype, it's like the expression of a gene. Is that right?

Bharath: Yes, exactly. So I think one way to think about it is maybe bottom-up design versus top-down design. So kind of the targeted drug discoveries may be bottom-up. You say the human body is complicated, I'm going to be a reductionist, I think this is one magic lever and I can switch that lever on and off. I can really change everything. And that's kind of, you coming from the bottom and then you hope it makes it all the way to the top. The other one, which is actually the more traditional way of finding medicine, is like, you know, some really smart doctor. This is like the penicillin story, notes some effect, you have no idea what the effect is caused by. You don't really understand the intricate biophysics, the chemistry behind it. But you see it, maybe there's something that you just observe. I think this famous case of penicillin wasn't the mold on the bread. But I think for a phenotypic screen like the ones Recursion do. Basically they have these cell-based acids where they grow cells in a petri dish. And essentially they test, you put a little bit of medicine in there and then you see how the cells state changes and use the microscope and the deep learning system on that to pick up those changes. So you can do this very rapidly.

Lukas: What would be an example change? Is that the cells are a different shape?

Bharath: That's a really good question. I think it often depends on the disease in question. So like a common thing, say for like cancer is that, the silly one is can you kill the tumor cells? The hard part there is can you kill it without finding bleach? So that's something that's a medicine. I think, for other readouts really depends on the disease. I think the general point there is like diseases are complicated. So there are many proxys people use. So kind of the hierarchy of proxy's is if you have a pure test-tube, which is molecules, that's like the weakest, if you have cells, that's a little better, if you have a rat, that's a little better; but I think the gold standard of course, is like the human. So you can think of this as like it's better than the pure test-tube, but it's absolutely not the same as a human, it is a useful kind of proxy.

Lukas: So, okay. So what the method with the machine learning does is kind of find properties based on the images from the microscope?

Bharath: The way I like to think about it is that machine learning is kind of like making a better microscope. So in many ways, if you go back to classical signal processing. We have all these, you know, Fourier transforms, you have high pass filters, low pass filters. And these, you know, traditional signal processing techniques made things like microscopy even feasible in the first place. Well, you have purely kind of optical microscopes back in the day. But in the last century, I think there's been a lot of signal processing attached to it. So I think of deep learning in these applications as signal processing,  turned up to eleven. And so you can pull things out of the image for which there is no obvious way to write down that function. So I think right now it's more like this really fascinating scientific thing, you know there's got to be something there.

But I want to make sure I'm like, picture it, like I want to have a mental model. So, like, maybe that was evocative of like, did I kill that tumor cell? So is the point that like the machine learning could tell me if the tumor cells were killed without me having to actually look at it? or is it that the machine learning, like sees something deeper that like I couldn't figure out if I looked at it.

Bharath: So I'll have to apologize up front because I'm not an expert at cellular biology but I'll try to. So, for example, I might be making this up, so if there are real biologists that eventually listen to this, please bear with me.

Lukas: No, it's a machine-learning audience, you can pontificate. By the way I think machine learning people will be really familiar with the idea of just looking at results and not worry about the process behind it. So I feel like this is very appealing to our machine-learning audience.

Bharath: You know, I do have to say I still have no way idea about what happens deep in layer 37 of my ConvNet. Imagine you have a muscle cell and you can often measure like the stretchiness of the muscle cell. There is often ways to kind of guess that a proxy for healthiness. I think the actual thing you measure depends a lot on the biology of the system. For example, like one common thing is that there is these things called fluorescent reporters and you can engineer the cells so that if you have the drug and it actually hits something in the cell that you know about, it sets off light. Here, It's you have to know a little bit about what's happening inside the cell. You have to have a guess already. I think the cruder version might be, you know, you have this muscle cell you're looking at. You know, maybe there's some measure of how stretchY it is. Oftentimes it's just like kind of obvious to the eye. It's like that traditional, you know a dog when you see it. You see the healthy cells, they have some, like, nice geometric shape, it looks good. And you see, like disease and they're all like shriveled up and just looks bad. And you can't quite write down that function. You can't know when you look at it. Yeah. So it makes sense to begin to pick this up.

Lukas: Right. And I guess I've seen versions of like cancer cells and kind of different levels. What do they call them, like biopsies? Where you look at the cells. Its 9th grade biology. I guess I can picture what you're saying like that there's like healthy cells. My question is what is the machine learning helping with? Is it sort of like reducing the cost of looking at this stuff, or is it like pulling out other signals that are somehow like, useful?

Bharath: I think it's a bit of both. So I think traditionally, the traditional labor was you'd have a grad student whose painful job it is. If you're unfortunate to be stuck in this lab to is look at cell 1,2,3 ...10, to three, ten thousand. Now, I think there a number of readouts where you just look and you kind of know there is a difference. So I think you can train yourself to read these things. I think this is, again like an interesting example you brought up where you're training the model to basically pick out something and you do it at a bigger scale to maybe before I can only test 10000 views. You know, the grad student union would revolt at that point. But now, you know, maybe I can test a billion or I'm limited more by my supplies. I think the second question you asked is actually the more exciting one. Is it possible we can pick out something we didn't know? So I think there are glimmers that this is yes, I know there are a few papers that are doing things like you can identify where the organelles are, you can begin to do some more complex readouts. But I think there is sort of almost a chicken and egg problem here, as in like when you're discovering something it's like unsupervised learning, right? If you know the thing you're looking for, then you can, like, slot it into buckets pretty easily. But then if it's like you want to go deeper and find something you don't know. I think yes, I think there are likely places that ConvNets act as amplified microscopes and like pick up biology that we don't know. But if I knew that, I would have gone off and written nature paper about it already. I'm sure there is a couple that have already come out of this thing.

Lukas: Okay. So I have to ask you, one of the Nature papers that blew my mind and I think a lot of people was the dermatologist's one where they fine-tuned an ImageNet classifier on cancers. That was not like under a microscope, that was just literally just like photos. And that seemed so amazing. I mean, should I be as enamored with that as I felt or are there some gotchas where it's not actually like it? Should we actually using doctors for these diagnoses still? It sort of seemed like from the paper that it was more accurate than the doctor's diagnosis, wasn't it?

Bharath: You know, I think that entire field for sure, I think is like radiology or I think usually it's like pathology or like dermatology. You look at some picture and then you kind of diagnose it, I think that absolutely is a place ConvNets will just make a big difference. And I do think that these models do kind of achieve a striking advance over what you could do previously. So my understanding is that the challenge there is that sometimes these models pick up things that are kind of silly. I remember there is this really excellent blog post where we kind of discussed failed models that are turned out. There are like scans from different trauma centers and the models doing an amazing job, 99% accuracy. Any time you see that 99 percent accuracy know something is up. It turned out there's like some label at the bottom or something that printed to the trauma center so there is like light trauma, Heavy trauma. Guess what that model learned to do right there. So I think it kind of comes down to, what is the model learning? Is it a fluke? Is it kind of an actual thing? Radiologists were kind of tried and tested like, do you really want to fir your world-class radiologist? So I think there's there is a natural caution there. I think in part because we don't really understand what happens deep in layer 37 of the resnet. So I think the FDA and some companies are moving forward. I do think in potential in places where there aren't enough doctors, this could be kind of potentially a revolutionary advance or you could get, you know, world-class scanning centers, available clinics throughout the world, and not just places where you have excellent hospitals already. But I think it will take some time. I remember a number of years ago, I think maybe in the 80s, again, there's a whole wave of hype around expert systems for medicine and how they could diagnose patients. And I think it might have been in that same blog, a retrospective study that found that many cases, hospitals that deployed expert systems, actually had a fall in patient kind of well-being afterwards because there are these complex interactions that no one thought of in the first study. And then you find a number of years later that there is this unexpected side effect. So, yeah, I am, long with an answer there. I think it is something to be interested in and excited about. I think it will also take time to really bet and really kind of like make sure that this is something that improves patient well-being.

Lukas: Although I guess I do know like what happened with the melanoma model, because it does seem like, you know, doctors are also not perfect. And you know, I also cannot inspect my doctor's brain to really know their decision-Making process. So I wonder, is it unsafe to not change, or was there some real flaw or some simplification that it wasn't obvious.

Bharath: I don't think there is a flaw in the paper. My guess is that.. this isn't my field, soof projecting a little bit out there. I know that the entire deploying something in the clinic, in the health care side is actually quite more complicated even than the new biotech side. I think you have to work with insurers that work with payers to work with hospitals and doctors. You know, the American healthcare system has many known challenges. My sense is that this has just been very hard to actually get out there. So, I think, in Pharma Inforum and biotech, I think the advantages is like if you get something to work, there is actually a very well known path to get it to people. I think for advances like this dermatology thing, there's actually a fuzzier, more ill-defined path to get it out there in the wild. I think there are some real scientific questions around is this actually robost that extols an answer. But I think there is also harder business questions about does this make sense as a viable business? And I'm sure there's like a dozen startups who are working on this right now, but I just don't know as much about it.

Lukas: Actually my wife runs a healthcare staff and she tells me that it's the only industry where you can literally save money and save lives simultaneously and not have a viable business.

Bharath: I've had a few friends who left health care and have formed, ostensibly boring but very successful startups and are much happier with their lives. So I sympathize just a little bit. But, you know, you probably know way more about this than I do. Like, it's a little bit outside of my expertise.

Lukas: Sorry to take you out of your expertise. But this is what I was hoping that the podcast. I could corner guys like you to ask all of my dumb questions. I really appreciate it. And I think we should kind of wrap up because I think this might be just getting long for the format. But we always get the two questions that I'm kind of curious, actually. I always say this, but really, I am curious how you're going to answer this and what is one really underrated aspect of machine learning that you think people should pay more attention to.What comes to mind?

Bharath: That's a really good question. I think that machine learning is amplified signal processing, I think, it is a view that is not as commonly celebrated. But I think there is these really exciting things going on. Machine learning is finding its way into instruments, like into sequencers, into microscopes. It's a type of internet of things , but like not the consumer version. I think traditionally new scientific instruments are the predecessor to fundamental new scientific discovery. So I think that when we find deep learning is making our instruments better and more capable. Then I think that we're actually setting ourselves up to discover and build fundamental science. So that's something I'm very excited by. But it's kind of a longer... We might have the instrument and we still need the Einstein or something to come in and work that and really get us that magical new understanding about the world. But I'm excited by that.

Lukas: That is a totally cool answer. But I guess they may give some many readings that it's like hard to even interpret, but I guess a good algorithm would give you a few high values what you call processed outputs like that.

Bharath: I think for now it's still going to be quite a while before. I think we see. I think we talk a lot about HTI and I know there are many ways in which you could get a general intelligence. But I think the process of induction, of interpolating things about reality from very few hunches, This is probably made up, the Newton, the Appletree. Like, if it probably didn't happen that way, we know it's just so story. But, you could imagine some machine learning model seeing that. Can you somehow interpolate from that out to the universal law of gravitation? That I think would be amazing. It just seems far beyond our current science.

Lukas: I feel like with all these medical applications, I guess the reason I naively find them exciting is that, like if you're trying to compete with the human for navigation and driving. Our brains are designed for that. Clearly, like huge part of our brain is just to navigate the world and not crash into stuff. But it doesn't seem like our brains are designed for interpreting molecules that we can't see and like what effects they might have. I mean, I'm still trying to visualize it in my head, I can't even do it. So it sort of seems like maybe the bar is lower a useful algorithm.

Bharath: I think it's a really interesting kind of point there. I do think, understanding quantum mechanics, this kind of, at least doesn't fit in my head kind of. There are lots of complicated things going on about that ##hided## world. Maybe, part of the challenge is that it's hard to validate a discovery. Many times a model says something, but after you spend a while like 9 times out of 10, you're like what bullshit did the system pick up this time.  And I think the challenge there is like maybe we have to make the model like you said, we have to make the models robust enough that there is actually high-quality signals coming out. So we're like, oh, that's a clue or not. Oh. I don't know what Hiccup happened then. In you know, step two thousand of gradient descent. So I think that's maybe the challenge where we just haven't. I think this was beginning to change. It feels like still discovery, like invention is the province of the human and not the machine. But, you know, maybe that's like, you know, the antiquated line and 10 years from now AI will have discovered everything. And I'll be like, well, that aged poorly there.

Lukas: It  will be an interesting world that comes to pass. All right. So final question is, so, you know, right now in 2020. I guess its already June, what do you think is. What do you think is currently the biggest challenge of making machine learning models work in the real world? Like in your experience, what are the challenges that you've run into? Like what have been the surprising hurdles?

Bharath: I think things more specific to me are often small data. Like, again, you have 30 data points and oftentimes it's a very well-meaning scientist who kind of comes and says, what can you do for us with 30 data points? yet oftentimes I'm like, oooh, I wish I had a better answer. Sometimes you just try seven things like you're trying to transfer learning and you try like multitask learning, the mental learning and all the learning fail. And then at the end, the random forces like, yeah, it's all great, but it does something. So I think for things I'm excited by, I think robust transfer learning that actually works on small data, which I think this has occurred in NLP. But I think has not occurred in molecules, I think that would be an amazing advance for this field.

Lukas: It's so interesting. It hasn't occurred because I think it's also totally happened in vision for sure. And NLP now, definitely it's too interesting. It doesn't work for molecules.

Bharath: It might just be data. I think if someone just found a gigantic trove of molecular measurements, so it is high quality, you can build into them.

Lukas: Collecting it, nobody is going to find that right.

Bharath: I think this is just one thing that I think the governmental effort could do, like amazing work. You know, to be fair, I think, governmental agencies have actually put out most of the open-source data out there. So they are actually working hard at this. But, yeah, maybe the sort of thing that like if you get a $10,000,000 grant or something. I think you could make a serious dent at putting together a high-quality open data set for this but it is more expensive than ImageNet, and it will take more resources. This means you could do the actual experiments.

Lukas: Great answer, I love it. Well, thank you so much. Is there like someplace we should tell people to contact you or is there a thing you want to promote, maybe DeepChem. Everyone should try it.

Bharath: Absolutely. I think part of like the goal behind DeepChem is to make opensource more feasible for drug discovery. So I think we could definitely use more users. In particular, if you an engineer that knows how to handle build processes well, please get in touch, you know, I am trying to figure out the windows and etc. builds and it is such a pain. I am too much of a scientist. We could absolutely use more help. So if you are interested in open science, please do get involved.

Lukas: I love it. Thanks, Bharath.

Bharath: My pleasure. Thank you for inviting me.

Join our mailing list to get the latest machine learning updates.