Building Machine Learning Tools with Hamel Husain
Hamel Husain, Staff Machine Learning Engineer at Github talks about Github Actions, the CodeSearchNet challenge and the tools they're building to advance progress in AI
View all podcasts
TRANSCRIPT

Hamel Husain is a Staff Machine Learning Engineer at Github. He has extensive experience building data analytics and predictive modeling solutions for a wide range of industries, including: hospitality, telecom, retail, restaurant, entertainment and finance. He has built large data science teams (50+) from the ground up and have extensive experience building solutions as an individual contributor.

Learn more about Github Actions and the CodeSearchNet Challenge.

Follow Hamel on Twitter. And on his website.


Lukas: Hamel thanks so much for taking the time to talk to us today. You are the first guest that we've had that I've actually worked with before. So, we have a lot to talk about, and I thought, the cool thing that we worked on, which will I just love for you to describe is the CodeSearchNet project that you spearheaded. Can you describe it for somebody who doesn't know what it is, and what the goals are?


Hamel: So Github has a large corpus of code, as you might imagine. You know, there's all these open source repositories and many different languages. And in the machine learning community, natural language processing is a really exciting space, especially with deep learning. There's a few datasets that people really like for that, but one that hasn't been paid attention too much is perhaps a large corpus of GitHub data. And the thing is that the data is open, it's already open source but the barrier to entry is kind of high, especially when you're talking about code to tokenized code or pass code. It's very complicated, especially when you strip out comments or do something like that. So internally, with the GitHub project, we wanted to explore representation learning of code and explore the possibilities to see if we could represent learner representation of code.


Lukas: Sorry. What exactly do you mean by a representation of the code? Like some abstract representation or how do you think about that?


Hamel: Oh yeah. Sorry. I mean it in very canonical machine learning sense like learning and embedding of code. So that aligns with natural language, that was one of the experiments that we wanted to try to then see if we could use that to boost search results. So someone types in a query... You know, a lot of people don't like GitHub search, understandably, and, you know, if you're trying to search for code, right now its keyword search so you have to have a good idea of what the syntax is or what keywords may be in the code that you're trying to find. But what if you don't know that? What if you're trying to search for some kind of concept in a code? Is that possible? And so we started exploring that to see if it would be possible, perhaps, to use machine learning to learn something and embedding of code to then do some kind of semantic search. So, one of the interesting parts about that is you might wonder how you would go about doing that. How would you go about learning some embedding to code? And so stealing a lot of ideas from natural language processing, you know? So it is useful in natural language processing if you have parallel corpus of let's say one language to another language, like a language translation. So we thought about that and said, that's interesting. I wonder if we can do that with code since there is a lot of natural language. It happens to be inside code and specifically comments of code where people naturally are sort of labeling what code does. And so now this is a tough problem because comments can be everywhere but, not necessarily in the same place or not necessarily at the same level of granularity are in the same format. And so what we did is we scoped down the problem to methods and functions in various languages and looked at the docstrings or what is equivalent to a docstring in Python, which is some comment that documents what that function is doing. So, we constructed a large parallel corpus of all of these things, and we did some experimentation, and it was a really exciting project. We had to stop in the middle for various reasons, as sometimes happens with machine learning projects that are ambitious like this. But in doing so, we thought we should open source the data and we should present the results we have and give it to the communities so they can take it forward from there. So the CodeSearchNet challenge is this large parallel corpus of code and natural language and its benchmarks of our attempt at doing information retrieval, so, given a query of some kind, can you identify the piece of code that goes with it on this case? If you are given a docstring, can you find the code that is paired with that originally? So, the benchmark is an information retrieval task and we thought, "okay, even though we're pausing on this for a moment, we know that everybody is interested in this". Not everybody, but a lot of people may be interested in this. We saw people, various research labs exploring similar problems, so we thought, we should get in the mix being GitHub. And so that was the kind of impetus to release this dataset and of course, we had a wonderful partnership with Weights and Biases -


Lukas: I've heard of them [laughs]


Hamel: Who benchmarks in the leader board of the different submissions that people have and the improvements to the model. And I think what I really like about Weights and Biases is the transparency. So like in Kaggle competitions, you only really get to see what is behind the scenes if the author chooses to release the code, but with Weights and Biases you can see all kinds of detail, it's very transparent and you get very rich logs of what happened during the training process. So that's really helpful and I think the whole community can see that. And I think that helps drive that forward.


Lukas: So just to clarify, the challenge is to find the code that best matches the docstring.


Hamel: Yeah. So there's a couple of different tasks. There's one task; matching a docstring to code which is sort of a proxy for what? So the search query may not look like a docstring. Probably doesn't really. And so that task is searching for comparing the docstring with the original code which is a proxy for search. So we also have some search, actual search queries that people have done against Bing; since we are in Microsoft we can pull that. And, you know, we have some ways of finding out things like what page they landed on and if that was inferred or that was the thing they were looking for. We have another test set that is actual queries from a search engine. Even that is not perfect because the task that we would want is to simulate a more scoped search of code, not a global search like on Google or Bing of how to do something. I think that's a solved problem. That works really well, at least for me. So I would say the task isn't perfect, but we did whatever we could in the time we had.


Lukas: I think it's awesome that you released the data. Since we work together and I think I understand it - it's like, I search for some algorithm, maybe like insertions and then I map that to code that actually doesn't search in embedded space that's like the embedding of insertions or.. Right?


Hamel: Yeah, absolutely. Like, if the code doesn't contain the word insertion or sort, you can still find that code because it does the magic. Yeah, it does. And thus the idea is to enhance discoverability of code, to do various things.


Lukas: It's super cool. We'll put a link to materials so folks can find that if they're more interested. So we actually haven't talked about this but I was looking at your background and seeing that, you already worked at DataRobot and you wrote about AutoML. This is another question we get all the time. What do you think is the state of AutoML? Is it something you use in practice? Would you recommend it to people? When is it useful and is it not useful?


Hamel: I think AutoML is really misunderstood a lot of times.


Lukas: Maybe define it first because we may be talking about different things.


Hamel: I think we're talking about the same thing, but I'll define it just for clarity. So I would say I know AutoML are tools that help you automate your machine learning pipeline as much as possible. And that can mean various different things. Obviously, that definition has a very large scope. I mean, of course, you can automate some parts of your machine learning pipeline. You got to try to automate the whole machine learning pipeline. You know, what part are you automating exactly?


Lukas: Totally


Hamel: That kind of gets into the weeds. But I would say something that sufficiently automates a lot of it. You know, you can kind of bucket that into AutoML but even that is.. I mean, I don't know if there's like an official...  There's an  organization that has a definition. I don't remember at the top of my head, but that's the way I would define it personally. So, you're right. I did work at DataRobot, which is a company that has a piece of software as a service and they're one of the first folks to really put AutoML out there. I would say before DataRobot tools that you may have seen are, I think it was called weka or Autoweka, and then maybe RapidLiner was one that was popular; things that sort of automate the machine learning pipeline. So drilling a little bit more, what DataRobot does is you feed it a prepared set of data, and so you've already done tons of work before getting to this process and you have a target variable and all these features, and so what DataRobot does is it tries a lot of feature engineering steps; almost like problem agnostic feature engineering steps, and it tries many different algorithms, all from open source. It benchmarks them against each other and does lots of diagnostics, like an incredible amount of diagnostics on your data and then gives you a leaderboard of all these different models. Again, they're all from open source and that's one flavor of that, there's other people who do this. So like H2O has product where they have, I think they call it AutoML or self-driving ML or something like that. I don't know what it is exactly, but they do something like this. The reason why AutoML is  misunderstood is people think of it in a certain way, put it in a box and they say, "this can't replace a data scientist" or the objection is, "why would I? I can beat this thing. I have domain knowledge. Why would I want to use an AutoML system when I can build a model with my domain understanding to fit the needs better, then why would someone want to use AutoML?" That's the common misunderstanding. I wrote about this when I was at AirBnB, the blog you're referencing, and I think the way that it's used most effectively is to really augment a data scientist. You may not use any of the models produced from the AutoML system, which kind of sounds ironic, but really a AutoML system gives you a lot of information from the very beginning. So I think it's really important to have a baseline and the better your baseline, the better off you are. And so you can use an AutoML system to give you a pretty competitive baseline to begin with and the reason that a lot of people use linear regression or something or some simple model or just the average, as you know, baseline or whatever it might be, is, that's easy to do, and you need a baseline to compare it with what's going on. So, that's helpful, and then also you get a lot of diagnostics, you get a lot of things, you know, something that automatically explores your dataset and you can read. You're just getting more information about the task at hand. If you do that, you can use that information really effectively to go and build your own custom model and start with some more hypotheses about what might work or what might not work or invalidate some hypotheses. I mean, it's not uncommon to hear, I hear this all the time, you know, data scientists will say, you know what, random forest, they don't really work on this model or on this dataset or whatever. Neural nets, they don't work but, you know, what if the AutoML system produced much better results than you did and used in that model, that's really interesting. Like, why did that happen? That happens like a fair amount number of times. When I was at the AirBnB, which you know has a lot of talented people, it's really interesting, like sometimes AutoML system would give you some result that is something you didn't expect. And so, I think it's just really interesting. It's a way of using AutoML and I don't really see AutoML replacing data scientists, but I see it as an incredibly useful tool. I mean, just even in doing lots of exploratory data analysis. I know that sounds trivial or easy, but just to have something that does that really nicely for you and gives you all kinds of statistics and metrics and all kinds of graphs, for free. It is just a head start. That's my spill on AutoML.


Lukas: It's interesting. Do you make a distinction between AutoML and hyperparameter optimization?


Hamel: I think hyperparameter optimization is part of AutoML. Like, if something is really AutoML, it should also be doing hyperparameter observation and that's what a lot of these AutoML frameworks do. But I wouldn't say hyperparameter tuning on its own is what I would just call AutoML. I mean, this word AutoML is, like I said, very hard to define, but it's definitely something that should be included in there.


Lukas: But what extra stuff does it do? Like try different algorithms, you mean or what does it do on top of that?


Hamel: So a lot of the magic of, for example, Data robotics, (because I'm really familiar with that product) it does a lot of feature engineering. It's built by people that are grandmasters at Kaggle. They have about three or four or actually more former grandmasters, and then a lot more people close to that. They've taken all their experience of winning these competitions and  put a lot of recipes in there. So, things like, "If you're using a tree based model, here are some feature engineering steps to try", "If your data contains text, let's try these feature engineering steps", "If your data looks like this let's try something else". It's like a lot of these rules built in, but also still have a lot of different recipes included, things like model stacking, ensembling models, you know includes hybrid primary tuning. And so it is a incredible amount of diagnostics, so you get feature importance type of stuff in many different ways, you get a lot of model explainability stuff. And so all that information is pretty useful, regardless of what model you're going to go with, to understand something really fast.


Lukas: I feel like a lot of people think of AutoML as a way to get the best model.. And yeah, it's funny, I start to tell people that... I also share your view that it's a good way to do exploration or at least a hyperparameter search, I think it's a great way to understand the domain you're in but I guess maybe it's because Google has that automated product where you actually don't get a lot of data, I think you just get the best model out of it, that sometimes I think of AutoML specifically as just a way to find the best model. But, of course, if you get to see all the different runs and how they did, that would be incredibly useful to learn about your dataset.


Hamel: Yeah, I agree, and yes, some products, I think they actually think maybe they've designed it in such a way where they think of it as a black box and just give you the best model. I don't really think that's incredibly useful.


Lukas: Why isn't it useful though? I mean, because I think from a business owner's perspective, they might be like, "Awesome, it's a pain in the ass to make models. Let me just get done with this."


Hamel: Well, I would say it is still useful. I agree with you, but it's not as useful. I mean, I would say being able to see everything and learn from it is a lot of value added and  you can still build it and not look at it, all that stuff if you want to, but, I find it to be useful as a data scientist, I found out that it makes me a lot better, you know, and helps to check some stuff.


Lukas: When you were working at DataRobot. Do you think most teams are using DataRobot more for an exploratory use case or for an optimization purpose?


Hamel: It was pretty mixed. So I worked with a lot of customers there and people were using it to sort of.. So a lot of people already had a model in production of some kind. And, you know, they just wanted to.. and that was really an excellent-use case i mean plug your data into a data robot and see what's going on. And that was a really popular use case. Another one was; okay, you don't know how to get started. You're taking a long time to build a model, you want to use this.. a lot of people actually use the product to help learn about data science that are so transparent. They could see all the steps like you did, and they would learn about it, like the workflow. But I think there was a mix between taking something you already have, putting it in there and exploring. I would say people that already had, like data scientists - experienced ones, they found it really useful to get new ideas they didn't consider before.


Lukas: Makes sense. All right. So Hamel, what are you working on now? What's your day like? Can i guess? I've just followed you on Twitter, I haven't talked to you in a while. It seems like you are really into GitHub actions that I don't totally understand, but I want to learn.


Hamel: Oh, yeah, that's a great question. Yes, I am. I mean, I didn't try to be, but it just worked out that way. So you're absolutely right. I kind of became the GitHub actions person for data science by accident i think. To answer your question, I'm really interested in tools that can help data scientists, and building those. That's why I like talking to you so much, because you're also interested in that.


Lukas: Totally.


Hamel: That's what I was doing, why I was interested in working in Data Robot and I did a lot of that in AirBnB and then at GitHub I'm doing that also. And so one of the things that I've tried to do is to find ways in the short term, there's some long term things I'm working on which we're exploring, but in the short term, what can I do right now with the products that GitHub already has to make data science easier? And so it's being creative and trying to figure out what I can provide. So, like, there's a couple of examples of that. One is CICD; continuous integration delivery for machine learning workflows. So GitHub launched actions, as you alluded to. When I saw that launch, I realized that there is an opportunity to construct CICD plugins that will allow people to have machine learning workflows, or that have CICD workflows for machine learning.


Lukas: Just 'cause I think a lot of people might not know, including myself, what is the GitHub action and why is it useful for this?


Hamel: Yeah, that's a good question. On the surface, GitHub actions look like another CICD system, like Travis or CIRCLE CI or something like that. You know, compute that you can run triggered by some GitHub event.


Lukas: And this runs on GitHub?


Hamel: Yeah, this runs on GitHub. But the way that it's differentiated is in a couple of ways. One is; You can fire a GitHub action to run on any event, almost any event. So an event means opening an issue, commenting on the issue, opening a PR, labeling a PR, just think of almost anything that you can learn happens on GitHub, you can trigger actions to run arbitrary code based on that event. And then the reason why Actions is special is for two primary reasons. One is, all the metadata associated with the event that triggered the action is hydrated into the action's environment. So, if you want to know who commented on that PR or whatever, it's super easy to do because it's available inside the environment. Secondly is, let's say you create a workflow that is super useful, something like, "I'm going to run machine learning workflow, log my metrics to Weights and Biases." And then you report the metrics back into the PR in GitHub. That's pretty  useful. And so if you want to, you can package it up a little bit and you can say, "Okay, I have this workflow." It inspects it. It expects this input. It expects a run ID, and as an output, you know, it will comment on the issue with this formatted table or something like that. I'm simplifying it. But really, I mean we can talk about that. And then I can just use your workflow. I don't have to know anything about how you did that. I can just say, "Hey, this action and this workflow is pretty cool, i just have to feed in a run ID and something like it and it will do the same thing on my repo. So, i can just reference your packaged workflow from your repo. I don't have to install anything or do anything and so I can compose many different workflows together that do very modular things.


Lukas: Sorry. Dumb question. So, how do I pass in the run ID, like where would that happen?


Hamel: So with every step in an action's workflow, if you are using a packaged action,  there's input. That input can come from anywhere; you can hard-code a string or you can say that input can come from another step in the workflow. Like specifically for Weights and Biases because I love them so much, I made an action that does this. So what happens is I actually log the SHA, the commit SHA that is associated with a machine learning workflow and two Weights and Biases and then when the model finishes training, it takes that SHA and pulls it from Weights and Biases, so that becomes the input. That's one way of doing it. And then another thing I do is when I deploy a model, so Weights and Biases put a comment in my PR table of all the different runs and the run ID's from Weights and Biases. And then I just have a chat command. I say, "backslash deploy run ID." And then in actions, I pass that command and I say, "Okay, give me the run I.D." and I pass it into another action that then takes that run IDs as an input and then it goes out to Weights and Biases, downloads the model and it pushes it to my serving infrastructure.


Lukas: Well, Hamel I didn't know you did that, Can we give that to other customers?


Hamel: Yeah, I really do. Yeah, it's super cool. I think it's something I'm really excited about actually.


Lukas: Nice, so it's something I would love to play with it. So, basically your GitHub actions lets you build developer workflows and you're using it to do essentially CICD for ML.


Hamel: Let me paint a clearer picture. Imagine a scenario you may see all the time. You open a PR, you want to make a change to a model of some kind. Happens all the time. And you want a way to be able to see if this model is better than the baseline or whatever you might have in production. Now, how do people do that today? Even if you have something really cool like Weights and Biases, that's still a separate system than GitHub. And you might have to go out to Weights and Biases and copy and paste the link into the PR say, "hey, I ran this code in there" but that's prone to error. Like that might be a stale run. You might have changed the code since then. You never know. And it's all in different places. You want to go back to the PR, depending, you know, that's like a manual process. It's not a good practice. In machine learning, workflows can take a long time to run, can take from a day to a week, whatever it might be, but sometimes you might want to do a full run before you merge some code. And so with GitHub actions what you can do is you can say, in your PR, "I'm ready to test a full run of this model" and then you can issue a chat ops command, say "run full test", and then your model runs on your infrastructure of your choice logs to the experiment tracking system of your choice. And then it dumps metadata into the PR in form of comments and other things where you can see all the results of the diagnostics of this model that you want. And then finally, you can decide to deploy this model or do anything else using another chat command and then deploy it. And so now you have this really rich record of everything in the PR that's associated with that PR. And it's getting closer to proper software engineering practices where you have the test that's automatically conducted from the PR itself and you have all the documentation there. I even talked to some Weights and Biases customers. What they do is they have Jupiter notebooks on the side and they copy and paste their Jupiter notebooks into the PRs. And, you know, this is to avoid all that stuff. So it just happens magically. I hope that helps.


Lukas: That was great. That was educational. And I think we should try to package what you did and offer it to customers, they would like it.


Hamel: Yeah. Yeah.


Lukas: All right. Well, you know, we've been talking for like half an hour. So I should wrap up with some final questions that we've been asking everybody and getting really interesting answers. What is one underrated aspect of machine learning that you think people should pay more attention to?


Hamel: I actually think one of the highest impact areas you can try to work on as a data scientist or in machine learning is to build tools for other data scientists. And sometimes the tools don't have to be sexy. So there's another thing that I built called Fast Pages, which helps people share information and write blogs, which sounds really unsexy and it really is very unsexy. But I think that kind of stuff is very useful. So thinking of tools that you can do to automate your own workflow and then share that and then package that into tools I think is very underrated.


Lukas: Nice, from one tool maker to another, I have a lot of respect for that. That's awesome. We'll put a link to Fast Pages, I think it is supercool. All right. Next question is, what's the biggest challenge of making machine learning actually work in the real world, in your experience?


Hamel: I think that there is a gap between software engineering and machine learning and kinda different disciplines that need to work together, to a lot of times pull off a successful deployment of machine learning. And I think part of it is organizational, part of it is tooling. I don't think tooling can get you all the way there. I think machine learning is a team sport and requires people from design, UX of course, to M.L folks, infrastructure people, Dev ops, all kinds of people to pull something off. And so I think that can be a challenge to get those people together and a lot of organizations to do what you want.


Lukas: Have there been any particular dysfunctional patterns that you've seen over and over with that or miscommunications?


Hamel: Yeah, yeah. I mean, I think the main pattern that I continue to see over and over again is, "Oh, my business is in trouble. Let's sprinkle some data scientists on it." You don't just hire data scientists to solve a machine learning problem. You need to think about it as a holistic product. And I think that pattern keeps repeating itself over and over again.


Lukas: You're still seeing that? Reminds me of my first job.


Hamel: Yeah. It may not have changed much. I don't know. I mean, hopefully the industry is getting a little bit better.


Lukas: All right. The final question is, if people want to learn more about what you're working on and get in touch with you, what's the best way for people to find you online?


Hamel: Yeah, I mean, there's a lot of ways you could find me; on Twitter. I have a Web site that I haven't updated in a while. But you can do that. Just Google me. There's a lot of stuff like that.


Lukas: Thanks so much for chatting. It was really fun.


Hamel: Yeah. Thank you.

Join our mailing list to get the latest machine learning updates.