Building industrial strength NLP pipelines with Ines and Sophie of Explosion.AI
Sophie and Ines walk us through how the new spaCy library helps build end to end SOTA natural language processing workflows.
View all podcasts
Gradient Dissent - a Machine Learning Podcast · Evolution of Reinforcement Learning and the Robot Hand
BIO

Ines Montani is the co-founder of Explosion AI, a digital studio specializing in tools for AI technology. She's a core developer of spaCy, one of the leading open-source libraries for Natural Language Processing in Python and Prodigy, a new data annotation tool powered by active learning. Before founding Explosion AI, she was a freelance front-end developer and strategist.

https://twitter.com/_inesmontani

Sofie Van Landeghem is a Natural Language Processing and Machine Learning engineer at Explosion.ai. She is a Software Engineer at heart, with an absurd love for quality assurance and testing, introducing proper levels of abstraction, and ensuring code robustness and modularity.

She has more than 12 years of experience in Natural Language Processing and Machine Learning, including in the pharmaceutical industry and the food industry.

https://twitter.com/oxykodit

https://explosion.ai/

https://spacy.io/

https://thinc.ai/

https://prodi.gy/

TRANSCRIPT

Topics covered:

0:00 Sneak peek

0:35 intro

2:29 How spaCy was started

6:11 Business model, open source

9:55 What was spaCy designed to solve?

12:23 advances in NLP and modern practices in industry

17:19 what differentiates spaCy from a more research focused NLP library?

19:28 Multi-lingual/domain specific support

23:52 spaCy V3 configuration

28:16 Thoughts on Python, Syphon, other programming languages for ML

33:45 Making things clear and reproducible

37:30 prodigy and getting good training data

44:09 most underrated aspect of ML

51:00 hardest part of putting models into production

Ines:

We did have people in the past who were like, "Well, I want to put my system to be 90% accurate." And then we're like, "On what?" And they're like, "No, no, 90% accurate." And it's like, "What you're doing with your system will decide about how successful your system is, and that's what you want to measure. And that's what you want to focus on." And I can see how this sometimes gets lost if you're not thinking about it, and all you follow is like research, which has kind of a slightly different objective, because you're comparing algorithms.

Lukas:

You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Today, I'm talking to an Ines and Sophie. Ines is the co-founder of Explosion AI, which is a digital studio specializing in tools for AI technology. She's a core developer of spaCy, which is one of the leading open source libraries for NLP and Python, and Prodigy, which is a data annotation tool powered by active learning. Sophie is an NLP and Machine Learning Engineer at Explosion AI, and she has more than 12 years of experience in NLP and machine learning, including in the pharmaceutical industry and the food industry. I'm super excited to talk to them today.

Ines:

Hi, I'm Ines. Some of you might know me from spaCy, which is an open source library for natural language processing in Python. And our company is called Explosion, and we specialize in developer tools for AI and machine learning, NLP specifically. And so we build tons of open source tools that are quite popular, which is really cool. And we also have a commercial product, which is called Prodigy, which is an annotation tool for creating training data for machine learning models. It's a developer tool, it's fully scriptable in Python, so you can use it to script your own custom data flows, your own custom labeling recipes, put a model in a loop. There's all kinds of cool stuff you can do with it.

Ines:

And that's what we do. And we're currently working on version three of spaCy, which we're very excited about, and we've really kind of finally take spaCy to the next level, let people use all this really new, modern, cutting edge NLP stuff. And we're also working on Prodigy Teams, which is more of a software as a service C extension to Prodigy, and really lets you scale up your annotation projects, and manage larger annotation teams, but without compromising on the data privacy and script ability.

Lukas:

Okay. So before we jump into the details of your library, just as a fellow entrepreneur, I have to ask, tell me the story of how did you start this and what was it like?

Ines:

Yeah. So, spaCy started when, well, my co-founder Matt left academia, so he was working on NLP, he was researching, publishing stuff, but he always tells the story as like, he got to the point where I had to write grant proposals, which he didn't want to do. And the same time, he was realizing that people were using his research code in live production niche environments, which was kind of ridiculous because they're still research codes. So it's like, well, there's clearly something there. So he left academia and started writing spaCy as an open source library. And then shortly after that, we met here in Berlin and we started working together. And initially, I was working on more visualizers for the library, which are still an important part of what people like about spaCy.

But I was a bit skeptical at first. My background is not NLP specifically. I've always programmed, I did linguistics, but my first reaction was like, "Sounds kind of boring. I'm not sure if I want to..."

Lukas:

Really?

Ines:

Yeah, quite literally. Yeah, so he was like, "I'd love to have this syntax visualizer." And I'm like, "I know totally what it is because I know linguistics," but I was like, "That sounds... I don't know if I want to do that. I'd rather work with something else." But I did, and it was good.

Sofie:

And it's still very popular, right Ines?

Ines:

Yeah, I think when people think about spaCy, they think about our syntactic visualizer a lot. So I'm glad I did it.

Lukas:

Yeah. For what it's worth, that's my impression. It's quite beautiful.

Ines:

So yeah, that's how it all started. And then at some point we were like, "Well, okay. We want to found a company around this," but we also knew we didn't want to go down this typical startup route because we saw, "Hey, look, there's tons of money to be made. We can just run our company. We don't need to run at a loss. We don't need to have this crazy scale in the beginning. We can just build stuff and make money doing it." So that's how we set up the company and we're still fully independent, which is cool, and it gives us a lot of opportunities. And we're now a small team of, I think, eight people at this point, including Sophie, who was, I think, one of our first full-time NLP people who joined the team.

Lukas:

Nice. And so Sophie, how long have you been working with the team?

Sofie:

So I guess I started working for spaCy at the beginning of 2019. So almost two years now.

Ines:

Wow. Yeah,

Sofie:

I think I met Ines in a pub in Brussels.

Ines:

Yeah.

Sofie:

After an NLP meetup.

Ines:

Yeah, it's not like we met randomly.

Sofie:

So there was definitely a theme to the day. And I think I really loved their vision, not just for how to run a company, but also on how you should iterate over data and your models and just this very pragmatic view on how to apply machine learning in an industry context, really.

Ines:

Because I guess you've also seen this done okay, but also you've seen this done quite badly in maybe some contexts.

Sofie:

Yeah, exactly. So my background is I do have a PhD in the Biomedical LNP, so I'm on biomedical texts, but then I worked for Big Pharma for three years, so J&J, and indeed there's many examples on how to apply machine learning models and how not to apply machine learning models. So yeah. And I've always been, I think, working on open source is just the best thing there is, so they didn't need much convincing, I think, to start working.

Lukas:

Nice. Well, we'll get into the technical details in a second, but I have to ask, so how do you make money? How does your business operate?

Ines:

Yeah, we sell a product. So we do this really crazy thing where we sell something and people give us money and then we spend less money than we make, and then we make a profit. No, but it's just like, for Prodigy, you can buy Prodigy. And we sell Prodigy for a lifetime license, so it's a one-time fee. And the great thing about being in the software business is you can sell a piece of software and then you can sell it again.

Lukas:

You have an open source model, right? So what parts are open source and what parts are for sale?

Ines:

So we've never really liked this idea of having this sort of freemium thing or this kind of open core, because the problem there, it introduces a lot of questions, and it puts you, as a company, in a very weird position because we want people to have an easy time using our tools. We want our docs to be great. We don't want to sell help with our own software. So if we did the consulting model where we sold stuff on top of spaCy, we'd constantly be in this position where we're like, "Well, if our docs are too good and our library is too good, nobody gives us money. But if it's too shit, then nobody wants our services because they're not going to use our tools." So we never wanted that. And then we also see, because there's always a difficult story around an open source library, if you suddenly have these components that are free.

And in general, algorithms is not really what I think is the thing that you should be selling. There's developing experience. That's something that makes companies a lot of money. That's something people will be paying money for, just like...That's big thing. Why people use and pay for Weights and Biases, for example, it's developer productivity. Same with data, anything around creating data. That's where the real customization comes in. That's the valuable stuff people are working on. And that's also where products can be. So we have a separate product that, probably you'd be interested in if you're a power user of our open-source tools, but it's separate. spaCy is free, the code is open source, and we sell additional products to it.

Lukas:

And is the additional product Prodigy?

Ines:

Yeah. That's currently the one product we have. But we also, as I mentioned earlier, we're working on Prodigy Teams, and that we'll have a more SAAS type of model that people have been waiting for.

Lukas:

I see. But Prodigy's a software that I could buy one time.

Ines:

Yes. Yeah. So you can go online right now, go onto online shop, buy it, download it, PIP install it, use it. That was also the idea. We really want to make the path for a developer as easy as possible and make it easier to start using our tools. And then there's a lot more you can be doing.

Lukas:

And I guess the typical person that you sell to, and the person that uses a spaCy, is someone working in natural language processing, trying to build models. Do I have that right?

Ines:

Probably developers at all kinds of companies from top Fortune 500 to startups, to academics, researchers. Also a lot of people are getting into NLP now who don't have the classic machine learning background, as in like, "Oh, by a computer science PhD, and then start up or something." It's a lot of people, like in digital humanities, from the medical field. There are lots of people entering the field now who want to solve problems. And that's what we also find very interesting, because they're bringing the domain knowledge, they have a problem they want to solve, they know how to solve it. And then, okay, they're learning machine learning, which I would say is often a better path to success than coming from the other direction, knowing machine learning and then thinking about, what problem in the world can I apply it to? I think a lot of terrible products have been born out of, especially, a very naive or arrogant view on this sort of thing.

Lukas:

So spaCy, your library, let's talk about that. I want to say before we talk about kind of the new things you're doing, maybe for someone who hasn't heard about your library looked in detail, what are the big components and what was it designed to solve?

Ines:

Well, it was initially developed to really process large volumes of texts at scale, efficiently. So, you want to process the whole internet, well, the internet is not going to get smaller. Computers are not going to get faster. So you want to have be efficient software to do it. And it started out by just having... And also, of course, it was always designed to be used in production and in industry use cases, which was, especially at the time when we started, wasn't such a consideration. Most code is written for research, so spaCy really took the other approach and said, "Okay, look, you want to process texts and do stuff with text. We're giving you one implementation, which we think is the best. We're not giving you like 50 parsers you can all benchmark and play with. We're giving you one that works best."

So that's how it started. Initially, we have different components, for different things you can analyze about language, from starting, with what's even a word? Because it sounds very basic, it's obviously a lot more complex than it sounds, to what concepts are mentioned in the texts. What's the text about? What's a verb? What's a noun? How is stuff connected? And then various other things you can build on top of that.

Sofie:

I think one of the main features in version two of spaCy that is currently available is that there's a lot of pre-trained models for different languages. So people can come and say, "I want this general purpose parser that just parses my French texts and tells me what the labels are, what the part of speech tags are, what the entities are like, which are persons in the texts and these kinds of things."

But I guess what we've seen over the years is that people are now wanting to more and more train their own models. So we don't want just one French model, right? There can be very different texts, like biomedical texts will need quite a different parser than just your general domain English news or French news. So I think this is also a little bit the shift to a version three is where we're making everything much more configurable, so that people can also go and train their own models, and you won't just be limited to the pre-trained models that we'll have online, even though we'll still have them for people to quickly start with, but they'll also get much more power over-training their own models, I think. So I think this also reflects a shift in how people are dealing with NLP over the last years.

Lukas:

I feel like you've had front row seats to sort of the big shifts in language processing. And I feel like my impression from the outside is that, some people might even skip the parsing step for a lot of applications. Are you seeing that? How true is that?

Sofie:

It's difficult to say what all the different users are doing exactly, but this definitely has also inspired the move to version three of spaCy, so the transformers that are being published by HuggingFace, is huge repository of models that are extremely useful for NLP, will basically become available within spaCy three just through our spaCy transformers library. So you will just be able to sort of plug and play and just put a transformer in your pipeline and use that. And if at that point you feel like you don't need a specific parser model anymore, sure, you could definitely go and try that. So you'll be able to just write your own model on top of a transformer output and see how about that does basically.

Ines:

Yeah, but also I think another thing, on that topic, that we've seen is while, of course, there are a lot of interesting things, especially in research, of end to end things you can predict that are now really exciting and actually work like. You can model a really complex task and really predict it end to end, and you don't need to go via all these different components and stitch it together, and that's very exciting.

That's one thing people might be doing, but we've also seen that in more real life applications, it's still often very, very useful to have different components that you put together that you can train on very specific data, and also that you can adjust, retrain, and customize. And that's just because a lot of the things people are actually trying to solve in industry and in real life, so to say, are things that you just can't easily predict end to end and throw a huge language model at it, and then that's it. Of course not, because otherwise, those problems wouldn't be valuable, and companies wouldn't be spending a lot of money on it. The most interesting and most valuable things you want to build are the things that are very specific. And so for these, you often want different building blocks you can combine and train, and that's also something we want to make easier going forward.

Lukas:

Could you give me some examples of things like that, that companies care about, but maybe isn't as well studied in academic circles?

Ines:

I wouldn't even necessarily say that it's something that's not studied in academic circles. It's more like... Okay, often one thing that people always want to do is information extraction. You'll have-

Often, one thing that people always want to do is information extraction. Companies have amassed texts and texts and texts, and now you want to find something out about that text. And I think one example I sometimes use in my talks is imagine you have all these financial news that you scraped and you want to extract company sales, and you want to extract who bought which company for how much money and in which country and for what currency or something like that. That's like stuff in this sort of space is like endless and many companies want to do that. And sure, it's something you can try to... If you have a huge language model, you can try and just throw that prompt at it and come up with some ways to find the model. And maybe at the end of it, it will output your structured JSON representation that has all your data in it.

But that's often not the most efficient way to go about this problem. Like maybe you start off and you say, "Cool, I want to detect company names." Maybe for that, you could train an entity recognizer to predict GitHub as a company, Microsoft as a company, whatever. Same works quite well for money. Actually, probably in real life, you also have tons of noise. You want to try to classify it first to filter out whether something is actually even a text you care about. Then, okay, you have these random money values. You want to normalize them. Do you need machine learning for that? Maybe not, probably. There's a quite simple algorithm that can do that for you, that you want to just combine. Then you want to look up a stock ticker. That's not something you want to predict. You don't want to have like a sequence to sequence task or like, or whatever.

Ines:

I don't know. You don't want to have a language modeling thing. And for that you look it up in a database or somewhere on the internet and then you want to put that all together and put it into like a structured format that then you can feed into your database. And that's, I think is a good example of... Yeah, that's super common. Predicting this end to end is really interesting research topic, but not a practical approach. And it requires all these different components and maybe some component in between is like, yeah, hack together regular expression that does part of the job, great, and works 99% accurate. What more can you want? And so that's how we see a lot of the NLP done in practice.

Lukas:

Well, that's consistent with the NLP that I've seen in my life. So that makes total sense to me, I guess, what are the parts of that that kind of spaCy helps with that? Or maybe, what are the decisions that spaCy has made that might be different than a more research focused NLP library?

Ines:

I think I would say the opinionated take on stuff. We're like, well, there's one... We're moving a bit away from that in version three just by keeping people-

Lukas:

Yeah. I was going to ask, because like with Huggingface, there's lots of choices. So I guess, how do you handle the tension between those two goals?

Ines:

Yeah. I mean, in general, we do want to give people the most reasonable default that works best because we think that's good. And even for example, with the transform integrations, we've been running experiments. We've looked at what models actually work particularly well. So we can also provide some guidance there and say, "Oh, if you care about efficiency and you want to use that language, probably use those pre-trained weights." But yeah, more generally we've always started out being quite opinionated and also focused on efficiency. That's something... It's not a researcher's fault that like, "Oh, your thing is slow." It's like, well, yeah, that's not its job. Its job is to produce these benchmarks so we can compare algorithms. That makes sense. That's sort of problem that research has. It's just that, okay. If you want to actually use a lot of these things, you need to make choices and modeling choices that actually get your job done efficiently. So that's...

Lukas:

Oh, I see. So even like picking models that will run efficiently.

Ines:

Architectures, how we set up the pipeline, for example. A lot of models will have embeddings for each of the tasks they're doing, and you have the embeddings copied for maybe component that can work, but makes the model quite large. You're always recomputing a lot of stuff. So we're thinking about, well, how could we make it easy to share some of these things for multiple components? So when you're only computing the representations, what stuff like that? Those are all decisions that we have to think about that maybe for a researcher, it doesn't matter so much.

Lukas:

I guess one thing that I've always seen in industry caring more about maybe than... I mean, academics talk about multi-lingual support a lot, but I feel like in the end, many, many papers are written on English Corpus. There's good reasons why, I guess, but it does seem like multi-lingual support is front and center to most big companies. Right. Because you have, texts in multiple languages. Is that something that you've thought a lot about?

Ines:

Yeah. Sorry. I thought Sofie was going to answer for a second. Okay. No, no, absolutely. And also there's often more to supporting a language than just training some random Corpus that's available for the language. For example, it's basically auto-organization algorithm produces actual work, so it's linguistically motivated organization. And that also introduces a lot of considerations for like, okay, how do we deal with that language? What characters does that language have? How does that language normally work? Then to okay, what data can we train on that actually is useful? Because that's also not necessarily the same across research and industry. But yeah, of course there are lots of... And it makes sense why everything is in English in research. You can't fault an individual researcher for evaluating and running their tasks on an English corpus because that's just where the competition happens. But yeah, in a more real life scenario, sure.

Lukas:

And I guess bioinformatics kind of like an in between where maybe it's in English, but it's just such a different domain. Like do you suggest to people use different models if they're working in that domain? How do people think about that?

Sofie:

Yeah, no. You'll definitely need different models. I mean, there's just such a difference in grammar and the kind of entities and words that are being used in biomedical texts. And I think there's plenty of domains like that. Like finances and biomedical. These are all very different domains and you really want to train your model on that specific domain. And so what we've seen is not just for the languages, we've seen a lot of community support. Like for instance, if I remember correctly, Ines, the Japanese model had a lot of-

Ines:

Oh yeah. And Chinese as well. Yeah, yeah.

Sofie:

From the Japanese support and Chinese as well. Has a lot of support from the community, because obviously it's difficult with like, not even 10 people to be able to support all of these different languages, if you want them to be linguistically sound.

Sofie:

But also for the biomedical domain, actually there's even a plugin which we call... We have this list of plugins that we call the spaCy Universe. So people that just write to different packages that plug into spaCy and they have trained specific models for biomedical domain. So that's just perfect to them to go and use that or at least start from that, if you're processing that kind of text. So I think spaCy is quite nice and that sounds that there's quite as big community around it.

Ines:

So that's step one, that you mentioned is called sci spaCy.

Sofie:

sci spaCy.

Ines:

That was developed by Mark at Allen AI. And there's also a project called Blackstone, which does the same for the legal domain. And I think both of these actually are great examples because when you look at the components that are implemented, you can see a lot of thought went into what these components should do and what's appropriate for that specific domain.

Like, Oh, what do we need to do differently if we want to segment sentences properly in legal texts? If you want to do this well, you need to understand legal texts and how these things are written. What are the problems? Okay. Maybe we can have... This court, there's this component for resolving acronyms, I think which is like users, a specific algorithm. And it's pretty basic, but it can be implemented with spaCy and it just works. Thus, the job extends what's already there, but I think it's very interesting to see these projects where you can really tell, oh, a lot of knowledge and insight about the field went into developing that specific model. And yeah, I guess that's why we're back at the hammer and nail type stuff.

Lukas:

I want to go a little deeper just because you were talking about this before, and we were chatting about the new spaCy library and you're showing me all the stuff that it can do. It seems like you put a lot of thought in a lot of different components. Like one that I just recognized as someone who's also wrestled with this problem is your configuration system looks super cool with the nested configs and the way that you can put in actual logic in there. Can you talk a little bit about... I feel like people might not realize how complicated setting up configurations would be if they haven't wrestled with the problem before.

Sofie:

Yeah, I agree with you. I think personally, this is one of the biggest strength of the new release coming up. The version three of spaCy needs these configuration files. Because before that we would just have all of these defaults across the library and then it would be difficult to really get at them and change them. And now, with the conflict system, we basically just define all of the different components in a NLP pipeline so that you know exactly what is in there and what isn't, and then you can basically tune all the different parameters of each component. And I think Ines has worked with the most on actually the backend of this conflict system. Right. And getting it to work and all of the filling in defaults and validation stuff.

Sofie:

And I think we battled with it for a bit, but I think right now it feels very robust. And like the other day, I was driving a config and you made some kind of, I don't know. You write false with a capital or something such and the system just automatically tells you this can't be right. I mean, did you mean a string? Is this a bullion? What do you want? And so it's automatically it fails. And I actually think it's fun to work with because you get stopped very easy, very early on in your experiment. You get this feedback from, is this a valid conflict or not? Can you continue with it?

Ines:

I think just we have to accept that bugs will happen and that things can go wrong. And especially, machine learning is just hard and it's complex. You're basically passing these super abstract arrays and things with hundreds of dimensions from function to function, and then you're computing shit with it. And then you're passing that all the way back and hopefully something comes out at the end. That is just like complex. So I feel like probably everyone who listens to this can relate to debugging, couldn't broadcast shape blind to shape laugh. I mean, that just comes up and you're like, "Fuck. Somewhere I have a bug." And that happens like all the time, where it's like, Oh, you have the hidden width set to whatever here. And why does my thing not learn? And why does it all fall apart?

It's like, yeah, because over there you need to set that same width, and you don't because it uses the default value of that keyword argument that you've set to minus one. I don't know. That just kept coming up. We were like, "Well, that sucks." And mistakes will happen and problems will happen. And stuff's not going to get less complex. Also, you're not solving a problem by just abstracting away the complexity. That's another thing. If you see these concrete files and it has like every parameter defined, everything in there, you might be like, "Oh, well, that's like super complicated. How is that easy?" And it's like, well, easy doesn't always mean no cold out of the box. It's like you need to solve the problem. And that's something you can do by providing better development experience, not just by blocking it away.

Lukas:

Totally. And I feel like... Oh, go ahead.

Sofie:

Yeah. I just wanted to say, so I'm not sure when we started implementing these conflicts, I think around January, maybe. So I think we've been working with them now for more than half a year ourselves. So I think that's also we just saw what all the problems were and like these hidden... Like the parameter in that part of the country needs to be the same one as a nested parameter in that part. So we have all these referral mechanisms and so on. So I think if you sort of battle tested it by now that hopefully it's these common errors don't pop up as much anymore. Yeah. But it feels kind of nice.

Ines:

Yeah. No, it's always satisfying. You built something and then you actually use it and you're like, Oh, that actually works. I mean, not that it wouldn't have. Yeah, when we started playing with Weights & Biases, that actually also came up again because we were like, yeah, let's just log this whole conflict and then we're like, Oh wow. Now we can actually see how all of these values relate to each other. And it's just like works and it makes sense. And that was also very satisfying. Yeah.

Lukas:

It's funny. What you're saying about some of the typing and some of the stuff you said earlier about wanting things to run fast, kind of makes me wonder what you think about Python in general, because we've actually had some very strong, different opinions from different ML researchers that we've talked to. So I'm curious to get yours as an author of a famous Python library. What do you think about the language?

Ines:

I mean, the thing is... Python of course has won at machine learning, however you want to call it. And I think it's surely of course, because Python was there at the right time, in the right place. It was fast enough. It had support for C extensions, but it was also a general purpose language. Like that's something I always like to point out. The reason Python is popular and works and works well also for the type of stuff we're doing is that you can do all kinds of things in Python. Python was really big for web development stuff before the machine learning thing started at scale. And that also means that it's a general purpose language you can learn and get into from even something else you're doing. And that's why an AI language has also never really taken off. It doesn't work.

You want a general purpose language to write it. And I think that's why Python is so popular. And that's also why I like Python. And yeah, sure. You have to put some work into it to make it fast. In fact, spaCy is written in Syphon, which is kind of this dialect of Python that lets you write, see in the Python syntax. spaCy's a bit known for like, Ooh, Syphon. For some reason Matt has become like kind of known for like, Ooh, writing stuff in Syphon which some people can still find a bit intimidating, but I don't know. How did you feel about it? Because Sofie, you learned Syphon.

I don't know. How did you feel about it, because Sofie, you learned Syphon.

Sofie:

Yeah, I think you're making it sound more scary than it is, Ines.

Ines:

Oh, really.

Sofie:

Because not the whole of spaCy is implemented in Syphon, obviously, it's just the parts that really matter efficiency-wise. So yeah. I think it's a very interesting question, because I, myself, I actually come from a Java background, which is obviously quite different. So, I think personally, I'm really happy with the combination between Python and the typing system, because you get a bit of the goods of both worlds. You have Python which, let's face it, just programs much more easily than Java, and there's just so much less overhead and so on. It's definitely grown on me.

Sofie:

And I think the typing, I do really like it. Because especially if you're writing your own machine learning models, or think that our open source libraries for machine learning has all of these types also integrated, so that if you're trying to combine layers that just don't have the right input and output types, it will tell you, again. And it won't just try to propagate these meaningless areas that have wrong dimension, and then just scratch somewhere in between it, it will tell you upfront. So I think that that really helps and the type system really works there.

Ines:

Yeah, and I think there's a lot of exciting stuff happening in the ecosystem. It's still quite young. It's also the static type checking, MIPI, that's all under very active development, just like Python, itself, really. But it's very cool to see some of this stuff actually work. Or using a modern editor and just seeing something underlined, and you look at it and you're like, "Oh yeah, I passed something wrong to this function." That could have easily taken me a long time, because I passed a string and it should've been a list, and the string is also iterable.

Sofie:

Oh, yeah.

Ines:

Those sorts of bugs that everyone can relate to, and that's pretty cool. And also the ability to hook into that system, and into the static type checking via MIPI and implement your own plugins for your own libraries and use cases. I think that's something we're going to see a lot more of in the next couple of years as the ecosystem around this matures. Yeah.

Lukas:

Do you see demand for other languages?

Ines:

As in programming languages or using-

Lukas:

Yeah, do people ever ask you, hey could you support Java?

Ines:

Well, I mean, it's still the Python library. But that said, there's a very popular wrapper written in R, and that's still a very popular language. If anything, I would say actually, in our blank space, this would be the other main blank language that people are working in. And sure, it might be biased because I don't know what people who are working in Java are doing, because they're surely not using Python. So, I mean, I don't know. It's like, "Oh, we never hear from people who work in Java. Yeah, because they don't use our stuff." And that's fine, fair enough.

But yeah, I think it's also because R and Python integrate quite well. So it's basically just this wrapper layer, and it fits in with a lot of people who are working digital humanities, social sciences. Actually, it's quite heavily R based, but they also have tons of texts to analyze. So, they often then use spaCy via R. Yeah.

Lukas:

Got it. I guess, is there other things? I mean, one thing that I think of with your library and configuration systems is reproducibility effort. And especially, I imagine working with this range of people, especially in academia, but I think in industry too, it's so important to have things reproducible. Is there anything else you're doing to move things towards making things clear and reproducible?

Sofie:

Yeah, and I think that's definitely one of the main reasons for having the config like it is. So that basically everything is defined there, and also you can just set your seat of your random generator, so that it should pretty much be able to reproduce exactly the same ways, even in the machine learning models and so on. So that is definitely something that we care deeply about, as well. Yeah.

Ines:

And of course, another part is that, well, it's not just the model you're training. They're always these different components. You have the data you're downloading and loading in. You have some other script that you just need to pre-process something, and so on. And so another feature of spaCy three will be what we call spaCy projects or project templates. So it's a CLI interface that lets you work with more end to end workflow.

Because often, yeah, you don't just run one command, you run a pre-processing step, you download something, then you train, then you want to evaluate. Sometimes you only want to rerun the training if your data changed, or if something else changed, or if your results changed. So there are all these interdependencies, and that's something we felt like was quite difficult to do fast internally, as well. So, that's what motivated this idea. You kind of think of it a bit like CI config, if you've ever configured something like pipelines or Travis. I mean, if you haven't, well, I guess you're lucky you never have to wrangle with CI thing. Maybe it is one of these things where I'm like, "Oh, I know way too much about these things, yeah."

Yeah, you just basically, any system actually, you find a series of steps. You can have a file to do that. You can download data or anything else, any weights you need. And then you can upload that to a get hub report. We're going to provide lots of templates you can clone, and then that also makes it very easy for people to get started. Or even something as basic as a benchmark. We're currently running benchmarks, of course, because we need to test all the stuff we've built there, and we don't want to launch without having some numbers. But that makes it very nice, because we have the steps defined, we have the data defined, that's loaded. We have the processing script defined. We have everything up to the random seed to set, so anyone running that should be able to reproduce that.

And so, if you say, "Hey, cool, I would love to run these benchmarks." You can do spaCy project clone, benchmark, whatever, downloads it, kind of like it, then you run assets, you run a named workflow, and then it just runs it. And then you can rerun a step. It will only rerun if actually things changed, and that system also makes it very easy to integrate with other libraries. Like for example, oh, you can have a script that does something very, very specific you want to do with weights and biases, that you wrote your custom function for, that integrates with the config.

Or we have one project that shows how you can easily have one step that serves your model, using fast API, which is probably one of the most popular tools. And also the developer happens to be on our team. So that's a way people are always using space in fast API, and people have always asked about integrations. And we were like, "Well, it just works." And that's actually something you'll be seeing in that integration. "Well, is it even an integration? Because it just works." Or a streamlit visualizer, that's also pretty cool. I imagine you run your steps, train, you have your output, your artifact, and then you just run visualize, and then it spins you up this app. Plus tons of... I don't know, I don't even know what people are going to build with it. And I'm very excited. So, that's also, yeah.

Lukas:

That's super cool. I want to make sure I have a little bit of time to ask you questions about prodigy and data, because that is my former life. I also worked in the space. So, I'm also super passionate about people getting good training data. I'm curious if you could tell me a little bit about how Prodigy works. Maybe, does it integrate with spaCy in a special way?

Ines:

So yeah, obviously because we developed spaCy, so for the NLP capabilities and NLP workflows, we obviously had lots of opinions and ideas. So, they're lots of workflows that use spaCy. But basically part of the idea started when we started working on models and things and we wanted to create our own data. And that was also at a time when we realized, look, you don't need big, big data anymore. You don't need billions of labeled examples. You can do that, but often what you need is something very, very specific to what you're doing. And you want to create that, and you want to create a lot of it. And you don't want to have a meeting, and outsource it, and then get it back, and you're like, "Why is this so shit?" Then it's like, "Well, yeah, because you just ask someone to label all persons and didn't tell them what you wanted." And so, why are you surprised? And you paid them $2 an hour. "Surprise," that the core of your application is kind of shit, if that's how you treat the data collection process.

So anyway, that's something we've also seen a lot. And actually, very early on when we started the company, we did a bit of consulting for about six months to, we call it, we raced the client round. So, to get some money in, and also to see what people are doing in practice. And data collection always came up, every project. And also, it showed iterating on this was very difficult, because how people did it in a spreadsheet. And then often we're like, "Oh, maybe you should try with that type of label scheme. Maybe you'd want to change this around a bit. Maybe predicting something else is actually more useful."

That's something, actually, to go back to our industry versus research thing. That's another thing people often forget. If you're not in research, you're in industry. You can choose how easy you make your problem. You can't do that if you're researching stuff. You can't be like, "Oh, that task sucks. I'll just do a different task." But if you are solving a problem, you can choose how easy or hard you're making it. And that often needs trial and error and you need to try things out. So that's how Prodigy was born, because we were like, "We want a development tool."

Labeling data needs to be part of the development process. At least initially, before you scale it up, you want to be building these workflows. And you want to write them and place it. You want to load a model. Maybe you have a model, you want to have the model present you with what the model thinks is most likely correct. And then you can say, "Yep, no, yes, no." That's very fast. Or maybe you want to correct what a model does. Maybe you want to do something from scratch. Maybe you want to label entities, text categories, lots of things.

So, that's what really motivated the tool. So in practice, it looks like this. It's a Python library you install alongside your other stack. You can use it to start up the web server via the CLI. And then you have a modern web app with an interface, that really focuses on one task at a time, doing it as efficiently as possible. Can move or label some data. If you're in a good flow, you can do, I don't know, a few seconds per example.

So, it's actually really viable to say, "Hey, you spent an hour and created a data set of a few hundred annotations." And nowadays, that gives you enough to at least validate your hypothesis. You'll have some idea, "Hey, how about we predict this as text classification task." And then you're like, "Is it going to work?" Who knows? You have to try it. I mean, that's machine learning. I don't know. So, yeah.

Sofie:

Yeah, and I think the other interesting thing is also that because we're targeting developers in NLP, I've spoken to quite a few people who are using spaCy in industry. And what is interesting is when they go and just annotate a little bit of the data themselves in Prodigy, right? They can script their own recipe, and they can annotate a bit of data, because that also helps you understand the domain better. And that's definitely going to help you model the challenge better. So, it's really this fast iteration of how could I annotate a data? What would make sense in my machine learning problem? And basically, knowing a little bit of the both worlds, rather than indeed just having thrown some data over the fence, and then trying to make sense of that from machine learning experts, which just doesn't doesn't work. So, yeah.

Lukas:

And I feel like another difference, that you see, I guess, in real world, especially in NLP write is this loop of model gets trained, a little more data gets collected in particular ways, and modeling gets trained. Because how do your tools support that kind of process? I'm sure you've thought a lot about that.

Ines:

Yeah. So, definitely there's continuous... First just making that point to people was very important to say, "Now your model isn't trained and done. Your model's never done." You need to plan for a continuous process of improving it. And ideally, you also want the model in the loop somewhere, at some point at least, because you want to see, "Am I actually producing stuff that's reasonable?" And you can do that in different ways. You can actually have the model present it's suggestions, and you can annotate them, and give feedback, and evaluate your model that way.

Also, one workflow we thought of is, well, what if you actually just focus on the basic uncertainty sampling? Even something very simple where you say, "Hey, let's just see which text categories have scores closest to 0.5." Because that means that it could go either way, and no matter how you correct it, you always have a gradient to update, and you have the biggest gradient to update with in either direction, because you're in the middle, right? So there's always something to learn from. And that's another approach you could take.

And also, just allowing people to quickly spin up these experiments, and not having every update you make to your model be a whole bureaucratic process, because that's also often what it ends up. Developers want to develop. You don't want to have five meetings before you can start on your model. You want to just write code. And that's something people definitely appreciate, that great, I can get to work.

Lukas:

Yeah.

Ines:

I don't have to schedule meetings.

Lukas:

It makes sense.

Ines:

Yeah.

Lukas:

So, we always end with two open-ended questions, and I want to make sure I give you both a chance to answer these. So the first question is, when you look at all of machine learning, including production stuff, is there a topic that comes to mind that people should be thinking about more than they are?

Sofie:

I think for me, personally, and I think everybody who knows me in the NLP domain knows this, it's probably normalization or entity linking. And this is also one of the models that I worked on for spaCy. So basically, if you have a text and you've already been able to annotate something as being a person, or a location, or an organization, or whatnot, that's fine and that's interesting, but you also want to know what exactly is it. So, being able to give it some unique identity, preferably from a database or knowledge source somewhere. And for me, this is really a crucial step in NLP, because it links your text-based analysis, because a lot of the other steps are just based on the text, itself.

Sofie:

... a lot of the other steps are just based on the text itself. And it links that to the outside world and an external knowledge base that you can then use to integrate your textual knowledge with other information from databases. Because for instance, if I think back about my BioNLP backgrounds, there is a lot of interactions, protein interactions known in databases. People record them as structured information. And then there's another set of interactions that are only written in articles. And you would be amazed at how little the overlap is between the two. So you really want to be able to integrate them and combine the both sources of information. And so for me, energy linking or normalization is... it's a very difficult challenge and we've definitely not solved it in spaCy yet, or it hasn't been solved in general yet, but I think this is an extremely interesting and crucial even at a step in NLP.

Ines:

Yeah, and it's definitely also the type of tasks that we want to make easier and provide likewise for people to do. Okay, is it my turn? It might sound a bit basic, but the idea of really just sitting down and reasoning about what the fuck you're doing, even if it's reasoning and I'm not saying all people are not thinking but it's more like often it can be quite refreshing... Often I've seen people over-complicating things or feeling very intimidated by all this machine learning stuff. And it's like, ok, what are you trying to solve? What are you trying to do? What makes sense? What can a computer do? What can a computer not do? What's logical and there's... What's so funny about reasoning?

Lukas:

I don't know, it's just... I love it.

Ines:

Well, it's clearly not talked about enough.

Lukas:

Sitting down and fucking reasoning. We all could stand to do more of that, I think.

Ines:

It's a great slogan.

Sofie:

Not just in NLP.

Ines:

Yeah, actually just in life.

Sofie:

Good life advice, yeah.

Ines:

Just think about stuff. But I do think some of that really also defines how we do things and how we think about running the company. I don't know, often people spend way too much time just looking at data and trying to infer stuff that makes no sense and that could be much better solved if you just sit down and think about it and are like, what makes sense? Should I do this? And not, "I collected some data and it says I should do this." Well, but this could mean all kinds of things. Is it logical? No. Okay. Then don't.

Lukas:

Awesome.

Ines:

And I think then once you're there, it can be refreshing, as I said, because you're like oh, suddenly things make sense. Suddenly things are doable. Suddenly your problems are solvable because you're not stuck in like some weird technical rut, but you're actually thinking about what you should be doing.

Lukas:

Is there a specific story that you're thinking of? Is there like a client interaction that you want to share with us?

Ines:

We did have people in the past who were like, "Well, I want my system to be 90% accurate," and then we're like, "On what?" And they're like, "No, no, 90% accurate." And it's like what you doing with your system will decide about how successful your system is and that's what you want to measure and that's what you want to focus on. And I can see how this sometimes gets lost if you're not thinking about it and if all you follow is research, which has a slightly different objective because you're comparing algorithms.

Lukas:

Yeah. Do you think you want to... I don't know. I always feel like with the stuff, when you're thinking about a big picture, it's easier to think clearly. And then you push it down to a sub problem like, "Okay, I'm trying to optimize accuracy here." And it's really easy to get lost as like an individual human, just trying to optimize accuracy, but then in an organization where you can't just run thought experiments in your brain, you have to actually talk to people. I think it's even easier for people to go down these optimization, micro-optimization paths, and very hard for people to pull themselves out too.

Ines:

Yeah. And it is also the fun part. This pyramid where at the bottom is data and then at some point you have the code and then you have hyper-parameters so people are like, "Ooh, I can't wait to tune my hyper-parameters."

Lukas:

Right. Although the new spaCy library does support tuning hyper-parameters very well, doesn't it?

Ines:

Yeah, we do expose them, but we also have to say that we actually... That's another thing of optimizing for more stable industry use cases that hyper-parameters have never mattered that much for the models we implemented and still... I mean, now they do a bit more with the transformers that are just more sensitive to that. But we've always tried to also design things in a way that they're not so dependent on these really brittle random numbers, you set there, where it's like, Oh 0.01. Yeah, no wonder it's not working. It should have been like 0.001. That's not productive. That's something that shouldn't have to exist.

Sofie:

I mean, that is still a lot of fun though, Ines.

Ines:

Yeah, it's just like...

Sofie:

I love playing with my 0.001.

Ines:

Yeah. And also all this common wisdom of like, "What should I use for the dropout?" "0.2?" Why? "I don't know, it works." But I think it's good to also talk about this, you can see now, from the outside, if people are looking at the field. A lot of that is genuinely complex and abstract and difficult, but there're also a lot of the things that are not as deep as they might seem. We don't know all the answers and sometimes we just changing a number until something comes out that we like.

Lukas:

When you look at your experience in your clients and your experience on your own of taking things from here's the thing I want, I'm starting to build a model too, okay, it's deployed in production and like helping someone or helping some process. Where are the unexpected places that these things get hard? I think people sort of know, maybe it needs to do little hyper-parameter tuning, but you're saying maybe that's not as important as we thought. And I think when you actually look at what ML practitioners do day to day, it's not all training models, so-

Sofie:

Cleaning data.

Lukas:

... what else is there? I mean, what did you all see as the issues?

Ines:

I think cleaning data, OCR. Yeah, it's nice if you have everything in actual, plain texts, but often you have a PDF that someone scanned 10 years ago. Just keep keeping the things together because software is just hard, there are all these moving parts. They're all these moving parts in the ecosystem that all depend on each other. There're all like DevOps and infrastructure stuff, that's never been something that I was particularly into. I don't think you've ever been into that either. I think our team strong... All that shit where you have to wait forever to see something fail. You run something and then you wait and then half an hour later you've seen it fail and then you try something else and then it runs again. And then two days later you're like still debugging.

Lukas:

Yeah.

Sofie:

One of the challenges to me is what I wasn't expecting when I started because we follow up on the issue tracker for spaCy, and then we look at the kind of problems that people run into. So what I wasn't expecting is that sometimes people just try to solve... for instance, a named entity recognition task with a tax categorization pipeline. And it's often you can actually cost different NLP problems in different ways so that you can solve them in different fashions. And I think it's sometimes difficult to communicate what is the ideal way of going forward or trying to explain why you shouldn't use this, or maybe you want a rule based system for some cases, and you don't need all of this ML training.

Sofie:

And I think that's one of the challenges also a little bit for it for spaCy because you have a lot of possibilities and opportunities. There's rule-based components, there's machine learning training, a lot of it in there as well, but you need to know hoo to use the right tool for your specific challenge. And we can never know what the exact challenge is for all the users. So this is, I think, very difficult to guide people with as well.

Ines:

Actually, that ties back in with the reasoning about stuff. One example we sometimes show in talks is imagine you were trying to extract stuff on police reports about the victim, where the crime happened and... I don't know some other details around it. There are lots of ways you can model that. And one would be you do it end to end, you label a name as victim, and then you label something else as crime location. That's quite the obvious way. Maybe that works. Maybe if you have a big enough language model, you can actually learn that.

But often this doesn't really work, necessarily. So, then you have to think about how else can I decompose this problem? Maybe I should just predict whether a text is about crime. Then I can predict the locations in it. Then I can use other information I have about the text to resolve that, figure out that's a location. That's the crime location. Maybe that's where a passer could come in handy because you can look at the syntax, especially in a language like English, there're only so many ways you can express a thing. There're only so many verbs that are used. If you cover the most common verbs, you've covered 95% of all constructions that are likely going to occur.

Maybe it turns out it always mis-recognizes some city because the model wasn't trained on data that have many mentions of that. Yeah, you could retrain it. Maybe you would just want to put one regular expression in that makes sure that this thing is always recognized because you know the answer. So there are many ways you can go wrong. And I guess also just people still like this idea of downloading something off the internet and it just magically working for whatever complex, specific thing they think of. With a lot of these language models, of course, you only want to train them once, download them and then fine tune them or reuse them. But you should always want to train a model.

The question is not, do I have to train a model? It's you can train a model now. It's great. It works. You're going to make your life so much easier. You should want to train a model and not... I mean, if there's something you want to predict, if not, you probably shouldn't. We often tell people, "Look, you probably don't want to be using machine learning for this." And they're like, what? Oh, someone actually someone did ask me once, they wanted to implement NER ah, for digits. And I was like, wait, just sequences of numbers in text. I'm like, "Why do you want to predict that? You can match that with a regular expression." He's like, "Yeah, But my boss wants me to use machine learning."

Lukas:

Wow.

Ines:

I'm like, God, dude, I'm sorry. But yeah, that stuff like that definitely happens as well, people trying to model things that don't need to be modeled.

Lukas:

It's funny though, I think 15 years ago when I was starting a labeling company, I felt like people were sort of thinking of machine learning as like the scary science project that they didn't want to do. And now it's like they want to add machine learning to ridiculous, easy rule-based tasks. It's so funny the way things change.

Ines:

Yeah, but I guess it's what people get paid for. I mean, there are some rare cases where I'm like God, how... some people who express their unfriendly attitudes on the internet will have jobs where they're likely to get paid a ton to do machine learning and hassling us about pretty basic stuff. I can see how everyone wants to work in machine learning because it's probably a nice job, but it doesn't always mean that what you're doing there is particularly good.

Lukas:

Well, cool. Thanks so much for the time and we'll put a link to the new spaCy library in the show notes and maybe we can put in some tutorials to help people get started if they want to give it a try.

Ines:

Yeah, cool.

Lukas:

Train a digit recognizer with good accuracy.

Sofie:

Don't forget the hyper-parameter tuning.

Lukas:

Yeah, don't forget was a wide hyper-parameters search.

Sofie:

Exactly.

Ines:

Yeah. If you're lucky, you can get to 95%, I would say pretty good on any machine learning Task.

Lukas:

Awesome. Well, thanks so much.

Sofie:

Thanks for having us.

Ines:

Thanks.

Join our mailing list to get the latest machine learning updates.