Vicki Boykis is a senior consultant in machine learning and engineering and works with clients to build holistic data products used for decision-making. She's previously spoken at PyData, taught SQL for GirlDevelopIt, and blogs about data pipelines and open internet.
Follow Vicki on her website: vickiboykis.com
on twitter: https://twitter.com/vboykis
and subscribe to her newsletter: vicki.substack.com
Lukas: It's really nice to talk to you, Vicky. Could you tell me a little bit about your career and how you got into data science and where that began?
Vicki: Yeah, it's been a really interesting career, I think. Not unlike a lot of other people, but a little bit different. So I don't have a computer science background; I did an undergrad in economics at Penn State and then I went into economic consulting, which is pretty unusual, and it was right around the time of the recession in 2007. So I was happy to find a job, actually, and even more so to find one in my industry. But that involves doing a lot of spreadsheets, tracking global trade movements, tracking internal projects, all that kind of stuff. I started working with data analytics there and then for my next couple of jobs, I worked in data analytics, and then for a job that I had in Philadelphia, Comcast, I started working with Big Data, so we had big data available, and there I started working with Hadoop and looking at big data.
Lukas: What year was that?
Vicki: That was 2012.
Vicki: Yeah. So right around the time that it was starting to get really big, I started working with that tool stack and that involved me having to get a lot more technical. So at the time I was primarily doing sequel, I had some frustrations with just doing a sequel with Hadoop. Hive was still relatively new, a lot of growing pains there. So I started working with that stack and then from there I started doing more large scale data science, sampling, programming, all of that, and then went on to my next job - a data science job. And I've been doing data science ever since. But ironically enough, I'm moving away from data science now, and I actually wrote a post about this. I think the entire industry is moving a little bit in that direction. So not every job, but the industry on average; and more towards instrumenting the processes around data science. So creating machine learning pipelines, creating foundations and structures that are really solid and that go end to end. And so I think there's still a ton of jobs that are just pure analysis but as the industry grows, as the amount of data that we work with grows, I think the whole industry as a whole is trying to get smarter about replicability. And that's where I'm working in now, more in the machine learning engineering space.
Lukas: So you think the bigger challenges now are becoming more engineering issues than analysis issues?
Vicki: Yeah, I think I agree with that, at least from my perspective. So, again, I'm a consultant. I come into companies that want to build out data science or data engineering platforms, and usually they're starting from a question about, let's say, "are sales going up or down and why?" And then we work backwards and say, "OK, well, you actually don't have the data to do this yet and you don't have a platform set up where you can reliably look at this stuff on a month-to-month basis." So that's where a lot of the challenges that I see are now.
Lukas: Interesting, Is it maybe because you're going to companies now that are a little bit farther behind, is that possible? Or companies starting from scratch? Or do you think something's changing where people expect to have more built out processes and tools?
Vicki: I think it's the second one. I think if you look at the landscape of whatever the data tools landscape map that Matt Turck puts out every year. So in 2011 and 2012, it was about a quarter of a page and it was just like Hadoop. And that was it. Now I think poor Matt has to put together about five hundred logos into a single page, and there's an orchestration area, and there's a tool's area, and there's an area just for tools around Spark and all that stuff. So, I think people also now have an expectation that if you have something that should be productionizable, and even to the point where we now have notebooks which are generally seen as an exploration tool, there's also some movement. For example, Netflix recently has had around productionizing notebooks. So, whatever workflow you're looking at, I think there's the expectation that in the end it be reproducible to be valuable.
Lukas: I see. It's funny, you know? My last company, we sold into a lot of the Fortune 500, not necessarily Silicon Valley. I always really enjoyed seeing the different perspectives and all the different applications. How has it been for you as a consultant? Does it feel more frustrating or exciting? Or what is it like to go into an organization and then try to teach them how to build up this process?
Vicki: I think it's a little bit of both. So it's interesting because consulting as a data scientist involves both, and I think this is actually true of all data science, but even more so with consulting. It involves both the people piece and the technical piece. So, you have to know what you're doing technically because you're the expert, when you come into the company and you have to say, "OK, this is how we want to do the architecture". You are also going to be talking to people who maybe don't want this process at all. You're going to be talking to people who are disorganized. You're going to be talking to people who are for it but don't necessarily understand it. And so a lot of that work is actually talking to people and building the case for this stuff as well.
Lukas: What does a typical stack end up looking like these days?
Vicki: Oh, it's hard to say. So I've dealt with both companies, small and large, a lot of companies are increasingly in the cloud. So it's interesting. I don't think I have any GCP clients that I've dealt with. AWS is, of course, probably the lead. And I did a Twitter question about this a couple of months ago, like, who's using what AWS asked came out something around 60 to 70 percent? Azure, I'm surprised, is really catching up, I think even as little as two to three years ago, they were squarely in third place, no one was even considering them. But now it's really growing, and I think part of that is Microsoft's leadership, plus the fact that a lot of companies in the retail space are not allowed to use AWS because they see them as a competitor and partially because they are stepping up their game in the tools that they're providing.
Lukas: Is there a particular offering from Azure that you like and you think is driving some of this growth?
Vicki: I actually ironically, haven't used Azure a lot. Most of my work has been in AWS, but now that I'm seeing that people are more interested in it, I'm definitely going to have to start looking into it.
Lukas: Interesting. Is there any tool that you think is underrated that people probably should be using or you recommend, that people aren't using yet?
Vicki: I want to say, Bash.
Vicki: That's a really glib answer. But it's really true because a lot of the times when you come into these big, huge projects, you have five or six different AWS services spun off. You have GPUs, you have monitoring, you have all this stuff. And then you start thinking, "Okay, well, I have all this stuff, How am I going to use it for us? While, oh, I can't test this locally, I can't do this locally, I can't sample the data, what am I going to do with it?" So I really do think, and I find myself falling into this pattern too, where you use all this big data stuff, but then you don't use the stuff that you have available to you. And it's even easier these days when a lot of us are working with pretty high-powered machines that you can do a lot locally as well.
Lukas: Interesting. So you run a lot of stuff locally?
Vicki: Some, yeah. Especially to test stuff to prototype and in cloud environments, it's really hard to spin up those local environments. So just even to look at the data, to examine what you're dealing with, all of that stuff, so you can do locally and you can bash. Bash goes a long way towards that.
Lukas: So I know that you can't talk about your individual customers, but can you talk broadly about the questions that are driving more interest in data science right now? Like, what's top of mind? What do you expect going into another Fortune 500 company that the executives want out of their data science platform they're not getting right now?
Vicki: The number one question is always to understand customers, understand what they're doing, and understand how what the customers are doing ties directly to the bottom line. And that manifests itself in a number of different ways. The one that I usually talk about, which I've also written a blog post about, is churn. Everybody always wants to know churn. So how many people are leaving, why are they leaving your platform and how much money is it going to cost you on a month-to-month basis? Everybody always wants to know that, and I can guarantee in any given project they'll see it. And then the second one is better understanding operational metrics. There's sometimes not a lot of insight into that. And the third one would probably be classifying customers into different types of customers.
Lukas: Interesting, what's a deliverable you would give a company around trends that they'd be excited about? Can you literally say I can predict churn X percent better or is it if you see the signal then that means turn? How do you actually present analysis?
Vicki: Yes, a lever. It would be literally a platform that has information to be able to predict what churn is going to be, for example, for next month. Usually what ends up happening is a lot of the things that I'll deliver are the data engineering piece around getting all the data all together in one place so we can have a data lake so we can actually deliver that Churn piece.
Lukas: And how sophisticated would a Churn prediction model be today? Are people using deep learning for this or what's how complicated these models get?
Vicki: I don't think they're. I think a lot of times, companies, and even before I was doing consulting in all my previous jobs, people are just impressed if you can get a model out the door. Also lot of the times in the industry overall. So, if you have something that you can benchmark against, it's seen as good, especially because there's so many steps in doing it. So first you have to collect the data, then you have to clean the data. Then you have to go to the customer support team and say, 'does someone calling in mean that the person might churn or not?' And then you have to collect all the manual data that they keep and keep track of that. Then you have to build the model, then you have to do a prediction, then you have to meet with the people who are in charge of this and explain your data to them. And then there's going to be a back and forth there. And then you have to productionalize all of that. So if you can get a model end-to-end going, and I've come into companies where there was zero data science before... And that's why I'm saying that you have to build it from the ground up. Having that is fantastic. And just having metrics where there were no metrics before is a huge step up. And then the next step up is, of course, 'Okay, well, why is this metric different this month? Why is this metric different that month?' So, a lot of the churn models I've built have been with pretty basic stuff like logistic regression and decision trees. I haven't seen any deep learning used for churn yet, but I'm sure that use is just around the corner.
Lukas: Okay, here's a specific question. I mean, decision trees versus logistic regression they do different things. Do you have a particular one you start with or do you try both or some kind of hybrid?
Vicki: So again, it depends on the data available. Usually, it also depends on who's going to be looking at it. Usually if it's people at a higher level, like executives that need to briefly glance and understand something immediately, the decision tree is very intuitive and very easy to explain and it can offer a number of different pathways for discussion. If you just need some sort of model that spits out, 'Is this person going to Churn yes or no?', Logistic regression is a little bit better for that. But again, it depends on the stack that they have. There's different software packages that are better or worse for logistic regression. So, for example, surprisingly, Python, as far as I know, does not have a very good decision tree support. You could do XGBoost, which is not quite the same, but aren't exact
Lukas: It's like multiples decision trees, right?
Vicki: Nested decision trees.
Vicki: Just got boosted trees, but it doesn't offer the nice visual interpretation, I guess, as much as the R package. So, yeah, it really depends on what you have available, what you can do, all that kind of stuff. But I would say all or any three of those are my go-to tools for that.
Lukas: Interesting. So you'll build a stable pipeline that includes R in it?
Vicki: I've done it for stuff where I've had to prototype and throw it out. I actually have not built on our pipeline in production, although I know it's very possible and increasingly becoming more and more possible.
Lukas: Interesting. So do feel like, R, is here to stay or do you feel like it's getting replaced by python, what do you think?.
Vicki: I think they're two different tools for two different things. So I think R is fantastic for statistics, for stuff that you're working on in probably smaller teams and Python is more of a general tool, like if you need to glue stuff together and if you need to do deep learning and if you need to have stuff, you'll use Python. But its basic statistical capabilities are not as good as a lot of the R packages are.
Lukas: How do you think about leaving your work in a state where another person can update it? How does it happen? Do you ever check back in with a client and see if anyone's touched your model and still useful for them? That seems like it must be really hard.
Vicki: What we usually do is we work side by side with the client. So we'll have a person on the client side who is a data scientist so we can hand it off or we'll have teams, and so we do education throughout the process so we can hand it off, and I just be like this person knows how to pick it up and knows how it was being built.
Lukas: I see. That makes sense and you probably pick the technologies they're familiar with or...
Vicki: For sure. So we try to pick technologies that are not foreign to the client. So it's not like they're completely floundering and gone when we hand over PyTorch or something.
Lukas: So what's the biggest frustration in this whole process? Where do you see the biggest room for potential improvement? I mean, we've both sold into big companies and it's challenging. Like you do want to say bad things about your clients, but also do you have any patterns there?
Vicki: I think the biggest issue is trying to explain the benefit of machine learning in a way where it's not always exactly clear. So for example, you'll come to someone and they say, 'OK, well, I want to figure out customer churn.' And you're like, 'OK, we'll build this model but I can't guarantee that it's going to be good. I can't guarantee it's going to be accurate in the first pass.' But in the meantime, you have to figure out how long you're gonna be at the client, how much value you're going to add. So it's very, very hazy. And I think that's more of a frustration for me. But it's also an educational issue where you're not going to always get to a right answer, like the first sprint or the second sprint. It's going to be an iterative process and sometimes, if you add stuff, the model get worse. If you take stuff away, the models get better. So it's kind of hard because data science is always sold or rather I see it being sold as this exact thing, but it's very much like an art process. And so I think that's where some of the frustration is. It's not an exact thing and people expect it to be.
Lukas: And I can imagine it's probably really hard going in, like apparently not knowing the amount of lift. Somebody is like, I for sure want this to get better. So it's like, well without the data, how would you know? How do you articulate that? If someone's like, "Hey, tell me how much you know you're going to improve my churn prediction", what would you say to that?
Vicki: First I don't know, I've actually never had it happen that someone was like you have to improve my model by this much. It's usually like let's create a model to do X, Y or Z. But what we usually do is benchmark against previous metrics that they have. And so the goal there is to say, 'look, we're not sure how much we can improve your model, but we can improve the process around the models so that it can be a little clearer.'
Lukas: When you look at the successful engagements where you feel like you really made a difference versus the ones that are more frustrating, are there patterns your more successful clients are exhibiting around data science that sets them up for success?
Vicki: Usually working in a tight loop with me. So a lot of the times the companies I work at will be bigger. And so the data science team will be on one side. The data engineering team will be on one side. The project management team will be somewhere over there. And so I'll talk to all of them. But they don't talk to each other necessarily. And so what I've seen work best is when I'm embedded with a developer, a data scientist, a project manager, that are all kind of working together towards the same thing because there's a big tendency to get silo-ed.
Lukas: So I think companies debate internally about "Should we have a single data science function or should we embed the data scientists and have the different kind of functional teams, hire data scientists for the individual products that they're working on?" Do you have an opinion? it sounds like you might prefer data science being embedded in specific products that are specific outcomes, or do you think that it's better to keep it all as a single function so you can hire better people or create a better culture?
Vicki: I'm not really sure I have an opinion on that. I've seen it work well, different ways in different companies. I think probably for smaller companies, I would say less than a thousand people or so, you probably want to have a centralized team and for much, much larger companies, you probably want to have embedded data science teams. But then the danger is, if you don't manage them centrally, then you have five or six data science teams working on the same questions. And I've definitely seen this at companies where it's just replicated work and they're just approaching it in different ways.
Lukas: So you really have seen success in both.
Vicki: Yeah. Yeah, for sure.
Lukas: I see. Do you see specific stages where you're prototyping something and then deploy it into production? It sounds like you're really focused on getting things stable and in production, but do you prototype the steps first and then solidify them?
Lukas: You simply say to a client, "we're gonna prototype and then we're gonna deploy it"?
Vicki: Yeah, that's usually what we do. So usually I come into an environment and you're not really clear on what's going on in the environment at all. You're just kind of thrown in and told, okay, go. So the first step is to gather and assess what's going on. What tools are they using? Who are the key people involved in this? Gather all of that and start to create a model from whatever data that you have available. See if you can actually create that model then many sprints later, take that model to production. It's usually never you come in, you create something and then t's already running. It's usually a lot of human steps in the middle to get it to that point.
Lukas: I feel like everyone always underestimates the pain of taking a prototype into production. What are the biggest challenges that people might not expect, don't usually expect, going into that process?
Vicki: Packaging the model is always a big one. How do you package it?
Lukas: How do you typically package it? What are the options?
Vicki: You could pickle it. You could create a rest and point from it. You could put the model in a docker container and expose end points from it. I think that's something that I've seen happen more and more frequently where the resulting output is essentially a Web app or a Web service and something hits that Web service and you get an inference point. I would say those are the two big ways right now. I think another big thing that people don't think about a lot is metadata management and a lot of big companies want to do metadata management. In fact, I think almost every company that I've talked to over the last five years has said we need some way to manage all the metadata and the data leak so that we can update the models and so the analysts can do the analysis. But there's no single tool for it and I think only now you have open source tools that have started coming out for it. Like was it Uber that came out with Amundsen? I forgot. But there's at least a couple of companies that have a metadata management system. As of the metadata is which variables are in the model, when was this model updated, when was this table updated, all that kind of stuff. And surprisingly, people actually clamor for that, more so than even visibility into how to manage the model.
Lukas: I was kinda curious about what you're gonna say about what the metadata is but it sounds like you give examples of there is metadata about the actual input data. There's also metadata about what the model is actually doing. Sounds like both are important.
Vicki: Yeah. I was just going to say the biggest one is usually people create data lakes. They throw everything into unstructured environments like S3 and then they need to understand what's actually going into those environments and where it's coming from, which is where the metadata piece comes from.
Lukas: And what kinds of trouble do people run into from not having a standardized metadata? What are the issues that come up?
Vicki: Well, they wouldn't know, for example, which tables they can use for what, when those tables are being updated. What a big thing for big companies is whether that data is proprietary or not, whether they're actually allowed to use it. There's all sorts of controls around PII and all kind of stuff. And then usually data lake analysts will also want to query it and they won't know what's in there at all. So it's another way to surface it in a way so that it doesn't impact production. So when analysts are hitting it, for example, they don't hit the entire redshift table or the entire thing in big query. It's just they know what the data is, what's available and what they can take from it and what they can't.
Lukas: And so are most of the models that you're building running in online mode or they run offline in batches?
Vicki: A mix, I would say most of the models that we've built for clients are online or I'm sorry, are batches. I'm working on a personal project now that's online.
Lukas: Cool, can you say what it is?
Vicki: Yeah, I'm actually almost done with it. So I'm working on learning GPT2. It's like a Medium think peace generator.
Lukas: That sounds too dangerous to release.
Vicki: Yes. So the idea is that you put in the first few sentences of a VC blog post in there and it generates a medium think piece for you. So hopefully, that'll be online. But my inference time is five minutes right now so maybe it won't. We'll see.
Lukas: Do you do any monitoring of these models? Is that an issue? Because so many people talk about the input data changes and nobody notices and then the model gets broken and nobody notices. Is that a real issue that you've seen?
Vicki: I think we're just starting this as an industry. I know there's a lot of talk about Observability and catching model drift. And some of the larger companies are really ahead in that space. In general, I would say it's very much an unresolved issue and people usually still resort to checking the database and making sure that the data going in is OK, and that's the level of checking where we are. And I think people are just starting to say, "OK, well, this is where the model was yesterday. This is where it is today and this is where it should be tomorrow."
Lukas: Gotcha. I have a question I've been dying to ask you, but we're really not taking it, so. I love working with people that know Bash because I'm always embarrassed to use my Bash skills because I feel I am always lazy to learn. Do you have a favorite Bash command that people might not know? Is that your favorite?
Vicki: No, I didn't say it was my favorite. I said it was an overlooked tool. I don't consider myself a bash guru by any means.
Lukas: You know, you learned like in the last year and you're like ....
Vicki: Xargs, Yeah. It let you do parallel processing of a lot of stuff. So you can simulate two processes. Let's see. CAT is one that I use a lot. CAT and Unique basically lets you do a count start from a database type situation. I would say those are my most commonly used ones.
Lukas: That's cool. Well we always end with a couple questions, I mean we have touched on some of these. But I'd be really curious to hear your perspective on this. What's an underrated aspect of machine learning that you think people don't pay enough attention to?
Vicki: I think I touched on this earlier, but the people part of machine learning. If you are able to get more data or better data from people rather than banging your head against a smaller model, it's always going to go better than trying to figure out an advanced model for it.
Lukas: It's interesting. A lot of people have said that, that seems like a trend. All right. So you've also touched on this a fair amount, but I'm curious about how you would synthesize it. What is the biggest challenge of machine learning and the work in the real world right now?
Vicki: Putting stuff in production
Lukas: Putting stuff in production. And what is in that. What's the hardest part of putting stuff in production?
Vicki: Because there's so much that you need to get right in order to make. Because it's not just a software system. Well, it's just as complicated, but even more because software is, you have a piece of code, you put it in Docker, you put it somewhere and it goes. This is, you have to keep track of data that's flowing in from COFCO or Kinesis or streaming. You have to make sure that all of that data's correct. You have to make sure it's serialized in the right format. You have to make sure that the database that the data is streaming into process it correctly. You have to check all that data. Then you create your models. Your models might work one day. You might get drift the next day. So you have to plan for that. And like I said, I think we're still in the early stages of planning for that. Then you have to expose your model to some service or some end point that's going to consume it. The model piece itself, you have to put somewhere like Docker or whatever. You have to make sure to orchestrate all of that. So this is very similar to software, except, I think we like in modern software development, we have a lot of pieces of the stack that we're now responsible for because of DevOps. So DevOps means in theory, it's supposed to make it easier for you. But what it means is that the software developer also now has to be a system admin and understand some of those pieces in the cloud brought in the fact that you also now have to be a network expert. So actually, a lot of my issues are troubleshooting, like 'why can't this service connect to this service over the company firewall? Basically. So there's all of that. And you have to know the data and you have to know how the model that you're creating works. So putting that altogether in production is really hard. And so I would say that's the biggest thing.
Lukas: Well said. I have a feeling a lot of people listening to this podcast are going to want to hire you. So if that's the case, where can people find you? What's the best way to reach out? You say, well, maybe it's Twitter where you're absolutely hilarious.
Vicki: Yeah. Twitter is the best way. I'm just @Vblakers. And I also write a newsletter called Normcore Tech about all this kind of stuff, data and and a lot more.
Lukas: I can vouch for the newsletter, I'm a six month subscriber since we met and I definitely enjoy it. It's an honor to talk to you today.
Vicki: Thank you. Thank you for having me.