Taking hard lefts and the medical ML landscape with Zack Chase Lipton
Hear how Zack went from being a musician to professor, where medical applications of Machine Learning are developing, and the challenges of counteracting bias in real world applications.
View all podcasts
Gradient Dissent - a Machine Learning Podcast · Evolution of Reinforcement Learning and the Robot Hand
BIO

Zachary Chase Lipton is an assistant professor of Operations Research and Machine Learning at Carnegie Mellon University.

His research spans core machine learning methods and their social impact and addresses diverse application areas, including clinical medicine and natural language processing. Current research focuses include robustness under distribution shift, breast cancer screening, the effective and equitable allocation of organs, and the intersection of causal thinking with messy data.

He is the founder of the Approximately Correct blog and the creator of Dive Into Deep Learning, an interactive open-source book drafted entirely through Jupyter notebooks.

Zack's Blog

Detecting and Correcting for Label Shift with Black Box Predictors

Algorithmic Fairness from a Non-Ideal Perspective

Jonas Peter’s lectures on causality

TRANSCRIPT

Topics Covered:

0:00 Sneak peek: Is this a problem worth solving?

0:38 Intro

1:23 Zack’s journey from being a musician to a professor at CMU

4:45 Applying machine learning to medical imaging

10:14 Exploring new frontiers: the most impressive deep learning applications for healthcare

12:45 Evaluating the models – Are they ready to be deployed in hospitals for use by doctors?

19:16 Capturing the signals in evolving representations of healthcare data

27:00 How does the data we capture affect the predictions we make

30:40 Distinguishing between associations and correlations in data – Horror vs romance movies

34:20 The positive effects of augmenting datasets with counterfactually flipped data

39:25 Algorithmic fairness in the real world

41:03 What does it mean to say your model isn’t biased?

43:40 Real world implications of decisions to counteract model bias

49:10  The pragmatic approach to counteracting bias in a non-ideal world

51:24 An underrated aspect of machine learning

55:11 Why defining the problem is the biggest challenge for machine learning in the real world

Lukas:

I have a couple of your papers that you flagged that I'd love to talk about, but kind of before then, I kind of wanted you to catch me up. I feel like last time I knew you, you were applying to grad school, and now you seem like a successful professor with a lab at a very famous school. What happened, Zack?

Zack:

Yeah, it's been a weird ride. So when we met, it was in San Francisco. And that was like ... I had already made this weird decision to go and do this tech thing and live in California for a year, get into grad school. But before that I was a musician, so it was even a bigger jump. I think it looks more planned or directed now than it was at the time. The guiding thing to get from being a musician to being a PhD in machine learning was just a recognition that I wanted to be in PhD. I had enough friends who were in the sciences that I sort of knew that maybe the sorting hat got it wrong or something at some point.

Zack:

And I didn't even know what modern machine learning was. It was really guided by a kind of, I knew I wanted to be in a certain kind of scholarship, and I wanted to be in a certain kind of environment. And I knew that meant going to grad school. And then sort of looking like, all right, I was an old man for a first person starting on a scientific career. So it was like ... I wasn't going to do a wet lab thing and spend 10 years learning how to like pipette because it was too late for that. And I had had just enough of a connection with computer science earlier that I knew that was something I enjoyed doing. But I don't know. It's kind of weird to look back.

Zack:

I mean, I think in terms of from where we met, which was I kind of knew almost nothing. I was just kind of wanted to go to grad school for machine learning. I think the biggest thing is, is that I entered the field at the moment of a really great leveling event. So the sudden rise of deep learning was an unexpected thing. And I think it would be an exaggeration to say it completely wiped out people's skillsets or whatever from before then. But it certainly opened up a path in research, where at least the next two, three years of steps in that direction, or a good chunk of them, didn't really require that you were like ... If things were just progressing normal science, and it was like kernel machines were dominating, for me to get to the point where I was a world leader in understanding nonparametrics or something, that wouldn't happen in like three or four years.

Zack:

But entering a field where suddenly everyone is doing deep learning and there was kind of like a wild west type environment made it very easy to sort of pick an area, say ML and healthcare, and very quickly be at least like ... Now, the new generation of technologies, be one of the leaders applying deep learning in that.

Zack:

So I think I got lucky that I sort of entered at that moment of transition where it wasn't so disadvantageous that I wasn't an expert in ... I wasn't a great engineer, and I didn't necessarily have all of that mathematical background, but I was able to sort of ... One advantage of it is I didn't have a lot of commitments. So I wasn't committed to a set of methods that I had invested years in reputation and getting them to work. So I could be kind of nonpartisan about it and say like, "This is clearly a thing that's happening, and I have no sunk costs, so get in there."

Lukas:

That's really cool. It's actually kind of inspiring. I like it. What was your initial research on when you got to grad school? What were you looking at?

Zack:

I was working on healthcare problems. I had had some personal health experiences that were pretty devastating earlier in life. And I think that was just sort of always a motivating thing of, "Could we be making a lot of these kinds of inferences better that guide medical decision making?" It still is a kind of overriding, organizing motivation in my work. My research is a little more diverse. I don't just do the, "I want to grab things and get empirical results on, say, a specific medical dataset." Although, I do have a bunch of research in my portfolio that is applied to medical work, but also the motivated kind of underlying theoretical and methodological problems. But that was how I started PhD, was working on medical stuff. I wrote a statement of purpose that I think caught the attention of some people, like UCSD, which is where I ended up doing my PhD.

Zack:

There's a division that does biomedical informatics, and there's a computer science department. One's in the med school, the other's in engineering school. And I think they had been talking about maybe getting a joint student at some point, or someone who would be funded on one of the medical informatics training grants but be a student in CS. And they were looking for someone like that. What I was hired to do essentially was to work on healthcare problems, but I kind of just sort of ... I started with that motivation and looking at what people are doing, but I was sitting in a computer science department and watching what's happening with machine learning. So for example, I suppose the first problem I worked on was something in text mining. So it was medical articles. And we were doing massive multi-label classifications. So all the medical articles that get indexed by the NIH are tagged with some subset of this large controlled vocabularies. Kind of enables things like systematic reviews of literature.

Zack:

And so just a simple ... Like back when we were using linear models. And the challenge was that it was 27,000 classes, and we're trying to predict them all and do it in an efficient way. And now it seems kind of quaint because it's like ... You train language models with like billions of parameters and vocabularies that are like 300,000 words and it's not that big a deal. So I started working on that, but I was seeing what was happening in deep learning. And I think the first kind of bigger break that wasn't just a kind of minor paper was, we were watching everything that was happening. Convolutional neural networks where maybe the thing that were catching the most attention 2013, '14. But I was interested in a lot of these problems that had more sequential structure, so I was getting medical time series data. Like people are admitted, there's a bunch of measurements, they're getting updated over time.

Zack:

And so I started paying attention to natural language processing, what was happening, because that's another problem on a kind of sequential structure. And I was seeing things like these papers in 2012, '13, '14, that like [inaudible 00:07:27] gave, and other people like that were doing with language modeling and seek to seek type things. And you start thinking, "Are these methods sort of limited to these kind of neat, ordinally, kind of sequenced things like language? Or would they also work for things like messy multi-variant time series data that you have in clinical settings?" And so Dave Kale, who I mentioned earlier was the guy that they tried to recruit, I had actually met him when I was starting PhD at UCSD, actually at machine learning for healthcare. One of the first years of that, when it was still ... It wasn't even a conference at the time, it was like a symposium.

Zack:

And so we got together, this is like second year of PhD, and we kind of had this idea of ... It wasn't obvious at the time. Now, anything that looks like a sequence, people throw an LSTM at the time. But at the time, was really only making headway popularly in language. And a little bit maybe on top of RNN combinant type things like on top of video or stuff like that. And so we were interested, "Can we do much better than kind of status quo at predicting things like length of stay mortality, recognizing diagnoses based on ..."

Zack:

And so you have these time series where the added complications are you have a bunch of different variables, some of them are missing, they're not observed at some fixed interval on the wall clock, they're observed at different times. If you try to re-sample to make a statistic of the time series that's reflective of a fixed wall clock time delta, then you wind up with missing data that's not truly missing, but it's missing as an artifact of the sampling frequency. It wasn't observed in that window. So then what do you do? How do you impute it? Do you carry it forward?

Lukas:

I guess you have a lot of windows where nothing happened?

Zack:

Yeah, yeah, yeah. Right, exactly. Say your heart rate's measured continuously automatically by the equipment. However, the coma score is recorded once per hour by the doctor when they make the rounds. And then some serological result, maybe it's checked once per day or maybe some days it's never checked, or something like that, you know? Well, if you choose that time interval that's somewhere in the middle, like hourly, and you have this one thing that you're measuring that's happening multiple times inside a window, this other thing that's only happening once every like seven windows.

Zack:

I mean, an alternative way that you could represent it is you could just say every measurement is a ... You don't have the time tick for the RNN correspond to a fixed delta on the clock, but you can make it correspond to the observation and say something like, "Add as a feature. What is the time lapse since the last observation?" That's a little bit like those event based representations that they use for music generation and stuff like that. In our case it didn't work as well.

Lukas:

I mean, I'm always curious ... It's funny, we've talked to a whole bunch of people from different angles in the medical field, but can you give me a rundown of the current state of the art in ML and medical stuff? Like what are the most impressive results that you've seen recently?

Zack:

So there's a bunch of slam dunk results, I think. I mean, you have to divide up the categories of problems. I think a lot of people ... You see a lot of the whatever the public think pieces about ML and healthcare, and they just kind of slop everything together. And it's just like, the AI is making decisions, and you'll have an AI doctor, and is it better than a regular ... It's kind of just the way that collapses doctorness as to a single task.

Lukas:

Sure.

Zack:

I think the reality is, you have a whole bunch of different tasks. Some of them are really clearly recognition problems. Like it's a pattern recognition problem, and the environment around that problem is so well understood that if you solve pattern recognition, then you know what to do with the answer. So you don't have a real policy problem or a decision making problem, you just have a ... I put in this things like ... Now I'm going to get angry letters from, I don't know, some specialist that I'm saying they're automateable or something. But I think the things that are most amenable to this are the results like the diabetic retinopathy, where they take the retinal fundus imaging and they're able to predict whether or not someone has retinopathy, and do it, say, as well or better than a physician can just by looking at these images. This is one of those things where the doctor knows what to do if they're a hundred percent sure about the diagnosis. If you could just do the diagnosis more accurately, it's good. And then you know what to do.

Lukas:

And you do the diagnosis here purely from an image? So it's essentially an image classification test?

Zack:

Right. Exactly. Things that sort of just reduce to, "Hey, it's a pattern recognition problem. That's all we're doing. That's all the doctor's doing." Those things you can ... Pathology, I think, has some of these, like diagnosing things based on microscopy. One of the best papers I saw on machine learning for healthcare in the first year that it was a publishing conference is people said, "Hey ... " It turns out they were attuned to the climate. They were actually writing from Uganda, and were ... The paper's very straightforward, but the problem was ... The A plus part of this paper is how well motivated it was. It said, "Hey, there's ..." Three of the biggest maladies in Africa were tuberculosis, malaria, and intestinal parasites. These things are diagnosed based on basically pattern recognition by human doctors looking at microscopy, like microscope images. Africa, at the time, as it was arguing in the paper, they didn't have nearly enough technicians to be able to give timely diagnosis to everyone.

Zack:

And I think at the time they said something, it was some to do with like ... Because it's much easier to diagnose ... Or it's much easier to donate a microscope than a microscopist. So there was a situation where there were more microscopes than there were technicians on the continent. And basically, it was like, if you just do pattern recognition really accurately, you can ... And you can even avoid a lot of the pitfalls that normally plague machine learning. Like you could standardize the equipment, just send everyone the same damn microscope, the same phone camera for taking the picture, et cetera. So they train a simple combinant, there was not a lot of like ... You didn't need to do anything super novel methodologically, and you ended up getting like 99% accuracy on doing this four-way classification among these  done. This is an important problem. You can imagine shipping that tomorrow. Not really tomorrow, but you get the idea.

Lukas:

Does that really work? I see a lot of these kinds of results, and I wonder, do they really work or is it somehow a more toy version of the real problem?

Zack:

Right. I mean, I think that's almost always a concern when you look at machine learning results, right? Because the results that you see in a typical ML paper almost always on a sort of randomly partitioned holdout set. So you're always worried about basically, "Hey, I've ... " Everything in the paper is sort of conditioned on the faithfulness to that idea assumption. That my training data and data I'm going to see in the future really can be regarded as independent samples from the same underlying distribution. And that's almost never true in practice.

Zack:

And the question is, is this true in a way that just completely bungles up everything you've done? Or is this ... So an example of where there's a huge discrepancy, is you have people saying that we have human level speech recognition. And then if you ever actually use your speech recognition, it's really clear that it's nowhere near human level. So what it means is, on the training corpus, if you randomly partition it, and you only look at the maybe accuracy on catching the ... Actually, I take it back. They're not looking at phoneme level error rates. They do look at word-

Zack:

I take it back. They're not looking at phoneme level error. They do look at word error rate at this point. But you get the point. It's like, if you make this really strong assumption that the training data is... And people confuse these because they use the same word. They say "generalization" in both cases.

Zack:

But one is the extrap... Or maybe what you might better, rather call "interpolation" than "extrapolation" of, "Do I generalize from the training set, to samples from the exact same underlying distribution?"

Lukas:

Yep.

Zack:

The other is like, "Can I tolerate the sort of perturbations and distribution that are assured to happen in practice?" And so I think this is the thing that people deal in a really clumsy and kind of ad hoc way with right now. And a lot of my more theoretical and methodological research is about, what are actually proper sound principles according to which you can expect to generalize under, perform under, various shocks to the data generating distribution.

Lukas:

So then I want to get to that, but I feel like I took you off on a tangent for no reason there. So, just going back to-

Zack:

You take me on a tangent, and I'll oblige.

Lukas:

I appreciate it. But sorry, the other medical examples that you think are impressive. I think you were laying out like an ontology of it.

Zack:

Right. So I think the retinal fundus imaging, I think there's that long pipeline of productionalizing things in clinical trials, and I'm not actually up to the minute on where those are in that process. But that would be stuff that I'd be really confident would see it to production somewhere, if only as an assistive tool that like, "Hey, if the doctor disagrees with this, get a second opinion."

Lukas:

Yep.

Zack:

So that stuff I think is really out there. But then you see the other things people are talking about, people started talking about management, conditions, decision-making. And they started training models to do things like predict what would happen based on past decisions or whatever.

Zack:

Now this stuff, it gets way, way, way funkier. Or all this kind of stuff that has a flavor of... There are maybe two things that people do. One is estimating conditional probabilities and pretending that they're estimating treatment effects. And they're just acting as though knowing probability of death given this, and death given that, actually is giving you really deep insight into what would happen if you intervened. Probability that someone dies given that they had a treatment is very different from probability that someone dies given that I intervene and give him that treatment, when in the historical data, this person always would have received a different treatment.

Zack:

So I think you have that kind of work, where there's a huge gap between the kinds of things people are trying to say about how... You have two sides, people who really understand causality and therefore really measure it and conservative about the kinds of claims they're making. And then other people putting out associative models, and acting, and writing in a way that seems to confuse whether they're associative or actually causal models, in terms of the kinds of decisions they could plausibly guide.

Zack:

Or you have sometimes people doing things like off-policy RL, where you look at things like sepsis management, or whatever, and you try to say, "Well, okay, can I fit some kind of..." It's the same as the RL problem, like I've observed a bunch of trajectories sampled from one policy, and then I fit a model, and I make an estimate of what average reward would I have gotten under this alternative policy.

Zack:

But being able to make that kind of statement is still subject to all kinds of assumptions that you need in causality. Like that there's no confounding that the past treatment decisions are not actually influenced by any variables that you yourself don't observe that also influenced the outcome.

Zack:

So all of these kinds of things, when people start talking about guiding decisions, making better treatment decisions, inferring all these kinds of things from observational data, I think there's a huge gap between the way people are talking and getting things into practice. But maybe those are the very most important things to actually be working on.

Zack:

And then you have the easily cordonable ML pattern recognition problems. Like just, "Can I look at an x-ray and say, 'Is it pneumonia or not?' Can I look at a mammogram and say, 'Should they be recalled or not for diagnostics?'"

Lukas:

And so where does this time season series analysis stuff that you were talking about in the beginning fit into that? Is that at a point where it's a tool a doctor could use?

Zack:

For example, the first big paper that we did on this is the one we published at ICLR, which is Learning to Diagnose with LSTMR. And then so we're feeding in the time series and predicting which diagnoses apply to this patient.

Zack:

So I think you could paint a story that's not totally crazy about how this could potentially be useful. And one example would be, "Hey, I have a new patient. There's some kind of emergency, that I have the patient, I have them hooked up. I'm recording data. If I'm not sure what the diagnosis is, it would be nice to be able to have a short list." So that's part of how we evaluate. I could look at what the machine thinks are the 10 most likely diagnoses, and I could say, "Okay, I'm going to make sure that I include these things in the differential," or something. It would be some kind of sanity check, like you're using the machine as a wide pass to just make sure that you're considering the right diagnosis.

Zack:

Is that actually useful directly, like in its form? You know what I mean? Like, there's a question of, "Could that in general, that kind of idea, work, and is this sort of maybe a proof of concept, that it's plausible?" I think we can maybe make that kind of argument. But in terms of, for the specific cohort, like for the patients in the ICU, is this really something where what we did is directly useful? I think you have to really lack humility to go out there and just say like, in an unqualified way, this is actually useful in practice. I think probably not. Like, I think for a lot of those patients, basically, we're able to demonstrate this technology is capable of recognizing these facts about these patients. But in reality, the diagnoses for a lot of these patients was already known. We're just showing that we can figure out what it was from certain trajectories, certain traces, certain measurements. If the doctor already knows the diagnosis, what do we really do to improve care?

Zack:

And I think this is how my research has evolved. I started off maybe asking a lot more of these, which is dangerous thinking with representation learning. And like, "Can we do anything useful with these types of weird looking data?" You know, the standard thing you remember from the early 2000s or whatever was like, "Always find a way to represent whatever you're working with, as like, a fixed length vector. And then feed it into like, menu of  learn models, or whatever. And see what comes out."

Zack:

It was exciting to say, "Could we actually get signal out of these varying lifetime series, with these weird- missing those patterns, and whatever." But you know, at some point, okay, like the representation learning thing has happened, and we know that we can do this.

Zack:

And there's less things that are truly exciting there. Because we sort of know how to... We have a good set of tools, between sequence models and graph convolutions, et cetera, for representing various sorts of exotic objects. And that's no longer, maybe to me, the most exciting thing.

Zack:

So the most exciting thing is, "Okay, we can do function fitting. Let's just say we can do function fitting. Let's say we even believe that we've solved function fitting. What's next?" Like, that doesn't get us to the AI doctor. That gets us to maybe we've solved retinal fundus imaging. But for the most part... Here's another problem, to just poop on my own work a little bit more. And one thing that we often do, is we make these statements about, what is human level performance on some task. But we often don't think about the wider scope. We're sort of myopically focused on...

Zack:

Like, in ML, you're really told, I've got my inputs, I've got my outputs, I've got my loss function. And then the room inside there? That's where you dance. Right?

Zack:

But think about the diagnosis problem. This is an example I like to give my students, is, the way we cast the diagnosis problem in ML is, given all this measured data, can you infer more accurately or as accurately as the human, what is the applicable diagnosis? But was that ever the hard part? The extreme example is like, if the doctor gives you the test for Lyme disease and the result is positive, the fact that the machine can more reliably look at the data that contains that fact and say, "You have..." That's an extreme example. But you get the point. It's like, given that you were already routed to the right kind of care and had the right measurements done and whatever, maybe the machine is good at doing inference about what you have, but maybe that was never the interesting part, that was never the hard part, that was never the part that really demanded that you need a human in the loop. The hard part was seeing a patient. You have no data about them. And you have to make these hard decision problems.

Zack:

Decisions are not just about treatments. There's also decisions about information revelations. That's something we focus on a lot in the lab. Now it's these weird problems where the decision is what to observe. Like, I want to estimate... I want to ask them to figure out what is the best drug to treat some patient. I've got a bunch of people coming in. I can run some tests, but I can't run every test for every patient. I could try some treatment, but I can't run every treatment for every patient. So like, if I were to cast this kind of problem... And you can make it really abstract.

Zack:

You could just say, "I've got some kind of set of variables, they're related by some causal graph. In every time step, you get to observe some subset of them, and you have some budget that constrains which ones you can intervene on.

Zack:

But the point being that it's like, the set of data you observe not being taken just, like, by god, given to you as something that you take for granted, but rather, widening the scope of what we consider to be our jurisdiction as people thinking about decision-making and automation.

Lukas:

Well, I'm obviously I'm a big fan of that area of research. Because I do think in practical applications, you do actually have some control over those things. Like what data you want to collect and how you want to collect it. And I do think it's a messier research problem. But probably more directly useful in a lot of cases, just because the function fitting stuff is so well studied, relative to the impact that it can have.

Zack:

Yeah. Sometimes, I think people have a... You've seen this before. You were Stanford math or something? You've seen the kind of weird hierarchies that people form within a discipline, this idea of like, "Okay, there's the mathematicians, are on top of the physicist, are on top of the chemists, are on top of the biologists, are on top of the applied" whatever, whatever. And this thing happens in ML a little bit with theory and application, where people get snooty. And I think one thing that's weird is that there's two axes that get collapsed there, of theory and application, or mathematics and empiricism. Like mode of inquiry versus method versus real world.

Zack:

And I actually think that maybe it's true that theory is somehow more philosophically interesting than just benchmark applications, than just empirical pursuit on methods. But the application is a different axis. And I actually think that the applications are super philosophically interesting. They force you to ask... Because they ask you to ask questions that aren't just mechanical. You have to ask the normative questions, right? Like, the thing that I think is exciting about applications is that nobody told you in the first place what is worth predicting? That by itself, convincing someone that this is actually a problem worth solving.

Lukas:

I was just reading one of your papers that you pointed me to, on, essentially, collecting more data. The way I would describe it is, it's about collecting more data, to get the model, to learn the things that you want or the connections you want, versus the sort of spurious connections.

Lukas:

You had a good example of models predicting seagulls because they see the beach. You make this point that's evocative, of like, we assume that that's bad, but it's kind of hard to articulate exactly what's bad about that. Because it hurts you in generalization maybe. But if it doesn't hurt you in your data set, it's probably harder to distill what's bad about that.

Zack:

Right. You have all these papers out there that are just saying the model's biased, or the model depends on superficial patterns or spurious patterns, or whatever, without any kind of clear sense of what technically do they mean? And what we get out of that is trying to say, "Here's something that I think causality has to offer."

Zack:

I think a lot of people talk about causal inference really focused on the wrong thing. Thinking like, "Is it useful, or is it not useful?" Like, "Can I take the Pearl machinery and go apply it on the real data, and estimate, and get the number." And economists are, I think, more focused on that. Like, "Can I get the number? Can I estimate it?"

Zack:

But I think one thing that's nice about all this is Pearl's perspective. And I think that is really important. Causality is not just useful because you can actually estimate the causal [inaudible 00:13:39]. It's important because you can coherently express the kinds of questions that you actually care about. And at least within that, you can have a way of making coherent statements about things.

Zack:

So in this case, it gives us the vocabulary, of to say, "In what sense is it wrong to depend upon the beach?" When I'm saying this as a seagull. It's that it's not what causes it to be a seagull.

Zack:

Or an example that I like a lot of times is like, "Why is it not right to base lending decisions, for whom you give a loan to, on what shoes they're wearing?"

Zack:

And so part of it could be that you know something about how shoes relate to finances. Like, you know something about the structure of the universe. And you're able to think in your head, "What happens if I intervene on your shoes?" You know, if I take someone and I intervene on their shoes. Because you know people can intervene on their shoes, right? If everyone who wears oxfords to the bank gets a loan, and everyone who wears sneakers doesn't, people will intervene and say, "Is this a reasonable procedure?" One reason why I say, "This is why I want to depend on this or not on that," is to say, "What would be-" I can do this counterfactual simulations and say, "What would happen, where I to intervene on that? Would this change your ability to pay? Would this change the applicability of the label and the image?"

Zack:

So I think for us, the big insight is just to think of it kind of coherently.

Zack:

So I think for us, the big insight is just to think of it kind of coherently as think of like semantics as actually sort of being causal in a way, like this was what causes the label to apply. Then it becomes maybe well-defined, right? Because, I mean, the benefit that we have in our paper, the learning the difference that makes the difference paper, we actually have humans in the loop. So we're saying, "Hey, this is something that may or may not be actually identifiable from the observational data alone, but it's something that we can get via the annotators." They're revealing to us ...

Zack:

I read this example about genre in movies, right? So if you train a classifier to predict sentiment on IMDB movie reviews, you find that top positive words, or you do something like just train a linear model and look at high magnitude positive coefficients vs. negative, the high positive ones would be like, fantastic, excellent, whatever. Negative ones would be like terrible, awful. But the positive ones also have romance, and the negative one also has horror. You're like, "That's wrong." Why is it wrong? It's because then Jordan Peele comes out of nowhere and starts making all these great horror movies, and your model's inferring that they're bad because it's kind of depending upon this thing that the signal is not durable over time.

Lukas:

I was kind of thinking in that example, though, that I think romance movies are generally better than horror movies, and maybe the average human agrees with me. So there is some sort of ...

Zack:

Right, but that's an associative statement, right? You're saying they are generally better, and that actually does seem to be what the general public agrees with, right? The problem isn't are they generally better? It's does it have to be that way, right? Could you imagine a world in which tastes shift and the talented movie makers really shun romance movies and they become bad? I mean, so there's a sort of embedded assumption here. It's something that we're looking into a lot now, and for anyone in the audience who's really interested, there's a lot of great work by a scholar named Jonas Peters, who's maybe more of a theoretician, but approaches these problems.

Zack:

There's questions about you say ... Partly the question, one way of motivating us as you think about robustness out of domain, you say, "When I go out into the rest of the world, is it always going to be true that romance is good and horror is bad? If I go to a different culture, do I expect that that can ... If I can move to a different state, do I expect that this is the durable part?"

Zack:

So one kind of assumption here is that the things that cause the label to apply, that this relationship is actually stable. So you can imagine that the things that actually signal positivity versus negativity in a document, that this is relatively stable over years, but there's a complicated relationship in the background that influences, is the perceived sentiment positive? Is the movie quality high? What is the budget of the movie?

Lukas:

Right.

Zack:

What is in vogue? What are the houses spending money on? What are the publishers saying about what's getting distributed, whatever? But these things are all changing. But the causal features are ... You can think of if there's a structural equation that says what is the perceived sentiment from the text that that thing is actually relatively stable over time compared to these other features.

Zack:

That's part of our empirical validation. So we have this model, right? We show that what we essentially get people to do is to rewrite the document. They're told to make a sort of a minimal edit, but it should alter the document such that it accords with the counterfactual labels. So it was originally a positive review. We say, "Edit the review without making any gratuitous edits such that it is now a negative review." When they do that, you wind up with a new dataset, where for every original review tha had horror in it and was positive, now there's a sort of bizarro counterpart, and it still has horror in it. The reason why it has horror in it is because of the instructions. The instruction said, "Don't make gratuitous edits. Don't change facts that are not material to the sentiment."

Zack:

So this is something that we can argue about whether it's actually statistically possible to have disentangled that horror is or isn't a causal feature without that intervention. But once we have this document, we say all the horror movies still contain horror, but their label has been flipped. All the romance movies still contain romance, but their label has been flipped, because other parts of the document, the ones that actually needed to change in order to flip the applicability of the label, have been changed.

Zack:

So if you train the model on the counterfactually revised data, you find that the coefficients flip. So excellent and fantastic are still positive words. But now horror is also a super positive word, and terrible and awful are still negative words, but romance becomes a really negative word. The cool finding's if you combine these two datasets together and train on them, they kind of wash each other out. So you find that all of the things that look like they don't belong on these lists of important features actually seem to kind of fall off.

Zack:

So we're dealing with causality here in maybe a more gestural way. We're not using the mathematical machinery of graph identifiability or anything like that. But we are getting an interesting kind of really suggestive result on real data. When we look at it, just to that last point that we were talking about with generalizing out of domain and are the causal connections durable, one thing that we looked at in the camera-ready version of that paper is we say, "Okay, we trained it on IMDB. Let's now evaluate it on Yelp, Amazon, et cetera, et cetera."

Zack:

When you go to those other models, the model that was trained on the counterfactually augmented data, which is the combination of the original and the revised, does much better out of domain. That's just not guaranteed to happen. The ports are not shared. There's a lot of funky things happening statistically here. But what I think is suggestive here is it's like it does sort of agree with the intuition that you say on movie reviews, horror verse romance is an important part of the pattern. That's a real clue. But once you start looking at Amazon electronics or something, that's no longer actually maybe a durable pattern. Someone's like, "Oh, my Discman was such a horror" or something.

Lukas:

Well, I think what I really liked about that paper was sometimes I feel like, at least for me, some of the highly theoretical papers kind of point out problems, and they're kind of hard for me to even engage with, because I don't sort of see the practical effect. But you have actually such a simple mechanism proposed here that actually worked in your case, which I thought was super cool. I've noticed in my 15 years of working with ML teams, a lot of teams naively intuit to do things like what you're saying, and they usually feel bad about it. They feel like they're kind of doing this weird manipulation of the data to try to get it to generalize better by literally often rewriting the text in structured ways.

Lukas:

So I don't know. I just really enjoyed the ... It's a cool paper with a cool theoretical motivation that I think is really important, right, of kind of eliminating different types of bias and making these generalize better, but then also an interesting practical way of doing it. It's reminiscent of active learning techniques and things, but more interesting.

Zack:

Cool. Yeah, thanks. It was fun to write it. It was scary for a minute, though, because we're asking these workers to do this weird kind of thing and not sure of the results. It was sort of like a little bit of coin relative to-

Lukas:

Sure.

Zack:

... the pot of discretionary funds at the time. So it was sort of like there was this moment of, "Well, what the hell are we doing?" But yeah, it was nice that it worked out. I mean, I think that's just mainly one of the differences between a sort of ... Not to get into academia versus industry culture wars, but I think something that academia done right affords you is it's not like we need to get the product out or something. We have this thing more after, and it's like, okay, you have that intuition that this mechanism might be interesting. But the next step isn't just do it or not do it. It's like the ability to have a PhD student spend a lot of time to have kind of arguments about this for a couple months of, "How do we want to do this?", agonize over the experiments, kind of go back to ...

Zack:

Let's say we drew a toy causal model in our heads. What does this correspond to? So we have a lot of followup work coming from that now, but the fact that you get that, for somebody, it's their full-time job for a year, is thinking really hard about a problem. You can get from, "This is something kind of wacky. Maybe let's try it," and then call it ... versus "Okay, now this is your full-time job for a year, is we're going to think really hard about this one problem."

Lukas:

Yeah, yeah. That's super cool. I was kind of curious. So I was also looking at another recent paper that you pointed me to that was a little bit kind of harder for me to parse, algorithmic fairness from a non-ideal perspective. Could you describe what you're doing there?

Zack:

Yeah. So this is a paper with ... So I actually have a postdoc in the philosophy department now. So he's working with me and David Danks, and this paper is really about ... I guess in some sense, it sort of touches on the high-level theme of identifiability, which is ... There's a lot of well-founded concerns. If you're going to have decisions automated, these are decisions that in general are addressing problems that are sort of ethically consequential when there's bail decisions, lending decisions, hiring decisions, mediating the flow of information, any of these decisions. All the normal questions that you have about and concerns that you have about fairness and equity and justice continue to apply.

Zack:

I think as machine learning has gotten widely deployed, people have sort of become more and more aware of this. I think in 2015, I was starting a blog, or 2016, on this. I didn't even know there was this community out there of people working on it. There wasn't conferences like the Fairness, Accountability, Transparency, and whatever. Now it's kind of blown up, and it's blown up for a few reasons. But I think there've been a few pivotal things that caught people's attention. One, there was the hiring screening thing that was filtering out resumes by female candidates. Probably the biggest thing that caught people's attention was the ProPublica article about machine bias. This is talking about recidivism prediction models. It's predicting who will get rearrested if released on bail.

Zack:

So you have these systems, and suddenly, basically, the claim is these systems are being used to guide sentencing decisions or maybe bail release decisions, and they're biased against black people. This is obviously a big problem. Then immediately sort of people, there sort of arose this crisis of, "Well, how do you quantify that? What is the quantity that says there's bias?" So someone says, "Well, let's compare the false positive rates compared to the false negative rates. You have this whole kind of literature. Let's compare just a fraction of people that are released on bail among all defendants." You say, "Well, the distribution of crimes among defendants are maybe not the same." You have these metrics that are based on thresholds, but you're not necessarily considering all aspects of the distributions. People come back and these kinds of criticisms.

Zack:

Ther sort of emerged this whole community that spans sort of algorithmic fairness, which is looking at these kinds of problems and trying to say, "What are formal ways we could define fairness?" So you might say the model should functionally behave equivalently regardless of what your demographic is, fixing all your other data, and then the criticism against that is you say, "Well, that's meaningless, because if you withhold gender, but you have access to, say, all of my social media data or you have access to some sufficiently rich set of covariates, someone's gender is captured there. So what does it mean to say just that you didn't explicitly show that bit in the representation? If the information's there, you have it. So what does it mean to say it didn't impact your decision?"

Zack:

So there's this whole kind of line of work that's sort of trying to express this problem formally, and they're trying to express it in a world where everything is sort of defined statistically in a world where basically what we know is there's a set of covariates, which are just some numbers, some distribution. We'll call it X. There's a demographic indicator. It's like, "Are you in Group A or in Group B?" There's the predictions by the model, and there's the ground truth. This is sort of like now trying to say, "What are the kinds of parities that we want to hold?" So maybe I say, "I want an equal fraction of the population classified as positive, whether they're in Group One or in Group Zero. I want the model that doesn't actually look at the demographic. I want them to have the same false positive rates. I want them to have the same false negative rates. I want them to have the same [inaudible 00:13:38]." So people propose-

Lukas:

Do you think you could put these in the ... Sorry. I'm trying to connect these to, as you say, false positive, false negative. I'm just imagining the cases. I mean, can you say these in more real-world cases-

Zack:

Sure.

Lukas:

... so people don't have to make that connection?

Zack:

Right. Actually, this is sort of the kind of focus of a lot of our critique, is that. You could just describe the world in those terms and zoom out and start talking about various kinds of equations, and you could say a whole lot of things that seem intuitively plausible or reasonable, like, "I want this to be equal to that." But what's missing from this whole thing, so when people talk Word2Vec and say that Word2Vec is biased, Word2Vec is discriminatory, Word2Vec is racist, what does that mean? What actually is even the category of object to which these statements apply? You kind of realize really quickly that we've sort of obstructed so far away from the problems in that description that we don't have the relevant facts to say what is fair.

Zack:

So example would be if a model is being used to predict whether or not you're going to commit a crime and being falsely predicted as going to commit a crime means that you get denied bail or something, being predicted positive is really bad. If the model is trying to predict who, condition on were they to be hired, would be likely to get promoted and it's using this to guide resume screening or something like that, then getting predicted positive is good. So in one case, maybe-

Zack:

... then getting predicted positive is good. So in one case maybe, you'd be concerned about false negative rates. Like if someone who really has the skill level being denied the opportunity for the job. In another case, you'd be concerned about false positive with someone who wouldn't commit a crime, be flagged. But lost in all that conversation also is whether or not something is like justice promoting or like justice whatever your normative positions are. And you fix any set of normative concerns that you say define your morality. I would argue that even anywhere within the normal spectrum, there's still a problem that these descriptions of the problems aren't sufficiently rich to say what you should do. Because the facts that are omitted are what actually is the problem I'm addressing.

Zack:

If there's disparities in the distributions initially, what caused that to be? If I'm making a decision, what actually is the impact of the decision? What is the impact? How has it actually helped or hurt people if I change this decision making process? So an example might be, let's say you have a process that is determining like admissions to higher education. Like in this case, intervening in a way that created more parity in the decisions I'd argue or create more demographic diversity in the ultimate decision, I'd say is a good thing. Now that's my normative position. Maybe someone who's not as progressive disagrees, but we can disagree about that. Even fixing my set of positions, if you change the situation and you say the issue is something like you're certifying surgeons or something, does objecting someone to say a different standard across demographics actually help or hurt their careers?

Zack:

In this case, that might be a bad thing because if you were to alter the decision making process that would say a safety certification, then maybe the reality, like the real world impact would be to almost legitimize discrimination further down the pipeline where now patients are going to treat doctors differently because they know they were subjected to different tests. So there's these different decisions that have different kind of... But because of what actually is the decision you're making and what actually has impacted the decision. Something that looks from a technical perspective, like an identical problem could actually have a very different interpretation in terms of what is the justice promoting policy to adopt. So the concern is that by abstracting away from all those relevant details, you kind of lose sight of this.

Zack:

What we ended up finding, and this is really [inaudible 00:47:41] gets credit for this. And I think a big contribution of this paper is really just making this nice connection across like a very wide interdisciplinary boundary, is that this is almost exactly in some ways a recapitulation of a lot of arguments that have been had for decades in the moral and political philosophy literature. There you have two approaches to theorizing. You have many approaches, but just like one of the axes of differentiation and how to theorize about these questions of justice is the ideal versus the non ideal approach.

Zack:

The ideal approach says, let's just imagine a perfect world and just say that things that hold in the perfect world, we should just fixate on one of them and try to make our world look more like that. It's saying... So you can think of the reason why this can go wrong. For example, this kind of theorizing has been used to oppose policies like affirmative action in a blanket way where you just say, "Well, in the ideal world, we'd all be colorblind. So therefore we don't need affirmative action." That's unjust.

Zack:

The non-ideal approaches in some ways is a more pragmatic way of looking at these sorts of problems where you say... Right. So among other things missing from the ideal approach is you don't say anything about what... You say how should someone behave in a world that is already in some ideally just or fair state and where everyone else is completely compliant with what justice demands of them and your job is to not fuck it up. That's very different from the non-ideal approach where you're saying, "Hey, I live in this world, there are existing disparities. Now, given that I live in this world, given that there are these disparities, given that there is all these people who are bad actors out there, what is the justice promoting act?" And to recognize that that's not necessarily the same thing. Then you have to be concerned with, well, what are the disparities? Who has the right or the power or the legitimate mandate to intervene on what kinds of decisions to try to rectify them? And then what are the policies that are actually effective?

Zack:

So I guess these questions become... If you remove those details, these questions become vacuous. I'll give you an example, would be higher education admissions. So if you just say like, "Okay, well we want to have the same fraction admitted among men and women." I think most of the people saying that aren't actually paying attention to facts. This is among what population, right? So if you were to look at like a typical school, there's already a huge gender disparity in the application. So if you just accept people at the same rate. If you take fix anyone problem and you really start going deep, you see that there there's all these other details that, what is the right thing to do? What actually counts as the fair decision making process? Hinges really precariously on a bunch of facts that are not represented in the problem descriptions.

Zack:

So I think that's our angle in this kind of critique is to cast a light on that. There's this common saying in the Fair ML world, like, "Oh, we have 72 definitions of fairness or something like this. Look how many definitions. And the kind of maybe TLDR is like, we don't have 72 formal definitions of fairness. We have zero. The reason why is because you have 72 different fairness inspired, parity assertions, but the real actual question of fairness is the question of what are the material facts that you need to make a determination about? Which apply in a given situation?

Lukas:

When you look at the different topics in machine learning, is there one that you feel like people spend way little too time on, like one that you just think has way more impact proportionate to the amount of attention that people give it?

Zack:

My only reluctance is that there are things that are... At least the trajectory is on the right track, like people are paying more attention. But I think in general, coming up with coherent ways of addressing problems that are beyond the IID setting is really key. I would subsume under this, like both addressing causality and mechanism, and also include robustness on the distribution shift. You have like one very narrow subset of distribution shift problems, which is the mini max like adversary setting where the adversary is able to basically have the same underlying distribution, but it's like the samples are composed with some asshole who's able to like manipulate your data within the L infinity ball. So you've got like four million people working on that problem.

Zack:

But in the broader set of, what are the kinds of structural assumptions that allow us to generalize under distribution shift? I think we have maybe... This is a problem that plagues every single real world ML setting and that even among papers that say they're working on this problem, I think the vast majority of people don't seem to even understand the basic concepts. For this technology to actually be usable, I think we need to have coherent principles under which we can make confident assessments about when it's going to be reliable and when it's not. I mean, that's obviously a little bit biased maybe towards my research agenda, but I think that's-

Lukas:

No, that's fine. That's why we ask the question.

Zack:

It really is. I mean, I guess that's sort of like the common sense what is done for how you should choose a problem is like you should pick something that you think is important and under appreciated and not over appreciated.

Lukas:

Yeah. Fair enough. I think actually you should feel happy that even that situation, I think somehow it's not logical that people get stuck working on problems they don't think are the most important problem maybe, or at least based on some of the conversations we've had.

Zack:

Part of that is people being lazy. Part of that is the friction, right? If you had a thing that you thought was important once, and then you built your lab around it and you got funding on it and your whole life revolves around maintaining this research. I guess now that I'm running a big lab and now that I have finances to worry about and all that, I'm a little bit more appreciative of the handful of people out there who really did make these hard left turns at some point. I think Michael Jordan's a nice example of that. You could say he's like Miles Davis or something, but it's like, okay, each decade he had neural networks vision on parametrics. Now, I guess it's like mechanisms and markets or whatever he's working on.

Lukas:

Well, you've made quite a leap from music to deep learning I think.

Zack:

Yeah. I think it's time for me to retire. I think five, six years, that's the left turn. That's the left turn limit.

Zack:

I have a mortgage now though, so it's a little bit harder.

Lukas:

But you have a fancy computer behind you there. I don't know what.

Zack:

That's actually my Power Mac from '95, '97, maybe.

Lukas:

Does it work?

Zack:

Oh yeah. Did you have one?

Lukas:

Something like that. Yeah. With like HyperCard and yeah.

Zack:

Yeah, I think there's still like Oregon Trail and like Diamond, all those like weird free wear games, like Max Ski.

Lukas:

Oh yeah. Max Ski.

Zack:

Epic.

Lukas:

Final question. When you look at taking machine models from the ideation or the research paper to actually deploy it in production, where do you see the biggest bottlenecks or things falling apart the most?

Zack:

I think the biggest bottleneck is still problem formulation. I think if we were to be really sober of most of the things that like people thought they were going to do, and then you look at the way they propose the problem and then the data they could actually collect and if they could produce it, and does this in any way actually address the problem that they thought they were going to? I think would be... I don't know how you would collect that statistic and there's some like measurement questions, but I think it would be like really depressing. It'd be really sobering that I think most things people think they're going to do are either kind of goofy and who knows if they work or just like not relevant, will never get used.

Zack:

I think that that's figuring out where there's really a well motivated application of machine learning and what it is. There's like that weird next, the kind of pieces of information that you're asking people to put together. I think this is why, not to be like data scientists are great or whatever, but like why I think people are really good at this job or really hard to find in some way. And at the same time it's kind of puzzling, right? Because I don't think that the great data scientists are in general great or even rate-able mathematicians. I think for the most part, of people actually touching data are mostly lousy mathematicians. They're usually not world-class engineers. I certainly am not.

Zack:

What is it? I think it's this weird combination of like the weakest link kills you. You have to see... I think good at doing this applied work. What is the important problem? You have to also know, what is the current technology like in the ballpark of being able to get you on this kind of problem? How does it match against the data that's available? Then I think you have to at least at an intuitive level, do this non-statistical thinking of what's actually the process where you're deploying it? The x-rays or whatever it was, we talking about that, their retinopathy imaging or something. This is sort of a good application of machine learning because of what those images look like isn't changing over time.

Zack:

But you look at all these places in industry, people trying to build recommender systems, do all these things where it's basically... It's like totally incoherent. Nobody has any idea what happens when you actually deploy it because you're modeling something that's only like in the vaguest or weakest of ways actually related to what you think your would like to be predicting. You're predicting clicks or whatever. You're predicting a condition on the previous set of exposures. Almost never with any kind of coherent accounting for what happens when you deploy the model. I think this obstacle is people making that... I think this is always in some ways the hardest part of intellectual work in this area is the bindings. First level difficult is like technical skills. Are you a good programmer? Are you a good engineer? Do you write proofs that are correct? You do whatever.

Zack:

But I think the conceptual difficulty in working on machine learning is do you make the connection between this abstraction that you possess and the world that you're actually trying to somehow interact with? That to me, I think often is where all kinds of things go off the rails. I think where a lot of even good academic work goes off the rails. It's like you can go down some rabbit holes asking like really second, third order theoretical questions about these fairness things without ever asking of does this actually match up on to any real world decision that I would want to make? Does it actually help someone with a problem that I purport to be motivated by? I would just say that... I mean, I don't know if that's kind of a banal answer is like-

Lukas:

No, no. It's great.

Zack:

Asking questions the right way or something, but...

Lukas:

Sure. Well, thank you so much, Zack. Thanks for taking the extra time. That was super fun.

Zack:

Yeah, for sure. Thanks for having me.

Join our mailing list to get the latest machine learning updates.