There isn’t one definition of fairness with Joaquin Candela Tech Lead for Responsible AI at Facebook
Scaling and democratizing AI at Facebook, while understanding fairness and algorithmic bias.
View all podcasts
Gradient Dissent - a Machine Learning Podcast · Evolution of Reinforcement Learning and the Robot Hand
BIO

Joaquin Quiñonero Candela is the Tech Lead for Responsible AI, a Director of Engineering in AI at Facebook where he built the Applied Machine Learning team which powers all production applications of AI across Facebook’s products.Prior to this, Joaquin taught at the University of Cambridge, and worked at Microsoft Research.

Reference papers from Timnit Gebru:    

- Race and Gender

- Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning

- Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

TRANSCRIPT

0:00 Defining fairness

0:22 Intro + Bio

0:53 Looking back at building and scaling AI at Facebook

10:31 How do you ship a model every week?

15:36 Getting buy in to use a system

19:36 More on ML tools

24:01 Responsible AI at facebook

38.33 How to engage with those effected by ML decisions

41:54 Approaches to fairness

53:10 How to know things are built right

59:34 Diversity, inclusion, and AI

1:14:21 Underrated aspect of AI

1:16:43 hardest thing when putting models into production

Joaquin Candela:

There isn't one definition of fairness, right? If you look at philosophy, whether it's moral or political philosophy. Or you look at the law. Or even you look at the vibrant community in the computer science community and machine learning who is thinking about algorithmic bias. One common pattern is that you have multiple definitions of fairness that are mutually incompatible. So you have to pick.

Lukas Biewald:

You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host Lukas Biewald. Joaquin Candela is the tech lead for Responsible AI at Facebook. Prior to that, he built the applied machine learning team. Which powers all production applications of AI across all of Facebook's products. Before that, Joaquin taught at university of Cambridge and worked at Microsoft Research. Today, I'm going to talk to him about fairness and algorithmic bias, scaling and democratizing AI at Facebook.

You were running the applied machine learning team at Facebook, right? During a time when there was tons of machine learning innovation going on. I'd love to hear what was happening when you started working on that. And kind of what tooling was necessary and how that kind of changed over the time that you were working on that.

Joaquin Candela:

I think the context is very important here. So I joined Facebook in mid 2012. There wasn't a lot of ML people in the company. And if you think about the two biggest applications, it was News Feed ranking on the one hand. And then ads ranking, right? So two ranking problems. So as far as the models were concerned you mostly had binary classifiers. That were used as inputs into a ranking function, right? So if you think about news feed ranking, you would have my value function is some combination of, I give every click a certain score. I give every comment a score. I give every share a score, et cetera. And then we've got to build myself a value function. And so I have all these binary classifiers. That predict the probability that someone will click share or comment or whatever. Before I show them something.

And then I kind of use that to sort of rank content. And for us it's a similar thing. Writing ads back in the prehistorical times, click based advertising was the big thing. Maybe like... I don't even remember now. Like 15 years ago, 20 years ago, whenever. And then you know that you had conversions. And then just more subtle things where you have brand. And then not all conversions are created equal. And then the only thing that happens of course, is that the complexity of the content evolves. If you think about when I joined Facebook, a lot of the content was mostly texts. Images of course there, fewer videos. And now that sort of becomes more complex and you have more multimodality. So I joined Facebook at a time where the company was just IPO, Inc. And revenue was flat.

And so there was a huge pressure to try and move the needle in ads. I joined the ads team. And one of the big levers to move revenue was like, "Oh, can we get better at predicting clicks and conversions on ads." But at the same time you start to have... We started to move away from only serving ads on web on the right hand column. To actually also serving ads on mobile. And then actually end of 2012 when I joined, if you look at where people were accessing Facebook from. Web was kind of slowly declining or being stable. And then mobile was rocketing. And I think they crossed sort of at around the end of 2012. The types of surfaces you have, the types of things you're predicting starts to increase.

And so the first dilemma that I had, was I looked at what were we doing? And we were using just so use in a way. Like to go to the space station, nothing fancy just like the good old so used. We were using a gradient boosted decision trees as feature transformers. Mostly you could think about it that way. And then we were using online logistic regression sort of after that. Cascaded with it. So-

Lukas Biewald:

What would you... Sorry to interrupt, but what then would you train the intermediate gradient boosted tree on? What would be the kind of thing that, that would try to predict?

Joaquin Candela:

... You'd still train them on the event. On the binary event that you're trying to predict like clicks or conversions or whatever. But obviously you'd benefit from the robustness that, that gives you. You don't have to worry too much about scaling, and translation, and whatever of your features.

Lukas Biewald:

But then you would feed them into a simpler model?

Joaquin Candela:

What you would then use is the trees themselves. Every tree has a categorical feature as it were. And so then your logistic regression model which would be training online, has a bunch of inputs that are categorical. Which are the outputs of the tree. So it's basically kind of relearning the weights associated to its leave.

Lukas Biewald:

Interesting. Wow! I had not heard of that. So the thing that's changing then is sort of the combination of the... You train a graded boosting tree, then you pull the trees apart and then you relearn the weights of each tree in the combination?

Joaquin Candela:

Yeah. It's a hack. it's not a fully back propagated model. Because you train your trees every few weeks whenever. And then you have logistic regression that takes as inputs both like the binary indicator. So every one of the trees you train, like hundreds, maybe a couple of thousand trees. So you have... And any two trees has a dozen leaves or whatever. And you treat those as one out of 12 kind of like in coding. But then you're learning a weight for each of those. And you're kind of running in real time. And then you have other features that go in side by side. That can actually be sort of continuous features as well.

Joaquin Candela:

That's the setup. That's what I found when I got there. And so the key decisions, since you wanted to talk about building applied ML and all that. The dilemma was the following. Was like, "Well, this thing is begging for a proper neural net to be thrown at it." It's almost like we've handcrafted a Frankenstein type of neural net by having these threes. With logistic regression concatenated to it. But we're not even training it together. We first train the threes and we kind of chop the output. And then we kind of plug this other thing to it. And then that's the thing that we train. So it was already obvious that doing that would probably give us against. This was-

Lukas Biewald:

And this is actually... So 2012 it was obvious that a neural net would give you an improvement. I'm trying to remember.

Joaquin Candela:

... Sort of.

Lukas Biewald:

Was that obvious to everyone?

Joaquin Candela:

No. Because if you think about the... I think it was Russ Salakhutdinov and Jeff Hinton. And I might be forgetting some co-author. So I deeply apologize because I've had a long day already. Apologies. But this was the big image net paper. Was with convenance I think from 2012 if I'm not mistaken.

Lukas Biewald:

Mm-hmm (affirmative). That sounds right.

Joaquin Candela:

Yeah. So I don't think it was clear. I think it was just the beginning of the hockey stick. But I think it wasn't clear. If that had been two years later, then it would have been obvious, right?

Lukas Biewald:

Right.

Joaquin Candela:

But at the time it wasn't clear yet. And you always need a couple of months to realize that something happened. So thing that really struck me was the time it took from doing an experiment... Which a lot of it was really feature engineering. Maybe there were some experiments with tuning your learning ratios. And tuning the architecture of your trees and the architecture of your logistic regression. Although there isn't a lot of architecture to be tuned with logistic regression. The time to go from someone has an idea and starts to do some offline experiments, to actually have your new click prediction. Or conversion prediction model for mobile or whatever in production. I'm actually materializing those gains would be several weeks.

It'd be sometimes six weeks, sometimes two months. And I thought, "Holy crap, that's not great." And so the crossroads in a way, on the one hand you're like, "This thing I have is simple, but we're still getting a lot of gains by tuning it." And on the other hand, I can go and just replace my so use with something sophisticated. So that was the crossroads that I... So do you want to know what I decided to do?

Lukas Biewald:

Tell me, yes. I feel like you're kind of picking on this so use that. I didn't know that was the metaphor for the tree thing.

Joaquin Candela:

It's true. The so use... Well, I think the so use is rudimentary in the sense that the computer systems that the so use has in it. For probably 50 years old or something like that. But they work. So the reason I use the so use analogy, is more like it gets the job done. It's like a gradient boosted decision tree and logistic regression. It's like this as an aside. One thing that triggers me these days a little bit is I see people jump straight. If they'd have to solve an NLP task, they'll use either some sort of a sequence model. They're using LSTM. They'll use... What I mean is that it's a transformer or whatever. And sometimes you'll go like, "Did you try like a max-end model?"

"Did you try a good old bag of words with logistic regression?" And the surprising thing is that I would say between 20 and 50% of the time you get the same results. And then you're like, "Did you realize how much cheaper this thing is in terms of anything you care about?" Whether it's training time, inference time, whatever. So basically the big bet there was to say, "Well, what do we need to do here is we need to actually allow our teams to ship every week." And that was the big model, was like ship every week. Do whatever it takes so that every week we can ship new models in production. And what that meant was we need to dramatically accelerate that path from, "I have a new model that I could put in production." To like, "It's in production." And that kind of triggered the five years of work.

Lukas Biewald:

And so what were the keys? I mean, tell me the pieces that you needed to build in order to allow that to happen. Because I'm sure a lot of people listening to this are thinking, "I'd like to ship a model every week." What do you need in order to do that safely?

It was many things at different levels. So at a very low level, it's about fitting in seamlessly. Whatever infrastructure you have for inference. And adopting some sort of standards, which seems super easy and trivial. But even that you shouldn't take for granted. The part that I thought was even more interesting is that I think what was slowing people down was probably two or three things. One was, it was extremely difficult to share work between people. Because people would be running experiments in their own dev servers. And even having... As we all know configs back then weren't sort of easily portable. It would just take you a couple of hours or whatever. You'd have an energy barrier before I could actually play with what you had done.

The second thing, which I think is related is that you started to have a lot of teams reinventing the wheel. So a lot of the work that was being done was actually duplicate. Because as a number of surfaces on which we showed, ads sort of kept increasing. And the types of modalities kept increasing. You kind of had teams that focused on one of those evoke cells in your tensor of configurations. And they wouldn't sort of easily talk to each other. Or the work wouldn't be discoverable. So the thing, number one was automate everything. You have to automate everything. You have to make it ridiculously easy and you have to abstract. Everything from the engineering, trying to deploy something. Especially because we're growing very fast and you get a lot of people who are joining the company fresh from somewhere.

Maybe they are good applied researchers, but they're not in for a people necessarily. So abstracting and automating, super important. The second, share-ability. Make sure that you abstract and encapsulate things in a way where they're super easy to share. So I can see what input features are working for you. If you're working on conversion prediction models for in game ads or whatever. I can super easily see that. Obviously you have infra work again like codes. The way we store and represent data is very heterogeneous. So it's a pain in the butt usually to... Even if you're only looking at reproducing training depending on what your set up is that's work. But then going... Obviously the way you run your data pipelines when you're training offline. Versus when you're trying to serve in real time is different almost always.

And obviously, when you're online, you're on a budget. So you want to make as few calls as possible when you're serving. SO you got to sort of figure out how to abstract those things. And again hide all the complexity. And then the third one, which I think is really interesting is really think about collaboration by design. How can you build an environment where I go in and I can see every single experiment anyone has run. And I can go and by clicking, I can see first of all who they are. Who they are is huge, right? Because then I know who to ask. Especially if I'm new to the company and you have a company that's growing fast. So the equivalent of your git blame or whatever is super important. You need to know who people are.

The second one again is so much is wasted in terms of replicating experiments. That someone has already done. So bookkeeping is extremely important. And then the ability to just beg, borrow and steal bits of pieces. Either of feature competitions or models. We were exposing learning curves and things like that as well. So you can actually sort of browse them. And then another component... And I'm not being super organized here. I think I've said it's three things. And I'm at the fifth thing already. But another one is try to be as modular as possible. And if possible as well, language agnostic. And separate out the language or the platform. That you're using to kind of specify whatever models you're building from the definition of a workflow and execution of a workflow.

So it's really abstracting that away. And sort of thinking about an ML workflow, is in an ML workflow. And I don't care if you're specifying your models in MATLAB, Octave, Python, PyTorch, TensorFlow, whatever it is that you're doing. A lot of the bread and butter that you're doing is kind of common. So really layer it, modularize it was sort of huge,

Lukas Biewald:

It's interesting. I feel like the things that you're saying are the things that all ML leaders want that I talk to. But I think that the place that they get tripped up. I mean, all the benefits that you're saying totally makes sense. But I think the sort of downside is that it requires getting everyone's buy in into kind of a standard way of doing things. And I'm curious how you got that. Because ML practitioners are so hard to hire. And they're often opinionated in working different parts of the org. How did you get them all to buy into the same system?

Joaquin Candela:

Often opinionated you say. I would like to meet one that is not opinionated. Sometimes not opinionated. It's tricky and this is actually... So you're putting the finger on an amazing point. Which is really a almost like change management. Very hard, but several reasons why it's hard. Reason number one, in any fast moving company where you have low hanging fruits... I mean, this is not unique to a ML. Who's going to actually pause and do and clean up the kitchen. And pay back some tech debt or build in first so you move faster. You're almost like, "Hey, why don't you do it?" Like, "I don't feel like doing it myself." So that's one challenge.

The other one of course is a sense of pride that people have. I mean, and especially... I used to be in academia. And in academia, the thing that determines your worth is almost the praise you get for the work that you do. But you put your name on your papers. So culturally it's tricky to sort of say, "I'm going to surrender some of that for the greater good." So the tactic that we took, one of them to assist would be ridiculously laser focused. The one thing that I should have clarified is I never dreamt that one day I would build the applied machine learning team at Facebook. That was not the intent. I was in ads, we're focusing ads. But even within ads, we already started to have several teams working on similar aspects of the problem.

So at least we work on generating alignment and a vision within that. And that was not like a million people. There was just a couple of dozen people. And we were all feeling the pain and the urgency to move fast. So it was semi obvious that this was going to be good. It was a bold bet. So you need to kind of generate alignment, both on the people who are deploying things. And doing experience everyday, but also get management to give you air cover. Because things are going to slow down. I can remember talking to my manager at Fields Coffee end of 2012 where revenue was still not picking up. And he was asking me, "Hey, you haven't been shipping models a lot often. What's going on?"

And I'm like, "Well, actually we're going to slow down even more." And it was like, "Explain." And then you explain, you get into the details. You get like buying on the vision at all levels. But you keep it very narrow. And then what happened once we started to have progress. And stuff started to move faster and you saw productivity increase. Then we started to talk to the feed ranking team. And the feed ranking team we decided to join forces for summer 2013. And that was really interesting because there again, you have to just be laser focused. Don't think about the features first. Don't think abstract first. Don't think about... It's not like platform first and then we see what happens. It's like be extremely concrete. Like here's the types of things I want to make work.

And also just accept that one day you have to rewrite it and that they can. But for now you want to prove the hero scenario. You want to prove, "Hey, this can actually be amazing." So that was the approach. It was extremely laser-focused. Start very small, start adding people, build almost a community that supports. Really go from a core and then started expanding.

Lukas Biewald:

It's interesting at Weights and Biases we make a tool and I actually didn't really realized how similar our tools, vision as to what you were building. Our hope is to really help with collaboration and reproducibility. And sort of the same idea of we really want people to be able to find the person that made the thing. And not have to redo all the work from scratch. And I think we have maybe even more trouble than you getting buy in. Because no one owes this anything. Why would someone want to use our tool?

And I feel like for us, a big part of it is showing little wins to the individual practitioner. I feel like there's little details in our product. That we try to just give something helpful right out of the gate to someone new coming in before they do the collaboration. And before they have to really buy into our system. I wonder if there's any things like that for you? Or people like, "I want to be able to see the system metrics of my runs." Or something like that, that got people to use their stuff.

Joaquin Candela:

Yeah. Excellent question.

Lukas Biewald:

I'm just mining you selfishly for features for a product really.

Joaquin Candela:

So shameless. One caveat Lukas that I have to say of course, is that when I was very involved with this stuff in the early days, that was already eight, seven and six years ago. I know that things have changed a lot. I know that you have open source tools today. Which if we had had them, we would have just used them directly. Including maybe Weights and Biases products. So your question is if you set aside the collaboration benefits and all that. Just in terms of pure individual contributor, productivity. Why would I care? I have both news and I have bad news I guess. The good news and the bad news. I think the bad news maybe is that some of the problems we solved were actually a bit Facebook specific.

So I think that's not going to be useful to you. But in terms of abstraction and just the ability to almost at the click of a button. The fact that you could actually clone a workflow... I'm going to give you an example. Here's an example. So you're the Instagram ML team. And Instagram's never been ranked before. Instagram's feed has been showing in a decreasing a chronological order. From most recent to less recent. And your task would like, "Hey, design me a ranking system for Instagram." That's kind of a tall order. But imagine now that you have an environment where you can actually just look at the production system that ranks newsfeed. There's a lot you can borrow there.

And so I think just like the ability to borrow. Whether it's the features that seem to be working the best. The models, the sort of training schedule, the hyper-parameters. All of that is a big thing. In parallel to that you have abstractions. Again, like at Facebook I don't even know how many distinct and mutually incompatible data stores we have. But you can imagine. And the fact that the tool will actually abstract that for you is very useful as well. Then if you have to build a workflow yourself. Building workflows is a pain. If you have to do them from... If you don't have a tool to build workflows, it's just pain.

And then another one, tools for debugging and automation. So I'll give you an example of a couple of things. Tut for automation, automatic feature selection. The fact that you have a tool that actually scans for every feature you could possibly use. And then while you're sleeping this is making sure that you have maximum machine utilization. And you're just doing whatever feature selection algorithm you want. I don't care. It doesn't matter. But it's just doing work for you. True stories is ads engineers would come in the morning on Monday. And they would see proposals for new models. And you're like, "Oh, this looks good." I get a couple of 0.1. Whatever percent points of gaining whatever metric and that's good.

The other one very simple reason ML systems fail is because some data pipeline fails. And again if you have to be checking ad hoc, it's a pain. Imagine that you have this beautiful dashboards with colors and whatever. That just tell you what features are not working. And in which way are they not working anymore. Is it that you have like statistical things. They still produce valid values, but they're the same all the time. Or is it that you get things that are not a number. What the hell is going on? That's super useful as well. Or tools to look at your learning curves and whatever. So these are a bunch of examples of things. Which if you're an ML engineer or you want that stuff.

Lukas Biewald:

That totally makes sense. I want to make sure we leave plenty of time for the other thread of questions. Which is the new work you're doing as... I think it says on LinkedIn you're the tech lead for Responsible AI. Which sounds like a tall order. I mean, there's so many possible questions here. I was kind of wondering what would be the most interesting. But I think that... I guess the genuine question that's top of mind for me is always walk me through a real decision. Where it wasn't obvious what to do. And by some kind of analysis or thinking about it, bringing your expertise. You were able to kind of guide Facebook to a better decision. Does something come to mind?

Joaquin Candela:

I'm going to start from the India elections. This was the biggest election in human history with almost a billion eligible voters. So what's the challenge and where does AI come in? Well challenge is that there's a lot of concerns of election interference. Through the spread of information, which is either false or misleading. Or voter suppression or whatever it might be. And of course the way you address this, if you're Facebook or a similar company, is you create guidelines. For what things are acceptable and what not. There's of course, legal constraints as well. And then you just hire a bunch of humans. As many as you can. And you would know about that because you've worked on that in the past. So you have humans who are actually processing a queue of work.

And that queue of work is just reviewing posts. But when you have a country the size of India and the volumes of information. Or content that are created every day on Facebook. It's just impossible. You cannot hire enough humans to review even a decent fraction of everything. So it's impossible. So the way you use AI is use AI to prioritize human work. And the way you do this, is for example you train a type of classifiers that we have used. We call them civic classifiers. And what they do is it try to tell whether the piece of content is just a picture of a cat. Which is like, "Whatever, it does it matter." Or people like me I'm a runner. So did I post a new run on strata? It's like, "Whatever, he doesn't matter for the elections."

Or whether it's actually someone discussing something that's actually relevant. Social or political or civic issues. And then at least make sure that that type of content gets coverage. So what's the challenge? We're talking about resource allocation. We're talking about, you have these set of humans that we're paying to protect the elections from interference. And now the question is... And we're using AI to prioritize our work. Well, what happens if your NLP works only for Hindi?

Lukas Biewald:

Wait, sorry. Could you even back up a second? Because this is probably obvious to you. But it's not totally obvious to me. Assuming you had unlimited human resources to do something, what is the thing you're trying to do? I mean, obviously you're not trying to block everything that's on the topic of an election.

Joaquin Candela:

Yeah. Apologies.

Lukas Biewald:

What's the goal?

Joaquin Candela:

I should have explained that. You will block things if they violate the laws. So if you have... I don't know. Defamation of public figures with lies or just like illegal content. Or reduce the distribution of things that are harmful. So it's both like filtering and reducing distribution of things that violate our community standards or laws. So that's the action that you're taking with a commission of humans and AI. And so the challenge there again, is if you look at this from a fairness point of view. Maybe your definition of fairness is that if we're investing a certain amount of human resources to do this job, that we want to make sure that everyone in India gets protection. From this type of harmful content. And then the question there becomes what does that mean? Because if you think about algorithmic fairness and bias.

If you're thinking about using AI to recommend jobs to people. Then you get you into and you're in the US. You think about protected categories, you think about gender, age, race or ethnicity and stuff like that. Where there's anti-discrimination laws that exist. But if you think about this from India you're like, "Oh, politically, what are the hot areas?" And then immediately when you work with local people, it's things like caste or religion. But obviously we don't have that data. And it's not clear that... It's not good that we should have that data. So in the end you do a bunch of work and you figure out what can I do? And so we ended up using language and region.

Lukas Biewald:

Sorry. Well, again if you had caste and ethnicity... I'm sorry. I feel like I'm showing my ignorance here. But if you had those things, what would you do with that? What would be the fair thing to do that you're trying to do?

Joaquin Candela:

So the challenge with fairness. And that's where we're going to go back all the way to music somehow. Is that there isn't one definition of fairness. If you look at philosophy, whether it's moral or political philosophy. Or you look at the law. Or even you look at the vibrant community in the computer science community and machine learning. Who's thinking about algorithmic bias. One common pattern is that you have multiple definitions of fairness that are mutually incompatible. So you have to pick one.

In this case the one that you could pick, is you could say, "Well, I want to make sure that everyone irrespective of their caste or religion, is going to see content that has received a comparable amount of protection. Against harmful or content that is basically misleading." Imagine there's voter suppression type of content they read that spreads lies about... I mean, there's even stuff like just lying about when the election day is or whatever.

Lukas Biewald:

I see. Actually thanks. That's helpful.

Joaquin Candela:

Then you kind of miss it. Or maybe lying about what a particular politician stands for. Just sort of putting out something that's completely false. So you want to... Go ahead.

Lukas Biewald:

So I guess one way... Just to repeat back what you're saying. One way would be we want to make sure everyone across groups like caste or religion gets the same level of protection?

Joaquin Candela:

Correct.

Lukas Biewald:

By actual humans looking at the content?

Joaquin Candela:

That's exactly right.

Lukas Biewald:

Why might that not be the most fair approach? Would there be a different argument for different?

Joaquin Candela:

Yeah. You have situations where... So here this would be an equal treatment type of argument. Where you would say, "We want to treat everybody the same." And equal treatment is I think in many cultures like civil. The first instinct that you have. But you could think about other things. On the one hand you could dial things more towards equity. And inequity you could look at historical disadvantages that some groups might have had. Is there a case where historically some caste and religions are privileged compared to others.

And the pressure or the amount of misinformation. If you think about the US, not every group in the US has historically had equal access to voting. And even today, voter suppression efforts are not uniformly distributed. Some groups are actually more targeted than others. So you could actually say, "I'm actually going to understand whether I should prioritize outcomes for some groups over others." And if you think about... There's many as many sort of public policies in society that actually sort of aim at focusing more on some groups that have been disadvantaged.

Lukas Biewald:

I see. And so what did you do?

Joaquin Candela:

So in this case we went for the equal treatment approach. And then what we did, this triggered a whole amount of work. First of all, we don't have caste and religion. And there's many reasons, there's many risks why a corporation shouldn't have certain type of demographic information. There's a lot of examples in history why it's just dangerous to have repositories of certain demographic characteristics. So what we did is we used reasonable alternatives like language and region. And so we said, "We're going to make sure that all regions in India..." And not all languages because it's a huge amount of languages in India. But I think we went for the top... I don't remember any more. Top 15 plus, minus languages were protected.

And then you can get into things. How do I translate that into math and code? So you need to look at many levels. One, you need to look at the most basic thing is when you look the data, you look at two things. You look at representation. And then you look at biases in the labels. So representation, make sure that across you build yourself your matrix of regions and languages. And make sure that for each of these buckets, you have a sufficient amount of labeled training data. And then once you're in one of these buckets, you get yourself some ground truth data. And that would be a very long conversation to figure out what that is. But expensive, high quality data that you can use as a reference.

And then you kind of measure. You look at the difference in errors that you have in your labeling process across all these buckets. And you want to make sure that you don't have systematic differences. But of course that's not enough. Then you actually look at your models themselves. So you turn your model and you look at things like, "Oh, in the prediction errors do I have systematic differences." And one cool thing to look at if you have binary classifiers and you're using... Here you would be using the probability that something is civic content. To prioritize a review. In that context, it's very reasonable to use actually a calibration. To look at the whole calibration curve. And make sure that my calibration curve which maps scores to actual outcome rates.

Make sure that those curves look similar for different groups. That I'm not over predicting for one group and under predicting for another. Because if I were over predicting for a particular language, then I would be allocating more human resources to that language. And if I'm under predicting for another, I'm allocating fewer resources, but it's not justified. Because that doesn't reflect the actual volume of content that actually needs to be reviewed for both.

Lukas Biewald:

But is it... I guess, is it possible that some language has more banned content? And then how can you be sure that... It seems like your model would sort of naturally use that as a feature in the model. And then it would sort of naturally get over index. How do you back that out?

Joaquin Candela:

I think that's how you... That's where you use calibration. So if you think about a calibration curve, you're looking at how your scatterplot of... You group your scores so the thing is breaking your score at zero to one. Which we interpret as a probability of something being it needs to review. It's true that the distribution of scores is going to be different between languages. Or if one language is being more under attack, then you're going to see more stuff with a higher probability. But what you really want is once you bucket things by score, you want to kind of look within those buckets. What's the actual percentage of content that was violating in that bucket?

And you want to make sure that 0.6 means roughly 60% for any language. And then the pieces of content that fall within that bucket is going to be different between languages. But that's not a problem. I mean, and eventually as a result of the distributions being different of scores, you'll end up investing more or less resources in a language or another. But at least you have apples to apples in terms of your risk scores.

Lukas Biewald:

I see. So you let the model use the language. But then you back it out in sort of a post analysis based on the actual performance. Am I explaining it right?

Joaquin Candela:

I mean, if you're using NLP and you're building different classifiers for different languages. Then inevitably you're using the language in your NLP model. I mean, having said this, of course we have a cross lingual embeddings and all these fancy things obviously. But you'll still need some sort of training data. The question of whether you should use an input signal or not, is a long and fascinating discussion as well. And I think it is in my view, somewhat orthogonal to many of the ways. In which you would make sure that you have procedural fairness in your classifier. So we need another couple of hours to discuss that because that's actually a very active topic.

One of the papers that explains it well as Cynthia Dworks and coworkers. A paper called fairness through awareness. I'm probably butchering the title. There's more to it, but this is the bed. Where if you're trying to be fair across genders, when you're recommending job offers. Should you actually not use gender as an input to your algorithm? Or should you use it? And there's examples that illustrate both positions. So I don't think it's as easy as to say, "Oh, if I don't use gender as an input to my algorithm, then I know I'm going to be fine." And the reason is simple, Is that A, you have a lot of features that correlate with gender anyway.

But then also if you think about it from a causal perspective, you're going to have certain things you can measure. Which have opposing effects depending on whether you're a male or female. For one, females carry babies and get gaps in their CVs. And so is the effect of a gap in your CV the same depending on your circumstances? That's not clear. Causality actually is probably one of the most exciting lenses on fairness in many ways. But its super early days.

Lukas Biewald:

Interesting, I guess to ask you another question, that's probably another long question. And this is one of the ones I always worry about with the fairness and AI stuff. I guess, how do you engage with the people who are actually affected by these decisions? It always makes me a little nervous. This idea that scientists go in and sort of get to decide what's fair. And I can see... I can kind of see why. It's important that someone kind of understands the algorithms. That's one point of view. But I mean how did you engage with the folks in India who are affected by this? To even decide what's the fair thing to do?

Joaquin Candela:

It's essential for AI practitioners to understand that responsible AI is not primarily an AI problem. It's as simple as that. And you pointed to the question of governance. Who should decide? It's not the AI practitioner. It's not me for sure. So how do you... What does that mean in practice? In practice what it means is you build something like a fairness maturity framework. We're building one like this. You work with ethicists, with lawyers on building it. You try to capture the different interpretations of fairness that exists. And what this ends up being is not a tool that tells you what to do. It's a tool that gives you a big menu of questions that you should ask and consider. And then what you build is you build processes of consultation. Where you sort of put the options on the table and then you have a decision framework. Where you sort of weigh pros and cons and risks. And this has been used way before AI.

These kinds of risk assessments and decision processes, consultation processes and so on. And one example of this, I think that is quite interesting is Facebook has to build this external advisory board. It's not fully rolled out yet. But it's 40 people if I remember correctly. Who represent all kinds of countries and political views and other types of views. And their goal is going to be in the context of content moderation to kind of look at all these edge cases that are hard. And then come up with recommendations. Obviously they're going to carry a very heavy burden of representing lots of people. But they don't work for Facebook. They're an external body. And I think that one of the... If you want ideas for what to do next after weights and biases, I feel like... Although I'm sure you're going to be busy with this for a while. I think the question of governance in AI... And how to build infrastructure.

And this is people in infrastructure for transparency, for accountability, for risk assessments. You see the recent EU paper on AI scratches the surface. By asking some of the big questions that need to be answered for responsibility. I think we're only getting started here. But the thing that I'm most excited about is that AI is going to replace humans in decision making. Across the range of decisions that people make in any domain. And I think most of the time it's going to be a huge improvement. But now all of the sudden we need to go through thousands of years od political science. Or how do societies govern themselves and kind of bring that into AI. So that's a pretty freaking daunting task, but I think this is what we're talking about. And every investment that I see in this is orders of magnitude smaller than it needs to be.

Lukas Biewald:

When we were last talking, we were talking about an actual kind of case study. I thought that was really interesting. On voting in India and stopping the spread of misinformation. And how there isn't kind of one definition of fairness. And you kind of give people a menu of options, which I think is a really interesting perspective. I guess I'm kind of wondering if you could say a little more about what might be on that menu of fairness? I think it's so interesting when different people have different ideas of what's fair. And actually you say it's not your role to resolve it, but you must have opinions on what feels fair and not.

Joaquin Candela:

Yeah. Of course. That makes a lot of sense. I think the most important thing to realize first of all is that fairness is a bit of a social construct in a way. It depends a lot on context and it depends a lot on how a particular society has decided to govern itself. So fairness ends up being political inevitably. So let's start to ground this with a very concrete example. So here's three possible interpretations of fairness that resonate both with moral philosophy interpretations. But also with legal interpretations and finally with mathematical interpretations. Because the computer science community is also building metrics of algorithmic bias. All right, so here's the three. The first one could be minimum quality of service. This is also known as minimum threshold and philosophy. And the idea there is that you want an AI for example, to work well enough for everybody. And well enough might mean, if you have a computer vision system that detects people.

Or detects faces to be able to put masks on them or whatever. That it works well enough across things like skin tones and skin reflectance, and age, and gender, and other characteristics. That will be sort of a concrete example. It doesn't matter if it works a lot better for a group than another. As long as it works above a certain precision recall for everybody. The second interpretation would be equality. So if we go back to the India misinformation example. One question there could be, if I have some measure of accuracy for my... I think we were talking about the civic classifier that basically identifies among all of posts about cats and dogs on Facebook. What are the ones that are actually discussing political issues? Maybe the political agenda of a particular politician or party. And again to recap, we want to find those because we have limited resources in terms of human reviewers.

To look at content and check if they violate our policies or the law. So you want the AI to basically prioritize those cues essentially. This is something Lukas that I know you understand very well. Because you've worked on an onset of human computation a lot. So back to equality in India. Obviously languages and regions have a big social significance. Because they align with religion, they're aligned with caste. And they're aligned with other important sort of social dimensions. So imagine that your civic classifier works well enough for everybody. For all languages and regions. But imagine that it's under predicting a little bit for some language. And over predicting a little bit for another language. So what would happen is that we would be allocating more human resources for the language where it other predicts. And a bit too few for the one where it under predicts.

So there, we actually want to have a higher standard in a way of fairness. We're going to say, "Look, minimum quality of services is maybe not good enough. We want to make sure that we're offering equal protection against misinformation to everybody as much as possible." And then the third interpretation of fairness, which is sort of widely accepted. Would be to go from equality to equity. And so when we think about equity, we no longer think about equal treatment. We think about, "Is there any group that deserves special consideration?" So we're living this in the US right now. Obviously with a big awareness and awakening around racial justice. Where we're obviously paying special attention to the black community in the US. And the reason we're doing that is because of historical structural disadvantages.

So if you took this to India, there might be a legitimate question. Some people might ask, "Hey, actually maybe historically there's been some groups, some regions, some languages in India that have suffered more from manipulation or injustice." Therefore we actually are going to allocate extra resources to make sure that, that group is really protected. Because given the same amount of disinformation or misinformation, the harm to that group will be bigger relatively speaking. And so these are questions that an AI engineer like me should be asking, but not answering. It's really important to basically escalate those questions to the local team, to policy experts. Find ways to involve external people to give an opinion.

So that's what I mean with a menu of options. Each of those translates in math and in code to a different choice. But that choice I should not make neither deliberately nor accidentally. By just picking something that looks reasonable mathematically if I don't understand what the implications are.

Lukas Biewald:

Do you find that it's easy to articulate those to a nontechnical audience? I feel like you're framing it in a very technical way. Is it clear to people what they're choosing?

Joaquin Candela:

Help me understand your question. So who would be the audience more concretely?

Lukas Biewald:

Well, I guess, I'm imagining you're saying, "Do we want to kind of treat all regions equally?" In the India example or something else. And then that something else might be we prefer to over predict some regions. And I'm trying to picture... I guess that part makes sense, but it seems like actually, there's sort of this other question. If we wanted to sort of do something that I think is kind of affirmative action. In college admissions I'm picturing. So if you want to do that, actually then you have to get someone to tell you kind of exactly the tuning that they want, right?

And I'm not sure I could even come up with what's exactly the fair amount of... The fair distribution to apply. I'm not even sure how I would answer that. Or I'm not even sure how I would ask someone that question in a way that I would get a useful answer out of them. We certainly don't walk around in our heads with exactly a particular distribution that feels the most fair.

Joaquin Candela:

100%. I think that's exactly the reason why equity is the hardest of these three lenses on fairness. So I think in practice you'll find that most teams, most product teams, most AI engineers will be either asking questions of minimum quality of service. And if you want we can talk about how to operationalize that. It's surprisingly easy actually. Or questions of equality. Of equal treatment which is conceptually easy but a bit harder to implement. When it comes to questions of equity, these are not really questions that are directly addressed to AI engineers. These are really questions that the overarching leader of a product needs to be a reasoning about equity. I'll give you a concrete example. Adam Mosseri who leads the Instagram team has started to make public posts that you can Google about Instagram and equity.

And basically what he's starting to do, he's initiating a dialogue. Where he's saying, "Hey, we will put the interests of communities above the interests of Instagram." If we feel that a certain product causes unintended harm to a community. Or that it doesn't serve it as well as we intended, then we will actually stop and rethink it. What does that translate exactly? If I'm running ads and I feel like, "Oh, ads isn't working for everybody." Does that mean I shut it down? Do I have a percent? Do I say, "Oh, I cut my losses at minus 10%?" We don't need to stay within... We don't need to stay within the Facebook, Inc sphere. I have close friends at Spotify and at Netflix. The same questions occur there as well. And like, "Hey do we inject some diversity of content?"

Do we allow some producers, some musicians, some filmmakers that are maybe a little bit in the shade? To kind of pop up. And then what's the hit that we're taking in terms of our engagement metrics. In terms of how many songs people listen per day? Or how many movies or shows people watch per day? And stuff like that. I don't think there's an exact science on that at all. But it's a very real question that many people are sort of reasoning about. And the last thing I'll say is that one of the big challenges is a question of governance. And I think you were alluding to that. It's a question of who decides. And if you think about it, we have democratic processes. I live in Mountain View. The city of Mountain View decides where we put bicycle lanes. And of course they're going to slow down traffic, but they're going to create all their benefits.

They're going to decide on urban density. On things that are all trade offs. There's obviously... In luxury resorts and stuff like that like in Truckee. I know because we recently bought a house there. And the city council will demand that you reserve a certain part of land and building. For sort of less expensive dwellings to sort of give access to housing to everyone. And in those cases, it's a bit easier. Because there's a democratic process by which that city council gets elected. There's public consultations. If I think about one of the challenges that we're facing as technology companies is this idea of how do we bring in public deliberation? And consultation mechanisms into decisions we make.

At Facebook of course we're launching this external oversight board for content moderation. Which is almost ready to go. We have all the members identified. And I think this is only the beginning. I think that we're going to see a lot more of this. It was a very long answer.

Lukas Biewald:

It was a great answer. And I got to ask though. You talked about operationalizing and I do think that... I was kind of thinking as I was asking the last question about kind of getting the details right. I think a lot of the mistakes that you see around algorithmic fairness, they're so glaring. Probably the more important thing for most of the people that are listening to this interview is how do you operationalize the basic stuff? How do you make sure your thing isn't egregiously unfair? Maybe your first definition.

Joaquin Candela:

Yeah. And thank you for asking me that question. Because obviously we should talk about equity as a society. But as engineers, there's a lot of stuff we can do to make sure that our crap is built right. And so obviously I have friends at Twitter obviously. And I know that there was a very new cycle on the AI that automatically crops images. And the challenge there is that some people externally actually tested the system. And realized, "Oh, if I put a picture of a white person and a black person. And they're a little bit apart and I create some blank in between. And it's like a rectangular shape picture then obviously there'll be some cropping to kind of render something. And then disproportionately it seemed to pick the picture of the white person."

So how did that happen? How does something like that happen? Well, Twitter says they tested for racial bias. Which is great before they launched this AI as they should. And I commend them for that. But the devil as you say is in the details. Do I have the right test sets? Did I cover skin tones correctly? And there's a bunch of work by people like Joy Buolamwini and Timnit Gebru and many other coworkers. Which is brilliant right, on having reference papers that kind of propose methodologies for doing these things. What we're doing and my advice in general to the community is to invest heavily in transparency. We have started to do this with media manipulation and deep fakes. Where we've published... We've built and created datasets. And this is only the beginning.

I think that for many of these biases, whether it's in computer vision algorithms, in speech recognition algorithms, in racking algorithms. In whatever you want, having methodologies that you can publish and talk about with applicable datasets that you can share. I think it's going to be the way to go.

Lukas Biewald:

I see. So investing in being transparent about how you're doing things. And is the benefit there that you can get external people to comment on what you're doing? And kind of catch unexpected errors?

Joaquin Candela:

I think there's two benefits. One is accountability. Transparency implies accountability in a way. If I declare to the world that I have methods and processes for testing my stuff, then it'd better be that I have them. And it better be that they don't suck. And then on the latter part on whether they suck or not, I guess it's just like open sourcing or publishing. It's the same philosophy. If I open source my code, then often I actually get quite a bit of constructive feedback. And sort of the community helps improve it.

To be clear I think as a community at large, I mean there's been a couple of really good papers. Google and Microsoft have done excellent work in proposing things like data sheets for data sets. And fact sheets for models and things like that, that sort of invite for transparency. But I don't think that we have... We're not at a stage where there's standards that are sort of widely adopted. But my prediction is that, that's where the field is going.

Lukas Biewald:

And so, I mean, how would this work if... I'm just channeling my audience which is a lot of startups and compaies like... Not Facebook or Twitter. How would you go about being transparent if you're a smaller company? What would you do?

Joaquin Candela:

Yeah. Good question. Where possible use public data sets of which there exists. I'm thinking again about if you look up the work of Joy Buolamwini, and Timnit Gebru, and many other coauthors. They are reference data sets that have been proposed for things like face detection, and gender recognition, and so on. That's just one example. There's a lot more out there. Make sure to use them as reference data sets. And maybe even report how your algorithms perform on those. If you're building a human centric AI application. That would be one. If you're a startup that's using Azure or Google Cloud. If you're using any of the big three cloud and mail providers... All of them are actually offering libraries to detect biases in the data and in the predictions of the model.

Definitely use those. And report on what you found. And then the second thing of course is put pressure on these companies. If they're providing you some pre-learned embeddings, whether its for computer vision or for NLP or whatever. Ask for transparency. Ask are there any funky things going on in these sort of text embeddings. Because there's a bunch of papers out there that show that if you train word or sentence embeddings off a big existing corporal like Wikipedia or wherever you want. Inevitably, you're going to see well-documented things. Like gender related nouns tend to sort of fall closer in an embedded space to stereotypically male jobs. And it's the same for females. And that could have nasty consequences sort of downstream. So at least you should be aware of those.

Lukas Biewald:

Well, I guess that's a good segue into something I wanted to make sure I talked with you about. Which is diversity, inclusion and AI. And I kind of bring this topic up with a little bit of... I don't know. Awareness of the fact that we're both middle aged white guys talking to each other about diversity and inclusion. But you have thought a lot about it. So I'd love to hear what you think can be done, should be done, is being done.

Joaquin Candela:

Yeah. I'd like to share a... First of all, I'd like to share a story. And I don't know if I'm repeating myself if we talked about that last time. But when Yo-Yo Ma came to NuerIPS in 2018. And he came to a workshop where the theme of the workshop was AI for social good. And one of the questions that was asked to Yo-Yo Ma was, "How do you build trustworthy AI?" And it's interesting. He said the most important thing for him was to understand who was behind the AI. Who are the humans behind the AI? What are their intentions? What are their fears? What is their background? What perspective do they bring to the table? What are their values at the end of the day? What's the values of the human building the AI?

And this actually echoes a theme that's very common. When talking about building AI responsibly, one of the main questions we tend to get is, "Well, how diverse and how inclusive is the team behind the AI." Diversity, it's a tricky concept. Diversity really means heterogeneity. It means having a team that has people from different... Both innate, but also chosen characteristics. Obviously I was born in Spain. I'm a white dude. That means certain things. But then I have my political, religious, and maybe sexual and other choices that I made myself. And that sort of defines who I am. And kind of reflects both values and circumstances are.

So bringing that sort of diversity into teams is very important. Because teams that have that sort of diversity tend to make better decisions. Because they tend to look at the problem from multiple angles. And they tend to ask more and harder questions. Inclusion means meaningful representation. It's like if you're at a cocktail party, it's one thing for everybody to get a cocktail. So that inclusion, we're giving cocktails to everybody. It's a different thing to really sit down and understand what do you like drinking. And so making sure that your voice is heard and that your experience actually... You're not just like a token person sort of ticking a box inside of the diversity list. But really that you'd be included.

And that means creating a space where everyone's voice is heard. Where a few can bring their authentic self to work. And sort of really express themselves and not be shut down. So a couple of questions. Question number one is well, how do you do that? How do you build a team like that? And question number two, why do you do it? Do you build a team like that because it's a necessary condition in order to build responsible AI? So I'm going to start with the second question because it's tricky. And this is something that we've debated a lot. So one danger, one risk in coupling together the goal of responsible AI. And the goal of diversity and inclusion, is that the burden of building responsible AI can inadvertently fall more heavily on the shoulders of underrepresented people.

And this is something that we've seen in many forum. Not just at Facebook. Whenever I'm in a forum that's trying to tackle issues of responsible AI. Disproportionately I find myself in the company of women or underrepresented minorities. And it's interesting because I guess this is kind of like a sense of duty. In making sure that the AI is built responsibly if you are a member of an underrepresented group. So the danger a little bit with ascribing the goal of diversity inclusion as a means to achieving responsible AI. Is that you might inadvertently put that burden only on a subgroup. So it's very important to actually keep these two goals separate. And to sort of say, "No building responsible AI is a duty of everybody period."

Now there's certain attention to building diverse and inclusive teams. Which will help with responsible AU goals. But it also will just create teams that make better decisions overall. So let's just focus on that part. So how do you do this?

Lukas Biewald:

All right. I have one more question on that first point. I feel like few people would argue that diversity inclusion is bad. But it does seem like it comes up particularly frequently in AI. Even within CS practices. I feel like you don't hear a lot about diversity and inclusion in say databases. But we touched database maybe more than we touch AI systems. Why do you think that it's such an important topic for AI in particular?

Joaquin Candela:

Yeah. Thank you. Very good point. I mean, I guess because AI is increasingly replacing humans in making consequential decisions. Algorithms are used in criminal justice to assist judges with assessing risk. And deciding whether someone can go out on bail. Or whether they have to wait for trial in jail. In the education system or in employment to automate pre-select resumes and things like that. With promise because humans are very biased actually at selecting resumes. In medicine for automatic diagnosis, et cetera. And so the opportunity to reflect on amplified biases that exist in society in these automated decisions is real. And is very high. And so the stakes are very high. They're higher... A database is not going to make an automatic decision. It's going to give you some data. And AI is actually going to make a prediction or is going to make a decision.

Lukas Biewald:

Mm-hmm (affirmative). That makes sense. So how do you do it?

Joaquin Candela:

So typically, any diversity and inclusion strategy tends to include two big buckets. One is hiring. How do I hire more diverse people? And the other one is retention and growth. If I have an employee base that exists already. How do I make sure that not only that I don't lose the under represented people. But let's just make sure that everyone has the same opportunities to grow. And develop in their career. So that's the first thing. Is you got to have a strategy that focuses on finding people, growing people, keeping people. But then how do you do this? Right? You could be tempted. We're all engineers and we love numbers and metrics. And I could say, "Wow! This is easy." Let's just figure out...

Let's just map every individual to some demographics. Let's just figure out whether they're underrepresented people or not. Maybe with gender and other dimensions separately. So that we don't address... Because it's intersectionality issues and all the things like that. So you want to make sure that... Just assumes that you have the right metrics. And then you can say, "So for every team, I have a target number that I want to hit and so on." That's very tempting. But it's also fraught with tons and tons of unintended consequences. Because what might happen is that people managers might end up making decisions. That actually they're taking to consideration someone's group membership. And that could be problematic for a number of reasons.

Reason number one. You never want anybody to be in a position where they feel like if they got promoted, they got promoted because they were black. Or because they were a female. And you don't want others on the team to point fingers at them and say, "Oh sure, you got promoted for that reason." Or, "You got hired for that reason." It's extremely important that whether someone is hired or not. Or whether they get promoted is purely based on merit. So then what do you do? What you do is that you ensure that everyone has consistent equal opportunities. And the right amount of support to succeed. You make sure that no one is left behind. When it comes to hiring to be very specific. What you can do in hiring is you can use the so called Rooney Rule. I don't know if you've heard of that rule.

It comes from the NFL. So the goal of that rule was to increase the racial diversity of football coaches in the US. And there was a coach called... I can't remember right now if it was a coach or a team owner. Anyway, someone in a position of power called Rooney. Who basically came up with this idea. Which was to say, "Hey, what we're going to do is we're going to make sure that we have a diverse slate of people. That we consider whenever there's a job opening." Because one of the dangers is... Actually one of the biggest risks to diversity at the hiring side is that people tend to hire their friends. And obviously your friends tend to be like you.

And so what a diverse slate approach or DSA aims to do is to say, "Look if you have a position open, you can't just go out and do an opportunistic hiring." Where you just go and grab your friend or your pal that you went to college with or whatever. You actually have to write an inclusive job description. Which means that you really focus on the things that are really relevant to the job and nothing else. And then second, you just go out and you make sure that you have a slate of candidates. That you consider that is diverse. Now, once you have that slate, the bar is the same for everybody. And that worked in the NFL. I mean, there's lots of papers written about this. But don't write it helps.

So that's one thing you can do. Other things that you can do is make sure that you have outreach programs. To create awareness about the types of opportunities that exist. Because again people tend to reach out or source from sort of familiar circles. You got to break that barrier. And then inside the organization on growing and keeping people. This starts with very simple things like do you have any kind of training inside the company? And again, if you're a small startup there's a lot of resources. If you google diversity and inclusion resources, there's a lot of awesome talks and resources out there. To just raise awareness. This is small things... There's this training that we have at Facebook called Be The Ally. And Be The Ally really is about just keeping an eye out for people who may be in a meeting or at work tend to be more silent. Or aren't notice and maybe their voice isn't heard.

Or even maybe you sort of witness a microaggression or a macroaggression. I mean, if it's a macroaggression, then everybody notices in theory. But keeping an eye for these things and then checking in with people. Taking it on yourself to sort of go in and say, "Hey, I noticed this thing. It didn't feel like it was okay to me. How did you feel?" And this small things they make a huge difference in creating an inclusive culture. On career growth, every manager should have a personalized career growth plan for everyone on their team. And everyone is different, has different interests, has different strengths, has different areas that they want to develop. Making sure that you consistently personalize your career growth plan for each of your employees. Actually makes a very big difference as well.

Because sometimes if you templatize it and your requirements are the same. They might not be adequate for certain people. That can a lot more, but that kind of gives you a couple ideas.

Lukas Biewald:

I'm curious. Do you have any suggestions for this? So these recordings that we do, we know... We can look at the gender distribution of the people watching. And it actually skews over 90% male. And it's funny because our user base is actually not that lopsided. So it feels like maybe there's something we're doing that's not connecting, at least with women that might be watching it. I guess, how would you approach that?

Joaquin Candela:

Well my initial reaction was please ask the women who are not watching.

Lukas Biewald:

Fair enough.

Joaquin Candela:

But a couple of thoughts have come to mind. I mean, data actually is extremely important. Doing some user research and understanding why is this not resonating? Why is this not useful to you? A couple of other thoughts. Obviously, one of them is on the topics that are covered. Which is actually tied to what I just said. Sort of understand whether those are not useful. And then the other one which I don't know if you already do. Is to make sure to have at least as many female speakers as you have male speakers. And again, that's an interesting thing. Because you might say, "Well, maybe the potential audience is actually a majority male and a minority female." But that is an interesting point of intervention. Where you could sort of say, "Well, that's fine, but we're still going to aim for 50, 50."

Lukas Biewald:

Makes sense. We always end with two questions. And I'd love to do that with you too and kind of get your thoughts. These are a little bit open-ended. And I haven't prepped you for them, but I'm curious what you say. so one question we always end with is what's an underrated aspect of machine learning that you think people should pay more attention to? And maybe for you, I'd tailor it to sort of an ethics and responsible AI. What's something that you feel maybe doesn't get the attention it deserves? It might be the whole topic doesn't. But within that, is there something you'd especially like to call out?

Joaquin Candela:

Yeah. Just be aware of averages. I hate averages and aggregated metrics. It's as simple as that. But if you want to develop AI more responsibly. Try to desegregate the methods. You just talked about gender. If you have access to gender... Or if you have ways in which you can have the right sort of explicitly informed consent. Where your users give their agenda. And be transparent about what you're going to use it for. If you have location... Because obviously even if you only take the US, the South is very different from the coast. It's very different from the center and so on. Desegregate your metrics and the tests that I like to give is this. It's presumably if you're launching a new refresh of your AI. Whatever it might be or a new model or whatever. Presumably you have a launch criteria.

You have criteria to launch or not launch. And most of the time, what I observe is that it's just an all out metric. That aggregates overall users. Well, don't do that. Instead kind of desegregate and look across maybe gender, age, and location. Just simple things. And then ask yourself if you find a bucket that is significant where your AI performs the worst. Would you still launch? And show care. That would be a very sort of practical thing to do.

Lukas Biewald:

I would say it's... My reaction is that's a great suggestion for just competently launching an AI also. So there's no trade off here. I think definitely looking at distributions and things versus averages is a really good idea. Somehow I feel like you get less sort of Gaussian distributions in AI than you do outside of it.

Joaquin Candela:

I mean, I see still called up averages everywhere I look. So I feel like we don't have a discipline of doing it.

Lukas Biewald:

Okay. And now our last question. And maybe this actually goes back to all the work that you've done in your career. But when you look at the kind of productization in machine learning. Like taking an ML model from sort of conception. To actually in production doing something useful. Where do you see the biggest bottlenecks? Or where do you see the unexpected problems? That someone outside the space might not realize comes up in that process?

Joaquin Candela:

Yeah. A lot of things come to mind. But I think the biggest one that comes to mind is that, your training data is almost never representative of your life data. And that's just life. And therefore I mean, obviously people working on self driving cars know this very well. That you might sort of train all of your perception, behavior prediction and planning, and whatever. Some data... And then you're going to have a situation on the road that you had never anticipated. So there's that. And this would be sort of the problem of the black swans. Or the sort of unlikely events and so on. But even outside of that... Again then you have the very grass mistakes that are sort of super easy to fix.

Again if your application of AI is human centric. Did you really have in your training data people of all ages, genders, and skin tones or not. And in a surprising amounts of cases you didn't. And then you deploy your thing and now there comes an elderly black lady. And the thing is just not working for her. Or in SAR there's many different accents of English. Did you have proper representation in your data or not? And most of the time you didn't.

Lukas Biewald:

Hard to do.

Joaquin Candela:

It's hard to do.

Lukas Biewald:

Awesome. Well, thank you so much. It's really great to talk. And thanks for taking the extra time to do two part of here.

Joaquin Candela:

My pleasure and I can't wait to feel embarrassed by watching the recording.

Join our mailing list to get the latest machine learning updates.