The unreasonable effectiveness of synthetic data with Daeil Kim
Supercharging computer vision model performance by generating years of training data in minutes
View all podcasts
Gradient Dissent - a Machine Learning Podcast · Evolution of Reinforcement Learning and the Robot Hand
BIO

Daeil Kim is the co-founder and CEO of AI.Reverie, a startup that specializes in creating high quality synthetic training data for computer vision algorithms. Before that, he was a senior data scientist at the New York Times. And before that he got his PhD in computer science from Brown University, focusing on machine learning and Bayesian statistics. He's going to talk about tools that will advance machine learning progress, and he's going to talk about synthetic data.

https://twitter.com/daeil

TRANSCRIPT

Topics covered:

0:00 Diversifying content

0:23 Intro+bio

1:00 From liberal arts to synthetic data

8:48 What is synthetic data?

11:24 Real world examples of synthetic data

16:16 Understanding performance gains using synthetic data

21:32 The future of Synthetic data and AI.Reverie

23:21 The composition of people at AI.reverie and ML

28.28 The evolution of ML tools and systems that Daeil uses

33:16 Most underrated aspect of ML and common misconceptions

34:42 Biggest challenge in making synthetic data work in the real world

Daeil:

The hard part is diversifying the content. So if we just have the same character in an environment doing everything, it's not going to work, right? So how do you actually create hundreds or thousands of variations of that character model with different behavior and things like that? That's been really the core focus of how we're thinking about our technology.

Lukas:

You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald.

Lukas:

Daeil Kim is the co-founder and CEO of AI.Reverie. A startup that specializes in creating high quality synthetic training data for computer vision algorithms. Before that he was a senior data scientist at the New York Times. And before that he got his PhD in computer science from Brown university, focusing on machine learning and Bayesian statistics. He's going to talk about tools that will advance machine learning progress, and he's going to talk about synthetic data. I'm super excited for this.

I was looking at your LinkedIn and you have a little bit of an unusual path, right? You did a liberal arts undergrad.

Daeil:

Yeah, that's right.

Lukas:

Can you say a little bit about... I feel like I come across people quite a lot that want to make career transitions into machine learning and related fields. What was that for you? What prompted you to do it?

Daeil:

That's a great question. Wow. Searching back. I studied literature in college, so I actually did not have a lot of computer science background, and I've taken a lot of twists and turns in my life. Sarah Lawrence College is a pretty unique educational system. It's really small class sizes, Socratic system, liberal arts, humanities. I think from that, I garnered just the curiosity about the world. And then afterwards I did a lot of research in schizophrenia, so I studied mental illness. I was taking people inside MRI scanners and then analyzing their brain data. I spent about four years doing that after college.

It's again, transition like, after college, no skills working at a wine shop, and then over time, volunteering at a lab and getting into that position where I started actually publishing papers and really getting into computational neuroscience. And then I wanted to be a doctor at some point, but then decided at the last minute to study machine learning because I was actually really interested in understanding the underlying fundamental aspects of intelligence. What does that mean? How can you actually model things like that?

So instead of going to medical school, I decided to just do a PhD in computer science. After that, I wanted to try journalism, trying to see if I can apply and build tools to help journalism. So I worked in New York Times for a few years. And then finally, I was like, "Okay, I really want to do this stuff, synthetic data." So it's a lot of twists and turns, I have to say. I would have never been able to tell you this is where I would have ended up 10 years ago. So it's been quite a journey.

Lukas:

That's so cool. That's an impressive skill to be able to completely switch fields like that. I think I'd be too afraid maybe to make a leap.

Daeil:

Yeah. I think it's not easy. Let me just be clear. It's not easy learning the math, for example, with machine learning.

Lukas:

Yeah. When did you learn the math? Because I feel like that's a place where a lot of people feel nervous. Did you take math as an undergrad? Or how did that happen?

Daeil:

Not really. I didn't actually take a single math course in undergrad. So I had to learn. Actually, my PhD was Bayesian nonparametrics, which really gets into pretty complicated math, and variational calculus and things like that. Basically, I suffered [laughs] and I spent a lot of hours just learning. I took some classes as I could during schizophrenia stuff, during the research of that aspect of my life. I had to learn some level of statistics in that, and probability to be able to analyze that data.

But then once you get into the machine learning stuff, and especially in that area I was in, you really needed to up your game. And then that's where I spent a lot of time trying to play catch up. You know, I learned a lot and it was an unbelievably fruitful experience I would say. So it's very rewarding.

Lukas:

Do you have any tips for people trying to learn math outside of an undergrad curriculum?

Daeil:

I think actually one of the best ways for me was actually appreciating the beauty of math. A lot of people are scared of math and thinking, "Oh my God, I have to learn these rules, and first, second derivatives. I have to memorize these things." But once you get into more of the theoretical stuff and you start thinking about basically like... I'm not sure if you've heard of these insane millennium problems and P equals NT or the prime number, stuff like that, the Riemann hypothesis.

There's so much beauty there and you can actually read about it and understand how challenging some of these problems have been and how profound they are. Being able to appreciate it from an aesthetic level, I think helped me give the patience I need to learn it a little bit more. You need to be patient. Your brain is not going to just pick this stuff up, if you've never been exposed to it, unless you're a lot smarter than I am, which might be the case. Yeah.

Lukas:

Sounds unlikely. I don't know. For a long time, I've had a real interest in synthetic data, which is what your company does, but how did you get interested in synthetic data working on journalism?

Daeil:

So my adviser actually at Brown was a computer vision person. So I got exposed to a lot of the problems there. So you go to these conferences and it's always the same data sets being used. One thing that I actually wanted to do is one day build my own video game. I wanted to be able to actually create worlds. I wanted to see if you can integrate machine learning. That was an early interest of mine as I was learning this stuff. I've always believed simulation was such a powerful tool for a lot of things. So at some point in New York Times, I had a really great experience there learning all sorts of things, amazing community of people. And then from there I really wanted to do this thing I've been dreaming of doing, and I knew that there was such a huge issue.

The way I look at sometimes how science advances I think is actually through tools. I mean, you're building a great one with WandB. If you think about the microscope, for example, right? Before that, who... You know, there's entire fields that opened up. So what I'm hoping to do is I'm hoping to figure out a way to create a sort of a simulation platform that can one day be used by a lot of people. And at some point just introduce new people to ideas about how you can train computer vision algorithms without the standard process of collecting data in the real world and where simulation can actually play a really useful role. I think that really excited me and I actually think that there could be a lot of really important advancements in the celebration of computer vision with the adoption of synthetic data.

Lukas:

I see. So you'd actually been thinking about synthetic data for quite a long time. I should say, I don't know if you know about my previous company, we were talking about is CrowdFlower and became Figure Eight, and we developed data collection. I think it's funny. I think it was sort of similar experience to you of actually looking at conference papers and realizing they're all built around the same data set almost always based on the data sets that were available, which feels totally backwards. Right? You know?

Daeil:

Yeah, yeah. Yeah.

Lukas:

I think especially as a just starting out grad student researcher, you're the one that ends up spending a lot of time with the data sets. So you realize how massive they are and idiosyncratic they are.

Daeil:

Absolutely. I would just also add that a lot of my work during my PhD was in Bayesian models. So there you have this notion of prior belief, you then estimate your posterior from that. But in deep learning, it's not that easy to establish a prior in a way that you can really control. I actually think synthetic data, at least for computer vision, the data itself can actually act as a really interesting prior. So there's connections there that I think I took from my own work of wanting to think about how to incorporate that. Simulation is one aspect of using data to generate that prior.

Lukas:

Well, we always want to make this show for people that work in machine learning, but aren't necessarily domain experts in every [area].

Daeil:

Sure, sure.

Lukas:

Maybe you could explain what synthetic data is and-

Daeil:

Yeah. Absolutely.

Lukas:

... your take in how your system works today and then how you imagine it working in the future.

Daeil:

Yeah. So the way we're talking about synthetic data is basically data that is generated from, let's say, a game engine or something that doesn't come from the real world. It's sort of artificially generated. Of course, people talk about synthetic data in NLP as well in generating fake texts or text that's relatively useful there. But for our purposes in our startup, we're primarily focused on computer vision. And so what we try to do is we try to create these very photorealistic, virtual worlds. We extract images from them.

And then the nice thing about doing that in a simulated world is that you can encode some of the things you need for supervised learning in computer vision like the annotations and all that stuff directly. So you can help bypass that part of it and then help streamline that process. That's what we're focused on. We've been at it for close to four years now, and essentially we were trying to see where synthetic data works really well and how to push the boundaries there.

Lukas:

Where does it work well? How real is it?

Daeil:

Yeah. Great question. I like to think of what is a narrow problem and what is not so narrow, right? I say narrow AI all the time, and then... Think things like conveyor belts, right? Let's say you're processing certain types of food items, things like that. You're not going to see a random golden retriever jump on or things like that. The diversity of that scenario is not that large. I think their synthetic data really shines. That's one of those places. Of course, people are using it for really complex things like self-driving cars and things like that. But I would say if you want to think of a heuristic, the more narrow the problem, the more synthetic data will play a role. But of course, on the other end, there's attempts that we'll make to try to create that diversity.

So the way we think about it as a company is how do you create diversity and how do you scale that? So we incorporate a lot of proceduralism in our world. We think about how to procedurally generate meshes, geometry, 3D models, things like that, how to automatically change the terrain, all that stuff. That's really a big focus of our work, and then understanding how you can quantify that gap between synthetic and real data through benchmarking of algorithms. That's where we use a lot of WandB as well to understand that.

Lukas:

So how would it work? Say I'm trying to imagine what I might be doing, where I would want to come to you, and you can tell me like, if I'm trying to do factory automation since you said conveyor belt. I want to classify, does this machine look like in a normal state or a broken state.

Daeil:

Right, right, right. So I'll give you an example that I can talk about a little bit. One problem is this company we're working with called Blue River. They're trying to solve this problem of being able to identify weeds in a crop field. It turns out that if you were to able to target the herbicide you use, you can reduce the amount of herbicides by 95%. Right? Farmers are just spreading that all over. So for what we've done on our end is that we've created an environment where we actually procedurally generate weeds with different vegetation stages, and all sorts of things like that. And then be able to then automatically annotate that via segmentation mask and then train an algorithm and show that we are getting X amounts of improvement.

Daeil:

Another example I can talk about is 7/11. We're working with them where we're actually creating a retail store with all these items, and they're interested in things like activity, understanding grasp, pose detection, things like that, grasp intention. We have our own motion capture studio, so we actually have a lot of really cool animations that we can generate from there. We create all that simulated data and then all of that has perfect ground annotations and we feed it to them that they can basically download and then use to train their own algorithms.

Lukas:

What's the point where... I mean, those are great examples that make total sense, but it also strikes me as kind of tricky to set up, to make it really realistic. What's the sort of scale that you need to be at for this type of approach to make sense?

Daeil:

Yeah. It depends ultimately on their data set they're benchmarking against. Actually, when we work with companies, we often ask, can you share at least in the valuation a real-world data set that we can benchmark against. So oftentimes the first iteration that we run and create this environment might get you a certain percentage like 60% of the real database line. Real database line being, if you were to train the same algorithm on the real data only, what is the thing you would get from that? And then we keep iterating and improving. We have ways of finding out where the gaps are in terms of the synthetic and real data. And then we have a whole team of procedural artists from the game industry that actually work to develop better ways of actually creating more diversity within those scenes.

So it's not something that happens instantaneously, but it is something that once you build it, it's there forever. So you can just keep generating more and more data and iterating on that. So the early parts of our company was just trying to create that infrastructure and then being able to have a streamlined process of iterating on that. The way I like to think about it is sort of a virtuous cycle. We generate the environment, we collect data, we benchmark it and then we iterate again and again and again, until we get to a point where we're happy with the synthetic data. But on the first time, it's usually never a ... You know. Unless it's a very simple, narrow problem, you usually don't get up to the same performance.

And then depending on the problem, you'll look for different things in terms of what to improve. You might miss a certain type of orientation of an object, or you might have Zoom levels that are off that you didn't account for.

Lukas:

Did the images that you generate look extremely realistic? Is that really important to making it work well?

Daeil:

I think if I had to choose one, diversity is more important than photorealism. So I'm defining photorealism as the way people think about in computer graphics where you're modeling the light rays bouncing off of every part of an object and calculating that. That's how you get those CGI level realism. Because I do think that the technology that's coming out with the latest version of Unreal Engine and NTD is coming out with a global illumination system.

That is just going to happen and GPUs are getting more powerful. So that level of realism is there. But the hard part is diversifying the content. So if we just have the same character in an environment doing everything, it's not going to work, right? So how do you actually create hundreds or thousands of variations of that character model with different behavior and things like that? That's been really the core focus of how we're thinking about our technology.

Lukas:

I see. This must be really hard, but if I came to you and I was like, "Hey, I want my accuracy to go," How would you even think about that? What kinds of performance gains do you predict?

Daeil:

Well, let me answer that question in two ways. One way is there are scenarios where the only thing that could really work is synthetic data. Let's say you have a conveyor belt of, I don't know, ceramic mugs, and you need to also have an annotation around how much they weigh, right? You could potentially actually estimate that in a synthetic environment, because you might understand the materials and you can calculate that while it might be hard for a human annotator to look at that and be like, "This is 37 grams, right?" So there are scenarios where actually it can only seem to work with a ground truth thing. So there's an advantage there.

In terms of performance, I think it really depends. I can give you off the top of my head for the narrow cases, you're essentially looking at 90.99, .98 mean average precision for things like that. When you're starting to talk about much more complex things, we released a paper called RarePlanes, where we actually released with Cosmiq Labs a huge satellite, synthetic satellite image with airplanes and all that stuff that's already been annotated and synthetic version of that.

There, synthetic alone will give you 65 to 70% of the real baseline performance, but then what we do, and we would like to advocate for this. There's several things you can do on top of that. One is transfer learning. You can actually just take 10% of the real data. And then you start getting into the 95% of the performance of the real world data, just using 10% of that. And then you also have things like domain... Sorry?

Lukas:

I just want to make sure I understand what you're saying. So you train on the synthetic data first and then you transfer to the non-synthetic data?

Daeil:

Yeah. You take the real world, just 10% of the real-world data and you fine tune it off of that. So you can either pre-train it that way... What's that?

Lukas:

It's a final step you fine tune it on?

Daeil:

Yeah.

Lukas:

I see.

Daeil:

Exactly. And then you get much better performances. Of course, that 10% comes from the real world training set, not the test set. So the fine tuning stuff, I think the fine... The way I look at why that performance got us up to 95% is that I think you're feeding the sort of prior version of what the algorithm thinks the world should be. And then all the sort of noise that comes from the sensors and then any unique variations of that can be transferred in that fine tuning step and taking that fuzzy vision and then sharpening it with some real world data.

Lukas:

You say 10% of the training data. Did you take the other 90% and use it in the initial model?

Daeil:

No. We would just randomly sample 10% of the real world training data for the fine tuning stuff and then we'll first train it off just synthetic only. Right? So we train it off the synthetic data first. That gets us to something like 60 to 70%, at least in the airplane's scenario, which is a bit complex. And then when we take just 10% randomly sampled from the training data set and the real world images, then we get to the same 95% of if you were to train on a hundred percent of the real world data.

Lukas:

Oh, I see. So you're saying it's 10 times as efficiently using the labeled data.

Daeil:

Yeah. You're saving a little bit of 90% of the real world data needs. Yeah.

Lukas:

And presumably if you used all the real world data, you'd make an even better model?

Daeil:

We found that it actually tapers off a bit after 10%. After 10%, it tends to taper off. I mean, diminishing returns, which is what I'm saying.

Lukas:

I see.

Daeil:

But of course, they're still there. Again, this is one scenario. Different scenarios have different performances and it really actually depends on the data set you have at the end of the day. I think other people point this out all the time, not enough people focus on the data itself. So if your data set is really wonky, who knows what you can train off of it and who knows if the benchmarks even are useful there at that point.

Daeil:

There's a lot of things to consider, but generally, we find that fine tuning helps. And then the other thing I wanted to bring up was domain adaptation, which is the set of algorithms and computer vision that tried to transfer to statistics from real-world images to synthetic images. So algorithms are like image to image translation where you can maybe think style transfer, take that from the real world images and try to incorporate that noise, interesting real world noise into the synthetic images themselves.

Lukas:

Oh, interesting. Can you really see that in the image?

Daeil:

Yeah. Yeah, actually. I mean, Nvidia has done some great work on this. So Nvidia has definitely done a lot of good work in domain adaptation. And these computer vision conferences, it's been a really active area of research. It's sometimes not that distinguishable. What you find is more texture differences versus shape differences. As you can imagine, those are probably the more difficult things to transfer, but it does help. It definitely does help for certain scenarios. Yeah.

Lukas:

Interesting. I mean, I think the first thing you said when we were talking is you envision this as being a tool for people to use, but it sounds like maybe today you need to involve real artists. Right? I would assume that the interface isn't really a tool that I... You know, it wouldn't be like a TensorFlow.

Daeil:

No, no, no, no, no, no.

Lukas:

What's your plan to bridge that gap?

Daeil:

Yeah. That's a great question. You can almost think of it like a download a little video game, right? At some point, if you build an awesome enough environment and then we're building out a whole UI and productizing that process. So you can imagine these virtual environments living on the cloud somewhere, and then you have an API that allows you to tweak certain things like lighting, time of day, how things are spawned, all of that stuff. And then you'll be able to collect your own data that way. Right? So it's not so much that we give people the ability to create their own 3D worlds, as much as we'll create it and give them access to this huge environment that allows them to collect as much data as they want, and to see how far we can push that.

We're still building that out, but I'm hoping that once we start doing that and we set the paradigm for that, other people will follow and understand the value of that when it comes to computer vision. And yeah, so hopefully that's how. There are some really other cool ideas too, like stuff around medicine stuff where you're actually, if you can create an environment that has a lot of really interesting ways, you can modify it through API calls or some scripts. You can then imagine the reinforcement learning algorithm that can explore and exploit a whole range of parameters to figure out how to actually get the best synthetic data, right? Where the reward function is tied to, let's say, your mean average precision and things like that.

Lukas:

I'm curious, like in your company, is it mostly artists making this stuff or is it mostly machine learning people or is it graphics people? What's the composition of-

Daeil:

Yeah, it's a very interesting mix. One of the best things about working with all these folks is that they come from a wide range. We have procedural artists. We have your standard technical artists. So procedural artists will be able to do some amazing things with geometry and create all sorts of geometry procedurally. We have people who understand how to create procedural textures, materials on those 3D. So a lot of game developers. We have animation people. We have the motion capture. We have game engineers to be the glue that puts all this stuff together. And then we have a whole team of deep learning people who actually benchmark that data.

Content is generated from one side of the company, gets fed to the ML people, and then they're like, "Ah, it's great. No, it's good. Like, we need to do... You know, this is not working. This is working." And then it goes back. So there's a constant conversation between these two groups of people where we're always trying to improve the data and understand what's missing.

Lukas:

Did the ML people do any of the image generation now, like with GANs and other techniques? Have you started to do that or is it mostly more classical procedural generation? I'm not super familiar of the field, so I don't know how-

Daeil:

Oh, yeah. We've tried some of the adversarial stuff. It's not easy to get GANs to optimize. There's a lot of issues of mode collapse and things like that. So the adversarial networks, you tend to use that more for domain adaptation. So you can imagine those techniques where you're trying to create no distinguishable difference between the synthetic and real data that those GANs can be very useful there. We are still actively doing some R and D on geometry creation. There's a really cool paper out called PolyGen from DeepMind that does some cool work on that space. But yeah.

Right now, what we're trying to really focus on is trying to create a whole suite of procedural tools that are based off of tools like Houdini. I don't know if you've heard of Houdini before, but it's a way to create procedural geometry. And then we've done a lot of good work with that. At some point, we want to move more towards a system where we can just train off of our current library of 3D models, which is really large right now, and then be able to generate new models there. But I think it's still a little bit more R and D that's necessary to get that stable. That's my guess, but I don't know. Maybe somebody has an amazing algorithm out there that works all the time. So, yeah.

Lukas:

So most of the models that you're actually building are vision models, it sounds like.

Daeil:

Yeah, we're primarily focused on vision. The reason why is because as much as we'd love to do RL based things, vision is nice because it doesn't have to require the kind of physics necessarily in a game engine, which... You know. What we use is we use unreal game engine to build our platform. There, the physics isn't as accurate as you would need to get the right RL stuff working. So we're waiting until that becomes more mature before jumping into an RL. But right now computer vision is our primary focus.

Lukas:

Interesting. You're waiting for a good physics engine to do RL. That's sounds like a real opportunity.

Daeil:

Yeah, yeah, exactly. But once we can create these unbelievably rich realistic world, then incorporating the physics will be the next step. And then we'll get a cute little dog running around in the field, jumping over rocks.

Lukas:

I'm curious. I've played a little bit with MuJoCo. What makes that not something that you could use for this kind of thing?

Daeil:

I don't have as much familiarity with that platform as well, but what we love about the unreal game engine is that it is such a powerful suite of tools, and it is capable of a lot of stuff. A lot of huge scaled worlds, really, really being able to have high performance photorealism. Especially with the new unreal engine coming out, you're going to be able to get near photorealism in real time. So all that stuff is just... It really allows you to create rich worlds and some of the other platforms that I've seen really aren't built for that in the same way, in my opinion.

I was looking for the most robust system to build all of this stuff to be able to... One of the cool things we do is we, for example, generate huge cities. So we'll take things like open street map, geospatial data, things like that. And then we'll generate a big part of Manhattan, for example. And that takes us a few days to just put it through a system and then out pops this fully virtual 3D world that you can walk around. So this is stuff that I think the unreal engine is quite well suited for.

Lukas:

Switching gears slightly to the ML team because I think that's going to really resonate for people listening and watching this. You've now been building models for customers for four years, I guess, which is probably longer than, or at least building proxy models tweaking ML for enterprise and production. I wonder how have your processes and tools changed over the years that you've been doing it?

Daeil:

Yeah. I just want to caveat this by saying that we don't try to actually create models that are going to be used by everybody in the world, or in production. We train models for the purpose of understanding how good our synthetic data is. Unfortunately, we're not spending all our time pushing the boundaries on the next version of transformer architectures. We're not as focused on that for example, but we're more focused on trying to understand. I actually think it's a different way to think about optimizing your model. Of course, you can optimize it through hyper parameter searches, messing around with learning rates, all sorts of things like that. But actually, the way we do it is that we'll try a few things here and there in terms of the hyperparameters, but we're really focused on what the data tells us.

So we can quickly go back, and within a few days make conservable changes to the day that we have. And then that's almost how we think about tuning our model and improving it. So we're taking a data first approach in terms of optimizing the performance of our vision models, and we do it for the benefit of the customer. We want to be able to show and prove to the customer that this data is valuable and that's useful.

Lukas:

That will resonate with a lot of the people that I've talked to. I think most people in the real world tend to focus a lot on picking and choosing the data to make the models work well. Have your systems evolved for doing that? Obviously, it's Weights & Biases. That's how we got connected. But what are other tools do you use?

Daeil:

Yeah. Weights & Biases is awesome. We love it. We've been using it a lot for understanding how our models are performing, but for us, we have our own data centers. We have our own co-location system, and then we use something called Polyaxon to orchestrate all of the experiments. So let's say you want to run 20 or 30 experiments. We have a system like Polyaxon that orchestrates all that, but it's also tied to WandB. So we get all sorts of cool metrics to understand how the model is doing, and we can plot out a lot of stuff. We've also created our own customized dashboards to understand the difference between synthetic and real data.

There's some really cool things you can do with some of the new transformer architectures that can generate visuals attention maps to understand some of the differences between the synthetic and real data. But at the end of the day, a lot of it is around that part of just trying to get the synthetic data. It's all focused on improve the synthetic, improve the synthetic data. And then once we get it to a point, then we feel good. Then we can start doing crazy things with it like, "Okay, this edge case that never happens in the real world. We'll create it." Or this perspective all of a sudden is changed and you need a whole new dataset where the camera angle is now different because it's in a different place. Well, okay, we'll generate all that. There's things like that, that we also do a lot of. So it's a relationship we have.

So those are roughly the tools. We're not like a huge startup where we have 50 ML people, but it's a pretty nice pipeline. Also, our data is all API-driven. So, we just literally a few lines of API code. We get the data we need, and then it's streamlined into this whole orchestration of experiments. And then once we get the performance of that, and then we have our Weights & Biases and dashboards and all these nice visualizations to understand where the differences are.

Lukas:

Do you use TensorFlow or Pytorch or something else, or all of them?

Daeil:

We're Pytorch fans. Yeah, yeah, yeah, yeah. I mean, TensorFlow is great for production stuff, but Pytorch is just so nice in terms of debugging and so there's a lot of stuff... It's just also kind of a culture thing. You start off with Pytorch and then making the switch to TensorFlow is a little bit hard. But yeah, so most of our stuff is in Pytorch.

Lukas:

You started with Pytorch like four years ago when you started. No, no?

Daeil:

No, no, no, no, no. Well, keep in mind the first year is a little bit like-

Lukas:

Yeah, sure.

Daeil:

You know, swimming in the open ocean, trying to find the island. There's a little bit of that that happens. Right? And then of course Pytorch has matured a bit over the years. Definitely there was a little bit of just trying to get the other stuff working, but yeah. So, in the past year, year and a half, it's been Pytorch primarily.

Lukas:

Cool. Well, thanks so much. This has been super interesting. We actually always end with two questions.

Daeil:

Okay. Sure.

Lukas:

The first one I'll tell you. So what is one underrated aspect of machine learning that you think people should pay more attention to than they do?

Daeil:

I mean, given that I work in synthetic data, I'm a bit biased here in my response, but I really think that as much as time as you can spend on the architecture, sometimes it's really just the data that is an issue. So, I think people need to think more about the data and what that data looks like and understand what you're working with and the biases inherent in that data. That's a big thing.

Lukas:

When people come to you, do you feel like there's a common misconceptions they have about synthetic data?

Daeil:

A lot of people are skeptical, right? A lot of people are skeptical like, can this really work? Right? So a lot of the work that we had to do early on was proving it out. What's really cool is that over time, a lot of the clients we've had have been just coming back to us, right? They've been coming back and they'll be like, "Okay, yeah, we need this thing and we need this iteration. This is cool. Can you make things roll now? Can you make things jump?" Things like that. So it's been good in that sense. Early on, it was a lot of proving this out. And this is why we built this whole pipeline of benchmarking and showing how this works and how well it works. At the end of the day, they really just want to see metrics, right? And show that this data set actually improves something.

Lukas:

All right. So the final question, what's been the biggest technical challenge that you faced making synthetic data work?

Daeil:

It's an enormous engineering challenge. So if you're trying to create large worlds, it's not just about... You know. It requires a lot of optimization of huge amounts of data, which is not a trivial thing. You have to organize it in a way that is also modular so that you can swap this and that and create that diversity. The hardest technical challenge was how do you scale diversity, which is what you need to do with synthetic data. That is definitely the hardest part. Of course, to the point about trying to create an early technology and trying to create something that where early on the market, like... I remember going to investors and they're like, "What's synthetic data? Like what?" They didn't even understand the concept. There's so many things we have to explain.

But it's now changing. So, we're super excited about it. I would say creating a system that allows you to scale diversity in a simulation environment is a significant challenge. And how do you do that in a way that is controllable versus just... You can throw adversarial networks at it and things like that, and that might help you to some degree, but at the end of the day, I do think that there is a role for control that is more driven by people in terms of how these simulations work. So there's something there, but that's not to exclude the adversarial stuff. It's really important and will play a role in the future.

Lukas:

I think you have front row seats to this. I mean, I would trust your assessment.

Daeil:

I mean, trust me, I would love to just throw an adversarial algorithm and generate everything, but it's... I wish it was that easy. Yeah. But it's unfortunately not. I think so. Yeah.

Lukas:

Awesome. Well, thanks so much for your time. This was super fun.

Daeil:

Absolutely. Thank you. I really appreciate it.

Join our mailing list to get the latest machine learning updates.