Databricks Founder Ion Stoica: Turning Academic Open Source into Startup Success
Berkeley professor Ion Stoica, co-founder of Databricks and Anyscale, transformed the open source projects Spark and Ray into successful AI infrastructure companies. He talks about what mattered most for Databricks' success -- the focus on making Spark win and making Databricks the best place to run Spark. He highlights the importance of striking key partnerships -- the Microsoft partnership in particular that accelerated Databricks' growth and contributed to Spark's dominance among data scientists and AI engineers. He also shares his perspective on finding new problems to work on, which holds lessons for aspiring founders and builders: 1) building systems in new areas that, if widely adopted, put you in the best position to understand the new problem space, and 2) focusing on a problem that is more important tomorrow than today. Hosted by: Stephanie Zhan and Sonya Huang, Sequoia Capital Mentioned in this episode: Spark : The open source platform for data engineering that Databricks was originally based on. Ray : Open source framework to manage, executes and optimizes compute needs across AI workloads, now productized through Anyscale MosaicML : Generative AI startups founded by Naveen Rao that Databricks acquired in 2023. Unity Catalog : Data and AI governance solution from Databricks. CIB Berkeley : Multi-strategy hedge fund at UC Berkeley that commercializes research in the UC system. Hadoop : A long-time leading platform for large scale distributed computing. VLLM and Chatbot Arena : Two of Ion’s students’ projects that he wanted to highlight.
- Published
- Published Jan 14, 2025
- Uploaded
- Uploaded Jun 11, 2026
- File type
- Podcast
- Queried
- 00
Full transcript
Showing the full transcript for this episode.
AI-generated transcript with timestamped sections.
[00:00] In general, this was our approach. We are going to be aggressive also about partnerships, even though the partners could compete and overlap. Because you have to trust yourself that at least when it comes to Spark, you can build the best partnership. [00:16] products. We are, you know, kind of saying internally, well, you know, if someone else is building a better product for Spark, then we deserve to lose, right? So that's kind of always was the confidence that we can build the best product for Spark. And eventually, if Spark wins, we are going to win. [00:35] *music* [00:52] Hi everyone, welcome to Training Data. [00:54] Today we're excited to welcome Jan Stojka, Professor of Computer Science at UC Berkeley and co-founder of both Databricks and AnyScale. [01:01] He has a uniquely exceptional career as a leading professor and founder of companies at truly legendary scale. [01:08] Today, we dig into questions like Databricks positioning in AI, how research projects like Spark and Ray have led to the founding of Databricks in AnyScale, how he ties his research projects closely to industry from day one. [01:22] new projects out of his lab like VLLM, MEMGPT, LMSYS, and VACUNA, and what research fields he's thinking about next. [01:29] Jan, thank you so much for joining us today. We're really excited to have you on the pod. To kick things off, we'd love to hear a little bit about where Databricks aspires to fit into the overall ecosystem, especially with some of the recent launches. What are you personally most excited about?
[01:45] First, thanks for having me here. So I think with Databricks, always we wanted to provide information, [01:55] the platform an intern platform which helps our customers get most of the value out of [02:06] Zerdeyla. [02:07] And one of the best ways today to get value out of the data, it's using these kind of new developments in AI, including large language models and everything else. And the one thing is that to note that this is always our vision from day one, actually Spark. [02:30] was created, one of the main reasons to create it, Matej, when built it, it was to solve, [02:38] to speed up classical machine learning algorithms, right, and to scale them up, right? And so, in some sense, for us, it's full circle. We started with AI, I use machine learning, it's not [02:53] classic machine learning. And right now we are back and doing more and more AI because to take advantage and to create value out of the data. So you mentioned that, you know, you were founded around, you know, enabling classical machine learning. How do you see the current AI moment as a [03:12] Same, and how do you see it as different? And I'm curious, what are the specific things you're doing for this moment in time, like the Mosaic acquisition, things like that? Yeah. So certainly the momentum around AI, it's on a different level today, right? You just look at the investments out there, right? That should tell the big part of the story.
[03:42] So what the way we are looking at is that taking advantage and being successful with AI is not easy, right? It's like the AI ecosystem is growing in complexity. It's not just a simple call to a model, right? You have so many techniques now, like, [04:03] rag and raft and, right, to improve the accuracy of your application using AI, right? Obviously, everyone is excited about AI because it solves so many problems. [04:18] problems and there are so many, you know, generate so many headlines. But still, it's not what you want. You want a product of AI. A lot of things still we are seeing today are fantastic demos. [04:32] And demos are inspirational. And when people see demos, whatever, Chargipit solve this kind of problems, Olympics math problem, it's very easy to think that, wow, doing that is going to do everything. [04:49] But going from demo to production, like I said, it's a big step. A demo means that you need to find an instance, at least an instance, to be really impressive. [05:02] There exists such instance, right? But when you go from the demo to the product, when you have a product, it has to work for all cases, right? So that's kind of the big gap. So that's why the effort is to improve the accuracy, improve reliability, of course.
[05:20] eliminate hallucinations as much as you can. And really also to find where it provides the best value. [05:28] You know, because, you know, you can apply AI to 1,000 use cases, but which are the use cases which are going to provide you the biggest value? I think that's what we are also trying to help our customers with. So to navigate that, how to successfully apply AI to their product, to improve their services, to their business. [05:50] In that vein, one of the most interesting things that I thought came with the Databricks AI launch was the new open general purpose LLM created by Databricks, D-Bricks. What was the reason behind training your own models, open sourcing them? And what do you think are some of the best use cases for that model? Yeah, so I think the main thing. [06:10] if you look at like our main market, it's enterprise and enterprise customers. Um, they do have, uh, a lot of concerns about data privacy, confidentiality. And obviously they kind of want control is not only that they want auditability, right? They want to be able to audit, um, what data is used, what results have been, uh, have been used and what decision have been made with what data the decision have been made. And, um, [06:40] So that's why, in general, the enterprises, everything being equal, migrate towards the open source models. We can host, you know, on their machines or in their VPC and so forth. So it's one of the reasons for companies.
[07:00] releasing the dbricks is to help our customers. And then the customers, you know, in many of these cases, they start from this model and then fine-tune with their own data and to optimize for their particular use case, right? So, and it's, again, so our enterprise customers, they want, you know, open source model. They want kind of control, as much visibility as they can. [07:30] on privacy and confidentiality, especially in the recent light of the breaches, which are widely published. The other thing is that the DNA of Databricks has been open source. It's not only Spark, but Delta and MLflow and many others. The other thing I thought was fascinating about DeepRex is its excellent programming abilities. [08:00] compared to something like CodeLama 70B? Look, I think it's about, obviously, it's about data and how you train it. And it's one thing, I think, you know, one of the major Databricks advantage is Mosaic, right? We have also the entire infrastructure for training. We have not only about the data, but the entire infrastructure of training and fine-tuning. And that is make it much easier.
[08:30] most cost effective to optimize models for different use cases. And obviously the co-pilot programmability, it's one of the very important use cases in this enterprises because software engineers are still very expensive. Yes. Interesting. So making the people you have, and it's not, it's not only about that. It's like hiring top software engineering is very difficult, right? [09:00] If you are a large company, like maybe, I don't know. [09:03] 4G or something like that. And so making those people productive, it's extremely important and critical for their business. [09:13] You mentioned Mosaic as a key part of the strategy. I guess, what do you think are your most important chess pieces in this kind of AI battleground? I imagine Mosaic is one of them. And, you know, are most of your enterprise customers looking to train their own models? And how does, you know, how does Mosaic and your other acquisitions fit in with your customer needs? [09:36] Yeah. So I think there are, it's like, um, there are a few enterprises who would like, which, uh, so it's again, it's pre-training and then it's fine tuning and using the, using the model in your own, on your own hardware, in your own VPC, on your own machines, you, you know, um. [09:58] You rent it your own.
[10:01] So... [10:02] As you may expect, there are a few which are doing pre-training, but there are still a few doing one to pre-training on their data, have enough data. A lot of them, they want to do fine-tuning, right? Because, look, [10:13] If you are an enterprise and a company, [10:18] Right. And you want to improve your business. Right. What do you have? What do you have? And which other don't? The thing you have, which other don't, is the data, the data about your business, about your users. [10:31] Right. [10:32] right so [10:34] Therefore, because this is something you have and others do not, you want to take advantage of that. [10:40] Right. So how do you take advantage of that? Again, there are many ways and you try different ways to do it. Right. One is about fine tuning user data and you have an open source model. You fine tune on that. [10:52] right um the other one to use uh rack right and and and things like that um and but and and so you are going to do any of those but then you want to do like like i said in your own vpc right to preserve security to have you know 10 for security perimeter perimeter you want to do it in order to [11:22] GDPR, California Consumer Privacy Act and so forth, the role of regulation. [11:28] And the number of regulation is going to increase. So the fact that you can have having an open source model, you can fine tune, you can use this track in your own VPC. It's very compelling value proposition. And are you finding that most of your customers want to go that approach versus, you know, open AI and the very powerful co-source models in the market?
[11:52] I think enterprises and still we are still early on and, um, [11:57] There are many enterprises obviously using OpenAI for different use cases, different applications. And I'm sure OpenAI and [12:09] Microsoft Azure will come with new products to provide better confidentiality, better [12:17] security. But at the end of the day, what I was saying is that everything being equal, [12:24] As an enterprise, I will [12:28] prefer more control, security and strategic, right? It's like, it's less kind of locking or something like that, right? So I think that's what we are saying. So if as open source models are going to catch, [12:47] up. [12:48] with the proprietary models, in particular, in the use cases which matter for the enterprises. It doesn't need to be, you know, perfect, right, for all the particular use cases. It just needs to be very competitive where it matters. [13:04] Then, [13:05] you know, enterprises will prefer these solutions on which they have, you know, more control and [13:14] More secure. [13:15] And how far away do you think we are from that moment? [13:18] Like, do you think we're there today where all else is equal? Or when do you think we crossed that? In terms of the open source versus proprietary? Yeah, open source being, you know, on par, all else being equal for the core use cases. Um...
[13:30] So you have a lot of use cases. It's again, it's the open source plus the data. And right now, the application are more complex. It's not just a call to a large language model. You have this kind of, you know, what it called, you know. [13:44] you compose, you have [13:47] you know, application, which... [13:49] Build. [13:50] from many components. And, um, [13:55] That's what we call compound AI. And so actually it turns out that, you know, if you can, you know, you can build an applications for doing, you know, recommendation or something like that or, you know, for people. [14:16] programming for a particular copilot for a particular tool, you can actually do better than [14:25] even something like OpenAI on the latest chat GPT, because you have more data. And the other things about, about, uh, [14:36] you know, Databricks, which I think there are also announcements about that, is that with a kind of Unity catalog and things like that, you have access and you know also about the structure of the data, which... [14:52] helps you tremendously to improve the accuracy of your applications. [14:57] But I say it's not only the model, right? It's everything else you have around it. And the quality of the data, you feed it in. So I think that's...
[15:09] It's... [15:10] Is the data stupid, like you would say, right? Like, [15:13] at the end of the day. [15:15] It sounds like control and security are two primary areas of things that enterprises really care about that you've noticed. And Databricks obviously has a tremendous advantage with having access to data as well to help these companies use models. Control and security for the customers. Exactly. What about some other factors have you noticed that they also care about? How much does cost matter? How much does diversity of other models matter? [15:42] I think obviously cost is important. [15:47] And... [15:48] It's like this, right? First of all, you know, you want to, you know, it's initially the important thing is about the value, right? Can you provide the value, right? It's like, so that's the first thing. And at that stage, cost is not as important, right? And in here, actually, in some of these early stages where people also try to use the most powerful model like OpenAI and things like that. But once you [16:15] cross that and you have a use case which you conclude that is good, add value to your business, now you want to scale it up, right? And now you are talking about having more control and more security and all of these things, protect confidentiality of the data, privacy of your users and so forth. And now basically people consider how they are going to deploy it. And
[16:45] And having, like you said, more control, security, it matters. And where open source models and platforms like Databricks are... [16:55] are very valuable. And again, it's also all the other components that you're together in Databricks, like I mentioned, like Unity Catalog and everything else to increase the value of your applications. Yeah, super interesting. [17:11] I'd love to talk to you about compound AI systems. I think you guys probably coined or popularized the term, and it seems like that's a lot of what the industry is latching onto now. Maybe for our audience, can you explain what is a compound AI system and what our enterprise is thinking about when they're building these? Yeah. So a compound AI system is basically – [17:32] consists of multiple components, multiple calls to light language models or [17:39] agents. And it's very much you can think about when you write a program, you have multiple components, you have different function procedures to do different things. And then you then put them together to [17:53] to create the program. The same is very similar here, right? You can use maybe one model to parse the data, to extract the data. You can use, and then you can use, for instance, depending of the prompt, you may use a model, you know, if the prompt is about mass problems, right? It's like you can use one model.
[18:23] programming. You may use a different model, right? And then you may use, for instance, about formatting the result, right? That's another one. You may use it, for instance, now more and more, you are talking about agents. And with agents, you have to, you know, you call external services or [18:46] functions like search or you can use a calculator and things like that. So now you have different models, you can do a better job. And they are small models actually to take the prompt and convert it to a function call. [19:00] Right. So that's kind of what it is. But the way to think about is like this. The conceptually is like you write a program has different components and make it easier to develop and deploy and manage the same thing you want to apply to applications. So like a collection of smaller components that work together where the, you know, the. [19:25] some of the parts is greater than the one monolith that you're replacing. [19:30] I'd love to. [19:31] dive a little bit into the Databricks story quickly, which I think is an incredible, legendary journey over the last decade, but with a lot of nuances that I think maybe many folks don't yet understand. From today, it looks like you were building the right company at the right place at the right time. But the actual nuance is that Databricks was originally started, originally for data scientists. It happened to cater well to machine learning workloads because
[20:01] right strategic decisions to actually really grow with the AI market. Can you share a little bit of some of the learnings and journey that got you to where Databricks was today? [20:11] Yeah, I do think that, very happy to. I do think that a lot of things, a lot of it's also about being the right time and the right place and things like that. It's being lucky or I think that's all of this is true. And there are many things needs to go right to be successful. Some of the things you control, some of the things you do not control. [20:41] a product, a cloud product, hosted product for data scientists. [20:46] Like we have this kind of notebook and we have provided hosted Spark and we targeted data scientists. And we targeted data scientists. It was one of the reasons we targeted data scientists because, again, we all have, like I mentioned, Spark early on was also targeting machine learning application, machine learning workloads. [21:16] At that time, there were not many data scientists. It was like 2013. However, when you look around, most of the universities already have data science programs, right, degrees. You're starting to offer data science degrees and you're saying, okay, we are, it seems a good, you know, pass forward.
[21:38] a good market. And I remember you're looking at LinkedIn to see how many data scientists are there, because our users, not initially, they're not as many. Thousands. [21:51] Especially when you compare the data with database analysts and engineers and so forth. Who started to build? And I think it was a reasonable, good product. And then we started [22:08] you know, custom, initially we had, [22:10] small customers. And still, the interactive analysis of data science has been for a long time. It's actually one of the biggest workloads, in particular in terms of revenue. [22:28] because the interactive workloads are priced higher than batch workloads. [22:34] But I remember that, you know, we sold to these companies to use, you know, Databricks, and they are aspirationally buying it to do data, you know, to the data science AI. And when, you know, after a few months, we'll go to them because we didn't go much, you know, earlier. [22:58] that they are doing well. Their usage is growing and everything seems good. So no reason to worry. So we went back to them and said, you know, to see what they are doing. Maybe we can write a blog post or whatever they are doing, right? It's like marketing and everything. The surprise is that very few are doing actually machine learning at that time. And we asked them what happens, right? Well, it turns out that obviously,
[23:24] That in order to do machine learning, you need data, like we discussed early on. And they realized that for the particular applications they wanted, they don't have the data they need. So they need to now to put maybe to add new logs, you know, to collect new logs in their products and things like that. And also they need to clean up the data, you know, curate the data and so forth. [23:54] data engineering, right? For data processing, right? It's a data processing tool at the end of the day. And so they are using... [24:03] using Spark for data engineering. And then, obviously, we start focusing also on serving data engineers much more than before. [24:17] And, you know, that's how we started. And then, obviously, when... [24:24] Later, now. [24:26] you know, [24:27] got the data engineering right and still have all these data scientists exploring and starting to build models. And then it is a very natural extension to start to add more products for our user, for our customers, like I mentioned, to get more value out of their data. And that meant building [24:51] machine learning models and using the open source models.
[24:57] Very interesting. [24:58] I want to talk about the Databricks-Microsoft partnership. I think that was the stuff of legends. And I think still probably one of the only case studies of a successful, like truly transformative partnership. Maybe walk us through what that partnership was. Was it a bet the company moment at the time? Do you think Databricks would have become what it became if you hadn't struck that partnership? Maybe just talk about that. [25:23] Look, obviously, the partnership with Microsoft was... [25:29] was a great partnership for us. We are, the one thing I want to say that in, and this is most visible, we are very, from the day one, we are actually very focused on the partnerships. Our idea was always, you know, like we know we make Spark successful, you know, hopefully de facto standard for data processing. And then we make Spark. [25:57] Databricks, the best place to run Spark. So early on, actually, we started a few months in the life of the company. We had this kind of partnership with CloudEra. [26:07] And then we had with Hortonworks. And this partnership was mainly because you are to advance SPARC. [26:17] Right. Because Spark, it was created in. [26:22] Hadoop ecosystem, right? And these are the Hadoop companies, right? And so we had this partnership of data stacks and so forth. So this was like in the first year or so, we had all this partnership, despite the fact that some of these companies, we knew that they could become our competitors, right? Because
[26:43] You know, just making them and helping them to deploy, to manage, and to sell Spark. [26:53] based services, right? It's like, so in some sense, you know, the Microsoft is like, it fits our kind of approach of trying to, you know, to strike as much, you know, [27:12] do partnerships with other, being aggressive, doing partnership with other, you know, organizations in the ecosystem. Even though, again, in some cases, it was not a clear cut whether we are, you know, they are going to compete with us or not. But I say we're very, just growing the ecosystem and growing Spark, that was amazing. [27:36] the priority. We even had at some point a partnership with Snowflake. So it was very important [27:47] It requires a lot of heavy lifting. So we always look for partnerships which are meaningful. [27:54] Right. And I think great negotiation, you know, Ali and so forth did a fantastic job there. But at the end of the day, we also need to commit to it. It's like it took to. [28:12] to build the Azure Databricks. So we were in AWS before that. It took, you know, tens of engineers one year. And you are a small company at that point.
[28:23] So it was a huge commitment and a huge bet also from our perspective to do that. And, yeah, and we, you know, [28:33] I think, [28:34] engineering and everyone executed very well and it was a successful product. [28:39] Right. And, you know, Microsoft has great partners. [28:44] And, yeah, this is what happens. We are obviously a bit lucky, but in general, this was our approach. We are going to be aggressive also about partnerships, even though the partners could compete and overlap. Because you have to trust yourself that at least when it comes to Spark, you can build the best partnership. [29:09] products. We are, you know, kind of saying internally, well, you know, if someone else is building a better product for Spark, then we deserve to lose, right? So that's kind of always was the confidence that we can build the best products for Spark. And eventually, if Spark wins, we are going to win. [29:27] Do you think that Databricks would have become the company it is today without that Microsoft partnership? [29:32] I think so. It may be to... [29:35] would have taken a little bit longer. [29:37] But yeah, I think so. [29:39] We still have a very good offering on Azure. [29:45] like we have with GCP, you have taken a bit longer. [29:49] But I don't see...
[29:53] any fundamental change in dynamics, right? Because one of the advantage of Databricks, of course, is once Spark won't, and we could provide the best product for Spark, we are in a very strong position. And compared with other clouds, remember that one of our advantage, like everyone else advantage, like Confluent and so forth, is that you can provide a service on multiple clouds. [30:20] Right. And multiple clouds is, you know, multi-cloud has been, you know, more and more very strategic for especially for large enterprises. We do not want necessarily to be locked in or to want the one choice. [30:37] So, yeah. [30:38] I love the confidence and conviction that you have in Spark and your own execution abilities internally, but that married with the practicality and aggressiveness of winning as a business and pursuing the right partnerships and doing whatever it takes to win. [30:53] Yeah. Yeah. I mean, you try to simplify things. That's what I was saying. You know, you know, like initially said, like, look, you know, like we have to make spark queen. You know, there are many. I remember we look at all these combinations. It's like it's like. [31:09] Spark wins, the product fails, right? Right. [31:15] Spark loses, but we have a product which is successful or both fail. No, that's not very interesting. Or both are successful, right? And we convince ourselves that,
[31:24] you know, we need to bet on Spark to win because that's the most likely. Yeah. [31:31] way also for the product to win again for better or worse right but sometimes you know there there could be many ways to success right and you in retrospect you cannot go back and you know and and try other alternatives maybe there are better alternatives at that point but it's important to commit to one thing which and which hopefully it's it's a reasonably good solution a good pass forward right again there are many many passes to the peak right the most important is [32:01] shortest one or the easiest one. It may not be, but it has to be one to get there. And to the highest point. And that's why I said, okay, Spark has to win, right? And we have to build, you know, the best, need to be the best place for Spark. And then we are saying, you know, to be the best [32:18] place for data and AI. We need to eventually renew and we assume that if you are to be hugely successful, you are going to, you know, to go beyond Spark, right? It's like that's why the name of the company, Databricks, is not Spark Labs or something like that. So that's kind of, you try to simplify it. Once you do that, then you start to execute. Okay, so you want to make Ray successful as an open source. So you want everyone to use it. So that's how you do Cloudera [32:48] and so forth to do this partnership. Because at that time, there are other solutions. You know, people knew that Hadoop MapReduce, you know,
[32:59] you know, [33:01] These times have passed, so to speak. So they are talking about new systems. There are like TES, which actually Horton, this project and so forth. So that was very important. [33:11] And then it was also for data science. It was... [33:16] Kind of when we bet is like it's a niche and, you know, we and we can we saw that we can build the best product for it. [33:26] So, you know, so so that's that's kind of you need. [33:30] And you need [33:31] to have ultimately confidence in what you bet on, right? You have to bet, right? Right. Because you are a small company, right? If you don't bet how you are going to win, right? Everyone, right. And then you need to have some level of conviction, right? To do it. And, um, [33:47] And, yeah. [33:49] I'd love to kind of [33:50] Pulling that thread and switch gears a little bit into tying a lot of the entrepreneurial path that you have with a lot of the academic and research background that were the roots for the beginnings of these companies. You have a very unique career as both a leading professor and a founder of multiple unicorn and decacorn companies. I don't think there's anyone who comes close to pursuing both disciplines at the scale of success that you have. [34:20] AnyScale, Array2 AnyScale, Spark2 Databricks. Take us into your head, what is the process in which these research fields start to ruminate in your mind? When do you kind of continue to give them resources to develop? And then when do you know that it's the time to then start a company to pursue that in a better, more open and fast way?
[34:46] That's a good but hard question. [34:50] So I think that – and by the way, I like to preface, you know, to say that – [35:01] It's obviously also a lot of luck involved. And, you know, being in a place like Berkeley and having fantastic students and colleagues around you, you couldn't do without that. Right. It's like it's their merit probably more than mine. So but one thing I think is that I've been always trying to focus on. [35:24] It's on the problem. Yeah. Right. And actually when, um, [35:30] Even to my students, I'm telling them, it's like one of the most important things you need to do is to figure out what problems you are going to work on. Yeah. Right? Because it's like... [35:42] Everyone who comes to Berkeley or these top schools, one thing they have in common, they are good problem solvers. [35:49] Right. [35:50] They have good grades, good scores, you know, write papers. Paper is about solving a problem. So therefore, if all of them are good problem solvers, the differentiator is a problem you are working on. [36:04] Right. Yes. [36:05] Thank you. [36:06] So you start with that. And also... [36:10] I think it's like, especially at Berkeley, you get...
[36:19] You get exposed of kind of not only new ideas, but willingness to [36:24] take risks and get into new areas. In some sense, and this is what I like about Berkeley, if you look traditionally, Berkeley, [36:34] Among top schools, they are... [36:37] The first to open new areas is like, of course, the risk processor. It was with Stanford as well. But databases, networking, sensor networks, even devices. [36:51] in open source with the Unix BSD, you know, TCP/IP, right, part of the STIB. So they are always kind of more, you know, a little bit trying to experiment. So I think that's kind of, you know, that's kind of culture. I really resonated with it. And then the other thing that happened is, basically, we have these labs, which are like five-year labs, which basically each lab has kind of a vision and, you know, [37:21] know, it's a group of faculty coming together, which believe in that vision and with their students and try to make it happen over, you know, five, you know, five years. And this, you know, has a lot of great, you know, impact, you know, with like, [37:36] all the way it started. This tradition started 40, 50 years ago with Dave Patterson, Randy Katz and others. And they built, you know, risk rate, redundant array of inexpensive disk. Now, network of forestation is commodity. You know, this is everyone is building now this huge cluster of commodity machines, servers, and again, and many more. So there are these kind of elements and these
[38:06] industry. [38:08] connections, they are funded. When I came actually, this was one change happened. Before, these labs were also supported by government, in particular DARPA. But it was a point in which that kind of, at least that particular DARPA funding dried. So, [38:28] Kind of when I came to Berkeley, this kind of now we need to go and remember, you know, you know, getting and more money from industry from Google. First time we got and it was unheard by then because you are asking for five hundred thousand per year. Right. For four years. So but, you know, we got and this is we got. So now you have also this kind of very tight connection with the industry. [38:58] who understands their problems. [39:00] Right. And then you can you can you can you can you can see that. And you try to think about also about. [39:09] Obviously about trends, right? Because trends are important, right? You have to be aligned with the secular trends, right? You need to bet on the right trends because these are things you cannot change, right? Or it's very hard to change. So you are not aligned. [39:25] It's not good. And these trends, actually, there are multiple trends. And the multiple trends kind of open gaps between them. And these are kind of opportunities for problems.
[39:39] you know, big data. It is clear you have more and more data. [39:44] And the amount of data people collecting were just growing. [39:49] Right. It was pretty clear. Right. It's like Google has seen that year before. And they built all the systems. But now everyone wants to emulate that. [39:57] That's why Hadoop was created. [39:59] Right. And then you see you start to see again, you look at that and you are. [40:06] working in that area and we are [40:10] you know, we have all these Hadoop people coming to our retreats on this kind of labs, and we are, you know, friends with them. And we started to see problems, and then you try to use them. And then you just, like, for instance, there are two things which happened with Hadoop. One thing is about a group in our lab. Oh, by the way, and the other thing that happens in these labs, they are interdisciplinary, right? [40:40] that it was people from machine learning, systems, databases, networking. [40:46] So there are these groups of Michael Jordan students which wanted to compete to this Netflix challenge. The Netflix released some data, basically, and asked for people to provide recommendations, come up with recommendations, to build recommendation algorithm systems, to beat their own recommendation. So they come together.
[41:10] to us and okay, it's a lot of data, what can we do about it, right? We want to, and, you know, tell them to do, yes, Hadoop. But Hadoop was very slow, right? Because, you know, and then, you know, Matei put together something quickly for solving this problem in which the data was kept in memory. The other thing I've seen, it's about, like, I have a previous company, Conviva, and it's about [41:40] very slow and we try to do it, you know, it's like you took four auto queries and there is no way to do it. And again, keeping the data in memory was a solution. It's one solution. That's kind of how we started. That's one thing, right? And you look at the trends. Yeah, it's obvious, right? It's like on one hand, you, you know, have more and more data growing faster than the more slow. So you need to have, it's not going to fit on one machine and therefore you need to use multiple machines. [42:10] then the only other question is that are you going to have data sets that are going to fit, important data sets that are going to fit in memory, right? It's like that's kind of the first question. And they are because people, even when they are doing, for instance, RHO query, we notice that, [42:25] And we look at the data from different clusters, Hadoop clusters from Yahoo and Microsoft and others. And we notice that in a lot of cases, actually, when you do queries and you're doing analytics, you're doing not very rare on all the data. You do, say, for the most recent data. You want to see what happens yesterday, what happens last week, something like that. So once you get that and, you know, you have a lot of cases in which the data is fitting in memory
[42:55] at that time, you know, that's, you kind of, you're pretty, you know, you connect the dots, right? And then it's about solving the problem. And the other thing is happens and why they are related, you know, I'm talking about academy and industry, because I'm telling, you know, people, you know, some people, you know, there are people in academy are pushed back and say, you know, it's a lot of engineering here. This is not what you should do in academia. But one thing is what [43:25] It is very satisfying. If you build a system, [43:28] Right. In a new area. And that system, it's used by other people. [43:35] Then, [43:37] You can you are in the in the best position to understand the new problems in that area because people are going to use your system in different ways. Right. And then, you know, you you understand that. Right. And going back, if you know the problem, you are also in a good position to solve it. [43:57] Right. So actually, it directly helps you with your research. Right. To be ahead, because otherwise, what is a choice where you are finding the problems? [44:07] Of course, there are problems, very good problems in theory, which are not solved by decades and so forth. But other things that people do, they go to Google and Microsoft and so forth, spend time and to understand what is the problem they have, right? Because... [44:22] to solve. But that's kind of a little bit unsatisfying, right? Because you go to someone to learn about their problems, but the question is why don't people solve those problems? Maybe they don't solve the problems because maybe they're not as important as their given time and maybe...
[44:40] For the right reasons, they are, you know, too much in the future. [44:45] But this is a thing, right? You have to focus on the problem and you have to focus on the trends. And the way they're connected, you want to solve a problem, ideally, which is going to be more important tomorrow than today. [45:01] Mm. [45:01] What are the problems that you're most excited about right now? Like Spark, Ray, like what's going to be the next data breaks or any scale? [45:08] So a few things. So I still think that... [45:11] it will going to be a lot of work. It's, it's, [45:16] Right now what happens is that we need to resync most of the software stack. Why? Because it's, again, going back to the trends, is that the demands of this application, in particular AI application and so forth, growing much quicker than capabilities of a single processor, a single node, even if you considered accelerators. [45:39] So on the other hand, this happens. On the other hand, [45:43] the infrastructure becomes much more complex. You need to run that application not only on one node, but on many nodes. It's distributed. But it's not only that. It's becoming very heterogeneous, right? Because in order to breeze a gap between the demand and capabilities of hardware, people build accelerators. Like that's why NVIDIA is a trillion-dollar company, right? But now the infrastructure becomes even more complex.
[46:13] Right. It's not only distributed heterogeneous. When we start this spark, it was homogeneous. [46:19] All the nodes are the same, some storage, some CPOs, that's all. [46:25] But right now, look at the heterogeneity. You have NVIDIA and you have many others. You have TPUs from Google. Every company. [46:33] cloud. It's having their own chip. [46:39] Like now you have MD and... [46:41] Intel say that it's everything, it's about AI, right? So that's kind of what happens. So now you have a huge gap between this application and this very complex infrastructure, just growing in complexity. And then it's not only about CPU, it's a compute. It's about networking. You have InfiniBand, and you have all of this, you know, RDMA and so forth, right? So huge heterogeneity. [47:06] and the software stack has to abstract away that complexity for the developers. There is no way around. You want a single machine or you have this operating system to abstract away the complexity. That's what makes it easy to develop all this application. Now it's extremely hard. So something is going to happen there. I think the other one is about this building application. You're talking about compound AI. We're talking about compound AI and things like that. [47:36] application, AI application in particular, large language models. Everyone is talking about large language models.
[47:43] You know, the application are like assistance for humans, right? Humans are in the loop, right? If you are thinking about customer support, if you are thinking about co-pilot, if you are thinking about Q&A question and answering, even summarization, you have a human helping a human to be much more productive. [48:00] which is a fantastic application. Um, but, uh, they cannot, they cannot be autonomous. They are not theater autonomous. And to, to, to, to go from having the human in the loop to being autonomous, it's a huge gap, right? Because with autonomy to be, to have something autonomous, you need to have someone which, um, you know, it's, it's, it's running kind of, you know, it's, [48:26] It's more deterministic. It's more reliable, right? It's like far more accurate. [48:32] Right. And, you know, you need to get there because if you don't get there, you know, you are still limited to having the system where the humans in the loop. And the human is a bottleneck, will become the bottleneck. It's, you know, it's just a certain number of people on the planet. [49:02] more like an engineering discipline, right? [49:05] where you can build much easier systems from smaller components. Okay, so the two next Databricks will be distributed compute across heterogeneous hardware and autonomous compound AI systems. Noted.
[49:21] I'd love to tell you another thread that you mentioned just now of kind of how funding constraints drive what you're working on. I'm curious what you think right now. There's been a lot spoken about kind of almost the brain drain in AI right now [49:35] The universities just don't have the funding that you could get if you went to work at one of the big research labs. How do you think about that? What do you think is ideal? Does operating under constraints force creativity for you? What do you make of all that? Yeah, that's a great question. [49:55] And... [49:57] It's true. I mean, it's challenging. It's very challenging. And when I came to United States, I came to do my PhD and I graduated from Carnegie Mellon University. I am. [50:09] originally from Romania. So one thing was admired, people admired about United States is like, is that how well this kind of, this kind of, [50:24] you know, the collaboration and the partnership, the three-way partnership between academia and government and industry are working, right? And prime examples at that time was obviously the internet. [50:39] right, which was DARPA project. And of course, academia had a huge impact. And also industry was [50:49] you know. [50:50] It was whatever third industrial revolution people were saying. And that when it comes to AI today, that kind of partnership is broken.
[51:00] It's like... [51:04] industries, every company is doing this research in silos. [51:08] Even they don't talk to each other as much. [51:13] academia, like you said, doesn't have resources and the government doesn't invest as much. [51:20] So I think that's something to be very concerned about. [51:26] And that's why I'm also a big proponent of open source models. And U.S. and... [51:33] California kind of will lose in long term if this doesn't, is not fixed. [51:39] So what happens now, unfortunately, in academia, one thing that happens is that, of course, there are some, you know, bigger universities and labs which can still afford to maybe train. [51:50] spend one, two millions maybe to train some models. It's still they are not in the same league like OpenAI, right? It's like OpenAI and Microsoft, they are talking about building the data centers about $100 billion, right? [52:06] And I think there is the danger is that you are going to, [52:12] going to, some of the academics are going to give up and try to innovate around the ages. Now, you can still innovate in this application and things like that. And, you know, I think there's a lot of innovations there. [52:27] But clearly, you know, it will be harder to come up with new model architectures and, you know, innovate projects.
[52:40] have, it will be harder. It's not impossible. Nothing is impossible. [52:46] So yes, it's a challenge right now. I mean, there are people like, you know, which are, unfortunately, [52:55] more fortunate position, maybe I'm one of them, in which have access to resources outside academia, right? [53:04] But, you know, it's – you do want to – [53:09] level the playing field in order to maximize the innovation. The innovation comes from everywhere. Now, this being said, it's true that, you know, scarcity and always in the past, scarcity, you know, spurs innovation, right, as well. [53:27] But the concerns about communities, [53:33] not having access to... [53:36] resources to... [53:38] Thank you. [53:40] To play the same game, like in industry, it's a concept. [53:44] I'd love to switch gears into some rapid fire questions if you're ready for it. Yeah, go ahead. Will anyone take meaningful market share from NVIDIA over the next five years? [53:55] I think they, it will be, they will be at the minimum because they are going to, it's probably that it's, [54:04] and [54:06] Because NVIDIA will not like to be accused about monopolitic behavior. So their market share has to decrease under some percentage, whatever, 70, 80 percent. So that one will be one of the reasons. But I think that if I have to name one company to take,
[54:28] Of course, there are clouds and for strategic reasons, they are going to push their agenda to build their chips. It's like still probably the biggest competitor right now in terms of the market share. It's Google with TPUs. Yeah. [54:42] Yeah. [54:42] Probably that will continue for a while. [54:45] What's one project or student in your lab right now that you'd want to highlight? [54:50] You know, I'm going to cheat here because I think that both VLLM and Charbot Arena has been tremendous. It's like I'm not talking about SkyPilot because SkyPilot is like they started a company. So I'm not talking about. But I think VLLM has been amazing. It's like it's one year old project. And I haven't I've never seen such a rapid growth. And of course, it's also part of the AI. AI is kind of compresses the time. [55:20] about that. And I think the other one is Cherbot Arena, because it just is fascinating to see the development in the space and to see how these different models, where they are strong, where they are weaker. And I think that having a front seat at that, to see that kind of development of the ecosystem and in the space, it's fascinating. [55:48] Do you think the foundation models will commoditize? [55:50] The foundation model was a rare error. [55:53] Like a GPT-4 or a CLODE? Do you think there's a market to be made in providing these models over time, or do you think it commoditizes them?
[56:02] I think people will continue to build larger and larger models. I think when it comes to serving, it looks like, [56:11] model distillation works quite well by the way the model is where you train a smaller model on the outputs from the bigger model [56:20] has been a lot of success with that. By the way, this in some way, um, [56:25] It says that it shows how important the data, right, for training a model, right, going back early on in our conversation, right, because you have higher quality data from the big model and use that to train the small model and it's working very well. So I think using multiple distillation models to reduce the cost of inference is going to be a way forward. But, yes, I think for advancing and pushing the frontier, so to speak, right. [56:55] pun intended, you are still going to see a lot of, you know, a lot of effort on bigger and bigger models. What are you most excited to see in the world of AI in one, five, and 10 years? [57:06] But I must talk about AI. [57:09] Look, there is no question it's transformational, right? I think that... [57:15] and um [57:17] It will change a lot of things, everything maybe. I think the most excited I am about it, it's about how do you make this AI, more systems, more predictable, accurate, verifiable, how you can debug these systems. All of this is kind of in the realm of software engineering,
[57:47] This is what I think is exciting. [57:50] What advice do you have for founders building in AI? [57:53] Um, it's, it's the same thing, you know, focus on the problem. Don't focus on the hype. Hype is emotional. It's not reliable. Just look at the facts, right? It's like, and look at the problem, try to understand the problem and try to be truthful to yourself. It's, it's, it's about, if you build an application, it's about production. It's not about the demo, right? It's beyond the demo. Of course, the demo are important. Don't get me wrong. [58:23] This is the mindset you have to have. And, yeah, and the dangerous thing is there's so much hype. And you think you can solve everything and you can do everything, probably, in the future. [58:35] certain number of years. But now just focus on exactly what problems [58:41] you are going to solve. Convince yourself it's a good problem. Convince yourself that... [58:47] You can solve it, or at least you have, you can have an MVP, right? You can solve a smaller version of that problem, which is still very valuable for your customers. That's what I will say. Nothing. [59:02] Now, Silvia Ravoulat. Amazing. Thank you so much, Jan, for joining us today. I've loved hearing a lot about your own thinking and reasoning behind your own journey. A lot of the thought process behind finding the right problem to solve, building the right systems to actually be in a position to understand the best problems, and then applying that even to many of the bold decisions that you've had to make in founding multiple companies from research into commercialization and the incredible success of Databricks today.
[59:31] Thank you. [59:32] Thank you for having me. [59:34] Music.
Want to learn more?