ReflectionAI Founder Ioannis Antonoglou: From AlphaGo to AGI
Ioannis Antonoglou, founding engineer at DeepMind and co-founder of ReflectionAI, has seen the triumphs of reinforcement learning firsthand. From AlphaGo to AlphaZero and MuZero, Ioannis has built the most powerful agents in the world. Ioannis breaks down key moments in AlphaGo's game against Lee Sodol (Moves 37 and 78), the importance of self-play and the impact of scale, reliability, planning and in-context learning as core factors that will unlock the next level of progress in AI. Hosted by: Stephanie Zhan and Sonya Huang, Sequoia Capital Mentioned in this episode: PPO : Proximal Policy Optimization algorithm developed by DeepMind in game environments. Also used by OpenAI for RLHF in ChatGPT. MuJoCo : Open source physics engine used to develop PPO Monte Carlo Tree Search : Heuristic search algorithm used in AlphaGo as well as video compression for YouTube and the self-driving system at Tesla AlphaZero : The DeepMind model that taught itself from scratch how to master the games of chess, shogi and Go MuZero : The DeepMind follow up to AlphaZero that mastered games without knowing the rules and able to plan winning strategies in unknown environments AlphaChem : Chemical Synthesis Planning with Tree Search and Deep Neural Network Policies DQN : Deep Q-Network, Introduced in 2013 paper, Playing Atari with Deep Reinforcement Learning AlphaFold : DeepMind model for predicting protein structures for which Demis Hassabis, John Jumper and David Baker won the 2024 Nobel Prize in Chemistry
- Published
- Published Jan 28, 2025
- Uploaded
- Uploaded Jun 11, 2026
- File type
- Podcast
- Queried
- 00
Full transcript
Showing the full transcript for this episode.
AI-generated transcript with timestamped sections.
[00:00] Go is a complex game and there was always a bit of worry about whether AlphaGo was truly as good as we believed. [00:06] We actually had the conviction that deep reinforcement learning is the answer based on everything that we could measure and everything we could see. [00:14] But that's... [00:16] The thing about these systems is that they're not like classic computers where you just know that they always produce the same answer. They're like stochastic. [00:23] They are creative. And they have some blind spots. They hallucinate. Similarly to how modern LLM hallucinate. So you need to just really push them and see exactly where they break. And the only way you could actually do that is by having the best humans playing against them. [00:41] Thank you. [00:56] Today, we're excited to welcome Giannis Antonoglio, a researcher and an engineer who has contributed to some of the most significant breakthroughs in AI. [01:05] As a founding engineer at DeepMind, [01:07] Giannis played a crucial role in developing AlphaGo, which made history by defeating Go world champion Lee Siddall. [01:15] He later co-led the development of MuZero, which pushed the boundaries even further by mastering multiple games autonomously. [01:23] Now, as he embarks on his latest venture with reflection, he's focused on building the next generation of AI agents.
[01:32] We're excited to talk to Giannis about the breakthrough moments in AI history that he's witnessed firsthand. [01:38] From AlphaGo's famous Move 37. [01:41] to his perspective today on what's next for the combination of reinforcement learning and large language models on the way to AGI. [01:49] Janis, thank you so much for joining us today. Thank you so much for having me. Janis, you have an incredible background, having worked at DeepMind as a founding engineer for over a decade, starting with some of the most notable projects that have really defined the industry. DeepMind, quite notably, created this notion of building AI within games to start. Can you share a little bit more about why DeepMind chose to start with games at the time? [02:18] Yeah, so DeepMind was the first company to truly embrace the concept of Artificial General Intelligence, or AGI. From the outset, they had grand ambitions aiming to build systems that would match or exceed human intelligence. So the big question was, and still is, how do you build AGI? And more importantly, how do you measure intelligence in a way that allows for meaningful research and performance improvements? [02:40] So the idea of using video games as a testing ground came naturally to DeepMind founders. It was [02:45] Demis for Service and Shane Lake. Because Demis had a background in the gaming industry, and Shane's PhD thesis defined AGI as a system that could learn to complete any task. [02:56] Video games provided a controlled yet complex environment where these ideas could be explored and tested. [03:02] And to what extent, you mentioned games are, they provide a very controlled environment,
[03:06] To what extent are games representative or not of the real world? Like if you have a result in games, do you think that generalizes naturally to the real world or not? [03:14] So, I mean, I guess games have indeed been valuable for developing AI. [03:19] And you actually have, like, a few examples of that. So you can see that PPO, for example, which is currently being used in RLHF, was developed using OpenAI, GIM, and with Joko and Atari. And similarly, we have, like, MCTS, which was developed, which stands for Monte Carlo Tree Search, and was developed through board games like Pac-Man and Go. But at the same time, games have, like, a number of limitations. [03:49] even the most complex games. So even though it just gives you an interesting testbed to develop new ideas, it's definitely limiting, and it does really capture all the complexity of the real world. [04:01] Okay, interesting though. So a lot of the techniques and algorithms that you've developed in a game environment, DPO... [04:07] etc. These are used in the real world. [04:09] Yeah, so PPO is actually exactly what JustPity used for RLHF. And so MCTS, it's used in New Zero, and New Zero has been used in the real world in things like video compression for YouTube. [04:28] It was part of the self-driving system at Tesla at some time. [04:34] Uh... [04:35] And it was also used for developing a
[04:39] a pilot that was completely controlled by an AI. So yeah, I mean, [04:44] You can see methods like that being used in the real world to solve real problems. [04:49] So interesting. Giannis, I remember back in 2017 when AlphaGo, the movie, came out and it featured the incredible game of AlphaGo against Lisa Dull. Can you take us back to that moment in time and maybe the years leading up to it as you're building AlphaGo? How was AlphaGo specifically chosen as the game to focus on? [05:10] So, I think that like games, you've always been a benchmark for AI research. So like, before Go, you had chess, and chess was like a major milestone with like IBM's Deep Blue defeating Garry Kasparov in the late 90s. [05:24] And even though chess and Go are completely different games and Go is definitely a different beast, [05:30] There is a... [05:32] like games have always been acted as best bets for like the development, especially board games for the development of like new AI methods. Actually, even going back to the earliest days of AI research, Turing and Shannon, they both worked on their own versions of like chess bots. So now the thing about like Go is that... [05:53] it's a much harder problem than chess. The reason for that is because it's almost closely possible to define an evaluation method, a heuristic. In chess, you can just take a look at the board. You can count the number of pawns that each side has. You can see what the ranks of these pawns are. Then you can just make some... You can draw some conclusions on who is winning and why. In Go, there's nothing like that. It's mostly human intuition. If you ask a Go professional player
[06:23] like how they know whether a position is a good one or a bad one. They will say that like, you know, after having played the game for so long, they can just like feel it in their gut, like this is a better position than the other one. So now... [06:35] It's actually a question of how do you encode the feeling in your gut into like an AI system, right? [06:42] So this is exactly the reason why solving Go was considered the holy grail of AI research for a long time. And it was a challenge that seemed almost impossible, but at the same time it was like within reach. People felt that they could actually get it. [06:55] cracked. And this is exactly what AlphaGo did back in 2016. And it kind of like showcased two new methods, which is like deep learning and reinforcement learning. Because back in 2015 and 2016, like now we kind of think of deep learning and reinforcement learning as mature technologies, but like back then, we're kind of like literally like making... [07:16] they're taking their first steps and they're kind of like the new kid in the block. And most people were kind of like really skeptical about them. Everyone thought that deep landing was a... [07:26] It was another AI fad that we just won't last the test of time. [07:30] So yeah, AlphaGo was chosen because it was clear to show days that you actually have the most [07:38] the most performant agent in the world. You could actually evaluate it. You can have it play with other humans. And at the same time, it was within reach, given the latest developments in deep learning and universal learning. [07:50] I remember reading that there's more configurations of the Go board than Adams in the Universe by many orders of magnitude, and that blew me away. I grew up playing Go, and it felt like it's very simple in terms of the rules.
[08:05] I see why it was the Holy Grail. [08:07] Maybe can you explain how AlphaGo worked technically? Maybe explain to me like I'm a fifth grader because that is effectively my level of sophistication, understanding these things. [08:18] But how did it work? And you mentioned that both reinforcement learning and deep learning were involved. I'd love to peel that back a little bit. [08:24] Yeah, absolutely. So as for gold, [08:27] has... [08:28] two deep neural networks. So like a neural network is a function that like takes something as an input and produces something as an output. And it's literally like a black box. We don't really know exactly how it does it. Just like know that you can actually, if you train it on enough data, it will just like learn the mapping. It will learn the function like from input to the output space. [08:47] So AlphaGo actually had access to two deep neural networks, the policy network and the value network. And the policy network suggested the most promising move. So it will just take a look at a current port position and just like say, okay, you know, based on the current position, this is the list of moves that I would recommend you just like consider playing. [09:07] And it also had access to the value metric. We'll just take a look at, like, a board position and just, like, give you a winning probability. Like, what are your chances of actually winning the game starting from this position? This is exactly the gut feeling. Like, it had, like, its own gut feeling on, like, whether the position is a good one or a bad one. [09:24] So once you have access to these two networks, then you can actually play in your imagination a number of games. You can consider the most promising moves. Then you can consider your opponent's most promising moves. And then you can just evaluate each move, like the value network. And then you can use a method called min-max. What that says is that I want to win the game. But I also know that my opponent wants to win the game. So I want to pick a move that will maximize my chances of winning,
[09:54] will try to maximize their chances of winning. So if you actually like do that and simulate a bunch of moves, then you can just like get the optimal action. [10:04] And, you know, the way to just like do this imagination, this planning, this search in the most efficient way is by using a tree search method called Monte Carlo tree search. So MCTS. So whenever people talk about MCTS, they literally just like mean this heuristic of how do I. [10:23] How do I choose which futures to consider so that I can make informed decisions? The role for reinforcement learning and deep learning in building AlphaGo was that [10:33] Uh, [10:34] Alpha Gophers 4 was a success of reinforcing learning and deep learning. [10:37] because this is exactly the two methods that powered AfroGo. And the policy network was initially trained on a large set of human games. So you had many games played by human professionals, and you just consider every position, and you consider the move they took at this position. And then you have a deep-unit network that tries to predict [10:59] this move. [11:01] then once you have the policy network, you need to somehow find a way to just like obtain a value network. So we did it in two ways. First, we just took the policy network and we had it play against itself. And we used reinforcement learning to improve the blank strength of the model. [11:18] So we use a technique called policy gradient. So what policy gradient does is that it just looks at the game, and then it looks at the outcome. This is the simplest version of policy gradient. It looks at the outcome of the game, and for all the moves that led to a win, they'll just say, great, just increase the probability of choosing this move. And for all the moves that led to a loss, it says, great, now decrease the probability of this move being selected in the future.
[11:48] long enough, then you just get an improved policy. Now, once you have this improved policy, you can just generate a new data set of games where the policy plays against itself, and then you have a [11:59] huge amount of games where for each position you know who the final winner was. So then you can take this network, you can take another network, a value network, and have it predict the outcome of the game based on the current position. So [12:13] What the network could learn is that if I start at this position and I play under my kind of policy, [12:19] On average, this is the player who wins. Like this is either a black player or a white player. [12:24] So this is the first version of like a value network, and you can just like use it with enough ago by combining it with the policy network. [12:32] And what were some of the biggest challenges in building this and how did you overcome them? [12:37] Thank you. [12:38] AlphaGo was not just a listed challenge, but was mostly an engineering marvel. The early versions run on 1200 CPUs and 176 GPUs. The version that played against listed old used 48 TPUs. TPUs were the first custom accelerators. [12:58] And these accelerators were really primitive back then. Because literally it was the first version, right? Now the later accelerators are much, much better and much more stable. [13:08] So the system had to be highly optimized to minimize latency, maximize throughput. We had to build large scale infrastructure for training these networks. And it was a massive endeavor. It just required a lot of coordinated effort from many individuals working on different aspects of the project. But, you know, I just like walked you through a number of steps to just like obtain the policy network and the value network. And each of these steps had to just be implemented at the end.
[13:35] at the limits of what was available and what was possible back then in terms of scale. And it had to be implemented in a way where people could just think everything. They could just try the research ideas fast and get results fast. [13:50] Lots of people are [13:53] scale at levels that hadn't been implemented before, and it's kind of like working at the forefront of what was possible back then. [14:03] I love your highlight of it being a research marvel and an engineering marvel. And I remember you sharing one time that part of the reason this project came about also was because Google had TPUs that they needed to... [14:14] they needed a test customer for, and that was the Spark, this AlphaGo project. So that's pretty incredible. [14:21] How much conviction did the DeepMind team have that this is going to work? You mentioned that, you know, at the time... [14:27] Deep learning, reinforcement learning were still relatively novel, but DeepMind was very much founded with that belief. [14:32] But did you guys think that you were going to be able to have – [14:34] kind of these superhuman level results, beating the top Go player in the world. Like, was it a crazy idea and maybe it'll work? Or did the team have conviction like this is going to work? [14:44] Yeah, so I'd say the team had a cautious optimism. So one of AlphaGo's lead developers, Aja Hwang, he is a strong amateur Go player, and he had been working on Go for like [14:57] a decade before AlphaGo happened. And we also had like a lead report of computer players, and you could see that AlphaGo was significantly stronger than anything that had come before. But Go is a complex game, and there was always a bit of worry about whether AlphaGo was truly as good as we believed.
[15:15] We actually had the conviction that deep reinforcement learning is the answer based on everything that we could measure and everything we could see. But that's... [15:25] The thing about these systems is that, you know, the... [15:29] They're not like classic computers where you just know that they always produce the same answer. They're like stochastic. They're creative. So... [15:38] And they all have like, they have like some blind spots, they hallucinate, like similarly to how like model LLMs hallucinate. So you need to just like really push them and just like see exactly where they break. And the only way you could actually do that is by having like the best humans playing against them. [15:56] Move 37, can you tell us what that was? It was such a monumental move. And [16:02] I think everyone watching it at the time, at least at all, maybe primarily was confused by that move. [16:08] What was going on in your head when that happened? [16:12] So yeah, I mean, Move 37 in Game 2 against Lisztol was literally... [16:19] It was a spectacular moment in the sense that it kind of showed case to the world that AlphaGo has creativity. And it demonstrated that AI could come up with strategies that even top human players hadn't considered. So at first, like I still remember that, like we thought that AlphaGo made an error. So that it actually like hallucinated. It did something that like it didn't mean to do. But then it turned out to be a brilliant unconventional move that underscored that the system had a deep understanding of the game.
[16:49] It's good to think of things that people hadn't thought of before. [16:53] I want to take us to another key move in the game. I think it was in game four. [16:57] At this point, I was rooting for Lee because I was like, a poor guy needs to win a game. Move 78. [17:04] AlphaGo made a mistake, and Lisa don't know this is it. [17:07] I guess, what was the weakness there that Lee found during the game? [17:11] Yeah, exactly. So, I mean, Lissadol's victory in Game 4 was literally a testament to human ingenuity. Like, move 78 was unexpected and caught off a go off guard. [17:21] Initially, AlphaGo, based on its evaluations, misinterpreted as a mistake and thought that it was actually like winning. So that's why it didn't respond appropriately. [17:31] And, you know, this kind of highlighted the blind spot in the system. So the game showed that while systems like AlphaGo are extremely powerful, at the same time, they still have vulnerabilities and there were still areas where we could further improve it. [17:45] But how do you go about improving something like that? Do you need to show it a lot more... [17:49] data of that type of human ingenuity move? Or how do you go about fixing and patching those blind spots? It's actually interesting that by the end of the games with Lucidol, we just put together a benchmark. We're just trying to quantify and just have a way of measuring the mistakes that Afrogo makes. [18:12] And, you know, these blind spots, let's say. And then we just tried a number of approaches to just, like, improve the algorithm so that we can...
[18:19] Nope. [18:19] solve these issues. [18:22] And what happened is that... [18:25] Actually, the most effective way of getting rid of them was just, like, do what we were doing just, like, at a higher scale and better. So we just, like, changed the architecture of the model. We just, like, switched to a deep rest net with two output heads. And we also, like... [18:44] We just had a bigger network, trained on more data, then just moved to AlphaZero and spread the algorithms, and that made it so that we didn't have any... [18:53] hallucinations anymore. So in a way, we're just like... [18:57] Scale, data, things that are always kind of the well-known recipe in the field of AI is exactly what solves it in our case. [19:07] With scale and data, how much did higher quality data or maybe specifically data from great professional players, the best professional players make a meaningful difference? Or was it just any data? [19:18] Now, for us, what mattered was that we kind of like solved it using self-play. [19:24] So we actually had access to the most... [19:29] competent goal player in the world. [19:31] And we just used it to generate the best quality games, and then we just trained on these games. So I guess we didn't need to have human experts because you had an expert in-house. It wasn't human. [19:45] Huh, interesting. [19:47] Amazing. Well, I'd love to move on to the progression from AlphaGo to AlphaZero. And you talked a little bit about this notion of self-play just now. AlphaZero was powerful because it learned how to play the game from scratch, entirely from self-play without any human intervention. Can you share more about how that worked and why that was important?
[20:08] So AlphaZero was a game changer because it learned entirely from scratch through self-play without any human data. And this was like a major leap from AlphaGo because like AlphaGo, as I said, relies heavily on human expert gains. So two things happened. First of all, AlphaZero managed to simplify the training process and also like showed that AI will usually just like get from zero to superhuman performance. [20:31] just purely by playing against itself. And that allowed it to just be applicable to... [20:36] a whole range of new domains that were out of reach because there weren't enough human data for it. [20:43] I think the more important thing is that we just saw that AlphaZero also solved all the issues that AlphaGo had in terms of hallucinations, in terms of blind spots and robustness. So AlphaZero was a better method just to... [21:02] Full stop. [21:03] Huh? [21:04] And you explained kind of how AlphaGo worked to a fifth grader. What would you tell the fifth grader would be the key difference technically that you've implemented with AlphaZero? [21:13] So AlphaZero, just like AlphaGo, uses a policy network and a value network along with model card choices. So in that respect, it's exactly the same as AlphaGo. So the key difference is in training. AlphaZero starts with random weights and learns by playing games against itself. And by playing games against itself, it iteratively improves its performance. But the main idea behind AlphaZero is that whenever you take...
[21:38] a set of weights, a set of policy and value networks, and then you just combine them with search, then you just like end up with a [21:45] better [21:47] playing a better player. You just like increase your performance. You just like become a stronger player. [21:52] So, [21:54] What that meant [21:55] is that we can actually use this mechanism to improve the model policy, the raw policy. So this is what we call in reinforcement learning a policy improvement operator. Whenever you can just take an existing policy and then do something, some magic, and then just come up with a better policy, and then you can just take this policy and distill it back to the initial policy, and then just repeat this process, then you have a reinforcement learning algorithm. [22:21] And I feel like this is exactly what people are trying to do today with like two star or like synthetic data. This is exactly the idea of like, how can I take a policy? [22:33] do something with it, planning, search, compute, whatever it is, and derive a better policy, which I can then imitate in just like, [22:42] kind of distill back to the original policy. So this is exactly what AlphaZero is doing. It uses MCTS search to produce a better policy, then it takes its trajectories, it trains its policy and value network on the new better trajectories, and it repeats this process until it converges to the [22:59] to an expert level goal player. [23:02] That's fascinating and counterintuitive, that... [23:04] Kind of like starting without the weights that you would have from...
[23:08] from professional level players as actually a better starting place. [23:13] The epitome of AI agents and games was achieved, I think, via MuZero, which is the progression even from AlphaZero itself. And it's also where you became one of the co-leads or one of the leads of the game. AlphaZero was obviously impressive because of self-play, but it also needed to be told the environment's dynamics. [23:34] or the rules of the game. [23:35] And MuZero takes us to the next level without needing to be told the rules of the game. And it mastered quite a few different games, Go, chess, and many others. [23:46] Can you share a little bit about how MuZero worked and why was this particularly meaningful? [23:54] Absolutely. So AlphaZero, as you said, was a massive success in games like chess, Go, Shogi. So [24:02] In games where we actually had access to the game rules, where we actually had access to a perfect simulator of the world. But like this reliance on the perfect simulator made it challenging to apply it to real world problems. And real world problems are often messy and they like the rules and truly have just like right to perfect simulator for them. So that's exactly what New Zero tried to solve. So New Zero masters the games, of course, like Go, Chess and Shoggy. But it also like masters more visually challenging games or games like a hard goad like Atari. [24:32] it does that without giving access to the simulator. It just like learns how to build an internal simulator of the world and then just use this internal simulator in a way similar to what AlphaZero was doing. So it does that by using model-based reinforcement learning where what that means is that you can just take a number of trajectories generated by an agent and then try and learn a model
[24:55] learn a prediction model of how the world works. So this is actually quite similar to what methods like Sora are trying to do now, where they just take YouTube videos and they try to just learn a world model by just trying to predict based on... [25:10] Starting from one frame, what's going to happen in the future frames? [25:13] So New Zero tries to do exactly that, but it does it in a way different from, you know, generative models in the sense that it tries to only model things that matter for solving the reinforcement learning problem. So it tries to predict what the reward is going to be in the future, what's the value of, like, future states, what's the value of, like, future – what's the policy for, like, future states. So only things that you need within your MCTS. But, you know, the fundamentals kind of, like, remain the same. [25:43] model based on trajectories. And then once you have these models, you can just combine the search and, you know, [25:49] get super super performance. So [25:52] Of course, you can always decouple the two problems and have the model being trained separately from beta out in the wild and then just combine that with mu0. [26:02] And we just found that [26:05] Back then, given the limitations of our models and the smaller sizes, it kind of made more sense to just keep those two together and only have the model... [26:15] predict things that matter for planning. So just like try to model everything because you kind of, [26:21] hitting the limits of what the capacity of the model could take. So interesting.
[26:26] Is it right to assume then that not only Sora takes the same approach, but maybe other world models or other robotics foundation models? [26:35] Yeah, so anything that tries to just like build a model of how the world works and then just like use that... [26:40] for planning. [26:42] It's within U0-like methods. [26:46] So, yeah, you can just, like, train it on YouTube videos. You can train it on, like... [26:51] The inputs come from, like, robots. You can train it on, you know, any environment. You can even think of, like, large language models as a form of models of, like, text. So, like, the model text. But the thing about text is that, like, the model is a bit trivial. Like, you don't need to just... [27:08] There aren't many artifacts happening when you're trying to predict what the next world is going to be. [27:13] Have you seen the ideas behind MuZero kind of be used? [27:17] outside gameplay or in messy real world environments. [27:22] So, yeah, I mean, so... [27:25] As I've said, AlphaZero and MeZero are quite general methods. There's a number of scientific communities in chemistry. There's AlphaChem in quantum computing. Some people try to use AlphaZero in optimization, where they just adopted AlphaZero because it was really powerful in quantum computing. [27:44] really doing planning and solving this optimization problems. At the same time, U0 was incorporated in a version of Tesla's self-driving system. It was reported in their AI day. And it was also used, and I think it's currently being used, within YouTube as a custom compression algorithm. But I think it's early days and...
[28:09] takes time for like this new technology to be fully adopted from by the industry. [28:15] We'd love to talk a little bit more about reinforcement learning and agents. You alluded earlier to the fact that reinforcement learning and deep learning back in 2015 were new, nascent ideas. [28:26] They really grew in popularity 2017, 2018, 2019 onwards. And then they were overshadowed by LLMs, largely because of the GPT and everything else that came out. But now reinforcement learning is back. Why do you think that is the case? [28:44] Yeah, I mean, first of all, LLMs and multimodal models have indeed brought incredible progress to AI. So these models are exceptionally powerful and can perform some truly impressive tasks. But they have some fundamental capabilities. [28:57] limitations and one of them is the availability of human data. People just keep talking about the data wall and what happens once you run out of high quality data. And this is exactly where reinforcement learning signs. So reinforcement learning [29:12] excels because it doesn't rely solely on pre-existing human data. Instead, Dreamforce Learning uses experience generated by the agent itself to improve its performance. So this self-generated experience allows Dreamforce Learning to learn and adapt [29:25] and to even adapt to scenarios where human data is scarce or like non-existent. So, [29:32] If you define the reinforcement learning problem in the right setting, in the right way, you can literally effectively exchange compute for intelligence. You can just get to a point...
[29:43] similar to where we were with AlphaZero, where we just like, the moment we threw more computer at it, like we made the network speaker, we just like, you know, used more games, we just literally got a better player. [29:53] And it was deterministic. You always get a better player. So I guess this is exactly where we want to be with, like, this synthetic data pipelines. Currently, we have that with, you know, the scaling clause in LLMs, that if you have, like, more data and bigger models, then you get, like, you know, you can predict that there's going to be an improvement in performance. But, you know, once you've run out of, like, human data, how do you just keep going? And synthetic data is, like, the answer to that. And the... [30:22] The only way that [30:25] you can actually get high quality data to just improve your model. It's via some form of reinforcement learning. And just like leaving [30:36] I'm just keeping reinforcement learning as a really kind of blanket term here where I just define it as anything that learns through trial and error. How do you think reinforcement learning is being brought into the LM world? And you mentioned Qstar earlier. [30:54] In a closed form game, you have a pretty clearly defined policy and value function. [31:00] How does that work in like a messy kind of real world environment or the LLM world? [31:05] So, I mean, I guess like... [31:08] There are two different types of messy real world, right? There is the [31:12] If you try to just like build a controller or something, that's a really messy environment. And then if you operate in the digital space. So, personally, I believe that digital AGI, which is happening much earlier than, you know, robotics AGI. And the reason for that is exactly that you have control over the environment. And the environment is like computers, like the digital world. So even though it's like messy and it's noisy,
[31:37] It's still contained. It's not like the [31:39] the real conflict world in that sense. [31:44] Now, in terms of how do you bring, like, reinforcement learning, so... [31:48] Reinforcement learning is... [31:50] We used to say in DeepMind that you have the problem and you have the solution. And the problem setting of reinforcement learning is how do I take a model, how do I take a policy, and generate synthetic data or like I... [32:03] I find a way to improve this policy by interacting with the environment, via trial and error. [32:09] And this is like the Reinforced Learning Problem setting, right? And then there's like the solution space where you have... [32:15] value functions and have like reinforcement learning methods. So I think that there's a lot of inspiration to draw from like classical reinforcement learning methods that were developed in the past decade. But you have to adjust them to the new world of LLMs. [32:33] So, [32:34] Methods like Q* try to do that by just taking the idea that if I have a policy and then I do planning, I consider possible future scenarios, and then I have a way to evaluate which one is better. [32:46] Then I can just take the best ones and then ask the model to imitate these better ones. And this is like a way of improving the policy. So in the classic RL framework, you do that by using a policy and a value network. In the new world, you'll just do that by having a reward model or asking your model,
[33:08] Your LLM just gives you feedback on an output it gave you. [33:13] That's so interesting. You also talked a little bit about synthetic data earlier. I think some folks are very bullish on synthetic data and some folks more skeptical. I also believe that synthetic data is more useful in some domains where outcomes and success is perhaps more deterministic. Can you share a little bit about your perspective on the role of synthetic data and how bullish you are on it? [33:33] Yeah, I mean, I think like synthetic data is something that we have to... [33:38] solve one way or another. So it's not about like whether you're bullish or not. It's kind of [33:43] is an obstacle, but you have to just find a way around it. Like... [33:48] we will run out of data. Like, you know, there is so much data that, like, humans can... [33:52] produce and also like it's important that this system start taking options they start learning from their own mistakes [33:58] So we need to just find a way to make synthetic data work. Now what people have done is that [34:06] They've tried like the most... [34:08] I guess like... [34:10] naive approach where you just like take the models to produce something. And you try to just like train on that. And, um, of course, like, you know, [34:18] They've seen that there's mode collapsing and it just doesn't work out of the box. But new methods never work out of the box. You just need to invest in it and just take your time and really think of what's the best way of doing it. So I'm really... [34:36] optimistic that we'll just definitely find ways to improve these models and
[34:41] I think that actually there is a number of methods out there. [34:45] the two-star and equivalence that just, you know, in the new world where people don't really set their research [34:52] breakthroughs the way they used to is probably hidden behind some company trade secrets. [35:00] I'm going to ask about reasoning and, you know, novel scientific discoveries. Do you think that that can kind of naturally come out of just scaling LLMs if you have enough data? Or do you think that kind of like the ability to reason and, you know, come up with net new ideas requires... [35:15] kind of doing reinforcement learning and deeper computes at inference time. [35:21] So I think you need reinforcement learning to get better reasoning because the distribution of like, it's also about the distribution of data, right? Like you have like a... [35:32] You have a lot of data out in the wild in the internet. [35:35] But at the same time, you don't always have, like, the right type of data. So you don't have the data where, like, someone reasons, and they just, like, explain the reasoning in detail. You have some of it. You have, like, and it's incredible that, like, the models have actually amounts to... [35:51] to pick it up and just imitate it. But if you want to just like improve on that [35:56] capability, then you need to do that. [35:58] through reinforcement learning. You need to just like show the model how this kind of emerging capability can further be improved by just like have it generating the data, interact with the environment, you know, just tell it when it's doing something right and when it's not doing something right.
[36:14] So yeah, I think that like reinforced learning is definitely part of the answer for that. AlphaGo, AlphaZero and MuZero are the most powerful agents we've ever built. Can you share a little bit about how some of the lessons and learnings unlocked from that are relevant to how we're pursuing building AI agents today? [36:31] Yeah, so I think it's like AlphaGo and Museo, you know, if you have – [36:36] They've actually fundamentally transformed our approach to AI agents because they highlight the... [36:42] The importance of planning and scale, in my opinion, that if you actually look at the charts of, like, different models and how they scale, you can see that, like, AlphaGo and AlphaZero were, like, kind of really ahead of their time. Like, they were kind of outliers. You had, like, these scares of, like, how compute scales, and then you have, like, AlphaZero or, like, AlphaGo. [37:02] somewhere standing on its own. So it shows that if you can scale and you can really push on that, then you can get incredible, incredible results. [37:10] At the same time, you know, it also showed that you don't have just only train. You can also, like, you know, have better performance during inference, during tests, during evaluation, but just like using planning. And I think that this is something that we start seeing more and more in the near future. Or like this method will just like start thinking more, like planning more before they're just making any decisions. [37:33] So I'd say that this is more of the heritage of AlphaGo and AlphaZero and E0. [37:40] the basic principles. And the basic principles are that scale matters, planning matters,
[37:47] these methods can [37:49] really solve problems that we thought that [37:53] are insanely complex or like, you know, beyond what we can solve on our own. [37:57] Um, [37:58] Similar problems with the ones that you actually observe today with these large language models are things that we saw back then. Like back in 2016, we actually saw that these models can hallucinate or that at the same time, they're also creative. [38:11] that they will just come up with solutions that we hadn't thought of. But they can also have blind spots or hallucinate or be susceptible to adversarial attacks, which I guess everyone knows now that these neural networks suffer from. [38:27] I think that these are the... [38:29] the main kind of... [38:32] lessons drawn from this line of fork. [38:35] Thank you. [38:37] What do you think are the biggest open questions from this line of work for the field to answer going forward? [38:42] So the main question... [38:44] is [38:45] We had AlphaGo and Museo and we just managed to have this insanely robust [38:49] and reliable systems that will just always play go and at the [38:54] the highest possible level, and they'll just achieve... [38:59] consistently they will just like be top of the leaderboard we'll just like never lose again so alpha go master actually like played against 60 people [39:08] in online matches and just like usually [39:11] one in every single one of them. So there's like no... This matters for like a credible, robust, reliable. And I think this is exactly what we're missing now with these LLM-based agents. Sometimes they get it. Sometimes they don't. You cannot trust them.
[39:27] they will just like, you know, you have like some amazing demos, but like, you know, they happen once every two times, even, or like once every 10 times, you have like something amazing. And the remaining nine, they're just, [39:38] lost their way and didn't do anything. So I feel like what we need to do is just find a way to just make these LLM-based agents equally robust to the ones that we had with AlphaGo and MuZero and AlphaZero. [39:50] This is like the new open question of how do you actually do that? [39:54] We'd love to move into some of your thoughts on the broader ecosystem today. You've touched on a few really core problems that people are working on right now. One, the data wall problem that will hit eventually, perhaps by 2028 or so. [40:10] As some folks predict, another being the idea of planning as an area that AI agents need to get better at. [40:20] And then, you know, a third idea that you just described was around robustness and reliability. [40:25] Can you share a little bit about maybe some of these areas that you think the whole field needs to solve differently? [40:31] that you are [40:34] most excited about to help us unlock this vision of really getting to the AI agents that we want. [40:40] Yeah, I mean, I'll just like also add another one to the list. [40:44] So I feel like another major... [40:46] Another major challenge is how do we improve the in-context learning capabilities of these models? How do we make sure that these systems can learn
[40:57] on the fly and how they can adapt to new context like quickly. So this is like another thing that's, I think it's going to be really important. It's going to happen the next, uh, [41:07] a couple of years actually. So... - Yanis, what's the term that you used for that? - In context learning? - In context learning. - In context learning, yeah. So it's the idea that [41:17] a system can actually learn how to do a new task with like few shot prompting. Like it kind of like sees a few examples and on the fly, it kind of like learns how to adapt to the new environment. It learns how to use the, [41:33] the new tools that were provided to it. It's kind of like Lens. It's not just all the knowledge it has stored in Swedes, but it's also like [41:43] acquiring new knowledge by just like interacting with the real world, interacting with the environment. [41:48] So I think that this is like another... [41:52] place for [41:54] There is a lot of work happening at the moment, and... [41:57] going to have amazing progress in the next couple of years. [42:02] And I'm really excited about that. So, yeah, I mean, to recap, I think it was like planning is important. [42:09] You know, in-context learning, [42:10] is important and reliability. [42:14] The best way to achieve reliability is just like ensure that this model somehow know how to return from their mistakes. So if they just like made a mistake somewhere, they can just like see that. And they're like, okay, you know, I made a mistake, I'll just like work for it the way that humans, you know, make mistakes all the time, but like we don't.
[42:32] You can correct for them. [42:34] So, [42:35] So these are like the three areas which I'm really excited to see progress on. [42:41] Now that you've embarked on your own entrepreneurial journey, how do you think about the areas where startups can compete? [42:48] Against the big research labs and, like... [42:51] How do you kind of motivate yourself for that journey? [42:55] Yeah, I mean, it's a new world for me, but at the same time, it's not that new. Because when I joined DeepMind, it was literally a startup. So... [43:05] And I was like literally in the first of June plays. So I actually like saw that firsthand. [43:11] But, you know, one of the benefits of working for a startup is that, you know, [43:15] the agility and the focus. So everyone really cares. Everyone just moves really fast. [43:20] And there's like a clear focus on what we want to beat. [43:24] the building is like what's the most important kind of motivation for people like just like building [43:30] And I think that this is one of the big advantages that startups have over more established businesses. At the same time, it's easier to just activate wood, to adapt to new findings and technologies. [43:43] You're not kind of tied to... [43:45] you know, some... [43:46] pre-existing solutions, so like some products that [43:50] You don't want to deprecate because they bring a lot of revenue to you. [43:55] Well, if you're a startup, you know, you have like no such change. You could just like move fast and, you know, be innovative and, uh,
[44:02] you know, [44:03] break [44:05] conventions. Um, [44:06] And at the same time, just like allows you to leverage like open source resources, things that are out of touch for like the big labs. And yeah, and you don't have like the red tape that like big places tend to have. [44:18] I love the term that you use sometimes, Giannis, main quest versus side quest. Yeah, it's the idea of having a main focus. In big places, in big labs, they have many different projects that people are working on. And it usually happens that they have the main focus [44:35] quests, the main... [44:36] you know, [44:37] thing that, like, everyone's working on. And there's, like, many multiple, like, smaller side quests that... The idea is to just, like, feed into the bigger quests. But, like, usually they don't get as much... [44:48] They don't get as many resources or as much focus from the leadership. So, yeah, they... [44:55] they tend to... [44:57] In the broader field, what are some of the most defining projects that you admire the most? And maybe who are some of the most influential researchers that you admire the most? [45:08] Yeah, absolutely. [45:10] So I actually started my AI research journey back in 2012. And I've actually seen some milestones. So I'll just give a list of like, [45:21] what I think are like the main milestones like in AI in the past like 12 years that I've been around. So the first one I'll say is like AlexNet. This is the first paper that kind of like show that deep learning is the...
[45:35] is the answer. I mean, back then, it didn't feel like it. It just felt like, you know, yeah. [45:39] kind of curiosity. But like now I think that most people are convinced that deep learning is part of the answer. Then it was TQN. I had the pleasure to actually work on TQN and just like see it firsthand how it started. It was actually developed by a friend of mine, Vlad Mee. [45:57] And it was like the first system that showed that you can actually combine deep learning with reinforced learning to achieve superhuman performance in really complex environments. Then this was AlphaGo. Again, I was like really... [46:13] Lucky to just work on that. And it showed that scale and planning are really important ingredients. And if you just do that right, then you get... [46:24] huge success in an incredibly complex environment. [46:29] AlphaFold, another one. This is again by DeepMind. These methods are not just like [46:35] things that you can use to solve games, but they have... [46:40] They actually will make this world a better place. They will just ensure that healthcare is improved, that scientific discoveries are being realized. That we'll just make sure this world is a better place by using AI. [46:54] Then... [46:55] Chachapiti. It's kind of like brought AI to... [47:00] everyone, just like made it accessible to the broad audience. Like everyone knows what AI is now. It's made my life of explaining my job much easier.
[47:12] So, um... [47:13] And finally, Trip2For. And I think that, yeah, probably Trip2For is like the latest kind of peak [47:20] advancement in AI, because it kind of showed that [47:25] You know, artificial general intelligence is a matter of years. It's within reach. [47:32] Yeah, we are getting there. [47:35] I think that most people now believe that [47:38] You're like, [47:40] It. [47:41] a few years away from like AGI. And that's because of like the incredible breakthrough that GPT-4 was. Now in terms of like some people I really admire, before I forget. So I'd say first like David Silver, he was my PhD supervisor. He was my mentor at DeepMind. [47:59] He's an incredibly a researcher. He worked, he led of go and office zero and [48:06] You know, he... [48:07] He has an early-gilding dedication to the field of reinforced learning, and he's probably one of the smartest people, or maybe the smartest person I know, and... [48:17] Amazing guy. Um, [48:18] in [48:20] amazing reinforced learning engineer. And the second one I'd say is Ilya Satskevov. [48:26] And he was a co-founder of OpenAI. I had the opportunity to work with him just a little bit in the really early days of AlphaGo. But I think his commitment to scaling IA methods and pushing the boundaries of what these systems can achieve is remarkable. [48:42] And, you know, he got...
[48:44] nature that like GPT-3 and GPT-4 happen. So I'm [48:49] Yeah, immense respect towards him. Thank you for sharing that. Let's close that with some rapid fire questions. Maybe first, what do you think will be the next big milestone in AI, let's say in the next one, five and ten years? [49:03] So I felt like the next five to 10 years, the world will be a different place. I actually really believe that. I think that the next few years we'll see... [49:13] models becoming powerful and reliable agents that can actually independently execute tasks. And I think that AI agents will be massively adopted across industries. [49:22] especially in science and healthcare. So in that sense, I'm really excited on what's coming in AI. [49:30] And what I'm most excited about is... [49:34] AI agents. Systems can actually do tasks for you. [49:39] And, you know, this is exactly what we're building at Reflexion. [49:42] In what year do you think we'll pass the 50% threshold on Sweet Bench? [49:47] So I think we are one to three years away from the fifth percent threshold for three agents and three to five years from achieving 90 percent. [49:55] The reason is... [49:56] While progress is amazing, I think we still need reliable agents to hit these milestones. [50:03] When it comes to research, it's hard to make precise predictions. [50:07] When do you think we'll hit the data wall for scaling LLMs? And do you think all the research in RL is mature enough to keep up our slope of progress? Or do you think there will be a bit of a lull there?
[50:17] As we try to figure out what happens when we hit the wall. [50:20] So I feel like the wall... [50:24] based on what I've read, I think we have at least one more year for text. [50:29] just before we hit the wall. [50:31] And then we have like these extra modalities, which might actually buy us maybe a year extra. [50:37] and [50:38] I think we are in a really good place to just start... [50:42] using synthetic data. [50:44] So in the next two years, we'll just figure out this data problem. So I think that [50:49] we won't really hit the wall. It's just like we hit the wall, but like no one realized it because we have like new methods in place. [50:55] Do you think LLMs will have their AlphaGo moment? [50:58] And if so, when? [51:00] I feel like LLMs had their AlphaGo moment with the initial release of JGPT, where they showed days to power and the progress made over the past decade. I feel like what they hadn't had yet is their AlphaZero moment. And that's the moment where more compute directly translates to increased intelligence without human intervention. And I feel like this breakthrough is still on the horizon. [51:23] When do you think that will happen? [51:25] I think it's going to happen in the next five years. [51:27] Mm-hmm. [51:28] Wow. [51:30] Amazing. Janice, thank you so much for joining us and taking us through the awesome history of AlphaGo, AlphaZero, MuZero, your own journey through DeepMind, and then many of the core research problems. [51:40] that the whole industry is tackling today around data and building for reliability and robustness and planning and in-context learning.
[51:49] And we're really excited for the future that you're helping us build. [51:53] and that you're pushing forward in the field as well. So thank you so much, Jonas. Thank you so much for having me. [52:23] .
Want to learn more?