{"id":48,"date":"2025-05-04T14:16:03","date_gmt":"2025-05-04T06:16:03","guid":{"rendered":"http:\/\/120.55.12.129\/?p=48"},"modified":"2025-05-05T16:22:13","modified_gmt":"2025-05-05T08:22:13","slug":"%e4%ba%ba%e5%b7%a5%e6%99%ba%e8%83%bd%e7%9a%84%e4%b8%ad%e5%9c%ba%e9%98%b6%e6%ae%b5%ef%bc%9a%e4%bb%8e%e8%a7%a3%e5%86%b3%e9%97%ae%e9%a2%98%e5%88%b0%e5%ae%9a%e4%b9%89%e9%97%ae%e9%a2%98%ef%bc%88we","status":"publish","type":"post","link":"https:\/\/www.agidt.com\/?p=48","title":{"rendered":"\u4eba\u5de5\u667a\u80fd\u7684\u4e2d\u573a\u9636\u6bb5\uff1a\u4ece\u89e3\u51b3\u95ee\u9898\u5230\u5b9a\u4e49\u95ee\u9898\uff08We\u2019re at AI\u2019s halftime.\uff09"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"569\" src=\"http:\/\/120.55.12.129\/wp-content\/uploads\/2025\/05\/2323213-1024x569.png\" alt=\"\" class=\"wp-image-58\" srcset=\"https:\/\/www.agidt.com\/wp-content\/uploads\/2025\/05\/2323213-1024x569.png 1024w, https:\/\/www.agidt.com\/wp-content\/uploads\/2025\/05\/2323213-300x167.png 300w, https:\/\/www.agidt.com\/wp-content\/uploads\/2025\/05\/2323213-768x427.png 768w, https:\/\/www.agidt.com\/wp-content\/uploads\/2025\/05\/2323213-1536x853.png 1536w, https:\/\/www.agidt.com\/wp-content\/uploads\/2025\/05\/2323213-2048x1138.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>tldr: We\u2019re at AI\u2019s halftime.<\/p>\n\n\n\n<p>For decades, AI has largely been about developing new training methods and models. And it worked: from beating world champions at chess and Go, surpassing most humans on the SAT and bar exams, to earning IMO and IOI gold medals. Behind these milestones in the history book \u2014 DeepBlue, AlphaGo, GPT-4, and the o-series \u2014 are fundamental innovations in AI methods: search, deep RL, scaling, and reasoning. Things just get better over time.<\/p>\n\n\n\n<p>So what\u2019s suddenly different now?<\/p>\n\n\n\n<p>In three words: RL finally works. More precisely: RL finally generalizes. After several major detours and a culmination of milestones, we\u2019ve landed on a working recipe to solve a wide range of RL tasks using language and reasoning. Even a year ago, if you told most AI researchers that a single recipe could tackle software engineering, creative writing, IMO-level math, mouse-and-keyboard manipulation, and long-form question answering \u2014 they\u2019d laugh at your hallucinations. Each of these tasks is incredibly difficult and many researchers spend their entire PhDs focused on just one narrow slice.<\/p>\n\n\n\n<p>Yet it happened.<\/p>\n\n\n\n<p>So what comes next? The second half of AI \u2014 starting now \u2014 will shift focus from solving problems to defining problems. In this new era, evaluation becomes more important than training. Instead of just asking, \u201cCan we train a model to solve X?\u201d, we\u2019re asking, \u201cWhat should we be training AI to do, and how do we measure real progress?\u201d To thrive in this second half, we\u2019ll need a timely shift in mindset and skill set, ones perhaps closer to a product manager.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-first-half\">The first half<\/h2>\n\n\n\n<p>To make sense of the first half, look at its winners. What do you consider to be the most impactful AI papers so far?<\/p>\n\n\n\n<p>I tried the quiz in Stanford 224N, and the answers were not surprising: Transformer, AlexNet, GPT-3, etc. What\u2019s common about these papers? They propose some fundamental breakthroughs to train better models. But also, they managed to publish their papers by showing some (significant) improvements on some benchmarks.<\/p>\n\n\n\n<p>There is a latent commonality though: these \u201cwinners\u201d are all training methods or models, not benchmarks or tasks. Even arguably the most impactful benchmark of all, ImageNet, has less than one third of the citation of AlexNet. The contrast of method vs benchmark is even more drastic anywhere else \u2014- for example, the main benchmark of Transformer is WMT\u201914, whose workshop report has ~1,300 citations, while Transformer had &gt;160,000.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/ysymyth.github.io\/images\/second_half\/first_half.png\" alt=\"\"\/><\/figure>\n\n\n\n<p>That illustrates the game of the first half: focus on building new models and methods, and evaluation and benchmark are secondary (although necessary to make the paper system work).<\/p>\n\n\n\n<p>Why? A big reason is that, in the first half of AI, methods were harder and more exciting than tasks. Creating a new algorithm or model architecture from scratch \u2013 think of breakthroughs like the backpropagation algorithm, convolutional networks (AlexNet), or the Transformer used in GPT-3 \u2013 required remarkable insight and engineering. In contrast, defining tasks for AI often felt more straightforward: we simply took tasks humans already do (like translation, image recognition, or chess) and turned them into benchmarks. Not much insight or even engineering.<\/p>\n\n\n\n<p>Methods also tended to be more general and widely applicable than individual tasks, making them especially valuable. For example, the Transformer architecture ended up powering progress in CV, NLP, RL, and many other domains \u2013 far beyond the single dataset (WMT\u201914 translation) where it first proved itself. A great new method can hillclimb many different benchmarks because it\u2019s simple and general, thus the impact tends to go beyond an individual task.<\/p>\n\n\n\n<p>This game has worked for decades and sparked world-changing ideas and breakthroughs, which manifested themselves by ever-increasing benchmark performances in various domains. Why would the game change at all? Because the cumulation of these ideas and breakthroughs have made a qualitative difference in creating a working recipe in solving tasks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-recipe\">The recipe<\/h2>\n\n\n\n<p>What\u2019s the recipe? Its ingredients, not surprisingly, include massive language pre-training, scale (in data and compute), and the idea of reasoning and acting. These might sound like buzzwords that you hear daily in SF, but why call them a recipe??<\/p>\n\n\n\n<p>We can understand this by looking through the lens of reinforcement learning (RL), which is often thought of as the \u201cend game\u201d of AI \u2014 after all, RL is theoretically guaranteed to win games, and empirically it\u2019s hard to imagine any superhuman systems (e.g. AlphaGo) without RL.<\/p>\n\n\n\n<p>In RL, there are three key components:&nbsp;<strong>algorithm, environment, and priors<\/strong>. For a long time, RL researchers focused mostly on the algorithm (e.g. REINFORCE, DQN, TD-learning, actor-critic, PPO, TRPO\u2026) \u2013 the intellectual core of how an agent learns \u2013 while treating the environment and priors as fixed or minimal. For example, Sutton and Barto\u2019s classical textbook is all about algorithms and almost nothing about environments or priors.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/ysymyth.github.io\/images\/second_half\/rl_book.png\" alt=\"\"\/><\/figure>\n\n\n\n<p>However, in the era of deep RL, it became clear that environments matter a lot empirically: an algorithm\u2019s performance is often highly specific to the environment it was developed and tested in. If you ignore the environment, you risk building an \u201coptimal\u201d algorithm that only excels in toy settings. So why don\u2019t we first figure out the environment we actually want to solve, then find the algorithm best suited for it?<\/p>\n\n\n\n<p>That\u2019s exactly OpenAI\u2019s initial plan. It built&nbsp;<a rel=\"nofollow\" href=\"https:\/\/openai.com\/index\/openai-gym-beta\/\">gym<\/a>, a standard RL environment for various games, then the&nbsp;<a rel=\"nofollow\" href=\"https:\/\/openai.com\/index\/universe\/\">World of Bits and Universe projects<\/a>, trying to turn the Internet or computer into a game. A good plan, isn\u2019t it? Once we turn all digital worlds into an environment, solve it with smart RL algorithms, we have digital AGI.<\/p>\n\n\n\n<p>A good plan, but not entirely working. OpenAI made tremendous progress down the path, using RL to solve&nbsp;<a rel=\"nofollow\" href=\"https:\/\/openai.com\/index\/openai-five-defeats-dota-2-world-champions\/\">Dota<\/a>,&nbsp;<a rel=\"nofollow\" href=\"https:\/\/openai.com\/index\/solving-rubiks-cube\/\">robotic hands<\/a>, etc. But it never came close to solving computer use or web navigation, and the RL agents working in one domain do not transfer to another. Something is missing.<\/p>\n\n\n\n<p>Only after GPT-2 or GPT-3, it turned out that the missing piece is priors. You need powerful language pre-training to distill general commonsense and language knowledge into models, which then can be fine-tuned to become web (WebGPT) or chat (ChatGPT) agents (and change the world).&nbsp;<strong>It turned out the most important part of RL might not even be the RL algorithm or environment, but the priors, which can be obtained in a way totally unrelated from RL.<\/strong><\/p>\n\n\n\n<p>Language pre-training created good priors for chatting, but not equally good for controlling computers or playing video games. Why? These domains are further from the distribution of Internet text, and naively doing SFT \/ RL on these domains generalizes poorly. I noticed the problem in 2019, when GPT-2 just came out and I did SFT \/ RL on top of it to solve text-based games &#8211;&nbsp;<a rel=\"nofollow\" href=\"https:\/\/arxiv.org\/abs\/2010.02903\">CALM<\/a>&nbsp;was the first agent in the world built via pre-trained language models. But it took millions of RL steps for the agent to hillclimb a single game, and it doesn\u2019t transfer to new games. Though that\u2019s exactly the characteristic of RL and nothing strange to RL researchers, I found it weird because we humans can easily play a new game and be significantly better zero-shot. Then I hit one of the first eureka moment in my life &#8211; we generalize because we can choose to do more than \u201cgo to cabinet 2\u201d or \u201copen chest 3 with key 1\u201d or \u201ckill dungeon with sword\u201d, we can also choose to think about things like \u201cThe dungeon is dangerous and I need a weapon to fight with it. There is no visible weapon so maybe I need to find one in locked boxes or chests. Chest 3 is in Cabinet 2, let me first go there and unlock it\u201d.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/ysymyth.github.io\/images\/second_half\/reasoning.png\" alt=\"\"\/><\/figure>\n\n\n\n<p>Thinking, or reasoning, is a&nbsp;<strong>strange<\/strong>&nbsp;kind of action &#8211; it does not directly affect the external world, yet the space of reasoning is open-ended and combintocially infinite \u2014 you can think about a word, a sentence, a whole passage, or 10000 random English words, but the world around you doesn\u2019t immediate change. In the classical RL theory, it is a terrible deal and makes decision-making impossible. Imagine you need to choose one out of two boxes, and there\u2019s only one box with $1M and the other one empty. You\u2019re expected to earn $500k. Now imagine I add infinite empty boxes. You\u2019re expected to earn nothing. But by adding reasoning into the action space of any RL environment, we make use of the language pre-training priors to generalize, and we afford to have flexible test-time compute for different decisions. It is a really&nbsp;<strong>magical<\/strong>&nbsp;thing and I apologize for not fully making sense of it here, I might need to write another blog post just for it. You\u2019re welcome to read&nbsp;<a rel=\"nofollow\" href=\"https:\/\/arxiv.org\/abs\/2210.03629\">ReAct<\/a>&nbsp;for the original story of reasoning for agents and read my vibes at the time. For now, my intuitive explanation is: even though you add infinite empty boxes, you have seen them throughout your life in all kinds of games, and choosing these boxes prepare you to better choose the box with money for any given game. My abstract explanation would be:&nbsp;<strong>language generalizes through reasoning in agents<\/strong>.<\/p>\n\n\n\n<p>Once we have the right RL priors (language pre-training) and RL environment (adding language reasoning as actions), it turns out RL algorithm might be the most trivial part. Thus we have o-series, R1, deep research, computer-using agent, and so much more to come. What a sarcastic turn of events! For so long RL researchers cared about algorithms way more than environments, and no one paid any attention to priors \u2014 all RL experiments essentially start from scratch. But it took us decades of detours to realize maybe our prioritization should have be completely reversed.<\/p>\n\n\n\n<p>But just like Steve Jobs said: You can\u2019t connect the dots looking forward; you can only connect them looking backward.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-second-half\">The second half<\/h2>\n\n\n\n<p>This recipe is completely changing the game. To recap the game of the first half:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We develop novel training methods or models that hillclimb benchmarks.<\/li>\n\n\n\n<li>We create harder benchmarks and continue the loop.<\/li>\n<\/ul>\n\n\n\n<p>This game is being ruined because:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The recipe has essentially standardized and industried benchmark hillclimbing without requiring much more new ideas. As the recipe scales and generalizes well, your novel method for a particular task might improve it by 5%, while the next o-series model improve it by 30% without explicitly targeting it.<\/li>\n\n\n\n<li>Even if we create harder benchmarks, pretty soon (and increasingly soon) they get solved by the recipe. My colleague Jason Wei made a beautiful figure to visualize the trend well:<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/ysymyth.github.io\/images\/second_half\/progress.jpeg\" alt=\"\"\/><\/figure>\n\n\n\n<p>Then what\u2019s left to play in the second half? If novel methods are no longer needed and harder benchmarks will just get solved increasingly soon, what should we do?<\/p>\n\n\n\n<p>I think&nbsp;<strong>we should fundamentally re-think evaluation<\/strong>. It means not just to create new and harder benchmarks, but to fundamentally question existing evaluation&nbsp;<strong>setups<\/strong>&nbsp;and create new ones, so that we are forced to invent new methods beyond the working recipe. It is hard because humans have inertia and seldom question basic assumptions &#8211; you just take them for granted without realizing they are assumptions, not laws.<\/p>\n\n\n\n<p>To explain inertia, suppose you invented&nbsp;<a rel=\"nofollow\" href=\"https:\/\/arxiv.org\/abs\/2009.03300\">one of the most successful evals in history based on human exams<\/a>. It was an extremely bold idea in 2021, but 3 years later it\u2019s saturated. What would you do? Most likely create&nbsp;<a rel=\"nofollow\" href=\"https:\/\/agi.safe.ai\/\">a much harder exam<\/a>. Or suppose you solved&nbsp;<a rel=\"nofollow\" href=\"https:\/\/arxiv.org\/pdf\/2107.03374\">simply coding tasks<\/a>. What would you do? Most likely find&nbsp;<a rel=\"nofollow\" href=\"https:\/\/arxiv.org\/pdf\/2502.06807v1\">harder coding tasks<\/a>&nbsp;to solve until you have reached IOI gold level.<\/p>\n\n\n\n<p>Inertia is natural, but here is the problem. AI has beat world champions at chess and Go, surpassed most humans on SAT and bar exams, and reached gold medal level on IOI and IMO. But the world hasn\u2019t changed much, at least judged by economics and GDP.<\/p>\n\n\n\n<p>I call this the&nbsp;<strong>utility problem<\/strong>, and deem it the most important problem for AI.<\/p>\n\n\n\n<p>Perhaps we will solve the utility problem pretty soon, perhaps not. Either way, the root cause of this problem might be deceptively simple:&nbsp;<strong>our evaluation setups are different from real-world setups in many basic ways<\/strong>. To name two examples:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Evaluation \u201cshould\u201d run automatically<\/strong>, so typically an agent receives a task input, do things autonomously, then receive a task reward. But in reality, an agent has to engage with a human throughout the task \u2014 you don\u2019t just text customer service a super long message, wait for 10 minutes, then expect a detailed response to settle everything. By questioning this setup, new benchmarks are invented to either engage real humans (e.g.&nbsp;<a rel=\"nofollow\" href=\"https:\/\/lmarena.ai\/\">Chatbot Arena<\/a>) or user simulation (e.g.&nbsp;<a rel=\"nofollow\" href=\"https:\/\/arxiv.org\/abs\/2406.12045\">tau-bench<\/a>) in the loop.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/ysymyth.github.io\/images\/second_half\/tau.png\" alt=\"\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Evaluation \u201cshould\u201d run i.i.d.<\/strong>&nbsp;If you have a test set with 500 tasks, you run each task independently, average the task metrics, and get an overall metric. But in reality, you solve tasks sequentially rather than in parallel. A Google SWE solves google3 issues increasingly better as she gets more familiar with the repo, but a SWE agent solves many issues in the same repo without gaining such familiarity. We obviously need long-term memory methods (and&nbsp;<a rel=\"nofollow\" href=\"https:\/\/arxiv.org\/pdf\/2409.07429\">there<\/a>&nbsp;<a rel=\"nofollow\" href=\"https:\/\/yitaoliu17.com\/assets\/pdf\/ICLR_2025_CER.pdf\">are<\/a>), but academia does not have the proper benchmarks to justify the need, or even the proper courage to question i.i.d. assumption that has been the foundation of machine learning.<\/li>\n<\/ul>\n\n\n\n<p>These assumptions have \u201calways\u201d been like this, and developing benchmarks in these assumptions were fine in the first half of AI, because&nbsp;<strong>when the intelligence is low, improving intelligence generally improves utility<\/strong>. But now, the general recipe is guaranteed to work under these assumptions. So the way to play the new game of the second half is<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We develop novel evaluation setups or tasks for real-world utility.<\/li>\n\n\n\n<li>We solve them with the recipe or augment the recipe with novel components. Continue the loop.<\/li>\n<\/ul>\n\n\n\n<p>This game is hard because it is unfamiliar. But it is exciting. While players in the first half solve video games and exams, players in the second half get to build billion or trillion dollar companies by building useful products out of intelligence. While the first half is filled with incremental methods and models, the second half filters them to some degree. The general recipe would just crush your incremental methods, unless you create new assumptions that break the recipe. Then you get to do truly game-changing research.<\/p>\n\n\n\n<p>Welcome to the second half!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"acknowledgements\">Acknowledgements<\/h2>\n\n\n\n<p>This blog post is based on my talk given at Stanford 224N and Columbia. I used OpenAI deep research to read my slides and write a draft.<\/p>\n\n\n\n<p>Written on April 10, 2025<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">\u4e2d\u6587\u8bd1\u6587<\/h2>\n\n\n\n<p>\u4eba\u5de5\u667a\u80fd\u7684\u4e2d\u573a\u9636\u6bb5\uff1a\u4ece\u89e3\u51b3\u95ee\u9898\u5230\u5b9a\u4e49\u95ee\u9898<\/p>\n\n\n\n<p><strong>\u6838\u5fc3\u89c2\u70b9\uff1a<\/strong>\u4eba\u5de5\u667a\u80fd\u53d1\u5c55\u5df2\u6b65\u5165\u4e2d\u573a\u9636\u6bb5(halftime)\uff0c\u6e38\u620f\u89c4\u5219\u6b63\u53d1\u751f\u8d28\u53d8\u3002<\/p>\n\n\n\n<p><strong>\u4e0a\u534a\u573a\uff1a<\/strong>\u65b9\u6cd5\u8bba\u7684\u9ec4\u91d1\u65f6\u4ee3<\/p>\n\n\n\n<p>\u8fc7\u53bb\u51e0\u5341\u5e74\uff0cAI\u7814\u7a76\u7684\u6838\u5fc3\u8303\u5f0f\u805a\u7126\u4e8e\u5f00\u53d1\u65b0\u7684\u8bad\u7ec3\u65b9\u6cd5\u4e0e\u6a21\u578b\u67b6\u6784\u3002\u8fd9\u4e00\u8def\u5f84\u521b\u9020\u4e86\u4e30\u7855\u6210\u679c\uff1a<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u51fb\u8d25\u56fd\u9645\u8c61\u68cb\u548c\u56f4\u68cb\u4e16\u754c\u51a0\u519b<\/li>\n\n\n\n<li>\u5728SAT\u8003\u8bd5\u548c\u53f8\u6cd5\u8003\u8bd5\u4e2d\u8d85\u8d8a\u591a\u6570\u4eba\u7c7b<\/li>\n\n\n\n<li>\u65a9\u83b7\u56fd\u9645\u6570\u5b66\u5965\u8d5b\uff08IMO\uff09\u548c\u4fe1\u606f\u5b66\u5965\u8d5b\uff08IOI\uff09\u91d1\u724c<\/li>\n<\/ul>\n\n\n\n<p>\u8fd9\u4e9b\u91cc\u7a0b\u7891\u80cc\u540e\u7684\u5173\u952e\u662f\u6839\u672c\u6027\u6280\u672f\u521b\u65b0\uff1a\u4ece\u6df1\u84dd\u7684\u641c\u7d22\u7b97\u6cd5\u3001AlphaGo\u7684\u6df1\u5ea6\u5f3a\u5316\u5b66\u4e60(Deep RL)\uff0c\u5230GPT-4\u7684\u89c4\u6a21\u5316\u8bad\u7ec3\u4e0e\u63a8\u7406\u80fd\u529b\u3002\u5386\u53f2\u8868\u660e\uff0cAI\u59cb\u7ec8\u6cbf\u7740&#8221;\u66f4\u4f18\u65b9\u6cd5\u2192\u66f4\u9ad8\u57fa\u51c6\u5206\u6570&#8221;\u7684\u8def\u5f84\u6301\u7eed\u8fdb\u6b65\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/ysymyth.github.io\/images\/second_half\/first_half.png\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"has-text-align-center has-small-font-size\">\u4e0a\u534a\u573a\u4ee3\u8868\u6027\u6210\u679c<\/p>\n\n\n\n<p>\u8868\uff1a\u4f20\u7edfAI\u7814\u7a76\u7684\u6838\u5fc3\u5728\u4e8e\u65b9\u6cd5\u8bba\u521b\u65b0\uff0c\u5982Transformer\u8bba\u6587(\u88ab\u5f15\u7528\u8d8516\u4e07\u6b21)\u8fdc\u8d85\u5176\u6d4b\u8bd5\u57fa\u51c6WMT&#8217;14(\u7ea61300\u6b21\u5f15\u7528)*<\/p>\n\n\n\n<p><strong>\u8f6c\u6298\u70b9\uff1a<\/strong>\u5f3a\u5316\u5b66\u4e60\u7684\u6cdb\u5316\u7a81\u7834<\/p>\n\n\n\n<p><strong>\u5173\u952e\u7a81\u7834\u53ef\u6982\u62ec\u4e3a\u4e09\u70b9\uff1a<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\u5927\u89c4\u6a21\u8bed\u8a00\u9884\u8bad\u7ec3\uff08\u83b7\u53d6\u5148\u9a8c\u77e5\u8bc6\uff09<\/li>\n\n\n\n<li>\u6d77\u91cf\u6570\u636e\u4e0e\u7b97\u529b\u652f\u6491\uff08\u89c4\u6a21\u5316\u57fa\u7840\uff09<\/li>\n\n\n\n<li>\u63a8\u7406\u4e0e\u884c\u52a8\u7684\u534f\u8c03\u673a\u5236\uff08\u5bf9\u5e94\u4f20\u7edfRL\u7684\u7b97\u6cd5\/\u73af\u5883\/\u5148\u9a8c\u77e5\u8bc6\u4e09\u8981\u7d20\uff09<\/li>\n<\/ol>\n\n\n\n<p><strong>\u8fd9\u4e00\u7a81\u7834\u91cd\u5851\u4e86\u4f20\u7edf\u8ba4\u77e5\u2014\u2014\u5c31\u5728\u4e00\u5e74\u524d\uff0c\u4ee5\u4e0b\u89c2\u70b9\u4ecd\u88ab\u89c6\u4e3a\u5929\u65b9\u591c\u8c2d\uff1a<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u5355\u4e00\u65b9\u6cd5\u80fd\u540c\u65f6\u5904\u7406\u8f6f\u4ef6\u5de5\u7a0b\uff08\u5982GitHub Copilot\uff09<\/li>\n\n\n\n<li>\u521b\u610f\u5199\u4f5c\uff08\u5982GPT-4\u751f\u6210\u957f\u7bc7\u5c0f\u8bf4\uff09<\/li>\n\n\n\n<li>\u56fd\u9645\u5965\u6570\u7ea7\u522b\u8bc1\u660e\u9898<\/li>\n\n\n\n<li>\u5b9e\u65f6\u952e\u9f20\u64cd\u4f5c\uff08\u5982AutoGPT\uff09<\/li>\n\n\n\n<li>\u5f00\u653e\u57df\u957f\u6587\u672c\u95ee\u7b54<\/li>\n<\/ul>\n\n\n\n<p><strong>\u65b0\u65b9\u6cd5\u8bba\u7684\u4ef7\u503c\u91cd\u6784<\/strong><\/p>\n\n\n\n<p>\u4f20\u7edfRL\u7814\u7a76\u957f\u671f\u805a\u7126\u7b97\u6cd5\u6539\u8fdb\uff08\u5982\u57fa\u4e8e\u7b56\u7565\u68af\u5ea6\u7684REINFORCE\u3001\u6df1\u5ea6Q\u7f51\u7edcDQN\u7b49\uff09\uff0c\u5c06\u73af\u5883\u89c6\u4e3a\u56fa\u5b9a\u53c2\u6570\u3002OpenAI\u66fe\u5c1d\u8bd5\u901a\u8fc7[Gym\u5e73\u53f0](https:\/\/openai.com\/index\/openai-gym-beta\/)\u6807\u51c6\u5316RL\u73af\u5883\uff0c\u968f\u540e\u901a\u8fc7[Universe\u9879\u76ee](https:\/\/openai.com\/index\/universe\/)\u5c06\u4e92\u8054\u7f51\u8f6c\u5316\u4e3a\u8bad\u7ec3\u573a\uff0c\u5374\u53d1\u73b0\u667a\u80fd\u4f53\u96be\u4ee5\u8de8\u9886\u57df\u8fc1\u79fb\u3002<\/p>\n\n\n\n<p>GPT-3\u7684\u51fa\u73b0\u63ed\u793a\u5173\u952e\u6d1e\u89c1\uff1a\u5148\u9a8c\u77e5\u8bc6\u53ef\u80fd\u6bd4RL\u7b97\u6cd5\u672c\u8eab\u66f4\u4e3a\u91cd\u8981\u3002\u901a\u8fc7\u8bed\u8a00\u9884\u8bad\u7ec3\u83b7\u5f97\u7684\u77e5\u8bc6\uff0c\u4f7f\u6a21\u578b\u80fd\u5feb\u901f\u9002\u5e94\u65b0\u4efb\u52a1\u800c\u65e0\u9700\u4ece\u5934\u8bad\u7ec3\u3002<\/p>\n\n\n\n<p>\u63a8\u7406\uff1a\u7a81\u7834\u6027\u7684&#8221;\u52a8\u4f5c\u7a7a\u95f4&#8221;<\/p>\n\n\n\n<p>\u5728\u6587\u672c\u6e38\u620f\u5b9e\u9a8c[CALM](https:\/\/arxiv.org\/abs\/2010.02903)\u4e2d\u66b4\u9732\u4e86\u4f20\u7edfRL\u7684\u5c40\u9650\uff1a\u9700\u8981\u6570\u767e\u4e07\u6b65\u8bad\u7ec3\u624d\u80fd\u638c\u63e1\u5355\u4e00\u6e38\u620f\uff0c\u4e14\u7f3a\u4e4f\u8fc1\u79fb\u80fd\u529b\u3002\u76f8\u6bd4\u4e4b\u4e0b\uff0c\u4eba\u7c7b\u901a\u8fc7\u5f00\u653e\u5f0f\u63a8\u7406\u5b9e\u73b0\u96f6\u6837\u672c\u5b66\u4e60\u2014\u2014\u5c31\u50cf\u73a9\u5bb6\u53ef\u4ee5\u901a\u8fc7&#8221;\u53ef\u80fd\u9700\u8981\u6b66\u5668\u2192\u6b66\u5668\u53ef\u80fd\u5728\u5b9d\u7bb1\u2192\u5b9d\u7bb1\u5728\u67dc\u5b50\u91cc&#8221;\u7684\u903b\u8f91\u94fe\u89e3\u51b3\u95ee\u9898\uff0c\u800c\u65e0\u987b\u5c1d\u8bd5\u6240\u6709\u9519\u8bef\u9009\u9879\u3002<\/p>\n\n\n\n<p>![\u63a8\u7406\u673a\u5236\u793a\u610f\u56fe](https:\/\/ysymyth.github.io\/images\/second_half\/reasoning.png)<\/p>\n\n\n\n<p>*\u8fd9\u79cd\u63a8\u7406\u80fd\u529b\u53ef\u7c7b\u6bd4&#8221;\u5fc3\u7406\u6a21\u62df&#8221;\uff1a\u867d\u4e0d\u76f4\u63a5\u6539\u53d8\u5916\u90e8\u4e16\u754c\uff0c\u5374\u80fd\u5728\u8bed\u8a00\u6a21\u578b\u6784\u5efa\u7684\u77e5\u8bc6\u7a7a\u95f4\u4e2d\u63a2\u7d22\u53ef\u80fd\u6027\u3002\u8be6\u7ec6\u673a\u5236\u53c2\u89c1[ReAct\u8bba\u6587](https:\/\/arxiv.org\/abs\/2210.03629)\u3002*<\/p>\n\n\n\n<p><strong>\u4e0b\u534a\u573a\uff1a\u91cd\u65b0\u5b9a\u4e49\u8bc4\u4f30\u4f53\u7cfb<\/strong><\/p>\n\n\n\n<p>\u968f\u7740\u65b9\u6cd5\u8bba\u8d8b\u4e8e\u6210\u719f\uff0c\u6e38\u620f\u89c4\u5219\u5c06\u53d1\u751f\u6839\u672c\u53d8\u9769\u3002\u6211\u4eec\u9762\u4e34\u6548\u7528\u56f0\u5883\u2014\u2014\u5c3d\u7ba1AI\u5728\u68cb\u7c7b\u3001\u8003\u8bd5\u4e2d\u8d85\u8d8a\u4eba\u7c7b\uff0c\u4f46\u5bf9\u5b9e\u9645\u751f\u4ea7\u529b\u7684\u63d0\u5347\u4ecd\u6709\u9650\u3002\u6839\u6e90\u5728\u4e8e\u8bc4\u4f30\u4f53\u7cfb\u7684\u8131\u8282\uff1a<\/p>\n\n\n\n<p>1. \u81ea\u52a8\u8bc4\u4f30vs\u4eba\u673a\u534f\u540c&nbsp;<\/p>\n\n\n\n<p>&nbsp;&nbsp; \u5b9e\u9a8c\u5ba4\u8bc4\u4f30\u901a\u5e38\u662f\u5c01\u95ed\u7cfb\u7edf\uff08\u8f93\u5165\u2192\u8f93\u51fa\u2192\u8bc4\u5206\uff09\uff0c\u800c\u73b0\u5b9e\u573a\u666f\u5982\u5ba2\u670d\u9700\u8981\u6301\u7eed\u4e92\u52a8\u3002\u65b0\u5174\u57fa\u51c6\u5982[Chatbot Arena](https:\/\/lmarena.ai\/)\u5df2\u5f00\u59cb\u5f15\u5165\u771f\u5b9e\u4eba\u7c7b\u8bc4\u4f30\u3002<\/p>\n\n\n\n<p>2. i.i.d.\u5047\u8bbevs\u5e8f\u5217\u5b66\u4e60&nbsp;<\/p>\n\n\n\n<p>&nbsp;&nbsp; \u4f20\u7edf\u6d4b\u8bd5\u5047\u8bbe\u4efb\u52a1\u72ec\u7acb\u540c\u5206\u5e03\uff0c\u5b9e\u9645\u60c5\u51b5\u662f\u4eba\u7c7b\u901a\u8fc7\u7ecf\u9a8c\u79ef\u7d2f\u63d0\u5347\u8868\u73b0\u3002\u6b63\u5982\u8c37\u6b4c\u5de5\u7a0b\u5e08\u5bf9\u4ee3\u7801\u5e93\u719f\u6089\u5ea6\u63d0\u9ad8\u540e\u6548\u7387\u4f1a\u589e\u957f\uff0c\u800c\u73b0\u6709AI\u7cfb\u7edf\u7f3a\u4e4f\u8fd9\u79cd\u6301\u7eed\u5b66\u4e60\u80fd\u529b\u3002<\/p>\n\n\n\n<p>\u4e0b\u534a\u573a\u7684\u6e38\u620f\u89c4\u5219<\/p>\n\n\n\n<p>\u65b0\u7684\u7ade\u4e89\u7ef4\u5ea6\u5c06\u662f\uff1a<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u8bbe\u8ba1\u53cd\u6620\u771f\u5b9e\u4ef7\u503c\u7684\u8bc4\u4f30\u4f53\u7cfb<\/li>\n\n\n\n<li>\u57fa\u4e8e\u901a\u7528\u65b9\u6cd5\u5f00\u53d1\u5782\u76f4\u5e94\u7528<\/li>\n\n\n\n<li>\u5728\u4ea7\u4e1a\u573a\u666f\u521b\u9020\u5546\u4e1a\u4ef7\u503c<\/li>\n<\/ul>\n\n\n\n<p>![AI\u8fdb\u5c55\u8d8b\u52bf\u56fe](https:\/\/ysymyth.github.io\/images\/second_half\/progress.jpeg)<\/p>\n\n\n\n<p>\u6b63\u5982\u4e54\u5e03\u65af\u6240\u8a00\uff0c\u6280\u672f\u7a81\u7834\u7684\u8f68\u8ff9\u5f80\u5f80\u96be\u4ee5\u524d\u77bb\u2014\u2014OpenAI\u6700\u521d\u805a\u7126RL\u7b97\u6cd5\uff0c\u6700\u7ec8\u5374\u5728\u8bed\u8a00\u6a21\u578b\u4e2d\u627e\u5230\u7a81\u7834\u53e3\u3002\u4eba\u5de5\u667a\u80fd\u7684\u4e0b\u534a\u573a\u5df2\u7ecf\u5f00\u542f\uff0c\u8fd9\u5c06\u662f&#8221;\u5b9a\u4e49\u6b63\u786e\u95ee\u9898\u6bd4\u89e3\u51b3\u95ee\u9898\u66f4\u91cd\u8981&#8221;\u7684\u65b0\u7eaa\u5143\u3002<\/p>\n\n\n\n<p>\u81f4\u8c22\uff1a\u672c\u6587\u57fa\u4e8e\u4f5c\u8005\u5728\u65af\u5766\u798f224N\u8bfe\u7a0b\u548c\u54e5\u4f26\u6bd4\u4e9a\u5927\u5b66\u7684\u6f14\u8bb2\uff0c\u4f7f\u7528OpenAI\u7814\u7a76\u5de5\u5177\u8f85\u52a9\u5b8c\u6210\u5185\u5bb9\u7ec4\u7ec7\u4e0e\u6821\u5bf9\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>tldr: We\u2019re at AI\u2019s halftime. For decades, AI has largely been about developing new training methods and models. And it &#8230;<\/p>\n","protected":false},"author":1,"featured_media":58,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"topic":[],"class_list":["post-48","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-agi-salon"],"_links":{"self":[{"href":"https:\/\/www.agidt.com\/index.php?rest_route=\/wp\/v2\/posts\/48","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.agidt.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.agidt.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.agidt.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.agidt.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=48"}],"version-history":[{"count":3,"href":"https:\/\/www.agidt.com\/index.php?rest_route=\/wp\/v2\/posts\/48\/revisions"}],"predecessor-version":[{"id":59,"href":"https:\/\/www.agidt.com\/index.php?rest_route=\/wp\/v2\/posts\/48\/revisions\/59"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.agidt.com\/index.php?rest_route=\/wp\/v2\/media\/58"}],"wp:attachment":[{"href":"https:\/\/www.agidt.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=48"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.agidt.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=48"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.agidt.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=48"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/www.agidt.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftopic&post=48"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}