tldr: We’re at AI’s halftime.
tldr: 我们正处于人工智能的中场休息。
For decades, AI has largely been about developing new training methods and models. And it worked: from beating world champions at chess and Go, surpassing most humans on the SAT and bar exams, to earning IMO and IOI gold medals. Behind these milestones in the history book — DeepBlue, AlphaGo, GPT-4, and the o-series — are fundamental innovations in AI methods: search, deep RL, scaling, and reasoning. Things just get better over time.
几十年来,人工智能主要集中在开发新的训练方法和模型上。这是有效的:从在国际象棋和围棋中击败世界冠军,到在SAT和律师考试中超越大多数人,再到获得国际数学奥林匹克(IMO)和国际信息学奥林匹克(IOI)金牌。在这些历史书上的里程碑背后——深蓝(DeepBlue)、阿尔法围棋(AlphaGo)、GPT-4和o系列——是人工智能方法的根本创新:搜索、深度强化学习、扩展和推理。随着时间的推移,事物只会变得更好。
So what’s suddenly different now?
那么现在突然有什么不同呢?
In three words: RL finally works. More precisely: RL finally generalizes. After several major detours and a culmination of milestones, we’ve landed on a working recipe to solve a wide range of RL tasks using language and reasoning. Even a year ago, if you told most AI researchers that a single recipe could tackle software engineering, creative writing, IMO-level math, mouse-and-keyboard manipulation, and long-form question answering — they’d laugh at your hallucinations. Each of these tasks is incredibly difficult and many researchers spend their entire PhDs focused on just one narrow slice.
用三个词来说:强化学习终于有效了。更准确地说:强化学习终于实现了泛化。在经历了几次重大绕行和一系列里程碑之后,我们找到了一种有效的配方,可以使用语言和推理来解决广泛的强化学习任务。即使在一年前,如果你告诉大多数人工智能研究人员,一个配方可以解决软件工程、创意写作、国际数学奥林匹克水平的数学、鼠标和键盘操作以及长篇问答——他们会对你的幻想大笑。每个任务都极其困难,许多研究人员在攻读博士学位时专注于一个狭窄的领域。
Yet it happened.
然而,这发生了。
So what comes next? The second half of AI — starting now — will shift focus from solving problems to defining problems. In this new era, evaluation becomes more important than training. Instead of just asking, “Can we train a model to solve X?”, we’re asking, “What should we be training AI to do, and how do we measure real progress?” To thrive in this second half, we’ll need a timely shift in mindset and skill set, ones perhaps closer to a product manager.
那么接下来是什么?人工智能的下半场——现在开始——将把重点从解决问题转向定义问题。在这个新时代,评估变得比培训更为重要。我们不再仅仅问“我们能否训练一个模型来解决X?”,而是问“我们应该训练人工智能做什么,以及我们如何衡量真正的进展?”要在下半场蓬勃发展,我们需要及时转变思维方式和技能,或许更接近于产品经理的角色。
上半场
To make sense of the first half, look at its winners. What do you consider to be the most impactful AI papers so far?
要理解上半部分,请看看它的获胜者。你认为到目前为止最具影响力的人工智能论文是什么?
I tried the quiz in Stanford 224N, and the answers were not surprising: Transformer, AlexNet, GPT-3, etc. What’s common about these papers? They propose some fundamental breakthroughs to train better models. But also, they managed to publish their papers by showing some (significant) improvements on some benchmarks.
我尝试了斯坦福224N的测验,答案并不令人惊讶:Transformer、AlexNet、GPT-3等。这些论文有什么共同点?它们提出了一些基础性的突破,以训练更好的模型。但它们也通过在一些基准上展示一些(显著的)改进来成功发表了论文。
There is a latent commonality though: these “winners” are all training methods or models, not benchmarks or tasks. Even arguably the most impactful benchmark of all, ImageNet, has less than one third of the citation of AlexNet. The contrast of method vs benchmark is even more drastic anywhere else —- for example, the main benchmark of Transformer is WMT’14, whose workshop report has ~1,300 citations, while Transformer had >160,000.
尽管如此,有一个潜在的共性:这些“赢家”都是训练方法或模型,而不是基准或任务。即使是可以说是最具影响力的基准,ImageNet,其引用次数也不到 AlexNet 的三分之一。在其他地方,方法与基准的对比更加明显——例如,Transformer 的主要基准是 WMT’14,其研讨会报告的引用次数约为 1,300,而 Transformer 的引用次数超过 160,000。