Harness Engineering 读后感：AI工程的第三次范式转移

一句话总结

Harness Engineering = 为 Agent 构建「办公室」，而不是继续优化「邮件措辞」

三代并不是取代关系，而是抽象层次一层层向外扩展。任务越接近真实生产，后两者的重要性就越高。

这是 OpenAI Codex 团队 Ryan Lopopolo 在用 GPT-5 agent 构建了 100 万行代码、1500 个 PR 之后的真实总结。

实验数据：

Cost difference is irrelevant. Capability difference is everything.

花了大量时间在 prompt 模板、few-shot 示例、角色扮演上。但真正决定成败的，是** agent 运行环境的设计**。

做过 agent 开发的人都有类似经历：demo 惊艳，production 崩盘。原因是 demo 是「完美条件的单次交互」，而生产是「真实环境的持续运转」。

Cursor 团队在推进 “Self-Driving Codebases” 时发现：当模型能力足够强时，它会在死胡同和 nonsense 方案上浪费大量 token。一个设计良好的 Harness 通过提供清晰的边界和有限的高质量工具，强制 agent 更快收敛到正确答案。

你不是在限制 agent，你是在给它一条通往成功的窄路。

OpenAI 团队学到的硬规则之一：** Architectural constraints are enforced by linters, not prompts.**

不要跟 agent 说「请遵循单例模式」，而是构建一个 linter 来强制单例模式。不要请求，请约束。

OpenAI 团队有一条原则：If a PR requires significant human intervention, the agent is not the problem — the Harness is.

这是思维模式的核心转变。当 agent 犯错时，我们不是去批评它，而是去改进它的运行环境，让这个错误再也不可能发生。这正是 Mitchell Hashimoto 对 Harness Engineering 的定义：

“Every time you discover an agent has made a mistake, you take the time to engineer a solution so that it can never make that mistake again.”

这直接指向了 GAN 式的架构：分离的 Generator 和 Evaluator agent。这也是为什么可靠的 Harness 需要独立的验证层。

结合我们的 iUX 迪元健康项目：

Harness Engineering 不是一个工具，是一种工程哲学的升级——从「调教模型」到「构建系统」。