How to Make AI Write Code Slower, but More Correctly: Multi-Model PR Review to Minimize Bug Probability

📄Full Article· Automatically extracted by trafilaturaGemini 翻譯1534 words

Former Microsoft senior engineer Nolan Lawson uses Claude, Codex, and Cursor Bugbot — three models simultaneously — to review PRs, cross-validating to push the false positive rate close to zero. (Background: Claude Code announces weekly Token usage cap increased by 50%! Anthropic seizes developer ecosystem for two months) (Background supplement: Stripe launches AI Agent fully automated payment testing: supports Base chain USDC payment via x402) We know the advantage of AI coding is "rapidly producing large amounts of code," but accuracy is questionable. Former Microsoft and Salesforce senior engineer Nolan Lawson recently documented a new workflow on his blog: he uses multiple large language models to simultaneously review every pull request (code merge request, simply put, every action of submitting new code into the project), with the goal of cross-validating to find real bugs, rather than rapidly outputting more code. This workflow didn't increase his code output, but the code quality improved significantly. Anthropic's Glasswing program launched this year (a public update to the Mythos system) provides a direct data foundation for this logic. This system allows LLM agents to scan real open source code at scale. The result: after scanning over 1,000 open source projects, the system estimated finding 6,202 high-severity or critical vulnerabilities, totaling 23,019 vulnerabilities (including low severity). Among them, of the 1,752 vulnerabilities individually verified by independent security firms, 90.6% were confirmed as real issues, with 62.4% being high-severity or critical level. These numbers illustrate a fundamental shift: finding bugs is no longer the bottleneck — validation and patching are. Anthropic explicitly wrote in their research report: "Progress in software security was once limited by the speed of finding vulnerabilities; now it is limited by the speed of validation, disclosure, and patching." In other words, AI has pushed the bottleneck of the problem from "discovery" to "processing capacity." Lawson's core approach is to have multiple models from different vendors run PR reviews simultaneously, rather than relying on a single model. His tool combination includes Claude code, OpenAI's Codex, and Cursor Bugbot. All three conduct completely independent reviews on the same pull request simultaneously, then aggregate all results, sorted by output into four severity levels: critical, high, medium, and low. This multi-model cross-validation design has a key characteristic: a single model is prone to false positives, but when multiple models from different training data and architectures simultaneously point to the same issue, the false positive rate drops significantly, while coverage simultaneously increases. In Lawson's own words: "The false positive rate is close to zero, and the bug coverage found is very high." His decision-making process is quite clear. All critical and high issues must be fixed first; medium and low ones must be individually evaluated based on the ratio of "fix cost" to "actual impact" — if not worthwhile, skip them directly without wasting development resources; if a PR has too many critical issues, scrap the whole thing and redo it, rather than continuing to patch on top of fundamental problems. After using this workflow, Lawson's actual result is: code output (lines) didn't increase. Instead, it often unearthed existing old bugs, forcing him to write unit tests (automated tests that independently verify each small feature). The time spent fixing old problems often far exceeds advancing new features. This wasn't the result he expected, but from another angle, this is a signal that the health of the code foundation is being systematically reinforced. Lawson calls this way of working "vibe coding with more texture" — cautious, methodological, and quality-oriented. The popularization of development tools usually puts "speed" at the forefront of selling points, but the problem engineers really need to solve has never been just speed. Every line of code has its maintenance cost, and its probability of going wrong. Use AI to write code more slowly, but make each line of code survive longer with a lower probability of going wrong.

Data Status✓ Full text extractedRead Original (動區 BlockTempo)

🔍Historical Similar Events· Keyword + Asset Matching1 items

2026-05-15

Raindrop Workshop uses Codex to automatically find and fix bugs in your AI Agent (free and open source)

Similarity 120%關鍵字 bug同分類 zh

💡 Currently matching via keywords + symbols (MVP) · Will be upgraded to embedding semantic search later

Raw Information

ID:f2feed2ef3

Source:動區 BlockTempo

Published:2026-05-26 03:34:37

Category:zh_news · Export Category zh

Symbols:Unspecified

Community Votes:+0 / −0 · ⭐ 0 Important · 💬 0 Comments