Claude going crazy making mistakes and playing dumb when writing code? 12 rules to help you slash error rates from 41% to 3%

Andrej Karpathy complained early in the year about Claude making mistakes when writing code, Forrest Chang distilled 4 rules, and within 6 weeks the author added 8 more, covering new scenarios like multi-step Agents and hook chain triggering. Tested across 30 codebases, error rate plummeted from 41% to 3%. (Background: SEC lowered the broker-dealer stablecoin haircut "from 100% to 2%"—why is this a major bullish signal?) (Background context: Anthropic AI Economic Index lengthy report: automated trading workflow frequency doubled, Claude is transitioning from tool to life assistant) Reveals where the original template silently fails in 4 key scenarios, and supplements with 8 new rules, compressing the error rate from 41% down to 3%. In January 2026, AI guru Andrej Karpathy posted on X complaining about how Claude writes code—silent assumptions, over-engineering, collateral damage to unrelated code. Forrest Chang condensed these pain points into 4 behavioral rules, placed them into a CLAUDE.md file, and it went viral upon release. But a few months later, the Claude Code ecosystem had evolved from single-shot completion to multi-step Agent collaboration, and the old rules began to fail. One developer tested 30 codebases within 6 weeks. In late January 2026, Andrej Karpathy posted a Twitter thread criticizing how Claude writes code. He pointed out three typical categories of issues: making incorrect assumptions without explanation, over-complication, and causing unrelated damage to code that shouldn't have been touched. Forrest Chang saw this thread, distilled the complaints into 4 behavioral rules, wrote them into a standalone CLAUDE.md file, and published it to GitHub. The project received 5,828 Stars on its first day online, was bookmarked 60,000 times within two weeks, and now has 120,000 Stars, becoming the fastest-growing single-file code repository of 2026. Subsequently, over a 6-week period, I tested it against 30 codebases. These 4 rules really do work. Errors that previously occurred about 40% of the time dropped to below 3% on tasks where these rules were applicable. But the problem is that this template was originally designed to solve the errors Claude made when writing code in January. By May 2026, the problems facing the Claude Code ecosystem had become different: Agents conflicting with each other, hook chain triggering, skill loading conflicts, and multi-step workflow interruptions across sessions. So I added 8 more rules. Below is the complete 12-rule version of CLAUDE.md: why each one deserves to be added, and the 4 places where the original Karpathy template silently fails. If you want to skip the explanation and just copy and use it, the complete file is at the end of the article. Claude Code's CLAUDE.md is the most underrated file in the entire AI programming tech stack. Most developers typically make three kinds of mistakes: First, treating it as a preferences dumping ground, cramming all their habits into it, eventually bloating to over 4,000 tokens, with rule adherence dropping to 30%. Second, simply not using it at all and re-prompting every time. This causes 5x token waste and lacks consistency across different sessions. Third, copying a template and then never touching it again. It might work for two weeks, but as the codebase changes, it silently becomes ineffective without you knowing. Anthropic just shipped Dreaming agents review their own sessions between runs, surface recurring mistakes, and rewrite their memory so the next session doesn't repeat them. Karpathy wrote CLAUDE.md for the same reason. > silent wrong assumptions > over-complication >… https://t.co/HqCTygknPK pic.twitter.com/lUkHfhH7hb— Mnimiy (@Mnilax) May 13, 2026 Anthropic's official documentation is very clear: CLAUDE.md is essentially advisory only. Claude follows it about 80% of the time. Once it exceeds 200 lines, adherence drops noticeably, because important rules get drowned out by noise. Karpathy's template solved this problem: one file, 65 lines, 4 rules. This is the minimum baseline. But the ceiling can be higher. After adding the following 8 rules, what it covers is not only the code-writing problems Karpathy complained about in January 2026, but also the Agent orchestration problems that only emerged in May 2026—problems that didn't exist when the original template was written. If you haven't seen Forrest Chang's repository yet, look at this baseline version first: Rule 1: Think before coding. Don't make silent assumptions. State your assumptions, expose the trade-off points. Ask before guessing. When simpler alternatives exist, proactively raise the counterargument. Rule 2: Simplicity first. Use the minimum code that solves the problem. Don't add imagined features. Don't design abstraction layers for one-shot code. If a senior engineer would consider it over-complicated, it should be simplified. Rule 3: Surgical modifications. Only change what must be changed. Don't casually "optimize" adjacent code, comments, or formatting. Don't refactor what isn't broken. Maintain consistency with existing style. Rule 4: Goal-oriented execution. Define success criteria first, then loop and iterate until validation is complete. Don't tell Claude what to do at every step; tell it what the successful outcome should look like, and let it iterate itself. These 4 rules can resolve about 40% of the failure patterns I've seen in unsupervised Claude Code sessions. The remaining 60% of problems hide in the blank spaces below. Every rule comes from a real moment when Karpathy's original 4 rules were no longer enough. Below, I'll first describe the scenario, then provide the corresponding rule. Claude can be used for: classification, drafting, summarizing, extracting information from unstructured text. Don't use Claude for: routing, retrying, status code handling, deterministic transformations. If a status code already answers the question, let regular code answer that question. Karpathy's rules didn't cover this. As a result, the model started making decisions that should have been handled by deterministic code: whether to retry an API call, how to route a message, when to escalate handling. The result is that judgments differ every week. What you get is an unstable if-else billed at $0.003 per token. That moment was like this: there was code that called Claude to "decide whether to retry on a 503." It ran well at first, for two weeks, then suddenly became unstable because the model started treating the request body as judgment context too. The retry strategy became random because the prompt itself was random. Single task budget: 4,000 tokens. Single session budget: 30,000 tokens. If a task approaches the budget cap, summarize the current state and restart. Don't push through. Explicitly exposing budget overruns is better than silently exceeding them. A CLAUDE.md without budget constraints is a blank check. Every loop has the potential to spiral out of control and become a 50,000-token context dump. The model won't stop itself. That moment was like this: a debugging session lasted 90 minutes. The model kept iterating around the same 8KB error message, gradually forgetting which fixes it had already tried. By the end, it was proposing solutions I had already rejected 40 messages earlier. With a token budget, this process should have been terminated by minute 12. If two existing patterns in the codebase contradict each other, don't mix them together. Choose one pattern, prefer the newer or more thoroughly tested one, state your reasoning, and flag the other for follow-up cleanup. "Average code" that tries to satisfy both rule sets is the worst code of all. When two parts of the codebase contradict each other, Claude will try to please both sides, resulting in incoherent code. That moment was like this: a codebase had two error handling patterns, one using async/await with explicit try/catch, the other using a global error boundary. Claude wrote new code using both. As a result, error handling was done twice. It took me 30 minutes to figure out why errors were being swallowed twice. Before adding code to a file, first read the file's exports, direct callers, and any obviously related shared utility functions. If you don't understand why existing code is organized this way, ask first—don't just add things. "It looks unrelated to me" is the most dangerous sentence in this codebase. Karpathy's "surgical modifications" tells Claude not to touch adjacent code. But it doesn't tell Claude to first understand adjacent code. Without this rule, Claude will write new code that conflicts with existing code 30 lines away. That moment was like this: Claude added a new function next to an existing function with completely identical functionality, because it didn't read the original function first. Both functions did the same thing. But due to import order, the new function overrode the old one, and the old function had existed as the de facto single source of truth for 6 months. Every test must encode "why this behavior matters," not just "what it does." A test like `expect(getUserName()).toBe('John')` has no value if the function actually takes a hardcoded ID. If you can't write a test that would fail when business logic changes, then the function itself is wrong. Karpathy's "goal-oriented execution" implies tests can serve as success criteria. But in practice, Claude treats "tests passing" as the sole goal, and as a result writes code that passes shallow tests while breaking other things. That moment was like this: Claude wrote 12 tests for an authentication function, all passing. But the authentication logic in production was broken. Those tests were just verifying that the function "returned something," not that it returned the correct thing. The function passed the tests because it returned a constant. In multi-step tasks, after completing each step, summarize what has been done, what has been verified, and what remains. Don't continue from a state you can't clearly recite back to me. If you find you've lost track, stop and restate the current state. Karpathy's template assumes interactions are one-shot. But real Claude Code work is often multi-step: refactoring across 20 files, building features in one session, debugging across multiple commits. Without checkpoints, one wrong step can lose all prior progress. That moment was like this: a 6-step refactoring task went wrong on step 4. By the time I noticed, Claude had continued completing steps 5 and 6 on top of the incorrect state. Disentangling the fix took longer than redoing the entire task. With checkpoints, the problem would have been caught at step 4. If the codebase uses snake_case and you prefer camelCase: use snake_case. If the codebase uses class-based components and you prefer hooks: use class-based components. Disagreement is a separate discussion. Within a codebase, consistency takes priority over personal preference. If you genuinely believe a convention is harmful, raise it explicitly. Don't silently fork a divergent path. In a codebase with established patterns, Claude likes to introduce its own approach. Even if its approach is "better," introducing a second pattern is itself worse than either single pattern. That moment was like this: Claude introduced hooks into a React codebase based on class components. It did run. But it simultaneously broke the codebase's original test patterns, because those tests depended on componentDidMount. It took half a day to delete and rewrite it. If you can't confirm something has succeeded, state that explicitly. If 30 records were silently skipped, you cannot say "migration complete." If you skipped any tests, you cannot say "tests passed." If you didn't verify the edge cases I requested, you cannot say "feature works." Default to exposing uncertainty, not hiding it. Claude's most expensive failures are often those that look like successes. A function "runs" but returns wrong data; a migration "completes" but skips 30 records; a test "passes" but only because the assertion itself is wrong. That moment was like this: Claude said a database migration "completed successfully." But in reality, it had silently skipped 14% of records that triggered constraint violations. The skip behavior was written to the log, but not explicitly exposed. 11 days later, when report data started looking abnormal, we discovered the problem. Over 6 weeks, I tracked the same set of 50 representative tasks across 30 codebases, testing three configurations. Error rate refers to: tasks that needed to be corrected or rewritten to match the original intent. Counted errors include: silent wrong assumptions, over-engineering, unrelated breakage, silent failures, convention violations, conflicting compromises, missed checkpoints. Adherence rate refers to: when a rule applies, how likely Claude is to explicitly apply that rule. The truly interesting result isn't just that the error rate dropped from 41% to 3%. More importantly, expanding from 4 rules to 12 rules barely increased the adherence burden—the adherence rate only went from 78% to 76%, but the error rate dropped another 8 percentage points. The new rules cover failure patterns that the original 4 rules didn't address; they don't compete for the same attention budget. Even without adding new rules, the original 4-rule template falls short in at least 4 places. First, long-running Agent tasks. Karpathy's rules primarily target the moment when Claude is writing code. But what happens when Claude is executing a multi-step pipeline? The original template has no budget rule, no checkpoint rule, no "fail loudly" rule. So the pipeline slowly drifts. Second, multi-codebase consistency. "Match the existing style" assumes there is only one style. But in a monorepo with 12 services, Claude has to choose which style to match. The original rules don't tell it how to choose. So it either picks randomly or averages multiple styles together. Third, test quality. "Goal-oriented execution" treats "tests passing" as success, without specifying that the tests themselves must be meaningful. The result is that Claude writes tests that verify almost nothing, but these tests make it falsely confident. Fourth, the difference between production and prototyping stages. The same 4 rules can prevent production code from being over-engineered, but they can also slow down prototype development. Because at the prototyping stage, you sometimes do need 100 lines of exploratory scaffolding to figure out the direction. Karpathy's "simplicity first" tends to over-trigger in early code. These 8 added rules aren't meant to replace Karpathy's original 4 rules, but to patch their blank spots: the original template corresponds to the auto-complete-style code writing scenarios of January 2026; by May 2026, Claude Code had entered Agent-driven, multi-step, multi-codebase collaboration environments, and the problems they face are not the same. Before finalizing these 12 rules, I also tried other approaches. Adding rules I saw on Reddit / X. Most of them either just rephrased Karpathy's original 4 rules in different wording, or were domain-specific rules that don't generalize, like "always use Tailwind classes." All were eventually deleted. More than 12 rules. I tested up to 18 rules at most. After exceeding 14, adherence dropped from 76% to 52%. The 200-line ceiling is real. Beyond that length, Claude starts pattern-matching to "there are rules here" rather than actually reading the rules item by item. Rules dependent on the existence of certain tools. For example, "always use eslint"—as soon as a project doesn't have eslint installed, this rule fails, and silently at that. Later I changed it to a tool-agnostic expression, such as changing "use eslint" to "follow the style already enforced in the codebase." Putting examples in CLAUDE.md instead of rules. Examples take up more context than rules. Three examples consume about as much context as 10 rules, and Claude tends to overfit to examples easily. Rules are abstract, examples are concrete. So you should use rules. "Be careful," "think seriously," "stay focused." These are all noise. Adherence to these kinds of instructions dropped to about 30%, because they can't be verified. Later I replaced them with more concrete imperative rules, like "state assumptions explicitly." Telling Claude to be like a "senior engineer." This doesn't work. Claude already thinks it's a senior engineer. The real problem isn't whether it thinks so, but whether it acts so. Imperative rules can narrow this gap; identity prompts cannot. Below is the complete version that can be directly copy-pasted. This content cannot be displayed outside of Feishu docs for now. Save it as CLAUDE.md at the root of your repository. Below these 12 rules, add project-specific rules, such as tech stack, test commands, error patterns, etc. The whole thing shouldn't exceed 200 lines—beyond that, rule adherence drops noticeably. Just two steps: 1. Append Karpathy's 4 baseline rules to your CLAUDE.md curl https://raw.githubusercontent.com/forrestchang/andrej-karpathy-skills/main/CLAUDE.md >> CLAUDE.md 2. Paste rules 5–12 from this article below Save the file at the root of the repository. The >> here is important—its purpose is to append to the existing CLAUDE.md, not overwrite the project-specific rules you've already written. CLAUDE.md is not a wishlist; it is a behavioral contract used to seal off specific failure patterns you've already observed. Every rule should answer one question: what error does it prevent? Karpathy's 4 rules prevent the failure patterns he saw in January 2026: silent assumptions, over-engineering, collateral damage, weak success criteria. They are the foundation—don't skip them. The 8 rules I added prevent new failure patterns that emerged after May 2026: Agent loops without budget constraints, multi-step tasks without checkpoints, tests that appear to be tested but don't actually test the key logic, and the problem of packaging silent failures as silent successes. They are incremental patches. Of course, results will vary from person to person. If you don't run multi-step pipelines, Rule 10 isn't as important for you. If your codebase has only one unified style and it's already enforced by lint, Rule 11 is redundant. After reading these 12 rules, keep those that genuinely correspond to errors you've actually made, and delete the rest. A 6-rule CLAUDE.md customized for real failure patterns is better than a 12-rule version where 6 of the rules are ones you'll never use. Karpathy's tweet in January 2026 was essentially a complaint. Forrest Chang turned it into 4 rules. Ultimately, 120,000 developers gave the result a Star. And most of them are still only using those 4 rules today. The model has progressed, and the ecosystem has changed too. Multi-step Agents, hook chain triggering, skill loading, multi-codebase collaboration—none of these existed when Karpathy wrote that tweet. The original 4 rules don't solve these problems. They're not wrong; they're just incomplete. Add 8 rules. 6 weeks, testing across 30 codebases. Error rate dropped from 41% to 3%. Bookmark this article tonight, paste these 12 rules into your CLAUDE.md. If it saves you a week of Claude detours, feel free to share it.