The naming problem: Agents generate names (functions, variables, classes) that are locally logical but clash with codebase conventions. In a Python project using snake_case, agents introduce camelCase because their training data has more JS/TS. This isn't a bug, it's a context-attention failure.
Impossible context tests: You cannot test whether code respects implicit invariants (e.g., 'this field is always set by the sync job, never by direct access') without understanding the data flow. Agents don't have access to the 3 Slack threads and one architecture decision doc that explain why this invariant exists.
Library version hallucinations: Agents confidently suggest APIs from libraries that don't exist in your version, or use syntax from a future major version. This is especially dangerous because the code looks plausible and tests can pass in isolation.
Security by accident: Agents generate code that works and is insecure — they default to 'easiest to implement' not 'most secure'. SQL parameterized queries are often missed in complex JOINs. Auth checks are easy to miss in refactors.
Code review is still required: Every study on AI-generated code finds that human review catches 1-3 bugs per 100 lines on average. At that rate, a 1000-line feature still needs review. Agents haven't replaced reviewers — they've just made more work for them.
RAG for code works: Index the codebase with a tool like Sourcegraph Cody or context7. When generating, retrieve only the relevant 5-10 files with their imports and type signatures. This reduces hallucination rate by 60-80% compared to dumping the entire context.
Schema-first generation: If you provide the agent with explicit type schemas, API contracts, and database schemas as structured input, the hallucination rate on those interfaces drops to near-zero. The problem is always implicit knowledge — make it explicit.
Code quality benchmarks prove it: On standardized benchmarks like LiveCodeBench, Claude 3.7 Sonnet scores >85% on code generation tasks. The remaining 15% is edge cases — real developers aren't hitting 100% either.
The hybrid pattern: Use agents for first drafts (they're fast and cover 80% of cases), then route failures to human review. Over time, the review patterns get fed back as constraints and the agent improves. This is how all mature dev teams are using AI today.
Mac Mini setup: Use a local embedding model (Nomic or similar) to build the RAG index on-device. All retrieval stays local. Only the generation call goes to cloud. Privacy-preserving and fast.