The test coverage illusion: Agents maximize line coverage, not correctness coverage. They'll add tests for happy paths and miss the edge cases that actually break in production. A 90% covered codebase can still have its core invariants violated in ways no test catches.
Integration tests are the hard part: Unit tests are easy to generate. The tests that matter — 'does this feature work end-to-end with real database state and real network calls' — require setting up fixtures that agents don't understand. They'll stub things they shouldn't.
Property-based testing is alien: Agents struggle with generative testing (fast-check, hypothesis). They don't think in invariants — 'for all inputs of type X, property Y holds.' This is where most latent bugs hide.
Human oversight becomes a bottleneck: If every agent-generated change needs human review to verify correctness, you haven't replaced developers — you've created a review-on-demand bottleneck that slows everything down.
Adversarial robustness: Agents generate code that passes their own tests. But adversarial inputs, concurrent access patterns, and security exploits require someone to actively think 'what would break this?' — a mindset, not a procedure.
Contract-driven development: If you define preconditions and postconditions (using Pydantic, Zod, or formal contracts like Cursor's @contract), agents can verify generated code against those contracts automatically. This shifts verification from 'human review' to 'does it satisfy the spec.'
Property-based testing catches edge cases: Tools like fast-check or hypothesis let you express invariants ('this function always returns a valid email or throws'). Agents can learn to generate these — and they're far more effective than example-based tests at catching edge cases.
The verification loop: Agent generates → contract check → if fail, regenerate with contract error as context. This converges much faster than human review cycles because the feedback is precise.
Formal methods are accessible now: LLMs can interact with formal verification tools (TLA+, Dafny, Lean). For critical sections, a formal spec + LLM-generated proof is now feasible. Not for everything, but for the critical path.
Human oversight as quality gate, not bottleneck: Humans review architecture and contracts, not every line. A senior developer reviewing a contract/spec for a feature takes 5 minutes. Reviewing the 500-line implementation takes an hour. Agents shift the human work to the high-value part.