The Right Way to Use AI for Code Review

AI code review is easy to do badly. The naive version — dump a diff into Claude and ask "is this good?" — produces a list of observations that sounds thorough and misses the problems that actually matter.

The good version requires knowing what AI code review is actually good at and setting it up accordingly.

What AI Code Review Is Good At

Pattern matching. Common bugs, antipatterns, missing error handling, SQL injection risk, XSS vectors, hardcoded credentials — things that look the same across many codebases. AI is very good at this class of review.

Completeness checks. "Does this function handle the empty input case?" "Is this endpoint missing input validation?" "Does this migration have a down migration?" These are structural questions where the right answer is consistent and checkable.

Boilerplate review. Test files, migrations, API endpoint structure, configuration changes. Things that should follow a predictable pattern and where deviation is likely a mistake.

Documentation accuracy. Does this docstring match what the function actually does? Are the parameter descriptions accurate? AI catches drift between docs and implementation reliably.

Consistency. Does this new code use the same patterns as the existing codebase? AI can compare new code against the patterns in surrounding files and flag inconsistencies.

What AI Code Review Is Bad At

Architecture and design. Is this the right abstraction? Should this be a separate service? Is this API design going to cause problems in six months? These require understanding context, tradeoffs, and constraints that live outside the diff. AI guesses at these and often guesses confidently.

Product correctness. Does this code do what it's supposed to do? The AI can only judge against the code's visible intent — if the intent itself is wrong, the AI won't catch it.

Team and codebase context. Why does this code work the way it does? There's often history — previous bugs, deliberate tradeoffs, planned refactors — that explains code that looks odd in isolation. AI doesn't have this context.

Performance in context. "This n+1 query is a problem" — maybe. But whether it matters depends on data volume, frequency of the code path, caching, and other factors the AI can't see.

The Setup That Works

Run it on specific categories, not everything. Don't ask AI to review the full diff. Ask it to specifically check for security issues. Ask it to check for missing error handling. Ask it to verify test coverage. Specific questions get useful answers. Generic "review this" gets generic output.

# In your CI pipeline
git diff main...HEAD | claude --prompt "Review this diff specifically for:
1. Missing input validation on any new API endpoints
2. Hardcoded values that should be environment variables
3. Missing error handling in async functions
4. Any SQL queries that could be vulnerable to injection

For each issue found, cite the specific line and explain why it's a problem.
If nothing is found in a category, say 'None found.'
Only report issues in these categories — do not provide general feedback."

The specificity is load-bearing. Constrained review is useful. Unconstrained review produces noise.

Treat it as a first pass, not a final pass. AI review should flag things for human attention, not close the loop. The human reviewer looks at AI findings, validates the real issues, dismisses the false positives, and applies judgment the AI doesn't have.

Track false positive rate. If the AI is flagging things that aren't actually problems 50% of the time, it's creating more work than it saves. Tune your prompts to reduce false positives — usually by being more specific about what you care about.

Never skip human review for complex changes. For architectural changes, refactors across multiple files, or anything involving new patterns — human review is not optional. AI review is additive, not a replacement.

The False Confidence Problem

The dangerous scenario: a PR gets AI review, finds some issues, developer fixes them, AI re-reviews and clears it, PR merges without human review. The AI found real issues. The developer feels like the code has been thoroughly reviewed. But the issues the AI can't catch were never seen by a human eye.

The discipline: never let AI review be the only gate. The value of AI review is catching the easy stuff automatically so human reviewers can focus on the hard stuff — not eliminating human reviewers from the process.