Why a 30-rule analyzer beats a 1,200-check legacy tool

We scraped the entire PVS-Studio diagnostics catalog as part of our research. 1,234 warnings. 418 of them in the C++ general analysis section alone. It is an impressive corpus. It is also, for our use case, a liability.

Coverage is not the metric that matters

The thing every static analyzer reaches for is breadth. More checks, more rules, more languages, more standards covered. It looks good in a feature comparison table. The trouble is that for an LLM-driven workflow, breadth without precision is actively harmful.

When Claude calls our analyzer on a 200-line file, what comes back gets read by the model as part of its next prompt. Every false positive costs context window tokens, dilutes the signal, and trains the model to ignore the analyzer. Two or three noisy findings and the model starts treating the tool as advisory. After ten, it stops calling it.

The math is simple. A rule that fires 50% of the time on real bugs and 5% of the time as a false positive is a great human rule and a terrible LLM rule. A rule that fires 90% of the time on real bugs and 0.1% of the time as a false positive is the opposite. We optimise relentlessly for the second kind and we throw away anything that does not clear the bar.

What we drop

Style nags. Anything that needs a .clang-format to disagree with. Anything that requires whole-program inter-procedural analysis to be sound (we will get there in version two, but we will not ship a half-broken version one). Checks that depend on a build configuration we cannot reliably reconstruct. Checks for language features that are already deprecated.

What is left is around 30 rules in our initial C/C++ set. Each one is mapped to a CWE or CERT entry, ships with a positive and negative test fixture, has a one-line natural language fix hint, and was selected because we observed it firing on real LLM-generated code in production.

30 rules sounds like nothing. In practice it is enough to catch every one of the 22 bugs from our reference codebase, with zero false positives on the same body of code. That ratio is the only number we care about.