Engineering

June 12, 20269 min read

Static Analysis Misses Bugs That Reach Production

Agents can stare at code all day long, but they will not find all the issues because they are probabilistic unit tests. You need to actually run your application to get runtime evidence and proof that things behave as your organization expects them to for full integration and end-to-end tests.

Evan MarshallCTO

The Static Analysis Ceiling: Missing Bugs That Reach Production

Code review analyzes your diff. Behavioral tests check if users will run into bugs. Teams need both.

AI coding completely changed the risk equation for teams. Teams are generating code faster than they can confirm it works. Yet they can’t move slower because customer expectations have never been higher and competition has never been more fierce.

Yet real business risk can be sitting in a single pull request, and current tools aren’t enough to catch it.

Static code analysis is just one step. These code review tools read the change where behavioral testing runs the change. Both look at the same code change from different layers, and the difference is that one mirrors how users will use your application.

Code review answers the question “does this diff look reasonable?” That’s not what actually matters to your users. They care about your software helping them achieve their outcome.

KEY POINTS

→ Code review answers whether the diff looks reasonable. Behavioral testing answers whether a user can still complete what they came to do once the change ships. → Static analysis reads source and cannot observe what the application does at runtime. Auth, data states, race conditions, and live integrations sit in that blind spot. → Static analysis code review tools like CodeRabbit and Greptile study the diff. They belong alongside behavioral testing on the same PR, not in competition with it. → Behavioral testing is end-to-end testing written around user intent rather than selectors, so an agent navigates the flow and there is no script library to maintain.

The rise of AI code review

AI review tools made the reading step faster and more consistent. CodeRabbit, Greptile, and the tools in their category pull in repository context, surface logic that looks suspect, flag API contract changes, and point a human reviewer at the parts of a diff worth their attention. As coding agents push more code into review than teams used to see, that consistency keeps the review queue moving. They ensure no unit tests would be breaking for your team.

These tools hit a structural limit. They can reason about code but don’t observe what a running application does to a user's task. Seeing what happens from runtime behavior and being able to confirm it with evidence leads to catching over 30% more issues.

What static analysis can't catch

Reading a change and running it surface different information. Runtime bugs can look completely reasonable in the diff yet break the moment you run it or a real user touches it.

A database migration for a fintech company that looks good but during the rollout would cause double entries for all active customers. Every file involved looks correct on its own, the unit tests pass, and the issue only occurs during the actual migration.

A returning customer opens the app to finish a purchase, and an expired session drops them into a login loop instead of letting them back in, because the token refresh check runs after the redirect instead of before it. The order is what they came for, and the broken sign-in is where it stalls. Anyone with a fresh session sees nothing wrong, and the test against the refresh helper is green. The loop only shows up when the app runs with a stale token and real routing.

A drag-and-drop reorder looks correct in the UI, then reverts after the user navigates away and comes back, because the reorder updates local state and never calls the API. A reviewer can see that branch in the diff. Confirming the outcome takes leaving the page and returning, which is something only a running app does.

None of these are careless. Each one depends on state, timing, browser behavior, or an integration path that no amount of reading can observe. This is the structural limit of static analysis: it can only inspect the code, and the failure is in what the code does.

The Evolution to Behavioral Testing

Behavioral testing confirms that a person can complete a real task in your app, by running the app and exercising the flow the way a user would. It ties code changes to business outcomes and user flows, not flaky test suites.

An agent opens the affected paths, clicks through them, fills in inputs, waits for the application to respond, and then checks the outcome that matters. Did the migration work. Does the new integration cover all edge cases. Did the order go through. Did the returning user get back to what they were doing. Did the API survive the upgrade. It can observe UI state, network behavior, session handling, and the result the user is left with.

The question behavioral testing answers is the one code review cannot reach: can the user still do what they came to do once the change ships. Screenshot diffing compares pixels and tells you the page looks the same, which is a different claim from confirming a user completed the task. Behavioral testing is about the outcome, not the appearance.

The unit being validated is the user's goal, not any single step. A person signs in to reach something on the other side of the login, so a working sign-in matters only because it clears the path to the task waiting behind it. Behavioral testing checks that step as part of completing the task, rather than as a box to tick in isolation, which is also where it parts ways with scripted suites, which often assert that login works in isolation and call the flow covered.

Testing accelerates, and the landscape is changing

Behavioral testing is a shift-left end-to-end testing because it runs the whole application in a real environment. The approach differs from other end-to-end (e2e) test products in how the test is written and what it asserts against.

Scripted end-to-end testing in CI with Playwright, Cypress, or Selenium works well when the UI is stable and the team has capacity to maintain the suite. The cost is that the assertions are tied to selectors and DOM structure, so the tests break when the markup moves and someone has to keep them current.

Record-and-replay lowers the authoring cost of these e2e tests by capturing a session instead of hand-writing it, but it inherits the same brittleness, since the recording is coupled to the underlying structure.

Manual QA brings judgment that automation does not. This is crucial in testing edge cases and exploratory work. E2e test suites for this type of work aren’t worth the effort to build, and asking engineers to build them is how the queue backs even further up.

Agentic QA solutions like Ito generate the test from the code change itself. The agent navigates like a user and validates the behavioral outcome, so there is no library of selectors to maintain. The test is written around what the user is trying to accomplish, not around the structure the engineer happened to ship.

Same scope, different question: scripted suites confirm the code does what it was scripted to do, while behavioral testing confirms the user can finish the task they came for.

Where Static Analysis ends and Ito begins

CodeRabbit and Ito answer different questions on the same PR, and they sit together rather than compete.

CodeRabbit reviews the diff and catches code-level problems: shaky logic, a missing edge case, an unclear pattern, an API contract that changed. Ito runs the changed application in a real environment and catches behavior-level problems: the checkout that dies on submit, the returning user locked out by an expired session, the change that saved on screen and never persisted.

Both post their findings to the PR. Before anyone merges, the reviewer sees the code analysis alongside evidence of what the running app did with a real flow. That is a stronger basis for a merge decision than either half alone, and the reviewer still makes the call, now with more of the picture in front of them.

Building a complete QA stack

Each layer of a QA stack catches a different failure class, and the layers are additive rather than interchangeable.

Linting handles style and syntax. Unit tests check logic in isolation, one function at a time. AI code review reasons about the diff and flags code-level risk. None of these sees whether a real user can complete a task in the running app, and that is what behavioral testing covers.

A team that skips a layer is not removing those failures from the world. It is choosing not to see them until a user does. For teams shipping at the speed coding agents now allow, the layer most often missing is the last one, the one that confirms the people using the product can still finish what they started.

The forward-looking version of this is a development loop where behavioral validation is a property of every change, not a phase bolted on at the end. A developer opens a PR, and before the reviewer clicks in, the impacted flows have already run in a real browser and the results are waiting in the thread. The reviewer starts from behavioral evidence instead of from a guess.

The bottleneck in modern software development has moved downstream, from how fast a team can write code to how quickly it can trust the code it wrote. Faros, analyzing 22,000 engineers in 2026, found the strain landing exactly there: as AI adoption deepened, bugs per developer rose 54% and the workflow slowed most at the point of review. Behavioral testing is what turns a diff that looks safe into evidence that a real person can finish the task it touches.

How this looks in practice

Ito connects to your GitHub repo and configures an isolated environment for each pull request, so there is no shared state between runs. Adding testing credentials such as secrets, variables, or seed data is optional. On every PR, Ito deploys the change, runs the user flows it affects, and posts the results before anyone merges. When a flow breaks, the PR comment includes which flow failed, what the agent did, what the app returned, a video and screenshots of the failure, the steps to reproduce it, and the lines responsible.

Get started →

By the numbers

Faros tracked 22,000 developers across 4,000 teams over two years. As AI adoption rose, median PR review time increased fivefold and incidents per PR tripled. The volume of code going into review grew. The ability to trust it before it shipped did not keep pace. Code review did not absorb the difference just slowed down under the weight of it.

Your team reviews the code. Who confirms a user can still get through the app?

Connect your repo and add behavioral testing to every pull request. Ito configures the environment, so there is no infrastructure to stand up first.

Get started →

Frequently asked questions

Static analysis inspects source without running the application, so it cannot observe runtime state, timing, real data, or live integrations that often trigger production-only failures.

Issues that depend on runtime configuration, concurrency and timing, migrations, session or token handling, UI state that never persists, and third-party integration behaviors are commonly missed.

No. Code review evaluates the quality of the diff and the code itself, while behavioral testing validates that a real user can complete their task when the change runs, and both belong on the same pull request.

Both run the whole app, but behavioral tests are written around user goals and use agents to navigate flows rather than relying on brittle DOM selectors, which reduces maintenance overhead.

Ito connects to your repository, deploys each PR into an isolated environment, runs the impacted user flows, and posts the results back to the PR before anyone merges.

Ito can run the affected flows and post a full testing report within about an hour of opening the PR.

When a flow fails, Ito includes which flow failed, what the agent did, the app response, a video and screenshots, reproduction steps, and the lines of code responsible.

Sources

Faros (2026): The Acceleration Whiplash

Found this post helpful? Share it:

Enjoyed this? Subscribe.

Never miss a post on the latest in AI-driven software development, code review, new models, and more.

Related resources.

Google Trends chart comparing search interest in MTTR and mean time to recovery from 2014 to 2026, showing a sharp peak around 2021-2022

Engineering

July 23, 2026 • Evan Marshall

MTTR is the Wrong Metric for AI-Era Engineering Teams

AI tools produce 41% more bugs and 98% more pull requests. MTTR can't keep up. Here's how MTTF shifts your team from incident response to prevention.

Bar chart: median seconds for fourteen models to complete the same coding task

Engineering

July 21, 2026 • Evan Marshall

Your fastest model is probably not your fastest model

Tokens per second measures how fast a model emits, not how fast it finishes. Our 13-model probe shows the gap that flips both the speed + cost leaderboards.

How Moo makes your worktrees better by isolating each of their environments

Engineering

July 14, 2026 • Evan Marshall

Moo: Giving Your Agents the Runtime Isolation git worktrees Need

git worktree isolates your files. Moo isolates the database, ports, and services, saved per commit. Together, they give your agents fully isolated machines. Learn how to use Moo, the benefits, and why your agents need it.

Your first PR tested within 60 minutes.

Connect your repo and Ito starts testing pull requests right away. Each PR includes a full QA report with video, screenshots, and failure details directly in the PR.

Get Started

no credit card required