Building an AI that simulates QA engineers at scale
December 10, 2025 · 12 min read
What makes a good QA engineer?
A senior QA engineer doesn't just click through happy paths. They think adversarially. They know that payment forms break on iOS when the keyboard is open. They know that auth tokens expire at exactly the wrong moment. They know that pagination breaks when filters are applied in a specific order.
This institutional knowledge — accumulated over years of catching bugs — is what we tried to encode into ARI's AI.
The training data problem
The hardest part wasn't the model architecture. It was the data.
We needed examples of *real* bugs that slipped through manual QA and caused production incidents. We partnered with 12 engineering teams who gave us access to their incident post-mortems, bug trackers, and deployment histories — anonymized and stripped of PII.
From this, we built a dataset of 54,000 labeled examples: a crawl snapshot + the bug that was present + whether that bug caused a production incident.
The model
We fine-tuned on top of a base reasoning model, with a custom architecture for multi-modal input (screenshots, DOM trees, network logs, and console errors all feed in simultaneously).
The key insight was treating bug detection as a *ranking problem*, not a binary classification. Rather than asking "is there a bug?", we ask "rank these 200 potential issues by likelihood of causing a production incident." This dramatically reduced false positives.
94% accuracy — what that actually means
Our 94% figure is specifically for critical bugs — the ones that would cause a production incident or measurable revenue loss if shipped.
For lower-severity issues, accuracy is lower (~78%). We think that's acceptable — low severity bugs are less likely to trigger false alarms from engineers.
False positive rate for critical bugs: 2.8%. This means if ARI says NOT SAFE, it's right 97.2% of the time.
What we still get wrong
Bugs that require specific account state to reproduce (e.g., "only happens if you have more than 500 items in your cart") are hard. We're working on stateful simulation to handle these.
Bugs that only appear under load are outside our current scope. That's a different problem space — we recommend complementing ARI with load testing for high-traffic events.