Why we publish this
Anyone can build a dashboard. Anyone can call a query a “detector.” The thing that separates a tool you actually trust with a $5,000 leak from one you stop opening after a week is simple: can you check the math yourself?
So we publish four things, per detector, for every Instirio account:
- Brier score — does the probability we attach to each finding match the rate at which findings turn out to be real?
- Precision & recall — of the things we flagged, how many were genuine? Of all the real issues that happened, how many did we catch?
- Realised-vs-predicted savings — when we say a finding is worth $1,840, what does the backfill show 60 days later?
- The list of detectors that failed our own tests — what we tried, why it didn’t work, what we shipped instead.
Any competitor could publish these same numbers. Almost none of them do. That’s why this page exists.
The four measurements we run on every detector.
Brier score
The Brier score measures how well our confidence numbers match reality. If we say something is 80% likely and it turns out true 80% of the time, the Brier score is low. We publish one per detector, and we break it down by confidence bucket so you can spot where we’re too sure of ourselves.
Example: Stuck-order detector currently 0.11. Card-testing fraud 0.07 (very calibrated). New COGS-drift detector 0.24 (we’re still tuning).
Precision & recall
Two numbers, both worth knowing. Precision: of the alerts we send, how many were the real thing? Recall: of the real problems that happened, how many did we catch? We track both. Most tools quietly chase one and let the other slide.
Example: Defective-batch detector currently 0.84 precision, 0.71 recall. We’re willing to miss 29% of cases to keep the false-positive rate below 16%.
Merchant-level A/B
Before a detector ships, we run it as an experiment. A random group of merchants gets the alerts, a control group doesn’t. Sixty days later we compare losses on both sides. If the gap isn’t real, the detector doesn’t ship.
Example: Three of the eleven detectors we tested in Q1 2026 failed this bar. We didn’t ship them. The list of failures is at the bottom of this page.
Realised vs predicted $
When a finding says “$1,840” on it, we follow up. Sixty days after you take action, we measure what you actually recovered. Sometimes it’s $1,840. Sometimes $400. Sometimes $3,200. Each detector has its own realised-vs-predicted ratio. So does the site as a whole.
Example: Fulfillment-bottleneck detector is currently realising 78% of predicted savings (slightly under-promising, which is the bias we tune for).
Every detector, every number.
Illustrative numbers below for explaining the format. Your dashboard shows live numbers for your data, refreshed weekly.
| Detector | Brier | Precision | Recall | $ realised / predicted |
|---|---|---|---|---|
| Stuck order | 0.11 | 0.91 | 0.83 | 82% |
| Card-testing fraud | 0.07 | 0.97 | 0.79 | 94% |
| Serial returner | 0.16 | 0.81 | 0.66 | 71% |
| Defective batch | 0.14 | 0.84 | 0.71 | 88% |
| Carrier-zone drift | 0.19 | 0.74 | 0.68 | 76% |
| COGS drift per SKU | 0.24 | 0.69 | 0.74 | 63% (tuning) |
| Subscription churn | 0.13 | 0.87 | 0.72 | 85% |
| Inventory oversell | 0.09 | 0.93 | 0.81 | 91% |
| Volume spike (viral) | 0.21 | 0.78 | 0.69 | 79% |
Showing 9 of 37 detectors as an illustrative cross-section. The full table appears inside the product, refreshed every Monday after the weekend’s backfill job completes.
What it looks like when we’re wrong about a dollar.
That “$1,840” number rests on a model, and models miss things in interesting ways. Three patterns we watch closely:
We under-predicted
Action you took recovered more than we estimated. Common on stuck-order findings (downstream effects we don’t model).
We over-predicted
Action recovered less. Common on COGS-drift findings (margin recovery hits other costs we don’t see).
You didn’t act
No action taken in 60 days. We exclude these from realised-savings — but we report the count so you can see what’s being ignored.
The site-wide realised-vs-predicted ratio across all 37 detectors currently sits around 78% (illustrative). We tune for under-promising, so a number below 100% is the goal — not a bug.
Honest about every gap.
Three numbers people keep asking us to publish. We don’t. Here’s why:
- Cross-merchant accuracy comparisons. Detector accuracy varies wildly by merchant — your AOV, your category, your order velocity all shift the calibration. Publishing a single number across all merchants would mislead.
- ROI calculators. Any ROI number we’d publish is a function of your specific leak surface, which we can’t know before you connect. We’d rather show you your actual realised savings after 30 days than guess.
- Industry benchmarks. We don’t aggregate your data into “Apparel stores like yours lose X%” benchmarks. The data is yours; we don’t monetise it.
Detectors that failed our own tests.
Detectors we proposed, tested, and didn’t ship — with the reason. We publish this so you can audit our judgment, not just our wins.
“First-time customer fraud score” — failed precision bar
Proposed Q4 2025. Tested on a holdout of 14 merchants for 90 days. Precision came in at 0.41 — meaning more than half of flagged orders were legitimate first-time buyers. We pulled it; the false-positive cost on a DTC merchant’s welcome experience was higher than the fraud loss we’d catch.
“Marketing channel ROI attribution” — out of scope
Proposed Q1 2026. Worked technically but the data we need (multi-touch attribution from the merchant’s ad platforms) lives outside our integration surface. We’d be guessing at attribution, and there are better tools for it.
“Sentiment from product reviews” — failed realised-savings bar
Proposed Q1 2026. Detector worked (0.79 precision on flagging product-quality complaints). But the realised-savings backfill came in at 8% — merchants almost never acted on review-sentiment alerts because the data was already in their helpdesk. We didn’t ship.
Common questions.
Where do I see these numbers for my own account?
Inside the product, on the “Accuracy” tab of each detector. Brier score, precision, recall, and realised-vs-predicted are refreshed every Monday after the 60-day backfill window closes on the prior week’s findings.
What’s a “good” Brier score for a detector like this?
For binary detection of operational events, anything under 0.25 is meaningfully calibrated and anything under 0.15 is genuinely well-calibrated. Our published numbers range from 0.07 to 0.24 depending on detector maturity. New detectors typically start in the 0.20–0.30 range and tune down over 3–6 months.
Does Instirio use my data to train models that go to other merchants?
No. Models are calibrated per-merchant. Your event volumes, AOV distributions, and category mix shift detector thresholds for your account only. We don’t pool your data into a shared model and we don’t sell it.
What happens if a detector’s accuracy drops over time?
It shows up on the Accuracy tab the next Monday. If the drop is severe — Brier moving above 0.30 or precision below 0.50 — we automatically demote the detector to “monitor only” mode (no alerts fired) and your account gets a notification. We’d rather not alert you than alert you with bad signal.
Can I export the methodology + accuracy data?
Yes, on every plan. CSV export of detector accuracy + backfill data is available from the Accuracy tab. Plug it into your own audit workflow or share it with your auditor.
See your detector accuracy in your dashboard.
Connect your store, run for 30 days, and the Accuracy tab populates with your real numbers — calibrated to your data, not an industry average.