Benchmarking the workflow decision: ship, revise, review, or reject.
This page is the public status surface for 2026-05-benchmark-v1. No competitor numbers are published until the corpus, vendor ToS, metrics, and reproducibility artifacts are frozen.
Planned corpus
1000 text samples 120 image pilot 80 audio pilot
Text is the P0 commercial benchmark. Image/audio remain clearly labeled pilots until provenance and vendor support are clean.
Publication gates
- Vendor ToS/legal review
- Corpus licensing validation
- Budget + credentials
- Frozen metrics artifacts
Required caveat: No benchmark numbers are published until the run is complete, frozen, cited, and legally cleared.
Metrics we will publish
- Binary flagging F1 and false-positive rate.
- Routing macro-F1 for allow/revise/human_review/reject.
- Latency and cost per vendor/run.
- Confusion matrices and versioned run manifest.
Where Veracity loses
- English-first text calibration is stronger than non-English coverage.
- Image/audio scoring is workflow triage, not forensic provenance verification.
Weaknesses stay on the page even if the final benchmark is favorable.
Fairness caveat
Competitors were not designed for VeracityAPI's routing-action metric, so the final report must show standard binary detector metrics alongside routing F1. This is workflow-risk scoring, not forensic authorship proof.