Benchmark program · gated

Benchmarking the workflow decision: ship, revise, review, or reject.

This page is the public status surface for 2026-05-benchmark-v1. No competitor numbers are published until the corpus, vendor ToS, metrics, and reproducibility artifacts are frozen.

Try VeracityAPI Current evals Benchmark repo

Planned corpus

1000 text samples 120 image pilot 80 audio pilot

Text is the P0 commercial benchmark. Image/audio remain clearly labeled pilots until provenance and vendor support are clean.

Publication gates

Vendor ToS/legal review
Corpus licensing validation
Budget + credentials
Frozen metrics artifacts

Required caveat: No benchmark numbers are published until the run is complete, frozen, cited, and legally cleared.

Metrics we will publish

Binary flagging F1 and false-positive rate.
Routing macro-F1 for allow/revise/human_review/reject.
Latency and cost per vendor/run.
Confusion matrices and versioned run manifest.

Where Veracity loses

English-first text calibration is stronger than non-English coverage.
Image/audio scoring is workflow triage, not forensic provenance verification.

Weaknesses stay on the page even if the final benchmark is favorable.

Fairness caveat

Competitors were not designed for VeracityAPI's routing-action metric, so the final report must show standard binary detector metrics alongside routing F1. This is workflow-risk scoring, not forensic authorship proof.

Internal links

Benchmarking the workflow decision: ship, revise, review, or reject.

Planned corpus

Publication gates

Metrics we will publish

Where Veracity loses

Fairness caveat

Explore VeracityAPI

What VeracityAPI detects

Error handling

Copy-paste examples

MCP tools