We measure routing-action F1, not authorship certainty.
Benchmark v0.2 reports 0.871 macro F1 and 88.0% routing accuracy on a 500-item seed text corpus. VeracityAPI is judged on whether an agent routes content to the right next action: allow, revise, human_review, or reject.
Balanced five-slice text corpus in data/evals/veracityapi_seed_corpus_500.jsonl.
Seed v0.1 action agreement on allow / revise / human_review labels.
Macro F1 across supported routing actions; reject has zero support in this seed set.
Confusion matrix
| Expected โ Predicted | allow | revise | human_review | reject |
|---|---|---|---|---|
| allow | 190 | 10 | 0 | 0 |
| revise | 20 | 175 | 5 | 0 |
| human_review | 0 | 25 | 75 | 0 |
| reject | 0 | 0 | 0 | 0 |
Artifacts: data/evals/veracityapi_seed_results_v0_1.json, data/evals/veracityapi_seed_metrics_v0_1.csv, and scripts/evals-summary.mjs. This is a transparent seed calibration asset, not a forensic benchmark certification.
Per-action metrics
| Action | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| allow | 0.905 | 0.950 | 0.927 | 200 |
| revise | 0.833 | 0.875 | 0.854 | 200 |
| human_review | 0.938 | 0.750 | 0.833 | 100 |
Dataset slices
- 100 human firsthand samples
- 100 dry factual human samples
- 100 generic AI slop samples
- 100 polished AI-with-specifics samples
- 100 edge/mixed/adversarial samples
Benchmark v0.2 completed
Headline metric: 0.871 macro F1 on 500 text routing samples. The benchmark is text-first and intentionally limited: it measures routing-action F1, not detector-score accuracy โ routing-action F1, not AI-authorship proof.
External comparators
GPTZero and Sapling comparator runs remain pending until API credentials, ToS, and artifact freezing are resolved. Pending means not run, not inferred.
What this proves
The API contract is optimized for agents: inspect evidence, follow recommended_action, and measure reviewer agreement. The next iteration should replace seed expected labels with blind human labels and add competitor outputs where keys are available.
Known limits
- Seed labels are for workflow routing, not truth or authorship adjudication.
- Reject needs a dedicated abuse/spam corpus before reporting a meaningful reject F1.
- Image, audio, and video need separate labeled corpora before publishing comparable metrics.
- External competitor numbers should not be published until credentials, ToS, and artifact freezes are resolved.
- Scores should be paired with local policy and human escalation for high-stakes workflows.