Garbage-in prevention

Training data curation

Agents can filter a content corpus before fine-tuning, RAG indexing, or synthetic-data generation. VeracityAPI helps keep high-specificity, well-provenanced text and remove generic filler that would teach models bad habits.

Get API key All use cases Docs

Business value

Improves downstream model/data quality.
Prevents generic site boilerplate and weak articles from contaminating curated datasets.
Creates auditable acceptance criteria for training data.

Agent job to be done

Act as a data curator. Score documents or chunks, keep high-trust examples, quarantine medium-risk examples, and reject generic/weak-provenance text for training.

format: articleintended_use: traindomain: training data curation

When to call VeracityAPI

Run during dataset construction, before embeddings, fine-tuning, or distillation jobs.

What text to submit

Document title, body chunk, source URL/path, publication date, and metadata. Store metadata in your pipeline; submit the text and context to VeracityAPI.

Decision policy

allow: low risk chunks enter training/RAG corpus.
human_review: medium risk, because training workflows should be stricter than publishing.
reject: high risk or provenance_weakness >= 0.70.
dedupe: combine with similarity filtering so repeated generic text does not pass by volume.

Request template

curl https://api.veracityapi.com/v1/analyze -H "Authorization: Bearer DOC_KEY" -H "Content-Type: application/json" -d '{"type":"text","content":"Paste content here","context":{"format":"article","intended_use":"publish"}}'

Automation recipe

Collect candidate documents.
Strip boilerplate and split into chunks.
Score each chunk with intended_use=train.
Write content_trust_score and evidence to dataset manifest.
Only export allow chunks to training/RAG jobs.

Evidence spans agents should inspect

generic educational filler
unsupported claims
boilerplate repeated across many pages
absence of examples that teach the model domain-specific facts

Policy pseudocode

if (result.recommended_action === "allow") continueWorkflow();
if (result.recommended_action === "revise") rewriteWith(result.evidence, result.recommended_fixes);
if (result.recommended_action === "human_review") queueForHumanReview(result);
if (result.recommended_action === "reject") discardOrRebuild();

KPIs to track

dataset acceptance rate
number of weak chunks removed
fine-tune eval improvement
RAG answer specificity
manual data-review hours saved

What can go wrong

VeracityAPI is not a full data-governance system.
It does not check copyright/license rights.
Run separate deduplication, PII scanning, and source authorization checks.

Cost and latency notes

Analyze only is $0.005 per 1,000 characters; Analyze + revise with auto_revise=true is $0.010 per 1,000 characters. Both round up to the nearest 1,000 characters. Short captions/emails usually cost $0.005; longer pages or chapters scale linearly by length. Current v0.1 latency is LLM-bound, so batch/concurrent orchestration is recommended for high-volume pipelines.

Agent evaluation checklist

Does this workflow have a costly failure mode from generic or weak-provenance text?
Can the agent map evidence spans back to editable source locations?
Should this workflow fail open, fail closed, or queue human review if VeracityAPI is unavailable?
Which field drives policy: recommended_action, risk_level, content_trust_score, specificity_risk, or provenance_weakness?
What local rule should complement the API score?

Internal links

Related use-case cluster

Move from this specific workflow into the nearest modality, sibling jobs, and implementation docs.

Training data curation

Business value

Agent job to be done

When to call VeracityAPI

What text to submit

Decision policy

Request template

Automation recipe

Evidence spans agents should inspect

Policy pseudocode

KPIs to track

What can go wrong

Cost and latency notes

Agent evaluation checklist

Related use-case cluster

All use cases

AI content detector API

AI-generated text detector

Text signals we detect

KDP book manuscript QA

Ad copy and landing page optimization

Implementation docs

Explore VeracityAPI

What VeracityAPI detects

Error handling

Copy-paste examples

MCP tools