Garbage-in prevention

Training data curation

Agents can filter a content corpus before fine-tuning, RAG indexing, or synthetic-data generation. VeracityAPI helps keep high-specificity, well-provenanced text and remove generic filler that would teach models bad habits.

Get API key All use cases Docs

Business value

  • Improves downstream model/data quality.
  • Prevents generic site boilerplate and weak articles from contaminating curated datasets.
  • Creates auditable acceptance criteria for training data.

Agent job to be done

Act as a data curator. Score documents or chunks, keep high-trust examples, quarantine medium-risk examples, and reject generic/weak-provenance text for training.

format: articleintended_use: traindomain: training data curation

When to call VeracityAPI

Run during dataset construction, before embeddings, fine-tuning, or distillation jobs.

What text to submit

Document title, body chunk, source URL/path, publication date, and metadata. Store metadata in your pipeline; submit the text and context to VeracityAPI.

Decision policy

  • allow: low risk chunks enter training/RAG corpus.
  • human_review: medium risk, because training workflows should be stricter than publishing.
  • reject: high risk or provenance_weakness >= 0.70.
  • dedupe: combine with similarity filtering so repeated generic text does not pass by volume.

Request template

curl https://api.veracityapi.com/v1/analyze -H "Authorization: Bearer DOC_KEY" -H "Content-Type: application/json" -d '{"type":"text","content":"Paste content here","context":{"format":"article","intended_use":"publish"}}'

Automation recipe

  • Collect candidate documents.
  • Strip boilerplate and split into chunks.
  • Score each chunk with intended_use=train.
  • Write content_trust_score and evidence to dataset manifest.
  • Only export allow chunks to training/RAG jobs.

Evidence spans agents should inspect

  • generic educational filler
  • unsupported claims
  • boilerplate repeated across many pages
  • absence of examples that teach the model domain-specific facts

Policy pseudocode

if (result.recommended_action === "allow") continueWorkflow();
if (result.recommended_action === "revise") rewriteWith(result.evidence, result.recommended_fixes);
if (result.recommended_action === "human_review") queueForHumanReview(result);
if (result.recommended_action === "reject") discardOrRebuild();

KPIs to track

  • dataset acceptance rate
  • number of weak chunks removed
  • fine-tune eval improvement
  • RAG answer specificity
  • manual data-review hours saved

What can go wrong

  • VeracityAPI is not a full data-governance system.
  • It does not check copyright/license rights.
  • Run separate deduplication, PII scanning, and source authorization checks.

Cost and latency notes

Analyze only is $0.005 per 1,000 characters; Analyze + revise with auto_revise=true is $0.010 per 1,000 characters. Both round up to the nearest 1,000 characters. Short captions/emails usually cost $0.005; longer pages or chapters scale linearly by length. Current v0.1 latency is LLM-bound, so batch/concurrent orchestration is recommended for high-volume pipelines.

Agent evaluation checklist