Training data curation
Agents can filter a content corpus before fine-tuning, RAG indexing, or synthetic-data generation. VeracityAPI helps keep high-specificity, well-provenanced text and remove generic filler that would teach models bad habits.
Business value
- Improves downstream model/data quality.
- Prevents generic site boilerplate and weak articles from contaminating curated datasets.
- Creates auditable acceptance criteria for training data.
Agent job to be done
Act as a data curator. Score documents or chunks, keep high-trust examples, quarantine medium-risk examples, and reject generic/weak-provenance text for training.
format: articleintended_use: traindomain: training data curation
When to call VeracityAPI
Run during dataset construction, before embeddings, fine-tuning, or distillation jobs.
What text to submit
Document title, body chunk, source URL/path, publication date, and metadata. Store metadata in your pipeline; submit the text and context to VeracityAPI.
Decision policy
- allow: low risk chunks enter training/RAG corpus.
- human_review: medium risk, because training workflows should be stricter than publishing.
- reject: high risk or provenance_weakness >= 0.70.
- dedupe: combine with similarity filtering so repeated generic text does not pass by volume.
Request template
curl https://api.veracityapi.com/v1/analyze -H "Authorization: Bearer DOC_KEY" -H "Content-Type: application/json" -d '{"type":"text","content":"Paste content here","context":{"format":"article","intended_use":"publish"}}'Automation recipe
- Collect candidate documents.
- Strip boilerplate and split into chunks.
- Score each chunk with intended_use=train.
- Write content_trust_score and evidence to dataset manifest.
- Only export allow chunks to training/RAG jobs.
Evidence spans agents should inspect
- generic educational filler
- unsupported claims
- boilerplate repeated across many pages
- absence of examples that teach the model domain-specific facts
Policy pseudocode
if (result.recommended_action === "allow") continueWorkflow(); if (result.recommended_action === "revise") rewriteWith(result.evidence, result.recommended_fixes); if (result.recommended_action === "human_review") queueForHumanReview(result); if (result.recommended_action === "reject") discardOrRebuild();
KPIs to track
- dataset acceptance rate
- number of weak chunks removed
- fine-tune eval improvement
- RAG answer specificity
- manual data-review hours saved
What can go wrong
- VeracityAPI is not a full data-governance system.
- It does not check copyright/license rights.
- Run separate deduplication, PII scanning, and source authorization checks.
Cost and latency notes
Analyze only is $0.005 per 1,000 characters; Analyze + revise with auto_revise=true is $0.010 per 1,000 characters. Both round up to the nearest 1,000 characters. Short captions/emails usually cost $0.005; longer pages or chapters scale linearly by length. Current v0.1 latency is LLM-bound, so batch/concurrent orchestration is recommended for high-volume pipelines.
Agent evaluation checklist
- Does this workflow have a costly failure mode from generic or weak-provenance text?
- Can the agent map evidence spans back to editable source locations?
- Should this workflow fail open, fail closed, or queue human review if VeracityAPI is unavailable?
- Which field drives policy: recommended_action, risk_level, content_trust_score, specificity_risk, or provenance_weakness?
- What local rule should complement the API score?