Methodology

How we keep AI honest

Every AI-powered tool on this site is tested against real veteran scenarios automatically, every 6 hours. Here's what we check for, what we refuse to generate, and how you can verify it yourself.

Why this page exists

Veterans are right to distrust “AI” marketing. Models hallucinate citations. They drift under load. They cite the wrong CFR section. A drafted nexus letter that says “possibly related” instead of “at least as likely as not” gets auto-denied under 38 CFR 3.102. We publish what we test for so the claim is falsifiable.

Three layers of defense

Layer 1 · Prompt

Every tool starts with the same injection guard

What it is: A shared prompt guard is injected into all 13 AI tools. It enumerates what the model is forbidden to produce: no invented statistics, no fabricated CFR citations, no unexpanded acronyms outside a 12-item whitelist, no computed-duration drift, no markdown in plain-text fields. 8th-grade reading level is a hard constraint.

Why it matters: VA claim outputs get read under stress by people who are not lawyers. A single fabricated citation — "38 CFR 3.655" when the rater is looking at 3.310 — can undo the whole package. Pre-flight rules are cheaper than downstream scrubbing.

Proof: src/lib/ai/prompt-injection-guard.ts · imported by every /api route.

Layer 2 · Output

Buffered JSON, validated, scrubbed, then sent

What it is: Instead of streaming raw model output to the browser, every AI response is buffered through a shared runJsonTool() runner. It Zod-validates the shape, runs a per-tool postValidate hook (forces § 3.310 on secondaries, normalizes framework names on presumptive, strips computed durations on all tools), then a universal scrubMarkdownInPlace() pass removes any markdown leakage that slipped through the prompt guard.

Why it matters: Streaming has no window to catch a hallucination mid-flight. Buffered + validated means the model can never emit a value the UI will silently render as undefined or show the vet as a false-hope number.

Proof: src/lib/ai/run-json-tool.ts · 13 tools migrated from streaming to buffered.

Layer 3 · CI

Persona-drift evals run against production every 6 hours

What it is: A GitHub Action replays a library of 30+ veteran personas against every AI endpoint on the live site, 3 runs per persona, and scores each response against invariants: "if this vet has ROM 0–130°, the rating-gap tool MUST NOT suggest supplemental_new_evidence." 700+ invariants across 13 tools. The action also fires on every push to main that touches AI code, as a pre-deploy gate.

Why it matters: LLMs drift silently. A model provider can change behavior overnight. Without an automated drift check, you learn about regressions from veterans who filed bad claims. With one, you learn from a red check mark in the Actions tab.

Proof: scripts/eval/personas-*.ts + run-*.ts · .github/workflows/eval-ai-tools.yml

The hard rules

No invented CFR citations

Every citation is a real 38 CFR section. Common ones (§ 3.310 secondary, § 4.16 TDIU, § 3.309(f) Camp Lejeune, § 3.102 benefit-of-the-doubt, § 4.130 mental-health schedule) are written directly into the relevant tool prompts so the model quotes them verbatim rather than paraphrasing.

No hallucinated BVA citation numbers

The Precedent tool and decoder both run a BVA-citation allow-list scrubber — any "Citation Nr XXXXX" in the output that is not in the indexed precedent database gets stripped before the response is returned.

No invented statistics

The prompt guard forbids patterns like "80% grant rate," "median $42,000 retro," or "approved 9 out of 10." These numbers vary by condition and era; citing any specific figure at the session level is how vets get misled into filing weak claims.

Dates are data, not math

The AI is forbidden from computing durations ("for 3 years," "over a decade"). Duration claims are post-processed out. The only dates that appear in output are dates the vet typed in, surfaced verbatim.

No "could," "may," "possibly" on nexus output

38 CFR 3.102 sets the benefit-of-the-doubt standard at 50%. "At least as likely as not" clears it; softer phrasing does not. Nexus letter drafts are checked for the magic phrase before they reach the vet.

Plain-text only, no markdown

Every AI response is stripped of **bold**, __underline__, # headings, and `code` formatting server-side. The vet reads prose, not a markdown document they have to mentally convert before giving it to their doctor.

No account, no data retention

The Claim Workspace is local-first. Your narrative stays in your browser. We do not store veteran-PII on servers — we do not want the liability and you do not want the exposure.

What the eval harness actually measures

A “persona” is a frozen intake — a fictional veteran with a specific condition, rating history, and symptom profile. Each persona has a set of “invariants”: facts that MUST appear in the AI's response (must cite § 3.310 when PTSD secondary to OSA is claimed), facts that MUST NOT appear (must not list depression as secondary to hearing loss on the strong/moderate tier), and stability checks (three identical runs should produce substantively the same output).

When a persona drifts, the CI job fails loudly. We see it before the next veteran does. The fix is almost always a prompt edit — not a model swap — so shipped corrections land within hours of detection.

Current test coverage

  • /presumptive · 5 personas
  • /decoder · 5 personas
  • /smc · 6 personas
  • /secondaries · 5 personas
  • /rating-gap · 4 personas
  • /estimator · 5 personas
  • /tdiu · 5 personas
  • /nexus · 5 personas
  • /cp-prep · 5 personas
  • /cp-debrief · 3 personas
  • /eligibility · 4 personas
  • /claim-status · 3 personas
  • /forms · 5 personas

700+ invariants across 13 tools · 3 runs per persona · every 6 hours against production

Verify it yourself

Every piece of infrastructure described on this page lives in a public GitHub repository. You can read the prompts. You can read the invariants. You can read the CI workflow. If you want to propose a persona — a real case the site should be tested against — email the narrative and expected behavior and we'll add it to the library.

Disclosure: NexusVetClaims is not a VA-accredited representative. Nothing on this site is legal or medical advice. The tools help you prepare — you or your VSO file.