How does NexusVetClaims prevent AI from hallucinating citations?

Every AI tool runs through a three-layer guard system. Layer 1 is a shared prompt guard injected into all 13 AI endpoints that forbids invented statistics, fabricated 38 CFR citations, and unexpanded acronyms. Layer 2 is a buffered output runner that Zod-validates the shape, applies per-tool post-validation (forces § 3.310 on secondaries, strips computed durations), and scrubs markdown leakage server-side before the response is sent. Layer 3 is an automated drift-eval harness that replays 30+ veteran personas against production every 6 hours in GitHub Actions and fails the build on any invariant regression.

How often are the AI tools tested for accuracy?

The automated persona-drift eval harness runs every 6 hours against the live production site. It also fires on every push to main that touches AI code, as a pre-deploy gate. 700+ invariants across 13 AI tools are checked each run.

What CFR sections are referenced by the AI tools?

38 CFR § 3.310 (secondary service connection), 38 CFR § 4.16 (TDIU), 38 CFR § 3.309 (presumptive conditions including AO and Camp Lejeune), 38 CFR § 3.102 (benefit-of-the-doubt standard), 38 CFR § 4.130 (mental-health rating schedule), 38 CFR § 3.304(f) (PTSD stressor verification), and 38 CFR § 4.71a (musculoskeletal rating criteria). The relevant sections are written directly into each tool prompt so the model quotes them verbatim rather than paraphrasing.

Is my data stored on your servers?

No. The Claim Workspace is local-first — your narrative, conditions, and saved tool results stay in your browser localStorage. Nothing is retained on NexusVetClaims servers. The Export workspace (JSON) button lets you back up locally.

What reading level does the AI output target?

8th-grade reading level is a hard constraint in the prompt guard. VA claim outputs are read under stress by veterans who may not be lawyers; every AI tool is required to translate VA regulatory language into plain English, no markdown, no unexpanded acronyms.

Methodology

How we keep AI honest

Every AI-powered tool on this site is tested against real veteran scenarios automatically, every 6 hours. Here's what we check for, what we refuse to generate, and how you can verify it yourself.

Why this page exists

Veterans are right to distrust “AI” marketing. Models hallucinate citations. They drift under load. They cite the wrong CFR section. A drafted nexus letter that says “possibly related” instead of “at least as likely as not” gets auto-denied under 38 CFR 3.102. We publish what we test for so the claim is falsifiable.

Three layers of defense

Layer 1 · Prompt

Every tool starts with the same injection guard

What it is: A shared prompt guard is injected into all 13 AI tools. It enumerates what the model is forbidden to produce: no invented statistics, no fabricated CFR citations, no unexpanded acronyms outside a 12-item whitelist, no computed-duration drift, no markdown in plain-text fields. 8th-grade reading level is a hard constraint.

Why it matters: VA claim outputs get read under stress by people who are not lawyers. A single fabricated citation — "38 CFR 3.655" when the rater is looking at 3.310 — can undo the whole package. Pre-flight rules are cheaper than downstream scrubbing.

Proof: src/lib/ai/prompt-injection-guard.ts · imported by every /api route.

Layer 2 · Output

Buffered JSON, validated, scrubbed, then sent

What it is: Instead of streaming raw model output to the browser, every AI response is buffered through a shared runJsonTool() runner. It Zod-validates the shape, runs a per-tool postValidate hook (forces § 3.310 on secondaries, normalizes framework names on presumptive), then two universal scrubs every tool inherits — markdown leakage (asterisks, heading hashes, fence backticks) and fabricated computed durations ("10 years", "(3.3 years)") that the model emits despite the prompt-level ban.

Why it matters: Streaming has no window to catch a hallucination mid-flight. Buffered + validated means the model can never emit a value the UI will silently render as undefined or show the vet as a false-hope number.

Proof: src/lib/ai/run-json-tool.ts · 13 tools migrated from streaming to buffered.

Layer 3 · CI

Persona-drift evals run against production every 6 hours

What it is: A GitHub Action replays a library of 30+ veteran personas against every AI endpoint on the live site, 3 runs per persona, and scores each response against invariants: "if this vet has ROM 0–130°, the rating-gap tool MUST NOT suggest supplemental_new_evidence." 700+ invariants across 13 tools. The action also fires on every push to main that touches AI code, as a pre-deploy gate.

Why it matters: LLMs drift silently. A model provider can change behavior overnight. Without an automated drift check, you learn about regressions from veterans who filed bad claims. With one, you learn from a red check mark in the Actions tab.

Proof: scripts/eval/personas-*.ts + run-*.ts · .github/workflows/eval-ai-tools.yml

The hard rules

No invented CFR citations

Every citation is a real 38 CFR section. Common ones (§ 3.310 secondary, § 4.16 TDIU, § 3.309(f) Camp Lejeune, § 3.102 benefit-of-the-doubt, § 4.130 mental-health schedule) are written directly into the relevant tool prompts so the model quotes them verbatim rather than paraphrasing.

No hallucinated BVA citation numbers

The Precedent tool and decoder both run a BVA-citation allow-list scrubber — any "Citation Nr XXXXX" in the output that is not in the indexed precedent database gets stripped before the response is returned.

No invented statistics

The prompt guard forbids patterns like "80% grant rate," "median $42,000 retro," or "approved 9 out of 10." These numbers vary by condition and era; citing any specific figure at the session level is how vets get misled into filing weak claims.

Dates are data, not math

The AI is forbidden from computing durations ("for 3 years," "over a decade"). Duration claims are post-processed out. The only dates that appear in output are dates the vet typed in, surfaced verbatim.

No "could," "may," "possibly" on nexus output

38 CFR 3.102 sets the benefit-of-the-doubt standard at 50%. "At least as likely as not" clears it; softer phrasing does not. Nexus letter drafts are checked for the magic phrase before they reach the vet.

Plain-text only, no markdown

Every AI response is stripped of **bold**, __underline__, # headings, and `code` formatting server-side. The vet reads prose, not a markdown document they have to mentally convert before giving it to their doctor.

No account, no data retention

The Claim Workspace is local-first. Your narrative stays in your browser. We do not store veteran-PII on servers — we do not want the liability and you do not want the exposure.

What the eval harness actually measures

A “persona” is a frozen intake — a fictional veteran with a specific condition, rating history, and symptom profile. Each persona has a set of “invariants”: facts that MUST appear in the AI's response (must cite § 3.310 when PTSD secondary to OSA is claimed), facts that MUST NOT appear (must not list depression as secondary to hearing loss on the strong/moderate tier), and stability checks (three identical runs should produce substantively the same output).

When a persona drifts, the CI job fails loudly. We see it before the next veteran does. The fix is almost always a prompt edit — not a model swap — so shipped corrections land within hours of detection.

Current test coverage

/presumptive · 5 personas
/decoder · 5 personas
/smc · 6 personas
/secondaries · 5 personas
/rating-gap · 4 personas
/estimator · 5 personas
/tdiu · 5 personas
/nexus · 5 personas
/cp-prep · 5 personas
/cp-debrief · 3 personas
/eligibility · 4 personas
/claim-status · 3 personas
/forms · 5 personas

700+ invariants across 13 tools · 3 runs per persona · every 6 hours against production

Verify it yourself

Every piece of infrastructure described on this page lives in a public GitHub repository. You can read the prompts. You can read the invariants. You can read the CI workflow. If you want to propose a persona — a real case the site should be tested against — email the narrative and expected behavior and we'll add it to the library.

Compare coverage

Vs VeteranAI.co + VetClaims.ai

Side-by-side inventory of what each site actually ships.

Start here

Find the tool that fits your situation

Pre-filing, mid-claim, appealing, or family benefits.

Disclosure: NexusVetClaims is not a VA-accredited representative. Nothing on this site is legal or medical advice. The tools help you prepare — you or your VSO file.