yuragi — LLM Confidence Fragility Analyzer

Measure how unstable your LLM's confidence really is.
Perturbation-driven hallucination detection — black-box, logprob-friendly, CLI-first.

Latest research (v0.4.1) — Real-data benchmarks on llama-3.1-8B:

TruthfulQA n=412: ensemble AUC 0.73 [0.68, 0.78] — driven primarily by baseline_confidence; perturbation features add no statistically significant Δ (p=0.35)
TriviaQA n=200 (pilot): confidence-inversion AUC 0.75 [0.67, 0.82] — bootstrap CI width ~±0.10; pending n≥400 cross-family replication
Domain-boundary finding: works on single-path factoids, fails on imitative falsehoods (solo fragility_score AUC ≈ 0.50 across 6 datasets)

yuragi generates 13 semantic-preserving prompt perturbations (typos, tone, paraphrase, authority framing, counterfactual context) and compares the model's confidence distribution across responses. When answer text stays the same but confidence moves, that's fragility — a measurable property of prompt wording rather than model knowledge.