Fidelity Scoring
Fidelity scoring answers one question: if I ran the artifact back through /ingest, how much of the original wiki would I recover? The answer is a number between 0.0 and 1.0, compared against a per-artifact-type target.
Three Tiers
Section titled “Three Tiers”flowchart TB Input[["Original pages P<br/>Re-derived pages P'"]]:::source T1["Tier 1: Coverage<br/>per-page concept overlap"]:::engine T2["Tier 2: Jaccard<br/>corpus-level set similarity"]:::engine T3["Tier 3: LLM judge<br/>fact-level diff (flag-gated)"]:::engine Weight["Weighted combine<br/>0.6 * coverage + 0.4 * jaccard<br/>+ optional LLM adjustment"]:::engine Score[["Fidelity score<br/>0.0 → 1.0"]]:::output
Input --> T1 Input --> T2 Input -.-> T3 T1 --> Weight T2 --> Weight T3 -.-> Weight Weight --> Score
classDef source fill:#e0af40,stroke:#8a6d1a,color:#1a1a1a classDef engine fill:#5bbcd6,stroke:#2e6c7c,color:#0b0f14 classDef output fill:#7dcea0,stroke:#2d6a4f,color:#0b0f14Tier 1 — Coverage
Section titled “Tier 1 — Coverage”For each original page, extract its concept set (wikilink targets + frontmatter tags + capitalised multi-word terms). Find the closest re-derived page by title or slug. If the re-derived page retains ≥50% of the original’s concepts, the page survived.
coverage = (pages that survived) / (original pages)Per-page threshold is deliberately low (0.5, not 0.8) because lossy artifact types legitimately drop ~half the surface content — slides compress aggressively, flashcards split into atoms. A higher threshold would punish types that are supposed to compress.
Tier 2 — Jaccard
Section titled “Tier 2 — Jaccard”Union-level set similarity over all extracted concepts across both corpora:
jaccard = |concepts(P) ∩ concepts(P')| / |concepts(P) ∪ concepts(P')|Jaccard complements coverage: it catches cases where every page survived individually but the overall vocabulary shifted (synonyms drifted, or the artifact invented concepts not in the source). It also doesn’t depend on page-level matching, so it’s robust when the re-ingest produces a different page count.
Tier 3 — LLM Judge (optional)
Section titled “Tier 3 — LLM Judge (optional)”Flag-gated via --llm-judge. Claude reads both corpora and produces a fact-level comparison: “these 12 facts from the original survived, these 3 were dropped, these 2 were added.” Slower and paid, but catches semantic equivalence (different wording, same fact) that structural scoring misses.
Not run by default because:
- Most regressions show in tiers 1 + 2.
- Per-run cost adds up in CI.
- The result is harder to unit-test — a model’s verdict shifts over time.
Use it when tiers 1 + 2 flag a failure you want to investigate, or on the golden corpus during a template overhaul.
Weighting
Section titled “Weighting”Default:
fidelity = 0.6 * coverage + 0.4 * jaccardThe 60/40 split favours “every page survived” over “vocabulary overlapped.” A lossy artifact that preserves 3 of 10 pages perfectly should score lower than one that loosely preserves 10 of 10. Coverage is the stricter signal; Jaccard is the broader floor.
When --llm-judge is used, the judge produces a multiplier in [0.85, 1.15] applied to the structural score — never dominating, only nudging. This is deliberate: the LLM’s verdict should refine, not replace, the deterministic math.
Per-Artifact Targets
Section titled “Per-Artifact Targets”Targets come from close-the-loop testing and are encoded in .claude/skills/verify-artifact/SKILL.md.
| Artifact type | Target | Reasoning |
|---|---|---|
book | 0.85 | Pandoc PDF — concatenation + formatting preserves nearly all prose |
pdf | 0.85 | Same as book — a printable dump of the same source text |
podcast | 0.75 | Spoken-word rephrasing drops headings/structure but keeps ideas |
video | 0.60 | Scene-card compression; narration covers ~60% of content |
mindmap | 0.50 | Headings + bullets → captures skeleton but loses prose |
flashcards | 0.40 | Card-level chunking loses flow and cross-references |
quiz | 0.40 | Questions test ideas without stating them directly |
slides | 0.35 | Heavy rewording; bullet-level compression |
app | 0.25 | JSON fixture keeps structure but strips prose entirely |
infographic | 0.25 | SVG slots are single-phrase summaries |
Exceeding the target is fine. The targets represent the minimum faithful reconstruction — falling consistently below is a regression, overshooting isn’t.
Why the spread?
Section titled “Why the spread?”The 0.25 → 0.85 range reflects that different artifacts compress content at different rates — and that’s the point. A book should be approximately the same information as the source; an infographic is a one-glance summary. Holding them to the same fidelity bar would make one or the other useless.
Overriding
Section titled “Overriding”Per-vault or per-topic overrides live on the command line:
/verify-artifact quiz --vault my-research --topic rag --target 0.55Use this when the content shape deviates — a heavily code-heavy topic will score worse on quiz and flashcards because code blocks resist concept extraction.
What Counts as a Concept
Section titled “What Counts as a Concept”Concepts are extracted from three sources, lowest false-positive first:
[[wikilink]]targets in the page body — always intentional.tags:list in frontmatter — always intentional.- Capitalised multi-word phrases in prose — heuristic; can over-match on proper nouns unrelated to the topic.
The union forms the concept set for that page. Concepts are normalised (lowercase, whitespace-collapsed) before comparison so that “Attention Mechanism,” “attention mechanism,” and “attention-mechanism” collapse to the same string.
Known Caveats
Section titled “Known Caveats”- Prose-only pages under-extract. A page with no wikilinks, no tags, and no capitalised terms will produce a tiny concept set, and any artifact will score artificially high (or low) against it.
- Code-heavy pages distort results. Code blocks aren’t prose — capitalised identifiers (
MyClass) become “concepts” that an artifact usually can’t reproduce verbatim. - Synonym drift isn’t caught structurally. “LLM” and “large language model” are different concepts to the extractor. Tier 3 (
--llm-judge) is the only way to catch this. - Target tuning is ongoing. Initial targets come from manual runs across ~5 vaults. They’ll shift as the golden corpus matures.
See Also
Section titled “See Also”- verify-artifact feature — how to run the check
- close-the-loop testing — design doc with derivation of targets
- golden corpus — the frozen CI fixture these targets are tracked against
- lint —artifacts — cheap drift counterpart (doesn’t score fidelity, just detects hash drift)