Skip to content

Fidelity Scoring

Fidelity scoring answers one question: if I ran the artifact back through /ingest, how much of the original wiki would I recover? The answer is a number between 0.0 and 1.0, compared against a per-artifact-type target.

flowchart TB
Input[["Original pages P<br/>Re-derived pages P'"]]:::source
T1["Tier 1: Coverage<br/>per-page concept overlap"]:::engine
T2["Tier 2: Jaccard<br/>corpus-level set similarity"]:::engine
T3["Tier 3: LLM judge<br/>fact-level diff (flag-gated)"]:::engine
Weight["Weighted combine<br/>0.6 * coverage + 0.4 * jaccard<br/>+ optional LLM adjustment"]:::engine
Score[["Fidelity score<br/>0.0 &rarr; 1.0"]]:::output
Input --> T1
Input --> T2
Input -.-> T3
T1 --> Weight
T2 --> Weight
T3 -.-> Weight
Weight --> Score
classDef source fill:#e0af40,stroke:#8a6d1a,color:#1a1a1a
classDef engine fill:#5bbcd6,stroke:#2e6c7c,color:#0b0f14
classDef output fill:#7dcea0,stroke:#2d6a4f,color:#0b0f14

For each original page, extract its concept set (wikilink targets + frontmatter tags + capitalised multi-word terms). Find the closest re-derived page by title or slug. If the re-derived page retains ≥50% of the original’s concepts, the page survived.

coverage = (pages that survived) / (original pages)

Per-page threshold is deliberately low (0.5, not 0.8) because lossy artifact types legitimately drop ~half the surface content — slides compress aggressively, flashcards split into atoms. A higher threshold would punish types that are supposed to compress.

Union-level set similarity over all extracted concepts across both corpora:

jaccard = |concepts(P) ∩ concepts(P')| / |concepts(P) ∪ concepts(P')|

Jaccard complements coverage: it catches cases where every page survived individually but the overall vocabulary shifted (synonyms drifted, or the artifact invented concepts not in the source). It also doesn’t depend on page-level matching, so it’s robust when the re-ingest produces a different page count.

Flag-gated via --llm-judge. Claude reads both corpora and produces a fact-level comparison: “these 12 facts from the original survived, these 3 were dropped, these 2 were added.” Slower and paid, but catches semantic equivalence (different wording, same fact) that structural scoring misses.

Not run by default because:

  • Most regressions show in tiers 1 + 2.
  • Per-run cost adds up in CI.
  • The result is harder to unit-test — a model’s verdict shifts over time.

Use it when tiers 1 + 2 flag a failure you want to investigate, or on the golden corpus during a template overhaul.

Default:

fidelity = 0.6 * coverage + 0.4 * jaccard

The 60/40 split favours “every page survived” over “vocabulary overlapped.” A lossy artifact that preserves 3 of 10 pages perfectly should score lower than one that loosely preserves 10 of 10. Coverage is the stricter signal; Jaccard is the broader floor.

When --llm-judge is used, the judge produces a multiplier in [0.85, 1.15] applied to the structural score — never dominating, only nudging. This is deliberate: the LLM’s verdict should refine, not replace, the deterministic math.

Targets come from close-the-loop testing and are encoded in .claude/skills/verify-artifact/SKILL.md.

Artifact typeTargetReasoning
book0.85Pandoc PDF — concatenation + formatting preserves nearly all prose
pdf0.85Same as book — a printable dump of the same source text
podcast0.75Spoken-word rephrasing drops headings/structure but keeps ideas
video0.60Scene-card compression; narration covers ~60% of content
mindmap0.50Headings + bullets → captures skeleton but loses prose
flashcards0.40Card-level chunking loses flow and cross-references
quiz0.40Questions test ideas without stating them directly
slides0.35Heavy rewording; bullet-level compression
app0.25JSON fixture keeps structure but strips prose entirely
infographic0.25SVG slots are single-phrase summaries

Exceeding the target is fine. The targets represent the minimum faithful reconstruction — falling consistently below is a regression, overshooting isn’t.

The 0.25 → 0.85 range reflects that different artifacts compress content at different rates — and that’s the point. A book should be approximately the same information as the source; an infographic is a one-glance summary. Holding them to the same fidelity bar would make one or the other useless.

Per-vault or per-topic overrides live on the command line:

/verify-artifact quiz --vault my-research --topic rag --target 0.55

Use this when the content shape deviates — a heavily code-heavy topic will score worse on quiz and flashcards because code blocks resist concept extraction.

Concepts are extracted from three sources, lowest false-positive first:

  1. [[wikilink]] targets in the page body — always intentional.
  2. tags: list in frontmatter — always intentional.
  3. Capitalised multi-word phrases in prose — heuristic; can over-match on proper nouns unrelated to the topic.

The union forms the concept set for that page. Concepts are normalised (lowercase, whitespace-collapsed) before comparison so that “Attention Mechanism,” “attention mechanism,” and “attention-mechanism” collapse to the same string.

  • Prose-only pages under-extract. A page with no wikilinks, no tags, and no capitalised terms will produce a tiny concept set, and any artifact will score artificially high (or low) against it.
  • Code-heavy pages distort results. Code blocks aren’t prose — capitalised identifiers (MyClass) become “concepts” that an artifact usually can’t reproduce verbatim.
  • Synonym drift isn’t caught structurally. “LLM” and “large language model” are different concepts to the extractor. Tier 3 (--llm-judge) is the only way to catch this.
  • Target tuning is ongoing. Initial targets come from manual runs across ~5 vaults. They’ll shift as the golden corpus matures.