Skip to content

Golden Corpus

The golden corpus is a small, hand-curated set of wiki pages checked into the repo at golden-corpus/. CI generates every artifact type against this corpus and verifies each output. It’s the regression fixture that tells us whether a change to the pipeline affects artifact quality.

flowchart LR
Corpus[["golden-corpus/<br/>wiki/**.md"]]:::source -->|input| Gen["generate handler<br/>(one per artifact type)"]:::engine
Gen -->|output| Art["artifact<br/>(book, slides, quiz, ...)"]:::output
Art -->|input| Ver["/verify-artifact<br/>(round-trip)"]:::engine
Ver -->|output| Score[["fidelity score<br/>vs per-type target"]]:::output
Score -->|advisory| CI["GitHub Actions<br/>summary"]:::output
classDef source fill:#e0af40,stroke:#8a6d1a,color:#1a1a1a
classDef engine fill:#5bbcd6,stroke:#2e6c7c,color:#0b0f14
classDef output fill:#7dcea0,stroke:#2d6a4f,color:#0b0f14

Real vaults change constantly. If we verified artifact quality against a live vault, drift in the content and drift in the pipeline would look identical — a fidelity regression could be the pages shifting or the renderer breaking, and we’d never know which.

A frozen corpus pins the input so any score change is unambiguously the pipeline’s fault.

Four concept/entity pages selected to exercise every handler:

PathPage typeWhy it’s there
wiki/concepts/attention-mechanism.mdconceptDense technical content — stresses prose-heavy handlers (book, PDF, podcast)
wiki/concepts/retrieval-augmented-generation.mdconceptListy/procedural content — stresses slide/mindmap/infographic handlers
wiki/concepts/context-window.mdconceptBridges other pages — stresses cross-link preservation (mindmap, book TOC)
wiki/entities/transformer.mdentityEntity-type page — stresses frontmatter handling

All four cross-link into a connected subgraph — a deliberate design choice so that [[wikilink]] resolution is tested on every run.

Pages are 100–300 words each. Bigger pages would exercise handlers more but blow up CI runtime (podcast TTS and video rendering are the expensive ones). The intent is: small enough to regenerate the full corpus in under five minutes, large enough to produce non-trivial artifacts.

Add pages only when an existing handler has a blind spot the current corpus doesn’t expose. More pages ≠ better coverage.

.github/workflows/golden-corpus.yml runs on every push that touches golden-corpus/, any generate-* handler, or /verify-artifact. The workflow:

  1. Installs Pandoc, Node/pnpm, Python.
  2. Runs the source-hash test suite (.claude/skills/generate/lib/tests/test-source-hash.sh) — proves the hashing foundation is unchanged before anything else runs.
  3. Matrix-generates each artifact type against the corpus.
  4. Matrix-verifies each output via /verify-artifact.
  5. Uploads artifacts and verification reports as CI artifacts for inspection.

Advisory only in the initial rollout. continue-on-error: true on the generate/verify steps means a fidelity regression shows up in the CI summary but doesn’t block the pipeline. Per close-the-loop-testing’s explicit non-goals, hard-fail mode waits until per-type targets hold across three consecutive green runs.

Podcast and video handlers are excluded from the default matrix — their heavy lazy-installed deps (Piper TTS, Remotion) are better suited to a nightly job.

Targets come from the close-the-loop-testing concept page. The same numbers are encoded in .claude/skills/verify-artifact/SKILL.md:

Artifact typeTargetWhy
book0.85Lossless prose reproduction — concatenation + Pandoc preserves nearly everything
pdf0.85Same as book — a formatted dump of the same source text
podcast0.75Spoken rephrasing loses structural signal but keeps ideas
video0.60Scene-card compression; narration covers most content
mindmap0.50Headings + bullets — captures skeleton but not prose
flashcards0.40Card-level chunking loses flow and context
quiz0.40Questions test ideas but don’t surface them directly
slides0.35Heavy rewording, bullet-level compression
app0.25JSON fixture keeps structure but strips prose entirely
infographic0.25SVG slots are highly compressed summaries

Exceeding the target is fine. Falling below it consistently is the regression signal.

To run the same checks locally:

Terminal window
# Foundation: source-hash test suite
bash .claude/skills/generate/lib/tests/test-source-hash.sh
# Generate a single artifact against the corpus
/generate book --vault golden-corpus
# Verify it
/verify-artifact book --vault golden-corpus

See the README at golden-corpus/README.md. Bar for new pages: every existing handler must still render cleanly, the page must link to ≥1 other corpus page, and total corpus word count must stay under 1,500.