Skip to content

Evals Hub

This page organizes Dramatica evaluations as benchmark cards so researchers and studio technical teams can inspect signal quality quickly.

IMPORTANT

Several cards below are intentionally marked as Placeholder where public metrics are not yet finalized. Replace these with your current board outputs as they are approved for publication.

Framing

Each benchmark card is reported with:

  • Task
  • Data
  • Metric
  • Baselines
  • Verifier loop result
  • Qualitative notes

Constraints Layer

Benchmark Card: Storyform Constraint Compliance

  • Task: detect hard Storyform violations in candidate drafts
  • Data: prompt-to-draft pairs conditioned on fixed Storyform specs
  • Metric: pass rate (constraint-valid outputs)
  • Baselines: vanilla generation, prompting-only guardrails
  • Verifier loop result: Placeholder
  • Qualitative notes: exposes early structural drift before full rewrite cycles

Benchmark Card: Throughline Role Integrity

  • Task: validate whether Objective Story, Main Character, Influence Character, and Relationship Story roles remain distinct and consistent
  • Data: controlled scenarios with targeted Throughline swaps
  • Metric: violation detection accuracy
  • Baselines: prompting-only checks
  • Verifier loop result: Placeholder
  • Qualitative notes: catches hidden role collapse that often appears late in drafts

Alignment Layer

Benchmark Card: Throughline Voice Classification

  • Task: classify whether narrative pressure matches intended Throughline perspective
  • Data: candidate passages labeled by intended Throughline pressure
  • Metric: agreement and classification accuracy
  • Baselines: vanilla model, prompting-only
  • Verifier loop result: Placeholder
  • Qualitative notes: strongest gains appear on Influence Character vs Main Character disambiguation

Benchmark Card: Dynamic Consistency

  • Task: evaluate whether Dynamics remain coherent through revisions
  • Data: multi-draft sequences under fixed Dynamic settings
  • Metric: consistency score, pairwise agreement
  • Baselines: best-of-N without verifier
  • Verifier loop result: Placeholder
  • Qualitative notes: reduces “dead-on-arrival” rewrites that read fine but violate intended argument

Benchmark Card: Storybeat Advancement

  • Task: verify that Storybeats advance intended narrative pressure rather than repeat surface events
  • Data: beat-level generation traces
  • Metric: advancement pass rate
  • Baselines: vanilla generation
  • Verifier loop result: Placeholder
  • Qualitative notes: strongest effect appears in middle-act drift prevention

Quality and Diversity Layer

Benchmark Card: Diverse Valid Candidates

  • Task: measure whether best-of-N preserves distinct options after verification
  • Data: same prompt and Storyform across N candidates
  • Metric: count of distinct valid candidates per run
  • Baselines: unconstrained best-of-N, single-pass generation
  • Verifier loop result: Placeholder
  • Qualitative notes: confirms the system narrows to meaning-correct without forcing one style

Benchmark Card: Human Selection Lift

  • Task: compare human preference between verified candidate sets and baseline sets when intent must remain fixed
  • Data: blind human selection rounds
  • Metric: selection rate lift
  • Baselines: prompting-only and vanilla sets
  • Verifier loop result: Placeholder
  • Qualitative notes: separates intent correctness from taste choice in review workflows

Pipeline summary

Storyform Spec -> Generator -> Dramatica Verifier (constraints + scores) -> Best-of-N selection -> Human choice

The verifier narrows outputs to meaning-correct candidates. Human reviewers choose voice and taste.