Evals Hub

This page organizes Dramatica evaluations as benchmark cards so researchers and studio technical teams can inspect signal quality quickly.

IMPORTANT

Several cards below are intentionally marked as Placeholder where public metrics are not yet finalized. Replace these with your current board outputs as they are approved for publication.

Framing

Each benchmark card is reported with:

Task
Data
Metric
Baselines
Verifier loop result
Qualitative notes

Constraints Layer

Benchmark Card: Storyform Constraint Compliance

Task: detect hard Storyform violations in candidate drafts
Data: prompt-to-draft pairs conditioned on fixed Storyform specs
Metric: pass rate (constraint-valid outputs)
Baselines: vanilla generation, prompting-only guardrails
Verifier loop result: Placeholder
Qualitative notes: exposes early structural drift before full rewrite cycles

Benchmark Card: Throughline Role Integrity

Task: validate whether Objective Story, Main Character, Influence Character, and Relationship Story roles remain distinct and consistent
Data: controlled scenarios with targeted Throughline swaps
Metric: violation detection accuracy
Baselines: prompting-only checks
Verifier loop result: Placeholder
Qualitative notes: catches hidden role collapse that often appears late in drafts

Alignment Layer

Benchmark Card: Throughline Voice Classification

Task: classify whether narrative pressure matches intended Throughline perspective
Data: candidate passages labeled by intended Throughline pressure
Metric: agreement and classification accuracy
Baselines: vanilla model, prompting-only
Verifier loop result: Placeholder
Qualitative notes: strongest gains appear on Influence Character vs Main Character disambiguation

Benchmark Card: Dynamic Consistency

Task: evaluate whether Dynamics remain coherent through revisions
Data: multi-draft sequences under fixed Dynamic settings
Metric: consistency score, pairwise agreement
Baselines: best-of-N without verifier
Verifier loop result: Placeholder
Qualitative notes: reduces “dead-on-arrival” rewrites that read fine but violate intended argument

Benchmark Card: Storybeat Advancement

Task: verify that Storybeats advance intended narrative pressure rather than repeat surface events
Data: beat-level generation traces
Metric: advancement pass rate
Baselines: vanilla generation
Verifier loop result: Placeholder
Qualitative notes: strongest effect appears in middle-act drift prevention

Quality and Diversity Layer

Benchmark Card: Diverse Valid Candidates

Task: measure whether best-of-N preserves distinct options after verification
Data: same prompt and Storyform across N candidates
Metric: count of distinct valid candidates per run
Baselines: unconstrained best-of-N, single-pass generation
Verifier loop result: Placeholder
Qualitative notes: confirms the system narrows to meaning-correct without forcing one style

Benchmark Card: Human Selection Lift

Task: compare human preference between verified candidate sets and baseline sets when intent must remain fixed
Data: blind human selection rounds
Metric: selection rate lift
Baselines: prompting-only and vanilla sets
Verifier loop result: Placeholder
Qualitative notes: separates intent correctness from taste choice in review workflows

Pipeline summary

Storyform Spec -> Generator -> Dramatica Verifier (constraints + scores) -> Best-of-N selection -> Human choice

The verifier narrows outputs to meaning-correct candidates. Human reviewers choose voice and taste.

Evals Hub ​

Framing ​

Constraints Layer ​

Benchmark Card: Storyform Constraint Compliance ​

Benchmark Card: Throughline Role Integrity ​

Alignment Layer ​

Benchmark Card: Throughline Voice Classification ​

Benchmark Card: Dynamic Consistency ​

Benchmark Card: Storybeat Advancement ​

Quality and Diversity Layer ​

Benchmark Card: Diverse Valid Candidates ​

Benchmark Card: Human Selection Lift ​

Pipeline summary ​