ZS Benchmark Evaluation

3 models × 5 tasks × 3 dimensions
Evaluator: Claude Opus 4.6 Date: 2026-03-14 15 result files + 3 reflection.zobr
What is ZS (Zobr Script)?

Cognitive scripting language for LLMs

ZS provides formal constructs for describing reasoning processes — not as rigid instructions, but as composable cognitive operations with variables, control flow, and result formatting.

Think of it as SQL for thinking: you define what cognitive steps to take, the LLM decides how to execute them.

Scripts are executed by an LLM as interpreter: the model reads a .zobr file, executes operations step by step, tracks variables, follows control flow, and produces structured output.

12 built-in operations click to explore

survey ground assert doubt contrast analogy synthesize reframe assess pivot scope conclude
survey Discovery
Explore a topic and identify key elements (positions, factors, perspectives).
survey(topic, count?: N) → list
positions = survey("main positions on consciousness", count: 3)

Plus: variables, for/if/loop control flow, user-defined functions (define), yield, imports, @last/@N references.

Example: how a ZS script looks

task: "Evaluate risks of AI in education"

risks = survey("main risks of AI in education", count: 4)
evidence = for r in risks {
  concrete = ground(r, extract: [examples, studies])
  yield { risk: r, evidence: concrete }
}
overview = synthesize(evidence, method: "rank by severity and interconnection")

result = conclude {
  top_risks: list
  most_critical: string
  recommendation: string
  confidence: low | medium | high
}
Applicability & use cases

What ZS gives you

ZS is a reasoning amplifier, not a capability test. It doesn’t make weak models strong — it makes all models structured. A Haiku execution of a ZS script produces more useful output than a free-form Haiku response to the same question, because the script forces the model to decompose reasoning, show its work, and format conclusions. The benchmark confirms: even the smallest model follows ZS scripts with 92.5% structural fidelity.

When reasoning structure is provided externally by the script, the model’s job shifts from organizing thought to filling containers with content. This is why Sonnet achieves near-parity with Opus (9.3 vs 9.4) — structured scripts compress the capability gap between tiers.

📜 Repeatable analysis patterns

Encode your best analytical workflow once as a .zobr script, then apply it to new inputs. A political news analysis script works on any article. A due diligence script works on any company. The reasoning pattern is reusable — the content changes.

Example: news-analysis.zobr runs the same 6-phase pipeline (ground → stakeholders → motives → narrative gap → cui bono → blind spots) on every article, ensuring nothing is missed.

🧪 Quality assurance for AI reasoning

ZS scripts make reasoning auditable. Instead of a black-box LLM response, you get labeled operations ([doubt], [contrast]) with visible variable flow. You can verify that the model actually considered counterarguments, not just generated a one-sided summary.

Critical for compliance, legal analysis, medical reasoning — anywhere you need to show how a conclusion was reached, not just what it is.

🎯 Cost optimization via model routing

Benchmark shows different tasks need different models. Use Haiku for structural tasks (surveys, fact extraction) at 2.5× speed, Sonnet for most analytical work, Opus only for deep dialectical reasoning. ZS scripts make this routing explicit: the same script runs on any model.

Generate scripts with Sonnet (best architecturally), execute with Haiku at scale — valid structured reasoning at a fraction of the cost.

💡 Knowledge capture from AI sessions

When an agent produces exceptional reasoning in a conversation, the reasoning pattern can be distilled into a .zobr script — a reusable artifact. The benchmark proves all three models can generate valid, parameterized scripts (Task 05: 0 errors across all models).

Dual-purpose: humans write scripts as tasks for LLMs, agents export their reasoning as .zobr files for future use.

🎓 Education & critical thinking

ZS externalizes the structure of rigorous thinking: survey before asserting, doubt your own claims, contrast with the strongest counter, synthesize — don’t summarize. Students and analysts can learn these patterns by reading and writing scripts.

A dialectical.zobr template teaches iterative thesis refinement better than a textbook paragraph about dialectics.

🌐 Multi-agent cognitive workflows

ZS scripts can serve as shared protocols between agents. One agent runs survey and ground, another runs doubt and contrast, a third synthesizes the results. The script defines the workflow; agents fill the operations.

Part of the Black Zobr federated co-thinking ecosystem.

Benchmark: 5 tasks × 3 models

Tasks

Task 01
Simple pipeline
Linear chain: survey → for loop with ground → synthesize → conclude. Tests basic operation flow, variable tracking, yield.
Task 02
Dialectical reasoning
Iterative thesis refinement: assert → loop 2× {doubt → contrast → assess → if stuck: pivot → reframe}. Tests thesis evolution and conditional branching.
Task 03
Custom functions
User-defined steelman and devils_advocate functions with prompt, dot access (attack.damage_level), if/else branching.
Task 04
News analysis
6-phase pipeline with web search: ground → survey(5 stakeholders) → for loop {assert, doubt, contrast} → reframe(cui bono) → scope(wide). 10+ conclude fields.
Task 05
Reflection & generation
The model analyzes a topic (AI safety regulation), then generates a reusable .zobr script encoding the reasoning pattern, and validates it with zobr-check. Tests both content quality and ZS code generation.

Models under test

  • Claude Opus 4.6 — most capable, deepest reasoning
  • Claude Sonnet 4.6 — mid-tier, balance of speed and quality
  • Claude Haiku 4.5 — fastest, most cost-effective

Evaluation dimensions

  • Structural compliance (0–10) — does the model follow the ZS script correctly?
  • Content quality (0–10) — how deep, specific, and insightful is the reasoning?
  • Generation quality (0–10) — can the model produce a valid, reusable .zobr script?

Methodology

  • Each model runs each task independently via claude -p (headless mode)
  • Full isolation: no project context, no MCP servers, no cross-task state
  • --effort high for consistent thinking depth
  • Models read the ZS spec + system prompt, then execute the .zobr script
  • Results captured as inference transcripts + model output files
  • Evaluation performed by Opus 4.6 executing evaluate-benchmark.zobr — a ZS script evaluating ZS results (meta-evaluation)

15 runs total (5 tasks × 3 models), 0 failures. Total benchmark time: ~48 minutes.

Model composite scores
Opus 4.6
9.4
/ 10
Expert-level reasoning
Sonnet 4.6
9.3
/ 10
Near-parity with Opus
Haiku 4.5
7.9
/ 10
Competent & structured
Complete scoring matrix
Task Dimension Opus 4.6 Sonnet 4.6 Haiku 4.5
01 — Simple pipeline Structural 10 10 9
Content 9 8 7
Composite 9.5 9.0 8.0
02 — Dialectical Structural 10 10 9
Content 9 9 6
Composite 9.5 9.5 7.5
03 — Custom functions Structural 10 10 9
Content 9 9 7
Composite 9.5 9.5 8.0
04 — News analysis Structural 10 10 10
Content 9 9 7
Composite 9.5 9.5 8.5
05 — Reflection Content 9 9 7
Generation 9 9 8
Composite 9.0 9.0 7.5
Scores by dimension

Structural Compliance

Opus
10.0
Sonnet
10.0
Haiku
9.25

Content Quality

Opus
9.0
Sonnet
8.8
Haiku
6.8

Generation Quality

Opus
9
Sonnet
9
Haiku
8
Task performance profile
01 Simple 02 Dialectical 03 Functions 04 News 05 Reflection 10 9 8 7
Opus 4.6 9.4
Sonnet 4.6 9.3
Haiku 4.5 7.9
Composite score per task (avg of structural, content, and generation dimensions). Outer ring = 10, inner ring = 6.
Content quality gap (Opus vs Haiku)
01 — Simple
Δ2
02 — Dialectical
Δ3
03 — Custom functions
Δ2
04 — News analysis
Δ2
05 — Reflection
Δ2
Largest gap on dialectical reasoning (iterative refinement, domain knowledge, emergent synthesis)
Performance & efficiency

Average time per task

Haiku 4.5 110s  (1.8 min)
Opus 4.6 189s  (3.2 min)
Sonnet 4.6 273s  (4.6 min)

Total benchmark time

Haiku 4.5 548s  (9.1 min)
Opus 4.6 946s  (15.8 min)
Sonnet 4.6 1365s  (22.8 min)
Sonnet unexpectedly slowest (2.5× Haiku, 1.4× Opus) despite being mid-tier — may reflect API routing, not model properties.
Key findings
1

ZS is structurally model-agnostic

All three models follow ZS scripts with high fidelity (9.25–10.0). Operations executed in order, variables tracked, control flow followed. The 0.75-point gap is cosmetic, not semantic.

2

Content gap concentrates in dialectical tasks

The Opus–Haiku gap peaks at 3 points on Task 02 (iterative refinement, domain knowledge, emergent synthesis). Structural tasks show smaller gaps. ZS amplifies reasoning where it’s hardest.

3

Sonnet achieves near-parity with Opus (9.3 vs 9.4)

Structured scripts reduce the capability gap between tiers. When reasoning structure is externalized, the model’s job shifts to filling containers with content — and Sonnet fills them nearly as well.

4

All models generate valid ZS scripts

All three reflection.zobr files pass zobr-check with 0 errors. Generation capability scales with interpretation — no “generation penalty.” ZS script generation is a practical workflow.

Model selection guide
Use case Model Why
Structural tasks (extract, classify, survey) Haiku 1.7× faster than Opus; structural compliance ~perfect
Dialectical reasoning (doubt, contrast, reframe loops) Opus Content depth gap largest on iterative reasoning
News / political analysis Sonnet Opus Both expert-level; Sonnet adds source critique
Script generation Sonnet Most architecturally sophisticated; fully generalizable
High-volume batch processing Haiku 2.5× faster than Sonnet; valid reasoning at scale
Philosophy / deep analysis Opus Broadest references; most original framings