ZS Benchmark — Evaluation Infographic

What is ZS (Zobr Script)?

Cognitive scripting language for LLMs

ZS provides formal constructs for describing reasoning processes — not as rigid instructions, but as composable cognitive operations with variables, control flow, and result formatting.

Think of it as SQL for thinking: you define what cognitive steps to take, the LLM decides how to execute them.

Scripts are executed by an LLM as interpreter: the model reads a .zobr file, executes operations step by step, tracks variables, follows control flow, and produces structured output.

12 built-in operations click to explore

survey ground assert doubt contrast analogy synthesize reframe assess pivot scope conclude

survey Discovery

Explore a topic and identify key elements (positions, factors, perspectives).

survey(topic, count?: N) → list

positions = survey("main positions on consciousness", count: 3)

Plus: variables, for/if/loop control flow, user-defined functions (define), yield, imports, @last/@N references.

Example: how a ZS script looks

task: "Evaluate risks of AI in education"

risks = survey("main risks of AI in education", count: 4)

evidence = for r in risks {

  concrete = ground(r, extract: [examples, studies])

  yield { risk: r, evidence: concrete }

}

overview = synthesize(evidence, method: "rank by severity and interconnection")

result = conclude {

  top_risks: list

  most_critical: string

  recommendation: string

  confidence: low | medium | high

}

Applicability & use cases

What ZS gives you

ZS is a reasoning amplifier, not a capability test. It doesn’t make weak models strong — it makes all models structured. A Haiku execution of a ZS script produces more useful output than a free-form Haiku response to the same question, because the script forces the model to decompose reasoning, show its work, and format conclusions. The benchmark confirms: even the smallest model follows ZS scripts with 92.5% structural fidelity.

When reasoning structure is provided externally by the script, the model’s job shifts from organizing thought to filling containers with content. This is why Sonnet achieves near-parity with Opus (9.3 vs 9.4) — structured scripts compress the capability gap between tiers.

📜 Repeatable analysis patterns

Encode your best analytical workflow once as a .zobr script, then apply it to new inputs. A political news analysis script works on any article. A due diligence script works on any company. The reasoning pattern is reusable — the content changes.

Example: news-analysis.zobr runs the same 6-phase pipeline (ground → stakeholders → motives → narrative gap → cui bono → blind spots) on every article, ensuring nothing is missed.

🧪 Quality assurance for AI reasoning

ZS scripts make reasoning auditable. Instead of a black-box LLM response, you get labeled operations ([doubt], [contrast]) with visible variable flow. You can verify that the model actually considered counterarguments, not just generated a one-sided summary.

Critical for compliance, legal analysis, medical reasoning — anywhere you need to show how a conclusion was reached, not just what it is.

🎯 Cost optimization via model routing

Benchmark shows different tasks need different models. Use Haiku for structural tasks (surveys, fact extraction) at 2.5× speed, Sonnet for most analytical work, Opus only for deep dialectical reasoning. ZS scripts make this routing explicit: the same script runs on any model.

Generate scripts with Sonnet (best architecturally), execute with Haiku at scale — valid structured reasoning at a fraction of the cost.

💡 Knowledge capture from AI sessions

When an agent produces exceptional reasoning in a conversation, the reasoning pattern can be distilled into a .zobr script — a reusable artifact. The benchmark proves all three models can generate valid, parameterized scripts (Task 05: 0 errors across all models).

Dual-purpose: humans write scripts as tasks for LLMs, agents export their reasoning as .zobr files for future use.

🎓 Education & critical thinking

ZS externalizes the structure of rigorous thinking: survey before asserting, doubt your own claims, contrast with the strongest counter, synthesize — don’t summarize. Students and analysts can learn these patterns by reading and writing scripts.

A dialectical.zobr template teaches iterative thesis refinement better than a textbook paragraph about dialectics.

🌐 Multi-agent cognitive workflows

ZS scripts can serve as shared protocols between agents. One agent runs survey and ground, another runs doubt and contrast, a third synthesizes the results. The script defines the workflow; agents fill the operations.

Part of the Black Zobr federated co-thinking ecosystem.

Benchmark: 5 tasks × 3 models

Tasks

Task 01

Simple pipeline

Linear chain: survey → for loop with ground → synthesize → conclude. Tests basic operation flow, variable tracking, yield.

Task 02

Dialectical reasoning

Iterative thesis refinement: assert → loop 2× {doubt → contrast → assess → if stuck: pivot → reframe}. Tests thesis evolution and conditional branching.

Task 03

Custom functions

User-defined steelman and devils_advocate functions with prompt, dot access (attack.damage_level), if/else branching.

Task 04

News analysis

6-phase pipeline with web search: ground → survey(5 stakeholders) → for loop {assert, doubt, contrast} → reframe(cui bono) → scope(wide). 10+ conclude fields.

Task 05

Reflection & generation

The model analyzes a topic (AI safety regulation), then generates a reusable .zobr script encoding the reasoning pattern, and validates it with zobr-check. Tests both content quality and ZS code generation.

Models under test

Claude Opus 4.6 — most capable, deepest reasoning
Claude Sonnet 4.6 — mid-tier, balance of speed and quality
Claude Haiku 4.5 — fastest, most cost-effective

Evaluation dimensions

Structural compliance (0–10) — does the model follow the ZS script correctly?
Content quality (0–10) — how deep, specific, and insightful is the reasoning?
Generation quality (0–10) — can the model produce a valid, reusable .zobr script?

Methodology

Each model runs each task independently via claude -p (headless mode)
Full isolation: no project context, no MCP servers, no cross-task state
--effort high for consistent thinking depth
Models read the ZS spec + system prompt, then execute the .zobr script
Results captured as inference transcripts + model output files
Evaluation performed by Opus 4.6 executing evaluate-benchmark.zobr — a ZS script evaluating ZS results (meta-evaluation)

15 runs total (5 tasks × 3 models), 0 failures. Total benchmark time: ~48 minutes.

Model composite scores

Opus 4.6

9.4

/ 10

Expert-level reasoning

Sonnet 4.6

9.3

/ 10

Near-parity with Opus

Haiku 4.5

7.9

/ 10

Competent & structured

Complete scoring matrix

Task	Dimension	Opus 4.6	Sonnet 4.6	Haiku 4.5
01 — Simple pipeline	Structural	10	10	9
	Content	9	8	7
	Composite	9.5	9.0	8.0
02 — Dialectical	Structural	10	10	9
	Content	9	9	6
	Composite	9.5	9.5	7.5
03 — Custom functions	Structural	10	10	9
	Content	9	9	7
	Composite	9.5	9.5	8.0
04 — News analysis	Structural	10	10	10
	Content	9	9	7
	Composite	9.5	9.5	8.5
05 — Reflection	Content	9	9	7
	Generation	9	9	8
	Composite	9.0	9.0	7.5

Scores by dimension

Structural Compliance

Opus

10.0

Sonnet

10.0

Haiku

9.25

Content Quality

Opus

9.0

Sonnet

8.8

Haiku

6.8

Generation Quality

Opus

Sonnet

Haiku

Task performance profile

Opus 4.6 9.4

Sonnet 4.6 9.3

Haiku 4.5 7.9

Composite score per task (avg of structural, content, and generation dimensions). Outer ring = 10, inner ring = 6.

Content quality gap (Opus vs Haiku)

01 — Simple

Δ2

02 — Dialectical

Δ3

03 — Custom functions

Δ2

04 — News analysis

Δ2

05 — Reflection

Δ2

Largest gap on dialectical reasoning (iterative refinement, domain knowledge, emergent synthesis)

Performance & efficiency

Average time per task

Haiku 4.5 110s (1.8 min)

Opus 4.6 189s (3.2 min)

Sonnet 4.6 273s (4.6 min)

Total benchmark time

Haiku 4.5 548s (9.1 min)

Opus 4.6 946s (15.8 min)

Sonnet 4.6 1365s (22.8 min)

Sonnet unexpectedly slowest (2.5× Haiku, 1.4× Opus) despite being mid-tier — may reflect API routing, not model properties.

Key findings

ZS is structurally model-agnostic

All three models follow ZS scripts with high fidelity (9.25–10.0). Operations executed in order, variables tracked, control flow followed. The 0.75-point gap is cosmetic, not semantic.

Content gap concentrates in dialectical tasks

The Opus–Haiku gap peaks at 3 points on Task 02 (iterative refinement, domain knowledge, emergent synthesis). Structural tasks show smaller gaps. ZS amplifies reasoning where it’s hardest.

Sonnet achieves near-parity with Opus (9.3 vs 9.4)

Structured scripts reduce the capability gap between tiers. When reasoning structure is externalized, the model’s job shifts to filling containers with content — and Sonnet fills them nearly as well.

All models generate valid ZS scripts

All three reflection.zobr files pass zobr-check with 0 errors. Generation capability scales with interpretation — no “generation penalty.” ZS script generation is a practical workflow.

Model selection guide

Use case	Model	Why
Structural tasks (extract, classify, survey)	Haiku	1.7× faster than Opus; structural compliance ~perfect
Dialectical reasoning (doubt, contrast, reframe loops)	Opus	Content depth gap largest on iterative reasoning
News / political analysis	Sonnet Opus	Both expert-level; Sonnet adds source critique
Script generation	Sonnet	Most architecturally sophisticated; fully generalizable
High-volume batch processing	Haiku	2.5× faster than Sonnet; valid reasoning at scale
Philosophy / deep analysis	Opus	Broadest references; most original framings

ZS Benchmark Evaluation

Cognitive scripting language for LLMs

12 built-in operations click to explore

Example: how a ZS script looks

What ZS gives you

📜 Repeatable analysis patterns

🧪 Quality assurance for AI reasoning

🎯 Cost optimization via model routing

💡 Knowledge capture from AI sessions

🎓 Education & critical thinking

🌐 Multi-agent cognitive workflows

Tasks

Models under test

Evaluation dimensions

Methodology

Structural Compliance

Content Quality

Generation Quality

Average time per task

Total benchmark time

ZS is structurally model-agnostic

Content gap concentrates in dialectical tasks

Sonnet achieves near-parity with Opus (9.3 vs 9.4)

All models generate valid ZS scripts