Eval v3 — C++ to Python Conversion¶

EGAF Governed vs Ungoverned¶

Objective¶

Convert a C++ pi approximation script to the fastest correct Python implementation. NumPy is permitted. The mathematical result must equal 3.141592656089 within tolerance 1e-9.

Three runs were executed against the same model (llama-3.3-70b-versatile via Groq API):

Run	Method	Description
Baseline	None	Original pure Python loop — no optimization, no AI
Run A	Ungoverned	Plain prompt asking for fastest Python. No contract, no governance, no correctness gate.
Run B	EGAF	`VIAL_SP_PRIMED` + custom E_T declaring scope, constraints, and output requirements. `SINGLE_PASS` mode.

Results¶

	Baseline	Run A (Ungoverned)	Run B (EGAF)
Execution time	43.53s	6.26s	4.04s
Speedup vs baseline	1×	6.9×	10.8×
Correctness	✓ PASS	✗ FAIL	✓ PASS
Result value	`3.141592656089`	`-0.858407343910`	`3.141592656090`
Scope violations	—	0 / 5	0 / 5
Seal compliance	—	—	8 / 8
Disposition	—	No gate	ACCEPTED

Key finding¶

EGAF produced a 10.8× speedup over baseline and returned the correct result.

The ungoverned approach was nearly 7× faster than baseline but returned a wrong answer — −0.858 instead of 3.1416. There was no mechanism to detect this. In a production workflow, that output would have silently entered context as if it were valid.

EGAF detected nothing — because the model got it right under governance. But if Run B had failed, the disposition protocol would have caught it:

Run A PRUNED — result incorrect. Would resubmit in governed session.

That line appears in the notebook. It is not hypothetical. It is what the eval code does when a run fails the correctness check. The ungoverned run had no equivalent path.

Seal compliance — 8 / 8¶

The EGAF run verified all eight OIL_CONTRACT acknowledgment checklist items before execution began:

Check	Result
e1 — Read sequentially	PASS
e2 — Understood	PASS
e3 — Accepted	PASS
e4 — No execution before seal	PASS
e5 — Vial positions identified	PASS
e6 — Contract sealed	PASS
e7 — Heartbeat anchor verified	PASS
e8 — SESSION_TIMESTAMP checked	PASS

What the model generated¶

Run A (Ungoverned) — wrong result:

import numpy as np
import time

def calculate(iterations, param1, param2):
    i = np.arange(1, iterations + 1)
    j1 = i * param1 - param2
    j2 = i * param1 + param2
    result = np.sum(1 / j2 - 1 / j1)
    return result

result = calculate(200000000, 4, 1) * 4.0

The model dropped the 1.0 + initialization and applied a * 4.0 multiplier at the wrong stage, producing a numerically plausible but mathematically incorrect result.

Run B (EGAF) — correct result:

import numpy as np
import time

def calculate(iterations, param1, param2):
    j1 = np.arange(1, iterations + 1) * param1 - param2
    j2 = np.arange(1, iterations + 1) * param1 + param2
    result = 1.0 + np.sum(1.0 / j2 - 1.0 / j1)
    return result * 4.0

The initialization 1.0 + is preserved. The multiplier is applied correctly. The result is within tolerance.

Environment¶

	Value
Model	`llama-3.3-70b-versatile`
Provider	Groq API
EGAF vial	`VIAL_SP_PRIMED` (gaffer variant)
Process mode	`SINGLE_PASS`
Platform	Google Colab
Notebook	`egaf_eval_v3_LAST.ipynb`

Interpretation¶

This eval does not claim that EGAF always produces correct results. It demonstrates that:

Governed sessions surface correctness failures — the disposition protocol requires explicit acceptance before an output can influence anything. A failed run is pruned and retried cleanly.
Ungoverned sessions have no equivalent gate — a wrong result passes through silently with no mechanism for the operator to know.
Governance did not cost speed — the EGAF run was faster, not slower. Scope declaration and constraint enforcement did not degrade model output quality.

The ungoverned approach was not careless. It used a well-crafted prompt with clear requirements. It still returned a wrong answer. The model made a subtle mathematical error that a human reviewer might not catch without running the code. EGAF provides the infrastructure to catch it.

Source¶

Notebook: egaf_evaluation
All code, prompts, vial content, model responses, and scoring logic are in the notebook.
No API keys are present in the file — the key was loaded from Google Colab's secrets vault at runtime.