Eval v3 — C++ to Python Conversion¶
EGAF Governed vs Ungoverned¶
Objective¶
Convert a C++ pi approximation script to the fastest correct Python implementation. NumPy is permitted. The mathematical result must equal 3.141592656089 within tolerance 1e-9.
Three runs were executed against the same model (llama-3.3-70b-versatile via Groq API):
| Run | Method | Description |
|---|---|---|
| Baseline | None | Original pure Python loop — no optimization, no AI |
| Run A | Ungoverned | Plain prompt asking for fastest Python. No contract, no governance, no correctness gate. |
| Run B | EGAF | VIAL_SP_PRIMED + custom E_T declaring scope, constraints, and output requirements. SINGLE_PASS mode. |
Results¶
| Baseline | Run A (Ungoverned) | Run B (EGAF) | |
|---|---|---|---|
| Execution time | 43.53s | 6.26s | 4.04s |
| Speedup vs baseline | 1× | 6.9× | 10.8× |
| Correctness | ✓ PASS | ✗ FAIL | ✓ PASS |
| Result value | 3.141592656089 |
-0.858407343910 |
3.141592656090 |
| Scope violations | — | 0 / 5 | 0 / 5 |
| Seal compliance | — | — | 8 / 8 |
| Disposition | — | No gate | ACCEPTED |
Key finding¶
EGAF produced a 10.8× speedup over baseline and returned the correct result.
The ungoverned approach was nearly 7× faster than baseline but returned a wrong answer — −0.858 instead of 3.1416. There was no mechanism to detect this. In a production workflow, that output would have silently entered context as if it were valid.
EGAF detected nothing — because the model got it right under governance. But if Run B had failed, the disposition protocol would have caught it:
Run A PRUNED — result incorrect. Would resubmit in governed session.
That line appears in the notebook. It is not hypothetical. It is what the eval code does when a run fails the correctness check. The ungoverned run had no equivalent path.
Seal compliance — 8 / 8¶
The EGAF run verified all eight OIL_CONTRACT acknowledgment checklist items before execution began:
| Check | Result |
|---|---|
| e1 — Read sequentially | PASS |
| e2 — Understood | PASS |
| e3 — Accepted | PASS |
| e4 — No execution before seal | PASS |
| e5 — Vial positions identified | PASS |
| e6 — Contract sealed | PASS |
| e7 — Heartbeat anchor verified | PASS |
| e8 — SESSION_TIMESTAMP checked | PASS |
What the model generated¶
Run A (Ungoverned) — wrong result:
import numpy as np
import time
def calculate(iterations, param1, param2):
i = np.arange(1, iterations + 1)
j1 = i * param1 - param2
j2 = i * param1 + param2
result = np.sum(1 / j2 - 1 / j1)
return result
result = calculate(200000000, 4, 1) * 4.0
The model dropped the 1.0 + initialization and applied a * 4.0 multiplier at the wrong stage, producing a numerically plausible but mathematically incorrect result.
Run B (EGAF) — correct result:
import numpy as np
import time
def calculate(iterations, param1, param2):
j1 = np.arange(1, iterations + 1) * param1 - param2
j2 = np.arange(1, iterations + 1) * param1 + param2
result = 1.0 + np.sum(1.0 / j2 - 1.0 / j1)
return result * 4.0
The initialization 1.0 + is preserved. The multiplier is applied correctly. The result is within tolerance.
Environment¶
| Value | |
|---|---|
| Model | llama-3.3-70b-versatile |
| Provider | Groq API |
| EGAF vial | VIAL_SP_PRIMED (gaffer variant) |
| Process mode | SINGLE_PASS |
| Platform | Google Colab |
| Notebook | egaf_eval_v3_LAST.ipynb |
Interpretation¶
This eval does not claim that EGAF always produces correct results. It demonstrates that:
- Governed sessions surface correctness failures — the disposition protocol requires explicit acceptance before an output can influence anything. A failed run is pruned and retried cleanly.
- Ungoverned sessions have no equivalent gate — a wrong result passes through silently with no mechanism for the operator to know.
- Governance did not cost speed — the EGAF run was faster, not slower. Scope declaration and constraint enforcement did not degrade model output quality.
The ungoverned approach was not careless. It used a well-crafted prompt with clear requirements. It still returned a wrong answer. The model made a subtle mathematical error that a human reviewer might not catch without running the code. EGAF provides the infrastructure to catch it.
Source¶
Notebook: egaf_evaluation
All code, prompts, vial content, model responses, and scoring logic are in the notebook.
No API keys are present in the file — the key was loaded from Google Colab's secrets vault at runtime.