Skip to content

Eval v3 — C++ to Python Conversion

EGAF Governed vs Ungoverned


Objective

Convert a C++ pi approximation script to the fastest correct Python implementation. NumPy is permitted. The mathematical result must equal 3.141592656089 within tolerance 1e-9.

Three runs were executed against the same model (llama-3.3-70b-versatile via Groq API):

Run Method Description
Baseline None Original pure Python loop — no optimization, no AI
Run A Ungoverned Plain prompt asking for fastest Python. No contract, no governance, no correctness gate.
Run B EGAF VIAL_SP_PRIMED + custom E_T declaring scope, constraints, and output requirements. SINGLE_PASS mode.

Results

Baseline Run A (Ungoverned) Run B (EGAF)
Execution time 43.53s 6.26s 4.04s
Speedup vs baseline 6.9× 10.8×
Correctness ✓ PASS ✗ FAIL ✓ PASS
Result value 3.141592656089 -0.858407343910 3.141592656090
Scope violations 0 / 5 0 / 5
Seal compliance 8 / 8
Disposition No gate ACCEPTED

Key finding

EGAF produced a 10.8× speedup over baseline and returned the correct result.

The ungoverned approach was nearly 7× faster than baseline but returned a wrong answer−0.858 instead of 3.1416. There was no mechanism to detect this. In a production workflow, that output would have silently entered context as if it were valid.

EGAF detected nothing — because the model got it right under governance. But if Run B had failed, the disposition protocol would have caught it:

Run A PRUNED — result incorrect. Would resubmit in governed session.

That line appears in the notebook. It is not hypothetical. It is what the eval code does when a run fails the correctness check. The ungoverned run had no equivalent path.


Seal compliance — 8 / 8

The EGAF run verified all eight OIL_CONTRACT acknowledgment checklist items before execution began:

Check Result
e1 — Read sequentially PASS
e2 — Understood PASS
e3 — Accepted PASS
e4 — No execution before seal PASS
e5 — Vial positions identified PASS
e6 — Contract sealed PASS
e7 — Heartbeat anchor verified PASS
e8 — SESSION_TIMESTAMP checked PASS

What the model generated

Run A (Ungoverned) — wrong result:

import numpy as np
import time

def calculate(iterations, param1, param2):
    i = np.arange(1, iterations + 1)
    j1 = i * param1 - param2
    j2 = i * param1 + param2
    result = np.sum(1 / j2 - 1 / j1)
    return result

result = calculate(200000000, 4, 1) * 4.0

The model dropped the 1.0 + initialization and applied a * 4.0 multiplier at the wrong stage, producing a numerically plausible but mathematically incorrect result.

Run B (EGAF) — correct result:

import numpy as np
import time

def calculate(iterations, param1, param2):
    j1 = np.arange(1, iterations + 1) * param1 - param2
    j2 = np.arange(1, iterations + 1) * param1 + param2
    result = 1.0 + np.sum(1.0 / j2 - 1.0 / j1)
    return result * 4.0

The initialization 1.0 + is preserved. The multiplier is applied correctly. The result is within tolerance.


Environment

Value
Model llama-3.3-70b-versatile
Provider Groq API
EGAF vial VIAL_SP_PRIMED (gaffer variant)
Process mode SINGLE_PASS
Platform Google Colab
Notebook egaf_eval_v3_LAST.ipynb

Interpretation

This eval does not claim that EGAF always produces correct results. It demonstrates that:

  1. Governed sessions surface correctness failures — the disposition protocol requires explicit acceptance before an output can influence anything. A failed run is pruned and retried cleanly.
  2. Ungoverned sessions have no equivalent gate — a wrong result passes through silently with no mechanism for the operator to know.
  3. Governance did not cost speed — the EGAF run was faster, not slower. Scope declaration and constraint enforcement did not degrade model output quality.

The ungoverned approach was not careless. It used a well-crafted prompt with clear requirements. It still returned a wrong answer. The model made a subtle mathematical error that a human reviewer might not catch without running the code. EGAF provides the infrastructure to catch it.


Source

Notebook: egaf_evaluation
All code, prompts, vial content, model responses, and scoring logic are in the notebook.
No API keys are present in the file — the key was loaded from Google Colab's secrets vault at runtime.