Treni

Leaderboard

Canonical benchmark tables across G5 and Lambda A100/H100, plus historical run context.

Lower time is better.

Canonical Cross-Hardware Snapshot (Paper Package, 2026-02-20)

Sources:

Phase 2 Cold/Hot

HardwareCold StartupCold TTFTCold FullWarm MeanWarm p99
g51002.273 ms8.460 ms150.035 ms80.602 ms90.350 ms
lambda_a1001002.708 ms29.657 ms32.008 ms10.356 ms14.536 ms
lambda_h1001004.890 ms56.944 ms62.064 ms18.491 ms24.944 ms

Routing Matrix

SystemOverall Ext/IntBaseline p00 Ext/IntStress p04 Ext/IntOverall Int ErrorOverall Ext Error
g51.208x1.042x1.436x0.00000.0347
lambda_a1002.430x1.396x3.785x0.00000.0347
lambda_h1002.397x1.533x3.545x0.00000.0347

Phase 3 Loops (Baseline/Stress)

SystemBaseline Internal SuccessBaseline External SuccessBaseline Ext/Int LatencyStress Internal SuccessStress External SuccessStress Ext/Int Latency
g51.00000.900616.060x1.00000.878277.170x
lambda_a1001.00000.900616.879x1.00000.878277.861x
lambda_h1001.00000.900618.693x1.00000.878272.741x

C2 Runtime-Native Uncertainty Deltas (Uncertainty On-Off Success Delta)

SystemBaseline Internal DeltaBaseline External DeltaStress Internal DeltaStress External Delta
g5+0.1539+0.1058+0.1539+0.1154
lambda_a100+0.2308+0.2308+0.2308+0.2212
lambda_h100+0.2308+0.2308+0.2308+0.2212

Phase 5 Real-Benchmark (Diagnostic, G5, 2026-03-01)

Source artifacts:

Arm A (Control) Score Means

RunGPQAIFEvalGSM8KAIME25
r5 tokenizerfix2 (canonical diagnostic)0.50000.56250.00000.0000
r6 qwentpl1 (template A/B, non-canonical)0.12500.31250.00000.0000

Runtime (r5 Arm A) vs HF Reference (Same Sampled Set)

SystemGPQAIFEvalGSM8KAIME25
Runtime r5 Arm A0.50000.56250.00000.0000
HF reference control0.25000.62500.00000.0000

Interpretation:

  • r5 is the current Phase 5 diagnostic reference.
  • r6 was an explicit template-path experiment and regressed quality/latency; it is not used as canonical.
  • HF parity result for this sampled set: runtime and HF are tied on GSM8K/AIME (0.0), so current math-task failures are not runtime-only breakage.

Qwen3.5 One-Host Strict Matrix (AWS G5, 2026-03-07)

Source artifacts:

Overall / Per-Task (Arm A, gpqa_diamond+ifeval, seeds 7/17/27, 8/task)

SystemOverall ScoreOverall LatencyGPQAGPQA LatencyIFEvalIFEval Latency
Runtime0.33333809.745 ms0.29172867.493 ms0.37504751.996 ms
vLLM0.31601626.068 ms0.2917418.173 ms0.34032833.964 ms

Interpretation:

  • This is the cleanest current one-host strict A/B on Qwen3.5-0.8B.
  • Runtime is no longer behind on aggregate score on this set.
  • Runtime is still far slower overall, so this is not a claim of universal superiority.
  • The result remains task-stratified:
    • gpqa_diamond score is now at parity, but runtime latency is still far worse,
    • ifeval score is higher for runtime, but latency is still worse there too.

Qwen3.5 Nightly vLLM Diagnostic (G5, 2026-03-02)

Source artifacts:

RunGPQA A/B/CIFEval A/B/CGSM8K A/B/CAIME25 A/B/C
r1 (first nightly run)0.500 / 0.250 / 0.6250.5625 / 0.4375 / 0.50000.000 / 0.000 / 0.0000.000 / 0.000 / 0.000
r2 (conservative policy)0.375 / 0.250 / 0.2500.3125 / 0.5000 / 0.37500.000 / 0.000 / 0.0000.000 / 0.125 / 0.000
r3 (shared-first fairness fix)0.375 / 0.375 / 0.3750.3125 / 0.3125 / 0.31250.000 / 0.000 / 0.0000.000 / 0.000 / 0.000

Interpretation:

  • r3 removes arm-to-arm sampling noise (all arms share the same first completion).
  • Post-fix state is no-regression parity (B-A=0, C-A=0) rather than uplift.

Qwen3.5 Strict Runtime vs vLLM Matrix (G5, 2026-03-03)

Source artifacts:

RunMatrix modeRuntime scorevLLM scoreDelta (runtime-vLLM)Runtime latencyvLLM latencyDelta (runtime-vLLM)
20260302T222013Zstrict (all arms validated)0.05030.2170-0.16671881.188 ms178.093 ms+1703.095 ms
20260303T104038Zstrict Arm A-only (--phase5-arms arm_a_control)0.15630.1910-0.03471723.685 ms958.757 ms+764.928 ms

Latest run (20260303T104038Z) per-task score deltas (runtime-vLLM):

  • gpqa_diamond: -0.0833
  • ifeval: -0.0972
  • gsm8k: +0.0417
  • aime25: 0.0000

Interpretation:

  • Strict matrix is no longer blocked and is now fully reproducible.
  • Decoder-path fixes narrowed the quality gap materially, but runtime is still slower overall and still slightly behind on aggregate score.
  • Top target remains request-path latency recovery on Qwen3.5 while preserving this improved score parity.

Qwen3.5 Parse-Fix AB3 (GPQA+IFEval, G5, 2026-03-04)

Source artifact:

ScopeRuntime scorevLLM scoreDelta (runtime-vLLM)Runtime latencyvLLM latencyDelta (runtime-vLLM)
Overall (gpqa_diamond+ifeval)0.34030.3229+0.01741772.931 ms1553.034 ms+219.897 ms
gpqa_diamond0.27080.27080.00002319.086 ms437.310 ms+1881.776 ms
ifeval0.40970.3750+0.03471226.775 ms2668.758 ms-1441.983 ms

Interpretation:

  • This run is task-family stratified, not a universal win.
  • Runtime currently wins on ifeval (quality + latency) but remains far slower on gpqa_diamond.

Qwen3.5 Hybrid Full-Batch AB3 (GPQA+IFEval, AWS G5, 2026-03-08)

Source artifact:

ScopeRuntime scorevLLM scoreDelta (runtime-vLLM)Runtime latencyvLLM latencyDelta (runtime-vLLM)
Overall (gpqa_diamond+ifeval)0.41320.3472+0.06602940.172 ms1686.263 ms+1253.909 ms
gpqa_diamond0.45830.2083+0.25001347.582 ms512.075 ms+835.507 ms
ifeval0.36810.4861-0.11814532.763 ms2860.452 ms+1672.311 ms

Interpretation:

  • This is the first late AWS Qwen3.5 strict AB3 where runtime clearly leads overall on score.
  • The runtime is still slower on latency across both tasks.
  • GPQA is now a strong runtime-quality win; the next blocker is warm decode/request-path latency and the remaining IFEval score deficit.

Qwen3.5 Deterministic Strict Matrix (GPQA+IFEval, AWS G5, 2026-03-08)

Source artifact:

ScopeRuntime scorevLLM scoreDelta (runtime-vLLM)Runtime latencyvLLM latencyDelta (runtime-vLLM)
Overall (gpqa_diamond+ifeval)0.29510.2674+0.0278824.714 ms1572.529 ms-747.815 ms
gpqa_diamond0.16670.16670.0000671.640 ms436.583 ms+235.057 ms
ifeval0.42360.3681+0.0556977.787 ms2708.475 ms-1730.688 ms

Interpretation:

  • This is the current claim-safe Qwen3.5 strict lane.
  • Runtime wins overall on both score and latency.
  • gpqa_diamond is now exact parity on score, while ifeval is a runtime win on both score and latency.
  • Sampled-lane reproducibility is now fixed separately; this deterministic lane remains the cleanest low-variance claim-safe slice.

Qwen3.5 Sampled Strict Matrix (GPQA+IFEval, AWS G5, 2026-03-08)

Source artifact:

ScopeRuntime scorevLLM scoreDelta (runtime-vLLM)Runtime latencyvLLM latencyDelta (runtime-vLLM)
Overall (gpqa_diamond+ifeval)0.40970.3021+0.10761617.187 ms2017.206 ms-400.019 ms
gpqa_diamond0.37500.2500+0.1250710.693 ms435.823 ms+274.870 ms
ifeval0.44440.3542+0.09032523.680 ms3598.588 ms-1074.908 ms

Interpretation:

  • This is the post-fix sampled strict AB3 lane after the harness seed bug was removed.
  • Runtime now wins overall on both score and latency here as well.
  • gpqa_diamond is still runtime-slower, but runtime is ahead on score in both tasks.

Qwen3.5 Sampled Strict Matrix (Larger-N, GPQA+IFEval, AWS G5, 2026-03-08)

Source artifact:

ScopeRuntime scorevLLM scoreDelta (runtime-vLLM)Runtime latencyvLLM latencyDelta (runtime-vLLM)
Overall (gpqa_diamond+ifeval, 16/task)0.37150.2969+0.07471255.344 ms1585.043 ms-329.699 ms
gpqa_diamond0.37500.3125+0.0625801.900 ms433.256 ms+368.644 ms
ifeval0.36810.2813+0.08681708.789 ms2736.831 ms-1028.043 ms

Interpretation:

  • This is the stronger non-thinking sampled strict result.
  • Runtime still wins overall on both score and latency at larger sample count.
  • The weakest remaining slice is still GPQA latency, not overall sampled quality.

Qwen3.5 Thinking Strict Matrix (Finalized Closed-Form Lane, AWS G5, 2026-03-09)

Source artifact:

ScopeRuntime scorevLLM scoreDelta (runtime-vLLM)Runtime latencyvLLM latencyDelta (runtime-vLLM)
Overall (gpqa_diamond+ifeval)0.25000.1944+0.05566823.816 ms7503.000 ms-679.184 ms
gpqa_diamond0.16670.16670.00007727.880 ms7741.028 ms-13.148 ms
ifeval0.33330.2222+0.11115919.753 ms7264.973 ms-1345.220 ms

Interpretation:

  • This is the current best thinking tradeoff.
  • The close-form finalize path stays active, but a smaller GPQA reasoning budget removes the old latency collapse.
  • Runtime now wins this finalized thinking lane overall on both score and latency.

Qwen3.5 Fast-Sampler Tie-Stable AB3 (GPQA+IFEval, AWS G5, 2026-03-08)

Source artifact:

ScopeRuntime scorevLLM scoreDelta (runtime-vLLM)Runtime latencyvLLM latencyDelta (runtime-vLLM)
Overall (gpqa_diamond+ifeval)0.31600.3472-0.03131422.818 ms1659.878 ms-237.060 ms
gpqa_diamond0.29170.2083+0.0833886.296 ms515.171 ms+371.125 ms
ifeval0.34030.4861-0.14581959.340 ms2804.584 ms-845.244 ms

Interpretation:

  • This is the first clean late AWS Qwen3.5 strict AB3 where runtime is faster overall than vLLM.
  • The remaining blocker is score recovery, not another large latency rescue.
  • GPQA stays runtime-positive on score; the main open quality deficit is ifeval plus residual GPQA fidelity loss versus the older slower sampler lane.

Qwen3.5 Endpoint Probe Matrix (Warm, Extended, G5, 2026-03-06)

Source artifact:

BackendModeall_okPlain ChatTool First TurnTool Follow-upNotes
runtimenon-thinkingtrue387.672 ms5885.725 ms4406.168 msstrongest current functional lane
runtimethinkingtrue11113.612 ms13404.890 ms9349.855 msverbose reasoning-heavy outputs
vLLMnon-thinkingfalse112.543 ms1162.850 ms490.202 msmultimodal blocked by --language-model-only; exact-output thinking case hit length
vLLMthinkingfalse3564.340 ms3044.431 ms608.328 msseveral exact-output prompts end at finish_reason=length

Interpretation:

  • Runtime is now functionally solid on Qwen3.5 in non-thinking mode.
  • The main remaining runtime gap is long-prompt/tool latency, not base chat correctness.
  • Current vLLM launch is not a full multimodal parity configuration; failures here are partly launch-mode/config artifacts, not only model behavior.

G5 External Cold Backends (Canonical 3-Run Means)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First ResponseTTFT/RuntimeFull/RuntimeCold Total/Runtime
runtime1003.836 ms5.130 ms316.403 ms1320.240 ms---
pytorch_transformers-556.949 ms2322.565 ms19276.667 ms108.567x7.341x14.601x
vllm27704.770 ms84.837 ms1232.660 ms28937.430 ms16.537x3.896x21.918x
ollama (GGUF)1002.567 ms2638.945 ms2996.765 ms3999.332 ms514.414x9.471x3.029x

G5 Foundation (Canonical)

MetricValue
Baseline pipeline mean2407.974 ms
Runtime warm request mean (3-run)82.707 ms
Runtime warm request p99 (3-run)91.738 ms
Baseline/runtime ratio (pipeline)29.11x

G5 Cold First-Hit — True TTFT (3-run Means, 2026-02-17)

ModelEarly TTFTclean4 TTFTSpeedup
qwen27574.564 ms1100.044 ms25.067x
donut67360.388 ms150.322 ms448.107x
bart77520.798 ms125.011 ms620.112x
minilm23.342 ms22.621 ms1.032x

All values above are runtime-instrumented timing.ttft_ms. Validation rerun (clean7, 2026-02-18) matches these values within run noise.

Qwen Cold Upload Ablation (G5, 2026-02-19, Same Harness, Env Toggle Only)

MetricGPU Convert OffGPU Convert OnSpeedup (Off/On)
full_latency_ms1116.567 ms238.740 ms4.677x
decoder_tensor_upload_ms1007 ms129 ms7.806x
decoder_tensor_convert_ms862 ms6 ms143.667x
decoder_tensor_h2d_ms143 ms121 ms1.182x
startup + full response2119.906 ms1242.057 ms1.707x

Interpretation:

  • CPU-side tensor conversion was the dominant cold bottleneck for Qwen.
  • Moving BF16/F16 conversion to GPU materially reduced first-hit latency.

External Cold Runtime-Only Ablation (G5, 2026-02-19, Preload On, max_tokens=48)

MetricGPU Convert OffGPU Convert OnSpeedup (Off/On)
startup_to_healthy_ms2004.560 ms1003.455 ms1.997x
request_ttft_ms5.137 ms5.127 ms1.002x
request_full_ms317.989 ms317.276 ms1.002x
cold_total_first_token_ms2009.697 ms1008.582 ms1.993x
cold_total_first_response_ms2322.549 ms1320.731 ms1.759x

AWS G5 TTFT Kernel Pass (2026-02-22, Same Host/Container)

Cold (run_mode=cold_first_hit):

ConfigStartupTTFTFull
lt0_sync0 baseline1002.203 ms16.738 ms424.685 ms
lt0_sync0 softmax pass1002.435 ms16.715 ms425.390 ms
lt0_sync0 norm+softmax1002.304 ms14.813 ms399.781 ms
lt1_sync0 norm+softmax1002.284 ms13.974 ms396.814 ms
lt1_sync0 seq1 tiny-kernel default1002.256 ms12.504 ms390.099 ms

Warm (run_mode=warm_steady_state, 16 measured, 2 warmup):

ConfigMeanp95p99
lt0_sync0 baseline174.237 ms600.575 ms1035.823 ms
lt0_sync0 softmax pass174.504 ms601.855 ms1035.638 ms
lt0_sync0 norm+softmax147.495 ms501.527 ms936.281 ms
lt1_sync0 norm+softmax147.269 ms501.587 ms936.297 ms
lt1_sync0 seq1 tiny-kernel default143.230 ms490.296 ms924.276 ms

Interpretation:

  • softmax-only was near-parity.
  • norm rewrite was the material gain.
  • seq2seq/Bart follow-up (seq_q=1 tiny-kernel path + direct K/V cache write) produced another measurable lift.
  • best measured profile (lt1_sync0 seq1 tiny-kernel default) vs baseline:
    • cold TTFT: 1.339x faster (16.738 -> 12.504 ms)
    • cold full: 1.089x faster (424.685 -> 390.099 ms)
    • warm mean: 1.216x faster (174.237 -> 143.230 ms)
    • warm p99: 1.121x faster (1035.823 -> 924.276 ms)
  • follow-up profile vs previous lt1_sync0 norm+softmax best:
    • cold TTFT: 1.118x faster
    • warm mean: 1.028x faster
  • Bart cold TTFT moved from 16.573 -> 12.842 ms (1.29x faster).
  • 3-seed repeatability (new default path):
    • cold TTFT 12.563 ± 0.037 ms
    • cold full 390.961 ± 0.270 ms
    • warm mean 143.297 ± 0.222 ms
    • warm p99 925.668 ± 1.070 ms
  • seq1 fused follow-up matrix (seq1_hybrid_fused_20260222T192656Z) further improved request path:
    • warm default mean 54.505 -> 52.535 ms (1.037x)
    • warm default p99 82.134 -> 80.554 ms (1.020x)
    • cold default TTFT 6.447 -> 6.209 ms (1.038x)
    • cold default full 147.756 -> 145.587 ms (1.015x)

AWS G5 True Fused cuDNN Frontend A/B (2026-02-22, fixed qwen, warmed)

Source:

Metriccustomcudnn_sdpa_frontendcustom/frontend
warm request mean19.324 ms21.503 ms0.899
warm request p9922.087 ms24.875 ms0.888
warm infer18.803 ms20.976 ms0.896
warm TTFT4.199 ms4.498 ms0.934
cold TTFT4.220 ms710.641 ms0.006
cold full250.929 ms6610.148 ms0.038

Interpretation:

  • Warm steady-state is close but still favors custom.
  • Cold first-hit is still dominated by fused plan-build misses.

AWS G5 Frontend Repeatability Matrix (2026-02-22, repeats=3)

Source:

ProfileMetriccustom (mean +/- stdev)fused_frontend (mean +/- stdev)ratio custom/frontendwins custom/frontend/tie
warm_fixedwarm_request_mean_ms19.271 +/- 0.050 ms21.468 +/- 0.018 ms0.898 +/- 0.0033/0/0
warm_fixedwarm_infer_ms18.812 +/- 0.059 ms20.984 +/- 0.026 ms0.896 +/- 0.0043/0/0
warm_fixedwarm_ttft_ms4.198 +/- 0.001 ms4.498 +/- 0.001 ms0.933 +/- 0.0003/0/0
mixed_churnwarm_request_mean_ms47.864 +/- 0.018 ms843.141 +/- 0.735 ms0.057 +/- 0.0003/0/0
mixed_churnwarm_infer_ms47.331 +/- 0.050 ms842.542 +/- 0.747 ms0.056 +/- 0.0003/0/0
mixed_churnwarm_ttft_ms4.197 +/- 0.002 ms179.744 +/- 0.263 ms0.023 +/- 0.0003/0/0

Interpretation:

  • Current custom path wins all tracked metrics across both profiles in repeated runs.

AWS G5 Frontend Claim-Strength (2026-02-22)

Source:

Delta is frontend - custom (positive means custom is faster):

ProfileMetricDelta MeanDelta CI95Ratio Mean (custom/frontend)
warm_fixedwarm_request_mean_ms+2.197 ms[2.125, 2.238]0.898
warm_fixedwarm_ttft_ms+0.300 ms[0.299, 0.301]0.933
mixed_churnwarm_request_mean_ms+795.277 ms[794.408, 795.747]0.057
mixed_churnwarm_ttft_ms+175.546 ms[175.300, 175.820]0.023

Interpretation:

  • Current evidence is strongly directional for custom > fused_frontend on this hardware/workload.
  • Repeat count is still low (n=3/profile), so high-N reruns remain desirable for paper-grade confidence.

AWS G5 Frontend Miss-Mitigation (Updated Canonical, 2026-02-22)

Sources:

Comparison (no_preload -> startup_preload_benchmark_queries):

ProfileMetricFused Frontend BeforeFused Frontend AfterSpeedup
mixed_churnwarm request mean843.242 ms22.433 ms37.590x
mixed_churnwarm infer mean842.684 ms21.965 ms38.365x
mixed_churnwarm TTFT179.541 ms4.497 ms39.928x
mixed_churncold TTFT704.521 ms4.495 ms156.723x
mixed_churncold full latency6593.495 ms25.785 ms255.707x

Exact cold-prompt probe:

  • fused first-hit TTFT: 4.499 ms
  • fused first-hit full latency: 26.090 ms

Interpretation:

  • With benchmark-query preload coverage, the prior fused cold/mixed miss spikes are effectively removed in this harness.
  • Custom still leads warmed request-path latency slightly (about 0.90 custom/frontend ratio), but cold TTFT is now near parity on the covered prompt set.

AWS G5 Frontend Miss-Mitigation (Shape Prebuild, No Preload Prompts, 2026-02-22)

Sources:

Comparison (no_preload -> shape_prebuild_nopreload):

ProfileMetricFused Frontend BeforeFused Frontend AfterSpeedup
mixed_churncold TTFT704.521 ms5.805 ms121.364x
mixed_churncold full latency6593.495 ms255.267 ms25.830x
mixed_churnwarm request mean843.242 ms51.482 ms16.379x
mixed_churnwarm TTFT179.541 ms4.824 ms37.218x

Cold probe (fused frontend, no preload prompts, startup prebuild on):

  • startup->healthy: 11017.541 ms
  • request TTFT: 5.814 ms
  • request full latency: 255.434 ms

Tuned startup probe (seq_kv_max: 16 -> 10):

  • startup->healthy: 7011.472 ms (1.571x faster startup)
  • request TTFT: 5.826 ms (near-identical)
  • request full latency: 254.936 ms (near-identical)

Lower-range probe (seq_kv_max=8):

  • startup->healthy: 6010.381 ms
  • request TTFT: 703.771 ms (regression)
  • request full latency: 1660.576 ms (regression)

Tuned matrix confirmation (seq_kv_max=16 -> 10):

  • warm-fixed fused request mean: 22.556 -> 22.265 ms
  • mixed fused request mean: 51.482 -> 50.974 ms
  • cold fused TTFT: 5.827 -> 5.819 ms

Interpretation:

  • Shape-level prebuild removes no-preload fused request-path spikes without prompt-list curation.
  • Tradeoff: startup latency is still elevated due to up-front plan builds, but tuned range materially reduces that startup cost.
  • seq_kv_max=10 is currently the minimum safe tuned range for this benchmark prompt profile.

AWS G5 Frontend Miss-Mitigation (Hybrid Shape Gate, No Preload Prompts, 2026-02-23)

Sources:

Policy:

  • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MIN_KV=10
  • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=10
  • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128
  • TRENI_ATTN_CUDNN_FRONTEND_MIN_SEQ_KV=10

Startup probe (3 runs, fused frontend, no preload prompts):

  • startup->healthy: 2004.840 +/- 0.146 ms
  • request TTFT: 4.955 +/- 0.011 ms
  • request full latency: 242.673 +/- 0.352 ms

Delta vs prior tuned no-gate probe (prebuild_startup10_nopreload_probe_20260222T235944Z):

  • startup->healthy: 7011.472 -> 2004.840 ms (3.497x faster)
  • request TTFT: 5.826 -> 4.955 ms (1.176x faster)
  • request full latency: 254.936 -> 242.673 ms (1.051x faster)

Matrix deltas vs prior tuned no-gate matrix (attn_backend_frontend_matrix_20260223T000256Z):

  • warm-fixed fused request mean: 22.265 -> 20.354 ms (1.094x faster)
  • mixed fused request mean: 50.974 -> 47.904 ms (1.064x faster)
  • cold fused TTFT: 5.819 -> 4.959 ms (1.173x faster)
  • cold fused full latency: 254.146 -> 242.569 ms (1.048x faster)

Interpretation:

  • Hybrid shape gating removes most remaining startup cost while keeping low no-preload request-path latency.
  • In this harness, strict fused mode stays inference-valid with low-shape fallback to custom path.
  • Initial broader-shape sanity exposed out-of-window (seq_kv>10) miss cascades; adding bounded gate (TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10) removed those cascades and cut the same 5-shape set mean full latency from 9974.576 ms to 274.072 ms (36.395x) while keeping fixed-profile matrix behavior near-identical.

Internal vs External Routing (G5, 2026-02-17)

MetricInternalExternalRatio
Mean latency94.849 ms97.927 ms1.032x
TaskInternalExternal
general_short150.767 ms152.274 ms
receipt_extract80.732 ms81.270 ms
search_grounded46.945 ms57.237 ms
summarize_short100.950 ms100.928 ms

Internal routing is faster in aggregate and faster on 3/4 tasks in this set (the remaining task is effectively tied).

Internal vs External Routing (G5, 2026-02-18, Failure-Amplification Stress)

MetricInternalExternalRatio
Mean latency76.071 ms109.806 ms1.443x
Error rate0.00000.0833inf

Stress profile:

  • Tool failure injection: every 2nd request.
  • Tool timeout injection: every 3rd request (0.9s sleep) with controller timeout 0.25s.
  • Controller tool retries: 1.

Observed amplification:

  • External taxonomy: tool_hop_failed=4.
  • External tool retries mean: 0.182.

Internal vs External Routing Matrix (G5, 2026-02-19)

ProfileExt/Int Latency RatioInt Error RateExt Error RateExt Tool Retries
p00 baseline1.042x0.00000.00000.000
p01 fail mild1.048x0.00000.00000.021
p02 timeout mild1.142x0.00000.00000.021
p03 mixed moderate1.164x0.00000.04170.065
p04 mixed aggressive1.436x0.00000.08330.182
p05 mixed aggressive + retry21.416x0.00000.08330.091

Matrix mean ratio (all profiles): 1.208x external/internal.

Internal vs External Routing Cross-Host Pilot (2026-02-19)

Topology:

  • local benchmark client
  • SSH tunnel to G5 host
  • runtime + external router on G5
ProfileInternal MeanExternal MeanExt/Int RatioInt ErrorExt Error
crosshost-p00-baseline1071.477 ms1059.478 ms0.989x0.00000.0000
crosshost-p02-timeout-mild1054.123 ms1123.393 ms1.066x0.00000.0000
crosshost-p04-stress1056.013 ms1100.010 ms1.042x0.00000.0833

Stress notes:

  • injected tool failure every 2nd request
  • injected timeout every 3rd request (900 ms sleep)
  • controller tool retries 1 with 20 ms backoff

Internal vs External Routing Split-Host Matrix (2026-02-19, Canonical Track B)

Topology:

  • GPU host: runtime endpoint
  • CPU host: external controller + tool services
  • controller/tool call runtime over private VPC network
ProfileInternal MeanExternal MeanExt/Int RatioInt ErrorExt ErrorExt Tool Retries
splithost-p00-baseline1052.392 ms1046.702 ms0.995x0.00000.00000.000
splithost-p01_fail_mild1051.906 ms1049.576 ms0.998x0.00000.00000.021
splithost-p02_timeout_mild1055.872 ms1100.629 ms1.042x0.00000.00000.021
splithost-p03_mixed_moderate1071.284 ms1072.634 ms1.001x0.00000.04170.065
splithost-p04_mixed_aggressive1051.099 ms1142.209 ms1.087x0.00000.08330.182
splithost-p05_mixed_aggressive_retry21064.215 ms1112.362 ms1.045x0.00000.08330.091

Matrix means (all profiles):

  • External/Internal latency ratio: 1.028x.
  • Internal error rate: 0.0000.
  • External error rate: 0.0347.

Internet Multi-Hop Routing (Fly + Commercial APIs, 2026-02-20)

Topology:

  • internal path: local client -> commercial API
  • external path: local client -> Fly controller/tool -> same commercial API
Provider/ModelRuns/ProfileMatrix Mean Ext/IntBaseline p00Timeout-Mild p02Stress p04Internal ErrorExternal Error (p04)
OpenAI gpt-5.231.112x1.110x1.082x1.145x0.00000.0833
OpenRouter openai/gpt-5.230.755x0.686x0.891x0.689x0.00000.1667
OpenRouter anthropic/claude-sonnet-4.631.028x1.236x0.968x0.879x0.00000.1667

Interpretation:

  • OpenAI matrix shows the expected direction for Track B (external hop overhead under internet routing).
  • OpenRouter rows are mixed/inverted by profile and remain non-canonical for Track B direction claims in this topology.

Local Control Routing (No Fly Scheduler Path, 2026-02-20)

Topology:

  • internal path: local client -> commercial API
  • external path: local client -> local standalone controller/tool -> same commercial API
Provider/ModelRuns/ProfileMatrix Mean Ext/IntBaseline p00Timeout-Mild p02Stress p04Internal ErrorExternal Error Mean
OpenAI gpt-5.280.987x0.995x0.977x0.988x0.00000.0313
OpenRouter anthropic/claude-sonnet-4.681.066x1.055x1.141x1.003x0.00000.0313

Interpretation:

  • With higher-N local controls, OpenAI is near parity and OpenRouter Sonnet trends external > internal.
  • External errors still appear under stress while internal remained error-free in these runs.

Task-Family Parity Split (Local Control, runs=8, 2026-02-20, Legacy Pre-Fairness)

Provider/ModelModel-Only Ext/IntTool-Only Ext/IntModel-Only Errors (Int/Ext)Tool-Only Errors (Int/Ext)
OpenAI gpt-5.20.958x1.136x0 / 00 / 0
OpenRouter anthropic/claude-sonnet-4.61.044x1.051x0 / 00 / 0

Interpretation:

  • Tool-only tasks support the architecture claim on both providers.
  • Model-only behavior is near parity on OpenAI and still favorable on Sonnet.

Task-Family Parity Split (Fairness-Hardened, Local Control, runs=8, 2026-02-20)

Harness controls:

  • interleaved ordering (pair_order=alternate)
  • deterministic defaults (temperature=0)
  • strict tool parity on tool_only
  • token-normalized reporting (ms/completion_token)
Provider/ModelModel-Only Ext/IntTool-Only Ext/IntModel-Only Int ms/tokenModel-Only Ext ms/tokenTool-Only Int ms/tokenTool-Only Ext ms/token
OpenAI gpt-5.20.971x1.038x57.65757.66337.55338.990
OpenRouter anthropic/claude-sonnet-4.61.102x1.063x61.60670.05441.21243.791

Interpretation:

  • Tool-only tasks remain favorable to internal on both providers after parity hardening.
  • OpenAI model-only remains near parity/slight inversion; Sonnet model-only favors internal.
  • Track B commercial claims remain task-family-stratified.

Commercial Root-Cause Grouping (Fairness r4+r8, 2026-02-22)

Source:

Delta is external - internal (positive means internal is faster):

GroupPaired NLatency Delta MeanLatency CI95ClassController Overhead MeanModel Hop Mean
OpenAI gpt-5.2 model-only36-69.311 ms[-193.985, 61.444]near_parity_noise_dominated2.081 ms1406.971 ms
OpenAI gpt-5.2 tool-only parity12+49.601 ms[-162.047, 274.981]near_parity_noise_dominated12.842 ms2456.108 ms
OpenRouter Sonnet 4.6 model-only24+204.883 ms[-148.517, 683.114]near_parity_noise_dominated2.254 ms2220.251 ms
OpenRouter Sonnet 4.6 tool-only parity8+165.092 ms[-124.650, 423.449]near_parity_noise_dominated14.446 ms2788.196 ms

Interpretation:

  • Current commercial control set is not statistically locked as win/loss by group (all CI95 include zero).
  • Measured controller overhead is small relative to model-hop variance, so higher-N region-pinned reruns are needed before claiming directional commercial performance differences.

Phase 3 Agentic Loops (Canonical G5, 2026-02-19, 3 Seeds)

MetricInternalExternalRatio
Success rate (mean)1.00000.9006+0.0994 (internal-ext)
Mean latency (mean)2.668 ms42.855 ms16.060x
Mean steps to convergence (success, mean)2.0773.7701.815x

Per-scenario snapshot:

ScenarioInt SuccessExt SuccessExt/Int LatencyExt/Int Steps
retrieval_correction1.00001.00008.853x1.333x
tool_state_adaptation1.00000.741729.793x3.000x
confidence_gated_branching1.00001.000017.948x1.700x

Stress variant (tool_fail_every=9, tool_timeout_every=11, controller retries 2):

MetricInternalExternalRatio
Success rate (mean)1.00000.8782+0.1218 (internal-ext)
Mean latency (mean)2.669 ms205.942 ms77.170x
Mean steps to convergence (success, mean)2.0773.7891.824x

Phase 3 Uncertainty Ablation (G5, 2026-02-19, baseline+stress repeatability)

Normalized-logprob source (paper-aligned), 3-seed means:

ArmInternal SuccessExternal SuccessExt/Int Latency
baseline int_on_ext_on1.00000.900616.721x
baseline int_off_ext_on0.76920.900622.654x
baseline int_on_ext_off1.00000.669812.108x
baseline int_off_ext_off0.76920.669815.316x
stress int_on_ext_on1.00000.878268.233x
stress int_off_ext_on0.76920.878290.080x
stress int_on_ext_off1.00000.657048.146x
stress int_off_ext_off0.76920.657062.463x

Uncertainty-on gains (success delta means):

  • Baseline internal: +0.2308.
  • Baseline external: +0.2308.
  • Stress internal: +0.2308.
  • Stress external: +0.2212.

Cross-source check:

  • raw_logit_margin and hybrid preserve the same internal/external success deltas as normalized-logprob in both baseline and stress sets.
  • These rows are harness-level synthetic uncertainty; runtime-native canonical corroboration is now published below.

Phase 3 Uncertainty Ablation (G5, 2026-02-19, runtime-native canonical rerun, 3 seeds, superseded)

Runtime-native source (/v1/chat/completions uncertainty payload), after greedy uncertainty kernel fix:

ArmInternal SuccessExternal SuccessExt/Int Latency
baseline int_on_ext_on0.87180.785310.9504x
baseline int_off_ext_on0.76920.900619.9477x
baseline int_on_ext_off1.00000.669810.7348x
baseline int_off_ext_off0.76920.669813.8492x
stress int_on_ext_on1.00000.878274.1471x
stress int_off_ext_on0.76920.878295.3550x
stress int_on_ext_off1.00000.657053.1571x
stress int_off_ext_off0.76920.657068.2719x

Uncertainty-on gains (runtime-native success delta means):

  • Baseline internal: +0.1026.
  • Baseline external: +0.1155.
  • Stress internal: +0.2308.
  • Stress external: +0.2212.

Note:

  • This set established runtime-native wiring, but later audit found fallback contamination on part of the seed set.
  • Use the 2026-02-20 quality-gated rerun below for current canonical C2 interpretation.

Phase 3 Uncertainty Ablation (G5, 2026-02-20, runtime-native awareness3 rerun, zero-fallback quality gate)

Runtime-native source (awareness.generation first, legacy fallback preserved), seeds 7/11/19, baseline + stress:

MetricBaselineStress
Internal uncertainty-on success delta-0.1538-0.1538
External uncertainty-on success delta-0.1217-0.1089

Quality gate:

  • all runtime-native arm artifacts have non-zero runtime requests/ok, fallback=0, errors=0.

Interpretation:

  • with clean runtime-native probes, uncertainty-on currently hurts success in this harness.
  • runtime-native awareness plumbing is validated; this negative set is now superseded by calibrated rerun below.

Phase 3 Uncertainty Ablation (G5, 2026-02-20, runtime-native calibrated rerun calib1, zero-fallback quality gate)

Runtime-native source, seeds 7/11/19, baseline + stress, calibration params:

  • prior weight 0.75
  • confidence floor 0.10
  • confidence ceil 0.35
  • route blend 0.10
MetricBaselineStress
Internal uncertainty-on success delta+0.1539+0.1539
External uncertainty-on success delta+0.1058+0.1154

Quality gate:

  • all runtime-native arm artifacts have non-zero runtime requests/ok, fallback=0, errors=0.

Interpretation:

  • calibrated runtime-native uncertainty restores positive uncertainty-on gains in both baseline and stress profiles.
  • C2 is re-locked for this harness configuration.

External Cold Comparison (G5, 2026-02-24, Step0 Exp-Reuse Patch, 3-run means, Current Best)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime1003.287 ms4.018 ms238.400 ms1241.688 ms
pytorch_transformers-509.427 ms2234.756 ms7847.001 ms
vllm23366.391 ms50.406 ms997.514 ms24363.905 ms
ollama (GGUF)----

Interpretation:

  • Runtime keeps a large margin vs PyTorch and vLLM on request path and cold-total in this rerun.
  • Runtime also improved vs immediate seq1mh baseline (2026-02-24T192020Z): TTFT 4.022 -> 4.018 ms, full 239.277 -> 238.400 ms, cold-total 1242.592 -> 1241.688 ms.
  • Runtime remains materially better than prior host-prefetch means (2026-02-19): TTFT 5.130 -> 4.018 ms, full 316.403 -> 238.400 ms, cold-total 1320.240 -> 1241.688 ms.
  • Follow-up shared-probability patch (external_cold_step0shared_repeatability_20260224T194913Z) did not beat this row and was reverted.
  • Ollama is intentionally blank here because it was not installed on the rerun host.

External Cold Decode Profiling + Uncertainty A/B (G5, 2026-02-25, No Preload, 64 Tokens)

Profile source:

  • external_cold_stepn_profile_20260225T001334Z
  • external_cold_uncert_on_20260225T001702Z
  • external_cold_uncert_off_20260225T001704Z
Runtime ModeRequest TTFTRequest FullInferdecoder_stepN_layers_meandecoder_stepN_logits_sample_mean
uncertainty on4.109 ms479.889 ms461.771 ms1.360 ms2.671 ms
uncertainty off3.991 ms473.367 ms454.878 ms1.360 ms2.562 ms

Interpretation:

  • The dominant decode stage in this profile is decoder_stepN_logits_sample, then decoder_stepN_layers.
  • Disabling uncertainty stats helps, but the uplift is modest relative to total decode time; the main custom-kernel target remains logits+sample compute.

External Cold Runtime vs vLLM (G5, 2026-02-25, Same Profile, Uncertainty-Off Runtime)

Source:

  • external_cold_runtime_vllm_uncertoff_20260225T001929Z
BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime1003.392 ms3.929 ms472.724 ms1476.116 ms
vllm23032.532 ms49.577 ms1311.481 ms24344.013 ms

Ratios (vLLM/runtime):

  • TTFT: 12.618x
  • Request Full: 2.774x
  • Cold Total First Response: 16.492x

Full-Depth FFN Projection Batched2 (G5, 2026-02-27, --layers 36, preload64)

Source:

  • external_cold_layers36_preload64_ab3_ffnprojbatch2_off_s{1,2,3}_20260227T1820*.json
  • external_cold_layers36_preload64_ab3_ffnprojbatch2_on_s{1,2,3}_20260227T1820*.json
  • external_cold_layers36_preload64_ab3_ffnprojbatch2_vllm_off_s{1,2,3}_20260227T1821*/1822*.json
  • external_cold_layers36_preload64_ab3_ffnprojbatch2_vllm_on_s{1,2,3}_20260227T1823*/1824*/182512Z.json
  • external_cold_layers36_stageprofile_ffnprojbatch2_off_20260227T182949Z
  • external_cold_layers36_stageprofile_ffnprojbatch2_on_20260227T182728Z

Runtime-only 3-seed means:

ModeRequest TTFTRequest FullCold Total First Response
batched2 off15.189 ms1702.190 ms4708.109 ms
batched2 on15.018 ms1689.991 ms4696.805 ms

Runtime-vLLM 3-seed (runtime leg):

ModeRuntime TTFTRuntime FullRuntime Cold Total
batched2 off15.207 ms1704.091 ms4710.111 ms
batched2 on15.032 ms1691.116 ms4697.207 ms

Stage-profile corroboration (off -> on):

  • decoder_step_profile_ffn_proj_mean: 0.205 -> 0.196 ms/layer
  • decoder_stepN_layers_mean: 19.140 -> 18.447 ms
  • decoder_stepN_total_mean: 20.761 -> 20.044 ms

Interpretation:

  • Batched2 is a real full-depth uplift and is now default-on.
  • Runtime still trails vLLM on first-request full latency in this profile, but the gap narrowed again.

Full-Depth FFN Proj Fast-Compute Probe (G5, 2026-02-27, --layers 36, preload64)

Source:

  • external_cold_layers36_preload64_ab3_ffnprojfast_off_s{1,2,3}_20260227T194728Z.json
  • external_cold_layers36_preload64_ab3_ffnprojfast_on_s{1,2,3}_20260227T194728Z.json
  • external_cold_layers36_preload64_ab8_ffnprojfast_off_s{1..8}_20260227T195024Z.json
  • external_cold_layers36_preload64_ab8_ffnprojfast_on_s{1..8}_20260227T195024Z.json
  • week3_parity_report_ffnprojfast_20260227T194853Z.json

Runtime-only means:

SetModeRequest TTFTRequest FullInferCold Total First Response
3-seedoff14.690 ms1663.669 ms1641.932 ms4669.765 ms
3-seedon14.680 ms1662.115 ms1640.919 ms4668.126 ms
8-seedoff14.683 ms1662.812 ms1640.942 ms4668.851 ms
8-seedon14.685 ms1662.601 ms1641.667 ms4668.678 ms

Interpretation (historical for this cycle):

  • This specific cycle was too small/noisy to justify default promotion at that time.
  • Later clean-path reruns on 2026-02-28 (pool16g, no fallback) looked positive, but full foundation gate (foundation_ffnprojfast_gate_ab2_20260228T195240Z) rejected global promotion; canonical parser default remains TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0.

Full-Depth U16 Tensor Cache Unlock (G5, 2026-02-27, claim-safe A/B, --layers 36, preload64)

Source:

  • external_cold_layers36_preload64_u16cache_claimsafe_summary_20260227T200242Z.json
  • external_cold_layers36_preload64_u16cache_claimsafe_summary_20260227T200242Z.md
  • week3_parity_report_u16cache_toggle_default_20260227T200652Z.json
  • week3_parity_report_u16cachefix_default_20260227T195625Z.json

Runtime-only 3-seed A/B:

ModeRequest TTFTRequest FullInferCold Total First Response
TRENI_TENSOR_CACHE_U16=014.679 ms1661.982 ms1640.118 ms4667.860 ms
TRENI_TENSOR_CACHE_U16=114.682 ms1189.452 ms1168.883 ms4195.511 ms

Runtime-vLLM same-window A/B (2-seed):

ModeRuntime FullvLLM FullRuntime-vLLM Full Delta
TRENI_TENSOR_CACHE_U16=01663.314 ms1325.189 ms+338.124 ms
TRENI_TENSOR_CACHE_U16=11192.145 ms1290.816 ms-98.671 ms

Mechanism check:

  • request-path decoder_tensor_upload dropped from ~476 ms to ~5 ms
  • request-path decoder_tensor_h2d dropped from ~468 ms to 0 ms

Interpretation:

  • This is the primary full-depth request-path unlock in the current cycle.
  • With TRENI_TENSOR_CACHE_U16=1, runtime flips from trailing vLLM on request full to leading in the same-window compare.

Full-Depth FFN Follow-Up (G5, 2026-02-27 Late Night, --layers 36, preload64)

Source:

  • external_cold_layers36_ffn_followup_summary_20260227T223458Z.json
  • external_cold_layers36_ffn_followup_summary_20260227T223458Z.md

Lane 1: TRENI_LINEAR_BATCHED2_USE_LT (new optional backend)

ModeRequest TTFTRequest FullInferCold Total First Response
off14.683 ms1190.339 ms1169.828 ms4196.547 ms
on14.846 ms1202.808 ms1182.363 ms4209.043 ms

Lane 2: TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=1 + TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1 (runtime-only AB8)

ModeRequest TTFTRequest FullInferCold Total First Response
off14.683 ms1190.212 ms1169.500 ms4196.259 ms
on14.682 ms1190.013 ms1169.399 ms4196.057 ms

Lane 3: FFN fused path bias-deferral follow-up (TRENI_DECODER_FFN_PROJ_U16_FUSED=0/1)

ModeRequest TTFTRequest FullInferCold Total First Response
off14.688 ms1190.539 ms1169.862 ms4196.613 ms
on14.684 ms1190.156 ms1169.701 ms4196.147 ms

Interpretation:

  • No new lane is promoted from this cycle.
  • TRENI_LINEAR_BATCHED2_USE_LT=1 regresses materially in runtime-only full-depth A/B.
  • F32-input/fast-compute and fused-bias-deferral follow-ups are both near-noise and not material enough for canonical change.

Fast-Profile Logits Follow-Up (G5, 2026-02-28, --layers 2, preload64)

Source:

  • external_cold_layers2_logitsfast_ab8_summary_20260228T005529Z.json
  • external_cold_layers2_logitsfast_ab8_summary_20260228T005529Z.md

Runtime-only AB8 (TRENI_DECODER_LOGITS_U16_FAST_COMPUTE=0/1):

ModeRequest TTFTRequest FullInferCold Total First Response
off2.575 ms205.244 ms185.880 ms1208.636 ms
on2.573 ms204.945 ms185.867 ms1208.291 ms

Interpretation:

  • Fast-profile logits fast-compute remains near-noise (full -0.299 ms), so it is not promoted.

Mixed-Load p99 Repeatability (G5, 2026-02-28, Canonical Lane)

Source:

  • mixed_load_repeatability_summary_20260228T005626Z.json
  • mixed_load_repeatability_summary_20260228T005626Z.md

3-run (run_mode=mixed_load, http_runs=120 each):

MetricMean Across Runs
Request Mean122.247 ms
Request p95198.518 ms
Request p99199.608 ms

Interpretation:

  • Current canonical lane remains stable under this mixed-load repeatability set; no configuration change.

External Cold Comparison (G5, 2026-02-18)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime1003.537 ms1108.979 ms1339.459 ms2342.996 ms
pytorch_transformers-528.483 ms2288.516 ms8725.259 ms
vllm24032.203 ms51.763 ms1036.815 ms25069.018 ms
ollama (GGUF)1002.695 ms2168.902 ms2527.411 ms3530.106 ms

Cold total first response ratio over runtime:

  • PyTorch: 3.724x
  • vLLM: 10.7x
  • Ollama: 1.507x

Request-path only note:

  • vLLM is fastest on request-path TTFT/full once healthy, but has high startup in this run.

External Cold Comparison (G5, 2026-02-18, Runtime Preload + Tokenizer Cache, Non-Parity)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime2004.735 ms91.596 ms271.346 ms2276.081 ms
pytorch_transformers-522.795 ms2252.382 ms8374.324 ms
vllm27036.682 ms51.725 ms1035.826 ms28072.508 ms
ollama (GGUF)1002.508 ms2182.542 ms2538.609 ms3541.117 ms

Runtime advantage in this variant:

  • Request full latency vs vLLM: 3.817x faster.
  • Cold total first response vs vLLM: 12.334x faster.
  • TTFT still trails vLLM (91.596 ms vs 51.725 ms).
  • Caveat: runtime was still using 4 decode steps in this run while vLLM/PyTorch/Ollama used 48.

External Cold Comparison (G5, 2026-02-18, Token Parity = 48)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime2004.203 ms91.207 ms2518.142 ms4522.345 ms
pytorch_transformers-501.946 ms2244.327 ms11272.831 ms
vllm27036.248 ms51.310 ms1075.404 ms28111.652 ms
ollama (GGUF)1002.560 ms2197.797 ms2556.652 ms3559.212 ms

Parity interpretation:

  • Runtime still wins cold-total first response vs vLLM (6.216x better).
  • vLLM wins request-path TTFT and full latency at equal 48-token budget.

External Cold Comparison (G5, 2026-02-18, Token Parity = 48, Decoder/Sampling Fix)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime2004.759 ms5.022 ms311.289 ms2316.048 ms
pytorch_transformers-515.597 ms2291.854 ms8237.561 ms
vllm24036.762 ms52.995 ms1094.517 ms25131.279 ms
ollama (GGUF)1002.630 ms2184.383 ms2543.219 ms3545.849 ms

Post-fix interpretation:

  • Runtime now wins TTFT and full request latency vs vLLM at token parity.
  • Runtime keeps the large cold-total lead vs vLLM.
  • Initial 3-run repeatability (2026-02-18) means: runtime 5.022 ms TTFT, 311.444 ms full, 2316.002 ms cold-total vs vLLM 51.894/1052.767/24752.842 ms.
  • Superseded by 2026-02-19 rerun and all-backend repeatability below.

External Cold Comparison (G5, 2026-02-19, GPU-Convert Fix2, All Backends, 3-run means)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime2339.131 ms5.131 ms318.315 ms2657.447 ms
pytorch_transformers-591.635 ms2389.837 ms10420.381 ms
vllm27704.185 ms82.560 ms1226.201 ms28930.385 ms
ollama (GGUF)1002.622 ms10819.259 ms11178.525 ms12181.148 ms

Interpretation:

  • Runtime keeps the request-path lead across all backends.
  • Runtime keeps the cold-total lead vs all backends in this set.
  • A runtime preload upload outlier affected one run; stable runs 1-2 sit at ~1004 ms startup and ~1321 ms cold-total.

External Cold Comparison (G5, 2026-02-19, GPU-Convert + Host-Prefetch Fix, All Backends, 3-run means)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime1003.836 ms5.130 ms316.403 ms1320.240 ms
pytorch_transformers-556.949 ms2322.565 ms19276.667 ms
vllm27704.770 ms84.837 ms1232.660 ms28937.430 ms
ollama (GGUF)1002.567 ms2638.945 ms2996.765 ms3999.332 ms

Interpretation:

  • Runtime keeps the request-path lead across all backends.
  • Runtime keeps the cold-total lead vs all backends in this set.
  • Host-prefetch rerun removed the startup/upload outlier observed in the prior repeatability set.

Historical Legacy Mixed-Mode Context

SetRuntime HTTP request meanRuntime HTTP request p99
T4 (2026-02-15)146279.609 ms156769.1 ms
G5 (2026-02-15)77449.605 ms83346.187 ms
G5 registry-cached single run (2026-02-16)82.913 ms91.877 ms

Parity Health

SetCheckedFailedStrict
T430true
G530true

On this page

Canonical Cross-Hardware Snapshot (Paper Package, 2026-02-20)Phase 2 Cold/HotRouting MatrixPhase 3 Loops (Baseline/Stress)C2 Runtime-Native Uncertainty Deltas (Uncertainty On-Off Success Delta)Phase 5 Real-Benchmark (Diagnostic, G5, 2026-03-01)Arm A (Control) Score MeansRuntime (r5 Arm A) vs HF Reference (Same Sampled Set)Qwen3.5 One-Host Strict Matrix (AWS G5, 2026-03-07)Overall / Per-Task (Arm A, gpqa_diamond+ifeval, seeds 7/17/27, 8/task)Qwen3.5 Nightly vLLM Diagnostic (G5, 2026-03-02)Qwen3.5 Strict Runtime vs vLLM Matrix (G5, 2026-03-03)Qwen3.5 Parse-Fix AB3 (GPQA+IFEval, G5, 2026-03-04)Qwen3.5 Hybrid Full-Batch AB3 (GPQA+IFEval, AWS G5, 2026-03-08)Qwen3.5 Deterministic Strict Matrix (GPQA+IFEval, AWS G5, 2026-03-08)Qwen3.5 Sampled Strict Matrix (GPQA+IFEval, AWS G5, 2026-03-08)Qwen3.5 Sampled Strict Matrix (Larger-N, GPQA+IFEval, AWS G5, 2026-03-08)Qwen3.5 Thinking Strict Matrix (Finalized Closed-Form Lane, AWS G5, 2026-03-09)Qwen3.5 Fast-Sampler Tie-Stable AB3 (GPQA+IFEval, AWS G5, 2026-03-08)Qwen3.5 Endpoint Probe Matrix (Warm, Extended, G5, 2026-03-06)G5 External Cold Backends (Canonical 3-Run Means)G5 Foundation (Canonical)G5 Cold First-Hit — True TTFT (3-run Means, 2026-02-17)Qwen Cold Upload Ablation (G5, 2026-02-19, Same Harness, Env Toggle Only)External Cold Runtime-Only Ablation (G5, 2026-02-19, Preload On, max_tokens=48)AWS G5 TTFT Kernel Pass (2026-02-22, Same Host/Container)AWS G5 True Fused cuDNN Frontend A/B (2026-02-22, fixed qwen, warmed)AWS G5 Frontend Repeatability Matrix (2026-02-22, repeats=3)AWS G5 Frontend Claim-Strength (2026-02-22)AWS G5 Frontend Miss-Mitigation (Updated Canonical, 2026-02-22)AWS G5 Frontend Miss-Mitigation (Shape Prebuild, No Preload Prompts, 2026-02-22)AWS G5 Frontend Miss-Mitigation (Hybrid Shape Gate, No Preload Prompts, 2026-02-23)Internal vs External Routing (G5, 2026-02-17)Internal vs External Routing (G5, 2026-02-18, Failure-Amplification Stress)Internal vs External Routing Matrix (G5, 2026-02-19)Internal vs External Routing Cross-Host Pilot (2026-02-19)Internal vs External Routing Split-Host Matrix (2026-02-19, Canonical Track B)Internet Multi-Hop Routing (Fly + Commercial APIs, 2026-02-20)Local Control Routing (No Fly Scheduler Path, 2026-02-20)Task-Family Parity Split (Local Control, runs=8, 2026-02-20, Legacy Pre-Fairness)Task-Family Parity Split (Fairness-Hardened, Local Control, runs=8, 2026-02-20)Commercial Root-Cause Grouping (Fairness r4+r8, 2026-02-22)Phase 3 Agentic Loops (Canonical G5, 2026-02-19, 3 Seeds)Phase 3 Uncertainty Ablation (G5, 2026-02-19, baseline+stress repeatability)Phase 3 Uncertainty Ablation (G5, 2026-02-19, runtime-native canonical rerun, 3 seeds, superseded)Phase 3 Uncertainty Ablation (G5, 2026-02-20, runtime-native awareness3 rerun, zero-fallback quality gate)Phase 3 Uncertainty Ablation (G5, 2026-02-20, runtime-native calibrated rerun calib1, zero-fallback quality gate)External Cold Comparison (G5, 2026-02-24, Step0 Exp-Reuse Patch, 3-run means, Current Best)External Cold Decode Profiling + Uncertainty A/B (G5, 2026-02-25, No Preload, 64 Tokens)External Cold Runtime vs vLLM (G5, 2026-02-25, Same Profile, Uncertainty-Off Runtime)Full-Depth FFN Projection Batched2 (G5, 2026-02-27, --layers 36, preload64)Full-Depth FFN Proj Fast-Compute Probe (G5, 2026-02-27, --layers 36, preload64)Full-Depth U16 Tensor Cache Unlock (G5, 2026-02-27, claim-safe A/B, --layers 36, preload64)Full-Depth FFN Follow-Up (G5, 2026-02-27 Late Night, --layers 36, preload64)Fast-Profile Logits Follow-Up (G5, 2026-02-28, --layers 2, preload64)Mixed-Load p99 Repeatability (G5, 2026-02-28, Canonical Lane)External Cold Comparison (G5, 2026-02-18)External Cold Comparison (G5, 2026-02-18, Runtime Preload + Tokenizer Cache, Non-Parity)External Cold Comparison (G5, 2026-02-18, Token Parity = 48)External Cold Comparison (G5, 2026-02-18, Token Parity = 48, Decoder/Sampling Fix)External Cold Comparison (G5, 2026-02-19, GPU-Convert Fix2, All Backends, 3-run means)External Cold Comparison (G5, 2026-02-19, GPU-Convert + Host-Prefetch Fix, All Backends, 3-run means)Historical Legacy Mixed-Mode ContextParity Health