Leaderboard
Canonical benchmark tables across G5 and Lambda A100/H100, plus historical run context.
Lower time is better.
Canonical Cross-Hardware Snapshot (Paper Package, 2026-02-20)
Sources:
Phase 2 Cold/Hot
| Hardware | Cold Startup | Cold TTFT | Cold Full | Warm Mean | Warm p99 |
|---|---|---|---|---|---|
| g5 | 1002.273 ms | 8.460 ms | 150.035 ms | 80.602 ms | 90.350 ms |
| lambda_a100 | 1002.708 ms | 29.657 ms | 32.008 ms | 10.356 ms | 14.536 ms |
| lambda_h100 | 1004.890 ms | 56.944 ms | 62.064 ms | 18.491 ms | 24.944 ms |
Routing Matrix
| System | Overall Ext/Int | Baseline p00 Ext/Int | Stress p04 Ext/Int | Overall Int Error | Overall Ext Error |
|---|---|---|---|---|---|
| g5 | 1.208x | 1.042x | 1.436x | 0.0000 | 0.0347 |
| lambda_a100 | 2.430x | 1.396x | 3.785x | 0.0000 | 0.0347 |
| lambda_h100 | 2.397x | 1.533x | 3.545x | 0.0000 | 0.0347 |
Phase 3 Loops (Baseline/Stress)
| System | Baseline Internal Success | Baseline External Success | Baseline Ext/Int Latency | Stress Internal Success | Stress External Success | Stress Ext/Int Latency |
|---|---|---|---|---|---|---|
| g5 | 1.0000 | 0.9006 | 16.060x | 1.0000 | 0.8782 | 77.170x |
| lambda_a100 | 1.0000 | 0.9006 | 16.879x | 1.0000 | 0.8782 | 77.861x |
| lambda_h100 | 1.0000 | 0.9006 | 18.693x | 1.0000 | 0.8782 | 72.741x |
C2 Runtime-Native Uncertainty Deltas (Uncertainty On-Off Success Delta)
| System | Baseline Internal Delta | Baseline External Delta | Stress Internal Delta | Stress External Delta |
|---|---|---|---|---|
| g5 | +0.1539 | +0.1058 | +0.1539 | +0.1154 |
| lambda_a100 | +0.2308 | +0.2308 | +0.2308 | +0.2212 |
| lambda_h100 | +0.2308 | +0.2308 | +0.2308 | +0.2212 |
Phase 5 Real-Benchmark (Diagnostic, G5, 2026-03-01)
Source artifacts:
phase5_awareness_realbench_qwen-realbench-r5-tokenizerfix2_20260301T114510Z.jsonphase5_awareness_realbench_qwen-realbench-r6-qwentpl1_20260301T120235Z.jsonphase5_hf_reference_qwen_r5_20260301T1900Z.json
Arm A (Control) Score Means
| Run | GPQA | IFEval | GSM8K | AIME25 |
|---|---|---|---|---|
| r5 tokenizerfix2 (canonical diagnostic) | 0.5000 | 0.5625 | 0.0000 | 0.0000 |
| r6 qwentpl1 (template A/B, non-canonical) | 0.1250 | 0.3125 | 0.0000 | 0.0000 |
Runtime (r5 Arm A) vs HF Reference (Same Sampled Set)
| System | GPQA | IFEval | GSM8K | AIME25 |
|---|---|---|---|---|
| Runtime r5 Arm A | 0.5000 | 0.5625 | 0.0000 | 0.0000 |
| HF reference control | 0.2500 | 0.6250 | 0.0000 | 0.0000 |
Interpretation:
r5is the current Phase 5 diagnostic reference.r6was an explicit template-path experiment and regressed quality/latency; it is not used as canonical.- HF parity result for this sampled set: runtime and HF are tied on GSM8K/AIME (
0.0), so current math-task failures are not runtime-only breakage.
Qwen3.5 One-Host Strict Matrix (AWS G5, 2026-03-07)
Source artifacts:
phase5_qwen35_remote_strict_matrix_20260307T191653Z.jsonphase5_qwen35_remote_strict_matrix_20260307T191653Z.md
Overall / Per-Task (Arm A, gpqa_diamond+ifeval, seeds 7/17/27, 8/task)
| System | Overall Score | Overall Latency | GPQA | GPQA Latency | IFEval | IFEval Latency |
|---|---|---|---|---|---|---|
| Runtime | 0.3333 | 3809.745 ms | 0.2917 | 2867.493 ms | 0.3750 | 4751.996 ms |
| vLLM | 0.3160 | 1626.068 ms | 0.2917 | 418.173 ms | 0.3403 | 2833.964 ms |
Interpretation:
- This is the cleanest current one-host strict A/B on Qwen3.5-0.8B.
- Runtime is no longer behind on aggregate score on this set.
- Runtime is still far slower overall, so this is not a claim of universal superiority.
- The result remains task-stratified:
gpqa_diamondscore is now at parity, but runtime latency is still far worse,ifevalscore is higher for runtime, but latency is still worse there too.
Qwen3.5 Nightly vLLM Diagnostic (G5, 2026-03-02)
Source artifacts:
phase5_awareness_realbench_qwen35-realbench-r1-s8-nonthinking_20260302T184159Z.jsonphase5_awareness_realbench_qwen35-realbench-r2-policyfix1-s8-nonthinking_20260302T184624Z.jsonphase5_awareness_realbench_qwen35-realbench-r3-sharedfirst-s8-nonthinking_20260302T184947Z.json
| Run | GPQA A/B/C | IFEval A/B/C | GSM8K A/B/C | AIME25 A/B/C |
|---|---|---|---|---|
| r1 (first nightly run) | 0.500 / 0.250 / 0.625 | 0.5625 / 0.4375 / 0.5000 | 0.000 / 0.000 / 0.000 | 0.000 / 0.000 / 0.000 |
| r2 (conservative policy) | 0.375 / 0.250 / 0.250 | 0.3125 / 0.5000 / 0.3750 | 0.000 / 0.000 / 0.000 | 0.000 / 0.125 / 0.000 |
| r3 (shared-first fairness fix) | 0.375 / 0.375 / 0.375 | 0.3125 / 0.3125 / 0.3125 | 0.000 / 0.000 / 0.000 | 0.000 / 0.000 / 0.000 |
Interpretation:
r3removes arm-to-arm sampling noise (all arms share the same first completion).- Post-fix state is no-regression parity (
B-A=0,C-A=0) rather than uplift.
Qwen3.5 Strict Runtime vs vLLM Matrix (G5, 2026-03-03)
Source artifacts:
phase5_qwen35_runtime_vs_vllm_matrix_20260303T104038Z.jsonphase5_qwen35_runtime_vs_vllm_matrix_20260302T221546Z.jsonphase5_qwen35_runtime_vs_vllm_matrix_20260302T222013Z.json
| Run | Matrix mode | Runtime score | vLLM score | Delta (runtime-vLLM) | Runtime latency | vLLM latency | Delta (runtime-vLLM) |
|---|---|---|---|---|---|---|---|
20260302T222013Z | strict (all arms validated) | 0.0503 | 0.2170 | -0.1667 | 1881.188 ms | 178.093 ms | +1703.095 ms |
20260303T104038Z | strict Arm A-only (--phase5-arms arm_a_control) | 0.1563 | 0.1910 | -0.0347 | 1723.685 ms | 958.757 ms | +764.928 ms |
Latest run (20260303T104038Z) per-task score deltas (runtime-vLLM):
gpqa_diamond:-0.0833ifeval:-0.0972gsm8k:+0.0417aime25:0.0000
Interpretation:
- Strict matrix is no longer blocked and is now fully reproducible.
- Decoder-path fixes narrowed the quality gap materially, but runtime is still slower overall and still slightly behind on aggregate score.
- Top target remains request-path latency recovery on Qwen3.5 while preserving this improved score parity.
Qwen3.5 Parse-Fix AB3 (GPQA+IFEval, G5, 2026-03-04)
Source artifact:
| Scope | Runtime score | vLLM score | Delta (runtime-vLLM) | Runtime latency | vLLM latency | Delta (runtime-vLLM) |
|---|---|---|---|---|---|---|
Overall (gpqa_diamond+ifeval) | 0.3403 | 0.3229 | +0.0174 | 1772.931 ms | 1553.034 ms | +219.897 ms |
gpqa_diamond | 0.2708 | 0.2708 | 0.0000 | 2319.086 ms | 437.310 ms | +1881.776 ms |
ifeval | 0.4097 | 0.3750 | +0.0347 | 1226.775 ms | 2668.758 ms | -1441.983 ms |
Interpretation:
- This run is task-family stratified, not a universal win.
- Runtime currently wins on
ifeval(quality + latency) but remains far slower ongpqa_diamond.
Qwen3.5 Hybrid Full-Batch AB3 (GPQA+IFEval, AWS G5, 2026-03-08)
Source artifact:
| Scope | Runtime score | vLLM score | Delta (runtime-vLLM) | Runtime latency | vLLM latency | Delta (runtime-vLLM) |
|---|---|---|---|---|---|---|
Overall (gpqa_diamond+ifeval) | 0.4132 | 0.3472 | +0.0660 | 2940.172 ms | 1686.263 ms | +1253.909 ms |
gpqa_diamond | 0.4583 | 0.2083 | +0.2500 | 1347.582 ms | 512.075 ms | +835.507 ms |
ifeval | 0.3681 | 0.4861 | -0.1181 | 4532.763 ms | 2860.452 ms | +1672.311 ms |
Interpretation:
- This is the first late AWS Qwen3.5 strict AB3 where runtime clearly leads overall on score.
- The runtime is still slower on latency across both tasks.
- GPQA is now a strong runtime-quality win; the next blocker is warm decode/request-path latency and the remaining IFEval score deficit.
Qwen3.5 Deterministic Strict Matrix (GPQA+IFEval, AWS G5, 2026-03-08)
Source artifact:
| Scope | Runtime score | vLLM score | Delta (runtime-vLLM) | Runtime latency | vLLM latency | Delta (runtime-vLLM) |
|---|---|---|---|---|---|---|
Overall (gpqa_diamond+ifeval) | 0.2951 | 0.2674 | +0.0278 | 824.714 ms | 1572.529 ms | -747.815 ms |
gpqa_diamond | 0.1667 | 0.1667 | 0.0000 | 671.640 ms | 436.583 ms | +235.057 ms |
ifeval | 0.4236 | 0.3681 | +0.0556 | 977.787 ms | 2708.475 ms | -1730.688 ms |
Interpretation:
- This is the current claim-safe Qwen3.5 strict lane.
- Runtime wins overall on both score and latency.
gpqa_diamondis now exact parity on score, whileifevalis a runtime win on both score and latency.- Sampled-lane reproducibility is now fixed separately; this deterministic lane remains the cleanest low-variance claim-safe slice.
Qwen3.5 Sampled Strict Matrix (GPQA+IFEval, AWS G5, 2026-03-08)
Source artifact:
| Scope | Runtime score | vLLM score | Delta (runtime-vLLM) | Runtime latency | vLLM latency | Delta (runtime-vLLM) |
|---|---|---|---|---|---|---|
Overall (gpqa_diamond+ifeval) | 0.4097 | 0.3021 | +0.1076 | 1617.187 ms | 2017.206 ms | -400.019 ms |
gpqa_diamond | 0.3750 | 0.2500 | +0.1250 | 710.693 ms | 435.823 ms | +274.870 ms |
ifeval | 0.4444 | 0.3542 | +0.0903 | 2523.680 ms | 3598.588 ms | -1074.908 ms |
Interpretation:
- This is the post-fix sampled strict AB3 lane after the harness seed bug was removed.
- Runtime now wins overall on both score and latency here as well.
gpqa_diamondis still runtime-slower, but runtime is ahead on score in both tasks.
Qwen3.5 Sampled Strict Matrix (Larger-N, GPQA+IFEval, AWS G5, 2026-03-08)
Source artifact:
phase5_qwen35_remote_strict_matrix_20260308T222640Z.jsonphase5_qwen35_remote_strict_matrix_20260308T235013Z.json
| Scope | Runtime score | vLLM score | Delta (runtime-vLLM) | Runtime latency | vLLM latency | Delta (runtime-vLLM) |
|---|---|---|---|---|---|---|
Overall (gpqa_diamond+ifeval, 16/task) | 0.3715 | 0.2969 | +0.0747 | 1255.344 ms | 1585.043 ms | -329.699 ms |
gpqa_diamond | 0.3750 | 0.3125 | +0.0625 | 801.900 ms | 433.256 ms | +368.644 ms |
ifeval | 0.3681 | 0.2813 | +0.0868 | 1708.789 ms | 2736.831 ms | -1028.043 ms |
Interpretation:
- This is the stronger non-thinking sampled strict result.
- Runtime still wins overall on both score and latency at larger sample count.
- The weakest remaining slice is still GPQA latency, not overall sampled quality.
Qwen3.5 Thinking Strict Matrix (Finalized Closed-Form Lane, AWS G5, 2026-03-09)
Source artifact:
| Scope | Runtime score | vLLM score | Delta (runtime-vLLM) | Runtime latency | vLLM latency | Delta (runtime-vLLM) |
|---|---|---|---|---|---|---|
Overall (gpqa_diamond+ifeval) | 0.2500 | 0.1944 | +0.0556 | 6823.816 ms | 7503.000 ms | -679.184 ms |
gpqa_diamond | 0.1667 | 0.1667 | 0.0000 | 7727.880 ms | 7741.028 ms | -13.148 ms |
ifeval | 0.3333 | 0.2222 | +0.1111 | 5919.753 ms | 7264.973 ms | -1345.220 ms |
Interpretation:
- This is the current best thinking tradeoff.
- The close-form finalize path stays active, but a smaller GPQA reasoning budget removes the old latency collapse.
- Runtime now wins this finalized thinking lane overall on both score and latency.
Qwen3.5 Fast-Sampler Tie-Stable AB3 (GPQA+IFEval, AWS G5, 2026-03-08)
Source artifact:
| Scope | Runtime score | vLLM score | Delta (runtime-vLLM) | Runtime latency | vLLM latency | Delta (runtime-vLLM) |
|---|---|---|---|---|---|---|
Overall (gpqa_diamond+ifeval) | 0.3160 | 0.3472 | -0.0313 | 1422.818 ms | 1659.878 ms | -237.060 ms |
gpqa_diamond | 0.2917 | 0.2083 | +0.0833 | 886.296 ms | 515.171 ms | +371.125 ms |
ifeval | 0.3403 | 0.4861 | -0.1458 | 1959.340 ms | 2804.584 ms | -845.244 ms |
Interpretation:
- This is the first clean late AWS Qwen3.5 strict AB3 where runtime is faster overall than vLLM.
- The remaining blocker is score recovery, not another large latency rescue.
- GPQA stays runtime-positive on score; the main open quality deficit is
ifevalplus residual GPQA fidelity loss versus the older slower sampler lane.
Qwen3.5 Endpoint Probe Matrix (Warm, Extended, G5, 2026-03-06)
Source artifact:
| Backend | Mode | all_ok | Plain Chat | Tool First Turn | Tool Follow-up | Notes |
|---|---|---|---|---|---|---|
| runtime | non-thinking | true | 387.672 ms | 5885.725 ms | 4406.168 ms | strongest current functional lane |
| runtime | thinking | true | 11113.612 ms | 13404.890 ms | 9349.855 ms | verbose reasoning-heavy outputs |
| vLLM | non-thinking | false | 112.543 ms | 1162.850 ms | 490.202 ms | multimodal blocked by --language-model-only; exact-output thinking case hit length |
| vLLM | thinking | false | 3564.340 ms | 3044.431 ms | 608.328 ms | several exact-output prompts end at finish_reason=length |
Interpretation:
- Runtime is now functionally solid on Qwen3.5 in
non-thinkingmode. - The main remaining runtime gap is long-prompt/tool latency, not base chat correctness.
- Current vLLM launch is not a full multimodal parity configuration; failures here are partly launch-mode/config artifacts, not only model behavior.
G5 External Cold Backends (Canonical 3-Run Means)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response | TTFT/Runtime | Full/Runtime | Cold Total/Runtime |
|---|---|---|---|---|---|---|---|
| runtime | 1003.836 ms | 5.130 ms | 316.403 ms | 1320.240 ms | - | - | - |
| pytorch_transformers | - | 556.949 ms | 2322.565 ms | 19276.667 ms | 108.567x | 7.341x | 14.601x |
| vllm | 27704.770 ms | 84.837 ms | 1232.660 ms | 28937.430 ms | 16.537x | 3.896x | 21.918x |
| ollama (GGUF) | 1002.567 ms | 2638.945 ms | 2996.765 ms | 3999.332 ms | 514.414x | 9.471x | 3.029x |
G5 Foundation (Canonical)
| Metric | Value |
|---|---|
| Baseline pipeline mean | 2407.974 ms |
| Runtime warm request mean (3-run) | 82.707 ms |
| Runtime warm request p99 (3-run) | 91.738 ms |
| Baseline/runtime ratio (pipeline) | 29.11x |
G5 Cold First-Hit — True TTFT (3-run Means, 2026-02-17)
| Model | Early TTFT | clean4 TTFT | Speedup |
|---|---|---|---|
| qwen | 27574.564 ms | 1100.044 ms | 25.067x |
| donut | 67360.388 ms | 150.322 ms | 448.107x |
| bart | 77520.798 ms | 125.011 ms | 620.112x |
| minilm | 23.342 ms | 22.621 ms | 1.032x |
All values above are runtime-instrumented timing.ttft_ms.
Validation rerun (clean7, 2026-02-18) matches these values within run noise.
Qwen Cold Upload Ablation (G5, 2026-02-19, Same Harness, Env Toggle Only)
| Metric | GPU Convert Off | GPU Convert On | Speedup (Off/On) |
|---|---|---|---|
| full_latency_ms | 1116.567 ms | 238.740 ms | 4.677x |
| decoder_tensor_upload_ms | 1007 ms | 129 ms | 7.806x |
| decoder_tensor_convert_ms | 862 ms | 6 ms | 143.667x |
| decoder_tensor_h2d_ms | 143 ms | 121 ms | 1.182x |
| startup + full response | 2119.906 ms | 1242.057 ms | 1.707x |
Interpretation:
- CPU-side tensor conversion was the dominant cold bottleneck for Qwen.
- Moving BF16/F16 conversion to GPU materially reduced first-hit latency.
External Cold Runtime-Only Ablation (G5, 2026-02-19, Preload On, max_tokens=48)
| Metric | GPU Convert Off | GPU Convert On | Speedup (Off/On) |
|---|---|---|---|
| startup_to_healthy_ms | 2004.560 ms | 1003.455 ms | 1.997x |
| request_ttft_ms | 5.137 ms | 5.127 ms | 1.002x |
| request_full_ms | 317.989 ms | 317.276 ms | 1.002x |
| cold_total_first_token_ms | 2009.697 ms | 1008.582 ms | 1.993x |
| cold_total_first_response_ms | 2322.549 ms | 1320.731 ms | 1.759x |
AWS G5 TTFT Kernel Pass (2026-02-22, Same Host/Container)
Cold (run_mode=cold_first_hit):
| Config | Startup | TTFT | Full |
|---|---|---|---|
lt0_sync0 baseline | 1002.203 ms | 16.738 ms | 424.685 ms |
lt0_sync0 softmax pass | 1002.435 ms | 16.715 ms | 425.390 ms |
lt0_sync0 norm+softmax | 1002.304 ms | 14.813 ms | 399.781 ms |
lt1_sync0 norm+softmax | 1002.284 ms | 13.974 ms | 396.814 ms |
lt1_sync0 seq1 tiny-kernel default | 1002.256 ms | 12.504 ms | 390.099 ms |
Warm (run_mode=warm_steady_state, 16 measured, 2 warmup):
| Config | Mean | p95 | p99 |
|---|---|---|---|
lt0_sync0 baseline | 174.237 ms | 600.575 ms | 1035.823 ms |
lt0_sync0 softmax pass | 174.504 ms | 601.855 ms | 1035.638 ms |
lt0_sync0 norm+softmax | 147.495 ms | 501.527 ms | 936.281 ms |
lt1_sync0 norm+softmax | 147.269 ms | 501.587 ms | 936.297 ms |
lt1_sync0 seq1 tiny-kernel default | 143.230 ms | 490.296 ms | 924.276 ms |
Interpretation:
- softmax-only was near-parity.
- norm rewrite was the material gain.
- seq2seq/Bart follow-up (
seq_q=1tiny-kernel path + direct K/V cache write) produced another measurable lift. - best measured profile (
lt1_sync0seq1 tiny-kernel default) vs baseline:- cold TTFT:
1.339xfaster (16.738 -> 12.504 ms) - cold full:
1.089xfaster (424.685 -> 390.099 ms) - warm mean:
1.216xfaster (174.237 -> 143.230 ms) - warm p99:
1.121xfaster (1035.823 -> 924.276 ms)
- cold TTFT:
- follow-up profile vs previous
lt1_sync0 norm+softmaxbest:- cold TTFT:
1.118xfaster - warm mean:
1.028xfaster
- cold TTFT:
- Bart cold TTFT moved from
16.573 -> 12.842 ms(1.29xfaster). - 3-seed repeatability (new default path):
- cold TTFT
12.563 ± 0.037 ms - cold full
390.961 ± 0.270 ms - warm mean
143.297 ± 0.222 ms - warm p99
925.668 ± 1.070 ms
- cold TTFT
- seq1 fused follow-up matrix (
seq1_hybrid_fused_20260222T192656Z) further improved request path:- warm default mean
54.505 -> 52.535 ms(1.037x) - warm default p99
82.134 -> 80.554 ms(1.020x) - cold default TTFT
6.447 -> 6.209 ms(1.038x) - cold default full
147.756 -> 145.587 ms(1.015x)
- warm default mean
AWS G5 True Fused cuDNN Frontend A/B (2026-02-22, fixed qwen, warmed)
Source:
| Metric | custom | cudnn_sdpa_frontend | custom/frontend |
|---|---|---|---|
| warm request mean | 19.324 ms | 21.503 ms | 0.899 |
| warm request p99 | 22.087 ms | 24.875 ms | 0.888 |
| warm infer | 18.803 ms | 20.976 ms | 0.896 |
| warm TTFT | 4.199 ms | 4.498 ms | 0.934 |
| cold TTFT | 4.220 ms | 710.641 ms | 0.006 |
| cold full | 250.929 ms | 6610.148 ms | 0.038 |
Interpretation:
- Warm steady-state is close but still favors custom.
- Cold first-hit is still dominated by fused plan-build misses.
AWS G5 Frontend Repeatability Matrix (2026-02-22, repeats=3)
Source:
| Profile | Metric | custom (mean +/- stdev) | fused_frontend (mean +/- stdev) | ratio custom/frontend | wins custom/frontend/tie |
|---|---|---|---|---|---|
| warm_fixed | warm_request_mean_ms | 19.271 +/- 0.050 ms | 21.468 +/- 0.018 ms | 0.898 +/- 0.003 | 3/0/0 |
| warm_fixed | warm_infer_ms | 18.812 +/- 0.059 ms | 20.984 +/- 0.026 ms | 0.896 +/- 0.004 | 3/0/0 |
| warm_fixed | warm_ttft_ms | 4.198 +/- 0.001 ms | 4.498 +/- 0.001 ms | 0.933 +/- 0.000 | 3/0/0 |
| mixed_churn | warm_request_mean_ms | 47.864 +/- 0.018 ms | 843.141 +/- 0.735 ms | 0.057 +/- 0.000 | 3/0/0 |
| mixed_churn | warm_infer_ms | 47.331 +/- 0.050 ms | 842.542 +/- 0.747 ms | 0.056 +/- 0.000 | 3/0/0 |
| mixed_churn | warm_ttft_ms | 4.197 +/- 0.002 ms | 179.744 +/- 0.263 ms | 0.023 +/- 0.000 | 3/0/0 |
Interpretation:
- Current custom path wins all tracked metrics across both profiles in repeated runs.
AWS G5 Frontend Claim-Strength (2026-02-22)
Source:
Delta is frontend - custom (positive means custom is faster):
| Profile | Metric | Delta Mean | Delta CI95 | Ratio Mean (custom/frontend) |
|---|---|---|---|---|
| warm_fixed | warm_request_mean_ms | +2.197 ms | [2.125, 2.238] | 0.898 |
| warm_fixed | warm_ttft_ms | +0.300 ms | [0.299, 0.301] | 0.933 |
| mixed_churn | warm_request_mean_ms | +795.277 ms | [794.408, 795.747] | 0.057 |
| mixed_churn | warm_ttft_ms | +175.546 ms | [175.300, 175.820] | 0.023 |
Interpretation:
- Current evidence is strongly directional for
custom > fused_frontendon this hardware/workload. - Repeat count is still low (
n=3/profile), so high-N reruns remain desirable for paper-grade confidence.
AWS G5 Frontend Miss-Mitigation (Updated Canonical, 2026-02-22)
Sources:
attn_backend_frontend_matrix_20260222T230445Z.mdattn_backend_frontend_matrix_20260222T231139Z.mdattn_backend_frontend_missmit_compare_20260222T231335Z.mdpreload_exact_prompt_probe_20260222T231050Z.json
Comparison (no_preload -> startup_preload_benchmark_queries):
| Profile | Metric | Fused Frontend Before | Fused Frontend After | Speedup |
|---|---|---|---|---|
| mixed_churn | warm request mean | 843.242 ms | 22.433 ms | 37.590x |
| mixed_churn | warm infer mean | 842.684 ms | 21.965 ms | 38.365x |
| mixed_churn | warm TTFT | 179.541 ms | 4.497 ms | 39.928x |
| mixed_churn | cold TTFT | 704.521 ms | 4.495 ms | 156.723x |
| mixed_churn | cold full latency | 6593.495 ms | 25.785 ms | 255.707x |
Exact cold-prompt probe:
- fused first-hit TTFT:
4.499 ms - fused first-hit full latency:
26.090 ms
Interpretation:
- With benchmark-query preload coverage, the prior fused cold/mixed miss spikes are effectively removed in this harness.
- Custom still leads warmed request-path latency slightly (about
0.90custom/frontend ratio), but cold TTFT is now near parity on the covered prompt set.
AWS G5 Frontend Miss-Mitigation (Shape Prebuild, No Preload Prompts, 2026-02-22)
Sources:
attn_backend_frontend_matrix_20260222T230445Z.mdattn_backend_frontend_matrix_20260222T233003Z.mdattn_backend_frontend_missmit_compare_20260222T233116Z.mdprebuild_startup_nopreload_probe_20260222T232932Z.jsonprebuild_startup10_nopreload_probe_20260222T235944Z.jsonprebuild_startup8_nopreload_probe_20260223T000600Z.jsonattn_backend_frontend_matrix_20260223T000256Z.mdattn_backend_frontend_missmit_compare_20260223T000343Z.md
Comparison (no_preload -> shape_prebuild_nopreload):
| Profile | Metric | Fused Frontend Before | Fused Frontend After | Speedup |
|---|---|---|---|---|
| mixed_churn | cold TTFT | 704.521 ms | 5.805 ms | 121.364x |
| mixed_churn | cold full latency | 6593.495 ms | 255.267 ms | 25.830x |
| mixed_churn | warm request mean | 843.242 ms | 51.482 ms | 16.379x |
| mixed_churn | warm TTFT | 179.541 ms | 4.824 ms | 37.218x |
Cold probe (fused frontend, no preload prompts, startup prebuild on):
- startup->healthy:
11017.541 ms - request TTFT:
5.814 ms - request full latency:
255.434 ms
Tuned startup probe (seq_kv_max: 16 -> 10):
- startup->healthy:
7011.472 ms(1.571xfaster startup) - request TTFT:
5.826 ms(near-identical) - request full latency:
254.936 ms(near-identical)
Lower-range probe (seq_kv_max=8):
- startup->healthy:
6010.381 ms - request TTFT:
703.771 ms(regression) - request full latency:
1660.576 ms(regression)
Tuned matrix confirmation (seq_kv_max=16 -> 10):
- warm-fixed fused request mean:
22.556 -> 22.265 ms - mixed fused request mean:
51.482 -> 50.974 ms - cold fused TTFT:
5.827 -> 5.819 ms
Interpretation:
- Shape-level prebuild removes no-preload fused request-path spikes without prompt-list curation.
- Tradeoff: startup latency is still elevated due to up-front plan builds, but tuned range materially reduces that startup cost.
seq_kv_max=10is currently the minimum safe tuned range for this benchmark prompt profile.
AWS G5 Frontend Miss-Mitigation (Hybrid Shape Gate, No Preload Prompts, 2026-02-23)
Sources:
prebuild_hybrid10_nopreload_probe_r1_20260223T002214Z.jsonprebuild_hybrid10_nopreload_probe_r2_20260223T002214Z.jsonprebuild_hybrid10_nopreload_probe_r3_20260223T002214Z.jsonattn_backend_frontend_matrix_20260223T001959Z.mdattn_backend_frontend_missmit_compare_20260223T002153Z.mdhybrid_shape_sanity_20260223T002857Z.jsonhybrid_shape_sanity_maxgate_20260223T003453Z.jsonattn_backend_frontend_matrix_20260223T003611Z.mdattn_backend_frontend_missmit_compare_20260223T003734Z.md
Policy:
TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MIN_KV=10TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=10TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128TRENI_ATTN_CUDNN_FRONTEND_MIN_SEQ_KV=10
Startup probe (3 runs, fused frontend, no preload prompts):
- startup->healthy:
2004.840 +/- 0.146 ms - request TTFT:
4.955 +/- 0.011 ms - request full latency:
242.673 +/- 0.352 ms
Delta vs prior tuned no-gate probe (prebuild_startup10_nopreload_probe_20260222T235944Z):
- startup->healthy:
7011.472 -> 2004.840 ms(3.497xfaster) - request TTFT:
5.826 -> 4.955 ms(1.176xfaster) - request full latency:
254.936 -> 242.673 ms(1.051xfaster)
Matrix deltas vs prior tuned no-gate matrix (attn_backend_frontend_matrix_20260223T000256Z):
- warm-fixed fused request mean:
22.265 -> 20.354 ms(1.094xfaster) - mixed fused request mean:
50.974 -> 47.904 ms(1.064xfaster) - cold fused TTFT:
5.819 -> 4.959 ms(1.173xfaster) - cold fused full latency:
254.146 -> 242.569 ms(1.048xfaster)
Interpretation:
- Hybrid shape gating removes most remaining startup cost while keeping low no-preload request-path latency.
- In this harness, strict fused mode stays inference-valid with low-shape fallback to custom path.
- Initial broader-shape sanity exposed out-of-window (
seq_kv>10) miss cascades; adding bounded gate (TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10) removed those cascades and cut the same 5-shape set mean full latency from9974.576 msto274.072 ms(36.395x) while keeping fixed-profile matrix behavior near-identical.
Internal vs External Routing (G5, 2026-02-17)
| Metric | Internal | External | Ratio |
|---|---|---|---|
| Mean latency | 94.849 ms | 97.927 ms | 1.032x |
| Task | Internal | External |
|---|---|---|
| general_short | 150.767 ms | 152.274 ms |
| receipt_extract | 80.732 ms | 81.270 ms |
| search_grounded | 46.945 ms | 57.237 ms |
| summarize_short | 100.950 ms | 100.928 ms |
Internal routing is faster in aggregate and faster on 3/4 tasks in this set (the remaining task is effectively tied).
Internal vs External Routing (G5, 2026-02-18, Failure-Amplification Stress)
| Metric | Internal | External | Ratio |
|---|---|---|---|
| Mean latency | 76.071 ms | 109.806 ms | 1.443x |
| Error rate | 0.0000 | 0.0833 | inf |
Stress profile:
- Tool failure injection: every 2nd request.
- Tool timeout injection: every 3rd request (
0.9ssleep) with controller timeout0.25s. - Controller tool retries:
1.
Observed amplification:
- External taxonomy:
tool_hop_failed=4. - External tool retries mean:
0.182.
Internal vs External Routing Matrix (G5, 2026-02-19)
| Profile | Ext/Int Latency Ratio | Int Error Rate | Ext Error Rate | Ext Tool Retries |
|---|---|---|---|---|
| p00 baseline | 1.042x | 0.0000 | 0.0000 | 0.000 |
| p01 fail mild | 1.048x | 0.0000 | 0.0000 | 0.021 |
| p02 timeout mild | 1.142x | 0.0000 | 0.0000 | 0.021 |
| p03 mixed moderate | 1.164x | 0.0000 | 0.0417 | 0.065 |
| p04 mixed aggressive | 1.436x | 0.0000 | 0.0833 | 0.182 |
| p05 mixed aggressive + retry2 | 1.416x | 0.0000 | 0.0833 | 0.091 |
Matrix mean ratio (all profiles): 1.208x external/internal.
Internal vs External Routing Cross-Host Pilot (2026-02-19)
Topology:
- local benchmark client
- SSH tunnel to G5 host
- runtime + external router on G5
| Profile | Internal Mean | External Mean | Ext/Int Ratio | Int Error | Ext Error |
|---|---|---|---|---|---|
| crosshost-p00-baseline | 1071.477 ms | 1059.478 ms | 0.989x | 0.0000 | 0.0000 |
| crosshost-p02-timeout-mild | 1054.123 ms | 1123.393 ms | 1.066x | 0.0000 | 0.0000 |
| crosshost-p04-stress | 1056.013 ms | 1100.010 ms | 1.042x | 0.0000 | 0.0833 |
Stress notes:
- injected tool failure every 2nd request
- injected timeout every 3rd request (
900 mssleep) - controller tool retries
1with20 msbackoff
Internal vs External Routing Split-Host Matrix (2026-02-19, Canonical Track B)
Topology:
- GPU host: runtime endpoint
- CPU host: external controller + tool services
- controller/tool call runtime over private VPC network
| Profile | Internal Mean | External Mean | Ext/Int Ratio | Int Error | Ext Error | Ext Tool Retries |
|---|---|---|---|---|---|---|
| splithost-p00-baseline | 1052.392 ms | 1046.702 ms | 0.995x | 0.0000 | 0.0000 | 0.000 |
| splithost-p01_fail_mild | 1051.906 ms | 1049.576 ms | 0.998x | 0.0000 | 0.0000 | 0.021 |
| splithost-p02_timeout_mild | 1055.872 ms | 1100.629 ms | 1.042x | 0.0000 | 0.0000 | 0.021 |
| splithost-p03_mixed_moderate | 1071.284 ms | 1072.634 ms | 1.001x | 0.0000 | 0.0417 | 0.065 |
| splithost-p04_mixed_aggressive | 1051.099 ms | 1142.209 ms | 1.087x | 0.0000 | 0.0833 | 0.182 |
| splithost-p05_mixed_aggressive_retry2 | 1064.215 ms | 1112.362 ms | 1.045x | 0.0000 | 0.0833 | 0.091 |
Matrix means (all profiles):
- External/Internal latency ratio:
1.028x. - Internal error rate:
0.0000. - External error rate:
0.0347.
Internet Multi-Hop Routing (Fly + Commercial APIs, 2026-02-20)
Topology:
- internal path: local client -> commercial API
- external path: local client -> Fly controller/tool -> same commercial API
| Provider/Model | Runs/Profile | Matrix Mean Ext/Int | Baseline p00 | Timeout-Mild p02 | Stress p04 | Internal Error | External Error (p04) |
|---|---|---|---|---|---|---|---|
OpenAI gpt-5.2 | 3 | 1.112x | 1.110x | 1.082x | 1.145x | 0.0000 | 0.0833 |
OpenRouter openai/gpt-5.2 | 3 | 0.755x | 0.686x | 0.891x | 0.689x | 0.0000 | 0.1667 |
OpenRouter anthropic/claude-sonnet-4.6 | 3 | 1.028x | 1.236x | 0.968x | 0.879x | 0.0000 | 0.1667 |
Interpretation:
- OpenAI matrix shows the expected direction for Track B (external hop overhead under internet routing).
- OpenRouter rows are mixed/inverted by profile and remain non-canonical for Track B direction claims in this topology.
Local Control Routing (No Fly Scheduler Path, 2026-02-20)
Topology:
- internal path: local client -> commercial API
- external path: local client -> local standalone controller/tool -> same commercial API
| Provider/Model | Runs/Profile | Matrix Mean Ext/Int | Baseline p00 | Timeout-Mild p02 | Stress p04 | Internal Error | External Error Mean |
|---|---|---|---|---|---|---|---|
OpenAI gpt-5.2 | 8 | 0.987x | 0.995x | 0.977x | 0.988x | 0.0000 | 0.0313 |
OpenRouter anthropic/claude-sonnet-4.6 | 8 | 1.066x | 1.055x | 1.141x | 1.003x | 0.0000 | 0.0313 |
Interpretation:
- With higher-N local controls, OpenAI is near parity and OpenRouter Sonnet trends
external > internal. - External errors still appear under stress while internal remained error-free in these runs.
Task-Family Parity Split (Local Control, runs=8, 2026-02-20, Legacy Pre-Fairness)
| Provider/Model | Model-Only Ext/Int | Tool-Only Ext/Int | Model-Only Errors (Int/Ext) | Tool-Only Errors (Int/Ext) |
|---|---|---|---|---|
OpenAI gpt-5.2 | 0.958x | 1.136x | 0 / 0 | 0 / 0 |
OpenRouter anthropic/claude-sonnet-4.6 | 1.044x | 1.051x | 0 / 0 | 0 / 0 |
Interpretation:
- Tool-only tasks support the architecture claim on both providers.
- Model-only behavior is near parity on OpenAI and still favorable on Sonnet.
Task-Family Parity Split (Fairness-Hardened, Local Control, runs=8, 2026-02-20)
Harness controls:
- interleaved ordering (
pair_order=alternate) - deterministic defaults (
temperature=0) - strict tool parity on
tool_only - token-normalized reporting (
ms/completion_token)
| Provider/Model | Model-Only Ext/Int | Tool-Only Ext/Int | Model-Only Int ms/token | Model-Only Ext ms/token | Tool-Only Int ms/token | Tool-Only Ext ms/token |
|---|---|---|---|---|---|---|
OpenAI gpt-5.2 | 0.971x | 1.038x | 57.657 | 57.663 | 37.553 | 38.990 |
OpenRouter anthropic/claude-sonnet-4.6 | 1.102x | 1.063x | 61.606 | 70.054 | 41.212 | 43.791 |
Interpretation:
- Tool-only tasks remain favorable to internal on both providers after parity hardening.
- OpenAI model-only remains near parity/slight inversion; Sonnet model-only favors internal.
- Track B commercial claims remain task-family-stratified.
Commercial Root-Cause Grouping (Fairness r4+r8, 2026-02-22)
Source:
Delta is external - internal (positive means internal is faster):
| Group | Paired N | Latency Delta Mean | Latency CI95 | Class | Controller Overhead Mean | Model Hop Mean |
|---|---|---|---|---|---|---|
OpenAI gpt-5.2 model-only | 36 | -69.311 ms | [-193.985, 61.444] | near_parity_noise_dominated | 2.081 ms | 1406.971 ms |
OpenAI gpt-5.2 tool-only parity | 12 | +49.601 ms | [-162.047, 274.981] | near_parity_noise_dominated | 12.842 ms | 2456.108 ms |
| OpenRouter Sonnet 4.6 model-only | 24 | +204.883 ms | [-148.517, 683.114] | near_parity_noise_dominated | 2.254 ms | 2220.251 ms |
| OpenRouter Sonnet 4.6 tool-only parity | 8 | +165.092 ms | [-124.650, 423.449] | near_parity_noise_dominated | 14.446 ms | 2788.196 ms |
Interpretation:
- Current commercial control set is not statistically locked as win/loss by group (all CI95 include zero).
- Measured controller overhead is small relative to model-hop variance, so higher-N region-pinned reruns are needed before claiming directional commercial performance differences.
Phase 3 Agentic Loops (Canonical G5, 2026-02-19, 3 Seeds)
| Metric | Internal | External | Ratio |
|---|---|---|---|
| Success rate (mean) | 1.0000 | 0.9006 | +0.0994 (internal-ext) |
| Mean latency (mean) | 2.668 ms | 42.855 ms | 16.060x |
| Mean steps to convergence (success, mean) | 2.077 | 3.770 | 1.815x |
Per-scenario snapshot:
| Scenario | Int Success | Ext Success | Ext/Int Latency | Ext/Int Steps |
|---|---|---|---|---|
| retrieval_correction | 1.0000 | 1.0000 | 8.853x | 1.333x |
| tool_state_adaptation | 1.0000 | 0.7417 | 29.793x | 3.000x |
| confidence_gated_branching | 1.0000 | 1.0000 | 17.948x | 1.700x |
Stress variant (tool_fail_every=9, tool_timeout_every=11, controller retries 2):
| Metric | Internal | External | Ratio |
|---|---|---|---|
| Success rate (mean) | 1.0000 | 0.8782 | +0.1218 (internal-ext) |
| Mean latency (mean) | 2.669 ms | 205.942 ms | 77.170x |
| Mean steps to convergence (success, mean) | 2.077 | 3.789 | 1.824x |
Phase 3 Uncertainty Ablation (G5, 2026-02-19, baseline+stress repeatability)
Normalized-logprob source (paper-aligned), 3-seed means:
| Arm | Internal Success | External Success | Ext/Int Latency |
|---|---|---|---|
| baseline int_on_ext_on | 1.0000 | 0.9006 | 16.721x |
| baseline int_off_ext_on | 0.7692 | 0.9006 | 22.654x |
| baseline int_on_ext_off | 1.0000 | 0.6698 | 12.108x |
| baseline int_off_ext_off | 0.7692 | 0.6698 | 15.316x |
| stress int_on_ext_on | 1.0000 | 0.8782 | 68.233x |
| stress int_off_ext_on | 0.7692 | 0.8782 | 90.080x |
| stress int_on_ext_off | 1.0000 | 0.6570 | 48.146x |
| stress int_off_ext_off | 0.7692 | 0.6570 | 62.463x |
Uncertainty-on gains (success delta means):
- Baseline internal:
+0.2308. - Baseline external:
+0.2308. - Stress internal:
+0.2308. - Stress external:
+0.2212.
Cross-source check:
raw_logit_marginandhybridpreserve the same internal/external success deltas as normalized-logprob in both baseline and stress sets.- These rows are harness-level synthetic uncertainty; runtime-native canonical corroboration is now published below.
Phase 3 Uncertainty Ablation (G5, 2026-02-19, runtime-native canonical rerun, 3 seeds, superseded)
Runtime-native source (/v1/chat/completions uncertainty payload), after greedy uncertainty kernel fix:
| Arm | Internal Success | External Success | Ext/Int Latency |
|---|---|---|---|
| baseline int_on_ext_on | 0.8718 | 0.7853 | 10.9504x |
| baseline int_off_ext_on | 0.7692 | 0.9006 | 19.9477x |
| baseline int_on_ext_off | 1.0000 | 0.6698 | 10.7348x |
| baseline int_off_ext_off | 0.7692 | 0.6698 | 13.8492x |
| stress int_on_ext_on | 1.0000 | 0.8782 | 74.1471x |
| stress int_off_ext_on | 0.7692 | 0.8782 | 95.3550x |
| stress int_on_ext_off | 1.0000 | 0.6570 | 53.1571x |
| stress int_off_ext_off | 0.7692 | 0.6570 | 68.2719x |
Uncertainty-on gains (runtime-native success delta means):
- Baseline internal:
+0.1026. - Baseline external:
+0.1155. - Stress internal:
+0.2308. - Stress external:
+0.2212.
Note:
- This set established runtime-native wiring, but later audit found fallback contamination on part of the seed set.
- Use the 2026-02-20 quality-gated rerun below for current canonical C2 interpretation.
Phase 3 Uncertainty Ablation (G5, 2026-02-20, runtime-native awareness3 rerun, zero-fallback quality gate)
Runtime-native source (awareness.generation first, legacy fallback preserved), seeds 7/11/19, baseline + stress:
| Metric | Baseline | Stress |
|---|---|---|
| Internal uncertainty-on success delta | -0.1538 | -0.1538 |
| External uncertainty-on success delta | -0.1217 | -0.1089 |
Quality gate:
- all runtime-native arm artifacts have non-zero runtime requests/ok,
fallback=0,errors=0.
Interpretation:
- with clean runtime-native probes, uncertainty-on currently hurts success in this harness.
- runtime-native awareness plumbing is validated; this negative set is now superseded by calibrated rerun below.
Phase 3 Uncertainty Ablation (G5, 2026-02-20, runtime-native calibrated rerun calib1, zero-fallback quality gate)
Runtime-native source, seeds 7/11/19, baseline + stress, calibration params:
- prior weight
0.75 - confidence floor
0.10 - confidence ceil
0.35 - route blend
0.10
| Metric | Baseline | Stress |
|---|---|---|
| Internal uncertainty-on success delta | +0.1539 | +0.1539 |
| External uncertainty-on success delta | +0.1058 | +0.1154 |
Quality gate:
- all runtime-native arm artifacts have non-zero runtime requests/ok,
fallback=0,errors=0.
Interpretation:
- calibrated runtime-native uncertainty restores positive uncertainty-on gains in both baseline and stress profiles.
- C2 is re-locked for this harness configuration.
External Cold Comparison (G5, 2026-02-24, Step0 Exp-Reuse Patch, 3-run means, Current Best)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|---|---|---|---|
| runtime | 1003.287 ms | 4.018 ms | 238.400 ms | 1241.688 ms |
| pytorch_transformers | - | 509.427 ms | 2234.756 ms | 7847.001 ms |
| vllm | 23366.391 ms | 50.406 ms | 997.514 ms | 24363.905 ms |
| ollama (GGUF) | - | - | - | - |
Interpretation:
- Runtime keeps a large margin vs PyTorch and vLLM on request path and cold-total in this rerun.
- Runtime also improved vs immediate seq1mh baseline (
2026-02-24T192020Z): TTFT4.022 -> 4.018 ms, full239.277 -> 238.400 ms, cold-total1242.592 -> 1241.688 ms. - Runtime remains materially better than prior host-prefetch means (
2026-02-19): TTFT5.130 -> 4.018 ms, full316.403 -> 238.400 ms, cold-total1320.240 -> 1241.688 ms. - Follow-up shared-probability patch (
external_cold_step0shared_repeatability_20260224T194913Z) did not beat this row and was reverted. - Ollama is intentionally blank here because it was not installed on the rerun host.
External Cold Decode Profiling + Uncertainty A/B (G5, 2026-02-25, No Preload, 64 Tokens)
Profile source:
external_cold_stepn_profile_20260225T001334Zexternal_cold_uncert_on_20260225T001702Zexternal_cold_uncert_off_20260225T001704Z
| Runtime Mode | Request TTFT | Request Full | Infer | decoder_stepN_layers_mean | decoder_stepN_logits_sample_mean |
|---|---|---|---|---|---|
| uncertainty on | 4.109 ms | 479.889 ms | 461.771 ms | 1.360 ms | 2.671 ms |
| uncertainty off | 3.991 ms | 473.367 ms | 454.878 ms | 1.360 ms | 2.562 ms |
Interpretation:
- The dominant decode stage in this profile is
decoder_stepN_logits_sample, thendecoder_stepN_layers. - Disabling uncertainty stats helps, but the uplift is modest relative to total decode time; the main custom-kernel target remains logits+sample compute.
External Cold Runtime vs vLLM (G5, 2026-02-25, Same Profile, Uncertainty-Off Runtime)
Source:
external_cold_runtime_vllm_uncertoff_20260225T001929Z
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|---|---|---|---|
| runtime | 1003.392 ms | 3.929 ms | 472.724 ms | 1476.116 ms |
| vllm | 23032.532 ms | 49.577 ms | 1311.481 ms | 24344.013 ms |
Ratios (vLLM/runtime):
- TTFT:
12.618x - Request Full:
2.774x - Cold Total First Response:
16.492x
Full-Depth FFN Projection Batched2 (G5, 2026-02-27, --layers 36, preload64)
Source:
external_cold_layers36_preload64_ab3_ffnprojbatch2_off_s{1,2,3}_20260227T1820*.jsonexternal_cold_layers36_preload64_ab3_ffnprojbatch2_on_s{1,2,3}_20260227T1820*.jsonexternal_cold_layers36_preload64_ab3_ffnprojbatch2_vllm_off_s{1,2,3}_20260227T1821*/1822*.jsonexternal_cold_layers36_preload64_ab3_ffnprojbatch2_vllm_on_s{1,2,3}_20260227T1823*/1824*/182512Z.jsonexternal_cold_layers36_stageprofile_ffnprojbatch2_off_20260227T182949Zexternal_cold_layers36_stageprofile_ffnprojbatch2_on_20260227T182728Z
Runtime-only 3-seed means:
| Mode | Request TTFT | Request Full | Cold Total First Response |
|---|---|---|---|
| batched2 off | 15.189 ms | 1702.190 ms | 4708.109 ms |
| batched2 on | 15.018 ms | 1689.991 ms | 4696.805 ms |
Runtime-vLLM 3-seed (runtime leg):
| Mode | Runtime TTFT | Runtime Full | Runtime Cold Total |
|---|---|---|---|
| batched2 off | 15.207 ms | 1704.091 ms | 4710.111 ms |
| batched2 on | 15.032 ms | 1691.116 ms | 4697.207 ms |
Stage-profile corroboration (off -> on):
decoder_step_profile_ffn_proj_mean:0.205 -> 0.196 ms/layerdecoder_stepN_layers_mean:19.140 -> 18.447 msdecoder_stepN_total_mean:20.761 -> 20.044 ms
Interpretation:
- Batched2 is a real full-depth uplift and is now default-on.
- Runtime still trails vLLM on first-request full latency in this profile, but the gap narrowed again.
Full-Depth FFN Proj Fast-Compute Probe (G5, 2026-02-27, --layers 36, preload64)
Source:
external_cold_layers36_preload64_ab3_ffnprojfast_off_s{1,2,3}_20260227T194728Z.jsonexternal_cold_layers36_preload64_ab3_ffnprojfast_on_s{1,2,3}_20260227T194728Z.jsonexternal_cold_layers36_preload64_ab8_ffnprojfast_off_s{1..8}_20260227T195024Z.jsonexternal_cold_layers36_preload64_ab8_ffnprojfast_on_s{1..8}_20260227T195024Z.jsonweek3_parity_report_ffnprojfast_20260227T194853Z.json
Runtime-only means:
| Set | Mode | Request TTFT | Request Full | Infer | Cold Total First Response |
|---|---|---|---|---|---|
| 3-seed | off | 14.690 ms | 1663.669 ms | 1641.932 ms | 4669.765 ms |
| 3-seed | on | 14.680 ms | 1662.115 ms | 1640.919 ms | 4668.126 ms |
| 8-seed | off | 14.683 ms | 1662.812 ms | 1640.942 ms | 4668.851 ms |
| 8-seed | on | 14.685 ms | 1662.601 ms | 1641.667 ms | 4668.678 ms |
Interpretation (historical for this cycle):
- This specific cycle was too small/noisy to justify default promotion at that time.
- Later clean-path reruns on
2026-02-28(pool16g, no fallback) looked positive, but full foundation gate (foundation_ffnprojfast_gate_ab2_20260228T195240Z) rejected global promotion; canonical parser default remainsTRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0.
Full-Depth U16 Tensor Cache Unlock (G5, 2026-02-27, claim-safe A/B, --layers 36, preload64)
Source:
external_cold_layers36_preload64_u16cache_claimsafe_summary_20260227T200242Z.jsonexternal_cold_layers36_preload64_u16cache_claimsafe_summary_20260227T200242Z.mdweek3_parity_report_u16cache_toggle_default_20260227T200652Z.jsonweek3_parity_report_u16cachefix_default_20260227T195625Z.json
Runtime-only 3-seed A/B:
| Mode | Request TTFT | Request Full | Infer | Cold Total First Response |
|---|---|---|---|---|
TRENI_TENSOR_CACHE_U16=0 | 14.679 ms | 1661.982 ms | 1640.118 ms | 4667.860 ms |
TRENI_TENSOR_CACHE_U16=1 | 14.682 ms | 1189.452 ms | 1168.883 ms | 4195.511 ms |
Runtime-vLLM same-window A/B (2-seed):
| Mode | Runtime Full | vLLM Full | Runtime-vLLM Full Delta |
|---|---|---|---|
TRENI_TENSOR_CACHE_U16=0 | 1663.314 ms | 1325.189 ms | +338.124 ms |
TRENI_TENSOR_CACHE_U16=1 | 1192.145 ms | 1290.816 ms | -98.671 ms |
Mechanism check:
- request-path
decoder_tensor_uploaddropped from~476 msto~5 ms - request-path
decoder_tensor_h2ddropped from~468 msto0 ms
Interpretation:
- This is the primary full-depth request-path unlock in the current cycle.
- With
TRENI_TENSOR_CACHE_U16=1, runtime flips from trailing vLLM on request full to leading in the same-window compare.
Full-Depth FFN Follow-Up (G5, 2026-02-27 Late Night, --layers 36, preload64)
Source:
external_cold_layers36_ffn_followup_summary_20260227T223458Z.jsonexternal_cold_layers36_ffn_followup_summary_20260227T223458Z.md
Lane 1: TRENI_LINEAR_BATCHED2_USE_LT (new optional backend)
| Mode | Request TTFT | Request Full | Infer | Cold Total First Response |
|---|---|---|---|---|
| off | 14.683 ms | 1190.339 ms | 1169.828 ms | 4196.547 ms |
| on | 14.846 ms | 1202.808 ms | 1182.363 ms | 4209.043 ms |
Lane 2: TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=1 + TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1 (runtime-only AB8)
| Mode | Request TTFT | Request Full | Infer | Cold Total First Response |
|---|---|---|---|---|
| off | 14.683 ms | 1190.212 ms | 1169.500 ms | 4196.259 ms |
| on | 14.682 ms | 1190.013 ms | 1169.399 ms | 4196.057 ms |
Lane 3: FFN fused path bias-deferral follow-up (TRENI_DECODER_FFN_PROJ_U16_FUSED=0/1)
| Mode | Request TTFT | Request Full | Infer | Cold Total First Response |
|---|---|---|---|---|
| off | 14.688 ms | 1190.539 ms | 1169.862 ms | 4196.613 ms |
| on | 14.684 ms | 1190.156 ms | 1169.701 ms | 4196.147 ms |
Interpretation:
- No new lane is promoted from this cycle.
TRENI_LINEAR_BATCHED2_USE_LT=1regresses materially in runtime-only full-depth A/B.- F32-input/fast-compute and fused-bias-deferral follow-ups are both near-noise and not material enough for canonical change.
Fast-Profile Logits Follow-Up (G5, 2026-02-28, --layers 2, preload64)
Source:
external_cold_layers2_logitsfast_ab8_summary_20260228T005529Z.jsonexternal_cold_layers2_logitsfast_ab8_summary_20260228T005529Z.md
Runtime-only AB8 (TRENI_DECODER_LOGITS_U16_FAST_COMPUTE=0/1):
| Mode | Request TTFT | Request Full | Infer | Cold Total First Response |
|---|---|---|---|---|
| off | 2.575 ms | 205.244 ms | 185.880 ms | 1208.636 ms |
| on | 2.573 ms | 204.945 ms | 185.867 ms | 1208.291 ms |
Interpretation:
- Fast-profile logits fast-compute remains near-noise (
full -0.299 ms), so it is not promoted.
Mixed-Load p99 Repeatability (G5, 2026-02-28, Canonical Lane)
Source:
mixed_load_repeatability_summary_20260228T005626Z.jsonmixed_load_repeatability_summary_20260228T005626Z.md
3-run (run_mode=mixed_load, http_runs=120 each):
| Metric | Mean Across Runs |
|---|---|
| Request Mean | 122.247 ms |
| Request p95 | 198.518 ms |
| Request p99 | 199.608 ms |
Interpretation:
- Current canonical lane remains stable under this mixed-load repeatability set; no configuration change.
External Cold Comparison (G5, 2026-02-18)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|---|---|---|---|
| runtime | 1003.537 ms | 1108.979 ms | 1339.459 ms | 2342.996 ms |
| pytorch_transformers | - | 528.483 ms | 2288.516 ms | 8725.259 ms |
| vllm | 24032.203 ms | 51.763 ms | 1036.815 ms | 25069.018 ms |
| ollama (GGUF) | 1002.695 ms | 2168.902 ms | 2527.411 ms | 3530.106 ms |
Cold total first response ratio over runtime:
- PyTorch:
3.724x - vLLM:
10.7x - Ollama:
1.507x
Request-path only note:
- vLLM is fastest on request-path TTFT/full once healthy, but has high startup in this run.
External Cold Comparison (G5, 2026-02-18, Runtime Preload + Tokenizer Cache, Non-Parity)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|---|---|---|---|
| runtime | 2004.735 ms | 91.596 ms | 271.346 ms | 2276.081 ms |
| pytorch_transformers | - | 522.795 ms | 2252.382 ms | 8374.324 ms |
| vllm | 27036.682 ms | 51.725 ms | 1035.826 ms | 28072.508 ms |
| ollama (GGUF) | 1002.508 ms | 2182.542 ms | 2538.609 ms | 3541.117 ms |
Runtime advantage in this variant:
- Request full latency vs vLLM:
3.817xfaster. - Cold total first response vs vLLM:
12.334xfaster. - TTFT still trails vLLM (
91.596 msvs51.725 ms). - Caveat: runtime was still using 4 decode steps in this run while vLLM/PyTorch/Ollama used 48.
External Cold Comparison (G5, 2026-02-18, Token Parity = 48)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|---|---|---|---|
| runtime | 2004.203 ms | 91.207 ms | 2518.142 ms | 4522.345 ms |
| pytorch_transformers | - | 501.946 ms | 2244.327 ms | 11272.831 ms |
| vllm | 27036.248 ms | 51.310 ms | 1075.404 ms | 28111.652 ms |
| ollama (GGUF) | 1002.560 ms | 2197.797 ms | 2556.652 ms | 3559.212 ms |
Parity interpretation:
- Runtime still wins cold-total first response vs vLLM (
6.216xbetter). - vLLM wins request-path TTFT and full latency at equal 48-token budget.
External Cold Comparison (G5, 2026-02-18, Token Parity = 48, Decoder/Sampling Fix)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|---|---|---|---|
| runtime | 2004.759 ms | 5.022 ms | 311.289 ms | 2316.048 ms |
| pytorch_transformers | - | 515.597 ms | 2291.854 ms | 8237.561 ms |
| vllm | 24036.762 ms | 52.995 ms | 1094.517 ms | 25131.279 ms |
| ollama (GGUF) | 1002.630 ms | 2184.383 ms | 2543.219 ms | 3545.849 ms |
Post-fix interpretation:
- Runtime now wins TTFT and full request latency vs vLLM at token parity.
- Runtime keeps the large cold-total lead vs vLLM.
- Initial 3-run repeatability (
2026-02-18) means: runtime5.022 msTTFT,311.444 msfull,2316.002 mscold-total vs vLLM51.894/1052.767/24752.842 ms. - Superseded by
2026-02-19rerun and all-backend repeatability below.
External Cold Comparison (G5, 2026-02-19, GPU-Convert Fix2, All Backends, 3-run means)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|---|---|---|---|
| runtime | 2339.131 ms | 5.131 ms | 318.315 ms | 2657.447 ms |
| pytorch_transformers | - | 591.635 ms | 2389.837 ms | 10420.381 ms |
| vllm | 27704.185 ms | 82.560 ms | 1226.201 ms | 28930.385 ms |
| ollama (GGUF) | 1002.622 ms | 10819.259 ms | 11178.525 ms | 12181.148 ms |
Interpretation:
- Runtime keeps the request-path lead across all backends.
- Runtime keeps the cold-total lead vs all backends in this set.
- A runtime preload upload outlier affected one run; stable runs 1-2 sit at
~1004 msstartup and~1321 mscold-total.
External Cold Comparison (G5, 2026-02-19, GPU-Convert + Host-Prefetch Fix, All Backends, 3-run means)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|---|---|---|---|
| runtime | 1003.836 ms | 5.130 ms | 316.403 ms | 1320.240 ms |
| pytorch_transformers | - | 556.949 ms | 2322.565 ms | 19276.667 ms |
| vllm | 27704.770 ms | 84.837 ms | 1232.660 ms | 28937.430 ms |
| ollama (GGUF) | 1002.567 ms | 2638.945 ms | 2996.765 ms | 3999.332 ms |
Interpretation:
- Runtime keeps the request-path lead across all backends.
- Runtime keeps the cold-total lead vs all backends in this set.
- Host-prefetch rerun removed the startup/upload outlier observed in the prior repeatability set.
Historical Legacy Mixed-Mode Context
| Set | Runtime HTTP request mean | Runtime HTTP request p99 |
|---|---|---|
| T4 (2026-02-15) | 146279.609 ms | 156769.1 ms |
| G5 (2026-02-15) | 77449.605 ms | 83346.187 ms |
| G5 registry-cached single run (2026-02-16) | 82.913 ms | 91.877 ms |
Parity Health
| Set | Checked | Failed | Strict |
|---|---|---|---|
| T4 | 3 | 0 | true |
| G5 | 3 | 0 | true |