Benchmark Status
What was run already, what still remains, and why.
Direct Answers
- "Did we run the start benchmark (Phase 1 baseline)?" Yes.
- "Did we rerun after true TTFT and cold fixes?" Yes (2026-02-17 on G5).
- "Did we run all benchmarks in the full plan?" Core plan: yes (through Phase 4 + paper package).
- "Did we run a full-depth runtime-vLLM cold check?" Yes (
2026-02-25,--layers 36,--pool-mb 16384). - "Is there anything else to run?" Yes: the next real blocker is thinking-mode answer emission on GPQA-style tasks. The old sampled reproducibility blocker is fixed, and the non-thinking strict lanes are green.
Latest 2026-03-11 Update: Native Hermes 4B Conversation Lane
- Native Hermes same-VM registration is now the real integration path:
/_vendor/hermes-agent/tools/treni_samevm_tools.py/_vendor/hermes-agent/model_tools.py/_vendor/hermes-agent/toolsets.py/_vendor/hermes-agent/hermes_cli/tools_config.py
- Tool name audit on the AWS Hermes checkout is clean:
73total tool names,73unique- no duplicate
execute_code,browser_*, orsamevm_*registrations
- Multi-turn
4Bconversation bugfixes now in place:- unique runtime tool-call IDs from
monolith/server/http.c - compact multi-turn carry-over in
scripts/samevm_agent_conversation_suite.py - structured
400JSON worker errors inscripts/treni_local_tool_worker.py samevm_rag_ingestHTTP bridge now preserves worker-side error payloads inscripts/hermes_same_vm_mvp.py
- unique runtime tool-call IDs from
- Canonical repaired split workflow artifact:
benchmarks/same_vm_mvp/results/hemkesh-v22_20260311T020710Z.json
- Result:
- turn 1: local discovery + grounded facts
- turn 2: exact facts stored in SQLite and queried back
- turn 3: broader background stored in RAG and retrieval-checked
- turn 4: memory note saved
- turn 5: final recall correctly distinguishes SQLite exact facts vs RAG broader background
- Interpretation:
- the native Hermes
4Blane is now green for the split real-world persistence workflow, - the remaining non-canonical case is the single freeform turn that tries to do SQLite + RAG + memory all at once.
- the native Hermes
Latest 2026-03-08 Update: Deterministic Strict Lane
- Runtime-side request override handling is now serialized in
monolith/server/http.c, so request-scoped decode overrides no longer race through process-global env state. - Direct runtime reproducibility proof on AWS:
benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_t0_r1.jsonbenchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_t0_r2.json- repeated
temperature=0IFEval seed-7 runs are identical (score_mean=0.5625both).
- New deterministic one-host strict matrix:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T204248Z.json- overall:
- runtime score
0.295139 - vLLM score
0.267361 - runtime latency
824.714 ms - vLLM latency
1572.529 ms
- runtime score
gpqa_diamond:- score parity (
0.166667vs0.166667) - runtime slower (
671.640 msvs436.583 ms)
- score parity (
ifeval:- runtime higher score (
0.423611vs0.368055) - runtime much faster (
977.787 msvs2708.475 ms)
- runtime higher score (
- Interpretation:
- the runtime now has a claim-safe deterministic strict lane where it wins overall on both score and latency,
- sampled runs are now also fixed separately below.
Latest 2026-03-08 Update: Sampled Lane Fixed
- Root cause:
- the bug was in
scripts/phase5_awareness_realbench.py, not in runtime decode math - the shared first-pass used for
arm_a_controlskipped the request seed and task-specific decode payload
- the bug was in
- Post-fix runtime-only sampled reproducibility probes on AWS:
benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_raw_seedfix_r1.jsonbenchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_raw_seedfix_r2.json
- Result:
- repeated sampled IFEval seed-7 runs are identical (
score_mean=0.3125both) - all
8/8outputs are identical across the two reruns
- repeated sampled IFEval seed-7 runs are identical (
- New post-fix sampled one-host strict matrix:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T220806Z.json- overall:
- runtime score
0.409722 - vLLM score
0.302083 - runtime latency
1617.187 ms - vLLM latency
2017.206 ms
- runtime score
gpqa_diamond:- runtime higher score (
0.3750vs0.2500) - runtime slower (
710.693 msvs435.823 ms)
- runtime higher score (
ifeval:- runtime higher score (
0.4444vs0.3542) - runtime faster (
2523.680 msvs3598.588 ms)
- runtime higher score (
- Immediate repeatability confirmation:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T221823Z.json- overall:
- runtime score
0.409722 - vLLM score
0.281250 - runtime latency
1607.757 ms - vLLM latency
2008.759 ms
- runtime score
- Interpretation:
- sampled-lane drift is fixed,
- the new sampled strict lane is promotable and runtime wins overall on both score and latency,
- and a second full-matrix rerun stays aligned with that conclusion.
Latest 2026-03-08 Update: Larger-N Sampled Strict Confirmation
- Stronger sampled strict matrix:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T235013Z.json
- Result (
16samples/task,3seeds):- overall:
- runtime score
0.371528 - vLLM score
0.296875 - runtime latency
1255.344 ms - vLLM latency
1585.043 ms
- runtime score
gpqa_diamond:- runtime higher score (
0.3750vs0.3125) - runtime slower (
801.900 msvs433.256 ms)
- runtime higher score (
ifeval:- runtime higher score (
0.368056vs0.281250) - runtime faster (
1708.789 msvs2736.831 ms)
- runtime higher score (
- overall:
- Interpretation:
- the non-thinking sampled win is now stronger than the original
8-sample pass, - score and latency stay runtime-positive overall with positive confidence intervals.
- the non-thinking sampled win is now stronger than the original
Latest 2026-03-08 Update: Thinking-Mode Parity Exploration
- First explicit thinking-mode strict matrix:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T223442Z.json
- Budget-fixed follow-up:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T224358Z.json
- Finalized follow-up:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T235628Z.json
- Lower-cost finalized follow-up:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T010353Z.json
- Key result:
- the lane is no longer all-zero on
gpqa_diamond; the close-form finalize pass makes it measurable - lower-cost finalized result (
gpqa_max_tokens=256):- overall:
- runtime score
0.250000 - vLLM score
0.194444 - runtime latency
6823.816 ms - vLLM latency
7503.000 ms
- runtime score
gpqa_diamond:- runtime score
0.166667 - vLLM score
0.166667 - runtime near parity on latency (
7727.880 msvs7741.028 ms)
- runtime score
ifeval:- runtime score
0.333333 - vLLM score
0.222222 - runtime faster (
5919.753 msvs7264.973 ms)
- runtime score
- overall:
- the lane is no longer all-zero on
- One-example long-budget probes:
benchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_runtime_1024_20260308T230352Z.jsonbenchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_vllm_1024_20260308T230352Z.json
- Interpretation:
- the old blockers were real and are now fixed enough to measure the lane:
- runtime no longer hard-clips at
512, - long-decode host-buffer corruption is fixed,
- close-form finalize converts length-exhausted reasoning into parseable answers
- runtime no longer hard-clips at
- the better current thinking tradeoff is the reduced-budget finalized lane:
- runtime still leads on score overall,
- and with
gpqa_max_tokens=256it now also beats vLLM overall on latency.
- the old blockers were real and are now fixed enough to measure the lane:
- Early GSM8K-only finalized thinking follow-up:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260310T022347Z.json- result (
32samples/task,3seeds):- runtime score
0.197917 - vLLM score
0.177083 - runtime latency
7174.829 ms - vLLM latency
7643.231 ms
- runtime score
- interpretation:
- this extends the closed-form thinking lane beyond
gpqa_diamond, - runtime remains directionally ahead on both score and latency,
- but the score CI still crosses zero, so this GSM8K thinking lane is exploratory, not claim-safe yet.
- this extends the closed-form thinking lane beyond
- AIME25 isolated follow-up:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260310T021732Z.json- result (
8samples,1seed,512thinking tokens, patched AIME prompts):- runtime score
0.0 - vLLM score
0.0 - runtime latency
19776.254 ms - vLLM latency
16092.718 ms
- runtime score
- interpretation:
- AIME25 does not recover even after an AIME-specific prompt/finalize pass adjustment,
- so this is currently a task-family limitation, not a benchmark-wide thinking win.
- AIME25 second-thinking recovery attempt:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T021331Z.json- result (
8samples,1seed):- runtime score
0.0 - vLLM score
0.0 - runtime latency
21409.322 ms - vLLM latency
22110.402 ms
- runtime score
- interpretation:
- giving AIME a second short thinking pass did not recover score,
- that experiment is non-canonical and should not replace the lower-cost default finalized path.
Late 2026-03-08 Update: Fast Sampler + Tie-Stable AB3
- After the hybrid prefill fix, sampled decode became the dominant remaining hotspot.
- Focused GPQA probe after fast top-k sampling:
q35-gpqa-profile-aws-samplefast1_20260308T003727Z.json- first-call moves:
decoder_step0_logits_sample 40.701 -> 3.538 msdecoder_ttft 1019.079 -> 982.663 ms
- step-N moves:
decoder_stepN_sample_mean 37.090 -> 2.366 msdecoder_stepN_total_mean 47.748 -> 12.721 ms
- First clean strict AB3 after fast sampling:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T003749Z.json- overall:
- runtime score
0.305556 - vLLM score
0.347222 - runtime latency
1405.707 ms - vLLM latency
1676.336 ms
- runtime score
- interpretation:
- runtime flipped to a real overall latency win,
- but quality regressed enough that this run was not promotable as canonical.
- Tie-stable sampler follow-up:
- one-seed proof:
phase5_qwen35_remote_strict_matrix_20260308T004511Z.json- runtime wins both score (
0.4375vs0.375) and latency (1497.984 msvs2026.199 ms) on seed7
- runtime wins both score (
- full AB3 rerun:
phase5_qwen35_remote_strict_matrix_20260308T004758Z.json - overall:
- runtime score
0.315972 - vLLM score
0.347222 - runtime latency
1422.818 ms - vLLM latency
1659.878 ms
- runtime score
- task split:
gpqa_diamond: runtime better score (0.291667vs0.208333) but still slowerifeval: runtime lower score (0.340278vs0.486111) but much faster
- one-seed proof:
- Interpretation:
- the runtime now has a clean strict latency lead on this Qwen3.5 one-host matrix,
- the remaining work is score recovery, not another large latency rescue.
Late 2026-03-08 Update: Batched Hybrid Qwen3.5 Prefill
- Qwen3.5 hybrid prompt prefill is now materially improved on AWS:
- new code paths:
monolith/models/decoder.cumonolith/main.cmonolith/include/treni_models.h
- changes:
- batched linear-attention hidden forward for sequence prefill,
- batched full-attention prefill with K/V cache materialization,
- hybrid layer-major prefill in
main.cinstead of token-by-token fallback for Qwen3.5 prompt runs.
- new code paths:
- Focused GPQA profile progression:
- old clean profile:
q35-gpqa-profile-aws-clean_20260307T220200Z.jsondecoder_prefill=3263.527 msdecoder_ttft=3317.441 ms
- linear-batch profile:
q35-gpqa-profile-aws-linearbatch_20260307T235448Z.jsondecoder_prefill=1341.628 msdecoder_ttft=1405.739 ms
- full-batch profile:
q35-gpqa-profile-aws-fullbatch_20260308T000420Z.jsondecoder_prefill=275.372 msdecoder_ttft=1017.876 ms
- old clean profile:
- Latest strict AB3 summary:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T000429Z.json- overall:
- runtime score
0.413195 - vLLM score
0.347222 - runtime latency
2940.172 ms - vLLM latency
1686.263 ms
- runtime score
- task split:
gpqa_diamond: runtime0.458333vs vLLM0.208333, runtime1347.582 msvs vLLM512.075 msifeval: runtime0.368055vs vLLM0.486111, runtime4532.763 msvs vLLM2860.452 ms
- Interpretation:
- this older AB3 still matters because it proved prompt prefill was a real architectural blocker and fixable,
- but it is no longer the latest strict state after the fast-sampler reruns above.
Late 2026-03-07 Update: ORPO Reload + Cache-Tier A/B
- Same-VM ORPO reload loop is now real on AWS:
- artifact:
benchmarks/same_vm_mvp/results/samevm-orpo-reload-aws_20260307T222341Z.json - path proved:
- local ORPO job finishes,
- adapter output is merged into a full HF model dir,
- merged model is packed into a new monolith container,
- a second runtime is restarted against that new container,
- that runtime answers a real chat request.
- artifact:
- Important scope note:
- this proof currently uses the Qwen2.5 ORPO demo model path, not the main Qwen3.5 strict benchmark target.
- the harness/control-plane path is real; the remaining work is promoting the same self-improvement loop onto the main target family.
- Qwen3.5 smarter shared-prefix tiering now has a clean runtime-side A/B:
- direct sequential GPQA profile:
q35-gpqa-profile-aws-seq2-cap112_20260307T222540Z.json- second related request improved:
decoder_prefill 2696.101 -> 2544.202 msdecoder_ttft 2747.697 -> 2595.907 ms
- clean strict seed-7 spot A/B:
cap112:phase5_qwen35_remote_strict_matrix_20260307T223218Z.jsoncap64:phase5_qwen35_remote_strict_matrix_20260307T223555Z.json
- runtime-only latency effect (
112 - 64):- overall:
-363.908 ms gpqa_diamond:-420.699 msifeval:-307.116 ms
- overall:
- quality effect on this one-seed spot:
- overall score unchanged (
0.291667both), - per-task scores moved in opposite directions (
gpqadown,ifevalup), so this is not a quality claim yet.
- overall score unchanged (
- direct sequential GPQA profile:
- Non-canonical artifact note:
phase5_qwen35_remote_strict_matrix_20260307T222736Z.jsonis contaminated and should not be cited.- cause: an ORPO demo runtime was still alive on port
18081and holding GPU memory during that A/B run. - the clean strict comparison is
20260307T223218Zvs20260307T223555Z.
- Qwen3.5 strict launcher/config drift is now corrected across the strict AWS runner and same-VM worker/runtime launcher:
- shared env source:
scripts/qwen_runtime_env.py - clean AB3 artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T231500Z.json - effect vs the older clean
cap64AB3 (20260307T225716Z):- runtime overall score:
0.333334 -> 0.335648 - runtime overall latency:
3801.258 -> 3690.124 ms - runtime
ifevalscore:0.333333 -> 0.421296 - runtime
gpqa_diamondlatency:2857.693 -> 2753.838 ms
- runtime overall score:
- current clean paired result:
- overall: runtime
0.335648vs vLLM0.291667, runtime3690.124 msvs vLLM1646.672 ms gpqa_diamond: runtime0.25vs vLLM0.25, runtime2753.838 msvs vLLM529.098 msifeval: runtime0.421296vs vLLM0.333333, runtime4626.410 msvs vLLM2764.246 ms
- overall: runtime
- interpretation: score-side evidence improved, but the remaining blocker is still long-prompt prefill latency.
- code-level explanation:
monolith/models/decoder.curejectstreni_decoder_forward_f32(...)forctx->is_linear_attn,- Qwen3.5 linear-attention is therefore only covered by cached/token decode today,
- so strict long-prompt Qwen3.5 prefill still runs through the token-by-token cached loop in
monolith/main.c.
- shared env source:
Qwen3.5 One-Host Strict Rerun + Request-Path Fixes (2026-03-07)
- New contract-validation artifacts on the active AWS host:
- tokenizer audit:
benchmarks/qwen35_tokenizer_audit/results/qwen35-tokenizer-audit-active_20260307T173024Z.json - runtime smoke:
benchmarks/qwen35_smoke/results/qwen35-runtime-smoke-active2_20260307T173132Z.json - isolated semantic A/B:
benchmarks/qwen35_smoke/results/qwen35-isolated-ab-active_20260307T173228Z.json
- tokenizer audit:
- New strict one-host matrix runner:
scripts/phase5_qwen35_remote_strict_matrix.py
- New late strict one-host matrix summary after request-path fixes:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T191653Z.jsonbenchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T191653Z.md
- Contract status:
- packed tokenizer/full vocab now matches HF exactly for
Qwen/Qwen3.5-0.8B(248077tokens), - runtime extended non-thinking smoke passes
7/7cases on the active AWS host, - isolated non-thinking probe A/B is mixed but useful:
- runtime
all_ok=true, - vLLM
all_ok=falsein that probe harness because current text-only launch rejects multimodal placeholders and the forced-thinking exact-output case still ends atfinish_reason=length.
- runtime
- packed tokenizer/full vocab now matches HF exactly for
- Request-path changes validated before the late rerun:
- Qwen3.5 decoder prefix cache is now default-on (
TRENI_DECODER_PREFIX_CACHE=1,64prefix tokens), timing.ttft_msnow includes request-path pre-decode time plus decoder-first-token timing, not only the decode-loop step-0 proxy,- repeated prompt-family hot probe on AWS dropped from
infer_ms ~1798.5 -> 842.4 msandttft_ms ~1531.9 -> 782.5 mson the second related request with a logged prefix-cache hit.
- Qwen3.5 decoder prefix cache is now default-on (
- Strict one-host realbench result (
gpqa_diamond+ifeval, Arm A, seeds7/17/27,8/task,request_logprobs=false):- overall score: runtime
0.333333vs vLLM0.315972 - overall latency: runtime
3809.745 msvs vLLM1626.068 ms - task split:
gpqa_diamond: runtime0.291667vs vLLM0.291667, runtime2867.493 msvs vLLM418.173 msifeval: runtime0.375000vs vLLM0.340278, runtime4751.996 msvs vLLM2833.964 ms
- overall score: runtime
- Status impact:
- Qwen3.5 compatibility is no longer the question; tokenizer/chat/tool contract is working.
- Score is no longer behind on this strict set.
- The remaining blocker is request-path latency, especially benchmark-prompt prefill behavior.
Same-VM Wrapper Recovery (2026-03-07)
- The explicit same-VM AWS wrapper is now recovered and usable:
benchmarks/same_vm_mvp/results/samevm-q35-stack_20260307T172158Z.json
- Current entrypoints:
scripts/hermes_same_vm_mvp.pyscripts/run_samevm_qwen35_stack.sh
- What changed in the harness:
- runtime prompt-token cap is now passed explicitly (
4096) for Hermes-started Qwen3.5 runs, - system prompt only advertises tools actually loaded in the session,
- wrapper no longer loads unrelated builtin tools by default,
- final wrapper response is now deterministically rewritten from tool outputs when the model emits malformed JSON-like summaries.
- runtime prompt-token cap is now passed explicitly (
- End-to-end result on the recovered wrapper path:
- runtime health:
ok - extended smoke:
PASS,7/7cases - case latencies:
plain_chat:234.641 msmulti_turn_memory:422.339 msmultimodal_content_items:647.002 mstool_call_first_turn:3080.0 mstool_followup_after_result:2596.057 msthinking_plain_chat:1194.161 mstool_followup_after_result_no_tool_call_id:3339.867 ms
- runtime health:
- Current caveat:
- runtime log still shows an intermittent first tool-turn retry in one observed wrapper run (
compute/ops.cu:765, invalid argument during prefill gather). The request recovered and the smoke artifact still passed, but this retry path is not yet closed.
- runtime log still shows an intermittent first tool-turn retry in one observed wrapper run (
- New multimodal same-VM tool surface is now wired in code:
- status/embed/rerank/tts/stt live in:
scripts/samevm_multimodal_models.pyscripts/treni_local_tool_worker.pyscripts/hermes_same_vm_mvp.pyscripts/samevm_stack_probe.py
- defaults:
- embedding:
Qwen/Qwen3-VL-Embedding-2B - reranker:
Qwen/Qwen3-VL-Reranker-2B - tts:
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice - stt:
Qwen/Qwen3-ASR-0.6B - whisper fallback: supported when
modelcontainswhisper
- embedding:
- worker-level smoke on
POST /v1/mm/statuspasses and reports the AWS machine state accurately. - runtime-admin proof on AWS is now clean:
benchmarks/same_vm_mvp/results/samevm-q35-runtime-admin-proof-v5_20260307T212852Z.json- Hermes calls
samevm_runtime_status+samevm_multimodal_statusand the wrapper deterministically rewrites the final summary from tool outputs if the model truncates.
- first real same-VM local-tool stack proof is complete on AWS:
benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v5.json- covered in one pass: runtime status, SQLite exec/query, RAG ingest/search, TTS, Qwen ASR STT, embedding, reranking
- observed outputs:
- SQLite rows:
1 - RAG top hit:
Same VM locality - TTS output path:
/home/ubuntu/treni/benchmarks/same_vm_mvp/results/samevm_probe_tts.wav - Qwen ASR transcript: usable but still imperfect on the synthetic voice (
Treniwas still heard asTrinity) - embedding dim:
2048 - rerank top document: the local-inference sentence ranked first
- SQLite rows:
- current caveat:
- timestamped STT still depends on the forced-aligner path and enough local disk to materialize that model on the AWS box
- new ORPO control-plane proof is complete:
benchmarks/same_vm_mvp/results/samevm-orpo-probe-aws_20260307T215307Z.json- one local dataset write + background ORPO train + job polling cycle completed with
returncode=0 - hot-reload of trained output back into the runtime is still not implemented
- new operational fix:
- the local multimodal worker was retaining about
13.3 GiBof GPU memory after model loads, which can starve the runtime and invalidate latency experiments - mitigation now exists via
POST /v1/mm/clear_cache - status endpoint now exposes loaded multimodal models and current CUDA allocation/reservation
- the local multimodal worker was retaining about
- status/embed/rerank/tts/stt live in:
Canonical Same-VM MVP (2026-03-10)
- The canonical investor-demo same-VM MVP is now green on AWS:
benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.jsonbenchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.md
- What the canonical v15 run proves in one flow:
- local Qwen3.5 runtime health:
ok - local tool worker health:
ok - Hermes runtime-status tool call:
ok - Hermes multimodal-status tool call:
ok - direct same-VM runtime smoke:
all_ok=Trueon the basic non-thinking profile (5cases, includes first-turn tool calling) - direct same-VM runtime thinking smoke:
all_ok=Trueon the extended/thinking profile with exact-match checks - local stack probe: SQLite + RAG + embedding + reranking + TTS + Qwen ASR STT all pass
- Qwen3.5 ORPO reload proof is available and reused from the latest successful sidecar artifact:
benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json
- sidecar cleanup now stops cleanly on port
18081 - multimodal cache clear runs at the end and returns GPU memory close to idle
- local Qwen3.5 runtime health:
- Additional post-v15 Hermes tool proofs on AWS:
- SQLite query via Hermes:
benchmarks/same_vm_mvp/results/hermes-demo-sqlite-query-v2.json- returned row count
1fromdemo_notes_v3
- returned row count
- RAG search via Hermes:
benchmarks/same_vm_mvp/results/hermes-demo-rag-search-v1.json- returned a valid top result for
same machine
- returned a valid top result for
- TTS via Hermes:
benchmarks/same_vm_mvp/results/hermes-tts-v2.json- generated
/home/ubuntu/treni/benchmarks/same_vm_mvp/results/hermes_tts_v2.wav
- generated
- STT via Hermes:
benchmarks/same_vm_mvp/results/hermes-stt-v2.json- transcribed the generated WAV successfully
- SQLite query via Hermes:
- Current observed v15 stack outputs:
- SQLite rows:
1 - RAG top hit:
Same VM locality - embedding dim:
2048 - top reranked text:
Treni keeps inference and tools on one local machine. - TTS output path:
/home/ubuntu/treni/benchmarks/same_vm_mvp/results/samevm_probe_tts.wav - Qwen ASR STT transcript is directionally correct but still imperfect on synthetic voice (
\"Trinity\"drift observed in the current probe)
- SQLite rows:
- Live speed snapshot on the current AWS Qwen3.5 runtime (
2026-03-10):3deterministic runs on a130-token response- mean
infer_ms: about1156.9 - mean
ttft_ms: about98.6 - mean end-to-end throughput:
112.37 tok/s - mean decode-only throughput:
121.90 tok/s
- Live current-model speed probe on AWS (
2026-03-10):qwen35(0.8B):128completion tokens in about1111.4 ms,ttft_ms≈103.1,decode_tps≈115.37qwen35_4b(4B):128completion tokens in about3313.3 ms,ttft_ms≈170.7,decode_tps≈38.64
- Real-world document caveat:
- current same-VM RAG ingests text payloads, text files, and raw PDF paths directly
- live worker proof now ingests
/home/ubuntu/treni/benchmarks/same_vm_mvp/data/manual-pncp-api.pdfthroughsamevm_rag_ingest(paths=[...])
- Runtime compatibility note:
- the live runtime now accepts both
/v1/chat/completionsand/chat/completions - the live runtime now exposes both
/v1/modelsand/models - Hermes can therefore target the runtime root URL directly on AWS without wrapper-specific path rewriting
- the live runtime now accepts both
- Scope note:
- the canonical MVP acceptance gate now includes both the basic non-thinking runtime smoke lane and the extended/thinking runtime smoke lane.
- the extended non-thinking lane now also passes on AWS (
benchmarks/qwen35_smoke/results/postmvp-extended_20260308T185130Z.json,7/7cases). - latest passing thinking artifact:
benchmarks/qwen35_smoke/results/postmvp-extended-thinking-r4_20260308T193858Z.json.
Clean GPQA Runtime Profile (2026-03-07)
- New direct runtime profile artifact:
benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-clean_20260307T220200Z.json
- New probe runner:
scripts/q35_gpqa_profile_once.py
- Method:
- restart the runtime cleanly on AWS with
TRENI_STEP0_PROFILE=1andTRENI_DECODE_STAGE_PROFILE=1 - send the same real GPQA prompt twice through the Qwen3.5 runtime API path
- parse timing lines from the managed runtime log
- restart the runtime cleanly on AWS with
- Result:
- call 1:
decoder_tensor_upload:218.091 msdecoder_prefill:3263.527 msdecoder_ttft:3317.441 ms
- call 2:
decoder_tensor_upload:11.216 msdecoder_prefix_cache_copy:0.162 msdecoder_prefill:2690.001 msdecoder_ttft:2750.672 ms
- step-0 decode is not the main limiter:
decoder_step0_layers: about8 msdecoder_step0_logits_sample: about33-36 ms
- call 1:
- Interpretation:
- the current strict GPQA latency gap is still dominated by prefill, not tokenizer cost and not decoder step-0.
- prefix cache helps, but only partially on this prompt family.
- next optimization target remains long-prompt prefill/kernel path, not sampling logic.
Qwen3.5 Probe Matrix + Same-VM MVP (2026-03-06)
- New tokenizer/full-vocab audit is complete:
benchmarks/qwen35_tokenizer_audit/results/runtime-q35-tokenizer-audit-r4_20260306T190418Z.json- result: packed runtime tokenizer matches HF exactly at full-vocab level for
Qwen/Qwen3.5-0.8B(248077tokens), with control probes like<think>,<|im_start|>,<|vision_start|>, and<|image_pad|>all aligned.
- New endpoint smoke/probe work is complete:
- base smoke:
benchmarks/qwen35_smoke/results/runtime-q35-smoke-r2_20260306T190530Z.json - consolidated matrix:
benchmarks/qwen35_smoke/results/qwen35-probe-matrix-r2_20260306T200035Z.json
- base smoke:
- Probe matrix summary (
profile=extended, same cases on both backends):- runtime
non-thinking:all_ok=true - runtime
thinking:all_ok=true, but outputs are verbose and tool path is very slow - vLLM
non-thinking:all_ok=false - vLLM
thinking:all_ok=false
- runtime
- Important case-level interpretation:
- runtime
non-thinkingis the strongest current functional lane for Qwen3.5:plain_chat:387.672 msmulti_turn_memory:573.434 mstool_call_first_turn:5885.725 mstool_followup_after_result:4406.168 ms
- vLLM
non-thinkingis much faster on tool path:plain_chat:112.543 mstool_call_first_turn:1162.850 mstool_followup_after_result:490.202 ms
- vLLM failures in this matrix are concrete and expected from launch/config:
- multimodal placeholder case fails because current launch is
--language-model-only - several thinking/exact-output cases stop at
finish_reason=length
- multimodal placeholder case fails because current launch is
- runtime
- Same-VM harness status:
- Hermes same-VM Qwen3.5 smoke succeeds:
benchmarks/same_vm_mvp/results/hermes-samevm-q35-smoke-r5_20260306T192703Z.json
- Hermes same-VM ORPO smoke-train succeeds and launches a real job:
benchmarks/same_vm_mvp/results/hermes-samevm-q35-orpo-smoke-r1_20260306T194152Z.json
- local worker ORPO run completed successfully on-host:
- training output:
benchmarks/same_vm_mvp/trainings/samevm-orpo-qwen25-smoke3/
- training output:
- Hermes same-VM Qwen3.5 smoke succeeds:
- Status impact:
- Qwen3.5 runtime is now functionally testable and smoke-clean in a real same-VM harness.
- The main blocker is no longer “does it run?”; it is the long-prompt/tool latency gap and thinking-mode output discipline.
Phase 5 Strict Parse-Fix AB3 (2026-03-04)
- New paired AB3 summary (
gpqa_diamond+ifeval, Arm A, seeds7/17/27,16/task,request_logprobs=false) is published:benchmarks/phase5_awareness_realbench/results/phase5_q35_runtime_vs_vllm_ifeval_gpqa_ab3_20260304T034227Z.jsonbenchmarks/phase5_awareness_realbench/results/phase5_q35_runtime_vs_vllm_ifeval_gpqa_ab3_20260304T034227Z.md
- Outcome:
- Overall score: runtime
0.3403vs vLLM0.3229(small runtime edge, CI includes parity). - Overall latency: runtime
1772.931 msvs vLLM1553.034 ms(runtime slower on aggregate due GPQA). - Task-family split:
gpqa_diamond: score parity, runtime latency deficit remains large.ifeval: runtime is both faster and slightly higher-scoring.
- Overall score: runtime
- Status impact:
- strict matrix is now better framed as task-family stratified, not universal runtime superiority yet.
Phase 5 Real-Benchmark Update (2026-03-01)
- Canonical diagnostic run is now complete on the active G5 host:
phase5_awareness_realbench_qwen-realbench-r5-tokenizerfix2_20260301T114510Z.json
- Runtime fixes validated before this run:
- full message aggregation in HTTP path (
system+user, not just last message), - prompt cap default increased (
32 -> 256), - tokenizer BPE merges +
added_tokensloading + improved pretokenization/decode behavior.
- full message aggregation in HTTP path (
r5key outcomes (max-samples-per-task=8):gpqa_diamond:A=0.500,B=0.500,C=0.375ifeval:A=0.5625,B=0.5625,C=0.5625gsm8k:A/B/C=0.0aime25:A/B/C=0.0
- Qwen-template auto mode A/B (
r6:phase5_awareness_realbench_qwen-realbench-r6-qwentpl1_20260301T120235Z.json) regressed quality and latency vsr5, so this mode is kept opt-in-only (env-controlled) and not canonical. - HF-reference parity run on the same sampled set is now complete:
phase5_hf_reference_qwen_r5_20260301T1900Z.json- score deltas (HF minus runtime Arm A):
gpqa -0.25,ifeval +0.0625,gsm8k 0.0,aime25 0.0 - key claim-safe interpretation: GSM8K/AIME
0.0is not runtime-only breakage in this setup (HF control is also0.0).
- Current status:
- first real-data set is run and documented,
- claim-safe parity interpretation is now locked for this sampled set,
- next open work is raising the math-task quality floor (prompt/eval/model-task fit), not proving runtime-vs-HF parity existence.
Phase 5 + Qwen05 Follow-up (2026-03-02)
- qwen05 deterministic empty-completion parity gap is now resolved in runtime:
- root cause fixed in HTTP chat-template build (inject default Qwen system preamble when no
systemmessage is present), - validation artifact (
runtime+vLLM):benchmarks/phase2_external_cold/results/external_cold_qwen05_templatefix_20260302T154019Z.json - same harness rerun (
--no-vllm-ignore-eos):benchmarks/phase2_external_cold/results/external_cold_qwen05_templatefix_nofixeos_20260302T154151Z.json - key signal: runtime now returns non-empty output (
usage_completion_tokens=3,completion_chars=241) instead of token-0 stop.
- root cause fixed in HTTP chat-template build (inject default Qwen system preamble when no
- qwen05 Phase 5 diagnostic rerun completed after parity fix:
benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen05-realbench-r2-templatefix1_20260302T154443Z.json- quality remains low on this small model/sample (
A/B/Call0.0across tasks in this run), so this is a correctness fix, not a quality win.
- canonical
qwenrerun with matched depth and sample count is now complete:- new run (
layers=36,max_samples_per_task=8):benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen-realbench-r9-templatefix1-l36s8_20260302T161123Z.json - prior canonical reference:
benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen-realbench-r5-tokenizerfix2_20260301T114510Z.json r9outcomes:gpqa_diamond:A/B/C = 0.125 / 0.125 / 0.125ifeval:A/B/C = 0.500 / 0.5625 / 0.5625gsm8k:A/B/C = 0.625 / 0.625 / 0.750aime25:A/B/C = 0.000 / 0.000 / 0.125- overall awareness deltas vs A:
B +0.015625,C +0.078125
- interpretation:
- math floor improved materially (
gsm8k,aime25 Arm C) vsr5, - GPQA dropped vs
r5, so current claim stays mixed by task family (not universal quality uplift yet).
- math floor improved materially (
- new run (
Phase 5 + Qwen3.5 Nightly vLLM Follow-up (2026-03-02)
- vLLM main/nightly path is now validated for Qwen3.5 on AWS G5:
- env:
.venv-vllm-nightly-q35 - server:
vllm 0.16.1rc1.dev... - endpoint:
http://127.0.0.1:18081/v1/*
- env:
- Infra issue resolved during setup:
- root filesystem hit
100%, causing Python/vLLM tempdir failure. - cleaned caches/old envs, restored ~21GB free, and launched with explicit
TMPDIR.
- root filesystem hit
- Qwen3.5 diagnostic run set:
- baseline sampled run:
phase5_awareness_realbench_qwen35-realbench-r1-s8-nonthinking_20260302T184159Z.json - conservative retry/vote policy probe:
phase5_awareness_realbench_qwen35-realbench-r2-policyfix1-s8-nonthinking_20260302T184624Z.json - fairness-fixed canonical probe (shared-first across arms):
phase5_awareness_realbench_qwen35-realbench-r3-sharedfirst-s8-nonthinking_20260302T184947Z.json
- baseline sampled run:
- Strict canonical runtime-vs-vLLM matrix is now completed (
2026-03-02) with strict inference guard enabled:- runtime strict mode (
TRENI_HTTP_REQUIRE_INFERENCE=1) hard-fails invalid inference paths (502 {"error":"inference_required"}), - matrix runner:
scripts/phase5_qwen35_runtime_vs_vllm_matrix.py, - canonical artifacts:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_runtime_vs_vllm_matrix_20260302T221546Z.jsonbenchmarks/phase5_awareness_realbench/results/phase5_qwen35_runtime_vs_vllm_matrix_20260302T222013Z.json
- canonical outcome (
20260302T222013Z, Arm A):- score: runtime
0.0503vs vLLM0.2170(delta -0.1667) - latency: runtime
1881.188 msvs vLLM178.093 ms(delta +1703.095 ms)
- score: runtime
- interpretation: Qwen3.5 strict matrix is no longer blocked; it is now a negative-result benchmark that defines the next optimization target.
- runtime strict mode (
- Post-fix rerun (
qnorm-check1,2026-03-02) after wiring decoder Q/K head RMS-norm:- artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_runtime_vs_vllm_matrix_20260302T225529Z.json - result remained negative (
rt_score=0.0000,vllm_score=0.0625; runtime latency1880.622 msvs187.453 ms)
- artifact:
- Decoder full-attn
q_projgate-layout parity fix landed (2026-03-03), and strict matrix was rerun with Arm A-only backend mode (--phase5-arms arm_a_control) to remove retry/vote-path contamination:- artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_runtime_vs_vllm_matrix_20260303T104038Z.json - overall Arm A score: runtime
0.15625vs vLLM0.19097(delta -0.03472, CI includes near-parity) - overall Arm A latency: runtime
1723.685 msvs vLLM958.757 ms(delta +764.928 ms) - result quality gap is materially narrower than
20260302T222013Z, but runtime is still slower overall and still behind on aggregate score.
- artifact:
r3result snapshot:gpqa_diamond:A/B/C = 0.375 / 0.375 / 0.375ifeval:A/B/C = 0.3125 / 0.3125 / 0.3125gsm8k:A/B/C = 0.0 / 0.0 / 0.0aime25:A/B/C = 0.0 / 0.0 / 0.0- awareness deltas: all
B-A=0.0,C-A=0.0(no down, no up).
Phase 5 Paper-Mode Debug (2026-03-03)
- Harness bug fix is now applied:
- in
papermode, retry now commits the refined output directly (paper semantics) instead of confidence-margin replacement filtering. - code:
/Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py
- in
- Live sanity outcomes after fix:
- vLLM sanity (
gpqa_diamond,ifeval,8samples/task):phase5_awareness_realbench_qwen35-paperfix-sanity1_20260303T201156Z.json- outcome: overall
B-A=0.0(gpqa +0.125,ifeval -0.125), latency up due retries.
- runtime sanity on isolated GPU:
phase5_awareness_realbench_qwen35-paperfix-sanity2-runtime_20260303T201744Z.json- outcome: overall
B-A=-0.125, retry100%, large latency penalty.
- vLLM sanity (
- Important contamination note:
phase5_awareness_realbench_qwen35-paperfix-sanity1-runtime_20260303T201620Z.jsonis invalid for performance interpretation (vLLMand runtime co-located; runtime OOM; strict502 inference_requiredon all calls).
- Calibration sweep result (runtime):
ppl 1.4/1.8/2.2produced identical outcomes with full retry (max_entropydominated trigger).- entropy threshold
7.0reduced retry volume (16 -> 9) but still no score uplift. - artifacts:
phase5_awareness_realbench_qwen35-paperfix-runtime-sweep-p1_4_20260303T202135Z.jsonphase5_awareness_realbench_qwen35-paperfix-runtime-sweep-p1_8_20260303T202255Z.jsonphase5_awareness_realbench_qwen35-paperfix-runtime-sweep-p2_2_20260303T202415Z.jsonphase5_awareness_realbench_qwen35-paperfix-runtime-sweep-ent7_20260303T202617Z.json
- Summary-mode calibration fix is now implemented in harness:
- summary uncertainty detection now uses
uncertainty_source=runtime_summary, - paper trigger uses guarded summary vote rule (
paper_summary_max_entropy_threshold,paper_summary_confidence_threshold,paper_summary_min_votes). - code:
/Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py
- summary uncertainty detection now uses
- Post-fix runtime sanity (
8/task):phase5_awareness_realbench_qwen35-papersummaryfix-runtime-sanity2_20260303T204120Z.json- result: retry
9/16(down from16/16) and quality recovered to parity (overall B-A=0.0), but latency overhead remains high (~+1386 ms).
- Post-fix confidence sweep (
8/task):conf=0.40:phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf0_40_20260303T204257Z.jsonconf=0.45:phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf0_45_20260303T204357Z.jsonconf=0.50:phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf0_50_20260303T204500Z.jsonconf=0.55:phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf0_55_20260303T204602Z.json- all four remained
overall B-A=0.0(latency deltas~+1252to+1386 ms, retry~0.50to0.5625).
- Higher-N check (
32/task, conf0.45):phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32_20260303T204751Z.json- result:
gpqa +0.03125,ifeval -0.0625, overallB-A=-0.015626.
- Task-aware follow-up (summary-mode retries disabled for IFEval) produced the first positive repeatable signal on this track:
- larger run (
32/task):phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32-ifevaloff_20260303T222841Z.json- overall
B-A=+0.015624(arm_a 0.273438 -> arm_b 0.289062) - latency delta
+618.068 ms - per-task deltas:
gpqa +0.03125,ifeval +0.0
- 3-seed repeatability (
16/task,s7/s17/s27):phase5_awareness_realbench_qwen35-papersummaryfix-runtime-ifevaloff-rpt-s7_20260303T223228Z.jsonphase5_awareness_realbench_qwen35-papersummaryfix-runtime-ifevaloff-rpt-s17_20260303T223410Z.jsonphase5_awareness_realbench_qwen35-papersummaryfix-runtime-ifevaloff-rpt-s27_20260303T223541Z.json- overall
B-Amean+0.020833(range0.0to+0.03125) - mean latency delta
+712.276 ms - retries occurred only on GPQA in this policy (
IFEval retries=0).
- larger run (
- Late optimization pass (
2026-03-03): compact invalid-parse retry prompt + confidence-gated invalid-parse retries (--invalid-parse-retry-confidence-max).- code:
/Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py - best tradeoff policy on this host so far:
invalid_parse_retry_confidence_max=0.73withpaper_summary_disable_ifeval_retry=true. - 3-seed (
16/task) artifacts:phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s16_20260303T232029Z.jsonphase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-rpt-s17_20260303T232254Z.jsonphase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-rpt-s27_20260303T232516Z.json
- result vs prior
...ifevaloff-rpt-s{7,17,27}baseline:- quality preserved (
overall B-A mean: +0.020833 -> +0.020833), - latency overhead reduced (
+712.276 ms -> +404.603 ms), - GPQA retry rate reduced (
0.5833 -> 0.2917).
- quality preserved (
32/taskconfirmation (s7):phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s32_20260303T232755Z.json- same quality delta as prior
s32policy (overall B-A=+0.015624) with lower latency overhead (+618.068 ms -> +326.187 ms).
- code:
Phase 5 Paper-Loop Alignment (2026-03-02 Late)
- Reference paper code is now local in this workspace:
third_party/weave-logprobs-reasoning-loop
- Phase 5 harness now uses paper-aligned uncertainty triggering:
--awareness-trigger-mode paper|confidence|hybrid(default:paper)- paper trigger = any of:
perplexity > trigger_perplexity_threshold(default1.4)max_entropy > trigger_max_entropy_threshold(default1.5)low_confidence_tokens >= trigger_low_confidence_tokens(default3)
- Retry/refinement prompts now carry first-pass uncertainty summary (top uncertain token positions + alternatives).
- Artifacts now store per-call loop trace with uncertainty metrics/tables for case-level debugging.
- End-to-end smoke run completed on AWS Qwen3.5 nightly with paper mode:
benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-paper-smoke_20260302T191420Z.json- validation signal: paper trigger fired with explicit reason fields (
paper_reasons) and per-call uncertainty traces in output.
- Full
r4run is now complete on the same config/sampling envelope:benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r4-paper-s8-nonthinking_20260302T191642Z.json
r4snapshot:gpqa_diamond:A/B/C = 0.375 / 0.375 / 0.375(unchanged vsr3)ifeval:A/B/C = 0.625 / 0.4375 / 0.625(baseline and Arm C up, Arm B down)gsm8k:A/B/C = 0.0 / 0.0 / 0.0(unchanged)aime25:A/B/C = 0.0 / 0.0 / 0.0(unchanged)- overall deltas vs Arm A:
B -0.046875,C 0.0; both with higher latency from retries.
- Interpretation:
- paper-mode trigger path is functionally integrated and reproducible,
- current default thresholds are too eager for this setup and do not yet produce net quality uplift on Qwen3.5.
Phase 5 Adaptive Uncertainty Fix (2026-03-02 Late 2)
- Harness fix landed: adaptive uncertainty mode now uses rolling per-task uncertainty history (
perplexity,max_entropy,low_conf_ratio) with robust thresholds.- Script:
scripts/phase5_awareness_realbench.py - New mode/default:
--awareness-trigger-mode adaptive
- Script:
- Full rerun (
r5, adaptive default):benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r5-adaptive-s8-nonthinking_20260302T202105Z.json- vs
r4paper:B-A:-0.046875 -> -0.015625(improved),C-A:0.0 -> 0.0(kept parity),- latency deltas reduced:
- Arm B:
+904 ms -> +536 ms - Arm C:
+1427 ms -> +623 ms
- Arm B:
- Stricter adaptive variant (
r6) was tested:benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r6-adaptive-strict-s8-nonthinking_20260302T202314Z.json- Result: Arm B reached parity (
B-A=0.0) but Arm C regressed (C-A=-0.03125) and latency worsened vsr5.
- Decision:
- keep adaptive default settings from
r5as current best policy for this setup.
- keep adaptive default settings from
Decision Update (2026-02-28 Late)
TRENI_LINEAR_U16_FAST_COMPUTEhas now been rerun with higher-confidence repeats and is promoted default-on.- Validation pack:
- warm+mixed AB5:
benchmarks/phase2_runtime/results/aws_speedpass/linearfast_ab5_20260228T124736Z/summary_ab5.json- warm
on-off: request-0.139 ms, p95-0.128 ms, p99-0.009 ms - mixed
on-off: request-0.139 ms, p95-0.156 ms, p99-0.208 ms
- warm
- cold AB3:
benchmarks/phase2_runtime/results/aws_speedpass/linearfast_cold_ab3_20260228T124510Z/summary_ab3.jsonfull +0.302 ms,TTFT -0.019 ms, startup-4.207 ms(near-flat on cold full, positive on startup/TTFT)
- strict parity:
benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_linearfast_20260228T124557Z.json(checked=3,failed=0) - post-default strict parity smoke:
benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_post_linearfast_default_20260228T125804Z.json(checked=3,failed=0)
- warm+mixed AB5:
- Runtime parser default is now
TRENI_LINEAR_U16_FAST_COMPUTE=1(override to0for strict fallback A/B). - Same-window sanity A/B after promotion (
linearfast_default_sanity_20260228T125957Z) confirms default-on behavior is directionally better than forced-off on mixed request path:default - force_off: mean-0.603 ms, p95-0.984 ms, p99+0.029 ms.
Rerun Update (2026-02-28 Late 2)
- Fresh canonical foundation rerun on the new default is now published:
- pack root:
benchmarks/phase2_runtime/results/aws_speedpass/foundation_linearfastdefault_pack_20260228T134157Z - summary:
benchmarks/phase2_runtime/results/aws_speedpass/foundation_linearfastdefault_pack_20260228T134157Z/summary_ab3.json
- pack root:
- Versus prior parser-default foundation pack (
20260228T114315Z):- warm AB3: near-flat/slightly slower (
request +0.101 ms,p95 +0.326 ms,p99 +0.208 ms) - cold AB3: near-flat/slightly slower (
full +0.491 ms,infer +0.530 ms,TTFT +0.002 ms) - mixed AB3: improved (
request -0.629 ms,p95 -1.281 ms,p99 -0.163 ms)
- warm AB3: near-flat/slightly slower (
- Same-window runtime-vLLM full-depth AB3 was rerun on this updated canonical lane:
- run set root:
benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_linearfastdefault_ab3_20260228T134630Z - summary:
benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_linearfastdefault_ab3_20260228T134630Z/summary_ab3.json - averages:
- runtime first-request full:
1185.186 ms - vLLM first-request full:
1305.971 ms vLLM/runtime full ratio:1.102x(runtime faster on this run set)vLLM/runtime cold-total-first-response ratio:5.807xvLLM/runtime cold-total-first-token ratio:7.648x
- runtime first-request full:
- run set root:
- Batched2-Lt fast-fallback short-circuit experiment (skip Lt gate/state/timing work when Lt is disabled) was tested and reverted:
- isolation A/B (
fastfallback_isolation_ab3_20260228T140122Z/summary_ab3.json) usingon(short-circuit) vsoff(forced old path) showed:- warm
on-off: request+1.155 ms, p95+2.124 ms, p99+1.504 ms(regression) - cold
on-off: full-0.846 ms(improvement) - mixed
on-off: mean+0.144 ms, p95+0.569 ms, p99-0.221 ms(mixed/slightly worse overall)
- warm
- decision: keep reverted (not canonical).
- post-revert strict parity passed:
week3_parity_report_post_fastfallback_revert_20260228T140626Z.json.
- isolation A/B (
Decision Update (2026-02-28 Late 3)
TRENI_TENSOR_H2D_CHUNK_MBdefault is now promoted from64to0(no chunking) on this canonical profile.- AB3 evidence:
- cold AB3 (
h2d_chunk_cold_ab3_20260228T142114Z/summary_ab3.json),chunk0 - chunk64:- startup
-4.022 ms, full-2.562 ms, infer-2.542 ms, TTFT-0.060 ms decoder_tensor_h2d -3.347 ms,decoder_tensor_upload -3.222 ms
- startup
- warm+mixed AB3 (
h2d_chunk_warm_mixed_ab3_20260228T142258Z/summary_ab3.json),chunk0 - chunk64:- warm: mean
-0.442 ms, p95-0.697 ms, p99-0.966 ms - mixed: mean
-0.044 ms, p95-0.368 ms, p99-0.279 ms
- warm: mean
- cold AB3 (
- Post-promotion strict parity passed:
week3_parity_report_h2dchunk0_default_20260228T142805Z.json(checked=3,failed=0)
- Single-run sanity (
h2d_chunk_default_vs64_sanity_20260228T142845Z) showed small mixed sensitivity (default - force64 mean +0.340 ms), so this lane should be kept under repeatability watch in future packs.
Decision Update (2026-02-28 Late 4)
- Higher-N same-window runtime-vLLM full-depth rerun is now complete on the updated defaults (
AB5):- run root:
benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z - summary:
benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/summary_ab5.json - summary markdown:
benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/summary_ab5.md
- run root:
- AB5 means (runtime vs vLLM):
- first-request full:
1184.812 msvs1318.675 ms(vLLM/runtime=1.113x) - TTFT:
14.640 msvs50.309 ms(vLLM/runtime=3.436x) - cold-total first response:
4190.848 msvs24350.818 ms(vLLM/runtime=5.810x)
- first-request full:
- Comparison vs prior same-window AB3 (
...linearfastdefault_ab3_20260228T134630Z) is published:benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/compare_vs_prev_linearfastdefault_ab3.json- runtime full mean improved slightly (
1185.186 -> 1184.812 ms,-0.375 ms). - full-latency ratio improved (
1.102x -> 1.113x), while TTFT ratio narrowed because vLLM TTFT was lower in this run window.
- Interpretation:
- request-path win vs vLLM remains stable at higher-N under claim-safe fixed-token settings.
- the remaining active Track A work is still deeper custom layer-compute reduction (
decoder_stepN_layers/ FFN-heavy path), not re-establishing baseline direction.
Decision Update (2026-02-28 Late 5)
- Full-depth gate sweep on top of current defaults is now complete:
- gate root:
benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_gate_newdefaults_20260228T150709Z - gate summary:
benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_gate_newdefaults_20260228T150709Z/summary_gate_ab2.json
- gate root:
- AB2 gate outcomes:
- delayed-Lt (
TRENI_LINEAR_BATCHED2_USE_LT=1,TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=10000) was directionally positive in both modes:- warm
on-off: request-0.384 ms, infer-0.343 ms, p99-0.719 ms - mixed
on-off: request-0.256 ms, infer-0.200 ms, p99-0.279 ms
- warm
- FFN
proj_fast(TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1) remained mixed/noise:- warm
on-off: request-0.096 ms, infer-0.082 ms, p99+0.129 ms - mixed
on-off: request-0.327 ms, infer-0.207 ms, p99+0.022 ms
- warm
- decision at gate stage: only delayed-Lt advanced to AB3 confirmation.
- delayed-Lt (
- delayed-Lt AB3 confirmation is complete:
- run root:
benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_delayedlt_ab3_20260228T151322Z - summary:
benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_delayedlt_ab3_20260228T151322Z/summary_ab3.json - warm
on-off: request-0.330 ms, infer-0.270 ms, p99-0.098 ms - mixed
on-off: request+0.173 ms, infer+0.191 ms, p99+0.291 ms
- run root:
- Decision:
- keep delayed-Lt non-canonical on defaults (
TRENI_LINEAR_BATCHED2_USE_LT=0,TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0). - at that stage, keep
TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0as canonical (later temporarily promoted inDecision Update (2026-02-28 Late 8), then rejected inDecision Update (2026-02-28 Late 9)and restored to canonical off). - next custom-kernel focus remains structural layer-compute reduction (
decoder_stepN_layers/FFN-heavy path), not env-toggle promotion.
- keep delayed-Lt non-canonical on defaults (
Decision Update (2026-02-28 Late 6)
- Tuned delayed-Lt slow-gate rescue probe is complete:
- run root:
benchmarks/phase2_runtime/results/aws_speedpass/delayedlt_tunedslow_ab2_20260228T152358Z - summary:
benchmarks/phase2_runtime/results/aws_speedpass/delayedlt_tunedslow_ab2_20260228T152358Z/summary_gate_ab2.json
- run root:
- Tuned
onconfig:TRENI_LINEAR_BATCHED2_USE_LT=1TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=10000TRENI_LINEAR_BATCHED2_LT_SLOW_RATIO_PCT=0TRENI_LINEAR_BATCHED2_LT_SLOW_STREAK_DISABLE=4
- AB2 deltas (
on-off):- warm: request
-0.185 ms, infer-0.054 ms, TTFT+0.016 ms, p99-0.417 ms - mixed: request
-0.004 ms, infer-0.032 ms, TTFT-0.011 ms, p99+0.221 ms
- warm: request
- Decision:
- tuned policy is still non-promotable (mixed mean near-zero and mixed p99 regresses), so delayed-Lt remains non-canonical on defaults.
Decision Update (2026-02-28 Late 7)
- FFN proj batched2
f32_inputfallback-path patch is now validated:- code change:
/Users/andrewcorrea/treni/monolith/models/linear.cu - behavior: cache unsupported mixed-input batched2 GEMM combos and short-circuit repeated failing calls.
- code change:
- Forced-Lt diagnostic (same profile/settings) before vs after patch:
- before:
benchmarks/phase2_runtime/results/aws_speedpass/ffnproj_f32input_ltalways_20260228T153113Z.json - after:
benchmarks/phase2_runtime/results/aws_speedpass/ffnproj_f32input_ltalways_patch_20260228T154942Z.json - delta:
- request mean
175.208 -> 173.124 ms(-2.084 ms) - p99
206.780 -> 204.405 ms(-2.375 ms) linear_batched2_lt_failures26112 -> 1(repeated failure loop removed)
- request mean
- before:
- Canonical AB2 re-gate for
TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0/1after patch:- root:
benchmarks/phase2_runtime/results/aws_speedpass/ffnproj_f32input_gate_patch_ab2_20260228T155033Z - summary:
benchmarks/phase2_runtime/results/aws_speedpass/ffnproj_f32input_gate_patch_ab2_20260228T155033Z/summary_gate_ab2.json - deltas (
on-off):- warm: request
+0.026 ms, infer-0.060 ms, p99+0.099 ms - mixed: request
+0.057 ms, infer+0.028 ms, p99+0.446 ms
- warm: request
- root:
- Decision:
- keep
TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0canonical (still non-promotable on default path). - keep fallback-path patch (removes pathological repeated-failure overhead and improves robustness in forced-Lt/stress configurations).
- keep
Decision Update (2026-02-28 Late 8)
- Full-depth FFN projection fast-compute rerun is now complete on clean inference path (
pool=16384, classifier disabled, no fallback errors):- profiled AB3 (
ffnprojfast_fullstep_ab3_20260228T160255Z/summary_ab3.json),on-off:- request
-0.370 ms, infer-0.348 ms, TTFT-0.045 ms, p99-0.533 ms.
- request
- non-profiled warm AB3 (
ffnprojfast_fullwarm_ab3_20260228T160358Z/summary_ab3.json),on-off:- request
-0.249 ms, infer-0.225 ms, TTFT-0.015 ms, p99-0.328 ms.
- request
- profiled AB3 (
- Strict parity passed with explicit candidate env and then again on a temporary promoted parser build:
- candidate env:
week3_parity_report_ffnprojfast_candidate_20260228T160459Z.json(checked=3,failed=0) - temporary promoted build:
week3_parity_report_ffnprojfast_default_20260228T160639Z.json(checked=3,failed=0)
- candidate env:
- Interim interpretation:
- this looked promotable on qwen-focused clean-path profiling, but needed full foundation validation before final canonical decision.
- Post-promotion same-window sanity AB3 (
ffnprojfast_default_sanity_ab3_20260228T160557Z/summary_ab3.json) confirms near-flat but directionally positive default behavior:default - force_off: request-0.094 ms, infer-0.093 ms, TTFT-0.003 ms, p99+0.057 ms.
Decision Update (2026-02-28 Late 9)
- Canonical foundation rerun + same-window gate resolved the contradiction and rejected global promotion:
- foundation pack:
benchmarks/phase2_runtime/results/aws_speedpass/foundation_ffnprojfastdefault_pack_20260228T194204Z/summary_ab3.json- vs prior canonical (
foundation_newdefaults_pack_20260228T143605Z), all three modes were slower:- warm request
+1.317 ms, cold full+3.117 ms, mixed request+1.112 ms.
- warm request
- vs prior canonical (
- same-window foundation gate AB2 (
defaultvsforce_off):- root:
benchmarks/phase2_runtime/results/aws_speedpass/foundation_ffnprojfast_gate_ab2_20260228T195240Z - summary:
benchmarks/phase2_runtime/results/aws_speedpass/foundation_ffnprojfast_gate_ab2_20260228T195240Z/summary_gate_ab2.json default - force_off:- warm: request
+0.489 ms, infer+0.479 ms, p99+0.841 ms - cold: full
+0.746 ms, infer+0.537 ms(startup improved-1.323 ms) - mixed: mean near-flat
+0.004 ms, tails improved (p95 -0.320 ms,p99 -0.823 ms)
- warm: request
- root:
- foundation pack:
- Decision:
- keep
TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0canonical parser default. - retain the lane as opt-in for qwen-focused profiling where it can still be useful.
- keep
Decision Update (2026-02-27)
- Additional full-depth FFN/linear probe cycle is complete and did not produce a new canonical win.
- 3-seed outcomes:
TRENI_DECODER_FFN_PROJ_U16_FUSED=1: slight regression vs off in runtime-only and runtime-vLLM sets.TRENI_LINEAR_U16_FAST_COMPUTE=1: near-neutral/slight regression vs off in the initial runtime-only set (later superseded by2026-02-28AB5 promotion evidence).TRENI_LINEAR_LT_WORKSPACE_MB=64: clear regression (first_request_full_ms +40.546 ms,+2.38%).TRENI_LINEAR_USE_LT=0: clear regression (first_request_full_ms +48.826 ms,+2.87%).- shape-scoped Lt fail cache (replacing process-wide disable-on-first-fail) was implemented and validated; perf impact was near-neutral (
~0.05%full-latency movement) in both runtime-only and runtime-vLLM checks.
- FFN projection batched2 lane (
TRENI_DECODER_FFN_PROJ_U16_BATCHED2) is now validated and promoted default-on:- runtime-only 3-seed delta (on-off):
first_request_full_ms -12.199 ms(-0.72%),TTFT -0.171 ms. - runtime-vLLM 3-seed runtime-leg delta (on-off):
first_request_full_ms -12.974 ms(-0.76%),TTFT -0.175 ms. - stage profile corroborates layer compute reduction (
decoder_stepN_layers_mean 19.140 -> 18.447 ms).
- runtime-only 3-seed delta (on-off):
- Canonical full-depth linear lane remains:
TRENI_LINEAR_USE_LT=1TRENI_LINEAR_LT_WORKSPACE_MB=0TRENI_DECODER_FFN_PROJ_U16_FUSED=0TRENI_DECODER_FFN_PROJ_U16_BATCHED2=1(default-on)
- Fresh stage profiles (
external_cold_layers36_stageprofile_ffnprojbatch2_off_20260227T182949Z,..._on_20260227T182728Z) still showdecoder_stepN_layersas dominant, but improved under batched2 (19.140 -> 18.447 ms); FFN projection remains the top layer sub-stage (0.205 -> 0.196 ms/layer), so next optimization remains structural layer-compute work.
Decision Update (2026-02-27 Late, Full-Depth Lane)
TRENI_DECODER_DIRECT_OUT_HIDDENis now promoted default-on in this full-depth lane after positive 3-seed runtime-only A/B:- off:
TTFT=15.024 ms,full=1690.855 ms,cold_full=4696.944 ms,infer=1668.381 ms - on:
TTFT=14.950 ms,full=1684.908 ms,cold_full=4691.002 ms,infer=1662.753 ms - delta (on-off):
full -5.948 ms,infer -5.629 ms - strict parity passed:
week3_parity_report_directouthidden_default_20260227T184738Z.json(checked=3,failed=0).
- off:
- External-cold harness now captures completion-length signals and supports fixed-token vLLM fairness:
- new fields:
completion_chars,completion_words, streamedusage_*(when available). - vLLM path now uses
ignore_eos=truefor fixed-token comparisons. - fixed-length rerun confirms matched
completion_tokens=64for runtime and vLLM.
- new fields:
- New fused qkv split+bias path (
TRENI_DECODER_QKV_SPLIT_BIAS_FUSED) is implemented and promoted default-on in this lane:- runtime-only 3-seed A/B:
- off:
TTFT=14.951 ms,full=1684.135 ms,cold_full=4690.132 ms,infer=1662.833 ms - on:
TTFT=14.687 ms,full=1663.776 ms,cold_full=4669.847 ms,infer=1641.322 ms - delta (on-off):
TTFT -0.265 ms,full -20.359 ms,cold_full -20.285 ms,infer -21.511 ms
- off:
- strict parity passed:
week3_parity_report_qkvsplitbias_default_20260227T190739Z.json(checked=3,failed=0).
- runtime-only 3-seed A/B:
- Latest fixed-length runtime-vLLM 3-seed set (both
completion_tokens=64):- runtime:
TTFT=14.685 ms,full=1662.478 ms - vLLM:
TTFT=50.272 ms,full=1293.215 ms - interpretation: runtime remains clearly ahead on TTFT, but request full still trails in this profile.
- runtime:
Decision Update (2026-02-27 Night, Logits Fast-Compute Hook)
TRENI_DECODER_LOGITS_U16_FAST_COMPUTEis now wired into the runtime logits projection path (*_f32_input_ex(..., use_fast_compute)).- Runtime-only 3-seed A/B (
layers=36,pool=16384, preload64):- off:
TTFT=14.687 ms,full=1661.945 ms,infer=1640.884 ms,cold_full=4667.855 ms - on:
TTFT=14.676 ms,full=1662.713 ms,infer=1640.797 ms,cold_full=4668.751 ms - delta (on-off):
TTFT -0.011 ms,full +0.767 ms,infer -0.086 ms,cold_full +0.896 ms
- off:
- Decision: keep this knob disabled by default in this lane (
TRENI_DECODER_LOGITS_U16_FAST_COMPUTE=0), because there is no material win and request-full regresses slightly. - Fixed-token runtime-vLLM sanity rerun (
completion_tokens=64):- runtime:
TTFT=14.700 ms,full=1662.793 ms - vLLM:
TTFT=49.778 ms,full=1306.676 ms - artifact:
benchmarks/phase2_external_cold/results/external_cold_layers36_preload64_logitsfast_off_vllm_s1_20260227T193632Z.json
- runtime:
- Strict Week 3 parity after hook integration passed:
- artifact:
benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_logitsfast_hook_20260227T193756Z.json - summary:
checked=3,failed=0.
- artifact:
Decision Update (2026-02-27 Night, U16 Cache Unlock)
- Implemented structural cache fix:
copy_tensor_to_gpu_u16now uses tensor-cache lookup/store.- new env gate
TRENI_TENSOR_CACHE_U16(default1) for explicit A/B. - logits-u16 request path now goes through shared cached helper.
- Runtime-only 3-seed A/B (
u16cache off/on, full-depth preload64):- off:
TTFT=14.679 ms,full=1661.982 ms,infer=1640.118 ms,cold_full=4667.860 ms - on:
TTFT=14.682 ms,full=1189.452 ms,infer=1168.883 ms,cold_full=4195.511 ms - delta (on-off):
TTFT +0.003 ms,full -472.529 ms,infer -471.235 ms,cold_full -472.349 ms
- off:
- Runtime-vLLM same-window A/B (
u16cache off/on, 2 seeds each):- off means:
- runtime:
TTFT=14.681 ms,full=1663.314 ms - vLLM:
TTFT=50.073 ms,full=1325.189 ms - runtime-vLLM full delta:
+338.124 ms(runtime slower)
- runtime:
- on means:
- runtime:
TTFT=14.688 ms,full=1192.145 ms - vLLM:
TTFT=50.183 ms,full=1290.816 ms - runtime-vLLM full delta:
-98.671 ms(runtime faster)
- runtime:
- off means:
- Mechanism check from logs (measured request after preload):
- off:
decoder_tensor_upload ~476 ms,decoder_tensor_h2d ~468 ms - on:
decoder_tensor_upload ~5 ms,decoder_tensor_h2d 0 ms
- off:
- Strict parity on final default-on build passed:
- artifact:
benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_u16cache_toggle_default_20260227T200652Z.json - summary:
checked=3,failed=0.
- artifact:
Decision Update (2026-02-27 Late Night, FFN Follow-Up)
- Consolidated artifact:
benchmarks/phase2_external_cold/results/external_cold_layers36_ffn_followup_summary_20260227T223458Z.jsonbenchmarks/phase2_external_cold/results/external_cold_layers36_ffn_followup_summary_20260227T223458Z.md
- New optional
TRENI_LINEAR_BATCHED2_USE_LTlane was implemented and tested:- runtime-only
ab3delta (on-off):TTFT +0.162 ms,full +12.469 ms,infer +12.534 ms. - decision: not promoted.
- runtime-only
- Higher-N retest of
TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=1+TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1(ab8runtime-only):- delta (on-off):
TTFT -0.001 ms,full -0.198 ms,infer -0.101 ms. - decision: not promoted (near-noise).
- delta (on-off):
- FFN fused path follow-up:
- code path now allows gate/up bias deferral into fused SiLU*Up activation when
TRENI_DECODER_FFN_PROJ_U16_FUSED=1. - runtime-only
ab3delta (on-off):TTFT -0.003 ms,full -0.383 ms,infer -0.161 ms. - decision: not promoted (near-noise).
- code path now allows gate/up bias deferral into fused SiLU*Up activation when
- Net status:
- no canonical change from this cycle.
- full-depth hotspot remains layer compute (
decoder_stepN_layers/ FFN-heavy path), so next work stays on deeper structural compute reductions plus mixed-load repeatability.
Decision Update (2026-02-28 Early, Fast-Profile + Mixed-Load Repeatability)
- Fast-profile (
--layers 2) higher-N logits fast-compute retest is complete:- artifact:
benchmarks/phase2_external_cold/results/external_cold_layers2_logitsfast_ab8_summary_20260228T005529Z.json - runtime-only AB8 delta (
on-off):TTFT -0.002 ms,full -0.299 ms,infer -0.013 ms,cold_full -0.345 ms - stage means remained effectively unchanged (
decoder_stepN_logits_proj_mean ~1.261 msin both modes) - decision: not promoted (near-noise effect).
- artifact:
- Mixed-load repeatability on canonical lane is complete (
run_mode=mixed_load,http_runs=120, 3 runs):- artifact:
benchmarks/phase2_runtime/results/aws_speedpass/mixed_load_repeatability_summary_20260228T005626Z.json - means across runs:
mean=122.247 ms,p95=198.518 ms,p99=199.608 ms - decision: stable; no canonical config change from this sweep.
- artifact:
- Strict Week 3 parity follow-up on latest patched build:
- artifact:
benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_followup_20260228T005805Z.json - summary:
checked=3,failed=0, strict.
- artifact:
Decision Update (2026-02-28, Parser Fix + Full-Depth FFN Follow-Up)
phase2_runtime_benchmark.pytiming parser was fixed to preserve decimals intiming stage=... ms=...lines.- root cause: regex escaped decimal point incorrectly, which truncated stage values to integer prefixes.
- impact: request-level metrics (
ttft,infer,full) were unaffected; stage telemetry was underreported.
- Rerun artifacts with fixed parser:
benchmarks/phase2_runtime/results/aws_speedpass/cold_profile_qwen_layers36_fixparse_20260228T011037Z.jsonbenchmarks/phase2_runtime/results/aws_speedpass/warm_profile_qwen_layers36_fixparse_20260228T011037Z.json
- Confirmed full-depth hotspot (qwen,
layers=36) remains FFN-heavy:decoder_step_profile_ffn_proj_mean ~0.366 ms/layerdecoder_step_profile_ffn_down_resid_mean ~0.190 ms/layerdecoder_step_profile_total_mean ~0.705 ms/layer
- Full-depth warm AB3 on
TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE:- artifact:
benchmarks/phase2_runtime/results/aws_speedpass/ffn_fast_compute_ab3_20260228T011146Z_summary.json - delta (
on-off): request+0.317 ms, infer+0.305 ms, stage means flat. - decision: not promoted.
- artifact:
- New strided-batched Lt path for batched2 FFN Lt fallback was implemented and benchmarked:
- warm AB3 artifact:
benchmarks/phase2_runtime/results/aws_speedpass/batched2lt_strided_ab3_20260228T011651Z_summary.json - warm AB3 delta (
on-off): request-0.190 ms, infer-0.194 ms, stage means flat. - runtime-only external-cold sanity (
layers=36, preload64): slight regression (full +0.579 ms,infer +0.609 ms). - decision: keep path opt-in (
TRENI_LINEAR_BATCHED2_USE_LT=1) and not canonical.
- warm AB3 artifact:
- FFN gate/up dual-bias fused add path (
TRENI_DECODER_FFN_BIAS_PAIR_FUSED) is now implemented and benchmarked:- warm AB3 artifact:
benchmarks/phase2_runtime/results/aws_speedpass/ffn_bias_pair_ab3_20260228T020257Z/summary.json - warm AB3 delta (
on-off): request-0.229 ms, infer-0.090 ms, p99-0.390 ms, TTFT+0.009 ms. - cold follow-up artifact:
benchmarks/phase2_runtime/results/aws_speedpass/ffn_bias_pair_cold_ab2_20260228T020723Z/summary.json(3 seeds each after extension). - cold delta (
on-off): TTFT-0.003 ms, infer+1.875 ms, full+1.928 ms. - decision: keep the lane opt-in (non-canonical) until cold regression is eliminated.
- warm AB3 artifact:
- Batched2
seq1split-GEMM lane (TRENI_LINEAR_BATCHED2_SPLIT_SEQ1) is now implemented and benchmarked:- warm AB3 artifact:
benchmarks/phase2_runtime/results/aws_speedpass/batched2_splitseq1_ab3_20260228T025841Z/summary.json - warm AB3 delta (
on-off): request+0.014 ms, infer+0.105 ms, p99+0.124 ms, TTFT+0.004 ms(near-noise/slight regression). - cold AB3 artifact:
benchmarks/phase2_runtime/results/aws_speedpass/batched2_splitseq1_cold_ab3_20260228T025841Z/summary.json - cold AB3 delta (
on-off): TTFT-0.021 ms, infer-2.002 ms, full-2.070 ms. - decision: keep opt-in and non-canonical (no warm-path win).
- warm AB3 artifact:
- Batched2 dup-input strided lane (
TRENI_LINEAR_BATCHED2_DUP_INPUT) is now implemented and benchmarked:- warm AB3 artifact:
benchmarks/phase2_runtime/results/aws_speedpass/batched2_dupinput_ab3_20260228T031816Z/summary.json - warm AB3 delta (
on-off): request+0.317 ms, infer+0.293 ms, TTFT+0.009 ms, p99-0.208 ms. - cold AB3 artifact:
benchmarks/phase2_runtime/results/aws_speedpass/batched2_dupinput_cold_ab3_20260228T031816Z/summary.json - cold AB3 delta (
on-off): TTFT+0.010 ms, infer+1.388 ms, full+1.307 ms. - decision: keep opt-in and non-canonical (regresses mean request path in both warm and cold).
- warm AB3 artifact:
- Batched2 dup-input v2 probe (duplication-kernel swap for the dup path) was run as a warm AB2 gate set and rejected:
- gate artifact:
benchmarks/phase2_runtime/results/aws_speedpass/batched2_dupinput_v2warm_ab2_20260228T032741Z/summary_gate_ab2.json - gate delta (
on-off): request+0.438 ms, infer+0.381 ms, TTFT+0.015 ms, p99+0.217 ms. - decision: reverted probe implementation; no AB3/cold expansion.
- gate artifact:
- FFN proj u16 fused gate rerun (
TRENI_DECODER_FFN_PROJ_U16_FUSED=0/1, warm AB2):- gate artifact:
benchmarks/phase2_runtime/results/aws_speedpass/ffn_proj_u16_fused_gate_ab2_20260228T033524Z/summary_gate_ab2.json - gate delta (
on-off): request+0.149 ms, infer+0.173 ms, TTFT+0.002 ms, p99-0.006 ms. - decision: near-flat/slight mean regression; no AB3 expansion.
- gate artifact:
- FFN proj batched2 f32-input gate rerun (
TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0/1, warm AB2):- gate artifact:
benchmarks/phase2_runtime/results/aws_speedpass/ffn_proj_batched2_f32input_gate_ab2_20260228T033758Z/summary_gate_ab2.json - gate delta (
on-off): request+0.236 ms, infer+0.248 ms, TTFT+0.011 ms, p99+0.512 ms. - decision: rejected at gate stage; no AB3 expansion.
- gate artifact:
- Linear u16 compute16f gate probe (
TRENI_LINEAR_U16_FORCE_COMPUTE_16F=0/1, warm AB2):- gate artifact:
benchmarks/phase2_runtime/results/aws_speedpass/linear_u16_compute16f_gate_ab2_20260228T034412Z/summary_gate_ab2.json - gate delta (
on-off): request+0.210 ms, infer+0.240 ms, TTFT-0.001 ms, p99+0.594 ms. - decision: rejected at gate stage and reverted; no AB3 expansion.
- gate artifact:
- Full-depth warm u16-lane re-baseline (explicit u16 decode flags, qwen,
layers=36) confirms active hotspot split:- request mean in this lane is
~173 ms(120-request warm profile), withdecoder_step_profile_total_mean ~0.402 ms. - FFN projection remains dominant (
decoder_step_profile_ffn_proj_mean ~0.196 ms, mostlyffn_proj_gate;ffn_proj_upstays0.0under batched2).
- request mean in this lane is
- FFN gate/up contiguous-pair packing probe (
TRENI_DECODER_FFN_PAIR_PACK_U16) is implemented as experimental and benchmarked:- AB3 artifact:
benchmarks/phase2_runtime/results/aws_speedpass/ffn_pair_pack_gate_ab2_20260228T040616Z/summary_ab3.json - warm AB3 delta (
on-off): request-0.423 ms, infer-0.442 ms, p99-0.673 ms. - caveat: both sides already reported contiguous gate/up pair active, so this delta is not a causal promotion signal.
- decision: keep path default-off (
TRENI_DECODER_FFN_PAIR_PACK_U16=0) and experimental only.
- AB3 artifact:
- Batched2 Lt rerun on the explicit u16 lane (
TRENI_LINEAR_BATCHED2_USE_LT) now has fresh warm+cold evidence:- warm AB3 artifact:
benchmarks/phase2_runtime/results/aws_speedpass/batched2_use_lt_u16lane_gate_ab2_20260228T041041Z/summary_ab3.json - warm AB3 delta (
on-off): request-0.313 ms, infer-0.468 ms, p99-0.511 ms, TTFT-0.058 ms. - cold AB3 artifact:
benchmarks/phase2_runtime/results/aws_speedpass/batched2_use_lt_u16lane_cold_ab2_20260228T041359Z/summary_ab3.json - cold AB3 delta (
on-off): full+1.165 ms, infer+1.424 ms, TTFT+0.001 ms. - fixed-on decision: keep non-canonical (warm gain did not survive cold-first-hit tradeoff).
- warm AB3 artifact:
- Adaptive delayed batched2 Lt policy (
TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS) has warm/cold wins but is not canonical (2026-02-28):5000msAB3 (batched2_lt_enable_after_ms5000_warm_ab3_20260228T104525Z,batched2_lt_enable_after_ms5000_cold_ab3_20260228T104712Z): warm improved but cold full still regressed (+0.422 ms).10000msAB3 (batched2_lt_enable_after_ms10000_warm_ab3_20260228T105028Z,batched2_lt_enable_after_ms10000_cold_ab3_20260228T105213Z): both modes improved.- warm delta (
on-off): request-0.363 ms, infer-0.326 ms, p99-0.696 ms. - cold delta (
on-off): startup-4.307 ms, full-0.635 ms, infer-0.347 ms, TTFT-0.070 ms.
- warm delta (
- strict parity (
week3_parity_report_batched2_lt_delay10000_20260228T105329Z.json): pass (checked=3,failed=0). - default-path strict parity smoke (no explicit batched2 Lt env overrides) also passed:
week3_parity_report_batched2_lt_defaultdelay_20260228T110825Z.json. - same-window mixed-load A/B (
mixed_load_defaultdelay_onoff_ab3_20260228T115010Z.json) regressed with delayed-on:on-off: mean+0.846 ms, p95+1.627 ms, p99+0.679 ms.
- decision: keep parser defaults off (
TRENI_LINEAR_BATCHED2_USE_LT=0,TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0) and leave delayed-on as opt-in. - post-revert default-path strict parity also passed:
week3_parity_report_postrevert_defaults_20260228T115543Z.json.
- Foundation parser-default rerun pack is now published (
foundation_defaultdelay_pack_20260228T114315Z):- warm AB3 means (
foundation_defaultdelay_warm_ab3_20260228T114315Z/summary_ab3.json): request147.258 ms, p99247.617 ms, infer128.450 ms, TTFT16.999 ms. - cold AB3 means (
foundation_defaultdelay_cold_ab3_20260228T114315Z/summary_ab3.json): startup425.532 ms, full598.787 ms, infer580.173 ms, TTFT12.210 ms. - mixed repeatability (
mixed_load_repeatability_summary_defaultdelay_20260228T114748Z.json) vs prior canonical summary (mixed_load_repeatability_summary_20260228T005626Z.json) remained slower (mean +2.841 ms,p95 +5.587 ms,p99 +5.140 ms), reinforcing the non-canonical decision for delayed-on defaults.
- warm AB3 means (
- Added experimental Lt prewarm path for FFN batched2 (
TRENI_DECODER_FFN_BATCHED2_LT_PREWARM) and measured it with Lt fixed-on:- warm AB2 (
batched2_lt_prewarm_warm_ab2_20260228T042453Z/summary_gate_ab2.json): small gain (request -0.328 ms,infer -0.394 ms). - cold AB3 (
batched2_lt_prewarm_cold_ab3_20260228T042649Z/summary_ab3.json): first-hit gain (full -1.497 ms,infer -1.406 ms).
- warm AB2 (
- Direct same-window combo A/B (
lt=0,prewarm=0vslt=1,prewarm=1) is mixed and non-promotable:- combined summary artifact:
benchmarks/phase2_runtime/results/aws_speedpass/batched2_lt_prewarm_combo_summary_20260228T042733Z.json - warm AB3 (
batched2_lt_prewarm_combo_warm_ab2_20260228T042733Z/summary_ab3.json): regression (request +0.198 ms,infer +0.178 ms,p99 +0.407 ms). - cold AB3 (
batched2_lt_prewarm_combo_cold_ab3_20260228T042733Z): still improved (full -1.099 ms,infer -0.819 ms). - decision: keep prewarm path experimental/default-off; not canonical.
- combined summary artifact:
- FFN down fast-compute lane (
TRENI_DECODER_FFN_DOWN_U16_FAST_COMPUTE) is now promoted default-on (2026-02-28) after full-depth canonical A/B + strict parity:- warm AB3 (
ffn_down_fast_compute_gate_ab3_20260228T044546Z/summary_ab3.json): request-0.565 ms, infer-0.566 ms, p99-1.405 ms, TTFT-0.030 ms. - cold AB3 (
ffn_down_fast_compute_cold_ab3_20260228T044753Z/summary_ab3.json): startup-8.405 ms, full-0.351 ms, infer-0.406 ms, TTFT-0.028 ms. - strict parity (
week3_parity_report_ffn_down_fast_20260228T044846Z.json): pass (checked=3,failed=0).
- warm AB3 (
- Post-promotion retest cycle on the updated canonical baseline (
2026-02-28) closed additional FFN toggle candidates as non-canonical:- new structural
TRENI_LINEAR_BATCHED2_STACKED_SEQ1=1AB3 probe regressed warm materially (request +1.259 ms,infer +1.229 ms,p99 +2.830 ms) with near-flat cold full (+0.030 ms), so it remains experimental/default-off. TRENI_LINEAR_BATCHED2_SPLIT_SEQ1AB3 retest regressed warm and cold.TRENI_LINEAR_BATCHED2_USE_LTfixed-on AB3 retest improved warm but still regressed cold startup/full; delayed-on improved warm/cold but still regressed mixed-load, so lane stays non-canonical.- combo
TRENI_LINEAR_BATCHED2_USE_LT=1+TRENI_DECODER_FFN_BATCHED2_LT_PREWARM=1looked positive at AB3 but failed AB5 cold confirmation. TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1was non-canonical in that cycle (later temporarily promoted inDecision Update (2026-02-28 Late 8), then rejected inDecision Update (2026-02-28 Late 9)and returned to canonical off).TRENI_LINEAR_U16_FAST_COMPUTE=1was revalidated in a later AB5/cold/parity cycle and is now promoted (seeDecision Update (2026-02-28 Late)above).
- new structural
Decision Update (2026-02-26)
- Full-depth FFN activation-to-u16 fused path (
TRENI_DECODER_FFN_ACT_U16_FUSED) was implemented, benchmarked, and promoted to default-on. - Runtime-only 3-seed A/B:
- off:
TTFT=15.333 ms,full=1715.700 ms,cold_full=4721.653 ms - on:
TTFT=15.193 ms,full=1704.987 ms,cold_full=4710.958 ms - delta (on-off):
TTFT -0.140 ms,full -10.713 ms,cold_full -10.696 ms
- off:
- Runtime-vLLM 3-seed A/B (same host class/window):
- off: runtime
full=1716.052 ms, vLLMfull=1299.219 ms(runtime/vLLM=1.3208x) - on: runtime
full=1704.248 ms, vLLMfull=1309.801 ms(runtime/vLLM=1.3012x)
- off: runtime
- Strict parity passed for explicit-on and default-on runs:
week3_parity_report_ffnactu16_20260226T1100.jsonweek3_parity_report_ffnactu16_default_20260226T1108.json
- cuBLASLt workspace probe (
TRENI_LINEAR_LT_WORKSPACE_MB=32) was tested and rejected in this lane (full 1711.213 -> 1754.568 msin trial A/B).
Decision Update (2026-02-24)
- cuDNN/frontend optimization lane is parked for now.
- Reason: high fused coverage remains slower than custom on warm path and dramatically worse on cold-first-hit.
- Active priority is custom-kernel best path only.
- Custom-lane implementation update: added seq1 microfused attention path (
TRENI_ATTN_SEQ1_USE_MICROFUSED) and cached cuBLAS stream binding. - G5 A/B update (
2026-02-23): microfused path showed no net win (mean/TTFT regressions across qwen/bart profiles), so it remains opt-in and defaults off.- summary artifact:
benchmarks/phase2_runtime/seq1_microfused_ab/seq1_microfused_ab_summary_20260223T014848Z.md.
- summary artifact:
- G5 stream-cache A/B update (
2026-02-23):TRENI_LINEAR_STREAM_CACHE/TRENI_ATTN_STREAM_CACHEshowed near-neutral impact in short runs; cache stays enabled by default.- summary artifact:
benchmarks/phase2_runtime/results/stream_cache_ab_summary_20260223T015222Z.md.
- summary artifact:
- G5 registry/model-index hash A/B update (
2026-02-23):TRENI_REGISTRY_LOOKUP_HASH/TRENI_MODEL_INDEX_NAME_HASHshowed no meaningful cold/setup gain on this profile; path remains opt-in and defaults off.- summary artifact:
benchmarks/phase2_runtime/results/registry_hash_ab_summary_20260223T020353Z.md.
- summary artifact:
- Cold-start measurement contract fix (
2026-02-23):phase2_runtime_benchmark.pyhealth polling moved from 1s cadence to 50ms cadence.- implication:
startup_to_healthy_msis now high-fidelity (older~1002 msplateaus were quantized artifacts, not true startup plateaus).
- implication:
- Runtime startup-smoke control is now first-class in harness (
--runtime-skip-startup-smoke, default true).- validated A/B (
startup_smoke_ab_hf_20260223T030059Z): startup-to-healthy improved488.027 -> 404.184 ms(-17.18%) and start-to-first-response improved705.454 -> 622.167 ms(-11.81%) with startup smoke skipped. - runtime default now also skips startup smoke unless explicitly disabled (
TRENI_SKIP_STARTUP_SMOKE=0).
- validated A/B (
- New high-fidelity cold reference (
cold_foundation_hf_20260223T030257Z, qwen profile):- startup-to-healthy
437.653 ms - request TTFT
4.295 ms - request full
217.998 ms - dominant request-path stage remains
decoder_tensor_upload/decoder_tensor_h2d.
- startup-to-healthy
- Consolidated knob-probe summary artifact:
benchmarks/phase2_runtime/results/cold_path_knob_probe_20260223T0303Z.md
- Per-tensor upload hotspot probe is now available:
TRENI_TENSOR_UPLOAD_TOPKidentifiedmodel.embed_tokens.weightas dominant in qwen cold upload (~79.3 ms,~63.8%share in probe artifact).- artifact:
benchmarks/phase2_runtime/results/cold_upload_hotspot_summary_20260223T1915Z.md.
- Container readahead probe (
TRENI_CONTAINER_WILLNEED) is now benchmarked:- 8-run A/B (
container_willneed_ab8_20260223T191145Z) shows modest repeatable cold-total improvement (~-1.94%start-to-first-response). TRENI_CONTAINER_WILLNEED + TRENI_TENSOR_HOST_REGISTERcombo did not improve further on this profile (container_hostreg_ab8_20260223T191255Z).- runtime default now enables
TRENI_CONTAINER_WILLNEEDunless explicitly disabled (=0).
- 8-run A/B (
- Staged H2D upload follow-up is now complete (
TRENI_TENSOR_H2D_STAGING):min64/chunk328-run A/B (h2d_staging_ab_20260224T100915Z) regressed full latency (+21.22%) and upload/h2d stages (+37.70%/+38.68%).min64/chunk1283-run probe (h2d_staging_chunk128_probe_20260224T101012Z) regressed further (full +44.43%,decoder_tensor_h2d +76.92%).- Decision: keep staging path parked (opt-in only), and focus Track A cold work on non-staging upload/H2D plus
decoder_step0_layers. - consolidated artifact:
benchmarks/phase2_runtime/results/h2d_staging_followup_summary_20260224T101324Z.md.
- Non-staging H2D chunk-size matrix (
TRENI_TENSOR_H2D_CHUNK_MB=0/64/128, 8 runs each) is now complete:- request-path and upload-stage deltas were near-neutral in that initial profile sweep; this was later superseded by
2026-02-28full-depth AB3 promotion of defaultTRENI_TENSOR_H2D_CHUNK_MB=0(seeDecision Update (2026-02-28 Late 3)). - consolidated artifact:
benchmarks/phase2_runtime/results/h2d_chunk_matrix_summary_20260224T101730Z.md.
- request-path and upload-stage deltas were near-neutral in that initial profile sweep; this was later superseded by
- Host page-touch pre-fault path (
TRENI_TENSOR_HOST_TOUCH) is now implemented and benchmarked (TRENI_TENSOR_HOST_TOUCH_MIN_MB=256, 8-run A/B):decoder_tensor_h2ddecreased, but prefetch/upload stages increased and net request latency regressed (full +7.73%,infer +8.22%).- Decision: keep host-touch opt-in/default-off; not promoted into canonical Track A settings.
- consolidated artifact:
benchmarks/phase2_runtime/results/host_touch_ab_summary_20260224T102444Z.md.
- Upload sync diagnostic probe (
TRENI_TENSOR_UPLOAD_SYNC=0/1, 3 runs each) is now complete:- with synchronization on, conversion is visible (
~6 ms) but H2D remains dominant (~118 ms) on this profile. - implication: cold upload optimization remains transfer-path first.
- consolidated artifact:
benchmarks/phase2_runtime/results/upload_sync_probe_summary_20260224T102618Z.md.
- with synchronization on, conversion is visible (
- Synchronized host-register probe (
TRENI_TENSOR_HOST_REGISTER=0/1, withTRENI_TENSOR_UPLOAD_SYNC=1) is now complete:- no transfer-stage gain and slight request-path regression on this profile.
- implication: host-register lane is currently deprioritized.
- consolidated artifact:
benchmarks/phase2_runtime/results/host_register_sync_probe_summary_20260224T102915Z.md.
- Decoder logits u16 path (
TRENI_DECODER_LOGITS_U16_PATH) is now implemented and benchmarked:- cold upload/setup improves slightly, but request-path latency regresses materially (
ttft/infer/full) in valid A/B runs. - follow-up fix2 pilot still regresses request path materially; lane remains parked.
- implication: keep this lane parked as opt-in experimental; not part of canonical Track A settings.
- consolidated artifact:
benchmarks/phase2_runtime/results/logits_u16_ab_fix1_summary_20260224T105532Z.md.
- cold upload/setup improves slightly, but request-path latency regresses materially (
- Tensor-cache hash lookup path (
TRENI_TENSOR_CACHE_HASH) is now implemented and benchmarked:- mixed + warm 3-seed A/B remains near-neutral, with slight warm
p99regression (+0.149 ms) in this profile. - implication: keep this lane opt-in/default-off.
- artifacts:
benchmarks/phase2_runtime/results/tensor_cache_hash_ab_20260224T113911Z/benchmarks/phase2_runtime/results/tensor_cache_hash_warm3_20260224T114126Z/
- mixed + warm 3-seed A/B remains near-neutral, with slight warm
- Sampler direct-store path (
TRENI_SAMPLE_DIRECT_STORE) is now implemented and benchmarked:- enabled path regressed warm request latency (3-seed A/B: mean
+0.062 ms, p95+0.076 ms, p99+0.143 ms). - implication: keep this lane opt-in/default-off.
- artifact:
benchmarks/phase2_runtime/results/sample_direct_store_ab_20260224T114633Z/.
- enabled path regressed warm request latency (3-seed A/B: mean
- Decoder direct-out residual path (
TRENI_DECODER_DIRECT_OUT_HIDDEN) initial warm-profile A/B (2026-02-24) regressed and was kept opt-in at that time:- enabled path regressed warm request and infer metrics (3-seed A/B: mean
+0.540 ms, p95+0.495 ms, p99+0.444 ms, infer+0.150 ms). - artifact:
benchmarks/phase2_runtime/results/direct_outhidden_ab_20260224T115051Z/. - note: this is superseded for the current full-depth lane by the
2026-02-27late-cycle promotion (see Decision Update above).
- enabled path regressed warm request and infer metrics (3-seed A/B: mean
- Consolidated summary for these three custom-path probes:
benchmarks/phase2_runtime/results/custom_path_probe_summary_20260224T115602Z.md.
- Multi-head seq1 attention path (
TRENI_ATTN_SEQ1_USE_MULTIHEAD) is now implemented and benchmarked:- qwen warm 3-seed: request mean
1.041x, p991.042x, infer1.074x(seq1_multihead_ab_20260224T125127Z). - qwen mixed 3-seed: request mean
1.036x, p991.045x, infer1.074x, cold wall1.010x(seq1_multihead_ab_20260224T125127Z). - bart warm 3-seed: request mean
1.097x, p991.112x, TTFT1.429x, infer1.185x(seq1_multihead_bart_ab_20260224T125404Z). - default sanity rerun (no env override) remains faster than forced-off profile (
seq1_multihead_default_sanity_20260224T125713Z). - decision: promoted default-on (
TRENI_ATTN_SEQ1_USE_MULTIHEAD=1,TRENI_ATTN_SEQ1_MULTIHEAD_MAX_KV=2048). - consolidated artifact:
benchmarks/phase2_runtime/results/seq1_multihead_ab_summary_20260224T125619Z.md.
- qwen warm 3-seed: request mean
- External-cold rerun after seq1 multi-head default promotion is now complete (
2026-02-24, same G5 host/config, 3 runs):- runtime means: startup
1003.315 ms, TTFT4.022 ms, request full239.277 ms, cold-total first response1242.592 ms. - runtime-normalized ratios: PyTorch
127.900xTTFT /9.378xfull /6.320xcold-total; vLLM12.350xTTFT /4.139xfull /19.333xcold-total. - note: Ollama was skipped on this host because Ollama service/model were not installed for this rerun.
- consolidated artifact:
benchmarks/phase2_external_cold/results/external_cold_seq1mh_default_repeatability_20260224T192020Z.md.
- runtime means: startup
- First
decoder_step0_layersoptimization follow-up on seq1 multi-head path is now benchmarked (2026-02-24):- change: reuse normalized probs in multi-head seq1 softmax+PV (remove repeated
expin inner PV accumulation loop). - 3-run external-cold repeatability (
runtime + PyTorch + vLLM) runtime deltas vs priorseq1mhbaseline:- TTFT:
4.022 -> 4.018 ms - request full:
239.277 -> 238.400 ms - cold-total first response:
1242.592 -> 1241.688 ms
- TTFT:
- interpretation: measurable but small gain; confirms direction, and more step0 work is still needed for material uplift.
- consolidated artifact:
benchmarks/phase2_external_cold/results/external_cold_step0expfix_repeatability_20260224T194226Z.md.
- change: reuse normalized probs in multi-head seq1 softmax+PV (remove repeated
- Second
decoder_step0_layersfollow-up (seq1 multi-head shared-prob cache) was benchmarked and reverted:- 3-run means were slightly worse than
step0expfix(full +0.278 ms, cold-total+0.282 ms) while still better than the olderseq1mhbaseline. - decision: keep
step0expfixas current best path and revert shared-prob patch. - artifact:
benchmarks/phase2_external_cold/results/external_cold_step0shared_repeatability_20260224T194913Z.md.
- 3-run means were slightly worse than
- Decode-stage profiling beyond step0 is now available (
TRENI_DECODE_STAGE_PROFILE):- first profiled run (
external_cold_stepn_profile_20260225T001334Z) showsdecoder_stepN_logits_sample_mean=2.671 msanddecoder_stepN_layers_mean=1.360 ms(qwen fast profile:--layers 2, 64 tokens, no preload). - implication (fast profile): next custom-kernel priority is logits projection/sampling path.
- first profiled run (
- Decode split follow-up (
2026-02-25) now isolates logits projection from sampling:external_cold_stepn_split_20260225T081450Zandexternal_cold_stepn_split_revert_20260225T082055Zshowdecoder_stepN_logits_proj_mean=2.458 msvsdecoder_stepN_sample_mean=0.106 ms.- implication: residual decode hotspot is specifically logits projection.
- Immediate logits-projection probe matrix (
2026-02-25) is complete and near-neutral:lt16A/B:external_cold_stepn_lt16_off/on_20260225T081717Z/081718Z- fast16 GEMMEx probe:
external_cold_stepn_split_fast16_20260225T082158Z - direct-u16-input A/B:
external_cold_stepn_u16direct_off/on_20260225T082445Z/082447Z lt_u16workspace A/B:external_cold_stepn_ltu16ws_off/on_20260225T082735Z/082737Z- decision: all no-gain probe code paths were reverted; baseline remains canonical.
- Uncertainty-capture A/B on the same profile (
TRENI_DEMO_CAPTURE_UNCERTAINTY=1/0) is now complete:- request full
479.889 -> 473.367 ms - infer
461.771 -> 454.878 ms decoder_stepN_logits_sample_mean 2.671 -> 2.562 ms- implication: uncertainty overhead is measurable but not the dominant decode cost.
- request full
- Runtime-vLLM cold rerun (
external_cold_runtime_vllm_uncertoff_20260225T001929Z) confirms runtime remains clearly ahead on this profile:- runtime TTFT/full/cold-total full:
3.929 / 472.724 / 1476.116 ms - vLLM TTFT/full/cold-total full:
49.577 / 1311.481 / 24344.013 ms
- runtime TTFT/full/cold-total full:
- Full-depth qwen check (
--layers 36,--pool-mb 16384) is now explicitly captured:- profiled runtime-only artifact:
external_cold_stepn_split_layers36_pool16g_20260225T083216Zshowsdecoder_stepN_layers_mean=24.306 ms,decoder_stepN_logits_proj_mean=2.458 ms,decoder_stepN_total_mean=26.875 ms. - implication (full depth): decoder layers are the dominant request-path stage; logits projection is secondary.
- profiled runtime-only artifact:
- Full-depth runtime-vLLM cold comparison (
external_cold_runtime_vllm_layers36_pool16g_20260225T083306Z):- runtime TTFT/full/cold-total full:
26.775 / 2983.780 / 3987.092 ms - vLLM TTFT/full/cold-total full:
49.998 / 1315.478 / 24346.938 ms - implication: runtime is better on TTFT and cold-total, but currently slower on first-request full latency in this full-depth configuration.
- runtime TTFT/full/cold-total full:
- Full-depth preload follow-up (
external_cold_runtime_vllm_layers36_pool16g_preload_20260225T150209Z):- runtime request path improves to
TTFT/full/infer = 26.748 / 2136.131 / 2114.951 mswith cache hits (cache_hit_delta=434,cache_miss_delta=0). - vLLM in same run remains faster on full (
1279.729 ms) but much worse on cold-total (24310.219 ms). - implication: after removing upload misses, residual gap is decode/layer compute.
- runtime request path improves to
- Full-depth preload-max-tokens probe (
external_cold_runtime_vllm_layers36_pool16g_preload64_20260225T150410Z) is near-neutral vs preload=1 on runtime request full (2133.948 ms), but increases cold-total due heavier startup preload. - Full-depth seq1 hybrid matrix rerun (
external_cold_layers36_hybrid_*_20260225T1508*.json):- default custom is best (
infer ~2113 ms). qk/pv/bothcublas variants regress materially (infer ~2459-2556 ms).
- default custom is best (
- Full-depth direct-u16-input probe (
external_cold_layers36_preload_a2_u16direct_off/on_20260225T150710Z/150715Z) is near-neutral/regressed and was reverted. - Full-depth FFN u16 path A/B (
TRENI_DECODER_FFN_U16_PATH=1) is now complete:- artifacts:
external_cold_layers36_preload64_ab2_base_20260225T1628Zexternal_cold_layers36_preload64_ab2_ffnu16_20260225T1628Z
- runtime deltas (
ffnu16 - base):- TTFT
26.872 -> 18.077 ms - request full
2148.336 -> 1820.345 ms - cold-total full
6155.513 -> 4826.635 ms
- TTFT
- implication: significant full-depth gain is validated, but runtime request full is still slower than vLLM (
~1.38x) in this matched run.
- artifacts:
- Full-depth 3-seed expansion (
basevsATTN+FFN u16vsATTN+FFN+LOGITS u16) is now complete:- baseline means: runtime
TTFT=26.863 ms,full=2147.754 ms,cold_full=6154.978 ms ATTN+FFN u16means: runtimeTTFT=17.080 ms,full=1791.873 ms,cold_full=4797.910 msATTN+FFN+LOGITS u16means: runtimeTTFT=16.104 ms,full=1775.313 ms,cold_full=4780.830 ms- implication: best runtime/vLLM full ratio improved to
1.365x(from1.653xbaseline), but full request parity is still not reached.
- baseline means: runtime
- Full-depth decode-input reuse + u16-Lt follow-up (
2026-02-25) is now complete:- pre-cast reuse 3-seed means: runtime
TTFT=15.866 ms,full=1755.374 ms,cold_full=4761.440 ms - pre-cast reuse + u16-Lt 3-seed means: runtime
TTFT=15.522 ms,full=1729.351 ms,cold_full=4735.345 ms - vs prior best (
ATTN+FFN+LOGITS u16): request full-45.962 ms, TTFT-0.582 ms, cold-total full-45.485 ms - implication: best runtime/vLLM full ratio improved further to
1.323x, but request-full parity is still open.
- pre-cast reuse 3-seed means: runtime
- Full-depth residual-fused u16-Lt follow-up (
2026-02-26) is now complete:- 3-seed means: runtime
TTFT=15.400 ms,full=1719.302 ms,cold_full=4725.923 ms - vs prior
precastreuse+u16ltset: request full-10.049 ms, TTFT-0.122 ms, cold-total full-9.422 ms - implication: runtime request path improved again; vLLM moved in the same rerun window, so ratio remained mixed and request-full parity is still open.
- 3-seed means: runtime
- Full-depth FFN gate+up fused-batch probe (
2026-02-26) is now closed as non-canonical:- trial and 3-seed runtime-only A/B were completed.
- result: regression on 3-seed means (
full +4.768 mswhen enabled), so the path was reverted.
- Full-depth attention qkv fused-alias follow-up (
TRENI_DECODER_ATTN_U16_QKV_FUSED) is now complete:- runtime-only 3-seed means:
- off:
TTFT=15.412 ms,full=1720.295 ms,cold_full=4726.358 ms - on:
TTFT=15.323 ms,full=1714.426 ms,cold_full=4720.376 ms - delta (on-off):
TTFT -0.089 ms,full -5.869 ms,cold_full -5.982 ms
- off:
- runtime-vLLM 3-seed means:
- off: runtime
full=1720.062 ms, vLLMfull=1282.024 ms(runtime/vLLM=1.3417x) - on: runtime
full=1713.520 ms, vLLMfull=1295.140 ms(runtime/vLLM=1.3230x)
- off: runtime
- implementation note: fused alias is now default-on in this lane; runtime logs confirm activation (
attn qkv fused alias=on) on current qwen profile.
- runtime-only 3-seed means:
- Post-rebuild full-depth sanity checks (
2026-02-26) confirm no regression in this lane:external_cold_layers36_sanity_postltwsoff_residfuse_u16lt_20260226T093127Z: runtimeTTFT=15.384 ms,full=1720.056 ms,cold_full=4726.451 msexternal_cold_layers36_sanity_postbatch2revert_residfuse_u16lt_20260226T093905Z: runtimeTTFT=15.399 ms,full=1720.835 ms,cold_full=4726.968 msexternal_cold_layers36_sanity_postffnsubprof_residfuse_u16lt_20260226T094109Z: runtimeTTFT=15.432 ms,full=1720.919 ms,cold_full=4727.189 msexternal_cold_layers36_sanity_qkvfuseddefault_residfuse_u16lt_20260226T102520Z: runtimeTTFT=15.319 ms,full=1713.886 ms,cold_full=4720.190 ms
- Full-depth FFN sub-stage profile split (
external_cold_layers36_stepn_profile_ffnsub_20260226T094140Z.log):decoder_step_profile_ffn_proj_mean=0.205 msdecoder_step_profile_ffn_proj_cast_mean=0.005 msdecoder_step_profile_ffn_proj_gate_mean=0.101 msdecoder_step_profile_ffn_proj_up_mean=0.099 ms- implication: cast is minor; remaining
ffn_projhotspot is gate/up linear compute itself.
- FAST_16 follow-up probe on top of u16-Lt (
2026-02-25) was evaluated and not promoted:- request-full deltas were small (
~1-2 ms) and one startup run in the repeatability set showed a large shared host outlier. - decision: keep canonical baseline on non-fast compute and continue next work from
residfuse+u16lt.
- request-full deltas were small (
What Has Been Run
Phase 1 (Baseline, Python stack)
- T4 set: baseline JSON exists.
- G5 set: baseline JSON exists.
- Includes cold start breakdown, warm model runs, and pipeline runs.
Phase 2 (Minimal runtime benchmark)
- T4 set: runtime JSON exists.
- G5 set: runtime JSON exists.
- Includes cold starts, model run timing, and HTTP request latency.
- True TTFT rerun exists (runtime timing, not SSE proxy).
- Cold optimization rerun exists after tensor index-cache fix.
- Stage-level cold decomposition exists (tokenizer/index/upload/prefill/step0 timings).
- Fast tensor collect optimization rerun exists (
clean4). - External cold canonical run exists across four backends (runtime, PyTorch, vLLM, Ollama) on G5 (
2026-02-18). - External cold optimized run exists with runtime startup preload + tokenizer cache (
2026-02-18). - External cold token-parity rerun exists after decoder/sampling fixes; runtime now wins request and cold-total vs vLLM (
2026-02-18). - Qwen cold upload sub-stage ablation exists with GPU conversion toggle on G5 (
2026-02-19). - External-cold runtime-only GPU-convert ablation exists on G5 (on/off toggle, preload+token-parity settings,
2026-02-19). - External-cold runtime-vLLM rerun exists after vLLM env restore (
2026-02-19, 3-run repeatability). - External-cold all-backend repeatability exists after GPU-convert fix (
2026-02-19, 3 runs; runtime+PyTorch+vLLM+Ollama). - Runtime-only cold stability sweep exists (
2026-02-19, 5 runs with preload upload sub-stage inspection). - Runtime host-prefetch cold-variance fix exists (
TRENI_TENSOR_HOST_PREFETCH,2026-02-19) with stable runtime-only 5-run sweep. - External-cold all-backend repeatability rerun exists after host-prefetch fix (
2026-02-19, 3 runs). - External-cold repeatability rerun exists after seq1 multi-head default promotion (
2026-02-24, 3 runs, runtime + PyTorch + vLLM; Ollama skipped on host). - AWS G5 speedpass matrix exists for upload-sync +
cublasLttoggles (2026-02-22). - AWS G5 TTFT kernel pass exists (
2026-02-22): softmax near-parity, then norm-kernel rewrite (rmsnorm/layernorm) produced measurable cold/warm latency gains. - AWS G5 TTFT follow-up exists (
2026-02-22):seq_q=1tiny-attention kernel path + direct K/V cache writes further improved TTFT/warm latency and moved Bart TTFT materially. - AWS G5 attention backend A/B exists for
customvscudnn_sdpaproxy, including reverse-order rerun to remove call-order cold bias (2026-02-22). - AWS G5 seq1 hybrid tuning matrix exists (
2026-02-22): default custom seq1 vs qk-cublas vs pv-cublas vs both-cublas. - AWS G5 seq1 fused-softmax/PV follow-up exists (
2026-02-22): default custom path rerun after fused seq1 softmax+PV + QK block retune (seq1_hybrid_fused_20260222T192656Z). - Attention runtime now caches backend env config once per process (removes per-call parse overhead on request path).
cudnn_sdpais now fused-only by default; legacy proxy A/B runs are explicit opt-in viaTRENI_ATTN_ALLOW_SDPA_PROXY=1.- H100 fused cuDNN SDPA probe pack exists (
2026-02-22): alignment/shape/layout sweeps plus debug traces; no viable SDPA engine configs under current backend descriptor path. - AWS G5 strict fused frontend A/B rerun exists (
attn_backend_ab_frontend_20260222T220111Z, fixedqwen, warmed query set): warm path near parity while cold-first-hit still regresses heavily. - Fused frontend stage-profile probe exists (
cudnn_frontend_profile_probe_20260222T2204Z): miss-cost root cause isolated (~705 msper plan-build miss on A10G; pack/execute/unpack are negligible). - Frontend A/B runner now hard-fails non-fused contamination (missing fused log marker or
TRENI_WITH_CUDNN=0runtime build). - AWS G5 frontend repeatability matrix exists (
attn_backend_frontend_matrix_20260222T221948Z, 3 repeats/profile,warm_fixed+mixed_churn) and shows custom wins all tracked metrics in both profiles. - Frontend claim-strength report exists (
attn_backend_frontend_claim_report_20260222T222958Z) with paired delta CI95 summaries for each metric/profile. - AWS G5 frontend miss-trace probe exists (
attn_backend_ab_frontend_20260222T224739Z) with explicit miss-key logging (TRENI_ATTN_CUDNN_FRONTEND_TRACE_MISSES=1). - AWS G5 startup-preload mitigation matrix exists (
attn_backend_frontend_matrix_20260222T224521Z) plus direct compare report vs no-preload (attn_backend_frontend_missmit_compare_20260222T225215Z). - AWS G5 preload splitter fix is verified (
TRENI_HTTP_PRELOAD_PROMPTSnow executes full list,run=1/4 ... run=4/4in runtime logs). - AWS G5 benchmark-query preload mitigation matrix exists (
attn_backend_frontend_matrix_20260222T231139Z) with direct compare vs matched no-preload baseline (attn_backend_frontend_missmit_compare_20260222T231335Z). - AWS G5 shape-level no-preload mitigation probe exists (
prebuild_startup_nopreload_probe_20260222T232932Z): fused cold TTFT/request latency are near custom with startup prebuild enabled. - AWS G5 no-preload shape-prebuild matrix probe exists (
attn_backend_frontend_matrix_20260222T233003Z) with direct compare vs no-preload baseline (attn_backend_frontend_missmit_compare_20260222T233116Z). - AWS G5 hybrid no-preload fused policy rerun exists (
attn_backend_frontend_matrix_20260223T001959Z) with direct compare vs prior tuned no-gate shape-prebuild baseline (attn_backend_frontend_missmit_compare_20260223T002153Z), plus 3x startup probe repeatability (prebuild_hybrid10_nopreload_probe_r{1,2,3}_20260223T002214Z). - AWS G5 broader-shape sanity run exists for the hybrid policy (
hybrid_shape_sanity_20260223T002857Z): startup stays near2.0sand inference stays valid, but long-prompt growth pastseq_kv=10still triggers fused miss cascades. - AWS G5 bounded-hybrid follow-up exists (
TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10): broader-shape sanity rerun (hybrid_shape_sanity_maxgate_20260223T003453Z) removes miss cascades, and 3x matrix rerun (attn_backend_frontend_matrix_20260223T003611Z) remains near-parity with prior hybrid fixed-profile metrics. - Runtime now emits per-request attention backend telemetry (
attention.total_calls,custom/fused/proxy shares, gate/fail counters) in chat responses; phase2 harness/reporting now aggregates this inattention_backend. - Coverage-instrumented fused reruns exist (
2026-02-23): 3x matrix (attn_backend_frontend_matrix_20260223T011158Z) plus warm/cold coverage profiles (fused_coverage_profiles_20260223T011504Z,fused_coverage_cold_profiles_20260223T011534Z). - Execution direction is now explicit: fused/frontend work is parked; next optimization cycles are custom-only.
- Routing failure-amplification stress run exists with injected tool failures/timeouts plus controller retries (
2026-02-18). - Routing matrix expansion exists on G5 (baseline + 5 stress profiles,
2026-02-19). - Cross-host routing pilot exists (local client via SSH tunnel to G5 runtime/controller; baseline + mild-timeout + stress,
2026-02-19). - Split-host routing matrix exists (CPU controller/tool host + GPU runtime host, 6 profiles,
2026-02-19). - Internet multi-hop expansion exists (Fly.io controller/tool hops with commercial APIs):
- OpenAI
gpt-5.2profile matrix (2026-02-20, repeatability rerun atruns=3/profile). - OpenRouter
openai/gpt-5.2profile matrix (2026-02-20, repeatability rerun atruns=3/profile). - OpenRouter
anthropic/claude-sonnet-4.6profile matrix (2026-02-20, repeatability rerun atruns=3/profile).
- OpenAI
- Local control routing matrices exist (same harness, local standalone external router, no Fly scheduler path):
- OpenAI
gpt-5.2(2026-02-20,runs=3/profile). - OpenRouter
anthropic/claude-sonnet-4.6(2026-02-20,runs=3/profile). - higher-N reruns (
runs=8/profile) for the same pair.
- OpenAI
- task-family parity split (
model_only,tool_only) on the same pair (runs=8). - grouped commercial root-cause report exists (
commercial_gap_root_cause_20260222T222958Z) combining fairness artifacts (r4+r8) by provider/model/task family with stage decomposition.
Week 3 (Numerical parity)
- T4 parity: strict mode, 0 failures.
- G5 parity: strict mode, 0 failures.
- Donut is intentionally skipped in parity check and marked as skipped.
Phase 3 comparison report
- T4 comparison report exists.
- G5 comparison report exists.
Phase 3 agentic loop benchmark (canonical G5 set)
- Dedicated harness implemented with 3 scenarios:
- retrieval correction
- tool-state adaptation
- confidence-gated branching
- Evaluator metrics included:
- success rate
- steps-to-convergence / correction efficiency
- latency per task and per successful task
- failure taxonomy
- Canonical G5 set run complete (
2026-02-19):- baseline profile: 3 seeds
- stress profile: 3 seeds
- consolidated summary artifact published.
- Realistic-v1 fixture run set complete (
2026-02-22):- baseline profile: 3 seeds
- stress profile: 3 seeds
- consolidated summary artifact published.
Phase 3 uncertainty-awareness ablation (baseline + stress + comparison)
- Harness now supports:
- uncertainty source modes:
normalized_logprob,raw_logit_margin,hybrid,runtime_native - independent uncertainty toggles: internal/external on/off
- uncertainty source modes:
- Matrix runner added for 4-arm ablation per source.
- Baseline repeatability set complete (
2026-02-19,runs=8, seeds7/11/19, all 3 sources). - Stress repeatability set complete (
2026-02-19, same seeds/sources, injected timeout/failure profile). - Consolidated baseline-vs-stress comparison report published.
- Runtime/kernel-native uncertainty wiring is now implemented:
- runtime HTTP response includes
uncertaintypayload - Phase 3 harness can consume it via
runtime_native
- runtime HTTP response includes
- Canonical G5 C2 rerun with
runtime_nativesource is now published (2026-02-19, baseline+stress, 3 seeds each). - Realistic-v1 C2 baseline+stress pair is now published for
normalized_logprob,raw_logit_margin, andhybrid(2026-02-22, seed7).
Phase 4 hardware expansion (Lambda A100/H100)
- Full A100 run set complete (
phase2 cold/hot + routing matrix + C2 runtime-native calibrated). - Full H100 run set complete (
phase2 cold/hot + routing matrix + C2 runtime-native calibrated). - Canonical loop summaries on A100/H100 are also complete (baseline+stress, 3 seeds each).
- Paper-grade package generated from canonical G5 + A100 + H100 artifacts:
/benchmarks/paper_package/latest/package_summary.json/benchmarks/paper_package/latest/paper_package.md/benchmarks/paper_package/latest/tables/*.csv/benchmarks/paper_package/latest/manuscript/*(captions, claims, figure manifest, mermaid figure specs)
Latest Key Findings (2026-02-17)
- Warm path on G5 remains strong (
~80.6 msmean,~90.4 msp99 in latest clean7 sanity run). - Internal routing is faster than external routing (
1.032xexternal/internal ratio). - Cold TTFT dropped further after stage decomposition + fast tensor collect:
- qwen:
1.41s -> 1.10s(22.1%lower) - donut:
619ms -> 150ms(75.7%lower) - bart:
777ms -> 125ms(83.9%lower) - minilm:
23.4ms -> 22.6ms(3.4%lower)
- qwen:
model_tensor_index_buildis no longer dominant (~1-2.3 msmean across models in clean4).- An async pinned-upload experiment regressed Qwen cold TTFT and was reverted; clean4 remains the accepted cold-path reference.
- Revert validation set (
clean7, 2026-02-18 UTC) confirms clean4 numbers are reproducible within noise.
Latest Key Findings (2026-02-22, True Fused cuDNN Frontend Rerun)
- Strict fused frontend A/B (
attn_backend_ab_frontend_20260222T220111Z) with fixedqwenand warmup policy (http_warmup_runs=8) shows:- warm request mean: custom
19.324 msvs fused frontend21.503 ms(custom/frontend=0.899) - warm infer mean: custom
18.803 msvs fused frontend20.976 ms(custom/frontend=0.896) - warm TTFT: custom
4.199 msvs fused frontend4.498 ms
- warm request mean: custom
- Cold-first-hit remains the blocker:
- cold TTFT: custom
4.220 msvs fused frontend710.641 ms - cold full latency: custom
250.929 msvs fused frontend6610.148 ms
- cold TTFT: custom
- Stage-profile probe (
TRENI_ATTN_CUDNN_FRONTEND_PROFILE=1) shows root cause is miss compile cost, not execution kernels:- plan-build miss cost:
~704.8 msper miss - pack/execute/unpack per-call costs are tiny (
~0.010/0.021-0.048/0.005 ms)
- plan-build miss cost:
- Interpretation:
- fused path is real and validated.
- warm steady-state is close to custom when shapes are warmed.
- unresolved work is miss mitigation for cold/mixed shape churn.
Latest Key Findings (2026-02-22, Frontend Repeatability Matrix)
- Artifact:
attn_backend_frontend_matrix_20260222T221948Z(repeats=3per profile). - Profiles:
warm_fixed: fixed model (qwen) with warmup (http_warmup_runs=8)mixed_churn: fixed model with no warmup (http_warmup_runs=0) to expose miss churn
- Win counts:
- custom is faster on every tracked metric in both profiles (
3/3wins each metric)
- custom is faster on every tracked metric in both profiles (
- Warm-fixed aggregate:
- request mean: custom
19.271 +/- 0.050 msvs fused21.468 +/- 0.018 ms - infer mean: custom
18.812 +/- 0.059 msvs fused20.984 +/- 0.026 ms - TTFT mean: custom
4.198 +/- 0.001 msvs fused4.498 +/- 0.001 ms
- request mean: custom
- Mixed-churn aggregate:
- request mean: custom
47.864 +/- 0.018 msvs fused843.141 +/- 0.735 ms - infer mean: custom
47.331 +/- 0.050 msvs fused842.542 +/- 0.747 ms - TTFT mean: custom
4.197 +/- 0.002 msvs fused179.744 +/- 0.263 ms
- request mean: custom
- Interpretation:
- custom clearly wins under both stable warmed and churned request conditions.
- fused frontend remains sensitive to shape misses; miss mitigation is still the blocker for cold/mixed competitiveness.
Latest Key Findings (2026-02-22, Frontend Claim-Strength Report)
- Artifact:
attn_backend_frontend_claim_report_20260222T222958Z. - This report computes paired deltas (
frontend - custom) with CI95 from the repeatability matrix. - Warm-fixed signal:
- request mean delta:
+2.197 msCI95[2.125, 2.238] - TTFT delta:
+0.300 msCI95[0.299, 0.301]
- request mean delta:
- Mixed-churn signal:
- request mean delta:
+795.277 msCI95[794.408, 795.747] - TTFT delta:
+175.546 msCI95[175.300, 175.820]
- request mean delta:
- Interpretation:
- current custom path is faster than current fused frontend path in both profiles, with non-overlapping positive deltas for all tracked latency metrics.
- repeat count remains low (
n=3/profile), but effect sizes are large and stable.
Latest Key Findings (2026-02-22, Startup-Preload Miss-Mitigation, Updated Canonical)
- Artifacts:
- baseline matrix (
no_preload):attn_backend_frontend_matrix_20260222T230445Z - candidate matrix (
startup_preload_benchmark_queries):attn_backend_frontend_matrix_20260222T231139Z - comparison report:
attn_backend_frontend_missmit_compare_20260222T231335Z - exact-prompt probe:
preload_exact_prompt_probe_20260222T231050Z.json
- baseline matrix (
- Mitigation used:
- startup multi-prompt preload (
TRENI_HTTP_PRELOAD_PROMPTS) with prompt set matched to benchmark cold/warm queries.
- startup multi-prompt preload (
- Mixed-churn deltas (
no_preload->startup_preload_benchmark_queries):- fused warm request mean:
843.242 -> 22.433 ms(37.590xfaster) - fused warm infer mean:
842.684 -> 21.965 ms(38.365xfaster) - fused warm TTFT:
179.541 -> 4.497 ms(39.928xfaster) - fused cold TTFT:
704.521 -> 4.495 ms(156.723xfaster) - fused cold full latency:
6593.495 -> 25.785 ms(255.707xfaster)
- fused warm request mean:
- Exact-prompt probe result:
- preload on exact cold prompt drops first-hit fused TTFT to
4.499 msand full latency to26.090 ms.
- preload on exact cold prompt drops first-hit fused TTFT to
- Interpretation:
- fused cold/mixed miss penalty can be removed on this harness when preload coverage matches serving prompts.
- custom remains slightly faster in warmed steady state (ratio remains about
0.90custom/frontend), but the prior~704 msfirst-hit fused TTFT blocker is resolved for this canonical prompt set. - still open: make this robust without prompt-list curation (shape-level prebuild/reuse path).
Latest Key Findings (2026-02-22, Shape-Prebuild No-Preload Probe)
- Artifacts:
- cold probe (
no preload, startup shape prebuild):prebuild_startup_nopreload_probe_20260222T232932Z.json - matrix probe (
repeats=1):attn_backend_frontend_matrix_20260222T233003Z - compare vs no-preload baseline:
attn_backend_frontend_missmit_compare_20260222T233116Z
- cold probe (
- Mitigation used:
TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=16(initial probe)TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128- no startup preload prompts (
TRENI_HTTP_PRELOADunset)
- Cold probe (
qwen, fused frontend):- startup->healthy:
11017.541 ms - request TTFT:
5.814 ms - request full latency:
255.434 ms
- startup->healthy:
- Matrix probe highlights (
shape_prebuild_nopreload, fused frontend):- mixed-churn cold TTFT:
5.805 ms - mixed-churn cold full latency:
255.267 ms - mixed-churn warm request mean:
51.482 ms - mixed-churn warm TTFT:
4.824 ms
- mixed-churn cold TTFT:
- Interpretation:
- shape-level prebuild removes the no-preload fused cold/mixed request-path spike without curated prompt lists.
- current tradeoff is startup cost shift (
http_attn_prebuilddominates startup time), so next work is reducing compile-at-startup overhead.
- Follow-up tuning (
seq_kv_max: 16 -> 10) artifact:prebuild_startup10_nopreload_probe_20260222T235944Z.json- startup->healthy:
11017.541 -> 7011.472 ms(1.571xfaster startup) - request TTFT:
5.814 -> 5.826 ms(near-identical) - request full latency:
255.434 -> 254.936 ms(near-identical)
- startup->healthy:
- Matrix confirmation for tuned range:
- tuned matrix (
seq_kv_max=10):attn_backend_frontend_matrix_20260223T000256Z - compare vs
seq_kv_max=16:attn_backend_frontend_missmit_compare_20260223T000343Z - request-path behavior stayed near-identical while startup dropped materially:
- warm-fixed fused request mean:
22.556 -> 22.265 ms - mixed fused request mean:
51.482 -> 50.974 ms
- warm-fixed fused request mean:
- tuned matrix (
- Lower-range probe (
seq_kv_max=8) artifact:prebuild_startup8_nopreload_probe_20260223T000600Z.json- startup->healthy:
6010.381 ms(faster startup) - request TTFT:
703.771 ms(regression) - request full latency:
1660.576 ms(regression) - interpretation:
seq_kv_max=8under-covers this query profile;10is the minimum safe tuned range in current harness.
- startup->healthy:
- Heuristic-mode probe (
TRENI_ATTN_CUDNN_FRONTEND_HEUR_MODE) on currentsm86path:AandBhad near-identical prebuild/startup behavior.FALLBACKproduced no valid engine configs for this frontend descriptor path.
Latest Key Findings (2026-02-23, Coverage-Instrumented Fused Reruns)
- Coverage-instrumented 3x matrix (
attn_backend_frontend_matrix_20260223T011158Z) confirms:- warm-fixed fused coverage is low (
warm_attn_fused_share ~0.030303) under the bounded hybrid policy. - mixed-churn fused coverage is similarly low (
~0.030303) with custom handling most calls. - warm TTFT remains slightly better for custom (
4.194 mscustom vs4.269 msfused profile). - warm request mean/p99 stay near-parity on fixed profile; mixed profile still favors custom in request-path totals.
- warm-fixed fused coverage is low (
- High-coverage fused profile (
fused_coverage_profiles_20260223T011504Z) shows current fused frontend path is slower when heavily used:frontend_allfused share~0.878788with warm request mean22.310 msvs custom20.292 ms(~1.099xslower).- warm TTFT
4.496 msvs custom4.196 ms.
- Cold coverage profile (
fused_coverage_cold_profiles_20260223T011534Z) shows strong first-hit regression when fused coverage is high:frontend_allfused share~0.9with cold TTFT704.176 msvs custom4.215 ms.- cold full latency
6595.157 msvs custom246.306 ms.
- Interpretation:
- fused frontend path is now measurable and reproducible with explicit coverage accounting.
- in current implementation, high fused coverage still regresses latency; bounded gating avoids worst regressions by keeping most calls on custom.
- next optimization target remains dynamic shape plan reuse/coverage so fused can be exercised without miss-build penalties.
Latest Key Findings (2026-02-23, Hybrid Shape-Gate Frontend Policy)
- Artifacts:
- 3x startup probe:
prebuild_hybrid10_nopreload_probe_r{1,2,3}_20260223T002214Z.json - 3x frontend matrix:
attn_backend_frontend_matrix_20260223T001959Z - compare vs prior tuned no-gate baseline:
attn_backend_frontend_missmit_compare_20260223T002153Z - broader-shape sanity (initial):
hybrid_shape_sanity_20260223T002857Z - broader-shape sanity (bounded gate):
hybrid_shape_sanity_maxgate_20260223T003453Z - 3x bounded-gate matrix:
attn_backend_frontend_matrix_20260223T003611Z - bounded-gate compare vs prior hybrid:
attn_backend_frontend_missmit_compare_20260223T003734Z
- 3x startup probe:
- Policy used:
TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MIN_KV=10TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=10TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128TRENI_ATTN_CUDNN_FRONTEND_MIN_SEQ_KV=10- bounded-gate follow-up adds:
TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10
- 3-run startup probe summary (
qwen, fused frontend, no preload prompts):- startup->healthy:
2004.840 +/- 0.146 ms - request TTFT:
4.955 +/- 0.011 ms - request full latency:
242.673 +/- 0.352 ms
- startup->healthy:
- Delta vs prior tuned shape-prebuild no-gate probe (
prebuild_startup10_nopreload_probe_20260222T235944Z):- startup->healthy:
7011.472 -> 2004.840 ms(3.497xfaster) - request TTFT:
5.826 -> 4.955 ms(1.176xfaster) - request full latency:
254.936 -> 242.673 ms(1.051xfaster)
- startup->healthy:
- Matrix deltas vs prior tuned no-gate matrix (
attn_backend_frontend_matrix_20260223T000256Z):- warm-fixed fused request mean:
22.265 -> 20.354 ms(1.094xfaster) - mixed fused request mean:
50.974 -> 47.904 ms(1.064xfaster) - cold fused TTFT:
5.819 -> 4.959 ms(1.173xfaster) - cold fused full latency:
254.146 -> 242.569 ms(1.048xfaster)
- warm-fixed fused request mean:
- Bounded-gate broader-shape follow-up (
TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10):- broader-shape set mean full latency:
9974.576 -> 274.072 ms(36.395xfaster) - broader-shape set max full latency:
30654.303 -> 434.776 ms(70.504xfaster) - fixed-profile matrix stayed near-parity vs prior hybrid (
attn_backend_frontend_missmit_compare_20260223T003734Z).
- broader-shape set mean full latency:
- Interpretation:
- the startup compile-burst tradeoff has been materially reduced while preserving low no-preload request-path latency.
- strict fused runs remain inference-valid with low-shape custom fallback (
inference.used=true), so this is now the best prompt-independent frontend policy in this harness. - broader-shape limitation seen in initial hybrid sanity (
hybrid_shape_sanity_20260223T002857Z) is mitigated by bounded-gate follow-up (hybrid_shape_sanity_maxgate_20260223T003453Z), which removes miss cascades by routing out-of-window shapes to custom. - remaining work is wider fused coverage without fallback (dynamic shape-reuse/plan persistence).
Latest Key Findings (2026-02-22, Commercial Root-Cause Grouped Analysis)
- Artifact:
commercial_gap_root_cause_20260222T222958Z. - Grouped on fairness-hardened splits (
r4+r8) by provider/model/task-family. - OpenAI
gpt-5.2model-only (paired_n=36):- latency delta mean (external-internal):
-69.311 ms, CI95[-193.985, 61.444](near parity/noise). - external controller overhead mean:
2.081 ms; external model-hop mean:1406.971 ms.
- latency delta mean (external-internal):
- OpenAI
gpt-5.2tool-only parity (paired_n=12):- latency delta mean:
+49.601 ms, CI95[-162.047, 274.981](near parity/noise). - external controller overhead mean:
12.842 ms; external model-hop mean:2456.108 ms.
- latency delta mean:
- OpenRouter Sonnet 4.6 model-only (
paired_n=24):- latency delta mean:
+204.883 ms, CI95[-148.517, 683.114](near parity/noise). - external controller overhead mean:
2.254 ms; external model-hop mean:2220.251 ms.
- latency delta mean:
- Interpretation:
- current commercial control evidence does not show a statistically locked directional win/loss.
- controller overhead is small relative to model-hop variance; higher-N reruns are required before claiming directional commercial gap outcomes.
Latest Key Findings (2026-02-18, External Cold Canonical)
- Runtime cold total first response:
2342.996 ms. - PyTorch cold total first response:
8725.259 ms(3.724xruntime). - vLLM cold total first response:
25069.018 ms(10.7xruntime). - Ollama cold total first response:
3530.106 ms(1.507xruntime). - vLLM has the fastest request-path TTFT once healthy (
51.763 ms), but startup (24032.203 ms) dominates end-to-end cold in this run.
Latest Key Findings (2026-02-18, External Cold Optimized Runtime)
- Runtime request full latency:
271.346 ms(vs vLLM1035.826 ms). - Runtime cold total first response:
2276.081 ms(vs vLLM28072.508 ms). - Runtime still trails vLLM in request TTFT (
91.596 msvs51.725 ms). - This run was not token-parity yet (runtime decode steps still 4 while others used 48).
Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Pre-Fix)
- Runtime request full latency:
2518.142 ms(vLLM:1075.404 ms). - Runtime request TTFT:
91.207 ms(vLLM:51.310 ms). - Runtime cold total first response:
4522.345 ms(vLLM:28111.652 ms,6.216xruntime advantage). - Request-path gap remains: runtime per-token decode is now the dominant issue.
Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Decoder/Sampling Fix)
- Runtime request TTFT:
5.022 ms(vLLM:52.995 ms, runtime10.553xfaster). - Runtime request full latency:
311.289 ms(vLLM:1094.517 ms, runtime3.516xfaster). - Runtime cold total first response:
2316.048 ms(vLLM:25131.279 ms, runtime10.851xbetter). - Startup remained stable (
~2004.8 ms) while request-path bottleneck was removed. - Confirmation rerun (runtime+vLLM) matched the direction: runtime
5.021/310.376/2314.581 msvs vLLM51.655/1033.214/24065.623 ms(TTFT/full/cold-total). - Initial 3-run repeatability set (
2026-02-18) showed TTFT10.333x, full3.380x, cold-total10.688x; this was superseded by the2026-02-19rerun below.
Latest Key Findings (2026-02-19, Qwen Cold Upload GPU-Convert Fix)
- A/B setup:
- same G5 host, same runtime build, same
cold_first_hitharness. - only toggle changed:
TRENI_TENSOR_CONVERT_GPU=0(off) vs default on.
- same G5 host, same runtime build, same
- Qwen results:
full_latency_ms:1116.567 -> 238.740(4.677xfaster).decoder_tensor_upload:1007 ms -> 129 ms(7.806xfaster).decoder_tensor_convert:862 ms -> 6 ms(143.667xfaster).decoder_tensor_h2d:143 ms -> 121 ms(1.182xfaster).- startup + first response total:
2119.906 ms -> 1242.057 ms(1.707xfaster).
- Interpretation:
- the dominant cold bottleneck was CPU-side BF16/F16 conversion; moving conversion to GPU largely removed that bottleneck.
- External-cold runtime-only confirmation (
2026-02-19, preload enabled,max_tokens=48):- startup-to-healthy:
2004.560 -> 1003.455 ms(1.997xfaster). - request full latency:
317.989 -> 317.276 ms(effectively unchanged). - cold total first response:
2322.549 -> 1320.731 ms(1.759xfaster). - cold total first token:
2009.697 -> 1008.582 ms(1.993xfaster).
- startup-to-healthy:
Latest Key Findings (2026-02-19, Runtime vs vLLM After GPU-Convert Fix, 3-run)
- Setup:
- same G5 host and model family (
Qwen 3B), token parity (max_tokens=48), runtime preload enabled. - backends run: runtime + vLLM (PyTorch/Ollama skipped in this repeatability set).
- same G5 host and model family (
- Mean over 3 runs:
- runtime TTFT
5.135 msvs vLLM84.390 ms(16.433xfaster). - runtime request full
319.063 msvs vLLM1111.463 ms(3.484xfaster). - runtime cold-total first response
1656.573 msvs vLLM31151.892 ms(18.805xbetter).
- runtime TTFT
- Runs 2-3 only (post-first-run stabilization):
- TTFT
17.211x, full3.416x, cold-total22.395xin runtime’s favor.
- TTFT
- Interpretation:
- after restoring vLLM env and rerunning on matched settings, runtime remains decisively ahead on request path and end-to-end cold total.
Latest Key Findings (2026-02-19, External Cold All-Backend Repeatability, GPU-Convert Fix2)
- Setup:
- same G5 host, same model family (
Qwen 3B), token parity (max_tokens=48), runtime preload enabled. - backends run: runtime + PyTorch + vLLM + Ollama.
- same G5 host, same model family (
- 3-run means (all runs):
- runtime: startup
2339.131 ms, TTFT5.131 ms, request full318.315 ms, cold-total first response2657.447 ms. - runtime-normalized ratios:
- PyTorch: TTFT
115.313x, full7.508x, cold-total3.921x. - vLLM: TTFT
16.091x, full3.852x, cold-total10.887x. - Ollama: TTFT
2108.743x, full35.118x, cold-total4.584x.
- PyTorch: TTFT
- runtime: startup
- Stable reference (runs 1-2):
- runtime: startup
1003.915 ms, TTFT5.131 ms, request full317.290 ms, cold-total first response1321.205 ms. - vLLM vs runtime (runs 1-2): TTFT
18.275x, full4.298x, cold-total21.875x.
- runtime: startup
- Interpretation:
- runtime remains decisively ahead on request path and cold-total across all backends.
- one run had a startup/preload outlier (
decoder_tensor_h2dspike) that inflated all-3-run startup mean.
Latest Key Findings (2026-02-19, External Cold All-Backend Repeatability, GPU-Convert + Host-Prefetch Fix)
- Setup:
- same G5 host, same model family (
Qwen 3B), token parity (max_tokens=48), runtime preload enabled. - backends run: runtime + PyTorch + vLLM + Ollama.
- runtime cold path change:
TRENI_TENSOR_HOST_PREFETCH=1with host-pageMADV_WILLNEEDon large tensor ranges.
- same G5 host, same model family (
- 3-run means:
- runtime: startup
1003.836 ms, TTFT5.130 ms, request full316.403 ms, cold-total first response1320.240 ms. - runtime-normalized ratios:
- PyTorch: TTFT
108.567x, full7.341x, cold-total14.601x. - vLLM: TTFT
16.537x, full3.896x, cold-total21.918x. - Ollama: TTFT
514.414x, full9.471x, cold-total3.029x.
- PyTorch: TTFT
- runtime: startup
- Runtime-only 5-run stability comparison (before vs after host-prefetch):
- startup max:
3006.388 -> 1003.627 ms. - cold-total first response max:
3324.212 -> 1322.338 ms. - decoder tensor h2d max:
1869.296 -> 120.671 ms. - decoder tensor upload max:
1877.485 -> 128.777 ms.
- startup max:
- Interpretation:
- the intermittent preload upload outlier is removed in this sweep while request-path lead is preserved.
Latest Key Findings (2026-02-24, External Cold Repeatability After Seq1 Multi-Head Default)
- Setup:
- same G5 host class and token-parity prompt budget (
max_tokens=48), runtime preload enabled. - backends run: runtime + PyTorch + vLLM (Ollama skipped in this rerun host environment).
- same G5 host class and token-parity prompt budget (
- 3-run means:
- runtime: startup
1003.315 ms, TTFT4.022 ms, request full239.277 ms, cold-total first response1242.592 ms. - runtime-normalized ratios:
- PyTorch: TTFT
127.900x, full9.378x, cold-total6.320x. - vLLM: TTFT
12.350x, full4.139x, cold-total19.333x.
- PyTorch: TTFT
- runtime: startup
- Delta vs prior host-prefetch repeatability (
2026-02-19, 3-run means):- runtime TTFT:
5.130 -> 4.022 ms(1.275xfaster). - runtime request full:
316.403 -> 239.277 ms(1.322xfaster). - runtime cold-total first response:
1320.240 -> 1242.592 ms(1.062xfaster).
- runtime TTFT:
- Interpretation:
- after default-on seq1 multi-head promotion, runtime keeps a large cross-system margin and also improved its own cold request path vs the prior repeatability baseline.
Latest Key Findings (2026-02-24, External Cold Repeatability After Step0 Exp-Reuse Patch)
- Setup:
- same G5 host class and token-parity budget (
max_tokens=48), runtime preload enabled. - backends run: runtime + PyTorch + vLLM (Ollama skipped in this host environment).
- custom-kernel change: seq1 multi-head softmax/PV path now reuses normalized probabilities rather than recomputing
expin the inner PV loop.
- same G5 host class and token-parity budget (
- 3-run means:
- runtime: startup
1003.287 ms, TTFT4.018 ms, request full238.400 ms, cold-total first response1241.688 ms. - runtime-normalized ratios:
- PyTorch: TTFT
126.786x, full9.374x, cold-total6.320x. - vLLM: TTFT
12.545x, full4.184x, cold-total19.622x.
- PyTorch: TTFT
- runtime: startup
- Delta vs immediate pre-patch repeatability baseline (
external_cold_seq1mh_default_repeatability_20260224T192020Z):- runtime TTFT:
4.022 -> 4.018 ms(-0.004 ms) - runtime request full:
239.277 -> 238.400 ms(-0.877 ms) - runtime cold-total first response:
1242.592 -> 1241.688 ms(-0.904 ms)
- runtime TTFT:
- Interpretation:
- this step0 patch is valid and non-regressing with a small positive shift.
- next gains likely require deeper reduction-path/launch-structure work in
decoder_step0_layers, not only exp reuse.
Latest Key Findings (2026-02-24, External Cold Repeatability After Step0 Shared-Prob Follow-Up)
- Setup:
- same G5 host class and token-parity budget (
max_tokens=48), runtime preload enabled. - backends run: runtime + PyTorch + vLLM (Ollama skipped in this host environment).
- follow-up change: cached per-head seq1 probabilities in shared memory inside multi-head softmax/PV.
- same G5 host class and token-parity budget (
- 3-run means:
- runtime: TTFT
4.019 ms, request full238.678 ms, cold-total first response1241.970 ms.
- runtime: TTFT
- Delta vs immediate
step0expfixrun:- TTFT:
4.018 -> 4.019 ms(+0.001 ms) - request full:
238.400 -> 238.678 ms(+0.278 ms) - cold-total first response:
1241.688 -> 1241.970 ms(+0.282 ms)
- TTFT:
- Interpretation:
- this follow-up did not beat the prior exp-reuse patch.
- path was reverted; current best remains
step0expfix.
Latest Key Findings (2026-02-18, Routing Failure-Amplification Stress)
- Stress profile: injected tool
503every 2nd request, injected tool timeout every 3rd request, controller tool timeout0.25s, controller tool retries1. - Internal mean latency:
76.071 ms. - External mean latency:
109.806 ms(1.443xexternal/internal). - Internal error rate:
0.0000. - External error rate:
0.0833(4 tool-hop failures over 48 requests). - External/internal error-rate ratio:
inf(external errored while internal did not). - Retry signal: external tool retries mean
0.182; taxonomy showstool_hop_failed=4.
Latest Key Findings (2026-02-19, Routing Matrix Expansion, G5)
- Matrix set: 6 profiles (
p00 baseline+p01..p05stress variants). - Baseline profile: external/internal latency ratio
1.0420x, external error rate0.0000. - Mild timeout profile (
p02): ratio1.1420x, external error rate0.0000. - Mixed moderate profile (
p03): ratio1.1640x, external error rate0.0417. - Mixed aggressive profile (
p04): ratio1.4360x, external error rate0.0833. - Mixed aggressive + retry2 (
p05): ratio1.4160x, external error rate0.0833. - Internal error rate stayed
0.0000across all 6 profiles. - Interpretation: external path degradation scales with timeout/failure pressure; extra retries reduce some retry counts but do not close the latency/error gap.
Latest Key Findings (2026-02-19, Routing Cross-Host Pilot)
- Topology:
- local benchmark client
- SSH tunnel to G5 host
- runtime and external router on G5 host
- Baseline profile (
crosshost-p00-baseline, 12 runs):- Internal mean:
1071.477 ms - External mean:
1059.478 ms - External/Internal ratio:
0.989x - Error rates: internal
0.0000, external0.0000
- Internal mean:
- Mild-timeout profile (
crosshost-p02-timeout-mild, 12 runs):- Internal mean:
1054.123 ms - External mean:
1123.393 ms - External/Internal ratio:
1.066x - Error rates: internal
0.0000, external0.0000 - External tool retries mean:
0.083
- Internal mean:
- Stress profile (
crosshost-p04-stress, 12 runs, fail/timeout injection):- Internal mean:
1056.013 ms - External mean:
1100.010 ms - External/Internal ratio:
1.042x - Error rates: internal
0.0000, external0.0833 - External tool retries mean:
0.182
- Internal mean:
- Interpretation:
- under cross-host stress, external path again degrades in both latency and errors while internal remains error-free.
- this is a pilot sanity check; canonical Track B completion is the split-host matrix below.
Latest Key Findings (2026-02-19, Routing Split-Host Matrix, Canonical Track B)
- Topology:
- GPU host: runtime endpoint
- CPU host: external controller + tool services
- same VPC private-network runtime calls from controller/tool to runtime
- Matrix set: 6 profiles (
splithost-p00baseline +splithost-p01..p05stress variants), each with 12 runs. - Baseline (
splithost-p00-baseline):- internal mean
1052.392 ms - external mean
1046.702 ms - ratio
0.995x - external error
0.0000
- internal mean
- Mild fail (
splithost-p01_fail_mild): ratio0.998x, external error0.0000, external tool retries0.021. - Mild timeout (
splithost-p02_timeout_mild): ratio1.042x, external error0.0000, external tool retries0.021. - Mixed moderate (
splithost-p03_mixed_moderate): ratio1.001x, external error0.0417, external tool retries0.065. - Mixed aggressive (
splithost-p04_mixed_aggressive): ratio1.087x, external error0.0833, external tool retries0.182. - Mixed aggressive + retry2 (
splithost-p05_mixed_aggressive_retry2): ratio1.045x, external error0.0833, external tool retries0.091. - Matrix-wide summary:
- external/internal latency ratio mean
1.028x - internal error mean
0.0000 - external error mean
0.0347
- external/internal latency ratio mean
- Interpretation:
- split-host confirms the same failure-amplification shape: baseline is near parity, but under timeout/failure pressure external path degrades in latency and error rate while internal remains error-free.
Latest Key Findings (2026-02-20, Internet Multi-Hop Matrix on Commercial APIs)
- Topology:
- local benchmark client
- Fly.io hosted external controller/tool hop
- commercial model runtime endpoints (
api.openai.com,openrouter.ai)
- OpenAI (
gpt-5.2, 3 profiles,runs=3):- matrix mean external/internal ratio:
1.1123x - baseline profile:
1.110x - timeout-mild profile:
1.082x - mixed-aggressive profile:
1.145x - internal error rate
0.0000; mixed-aggressive external error rate0.0833
- matrix mean external/internal ratio:
- OpenRouter (
openai/gpt-5.2, 3 profiles,runs=3):- matrix mean external/internal ratio:
0.7553x - baseline profile:
0.686x - timeout-mild profile:
0.891x - mixed-aggressive profile:
0.689x - internal error rate
0.0000; mixed-aggressive external error rate0.1667
- matrix mean external/internal ratio:
- OpenRouter (
anthropic/claude-sonnet-4.6, 3 profiles,runs=3):- matrix mean external/internal ratio:
1.0277x - baseline profile:
1.236x - timeout-mild profile:
0.968x - mixed-aggressive profile:
0.879x - internal error rate
0.0000; mixed-aggressive external error rate0.1667
- matrix mean external/internal ratio:
- Interpretation:
- OpenAI matrix supports the routing thesis directionally: public-network external hops are slower and less reliable under stress.
- OpenRouter remains non-canonical for Track B direction claims in this topology due mixed/inverted profile direction and elevated external errors.
Latest Key Findings (2026-02-20, Local Control Matrix, No Fly Scheduler Path)
- Topology:
- local benchmark client
- local standalone external controller/tool server
- commercial model runtime endpoints
- OpenAI (
gpt-5.2,runs=3, 3 profiles):- matrix mean external/internal ratio:
0.9867x - profiles: baseline
0.995x, timeout-mild0.977x, mixed-aggressive0.988x - external error mean
0.0313
- matrix mean external/internal ratio:
- OpenRouter (
anthropic/claude-sonnet-4.6,runs=3, 3 profiles):- matrix mean external/internal ratio:
1.0663x - profiles: baseline
1.055x, timeout-mild1.141x, mixed-aggressive1.003x - external error mean
0.0313
- matrix mean external/internal ratio:
- Interpretation:
- with higher-N local controls, OpenAI is near parity while OpenRouter Sonnet trends in expected direction (
external > internal). - external errors still appear under stress while internal stayed error-free in these runs.
- with higher-N local controls, OpenAI is near parity while OpenRouter Sonnet trends in expected direction (
Latest Key Findings (2026-02-20, Task-Family Parity Split, Local Control, runs=8)
- OpenAI
gpt-5.2:model_only: external/internal0.958x(slight inversion, near parity).tool_only: external/internal1.136x(external slower).
- OpenRouter
anthropic/claude-sonnet-4.6:model_only: external/internal1.044x(external slower).tool_only: external/internal1.051x(external slower).
- Errors:
- all four task-family runs recorded
0internal errors and0external errors.
- all four task-family runs recorded
- Interpretation:
- when isolating tool-required tasks, the architecture hypothesis holds on both providers.
model_onlyremains provider-sensitive; OpenAI is close to parity while Sonnet keeps external slower.
Latest Key Findings (2026-02-19, Phase 3 Canonical G5 Baseline, 3 seeds)
- Internal success rate mean:
1.0000. - External success rate mean:
0.9006. - External/Internal latency ratio mean:
16.0603x. - External/Internal steps ratio mean:
1.8147x. - Scenario split (external success mean):
- retrieval correction:
1.0000 - tool-state adaptation:
0.7417 - confidence-gated branching:
1.0000
- retrieval correction:
Latest Key Findings (2026-02-19, Phase 3 Canonical G5 Stress, 3 seeds)
- Stress profile: tool fail every 9th request, timeout sleep every 11th request (
1.1s), controller timeout0.35s, controller retries2. - Internal success rate mean:
1.0000. - External success rate mean:
0.8782. - External/Internal latency ratio mean:
77.1703x. - External/Internal steps ratio mean:
1.8240x. - Scenario split (external success mean):
- retrieval correction:
1.0000 - tool-state adaptation:
0.6833 - confidence-gated branching:
1.0000
- retrieval correction:
Latest Key Findings (2026-02-19, Phase 3 Uncertainty Ablation, baseline runs=8)
- Uncertainty enabled vs disabled changes success materially (same tasks, same hardware):
- Internal success:
1.0000 -> 0.7692when internal uncertainty is disabled (-0.2308). - External success:
0.8846 -> 0.6538when external uncertainty is disabled (-0.2308).
- Internal success:
- Direction is consistent across all uncertainty sources:
normalized_logprobraw_logit_marginhybrid
- Interpretation:
- uncertainty-aware branching improves loop task completion in this benchmark.
- this was the first-pass synthetic-signal proof; runtime-native canonical rerun is now published separately below.
Latest Key Findings (2026-02-19, Phase 3 Uncertainty Ablation Repeatability + Stress)
- Baseline repeatability set (
3 seeds) confirms stable uncertainty gains:- Internal uncertainty-on success delta mean:
+0.2308(all three sources). - External uncertainty-on success delta mean:
+0.2308(all three sources).
- Internal uncertainty-on success delta mean:
- Stress repeatability set (
3 seeds, tool fail every9, timeout every11, sleep1.1s, controller timeout0.35s, retries2) shows:- Internal uncertainty-on success delta mean:
+0.2308(all three sources). - External uncertainty-on success delta mean:
+0.2212(all three sources).
- Internal uncertainty-on success delta mean:
- Stress minus baseline:
- Internal uncertainty gain change:
0.0000. - External uncertainty gain change:
-0.0096.
- Internal uncertainty gain change:
- Interpretation:
- uncertainty-aware branching benefit is stable in this harness under both normal and stressed routing conditions.
- synthetic-source result had an initial runtime-native corroboration, and final canonical interpretation now uses the calibrated zero-fallback rerun below.
Latest Key Findings (2026-02-19, Phase 3 Uncertainty Ablation Runtime-Native Canonical Rerun, Superseded)
- Root-cause fix before rerun:
- greedy decode uncertainty in
/monolith/compute/sample.cuwas emitting flat zeros (mean_logprob=0,mean_entropy=0). - patched greedy path now computes logprob + entropy from logits (log-sum-exp).
- greedy decode uncertainty in
- Initial runtime-native rerun (
runs=8, seeds7/11/19) showed positive uncertainty-on deltas:- baseline internal uncertainty success delta:
+0.1026 - baseline external uncertainty success delta:
+0.1155 - stress internal uncertainty success delta:
+0.2308 - stress external uncertainty success delta:
+0.2212
- baseline internal uncertainty success delta:
- Runtime-native
int_on_ext_onarm means:- baseline: internal success
0.8718, external success0.7853, ext/int latency10.9504x - stress: internal success
1.0000, external success0.8782, ext/int latency74.1471x
- baseline: internal success
- Interpretation:
- this run confirmed runtime-native plumbing after kernel fix, but was later superseded due fallback contamination in part of the seed set.
- runtime API now emits a unified
awarenesspayload with bothrouteandgenerationuncertainty sections (legacyuncertaintypreserved); Phase 3 runtime-native client now consumesawareness.generationfirst when present.
Latest Key Findings (2026-02-20, Phase 3 Runtime-Native Rerun with Unified Awareness, Quality-Gated)
- Rerun profile:
- source:
runtime_nativeonly - seeds:
7/11/19 - baseline + stress
- runtime configured with fast probe path (
TRENI_DEMO_LAYERS=2) - client consumes unified
awareness.generationfirst (legacy fallback preserved)
- source:
- Probe quality gate:
- all runtime-native arm artifacts in this rerun have
fallback=0,errors=0, and non-zerorequests/ok.
- all runtime-native arm artifacts in this rerun have
- Clean rerun deltas (runtime-native):
- baseline internal uncertainty success delta:
-0.1538 - baseline external uncertainty success delta:
-0.1217 - stress internal uncertainty success delta:
-0.1538 - stress external uncertainty success delta:
-0.1089 - stress-baseline external uncertainty delta change:
+0.0128
- baseline internal uncertainty success delta:
- Important interpretation correction:
- previously published positive runtime-native deltas were influenced by runtime probe fallback in part of the seed set (notably s11/s19 in older fix1 artifacts).
- with zero-fallback runtime-native probes, this awareness3 rerun showed uncertainty-on was not yet beneficial in this harness.
- this kept runtime-native uncertainty wiring validated and motivated the calibration pass documented below.
Latest Key Findings (2026-02-20, Phase 3 Runtime-Native Calibration Rerun calib1, Quality-Gated)
- Calibration update:
- runtime-native confidence is now calibrated/blended for decision usage (runtime confidence floor/ceil scaling + prior blend + optional route blend), while preserving raw runtime fields.
- runner now forwards calibration knobs through ablation harness to child benchmark runs.
- Canonical rerun profile:
- source:
runtime_nativeonly - seeds:
7/11/19 - baseline + stress
- calibration params: prior weight
0.75, confidence floor0.10, confidence ceil0.35, route blend0.10
- source:
- Probe quality gate:
- all runtime-native arm artifacts in this rerun have non-zero requests/ok and
fallback=0,errors=0.
- all runtime-native arm artifacts in this rerun have non-zero requests/ok and
- Calibrated rerun deltas (runtime-native):
- baseline internal uncertainty success delta:
+0.1539 - baseline external uncertainty success delta:
+0.1058 - stress internal uncertainty success delta:
+0.1539 - stress external uncertainty success delta:
+0.1154 - stress-baseline external uncertainty delta change:
+0.0096
- baseline internal uncertainty success delta:
- Interpretation:
- calibrated runtime-native uncertainty now recovers positive uncertainty-on gains in both baseline and stress while staying zero-fallback.
- C2 is re-locked for current harness; core phases are complete, and remaining optional work is region-pinned internet-hop controls (plus higher-N where needed).
Latest Key Findings (2026-02-20, Phase 4 Lambda Full Reruns + Paper Package)
- A100 (Phase 2):
- cold summary: startup
1002.708 ms, TTFT29.657 ms, full32.008 ms - warm request latency: mean
10.356 ms, p9914.536 ms - routing matrix overall: external/internal
2.4300x, external error0.0347, internal error0.0000
- cold summary: startup
- H100 (Phase 2):
- cold summary: startup
1004.890 ms, TTFT56.944 ms, full62.064 ms - warm request latency: mean
18.491 ms, p9924.944 ms - routing matrix overall: external/internal
2.3972x, external error0.0347, internal error0.0000
- cold summary: startup
- A100 + H100 C2 runtime-native calibrated deltas:
- baseline internal/external:
+0.2308 / +0.2308 - stress internal/external:
+0.2308 / +0.2212
- baseline internal/external:
- Paper package is generated and published at:
Latest Key Findings (2026-02-20, Track B Fairness-Hardened Commercial Reruns, Local Control r8)
- Harness changes applied and validated:
- interleaved internal/external ordering (
pair_order=alternate) - deterministic generation default (
temperature=0) - token usage export and
ms/completion_tokennormalization - strict tool parity enabled for
tool_onlyruns
- interleaved internal/external ordering (
- OpenAI
gpt-5.2:- model-only ext/int:
0.971x(near parity/slight inversion remains) - tool-only ext/int (strict parity):
1.038x(internal faster) - model-only ms/token internal/external:
57.657 / 57.663(effectively tied) - tool-only ms/token internal/external:
37.553 / 38.990(internal better)
- model-only ext/int:
- OpenRouter
anthropic/claude-sonnet-4.6:- model-only ext/int:
1.102x(internal faster) - tool-only ext/int (strict parity):
1.063x(internal faster) - model-only ms/token internal/external:
61.606 / 70.054(internal better) - tool-only ms/token internal/external:
41.212 / 43.791(internal better)
- model-only ext/int:
- Interpretation update:
- fairness hardening removes most of the ambiguity for tool tasks;
tool_onlynow favors internal on both providers. - OpenAI
model_onlyremains near-parity/provider-sensitive, so claim language stays task-family-stratified.
- fairness hardening removes most of the ambiguity for tool tasks;
Latest Key Findings (2026-02-22, AWS G5 TTFT Kernel Pass)
- Matched setup:
- same AWS G5 host (
g5.2xlarge, A10G), same container, same benchmark harness. - baseline reference:
lt0_sync0post-cache (TRENI_LINEAR_USE_LT=0,TRENI_TENSOR_UPLOAD_SYNC=0). - TTFT pass:
- softmax/reduction parallelization (near-parity result),
- norm kernel rewrite (
rmsnorm/layernormfrom single-thread to row-parallel256-thread reductions).
- same AWS G5 host (
- Best measured config in this pass:
norm+softmax,TRENI_LINEAR_USE_LT=1,TRENI_TENSOR_UPLOAD_SYNC=0. - Baseline -> best deltas:
- cold TTFT:
16.738 -> 13.974 ms(1.198xfaster). - cold full latency:
424.685 -> 396.814 ms(1.070xfaster). - warm mean latency:
174.237 -> 147.269 ms(1.183xfaster). - warm p99 latency:
1035.823 -> 936.297 ms(1.106xfaster).
- cold TTFT:
- Per-model cold TTFT signal:
qwen:39.537 -> 29.411 ms(dominant gain).donut:3.505 -> 2.619 ms.bart: near-flat (16.523 -> 16.573 ms), so seq2seq-specific step0 bottleneck still needs isolation.
- Interpretation:
- linear GEMM plumbing is no longer the main limiter for this run profile.
- norm/reduction work gives a real TTFT lift; next targeted work should isolate the residual Bart/seq2seq path.
Latest Key Findings (2026-02-22, AWS G5 TTFT Follow-Up: seq_q=1 Attention Path)
- Follow-up work after step0 profiling:
- profile gate
TRENI_STEP0_PROFILE=1added for stage split (decoder_step0_embed,decoder_step0_layers,decoder_step0_logits_sample). - seq2seq/Bart profile showed step0 dominated by
decoder_step0_layers(not embedding/logits). - implemented tiny-shape
seq_q=1attention kernels (QK + PV) and direct K/V projection-to-cache in decoder step path. TRENI_ATTN_SEQ1_USE_KERNELnow defaults to on (1; set0to force cuBLAS fallback).
- profile gate
- Previous best (
norm+softmax,lt1_sync0) -> new default path:- cold TTFT:
13.974 -> 12.504 ms(1.118xfaster). - cold full latency:
396.814 -> 390.099 ms(1.017xfaster). - warm mean latency:
147.269 -> 143.230 ms(1.028xfaster). - warm p99 latency:
936.297 -> 924.276 ms(1.013xfaster).
- cold TTFT:
- Bart-specific impact:
- cold TTFT
16.573 -> 12.842 ms(1.29xfaster).
- cold TTFT
- 3-seed repeatability on the new default path:
- cold TTFT
12.563 ± 0.037 ms. - cold full
390.961 ± 0.270 ms. - warm mean
143.297 ± 0.222 ms. - warm p99
925.668 ± 1.070 ms.
- cold TTFT
- Parity status note:
- parser now classifies interleaved runtime logs correctly (fallback/failure markers are detected even when stderr is merged into tensor lines).
- debug rerun identified old-container root cause:
minilmused out-of-bounds tensor offsets inmonolith_phase3.bin. - strict gate is now resolved with rebuilt parity container
monolith_phase3_qbm.bin(qwen+bart+minilm):week3_parity_qbm_report_20260222T132155Z.json=>checked_total=3,failed_total=0,missing_decoder_models=[],missing_encoder_models=[].
- runtime-on/off Bart step0 logits A/B remains numerically stable (max abs diff
~2e-6, cosine~1.0).
- Interpretation:
- this confirms residual TTFT loss was in tiny decode attention execution overhead, not linear GEMM path.
- seq2seq step0 path now moved in the expected direction while preserving gains from prior norm pass.
Latest Key Findings (2026-02-22, AWS G5 Attention Backend A/B, Deconfounded)
- Setup:
- runtime rebuilt with
WITH_CUDNN=1andTRENI_ATTN_BACKEND_STRICT=1. - compared
TRENI_ATTN_BACKEND=customvsTRENI_ATTN_BACKEND=cudnn_sdpain explicit proxy mode (TRENI_ATTN_ALLOW_SDPA_PROXY=1). - reverse-order rerun used as canonical to remove first-run cold cache bias.
- runtime rebuilt with
- Reverse-order canonical (
attn_backend_ab_rev_20260222T144736Z):- cold TTFT: custom
6.460 ms, cudnn6.447 ms(custom/cudnn=1.002x). - cold full: custom
147.789 ms, cudnn146.707 ms(1.007x). - warm mean: custom
53.545 ms, cudnn53.341 ms(1.004x). - warm p99: custom
82.031 ms, cudnn80.754 ms(1.016x).
- cold TTFT: custom
- Interpretation:
- legacy proxy mode is near-parity/slightly faster in this decomposition.
- runtime now treats
cudnn_sdpaas fused-only by default; proxy behavior is explicit opt-in (TRENI_ATTN_ALLOW_SDPA_PROXY=1). - true fused cuDNN SDPA/flash-attention path is still pending.
Latest Key Findings (2026-02-22, AWS G5 Seq1 Hybrid Tuning + Fused Follow-Up)
- Setup:
- runtime rebuilt with seq1-path tuning changes:
- specialized
seq_q=1softmax kernel - one-time cached attention env config reads
- optional hybrid knobs:
TRENI_ATTN_SEQ1_USE_CUBLAS_QK,TRENI_ATTN_SEQ1_USE_CUBLAS_PV
- specialized
- warm matrix (
12runs,4warmups): default vs qk-cublas vs pv-cublas vs both-cublas.
- runtime rebuilt with seq1-path tuning changes:
- Warm results (
seq1_hybrid_20260222T1554Z):- default: mean
54.505 ms, p9982.134 ms - qk-cublas: mean
54.572 ms, p9981.776 ms - pv-cublas: mean
54.281 ms, p9980.754 ms - both-cublas: mean
54.822 ms, p9979.947 ms
- default: mean
- Cold sanity (
seq1_hybrid_20260222T1558Z):- default: TTFT
6.447 ms, full147.756 ms - pv-cublas: TTFT
6.450 ms, full149.293 ms
- default: TTFT
- Fused follow-up (
seq1_hybrid_fused_20260222T192656Z):- code changes:
- fused
seq_q=1softmax+PV custom kernel - seq1 QK kernel block retune (
64/128/256based onhead_dim)
- fused
- warm default: mean
54.505 -> 52.535 ms, p9982.134 -> 80.554 ms - warm pv-cublas: mean
54.281 -> 51.964 ms, p9980.754 -> 78.519 ms - cold default: TTFT
6.447 -> 6.209 ms, full147.756 -> 145.587 ms - cold pv-cublas: TTFT
6.450 -> 6.215 ms, full149.293 -> 147.937 ms
- code changes:
- Interpretation:
- fused seq1 follow-up improved both warm and cold for default custom path.
- pv-cublas remains fastest warm variant in this pass, but default custom still keeps stronger cold-first-hit balance.
- this closes more request-path overhead without changing model/tool behavior.
Latest Key Findings (2026-02-22, H100 cuDNN SDPA Fused Probe)
- Probe pack:
phase2_runtime/results/cudnn_sdpa_h100_probe_20260222T1935Z. - Environment:
- Lambda H100 (
sm90) probe host. - tested staged system cuDNN 9.19 and pip
nvidia-cudnn-cu12==9.19.0.56.
- Lambda H100 (
- Results:
- alignment sweep:
cnt=0foralign={16,32,64,128,256}. - shape/layout sweep:
tested=1440,supported=0. - debug logs show candidate SDPA engines (
8/9/10/11) but no viable configs:NOT_SUPPORTED_GRAPH_PATTERN(8/9/11)NOT_SUPPORTED_ARCH_MISMATCH(10, Blackwell-only).
- alignment sweep:
- Interpretation:
- true fused
cudnn_sdpais still unresolved in current backend descriptor path even on H100. - runtime stays on explicit fused-only semantics for
cudnn_sdpa; proxy remains opt-in only.
- true fused
Latest Key Findings (2026-02-22, Phase 3 Realistic-v1 Reruns)
- Realistic-v1 loop summary (
phase3_realistic_v1_summary_20260222T143919Z):- baseline:
- internal success
1.0000 - external success
0.9010 - external/internal latency ratio
15.8563x - external/internal steps ratio
1.8037x
- internal success
- stress:
- internal success
1.0000 - external success
0.9010 - external/internal latency ratio
75.3563x - external/internal steps ratio
1.8037x
- internal success
- baseline:
- Realistic-v1 uncertainty compare (
phase3_uncertainty_compare_realistic_v1_s7_20260222T144116Z):- baseline uncertainty-on success deltas: internal
+0.2500, external+0.2500(all 3 sources). - stress uncertainty-on success deltas: internal
+0.2500, external+0.2344(all 3 sources).
- baseline uncertainty-on success deltas: internal
- Interpretation:
- richer file-backed fixtures keep the same thesis direction: internal loops are faster and more stable.
- uncertainty-aware branching remains beneficial on realistic-v1.
What Is Still Missing Per Plan
If following the full sequence:
- Optional: add region-pinned internet multi-hop controls (Fly-to-Fly or fixed-region affinity) to reduce provider-path confounding.
- Still open: replace
cudnn_sdpaproxy route with true fused cuDNN SDPA/flash-attention frontend path and rerun A/B.
Canonical Clarification
- Full-system canonical set remains g5-20260216-foundation.
- Cold optimization is tracked as g5-20260217-cold-indexcache (latest cold-specific canonical evidence).
- Cold decomposition/collect optimization is tracked as phase2-runtime clean4 (latest cold-stage evidence).
- External-cold canonical repeatability after GPU-convert + host-prefetch fix is tracked in phase2_external_cold external_cold_gpuconvert_prefetch_allbackends_repeatability_20260219T203017Z.
Artifact Pointers
- True TTFT set:
/benchmarks/g5-20260217-truettft/ - Cold index-cache set:
/benchmarks/g5-20260217-cold-indexcache/ - AWS G5 seq1 fused follow-up set:
/benchmarks/phase2_runtime/results/seq1_hybrid_fused_20260222T192656Z/ - H100 fused cuDNN SDPA probe pack:
/benchmarks/phase2_runtime/results/cudnn_sdpa_h100_probe_20260222T1935Z/ - Routing comparison set:
/benchmarks/g5-20260217-routing/ - Routing failure stress set:
/benchmarks/phase2_internal_external/results/ - Routing matrix report:
/benchmarks/phase2_internal_external/results/routing_matrix_20260219T005022Z.md - Routing split-host matrix report:
/benchmarks/phase2_internal_external/results/routing_matrix_splithost_20260219T161945Z.md - Routing internet multi-hop matrix report (OpenAI, repeatability):
/benchmarks/phase2_internal_external/results/routing_matrix_fly_20260220T104002Z.md - Routing internet multi-hop matrix report (OpenRouter, repeatability):
/benchmarks/phase2_internal_external/results/routing_matrix_fly_20260220T104550Z.md - Routing internet multi-hop matrix report (OpenRouter Claude Sonnet 4.6, repeatability):
/benchmarks/phase2_internal_external/results/routing_matrix_fly_20260220T112444Z.md - Routing local control matrix report (OpenAI):
/benchmarks/phase2_internal_external/results/routing_matrix_control_openai_20260220T115444Z.md - Routing local control matrix report (OpenRouter Claude Sonnet 4.6):
/benchmarks/phase2_internal_external/results/routing_matrix_control_openrouter_sonnet46_20260220T115815Z.md - Routing local control matrix report (OpenAI, higher-N):
/benchmarks/phase2_internal_external/results/routing_matrix_control_openai_r8_20260220T121820Z.md - Routing local control matrix report (OpenRouter Claude Sonnet 4.6, higher-N):
/benchmarks/phase2_internal_external/results/routing_matrix_control_openrouter_sonnet46_r8_20260220T122446Z.md - Routing task-family run (OpenAI model-only, higher-N):
/benchmarks/phase2_internal_external/results/internal_vs_external_control_openai_model_only_r8_20260220T123747Z.json - Routing task-family run (OpenRouter Sonnet model-only, higher-N):
/benchmarks/phase2_internal_external/results/internal_vs_external_control_openrouter_sonnet46_model_only_r8_20260220T123923Z.json - Routing task-family run (OpenAI tool-only, higher-N):
/benchmarks/phase2_internal_external/results/internal_vs_external_control_openai_tool_only_r8_20260220T124123Z.json - Routing task-family run (OpenRouter Sonnet tool-only, higher-N):
/benchmarks/phase2_internal_external/results/internal_vs_external_control_openrouter_sonnet46_tool_only_r8_20260220T124216Z.json - Track B commercial parity appendix (JSON):
/benchmarks/phase2_internal_external/results/trackb_commercial_parity_appendix_20260220T124509Z.json - Track B commercial parity appendix (Markdown):
/benchmarks/phase2_internal_external/results/trackb_commercial_parity_appendix_20260220T124509Z.md - Track B fairness-hardened commercial appendix (JSON):
/benchmarks/phase2_internal_external/results/trackb_commercial_parity_appendix_fairness_20260220T193757Z.json - Track B fairness-hardened commercial appendix (Markdown):
/benchmarks/phase2_internal_external/results/trackb_commercial_parity_appendix_fairness_20260220T193757Z.md - OpenAI model-only fairness run (
r8):/benchmarks/phase2_internal_external/results/internal_vs_external_control_openai_model_only_fairness_r8_20260220T193120Z.json - OpenAI tool-only fairness run (
r8, strict parity):/benchmarks/phase2_internal_external/results/internal_vs_external_control_openai_tool_only_fairness_r8_20260220T193246Z.json - OpenRouter Sonnet model-only fairness run (
r8):/benchmarks/phase2_internal_external/results/internal_vs_external_control_openrouter_sonnet46_model_only_fairness_r8_20260220T193432Z.json - OpenRouter Sonnet tool-only fairness run (
r8, strict parity):/benchmarks/phase2_internal_external/results/internal_vs_external_control_openrouter_sonnet46_tool_only_fairness_r8_20260220T193633Z.json - Routing internet multi-hop matrix report (OpenAI, initial exploratory):
/benchmarks/phase2_internal_external/results/routing_matrix_fly_20260220T102124Z.md - Routing internet multi-hop matrix report (OpenRouter, initial exploratory):
/benchmarks/phase2_internal_external/results/routing_matrix_fly_20260220T102706Z.md - Qwen cold upload GPU-convert ablation summary:
/benchmarks/phase2_runtime/results/cold_gpuconvert_ablation_qwen_20260219T164135Z.md - External cold runtime-only GPU-convert ablation summary:
/benchmarks/phase2_external_cold/results/external_cold_gpuconvert_ablation_runtime_20260219T164521Z.md - External cold runtime-vLLM repeatability summary:
/benchmarks/phase2_external_cold/results/external_cold_gpuconvert_fix2_runtime_vllm_repeatability_20260219T184234Z.md - External cold runtime-vLLM full-depth AB5 (new defaults) summary:
/benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/summary_ab5.md - External cold runtime-vLLM full-depth AB5 (new defaults) summary JSON:
/benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/summary_ab5.json - External cold runtime-vLLM AB5 vs prior AB3 compare:
/benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/compare_vs_prev_linearfastdefault_ab3.md - External cold all-backend repeatability summary (host-prefetch fix):
/benchmarks/phase2_external_cold/results/external_cold_gpuconvert_prefetch_allbackends_repeatability_20260219T203017Z.md - External cold repeatability summary after seq1 multi-head default (
runtime + PyTorch + vLLM):/benchmarks/phase2_external_cold/results/external_cold_seq1mh_default_repeatability_20260224T192020Z.md - External cold repeatability summary after first step0 exp-reuse optimization (
runtime + PyTorch + vLLM):/benchmarks/phase2_external_cold/results/external_cold_step0expfix_repeatability_20260224T194226Z.md - External cold repeatability summary after second step0 shared-prob follow-up (
runtime + PyTorch + vLLM, reverted):/benchmarks/phase2_external_cold/results/external_cold_step0shared_repeatability_20260224T194913Z.md - Runtime cold-stability before-vs-after summary:
/benchmarks/phase2_external_cold/results/runtime_cold_stability_prefetch_compare_20260219T203017Z.md - External cold all-backend repeatability summary:
/benchmarks/phase2_external_cold/results/external_cold_gpuconvert_fix2_allbackends_repeatability_20260219T185610Z.md - Runtime cold-stability sweep summary:
/benchmarks/phase2_external_cold/results/runtime_cold_stability_gpuconvert_fix2_20260219T185738Z.md - Phase 3 canonical set:
/benchmarks/phase3_agentic_loops/results/ - Phase 3 realistic-v1 summary:
/benchmarks/phase3_agentic_loops/results/phase3_realistic_v1_summary_20260222T143919Z.md - Phase 3 uncertainty ablation set:
/benchmarks/phase3_agentic_loops/results/phase3_uncertainty_ablation_20260219T094047Z.md - Phase 3 realistic-v1 uncertainty compare:
/benchmarks/phase3_agentic_loops/results/phase3_uncertainty_compare_realistic_v1_s7_20260222T144116Z.md - Phase 3 uncertainty baseline-vs-stress comparison:
/benchmarks/phase3_agentic_loops/results/phase3_uncertainty_compare_20260219T113526Z.md - Phase 3 uncertainty runtime-native canonical comparison:
/benchmarks/phase3_agentic_loops/results/phase3_uncertainty_compare_runtime_native_fix1_20260219T123933Z.md - Phase 3 uncertainty runtime-native awareness3 comparison (clean zero-fallback rerun):
/benchmarks/phase3_agentic_loops/results/phase3_uncertainty_compare_runtime_native_awareness3_20260220T020947Z.md - Phase 3 uncertainty runtime-native calibrated comparison (
calib1):/benchmarks/phase3_agentic_loops/results/phase3_uncertainty_compare_runtime_native_calib1_20260220T023521Z.md - Phase 4 paper package summary:
/benchmarks/paper_package/latest/package_summary.json - Phase 4 paper package markdown:
/benchmarks/paper_package/latest/paper_package.md - Phase 4 paper package manuscript figure manifest:
/benchmarks/paper_package/latest/manuscript/figure_manifest.json - Phase 4 paper package manuscript captions:
/benchmarks/paper_package/latest/manuscript/captions.md - Phase 4 paper package manuscript claims:
/benchmarks/paper_package/latest/manuscript/claims.md - AWS G5 speedpass + TTFT kernel summary:
/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_20260222_g5_summary.md - AWS G5 attention backend A/B summary (first order):
/benchmarks/phase2_runtime/results/attn_backend_ab_20260222T143605Z/attn_backend_ab_20260222T143605Z.md - AWS G5 attention backend A/B summary (reverse-order canonical):
/benchmarks/phase2_runtime/results/attn_backend_ab_rev_20260222T144736Z/attn_backend_ab_rev_20260222T144736Z.md - AWS G5 TTFT kernel artifact (
lt0_sync0, norm+softmax, cold):/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_norm_ttft_lt0_sync0_cold_20260222T121958Z.json - AWS G5 TTFT kernel artifact (
lt0_sync0, norm+softmax, warm16):/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_norm_ttft_lt0_sync0_warm16_20260222T122012Z.json - AWS G5 TTFT kernel artifact (
lt1_sync0, norm+softmax, cold):/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_norm_ttft_lt1_sync0_cold_20260222T122202Z.json - AWS G5 TTFT kernel artifact (
lt1_sync0, norm+softmax, warm16):/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_norm_ttft_lt1_sync0_warm16_20260222T122439Z.json - AWS G5 TTFT follow-up artifact (
lt1_sync0, seq1 tiny-kernel default, cold):/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_seq1kernel_default_lt1_sync0_cold_20260222T124156Z.json - AWS G5 TTFT follow-up artifact (
lt1_sync0, seq1 tiny-kernel default, warm16):/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_seq1kernel_default_lt1_sync0_warm16_20260222T124212Z.json - AWS G5 TTFT follow-up repeatability artifacts:
/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_seq1kernel_default_repeat1_cold_20260222T124601Z.json,/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_seq1kernel_default_repeat2_cold_20260222T124608Z.json,/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_seq1kernel_default_repeat3_cold_20260222T124615Z.json,/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_seq1kernel_default_repeat1_warm16_20260222T124630Z.json,/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_seq1kernel_default_repeat2_warm16_20260222T124634Z.json,/benchmarks/phase2_runtime/results/aws_speedpass/speedpass_seq1kernel_default_repeat3_warm16_20260222T124638Z.json - AWS G5 parity artifacts (parser fix, root-cause debug, strict-pass rerun):
/benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_bart_report_trace_20260222.json,/benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_bart_report_trace_seq1off_20260222.json,/benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_trace_fix_20260222T130917Z.json,/benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_runtime_trace_fix_20260222T130917Z.log,/benchmarks/phase2_runtime/results/aws_speedpass/minilm_demo_dbg_20260222T131659Z.log,/benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_qbm_report_20260222T132155Z.json,/benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_qbm_runtime_trace_20260222T132155Z.log - Cold decomposition clean4 set:
/benchmarks/phase2_runtime/results/ - External cold canonical set:
/benchmarks/phase2_external_cold/results/ - Cross-host routing pilot matrix:
/benchmarks/phase2_internal_external/results/routing_matrix_crosshost_20260219T145513Z.md
Latest Qwen3.5 Status
- Prompt/token parity for the failing IFEval probe is confirmed against HF tokenization.
- The new batched prefill path is not the remaining cause of IFEval quality drift.
- Current strict one-host control lane state:
- runtime is ahead on latency
- runtime still trails vLLM slightly on IFEval-style instruction fidelity
- New evaluator-guided IFEval repair loop improves runtime quality over runtime control, but does not yet fully surpass vLLM control.
Latest Qwen Family Runtime Status (2026-03-10)
- Live AWS
qwen35(Qwen/Qwen3.5-0.8B) is healthy again and direct/v1/chat/completionsnow returnsinference.used=true. - Live AWS
qwen35_4b(Qwen/Qwen3.5-4B) also performs real inference on the same host when launched withruntime_pool_mb=15360. - Direct runtime smoke against the root runtime URL is now split clearly:
qwen35passes the direct tool-call smoke on AWS, including first-turn function calling and follow-up tool-result handling (benchmarks/qwen35_smoke/results/live-qwen35-toolsmoke-root-20260310.json).qwen35_4bloads and infers, but still fails the same exact-output/tool-call smoke contract on the current harness (benchmarks/qwen35_smoke/results/live-qwen35_4b-toolsmoke-root-20260310.json).
- Backward compatibility is re-proven:
- a fresh
Qwen/Qwen2.5-0.5B-Instructcontainer was packed and proved on AWS before cleanup - the old
qwen2.5host artifacts were then removed to free disk while preserving code-path compatibility
- a fresh
qwen35_9bfamily wiring exists in/Users/andrewcorrea/treni/scripts/qwen_runtime_env.py, but the current AWS host does not yet have a packed9Bcontainer and is not the intended proof box for that model size.
Same-VM Promotion Status (2026-03-10, 4B + 9B follow-up)
- The old negative
4Bstatus is now superseded. - Root cause was a runtime parity bug in the cached linear-attention decode path for
Qwen3.5-4B:- the step-path repeated key heads before the depthwise-conv update instead of after it
- Hugging Face on the same host already proved the model itself was fine
- Repaired canonical 4B same-VM artifact:
/Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-full-aws-r4-qwen35_4b_20260310T184433Z.json
- Repaired 4B result:
15/15- direct runtime smoke passes
- direct PDF RAG passes
- direct embed/rerank passes
- direct TTS/STT passes
- Hermes runtime-status/RAG/SQLite/memory/execute_code all pass
- Current same-VM status on AWS A10G:
qwen35(0.8B) remains the speed-first laneqwen35_4b(4B) is now a real end-to-end valid agent lane
- Lambda 9B state:
- account auth and SSH key are valid
- current launch attempts are blocked by provider-side capacity / rate limiting
- no live Lambda 9B proof host exists yet from this sweep
Latest Same-VM Agent Compare Matrix (2026-03-10)
- New clean model-dependent comparison suite:
/Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35.json/Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35_4b.json
- Scope of this lane:
- runtime health
- worker health
- direct runtime smoke
- Hermes runtime-status
- Hermes RAG search
- Hermes SQLite exec/query
- Hermes memory add/read
- Hermes execute_code
- Result:
qwen35(0.8B):10/10pass (1.0)qwen35_4b(4B):2/10pass (0.2)
- Claim-safe interpretation:
- this selector artifact is stale and predates the repaired
4Bdecoder path - use the repaired full suite as the current source of truth for
4B
- this selector artifact is stale and predates the repaired
- Isolated speed probes:
/Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/qwen35_model_speed_compare_20260310.md0.8Bwarm steady state: about113.7 tok/s,ttft ≈ 95.4 ms4Brepeated warm lane: about38.5 tok/s,ttft ≈ 158.9 ms
Stub Audit Clarification (2026-03-10)
- Direct Phase 5 runtime/vLLM comparisons do not rely on Hermes tool stubs.
- Same-VM Hermes wrapper had one localized optional-import shim in
/Users/andrewcorrea/treni/scripts/hermes_same_vm_mvp.py; that path has been fixed. - Same-VM Hermes now loads the real file/code tools (
read_file,write_file,search_files,patch,execute_code) after thetoolspackage import fix in the wrapper. - Live Hermes single-tool validation on AWS now shows:
qwen35can execute realsamevm_rag_searchsuccessfully against a raw-PDF-ingested local RAG store.qwen35can call realexecute_code; the current small-model weakness is argument/code quality, not missing tool plumbing.qwen35also callssamevm_sqlite_query, but still tends to generate malformed SQL unless tightly guided.
- Phase 3 loop studies still include synthetic fixture profiles by design. Those results remain useful, but they are not equivalent to the direct benchmark lane.