Treni

Benchmark Status

What was run already, what still remains, and why.

Direct Answers

  • "Did we run the start benchmark (Phase 1 baseline)?" Yes.
  • "Did we rerun after true TTFT and cold fixes?" Yes (2026-02-17 on G5).
  • "Did we run all benchmarks in the full plan?" Core plan: yes (through Phase 4 + paper package).
  • "Did we run a full-depth runtime-vLLM cold check?" Yes (2026-02-25, --layers 36, --pool-mb 16384).
  • "Is there anything else to run?" Yes: the next real blocker is thinking-mode answer emission on GPQA-style tasks. The old sampled reproducibility blocker is fixed, and the non-thinking strict lanes are green.

Latest 2026-03-11 Update: Native Hermes 4B Conversation Lane

  • Native Hermes same-VM registration is now the real integration path:
    • /_vendor/hermes-agent/tools/treni_samevm_tools.py
    • /_vendor/hermes-agent/model_tools.py
    • /_vendor/hermes-agent/toolsets.py
    • /_vendor/hermes-agent/hermes_cli/tools_config.py
  • Tool name audit on the AWS Hermes checkout is clean:
    • 73 total tool names, 73 unique
    • no duplicate execute_code, browser_*, or samevm_* registrations
  • Multi-turn 4B conversation bugfixes now in place:
    • unique runtime tool-call IDs from monolith/server/http.c
    • compact multi-turn carry-over in scripts/samevm_agent_conversation_suite.py
    • structured 400 JSON worker errors in scripts/treni_local_tool_worker.py
    • samevm_rag_ingest HTTP bridge now preserves worker-side error payloads in scripts/hermes_same_vm_mvp.py
  • Canonical repaired split workflow artifact:
    • benchmarks/same_vm_mvp/results/hemkesh-v22_20260311T020710Z.json
  • Result:
    • turn 1: local discovery + grounded facts
    • turn 2: exact facts stored in SQLite and queried back
    • turn 3: broader background stored in RAG and retrieval-checked
    • turn 4: memory note saved
    • turn 5: final recall correctly distinguishes SQLite exact facts vs RAG broader background
  • Interpretation:
    • the native Hermes 4B lane is now green for the split real-world persistence workflow,
    • the remaining non-canonical case is the single freeform turn that tries to do SQLite + RAG + memory all at once.

Latest 2026-03-08 Update: Deterministic Strict Lane

  • Runtime-side request override handling is now serialized in monolith/server/http.c, so request-scoped decode overrides no longer race through process-global env state.
  • Direct runtime reproducibility proof on AWS:
    • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_t0_r1.json
    • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_t0_r2.json
    • repeated temperature=0 IFEval seed-7 runs are identical (score_mean=0.5625 both).
  • New deterministic one-host strict matrix:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T204248Z.json
    • overall:
      • runtime score 0.295139
      • vLLM score 0.267361
      • runtime latency 824.714 ms
      • vLLM latency 1572.529 ms
    • gpqa_diamond:
      • score parity (0.166667 vs 0.166667)
      • runtime slower (671.640 ms vs 436.583 ms)
    • ifeval:
      • runtime higher score (0.423611 vs 0.368055)
      • runtime much faster (977.787 ms vs 2708.475 ms)
  • Interpretation:
    • the runtime now has a claim-safe deterministic strict lane where it wins overall on both score and latency,
    • sampled runs are now also fixed separately below.

Latest 2026-03-08 Update: Sampled Lane Fixed

  • Root cause:
    • the bug was in scripts/phase5_awareness_realbench.py, not in runtime decode math
    • the shared first-pass used for arm_a_control skipped the request seed and task-specific decode payload
  • Post-fix runtime-only sampled reproducibility probes on AWS:
    • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_raw_seedfix_r1.json
    • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_raw_seedfix_r2.json
  • Result:
    • repeated sampled IFEval seed-7 runs are identical (score_mean=0.3125 both)
    • all 8/8 outputs are identical across the two reruns
  • New post-fix sampled one-host strict matrix:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T220806Z.json
    • overall:
      • runtime score 0.409722
      • vLLM score 0.302083
      • runtime latency 1617.187 ms
      • vLLM latency 2017.206 ms
    • gpqa_diamond:
      • runtime higher score (0.3750 vs 0.2500)
      • runtime slower (710.693 ms vs 435.823 ms)
    • ifeval:
      • runtime higher score (0.4444 vs 0.3542)
      • runtime faster (2523.680 ms vs 3598.588 ms)
  • Immediate repeatability confirmation:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T221823Z.json
    • overall:
      • runtime score 0.409722
      • vLLM score 0.281250
      • runtime latency 1607.757 ms
      • vLLM latency 2008.759 ms
  • Interpretation:
    • sampled-lane drift is fixed,
    • the new sampled strict lane is promotable and runtime wins overall on both score and latency,
    • and a second full-matrix rerun stays aligned with that conclusion.

Latest 2026-03-08 Update: Larger-N Sampled Strict Confirmation

  • Stronger sampled strict matrix:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T235013Z.json
  • Result (16 samples/task, 3 seeds):
    • overall:
      • runtime score 0.371528
      • vLLM score 0.296875
      • runtime latency 1255.344 ms
      • vLLM latency 1585.043 ms
    • gpqa_diamond:
      • runtime higher score (0.3750 vs 0.3125)
      • runtime slower (801.900 ms vs 433.256 ms)
    • ifeval:
      • runtime higher score (0.368056 vs 0.281250)
      • runtime faster (1708.789 ms vs 2736.831 ms)
  • Interpretation:
    • the non-thinking sampled win is now stronger than the original 8-sample pass,
    • score and latency stay runtime-positive overall with positive confidence intervals.

Latest 2026-03-08 Update: Thinking-Mode Parity Exploration

  • First explicit thinking-mode strict matrix:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T223442Z.json
  • Budget-fixed follow-up:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T224358Z.json
  • Finalized follow-up:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T235628Z.json
  • Lower-cost finalized follow-up:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T010353Z.json
  • Key result:
    • the lane is no longer all-zero on gpqa_diamond; the close-form finalize pass makes it measurable
    • lower-cost finalized result (gpqa_max_tokens=256):
      • overall:
        • runtime score 0.250000
        • vLLM score 0.194444
        • runtime latency 6823.816 ms
        • vLLM latency 7503.000 ms
      • gpqa_diamond:
        • runtime score 0.166667
        • vLLM score 0.166667
        • runtime near parity on latency (7727.880 ms vs 7741.028 ms)
      • ifeval:
        • runtime score 0.333333
        • vLLM score 0.222222
        • runtime faster (5919.753 ms vs 7264.973 ms)
  • One-example long-budget probes:
    • benchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_runtime_1024_20260308T230352Z.json
    • benchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_vllm_1024_20260308T230352Z.json
  • Interpretation:
    • the old blockers were real and are now fixed enough to measure the lane:
      • runtime no longer hard-clips at 512,
      • long-decode host-buffer corruption is fixed,
      • close-form finalize converts length-exhausted reasoning into parseable answers
    • the better current thinking tradeoff is the reduced-budget finalized lane:
      • runtime still leads on score overall,
      • and with gpqa_max_tokens=256 it now also beats vLLM overall on latency.
  • Early GSM8K-only finalized thinking follow-up:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260310T022347Z.json
    • result (32 samples/task, 3 seeds):
      • runtime score 0.197917
      • vLLM score 0.177083
      • runtime latency 7174.829 ms
      • vLLM latency 7643.231 ms
    • interpretation:
      • this extends the closed-form thinking lane beyond gpqa_diamond,
      • runtime remains directionally ahead on both score and latency,
      • but the score CI still crosses zero, so this GSM8K thinking lane is exploratory, not claim-safe yet.
  • AIME25 isolated follow-up:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260310T021732Z.json
    • result (8 samples, 1 seed, 512 thinking tokens, patched AIME prompts):
      • runtime score 0.0
      • vLLM score 0.0
      • runtime latency 19776.254 ms
      • vLLM latency 16092.718 ms
    • interpretation:
      • AIME25 does not recover even after an AIME-specific prompt/finalize pass adjustment,
      • so this is currently a task-family limitation, not a benchmark-wide thinking win.
  • AIME25 second-thinking recovery attempt:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T021331Z.json
    • result (8 samples, 1 seed):
      • runtime score 0.0
      • vLLM score 0.0
      • runtime latency 21409.322 ms
      • vLLM latency 22110.402 ms
    • interpretation:
      • giving AIME a second short thinking pass did not recover score,
      • that experiment is non-canonical and should not replace the lower-cost default finalized path.

Late 2026-03-08 Update: Fast Sampler + Tie-Stable AB3

  • After the hybrid prefill fix, sampled decode became the dominant remaining hotspot.
  • Focused GPQA probe after fast top-k sampling:
    • q35-gpqa-profile-aws-samplefast1_20260308T003727Z.json
    • first-call moves:
      • decoder_step0_logits_sample 40.701 -> 3.538 ms
      • decoder_ttft 1019.079 -> 982.663 ms
    • step-N moves:
      • decoder_stepN_sample_mean 37.090 -> 2.366 ms
      • decoder_stepN_total_mean 47.748 -> 12.721 ms
  • First clean strict AB3 after fast sampling:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T003749Z.json
    • overall:
      • runtime score 0.305556
      • vLLM score 0.347222
      • runtime latency 1405.707 ms
      • vLLM latency 1676.336 ms
    • interpretation:
      • runtime flipped to a real overall latency win,
      • but quality regressed enough that this run was not promotable as canonical.
  • Tie-stable sampler follow-up:
    • one-seed proof: phase5_qwen35_remote_strict_matrix_20260308T004511Z.json
      • runtime wins both score (0.4375 vs 0.375) and latency (1497.984 ms vs 2026.199 ms) on seed 7
    • full AB3 rerun: phase5_qwen35_remote_strict_matrix_20260308T004758Z.json
    • overall:
      • runtime score 0.315972
      • vLLM score 0.347222
      • runtime latency 1422.818 ms
      • vLLM latency 1659.878 ms
    • task split:
      • gpqa_diamond: runtime better score (0.291667 vs 0.208333) but still slower
      • ifeval: runtime lower score (0.340278 vs 0.486111) but much faster
  • Interpretation:
    • the runtime now has a clean strict latency lead on this Qwen3.5 one-host matrix,
    • the remaining work is score recovery, not another large latency rescue.

Late 2026-03-08 Update: Batched Hybrid Qwen3.5 Prefill

  • Qwen3.5 hybrid prompt prefill is now materially improved on AWS:
    • new code paths:
      • monolith/models/decoder.cu
      • monolith/main.c
      • monolith/include/treni_models.h
    • changes:
      • batched linear-attention hidden forward for sequence prefill,
      • batched full-attention prefill with K/V cache materialization,
      • hybrid layer-major prefill in main.c instead of token-by-token fallback for Qwen3.5 prompt runs.
  • Focused GPQA profile progression:
    • old clean profile: q35-gpqa-profile-aws-clean_20260307T220200Z.json
      • decoder_prefill=3263.527 ms
      • decoder_ttft=3317.441 ms
    • linear-batch profile: q35-gpqa-profile-aws-linearbatch_20260307T235448Z.json
      • decoder_prefill=1341.628 ms
      • decoder_ttft=1405.739 ms
    • full-batch profile: q35-gpqa-profile-aws-fullbatch_20260308T000420Z.json
      • decoder_prefill=275.372 ms
      • decoder_ttft=1017.876 ms
  • Latest strict AB3 summary:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T000429Z.json
    • overall:
      • runtime score 0.413195
      • vLLM score 0.347222
      • runtime latency 2940.172 ms
      • vLLM latency 1686.263 ms
    • task split:
      • gpqa_diamond: runtime 0.458333 vs vLLM 0.208333, runtime 1347.582 ms vs vLLM 512.075 ms
      • ifeval: runtime 0.368055 vs vLLM 0.486111, runtime 4532.763 ms vs vLLM 2860.452 ms
  • Interpretation:
    • this older AB3 still matters because it proved prompt prefill was a real architectural blocker and fixable,
    • but it is no longer the latest strict state after the fast-sampler reruns above.

Late 2026-03-07 Update: ORPO Reload + Cache-Tier A/B

  • Same-VM ORPO reload loop is now real on AWS:
    • artifact: benchmarks/same_vm_mvp/results/samevm-orpo-reload-aws_20260307T222341Z.json
    • path proved:
      • local ORPO job finishes,
      • adapter output is merged into a full HF model dir,
      • merged model is packed into a new monolith container,
      • a second runtime is restarted against that new container,
      • that runtime answers a real chat request.
  • Important scope note:
    • this proof currently uses the Qwen2.5 ORPO demo model path, not the main Qwen3.5 strict benchmark target.
    • the harness/control-plane path is real; the remaining work is promoting the same self-improvement loop onto the main target family.
  • Qwen3.5 smarter shared-prefix tiering now has a clean runtime-side A/B:
    • direct sequential GPQA profile:
      • q35-gpqa-profile-aws-seq2-cap112_20260307T222540Z.json
      • second related request improved:
        • decoder_prefill 2696.101 -> 2544.202 ms
        • decoder_ttft 2747.697 -> 2595.907 ms
    • clean strict seed-7 spot A/B:
      • cap112: phase5_qwen35_remote_strict_matrix_20260307T223218Z.json
      • cap64: phase5_qwen35_remote_strict_matrix_20260307T223555Z.json
    • runtime-only latency effect (112 - 64):
      • overall: -363.908 ms
      • gpqa_diamond: -420.699 ms
      • ifeval: -307.116 ms
    • quality effect on this one-seed spot:
      • overall score unchanged (0.291667 both),
      • per-task scores moved in opposite directions (gpqa down, ifeval up), so this is not a quality claim yet.
  • Non-canonical artifact note:
    • phase5_qwen35_remote_strict_matrix_20260307T222736Z.json is contaminated and should not be cited.
    • cause: an ORPO demo runtime was still alive on port 18081 and holding GPU memory during that A/B run.
    • the clean strict comparison is 20260307T223218Z vs 20260307T223555Z.
  • Qwen3.5 strict launcher/config drift is now corrected across the strict AWS runner and same-VM worker/runtime launcher:
    • shared env source: scripts/qwen_runtime_env.py
    • clean AB3 artifact: benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T231500Z.json
    • effect vs the older clean cap64 AB3 (20260307T225716Z):
      • runtime overall score: 0.333334 -> 0.335648
      • runtime overall latency: 3801.258 -> 3690.124 ms
      • runtime ifeval score: 0.333333 -> 0.421296
      • runtime gpqa_diamond latency: 2857.693 -> 2753.838 ms
    • current clean paired result:
      • overall: runtime 0.335648 vs vLLM 0.291667, runtime 3690.124 ms vs vLLM 1646.672 ms
      • gpqa_diamond: runtime 0.25 vs vLLM 0.25, runtime 2753.838 ms vs vLLM 529.098 ms
      • ifeval: runtime 0.421296 vs vLLM 0.333333, runtime 4626.410 ms vs vLLM 2764.246 ms
    • interpretation: score-side evidence improved, but the remaining blocker is still long-prompt prefill latency.
    • code-level explanation:
      • monolith/models/decoder.cu rejects treni_decoder_forward_f32(...) for ctx->is_linear_attn,
      • Qwen3.5 linear-attention is therefore only covered by cached/token decode today,
      • so strict long-prompt Qwen3.5 prefill still runs through the token-by-token cached loop in monolith/main.c.

Qwen3.5 One-Host Strict Rerun + Request-Path Fixes (2026-03-07)

  • New contract-validation artifacts on the active AWS host:
    • tokenizer audit: benchmarks/qwen35_tokenizer_audit/results/qwen35-tokenizer-audit-active_20260307T173024Z.json
    • runtime smoke: benchmarks/qwen35_smoke/results/qwen35-runtime-smoke-active2_20260307T173132Z.json
    • isolated semantic A/B: benchmarks/qwen35_smoke/results/qwen35-isolated-ab-active_20260307T173228Z.json
  • New strict one-host matrix runner:
    • scripts/phase5_qwen35_remote_strict_matrix.py
  • New late strict one-host matrix summary after request-path fixes:
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T191653Z.json
    • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T191653Z.md
  • Contract status:
    • packed tokenizer/full vocab now matches HF exactly for Qwen/Qwen3.5-0.8B (248077 tokens),
    • runtime extended non-thinking smoke passes 7/7 cases on the active AWS host,
    • isolated non-thinking probe A/B is mixed but useful:
      • runtime all_ok=true,
      • vLLM all_ok=false in that probe harness because current text-only launch rejects multimodal placeholders and the forced-thinking exact-output case still ends at finish_reason=length.
  • Request-path changes validated before the late rerun:
    • Qwen3.5 decoder prefix cache is now default-on (TRENI_DECODER_PREFIX_CACHE=1, 64 prefix tokens),
    • timing.ttft_ms now includes request-path pre-decode time plus decoder-first-token timing, not only the decode-loop step-0 proxy,
    • repeated prompt-family hot probe on AWS dropped from infer_ms ~1798.5 -> 842.4 ms and ttft_ms ~1531.9 -> 782.5 ms on the second related request with a logged prefix-cache hit.
  • Strict one-host realbench result (gpqa_diamond+ifeval, Arm A, seeds 7/17/27, 8/task, request_logprobs=false):
    • overall score: runtime 0.333333 vs vLLM 0.315972
    • overall latency: runtime 3809.745 ms vs vLLM 1626.068 ms
    • task split:
      • gpqa_diamond: runtime 0.291667 vs vLLM 0.291667, runtime 2867.493 ms vs vLLM 418.173 ms
      • ifeval: runtime 0.375000 vs vLLM 0.340278, runtime 4751.996 ms vs vLLM 2833.964 ms
  • Status impact:
    • Qwen3.5 compatibility is no longer the question; tokenizer/chat/tool contract is working.
    • Score is no longer behind on this strict set.
    • The remaining blocker is request-path latency, especially benchmark-prompt prefill behavior.

Same-VM Wrapper Recovery (2026-03-07)

  • The explicit same-VM AWS wrapper is now recovered and usable:
    • benchmarks/same_vm_mvp/results/samevm-q35-stack_20260307T172158Z.json
  • Current entrypoints:
    • scripts/hermes_same_vm_mvp.py
    • scripts/run_samevm_qwen35_stack.sh
  • What changed in the harness:
    • runtime prompt-token cap is now passed explicitly (4096) for Hermes-started Qwen3.5 runs,
    • system prompt only advertises tools actually loaded in the session,
    • wrapper no longer loads unrelated builtin tools by default,
    • final wrapper response is now deterministically rewritten from tool outputs when the model emits malformed JSON-like summaries.
  • End-to-end result on the recovered wrapper path:
    • runtime health: ok
    • extended smoke: PASS, 7/7 cases
    • case latencies:
      • plain_chat: 234.641 ms
      • multi_turn_memory: 422.339 ms
      • multimodal_content_items: 647.002 ms
      • tool_call_first_turn: 3080.0 ms
      • tool_followup_after_result: 2596.057 ms
      • thinking_plain_chat: 1194.161 ms
      • tool_followup_after_result_no_tool_call_id: 3339.867 ms
  • Current caveat:
    • runtime log still shows an intermittent first tool-turn retry in one observed wrapper run (compute/ops.cu:765, invalid argument during prefill gather). The request recovered and the smoke artifact still passed, but this retry path is not yet closed.
  • New multimodal same-VM tool surface is now wired in code:
    • status/embed/rerank/tts/stt live in:
      • scripts/samevm_multimodal_models.py
      • scripts/treni_local_tool_worker.py
      • scripts/hermes_same_vm_mvp.py
      • scripts/samevm_stack_probe.py
    • defaults:
      • embedding: Qwen/Qwen3-VL-Embedding-2B
      • reranker: Qwen/Qwen3-VL-Reranker-2B
      • tts: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
      • stt: Qwen/Qwen3-ASR-0.6B
      • whisper fallback: supported when model contains whisper
    • worker-level smoke on POST /v1/mm/status passes and reports the AWS machine state accurately.
    • runtime-admin proof on AWS is now clean:
      • benchmarks/same_vm_mvp/results/samevm-q35-runtime-admin-proof-v5_20260307T212852Z.json
      • Hermes calls samevm_runtime_status + samevm_multimodal_status and the wrapper deterministically rewrites the final summary from tool outputs if the model truncates.
    • first real same-VM local-tool stack proof is complete on AWS:
      • benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v5.json
      • covered in one pass: runtime status, SQLite exec/query, RAG ingest/search, TTS, Qwen ASR STT, embedding, reranking
      • observed outputs:
        • SQLite rows: 1
        • RAG top hit: Same VM locality
        • TTS output path: /home/ubuntu/treni/benchmarks/same_vm_mvp/results/samevm_probe_tts.wav
        • Qwen ASR transcript: usable but still imperfect on the synthetic voice (Treni was still heard as Trinity)
        • embedding dim: 2048
        • rerank top document: the local-inference sentence ranked first
    • current caveat:
      • timestamped STT still depends on the forced-aligner path and enough local disk to materialize that model on the AWS box
    • new ORPO control-plane proof is complete:
      • benchmarks/same_vm_mvp/results/samevm-orpo-probe-aws_20260307T215307Z.json
      • one local dataset write + background ORPO train + job polling cycle completed with returncode=0
      • hot-reload of trained output back into the runtime is still not implemented
    • new operational fix:
      • the local multimodal worker was retaining about 13.3 GiB of GPU memory after model loads, which can starve the runtime and invalidate latency experiments
      • mitigation now exists via POST /v1/mm/clear_cache
      • status endpoint now exposes loaded multimodal models and current CUDA allocation/reservation

Canonical Same-VM MVP (2026-03-10)

  • The canonical investor-demo same-VM MVP is now green on AWS:
    • benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.json
    • benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.md
  • What the canonical v15 run proves in one flow:
    • local Qwen3.5 runtime health: ok
    • local tool worker health: ok
    • Hermes runtime-status tool call: ok
    • Hermes multimodal-status tool call: ok
    • direct same-VM runtime smoke: all_ok=True on the basic non-thinking profile (5 cases, includes first-turn tool calling)
    • direct same-VM runtime thinking smoke: all_ok=True on the extended/thinking profile with exact-match checks
    • local stack probe: SQLite + RAG + embedding + reranking + TTS + Qwen ASR STT all pass
    • Qwen3.5 ORPO reload proof is available and reused from the latest successful sidecar artifact:
      • benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json
    • sidecar cleanup now stops cleanly on port 18081
    • multimodal cache clear runs at the end and returns GPU memory close to idle
  • Additional post-v15 Hermes tool proofs on AWS:
    • SQLite query via Hermes: benchmarks/same_vm_mvp/results/hermes-demo-sqlite-query-v2.json
      • returned row count 1 from demo_notes_v3
    • RAG search via Hermes: benchmarks/same_vm_mvp/results/hermes-demo-rag-search-v1.json
      • returned a valid top result for same machine
    • TTS via Hermes: benchmarks/same_vm_mvp/results/hermes-tts-v2.json
      • generated /home/ubuntu/treni/benchmarks/same_vm_mvp/results/hermes_tts_v2.wav
    • STT via Hermes: benchmarks/same_vm_mvp/results/hermes-stt-v2.json
      • transcribed the generated WAV successfully
  • Current observed v15 stack outputs:
    • SQLite rows: 1
    • RAG top hit: Same VM locality
    • embedding dim: 2048
    • top reranked text: Treni keeps inference and tools on one local machine.
    • TTS output path: /home/ubuntu/treni/benchmarks/same_vm_mvp/results/samevm_probe_tts.wav
    • Qwen ASR STT transcript is directionally correct but still imperfect on synthetic voice (\"Trinity\" drift observed in the current probe)
  • Live speed snapshot on the current AWS Qwen3.5 runtime (2026-03-10):
    • 3 deterministic runs on a 130-token response
    • mean infer_ms: about 1156.9
    • mean ttft_ms: about 98.6
    • mean end-to-end throughput: 112.37 tok/s
    • mean decode-only throughput: 121.90 tok/s
  • Live current-model speed probe on AWS (2026-03-10):
    • qwen35 (0.8B): 128 completion tokens in about 1111.4 ms, ttft_ms≈103.1, decode_tps≈115.37
    • qwen35_4b (4B): 128 completion tokens in about 3313.3 ms, ttft_ms≈170.7, decode_tps≈38.64
  • Real-world document caveat:
    • current same-VM RAG ingests text payloads, text files, and raw PDF paths directly
    • live worker proof now ingests /home/ubuntu/treni/benchmarks/same_vm_mvp/data/manual-pncp-api.pdf through samevm_rag_ingest(paths=[...])
  • Runtime compatibility note:
    • the live runtime now accepts both /v1/chat/completions and /chat/completions
    • the live runtime now exposes both /v1/models and /models
    • Hermes can therefore target the runtime root URL directly on AWS without wrapper-specific path rewriting
  • Scope note:
    • the canonical MVP acceptance gate now includes both the basic non-thinking runtime smoke lane and the extended/thinking runtime smoke lane.
    • the extended non-thinking lane now also passes on AWS (benchmarks/qwen35_smoke/results/postmvp-extended_20260308T185130Z.json, 7/7 cases).
    • latest passing thinking artifact: benchmarks/qwen35_smoke/results/postmvp-extended-thinking-r4_20260308T193858Z.json.

Clean GPQA Runtime Profile (2026-03-07)

  • New direct runtime profile artifact:
    • benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-clean_20260307T220200Z.json
  • New probe runner:
    • scripts/q35_gpqa_profile_once.py
  • Method:
    • restart the runtime cleanly on AWS with TRENI_STEP0_PROFILE=1 and TRENI_DECODE_STAGE_PROFILE=1
    • send the same real GPQA prompt twice through the Qwen3.5 runtime API path
    • parse timing lines from the managed runtime log
  • Result:
    • call 1:
      • decoder_tensor_upload: 218.091 ms
      • decoder_prefill: 3263.527 ms
      • decoder_ttft: 3317.441 ms
    • call 2:
      • decoder_tensor_upload: 11.216 ms
      • decoder_prefix_cache_copy: 0.162 ms
      • decoder_prefill: 2690.001 ms
      • decoder_ttft: 2750.672 ms
    • step-0 decode is not the main limiter:
      • decoder_step0_layers: about 8 ms
      • decoder_step0_logits_sample: about 33-36 ms
  • Interpretation:
    • the current strict GPQA latency gap is still dominated by prefill, not tokenizer cost and not decoder step-0.
    • prefix cache helps, but only partially on this prompt family.
    • next optimization target remains long-prompt prefill/kernel path, not sampling logic.

Qwen3.5 Probe Matrix + Same-VM MVP (2026-03-06)

  • New tokenizer/full-vocab audit is complete:
    • benchmarks/qwen35_tokenizer_audit/results/runtime-q35-tokenizer-audit-r4_20260306T190418Z.json
    • result: packed runtime tokenizer matches HF exactly at full-vocab level for Qwen/Qwen3.5-0.8B (248077 tokens), with control probes like <think>, <|im_start|>, <|vision_start|>, and <|image_pad|> all aligned.
  • New endpoint smoke/probe work is complete:
    • base smoke: benchmarks/qwen35_smoke/results/runtime-q35-smoke-r2_20260306T190530Z.json
    • consolidated matrix: benchmarks/qwen35_smoke/results/qwen35-probe-matrix-r2_20260306T200035Z.json
  • Probe matrix summary (profile=extended, same cases on both backends):
    • runtime non-thinking: all_ok=true
    • runtime thinking: all_ok=true, but outputs are verbose and tool path is very slow
    • vLLM non-thinking: all_ok=false
    • vLLM thinking: all_ok=false
  • Important case-level interpretation:
    • runtime non-thinking is the strongest current functional lane for Qwen3.5:
      • plain_chat: 387.672 ms
      • multi_turn_memory: 573.434 ms
      • tool_call_first_turn: 5885.725 ms
      • tool_followup_after_result: 4406.168 ms
    • vLLM non-thinking is much faster on tool path:
      • plain_chat: 112.543 ms
      • tool_call_first_turn: 1162.850 ms
      • tool_followup_after_result: 490.202 ms
    • vLLM failures in this matrix are concrete and expected from launch/config:
      • multimodal placeholder case fails because current launch is --language-model-only
      • several thinking/exact-output cases stop at finish_reason=length
  • Same-VM harness status:
    • Hermes same-VM Qwen3.5 smoke succeeds:
      • benchmarks/same_vm_mvp/results/hermes-samevm-q35-smoke-r5_20260306T192703Z.json
    • Hermes same-VM ORPO smoke-train succeeds and launches a real job:
      • benchmarks/same_vm_mvp/results/hermes-samevm-q35-orpo-smoke-r1_20260306T194152Z.json
    • local worker ORPO run completed successfully on-host:
      • training output: benchmarks/same_vm_mvp/trainings/samevm-orpo-qwen25-smoke3/
  • Status impact:
    • Qwen3.5 runtime is now functionally testable and smoke-clean in a real same-VM harness.
    • The main blocker is no longer “does it run?”; it is the long-prompt/tool latency gap and thinking-mode output discipline.

Phase 5 Strict Parse-Fix AB3 (2026-03-04)

  • New paired AB3 summary (gpqa_diamond+ifeval, Arm A, seeds 7/17/27, 16/task, request_logprobs=false) is published:
    • benchmarks/phase5_awareness_realbench/results/phase5_q35_runtime_vs_vllm_ifeval_gpqa_ab3_20260304T034227Z.json
    • benchmarks/phase5_awareness_realbench/results/phase5_q35_runtime_vs_vllm_ifeval_gpqa_ab3_20260304T034227Z.md
  • Outcome:
    • Overall score: runtime 0.3403 vs vLLM 0.3229 (small runtime edge, CI includes parity).
    • Overall latency: runtime 1772.931 ms vs vLLM 1553.034 ms (runtime slower on aggregate due GPQA).
    • Task-family split:
      • gpqa_diamond: score parity, runtime latency deficit remains large.
      • ifeval: runtime is both faster and slightly higher-scoring.
  • Status impact:
    • strict matrix is now better framed as task-family stratified, not universal runtime superiority yet.

Phase 5 Real-Benchmark Update (2026-03-01)

  • Canonical diagnostic run is now complete on the active G5 host:
    • phase5_awareness_realbench_qwen-realbench-r5-tokenizerfix2_20260301T114510Z.json
  • Runtime fixes validated before this run:
    • full message aggregation in HTTP path (system + user, not just last message),
    • prompt cap default increased (32 -> 256),
    • tokenizer BPE merges + added_tokens loading + improved pretokenization/decode behavior.
  • r5 key outcomes (max-samples-per-task=8):
    • gpqa_diamond: A=0.500, B=0.500, C=0.375
    • ifeval: A=0.5625, B=0.5625, C=0.5625
    • gsm8k: A/B/C=0.0
    • aime25: A/B/C=0.0
  • Qwen-template auto mode A/B (r6: phase5_awareness_realbench_qwen-realbench-r6-qwentpl1_20260301T120235Z.json) regressed quality and latency vs r5, so this mode is kept opt-in-only (env-controlled) and not canonical.
  • HF-reference parity run on the same sampled set is now complete:
    • phase5_hf_reference_qwen_r5_20260301T1900Z.json
    • score deltas (HF minus runtime Arm A): gpqa -0.25, ifeval +0.0625, gsm8k 0.0, aime25 0.0
    • key claim-safe interpretation: GSM8K/AIME 0.0 is not runtime-only breakage in this setup (HF control is also 0.0).
  • Current status:
    • first real-data set is run and documented,
    • claim-safe parity interpretation is now locked for this sampled set,
    • next open work is raising the math-task quality floor (prompt/eval/model-task fit), not proving runtime-vs-HF parity existence.

Phase 5 + Qwen05 Follow-up (2026-03-02)

  • qwen05 deterministic empty-completion parity gap is now resolved in runtime:
    • root cause fixed in HTTP chat-template build (inject default Qwen system preamble when no system message is present),
    • validation artifact (runtime+vLLM): benchmarks/phase2_external_cold/results/external_cold_qwen05_templatefix_20260302T154019Z.json
    • same harness rerun (--no-vllm-ignore-eos): benchmarks/phase2_external_cold/results/external_cold_qwen05_templatefix_nofixeos_20260302T154151Z.json
    • key signal: runtime now returns non-empty output (usage_completion_tokens=3, completion_chars=241) instead of token-0 stop.
  • qwen05 Phase 5 diagnostic rerun completed after parity fix:
    • benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen05-realbench-r2-templatefix1_20260302T154443Z.json
    • quality remains low on this small model/sample (A/B/C all 0.0 across tasks in this run), so this is a correctness fix, not a quality win.
  • canonical qwen rerun with matched depth and sample count is now complete:
    • new run (layers=36, max_samples_per_task=8): benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen-realbench-r9-templatefix1-l36s8_20260302T161123Z.json
    • prior canonical reference: benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen-realbench-r5-tokenizerfix2_20260301T114510Z.json
    • r9 outcomes:
      • gpqa_diamond: A/B/C = 0.125 / 0.125 / 0.125
      • ifeval: A/B/C = 0.500 / 0.5625 / 0.5625
      • gsm8k: A/B/C = 0.625 / 0.625 / 0.750
      • aime25: A/B/C = 0.000 / 0.000 / 0.125
      • overall awareness deltas vs A: B +0.015625, C +0.078125
    • interpretation:
      • math floor improved materially (gsm8k, aime25 Arm C) vs r5,
      • GPQA dropped vs r5, so current claim stays mixed by task family (not universal quality uplift yet).

Phase 5 + Qwen3.5 Nightly vLLM Follow-up (2026-03-02)

  • vLLM main/nightly path is now validated for Qwen3.5 on AWS G5:
    • env: .venv-vllm-nightly-q35
    • server: vllm 0.16.1rc1.dev...
    • endpoint: http://127.0.0.1:18081/v1/*
  • Infra issue resolved during setup:
    • root filesystem hit 100%, causing Python/vLLM tempdir failure.
    • cleaned caches/old envs, restored ~21GB free, and launched with explicit TMPDIR.
  • Qwen3.5 diagnostic run set:
    • baseline sampled run: phase5_awareness_realbench_qwen35-realbench-r1-s8-nonthinking_20260302T184159Z.json
    • conservative retry/vote policy probe: phase5_awareness_realbench_qwen35-realbench-r2-policyfix1-s8-nonthinking_20260302T184624Z.json
    • fairness-fixed canonical probe (shared-first across arms): phase5_awareness_realbench_qwen35-realbench-r3-sharedfirst-s8-nonthinking_20260302T184947Z.json
  • Strict canonical runtime-vs-vLLM matrix is now completed (2026-03-02) with strict inference guard enabled:
    • runtime strict mode (TRENI_HTTP_REQUIRE_INFERENCE=1) hard-fails invalid inference paths (502 {"error":"inference_required"}),
    • matrix runner: scripts/phase5_qwen35_runtime_vs_vllm_matrix.py,
    • canonical artifacts:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_runtime_vs_vllm_matrix_20260302T221546Z.json
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_runtime_vs_vllm_matrix_20260302T222013Z.json
    • canonical outcome (20260302T222013Z, Arm A):
      • score: runtime 0.0503 vs vLLM 0.2170 (delta -0.1667)
      • latency: runtime 1881.188 ms vs vLLM 178.093 ms (delta +1703.095 ms)
    • interpretation: Qwen3.5 strict matrix is no longer blocked; it is now a negative-result benchmark that defines the next optimization target.
  • Post-fix rerun (qnorm-check1, 2026-03-02) after wiring decoder Q/K head RMS-norm:
    • artifact: benchmarks/phase5_awareness_realbench/results/phase5_qwen35_runtime_vs_vllm_matrix_20260302T225529Z.json
    • result remained negative (rt_score=0.0000, vllm_score=0.0625; runtime latency 1880.622 ms vs 187.453 ms)
  • Decoder full-attn q_proj gate-layout parity fix landed (2026-03-03), and strict matrix was rerun with Arm A-only backend mode (--phase5-arms arm_a_control) to remove retry/vote-path contamination:
    • artifact: benchmarks/phase5_awareness_realbench/results/phase5_qwen35_runtime_vs_vllm_matrix_20260303T104038Z.json
    • overall Arm A score: runtime 0.15625 vs vLLM 0.19097 (delta -0.03472, CI includes near-parity)
    • overall Arm A latency: runtime 1723.685 ms vs vLLM 958.757 ms (delta +764.928 ms)
    • result quality gap is materially narrower than 20260302T222013Z, but runtime is still slower overall and still behind on aggregate score.
  • r3 result snapshot:
    • gpqa_diamond: A/B/C = 0.375 / 0.375 / 0.375
    • ifeval: A/B/C = 0.3125 / 0.3125 / 0.3125
    • gsm8k: A/B/C = 0.0 / 0.0 / 0.0
    • aime25: A/B/C = 0.0 / 0.0 / 0.0
    • awareness deltas: all B-A=0.0, C-A=0.0 (no down, no up).

Phase 5 Paper-Mode Debug (2026-03-03)

  • Harness bug fix is now applied:
    • in paper mode, retry now commits the refined output directly (paper semantics) instead of confidence-margin replacement filtering.
    • code: /Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py
  • Live sanity outcomes after fix:
    • vLLM sanity (gpqa_diamond,ifeval, 8 samples/task):
      • phase5_awareness_realbench_qwen35-paperfix-sanity1_20260303T201156Z.json
      • outcome: overall B-A=0.0 (gpqa +0.125, ifeval -0.125), latency up due retries.
    • runtime sanity on isolated GPU:
      • phase5_awareness_realbench_qwen35-paperfix-sanity2-runtime_20260303T201744Z.json
      • outcome: overall B-A=-0.125, retry 100%, large latency penalty.
  • Important contamination note:
    • phase5_awareness_realbench_qwen35-paperfix-sanity1-runtime_20260303T201620Z.json is invalid for performance interpretation (vLLM and runtime co-located; runtime OOM; strict 502 inference_required on all calls).
  • Calibration sweep result (runtime):
    • ppl 1.4/1.8/2.2 produced identical outcomes with full retry (max_entropy dominated trigger).
    • entropy threshold 7.0 reduced retry volume (16 -> 9) but still no score uplift.
    • artifacts:
      • phase5_awareness_realbench_qwen35-paperfix-runtime-sweep-p1_4_20260303T202135Z.json
      • phase5_awareness_realbench_qwen35-paperfix-runtime-sweep-p1_8_20260303T202255Z.json
      • phase5_awareness_realbench_qwen35-paperfix-runtime-sweep-p2_2_20260303T202415Z.json
      • phase5_awareness_realbench_qwen35-paperfix-runtime-sweep-ent7_20260303T202617Z.json
  • Summary-mode calibration fix is now implemented in harness:
    • summary uncertainty detection now uses uncertainty_source=runtime_summary,
    • paper trigger uses guarded summary vote rule (paper_summary_max_entropy_threshold, paper_summary_confidence_threshold, paper_summary_min_votes).
    • code: /Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py
  • Post-fix runtime sanity (8/task):
    • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-sanity2_20260303T204120Z.json
    • result: retry 9/16 (down from 16/16) and quality recovered to parity (overall B-A=0.0), but latency overhead remains high (~+1386 ms).
  • Post-fix confidence sweep (8/task):
    • conf=0.40: phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf0_40_20260303T204257Z.json
    • conf=0.45: phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf0_45_20260303T204357Z.json
    • conf=0.50: phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf0_50_20260303T204500Z.json
    • conf=0.55: phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf0_55_20260303T204602Z.json
    • all four remained overall B-A=0.0 (latency deltas ~+1252 to +1386 ms, retry ~0.50 to 0.5625).
  • Higher-N check (32/task, conf 0.45):
    • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32_20260303T204751Z.json
    • result: gpqa +0.03125, ifeval -0.0625, overall B-A=-0.015626.
  • Task-aware follow-up (summary-mode retries disabled for IFEval) produced the first positive repeatable signal on this track:
    • larger run (32/task):
      • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32-ifevaloff_20260303T222841Z.json
      • overall B-A=+0.015624 (arm_a 0.273438 -> arm_b 0.289062)
      • latency delta +618.068 ms
      • per-task deltas: gpqa +0.03125, ifeval +0.0
    • 3-seed repeatability (16/task, s7/s17/s27):
      • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-ifevaloff-rpt-s7_20260303T223228Z.json
      • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-ifevaloff-rpt-s17_20260303T223410Z.json
      • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-ifevaloff-rpt-s27_20260303T223541Z.json
      • overall B-A mean +0.020833 (range 0.0 to +0.03125)
      • mean latency delta +712.276 ms
      • retries occurred only on GPQA in this policy (IFEval retries=0).
  • Late optimization pass (2026-03-03): compact invalid-parse retry prompt + confidence-gated invalid-parse retries (--invalid-parse-retry-confidence-max).
    • code: /Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py
    • best tradeoff policy on this host so far: invalid_parse_retry_confidence_max=0.73 with paper_summary_disable_ifeval_retry=true.
    • 3-seed (16/task) artifacts:
      • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s16_20260303T232029Z.json
      • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-rpt-s17_20260303T232254Z.json
      • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-rpt-s27_20260303T232516Z.json
    • result vs prior ...ifevaloff-rpt-s{7,17,27} baseline:
      • quality preserved (overall B-A mean: +0.020833 -> +0.020833),
      • latency overhead reduced (+712.276 ms -> +404.603 ms),
      • GPQA retry rate reduced (0.5833 -> 0.2917).
    • 32/task confirmation (s7):
      • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s32_20260303T232755Z.json
      • same quality delta as prior s32 policy (overall B-A=+0.015624) with lower latency overhead (+618.068 ms -> +326.187 ms).

Phase 5 Paper-Loop Alignment (2026-03-02 Late)

  • Reference paper code is now local in this workspace:
    • third_party/weave-logprobs-reasoning-loop
  • Phase 5 harness now uses paper-aligned uncertainty triggering:
    • --awareness-trigger-mode paper|confidence|hybrid (default: paper)
    • paper trigger = any of:
      • perplexity > trigger_perplexity_threshold (default 1.4)
      • max_entropy > trigger_max_entropy_threshold (default 1.5)
      • low_confidence_tokens >= trigger_low_confidence_tokens (default 3)
  • Retry/refinement prompts now carry first-pass uncertainty summary (top uncertain token positions + alternatives).
  • Artifacts now store per-call loop trace with uncertainty metrics/tables for case-level debugging.
  • End-to-end smoke run completed on AWS Qwen3.5 nightly with paper mode:
    • benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-paper-smoke_20260302T191420Z.json
    • validation signal: paper trigger fired with explicit reason fields (paper_reasons) and per-call uncertainty traces in output.
  • Full r4 run is now complete on the same config/sampling envelope:
    • benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r4-paper-s8-nonthinking_20260302T191642Z.json
  • r4 snapshot:
    • gpqa_diamond: A/B/C = 0.375 / 0.375 / 0.375 (unchanged vs r3)
    • ifeval: A/B/C = 0.625 / 0.4375 / 0.625 (baseline and Arm C up, Arm B down)
    • gsm8k: A/B/C = 0.0 / 0.0 / 0.0 (unchanged)
    • aime25: A/B/C = 0.0 / 0.0 / 0.0 (unchanged)
    • overall deltas vs Arm A: B -0.046875, C 0.0; both with higher latency from retries.
  • Interpretation:
    • paper-mode trigger path is functionally integrated and reproducible,
    • current default thresholds are too eager for this setup and do not yet produce net quality uplift on Qwen3.5.

Phase 5 Adaptive Uncertainty Fix (2026-03-02 Late 2)

  • Harness fix landed: adaptive uncertainty mode now uses rolling per-task uncertainty history (perplexity, max_entropy, low_conf_ratio) with robust thresholds.
    • Script: scripts/phase5_awareness_realbench.py
    • New mode/default: --awareness-trigger-mode adaptive
  • Full rerun (r5, adaptive default):
    • benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r5-adaptive-s8-nonthinking_20260302T202105Z.json
    • vs r4 paper:
      • B-A: -0.046875 -> -0.015625 (improved),
      • C-A: 0.0 -> 0.0 (kept parity),
      • latency deltas reduced:
        • Arm B: +904 ms -> +536 ms
        • Arm C: +1427 ms -> +623 ms
  • Stricter adaptive variant (r6) was tested:
    • benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r6-adaptive-strict-s8-nonthinking_20260302T202314Z.json
    • Result: Arm B reached parity (B-A=0.0) but Arm C regressed (C-A=-0.03125) and latency worsened vs r5.
  • Decision:
    • keep adaptive default settings from r5 as current best policy for this setup.

Decision Update (2026-02-28 Late)

  • TRENI_LINEAR_U16_FAST_COMPUTE has now been rerun with higher-confidence repeats and is promoted default-on.
  • Validation pack:
    • warm+mixed AB5: benchmarks/phase2_runtime/results/aws_speedpass/linearfast_ab5_20260228T124736Z/summary_ab5.json
      • warm on-off: request -0.139 ms, p95 -0.128 ms, p99 -0.009 ms
      • mixed on-off: request -0.139 ms, p95 -0.156 ms, p99 -0.208 ms
    • cold AB3: benchmarks/phase2_runtime/results/aws_speedpass/linearfast_cold_ab3_20260228T124510Z/summary_ab3.json
      • full +0.302 ms, TTFT -0.019 ms, startup -4.207 ms (near-flat on cold full, positive on startup/TTFT)
    • strict parity: benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_linearfast_20260228T124557Z.json (checked=3, failed=0)
    • post-default strict parity smoke: benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_post_linearfast_default_20260228T125804Z.json (checked=3, failed=0)
  • Runtime parser default is now TRENI_LINEAR_U16_FAST_COMPUTE=1 (override to 0 for strict fallback A/B).
  • Same-window sanity A/B after promotion (linearfast_default_sanity_20260228T125957Z) confirms default-on behavior is directionally better than forced-off on mixed request path:
    • default - force_off: mean -0.603 ms, p95 -0.984 ms, p99 +0.029 ms.

Rerun Update (2026-02-28 Late 2)

  • Fresh canonical foundation rerun on the new default is now published:
    • pack root: benchmarks/phase2_runtime/results/aws_speedpass/foundation_linearfastdefault_pack_20260228T134157Z
    • summary: benchmarks/phase2_runtime/results/aws_speedpass/foundation_linearfastdefault_pack_20260228T134157Z/summary_ab3.json
  • Versus prior parser-default foundation pack (20260228T114315Z):
    • warm AB3: near-flat/slightly slower (request +0.101 ms, p95 +0.326 ms, p99 +0.208 ms)
    • cold AB3: near-flat/slightly slower (full +0.491 ms, infer +0.530 ms, TTFT +0.002 ms)
    • mixed AB3: improved (request -0.629 ms, p95 -1.281 ms, p99 -0.163 ms)
  • Same-window runtime-vLLM full-depth AB3 was rerun on this updated canonical lane:
    • run set root: benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_linearfastdefault_ab3_20260228T134630Z
    • summary: benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_linearfastdefault_ab3_20260228T134630Z/summary_ab3.json
    • averages:
      • runtime first-request full: 1185.186 ms
      • vLLM first-request full: 1305.971 ms
      • vLLM/runtime full ratio: 1.102x (runtime faster on this run set)
      • vLLM/runtime cold-total-first-response ratio: 5.807x
      • vLLM/runtime cold-total-first-token ratio: 7.648x
  • Batched2-Lt fast-fallback short-circuit experiment (skip Lt gate/state/timing work when Lt is disabled) was tested and reverted:
    • isolation A/B (fastfallback_isolation_ab3_20260228T140122Z/summary_ab3.json) using on (short-circuit) vs off (forced old path) showed:
      • warm on-off: request +1.155 ms, p95 +2.124 ms, p99 +1.504 ms (regression)
      • cold on-off: full -0.846 ms (improvement)
      • mixed on-off: mean +0.144 ms, p95 +0.569 ms, p99 -0.221 ms (mixed/slightly worse overall)
    • decision: keep reverted (not canonical).
    • post-revert strict parity passed: week3_parity_report_post_fastfallback_revert_20260228T140626Z.json.

Decision Update (2026-02-28 Late 3)

  • TRENI_TENSOR_H2D_CHUNK_MB default is now promoted from 64 to 0 (no chunking) on this canonical profile.
  • AB3 evidence:
    • cold AB3 (h2d_chunk_cold_ab3_20260228T142114Z/summary_ab3.json), chunk0 - chunk64:
      • startup -4.022 ms, full -2.562 ms, infer -2.542 ms, TTFT -0.060 ms
      • decoder_tensor_h2d -3.347 ms, decoder_tensor_upload -3.222 ms
    • warm+mixed AB3 (h2d_chunk_warm_mixed_ab3_20260228T142258Z/summary_ab3.json), chunk0 - chunk64:
      • warm: mean -0.442 ms, p95 -0.697 ms, p99 -0.966 ms
      • mixed: mean -0.044 ms, p95 -0.368 ms, p99 -0.279 ms
  • Post-promotion strict parity passed:
    • week3_parity_report_h2dchunk0_default_20260228T142805Z.json (checked=3, failed=0)
  • Single-run sanity (h2d_chunk_default_vs64_sanity_20260228T142845Z) showed small mixed sensitivity (default - force64 mean +0.340 ms), so this lane should be kept under repeatability watch in future packs.

Decision Update (2026-02-28 Late 4)

  • Higher-N same-window runtime-vLLM full-depth rerun is now complete on the updated defaults (AB5):
    • run root: benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z
    • summary: benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/summary_ab5.json
    • summary markdown: benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/summary_ab5.md
  • AB5 means (runtime vs vLLM):
    • first-request full: 1184.812 ms vs 1318.675 ms (vLLM/runtime=1.113x)
    • TTFT: 14.640 ms vs 50.309 ms (vLLM/runtime=3.436x)
    • cold-total first response: 4190.848 ms vs 24350.818 ms (vLLM/runtime=5.810x)
  • Comparison vs prior same-window AB3 (...linearfastdefault_ab3_20260228T134630Z) is published:
    • benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/compare_vs_prev_linearfastdefault_ab3.json
    • runtime full mean improved slightly (1185.186 -> 1184.812 ms, -0.375 ms).
    • full-latency ratio improved (1.102x -> 1.113x), while TTFT ratio narrowed because vLLM TTFT was lower in this run window.
  • Interpretation:
    • request-path win vs vLLM remains stable at higher-N under claim-safe fixed-token settings.
    • the remaining active Track A work is still deeper custom layer-compute reduction (decoder_stepN_layers / FFN-heavy path), not re-establishing baseline direction.

Decision Update (2026-02-28 Late 5)

  • Full-depth gate sweep on top of current defaults is now complete:
    • gate root: benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_gate_newdefaults_20260228T150709Z
    • gate summary: benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_gate_newdefaults_20260228T150709Z/summary_gate_ab2.json
  • AB2 gate outcomes:
    • delayed-Lt (TRENI_LINEAR_BATCHED2_USE_LT=1, TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=10000) was directionally positive in both modes:
      • warm on-off: request -0.384 ms, infer -0.343 ms, p99 -0.719 ms
      • mixed on-off: request -0.256 ms, infer -0.200 ms, p99 -0.279 ms
    • FFN proj_fast (TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1) remained mixed/noise:
      • warm on-off: request -0.096 ms, infer -0.082 ms, p99 +0.129 ms
      • mixed on-off: request -0.327 ms, infer -0.207 ms, p99 +0.022 ms
    • decision at gate stage: only delayed-Lt advanced to AB3 confirmation.
  • delayed-Lt AB3 confirmation is complete:
    • run root: benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_delayedlt_ab3_20260228T151322Z
    • summary: benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_delayedlt_ab3_20260228T151322Z/summary_ab3.json
    • warm on-off: request -0.330 ms, infer -0.270 ms, p99 -0.098 ms
    • mixed on-off: request +0.173 ms, infer +0.191 ms, p99 +0.291 ms
  • Decision:
    • keep delayed-Lt non-canonical on defaults (TRENI_LINEAR_BATCHED2_USE_LT=0, TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0).
    • at that stage, keep TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0 as canonical (later temporarily promoted in Decision Update (2026-02-28 Late 8), then rejected in Decision Update (2026-02-28 Late 9) and restored to canonical off).
    • next custom-kernel focus remains structural layer-compute reduction (decoder_stepN_layers/FFN-heavy path), not env-toggle promotion.

Decision Update (2026-02-28 Late 6)

  • Tuned delayed-Lt slow-gate rescue probe is complete:
    • run root: benchmarks/phase2_runtime/results/aws_speedpass/delayedlt_tunedslow_ab2_20260228T152358Z
    • summary: benchmarks/phase2_runtime/results/aws_speedpass/delayedlt_tunedslow_ab2_20260228T152358Z/summary_gate_ab2.json
  • Tuned on config:
    • TRENI_LINEAR_BATCHED2_USE_LT=1
    • TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=10000
    • TRENI_LINEAR_BATCHED2_LT_SLOW_RATIO_PCT=0
    • TRENI_LINEAR_BATCHED2_LT_SLOW_STREAK_DISABLE=4
  • AB2 deltas (on-off):
    • warm: request -0.185 ms, infer -0.054 ms, TTFT +0.016 ms, p99 -0.417 ms
    • mixed: request -0.004 ms, infer -0.032 ms, TTFT -0.011 ms, p99 +0.221 ms
  • Decision:
    • tuned policy is still non-promotable (mixed mean near-zero and mixed p99 regresses), so delayed-Lt remains non-canonical on defaults.

Decision Update (2026-02-28 Late 7)

  • FFN proj batched2 f32_input fallback-path patch is now validated:
    • code change: /Users/andrewcorrea/treni/monolith/models/linear.cu
    • behavior: cache unsupported mixed-input batched2 GEMM combos and short-circuit repeated failing calls.
  • Forced-Lt diagnostic (same profile/settings) before vs after patch:
    • before: benchmarks/phase2_runtime/results/aws_speedpass/ffnproj_f32input_ltalways_20260228T153113Z.json
    • after: benchmarks/phase2_runtime/results/aws_speedpass/ffnproj_f32input_ltalways_patch_20260228T154942Z.json
    • delta:
      • request mean 175.208 -> 173.124 ms (-2.084 ms)
      • p99 206.780 -> 204.405 ms (-2.375 ms)
      • linear_batched2_lt_failures 26112 -> 1 (repeated failure loop removed)
  • Canonical AB2 re-gate for TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0/1 after patch:
    • root: benchmarks/phase2_runtime/results/aws_speedpass/ffnproj_f32input_gate_patch_ab2_20260228T155033Z
    • summary: benchmarks/phase2_runtime/results/aws_speedpass/ffnproj_f32input_gate_patch_ab2_20260228T155033Z/summary_gate_ab2.json
    • deltas (on-off):
      • warm: request +0.026 ms, infer -0.060 ms, p99 +0.099 ms
      • mixed: request +0.057 ms, infer +0.028 ms, p99 +0.446 ms
  • Decision:
    • keep TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0 canonical (still non-promotable on default path).
    • keep fallback-path patch (removes pathological repeated-failure overhead and improves robustness in forced-Lt/stress configurations).

Decision Update (2026-02-28 Late 8)

  • Full-depth FFN projection fast-compute rerun is now complete on clean inference path (pool=16384, classifier disabled, no fallback errors):
    • profiled AB3 (ffnprojfast_fullstep_ab3_20260228T160255Z/summary_ab3.json), on-off:
      • request -0.370 ms, infer -0.348 ms, TTFT -0.045 ms, p99 -0.533 ms.
    • non-profiled warm AB3 (ffnprojfast_fullwarm_ab3_20260228T160358Z/summary_ab3.json), on-off:
      • request -0.249 ms, infer -0.225 ms, TTFT -0.015 ms, p99 -0.328 ms.
  • Strict parity passed with explicit candidate env and then again on a temporary promoted parser build:
    • candidate env: week3_parity_report_ffnprojfast_candidate_20260228T160459Z.json (checked=3, failed=0)
    • temporary promoted build: week3_parity_report_ffnprojfast_default_20260228T160639Z.json (checked=3, failed=0)
  • Interim interpretation:
    • this looked promotable on qwen-focused clean-path profiling, but needed full foundation validation before final canonical decision.
  • Post-promotion same-window sanity AB3 (ffnprojfast_default_sanity_ab3_20260228T160557Z/summary_ab3.json) confirms near-flat but directionally positive default behavior:
    • default - force_off: request -0.094 ms, infer -0.093 ms, TTFT -0.003 ms, p99 +0.057 ms.

Decision Update (2026-02-28 Late 9)

  • Canonical foundation rerun + same-window gate resolved the contradiction and rejected global promotion:
    • foundation pack: benchmarks/phase2_runtime/results/aws_speedpass/foundation_ffnprojfastdefault_pack_20260228T194204Z/summary_ab3.json
      • vs prior canonical (foundation_newdefaults_pack_20260228T143605Z), all three modes were slower:
        • warm request +1.317 ms, cold full +3.117 ms, mixed request +1.112 ms.
    • same-window foundation gate AB2 (default vs force_off):
      • root: benchmarks/phase2_runtime/results/aws_speedpass/foundation_ffnprojfast_gate_ab2_20260228T195240Z
      • summary: benchmarks/phase2_runtime/results/aws_speedpass/foundation_ffnprojfast_gate_ab2_20260228T195240Z/summary_gate_ab2.json
      • default - force_off:
        • warm: request +0.489 ms, infer +0.479 ms, p99 +0.841 ms
        • cold: full +0.746 ms, infer +0.537 ms (startup improved -1.323 ms)
        • mixed: mean near-flat +0.004 ms, tails improved (p95 -0.320 ms, p99 -0.823 ms)
  • Decision:
    • keep TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0 canonical parser default.
    • retain the lane as opt-in for qwen-focused profiling where it can still be useful.

Decision Update (2026-02-27)

  • Additional full-depth FFN/linear probe cycle is complete and did not produce a new canonical win.
  • 3-seed outcomes:
    • TRENI_DECODER_FFN_PROJ_U16_FUSED=1: slight regression vs off in runtime-only and runtime-vLLM sets.
    • TRENI_LINEAR_U16_FAST_COMPUTE=1: near-neutral/slight regression vs off in the initial runtime-only set (later superseded by 2026-02-28 AB5 promotion evidence).
    • TRENI_LINEAR_LT_WORKSPACE_MB=64: clear regression (first_request_full_ms +40.546 ms, +2.38%).
    • TRENI_LINEAR_USE_LT=0: clear regression (first_request_full_ms +48.826 ms, +2.87%).
    • shape-scoped Lt fail cache (replacing process-wide disable-on-first-fail) was implemented and validated; perf impact was near-neutral (~0.05% full-latency movement) in both runtime-only and runtime-vLLM checks.
  • FFN projection batched2 lane (TRENI_DECODER_FFN_PROJ_U16_BATCHED2) is now validated and promoted default-on:
    • runtime-only 3-seed delta (on-off): first_request_full_ms -12.199 ms (-0.72%), TTFT -0.171 ms.
    • runtime-vLLM 3-seed runtime-leg delta (on-off): first_request_full_ms -12.974 ms (-0.76%), TTFT -0.175 ms.
    • stage profile corroborates layer compute reduction (decoder_stepN_layers_mean 19.140 -> 18.447 ms).
  • Canonical full-depth linear lane remains:
    • TRENI_LINEAR_USE_LT=1
    • TRENI_LINEAR_LT_WORKSPACE_MB=0
    • TRENI_DECODER_FFN_PROJ_U16_FUSED=0
    • TRENI_DECODER_FFN_PROJ_U16_BATCHED2=1 (default-on)
  • Fresh stage profiles (external_cold_layers36_stageprofile_ffnprojbatch2_off_20260227T182949Z, ..._on_20260227T182728Z) still show decoder_stepN_layers as dominant, but improved under batched2 (19.140 -> 18.447 ms); FFN projection remains the top layer sub-stage (0.205 -> 0.196 ms/layer), so next optimization remains structural layer-compute work.

Decision Update (2026-02-27 Late, Full-Depth Lane)

  • TRENI_DECODER_DIRECT_OUT_HIDDEN is now promoted default-on in this full-depth lane after positive 3-seed runtime-only A/B:
    • off: TTFT=15.024 ms, full=1690.855 ms, cold_full=4696.944 ms, infer=1668.381 ms
    • on: TTFT=14.950 ms, full=1684.908 ms, cold_full=4691.002 ms, infer=1662.753 ms
    • delta (on-off): full -5.948 ms, infer -5.629 ms
    • strict parity passed: week3_parity_report_directouthidden_default_20260227T184738Z.json (checked=3, failed=0).
  • External-cold harness now captures completion-length signals and supports fixed-token vLLM fairness:
    • new fields: completion_chars, completion_words, streamed usage_* (when available).
    • vLLM path now uses ignore_eos=true for fixed-token comparisons.
    • fixed-length rerun confirms matched completion_tokens=64 for runtime and vLLM.
  • New fused qkv split+bias path (TRENI_DECODER_QKV_SPLIT_BIAS_FUSED) is implemented and promoted default-on in this lane:
    • runtime-only 3-seed A/B:
      • off: TTFT=14.951 ms, full=1684.135 ms, cold_full=4690.132 ms, infer=1662.833 ms
      • on: TTFT=14.687 ms, full=1663.776 ms, cold_full=4669.847 ms, infer=1641.322 ms
      • delta (on-off): TTFT -0.265 ms, full -20.359 ms, cold_full -20.285 ms, infer -21.511 ms
    • strict parity passed: week3_parity_report_qkvsplitbias_default_20260227T190739Z.json (checked=3, failed=0).
  • Latest fixed-length runtime-vLLM 3-seed set (both completion_tokens=64):
    • runtime: TTFT=14.685 ms, full=1662.478 ms
    • vLLM: TTFT=50.272 ms, full=1293.215 ms
    • interpretation: runtime remains clearly ahead on TTFT, but request full still trails in this profile.

Decision Update (2026-02-27 Night, Logits Fast-Compute Hook)

  • TRENI_DECODER_LOGITS_U16_FAST_COMPUTE is now wired into the runtime logits projection path (*_f32_input_ex(..., use_fast_compute)).
  • Runtime-only 3-seed A/B (layers=36, pool=16384, preload64):
    • off: TTFT=14.687 ms, full=1661.945 ms, infer=1640.884 ms, cold_full=4667.855 ms
    • on: TTFT=14.676 ms, full=1662.713 ms, infer=1640.797 ms, cold_full=4668.751 ms
    • delta (on-off): TTFT -0.011 ms, full +0.767 ms, infer -0.086 ms, cold_full +0.896 ms
  • Decision: keep this knob disabled by default in this lane (TRENI_DECODER_LOGITS_U16_FAST_COMPUTE=0), because there is no material win and request-full regresses slightly.
  • Fixed-token runtime-vLLM sanity rerun (completion_tokens=64):
    • runtime: TTFT=14.700 ms, full=1662.793 ms
    • vLLM: TTFT=49.778 ms, full=1306.676 ms
    • artifact: benchmarks/phase2_external_cold/results/external_cold_layers36_preload64_logitsfast_off_vllm_s1_20260227T193632Z.json
  • Strict Week 3 parity after hook integration passed:
    • artifact: benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_logitsfast_hook_20260227T193756Z.json
    • summary: checked=3, failed=0.

Decision Update (2026-02-27 Night, U16 Cache Unlock)

  • Implemented structural cache fix:
    • copy_tensor_to_gpu_u16 now uses tensor-cache lookup/store.
    • new env gate TRENI_TENSOR_CACHE_U16 (default 1) for explicit A/B.
    • logits-u16 request path now goes through shared cached helper.
  • Runtime-only 3-seed A/B (u16cache off/on, full-depth preload64):
    • off: TTFT=14.679 ms, full=1661.982 ms, infer=1640.118 ms, cold_full=4667.860 ms
    • on: TTFT=14.682 ms, full=1189.452 ms, infer=1168.883 ms, cold_full=4195.511 ms
    • delta (on-off): TTFT +0.003 ms, full -472.529 ms, infer -471.235 ms, cold_full -472.349 ms
  • Runtime-vLLM same-window A/B (u16cache off/on, 2 seeds each):
    • off means:
      • runtime: TTFT=14.681 ms, full=1663.314 ms
      • vLLM: TTFT=50.073 ms, full=1325.189 ms
      • runtime-vLLM full delta: +338.124 ms (runtime slower)
    • on means:
      • runtime: TTFT=14.688 ms, full=1192.145 ms
      • vLLM: TTFT=50.183 ms, full=1290.816 ms
      • runtime-vLLM full delta: -98.671 ms (runtime faster)
  • Mechanism check from logs (measured request after preload):
    • off: decoder_tensor_upload ~476 ms, decoder_tensor_h2d ~468 ms
    • on: decoder_tensor_upload ~5 ms, decoder_tensor_h2d 0 ms
  • Strict parity on final default-on build passed:
    • artifact: benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_u16cache_toggle_default_20260227T200652Z.json
    • summary: checked=3, failed=0.

Decision Update (2026-02-27 Late Night, FFN Follow-Up)

  • Consolidated artifact:
    • benchmarks/phase2_external_cold/results/external_cold_layers36_ffn_followup_summary_20260227T223458Z.json
    • benchmarks/phase2_external_cold/results/external_cold_layers36_ffn_followup_summary_20260227T223458Z.md
  • New optional TRENI_LINEAR_BATCHED2_USE_LT lane was implemented and tested:
    • runtime-only ab3 delta (on-off): TTFT +0.162 ms, full +12.469 ms, infer +12.534 ms.
    • decision: not promoted.
  • Higher-N retest of TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=1 + TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1 (ab8 runtime-only):
    • delta (on-off): TTFT -0.001 ms, full -0.198 ms, infer -0.101 ms.
    • decision: not promoted (near-noise).
  • FFN fused path follow-up:
    • code path now allows gate/up bias deferral into fused SiLU*Up activation when TRENI_DECODER_FFN_PROJ_U16_FUSED=1.
    • runtime-only ab3 delta (on-off): TTFT -0.003 ms, full -0.383 ms, infer -0.161 ms.
    • decision: not promoted (near-noise).
  • Net status:
    • no canonical change from this cycle.
    • full-depth hotspot remains layer compute (decoder_stepN_layers / FFN-heavy path), so next work stays on deeper structural compute reductions plus mixed-load repeatability.

Decision Update (2026-02-28 Early, Fast-Profile + Mixed-Load Repeatability)

  • Fast-profile (--layers 2) higher-N logits fast-compute retest is complete:
    • artifact: benchmarks/phase2_external_cold/results/external_cold_layers2_logitsfast_ab8_summary_20260228T005529Z.json
    • runtime-only AB8 delta (on-off): TTFT -0.002 ms, full -0.299 ms, infer -0.013 ms, cold_full -0.345 ms
    • stage means remained effectively unchanged (decoder_stepN_logits_proj_mean ~1.261 ms in both modes)
    • decision: not promoted (near-noise effect).
  • Mixed-load repeatability on canonical lane is complete (run_mode=mixed_load, http_runs=120, 3 runs):
    • artifact: benchmarks/phase2_runtime/results/aws_speedpass/mixed_load_repeatability_summary_20260228T005626Z.json
    • means across runs: mean=122.247 ms, p95=198.518 ms, p99=199.608 ms
    • decision: stable; no canonical config change from this sweep.
  • Strict Week 3 parity follow-up on latest patched build:
    • artifact: benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_followup_20260228T005805Z.json
    • summary: checked=3, failed=0, strict.

Decision Update (2026-02-28, Parser Fix + Full-Depth FFN Follow-Up)

  • phase2_runtime_benchmark.py timing parser was fixed to preserve decimals in timing stage=... ms=... lines.
    • root cause: regex escaped decimal point incorrectly, which truncated stage values to integer prefixes.
    • impact: request-level metrics (ttft, infer, full) were unaffected; stage telemetry was underreported.
  • Rerun artifacts with fixed parser:
    • benchmarks/phase2_runtime/results/aws_speedpass/cold_profile_qwen_layers36_fixparse_20260228T011037Z.json
    • benchmarks/phase2_runtime/results/aws_speedpass/warm_profile_qwen_layers36_fixparse_20260228T011037Z.json
  • Confirmed full-depth hotspot (qwen, layers=36) remains FFN-heavy:
    • decoder_step_profile_ffn_proj_mean ~0.366 ms/layer
    • decoder_step_profile_ffn_down_resid_mean ~0.190 ms/layer
    • decoder_step_profile_total_mean ~0.705 ms/layer
  • Full-depth warm AB3 on TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE:
    • artifact: benchmarks/phase2_runtime/results/aws_speedpass/ffn_fast_compute_ab3_20260228T011146Z_summary.json
    • delta (on-off): request +0.317 ms, infer +0.305 ms, stage means flat.
    • decision: not promoted.
  • New strided-batched Lt path for batched2 FFN Lt fallback was implemented and benchmarked:
    • warm AB3 artifact: benchmarks/phase2_runtime/results/aws_speedpass/batched2lt_strided_ab3_20260228T011651Z_summary.json
    • warm AB3 delta (on-off): request -0.190 ms, infer -0.194 ms, stage means flat.
    • runtime-only external-cold sanity (layers=36, preload64): slight regression (full +0.579 ms, infer +0.609 ms).
    • decision: keep path opt-in (TRENI_LINEAR_BATCHED2_USE_LT=1) and not canonical.
  • FFN gate/up dual-bias fused add path (TRENI_DECODER_FFN_BIAS_PAIR_FUSED) is now implemented and benchmarked:
    • warm AB3 artifact: benchmarks/phase2_runtime/results/aws_speedpass/ffn_bias_pair_ab3_20260228T020257Z/summary.json
    • warm AB3 delta (on-off): request -0.229 ms, infer -0.090 ms, p99 -0.390 ms, TTFT +0.009 ms.
    • cold follow-up artifact: benchmarks/phase2_runtime/results/aws_speedpass/ffn_bias_pair_cold_ab2_20260228T020723Z/summary.json (3 seeds each after extension).
    • cold delta (on-off): TTFT -0.003 ms, infer +1.875 ms, full +1.928 ms.
    • decision: keep the lane opt-in (non-canonical) until cold regression is eliminated.
  • Batched2 seq1 split-GEMM lane (TRENI_LINEAR_BATCHED2_SPLIT_SEQ1) is now implemented and benchmarked:
    • warm AB3 artifact: benchmarks/phase2_runtime/results/aws_speedpass/batched2_splitseq1_ab3_20260228T025841Z/summary.json
    • warm AB3 delta (on-off): request +0.014 ms, infer +0.105 ms, p99 +0.124 ms, TTFT +0.004 ms (near-noise/slight regression).
    • cold AB3 artifact: benchmarks/phase2_runtime/results/aws_speedpass/batched2_splitseq1_cold_ab3_20260228T025841Z/summary.json
    • cold AB3 delta (on-off): TTFT -0.021 ms, infer -2.002 ms, full -2.070 ms.
    • decision: keep opt-in and non-canonical (no warm-path win).
  • Batched2 dup-input strided lane (TRENI_LINEAR_BATCHED2_DUP_INPUT) is now implemented and benchmarked:
    • warm AB3 artifact: benchmarks/phase2_runtime/results/aws_speedpass/batched2_dupinput_ab3_20260228T031816Z/summary.json
    • warm AB3 delta (on-off): request +0.317 ms, infer +0.293 ms, TTFT +0.009 ms, p99 -0.208 ms.
    • cold AB3 artifact: benchmarks/phase2_runtime/results/aws_speedpass/batched2_dupinput_cold_ab3_20260228T031816Z/summary.json
    • cold AB3 delta (on-off): TTFT +0.010 ms, infer +1.388 ms, full +1.307 ms.
    • decision: keep opt-in and non-canonical (regresses mean request path in both warm and cold).
  • Batched2 dup-input v2 probe (duplication-kernel swap for the dup path) was run as a warm AB2 gate set and rejected:
    • gate artifact: benchmarks/phase2_runtime/results/aws_speedpass/batched2_dupinput_v2warm_ab2_20260228T032741Z/summary_gate_ab2.json
    • gate delta (on-off): request +0.438 ms, infer +0.381 ms, TTFT +0.015 ms, p99 +0.217 ms.
    • decision: reverted probe implementation; no AB3/cold expansion.
  • FFN proj u16 fused gate rerun (TRENI_DECODER_FFN_PROJ_U16_FUSED=0/1, warm AB2):
    • gate artifact: benchmarks/phase2_runtime/results/aws_speedpass/ffn_proj_u16_fused_gate_ab2_20260228T033524Z/summary_gate_ab2.json
    • gate delta (on-off): request +0.149 ms, infer +0.173 ms, TTFT +0.002 ms, p99 -0.006 ms.
    • decision: near-flat/slight mean regression; no AB3 expansion.
  • FFN proj batched2 f32-input gate rerun (TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0/1, warm AB2):
    • gate artifact: benchmarks/phase2_runtime/results/aws_speedpass/ffn_proj_batched2_f32input_gate_ab2_20260228T033758Z/summary_gate_ab2.json
    • gate delta (on-off): request +0.236 ms, infer +0.248 ms, TTFT +0.011 ms, p99 +0.512 ms.
    • decision: rejected at gate stage; no AB3 expansion.
  • Linear u16 compute16f gate probe (TRENI_LINEAR_U16_FORCE_COMPUTE_16F=0/1, warm AB2):
    • gate artifact: benchmarks/phase2_runtime/results/aws_speedpass/linear_u16_compute16f_gate_ab2_20260228T034412Z/summary_gate_ab2.json
    • gate delta (on-off): request +0.210 ms, infer +0.240 ms, TTFT -0.001 ms, p99 +0.594 ms.
    • decision: rejected at gate stage and reverted; no AB3 expansion.
  • Full-depth warm u16-lane re-baseline (explicit u16 decode flags, qwen, layers=36) confirms active hotspot split:
    • request mean in this lane is ~173 ms (120-request warm profile), with decoder_step_profile_total_mean ~0.402 ms.
    • FFN projection remains dominant (decoder_step_profile_ffn_proj_mean ~0.196 ms, mostly ffn_proj_gate; ffn_proj_up stays 0.0 under batched2).
  • FFN gate/up contiguous-pair packing probe (TRENI_DECODER_FFN_PAIR_PACK_U16) is implemented as experimental and benchmarked:
    • AB3 artifact: benchmarks/phase2_runtime/results/aws_speedpass/ffn_pair_pack_gate_ab2_20260228T040616Z/summary_ab3.json
    • warm AB3 delta (on-off): request -0.423 ms, infer -0.442 ms, p99 -0.673 ms.
    • caveat: both sides already reported contiguous gate/up pair active, so this delta is not a causal promotion signal.
    • decision: keep path default-off (TRENI_DECODER_FFN_PAIR_PACK_U16=0) and experimental only.
  • Batched2 Lt rerun on the explicit u16 lane (TRENI_LINEAR_BATCHED2_USE_LT) now has fresh warm+cold evidence:
    • warm AB3 artifact: benchmarks/phase2_runtime/results/aws_speedpass/batched2_use_lt_u16lane_gate_ab2_20260228T041041Z/summary_ab3.json
    • warm AB3 delta (on-off): request -0.313 ms, infer -0.468 ms, p99 -0.511 ms, TTFT -0.058 ms.
    • cold AB3 artifact: benchmarks/phase2_runtime/results/aws_speedpass/batched2_use_lt_u16lane_cold_ab2_20260228T041359Z/summary_ab3.json
    • cold AB3 delta (on-off): full +1.165 ms, infer +1.424 ms, TTFT +0.001 ms.
    • fixed-on decision: keep non-canonical (warm gain did not survive cold-first-hit tradeoff).
  • Adaptive delayed batched2 Lt policy (TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS) has warm/cold wins but is not canonical (2026-02-28):
    • 5000ms AB3 (batched2_lt_enable_after_ms5000_warm_ab3_20260228T104525Z, batched2_lt_enable_after_ms5000_cold_ab3_20260228T104712Z): warm improved but cold full still regressed (+0.422 ms).
    • 10000ms AB3 (batched2_lt_enable_after_ms10000_warm_ab3_20260228T105028Z, batched2_lt_enable_after_ms10000_cold_ab3_20260228T105213Z): both modes improved.
      • warm delta (on-off): request -0.363 ms, infer -0.326 ms, p99 -0.696 ms.
      • cold delta (on-off): startup -4.307 ms, full -0.635 ms, infer -0.347 ms, TTFT -0.070 ms.
    • strict parity (week3_parity_report_batched2_lt_delay10000_20260228T105329Z.json): pass (checked=3, failed=0).
    • default-path strict parity smoke (no explicit batched2 Lt env overrides) also passed: week3_parity_report_batched2_lt_defaultdelay_20260228T110825Z.json.
    • same-window mixed-load A/B (mixed_load_defaultdelay_onoff_ab3_20260228T115010Z.json) regressed with delayed-on:
      • on-off: mean +0.846 ms, p95 +1.627 ms, p99 +0.679 ms.
    • decision: keep parser defaults off (TRENI_LINEAR_BATCHED2_USE_LT=0, TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0) and leave delayed-on as opt-in.
    • post-revert default-path strict parity also passed: week3_parity_report_postrevert_defaults_20260228T115543Z.json.
  • Foundation parser-default rerun pack is now published (foundation_defaultdelay_pack_20260228T114315Z):
    • warm AB3 means (foundation_defaultdelay_warm_ab3_20260228T114315Z/summary_ab3.json): request 147.258 ms, p99 247.617 ms, infer 128.450 ms, TTFT 16.999 ms.
    • cold AB3 means (foundation_defaultdelay_cold_ab3_20260228T114315Z/summary_ab3.json): startup 425.532 ms, full 598.787 ms, infer 580.173 ms, TTFT 12.210 ms.
    • mixed repeatability (mixed_load_repeatability_summary_defaultdelay_20260228T114748Z.json) vs prior canonical summary (mixed_load_repeatability_summary_20260228T005626Z.json) remained slower (mean +2.841 ms, p95 +5.587 ms, p99 +5.140 ms), reinforcing the non-canonical decision for delayed-on defaults.
  • Added experimental Lt prewarm path for FFN batched2 (TRENI_DECODER_FFN_BATCHED2_LT_PREWARM) and measured it with Lt fixed-on:
    • warm AB2 (batched2_lt_prewarm_warm_ab2_20260228T042453Z/summary_gate_ab2.json): small gain (request -0.328 ms, infer -0.394 ms).
    • cold AB3 (batched2_lt_prewarm_cold_ab3_20260228T042649Z/summary_ab3.json): first-hit gain (full -1.497 ms, infer -1.406 ms).
  • Direct same-window combo A/B (lt=0,prewarm=0 vs lt=1,prewarm=1) is mixed and non-promotable:
    • combined summary artifact: benchmarks/phase2_runtime/results/aws_speedpass/batched2_lt_prewarm_combo_summary_20260228T042733Z.json
    • warm AB3 (batched2_lt_prewarm_combo_warm_ab2_20260228T042733Z/summary_ab3.json): regression (request +0.198 ms, infer +0.178 ms, p99 +0.407 ms).
    • cold AB3 (batched2_lt_prewarm_combo_cold_ab3_20260228T042733Z): still improved (full -1.099 ms, infer -0.819 ms).
    • decision: keep prewarm path experimental/default-off; not canonical.
  • FFN down fast-compute lane (TRENI_DECODER_FFN_DOWN_U16_FAST_COMPUTE) is now promoted default-on (2026-02-28) after full-depth canonical A/B + strict parity:
    • warm AB3 (ffn_down_fast_compute_gate_ab3_20260228T044546Z/summary_ab3.json): request -0.565 ms, infer -0.566 ms, p99 -1.405 ms, TTFT -0.030 ms.
    • cold AB3 (ffn_down_fast_compute_cold_ab3_20260228T044753Z/summary_ab3.json): startup -8.405 ms, full -0.351 ms, infer -0.406 ms, TTFT -0.028 ms.
    • strict parity (week3_parity_report_ffn_down_fast_20260228T044846Z.json): pass (checked=3, failed=0).
  • Post-promotion retest cycle on the updated canonical baseline (2026-02-28) closed additional FFN toggle candidates as non-canonical:
    • new structural TRENI_LINEAR_BATCHED2_STACKED_SEQ1=1 AB3 probe regressed warm materially (request +1.259 ms, infer +1.229 ms, p99 +2.830 ms) with near-flat cold full (+0.030 ms), so it remains experimental/default-off.
    • TRENI_LINEAR_BATCHED2_SPLIT_SEQ1 AB3 retest regressed warm and cold.
    • TRENI_LINEAR_BATCHED2_USE_LT fixed-on AB3 retest improved warm but still regressed cold startup/full; delayed-on improved warm/cold but still regressed mixed-load, so lane stays non-canonical.
    • combo TRENI_LINEAR_BATCHED2_USE_LT=1 + TRENI_DECODER_FFN_BATCHED2_LT_PREWARM=1 looked positive at AB3 but failed AB5 cold confirmation.
    • TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1 was non-canonical in that cycle (later temporarily promoted in Decision Update (2026-02-28 Late 8), then rejected in Decision Update (2026-02-28 Late 9) and returned to canonical off).
    • TRENI_LINEAR_U16_FAST_COMPUTE=1 was revalidated in a later AB5/cold/parity cycle and is now promoted (see Decision Update (2026-02-28 Late) above).

Decision Update (2026-02-26)

  • Full-depth FFN activation-to-u16 fused path (TRENI_DECODER_FFN_ACT_U16_FUSED) was implemented, benchmarked, and promoted to default-on.
  • Runtime-only 3-seed A/B:
    • off: TTFT=15.333 ms, full=1715.700 ms, cold_full=4721.653 ms
    • on: TTFT=15.193 ms, full=1704.987 ms, cold_full=4710.958 ms
    • delta (on-off): TTFT -0.140 ms, full -10.713 ms, cold_full -10.696 ms
  • Runtime-vLLM 3-seed A/B (same host class/window):
    • off: runtime full=1716.052 ms, vLLM full=1299.219 ms (runtime/vLLM=1.3208x)
    • on: runtime full=1704.248 ms, vLLM full=1309.801 ms (runtime/vLLM=1.3012x)
  • Strict parity passed for explicit-on and default-on runs:
    • week3_parity_report_ffnactu16_20260226T1100.json
    • week3_parity_report_ffnactu16_default_20260226T1108.json
  • cuBLASLt workspace probe (TRENI_LINEAR_LT_WORKSPACE_MB=32) was tested and rejected in this lane (full 1711.213 -> 1754.568 ms in trial A/B).

Decision Update (2026-02-24)

  • cuDNN/frontend optimization lane is parked for now.
  • Reason: high fused coverage remains slower than custom on warm path and dramatically worse on cold-first-hit.
  • Active priority is custom-kernel best path only.
  • Custom-lane implementation update: added seq1 microfused attention path (TRENI_ATTN_SEQ1_USE_MICROFUSED) and cached cuBLAS stream binding.
  • G5 A/B update (2026-02-23): microfused path showed no net win (mean/TTFT regressions across qwen/bart profiles), so it remains opt-in and defaults off.
    • summary artifact: benchmarks/phase2_runtime/seq1_microfused_ab/seq1_microfused_ab_summary_20260223T014848Z.md.
  • G5 stream-cache A/B update (2026-02-23): TRENI_LINEAR_STREAM_CACHE/TRENI_ATTN_STREAM_CACHE showed near-neutral impact in short runs; cache stays enabled by default.
    • summary artifact: benchmarks/phase2_runtime/results/stream_cache_ab_summary_20260223T015222Z.md.
  • G5 registry/model-index hash A/B update (2026-02-23): TRENI_REGISTRY_LOOKUP_HASH/TRENI_MODEL_INDEX_NAME_HASH showed no meaningful cold/setup gain on this profile; path remains opt-in and defaults off.
    • summary artifact: benchmarks/phase2_runtime/results/registry_hash_ab_summary_20260223T020353Z.md.
  • Cold-start measurement contract fix (2026-02-23): phase2_runtime_benchmark.py health polling moved from 1s cadence to 50ms cadence.
    • implication: startup_to_healthy_ms is now high-fidelity (older ~1002 ms plateaus were quantized artifacts, not true startup plateaus).
  • Runtime startup-smoke control is now first-class in harness (--runtime-skip-startup-smoke, default true).
    • validated A/B (startup_smoke_ab_hf_20260223T030059Z): startup-to-healthy improved 488.027 -> 404.184 ms (-17.18%) and start-to-first-response improved 705.454 -> 622.167 ms (-11.81%) with startup smoke skipped.
    • runtime default now also skips startup smoke unless explicitly disabled (TRENI_SKIP_STARTUP_SMOKE=0).
  • New high-fidelity cold reference (cold_foundation_hf_20260223T030257Z, qwen profile):
    • startup-to-healthy 437.653 ms
    • request TTFT 4.295 ms
    • request full 217.998 ms
    • dominant request-path stage remains decoder_tensor_upload/decoder_tensor_h2d.
  • Consolidated knob-probe summary artifact:
    • benchmarks/phase2_runtime/results/cold_path_knob_probe_20260223T0303Z.md
  • Per-tensor upload hotspot probe is now available:
    • TRENI_TENSOR_UPLOAD_TOPK identified model.embed_tokens.weight as dominant in qwen cold upload (~79.3 ms, ~63.8% share in probe artifact).
    • artifact: benchmarks/phase2_runtime/results/cold_upload_hotspot_summary_20260223T1915Z.md.
  • Container readahead probe (TRENI_CONTAINER_WILLNEED) is now benchmarked:
    • 8-run A/B (container_willneed_ab8_20260223T191145Z) shows modest repeatable cold-total improvement (~-1.94% start-to-first-response).
    • TRENI_CONTAINER_WILLNEED + TRENI_TENSOR_HOST_REGISTER combo did not improve further on this profile (container_hostreg_ab8_20260223T191255Z).
    • runtime default now enables TRENI_CONTAINER_WILLNEED unless explicitly disabled (=0).
  • Staged H2D upload follow-up is now complete (TRENI_TENSOR_H2D_STAGING):
    • min64/chunk32 8-run A/B (h2d_staging_ab_20260224T100915Z) regressed full latency (+21.22%) and upload/h2d stages (+37.70% / +38.68%).
    • min64/chunk128 3-run probe (h2d_staging_chunk128_probe_20260224T101012Z) regressed further (full +44.43%, decoder_tensor_h2d +76.92%).
    • Decision: keep staging path parked (opt-in only), and focus Track A cold work on non-staging upload/H2D plus decoder_step0_layers.
    • consolidated artifact: benchmarks/phase2_runtime/results/h2d_staging_followup_summary_20260224T101324Z.md.
  • Non-staging H2D chunk-size matrix (TRENI_TENSOR_H2D_CHUNK_MB=0/64/128, 8 runs each) is now complete:
    • request-path and upload-stage deltas were near-neutral in that initial profile sweep; this was later superseded by 2026-02-28 full-depth AB3 promotion of default TRENI_TENSOR_H2D_CHUNK_MB=0 (see Decision Update (2026-02-28 Late 3)).
    • consolidated artifact: benchmarks/phase2_runtime/results/h2d_chunk_matrix_summary_20260224T101730Z.md.
  • Host page-touch pre-fault path (TRENI_TENSOR_HOST_TOUCH) is now implemented and benchmarked (TRENI_TENSOR_HOST_TOUCH_MIN_MB=256, 8-run A/B):
    • decoder_tensor_h2d decreased, but prefetch/upload stages increased and net request latency regressed (full +7.73%, infer +8.22%).
    • Decision: keep host-touch opt-in/default-off; not promoted into canonical Track A settings.
    • consolidated artifact: benchmarks/phase2_runtime/results/host_touch_ab_summary_20260224T102444Z.md.
  • Upload sync diagnostic probe (TRENI_TENSOR_UPLOAD_SYNC=0/1, 3 runs each) is now complete:
    • with synchronization on, conversion is visible (~6 ms) but H2D remains dominant (~118 ms) on this profile.
    • implication: cold upload optimization remains transfer-path first.
    • consolidated artifact: benchmarks/phase2_runtime/results/upload_sync_probe_summary_20260224T102618Z.md.
  • Synchronized host-register probe (TRENI_TENSOR_HOST_REGISTER=0/1, with TRENI_TENSOR_UPLOAD_SYNC=1) is now complete:
    • no transfer-stage gain and slight request-path regression on this profile.
    • implication: host-register lane is currently deprioritized.
    • consolidated artifact: benchmarks/phase2_runtime/results/host_register_sync_probe_summary_20260224T102915Z.md.
  • Decoder logits u16 path (TRENI_DECODER_LOGITS_U16_PATH) is now implemented and benchmarked:
    • cold upload/setup improves slightly, but request-path latency regresses materially (ttft/infer/full) in valid A/B runs.
    • follow-up fix2 pilot still regresses request path materially; lane remains parked.
    • implication: keep this lane parked as opt-in experimental; not part of canonical Track A settings.
    • consolidated artifact: benchmarks/phase2_runtime/results/logits_u16_ab_fix1_summary_20260224T105532Z.md.
  • Tensor-cache hash lookup path (TRENI_TENSOR_CACHE_HASH) is now implemented and benchmarked:
    • mixed + warm 3-seed A/B remains near-neutral, with slight warm p99 regression (+0.149 ms) in this profile.
    • implication: keep this lane opt-in/default-off.
    • artifacts:
      • benchmarks/phase2_runtime/results/tensor_cache_hash_ab_20260224T113911Z/
      • benchmarks/phase2_runtime/results/tensor_cache_hash_warm3_20260224T114126Z/
  • Sampler direct-store path (TRENI_SAMPLE_DIRECT_STORE) is now implemented and benchmarked:
    • enabled path regressed warm request latency (3-seed A/B: mean +0.062 ms, p95 +0.076 ms, p99 +0.143 ms).
    • implication: keep this lane opt-in/default-off.
    • artifact: benchmarks/phase2_runtime/results/sample_direct_store_ab_20260224T114633Z/.
  • Decoder direct-out residual path (TRENI_DECODER_DIRECT_OUT_HIDDEN) initial warm-profile A/B (2026-02-24) regressed and was kept opt-in at that time:
    • enabled path regressed warm request and infer metrics (3-seed A/B: mean +0.540 ms, p95 +0.495 ms, p99 +0.444 ms, infer +0.150 ms).
    • artifact: benchmarks/phase2_runtime/results/direct_outhidden_ab_20260224T115051Z/.
    • note: this is superseded for the current full-depth lane by the 2026-02-27 late-cycle promotion (see Decision Update above).
  • Consolidated summary for these three custom-path probes:
    • benchmarks/phase2_runtime/results/custom_path_probe_summary_20260224T115602Z.md.
  • Multi-head seq1 attention path (TRENI_ATTN_SEQ1_USE_MULTIHEAD) is now implemented and benchmarked:
    • qwen warm 3-seed: request mean 1.041x, p99 1.042x, infer 1.074x (seq1_multihead_ab_20260224T125127Z).
    • qwen mixed 3-seed: request mean 1.036x, p99 1.045x, infer 1.074x, cold wall 1.010x (seq1_multihead_ab_20260224T125127Z).
    • bart warm 3-seed: request mean 1.097x, p99 1.112x, TTFT 1.429x, infer 1.185x (seq1_multihead_bart_ab_20260224T125404Z).
    • default sanity rerun (no env override) remains faster than forced-off profile (seq1_multihead_default_sanity_20260224T125713Z).
    • decision: promoted default-on (TRENI_ATTN_SEQ1_USE_MULTIHEAD=1, TRENI_ATTN_SEQ1_MULTIHEAD_MAX_KV=2048).
    • consolidated artifact: benchmarks/phase2_runtime/results/seq1_multihead_ab_summary_20260224T125619Z.md.
  • External-cold rerun after seq1 multi-head default promotion is now complete (2026-02-24, same G5 host/config, 3 runs):
    • runtime means: startup 1003.315 ms, TTFT 4.022 ms, request full 239.277 ms, cold-total first response 1242.592 ms.
    • runtime-normalized ratios: PyTorch 127.900x TTFT / 9.378x full / 6.320x cold-total; vLLM 12.350x TTFT / 4.139x full / 19.333x cold-total.
    • note: Ollama was skipped on this host because Ollama service/model were not installed for this rerun.
    • consolidated artifact: benchmarks/phase2_external_cold/results/external_cold_seq1mh_default_repeatability_20260224T192020Z.md.
  • First decoder_step0_layers optimization follow-up on seq1 multi-head path is now benchmarked (2026-02-24):
    • change: reuse normalized probs in multi-head seq1 softmax+PV (remove repeated exp in inner PV accumulation loop).
    • 3-run external-cold repeatability (runtime + PyTorch + vLLM) runtime deltas vs prior seq1mh baseline:
      • TTFT: 4.022 -> 4.018 ms
      • request full: 239.277 -> 238.400 ms
      • cold-total first response: 1242.592 -> 1241.688 ms
    • interpretation: measurable but small gain; confirms direction, and more step0 work is still needed for material uplift.
    • consolidated artifact: benchmarks/phase2_external_cold/results/external_cold_step0expfix_repeatability_20260224T194226Z.md.
  • Second decoder_step0_layers follow-up (seq1 multi-head shared-prob cache) was benchmarked and reverted:
    • 3-run means were slightly worse than step0expfix (full +0.278 ms, cold-total +0.282 ms) while still better than the older seq1mh baseline.
    • decision: keep step0expfix as current best path and revert shared-prob patch.
    • artifact: benchmarks/phase2_external_cold/results/external_cold_step0shared_repeatability_20260224T194913Z.md.
  • Decode-stage profiling beyond step0 is now available (TRENI_DECODE_STAGE_PROFILE):
    • first profiled run (external_cold_stepn_profile_20260225T001334Z) shows decoder_stepN_logits_sample_mean=2.671 ms and decoder_stepN_layers_mean=1.360 ms (qwen fast profile: --layers 2, 64 tokens, no preload).
    • implication (fast profile): next custom-kernel priority is logits projection/sampling path.
  • Decode split follow-up (2026-02-25) now isolates logits projection from sampling:
    • external_cold_stepn_split_20260225T081450Z and external_cold_stepn_split_revert_20260225T082055Z show decoder_stepN_logits_proj_mean=2.458 ms vs decoder_stepN_sample_mean=0.106 ms.
    • implication: residual decode hotspot is specifically logits projection.
  • Immediate logits-projection probe matrix (2026-02-25) is complete and near-neutral:
    • lt16 A/B: external_cold_stepn_lt16_off/on_20260225T081717Z/081718Z
    • fast16 GEMMEx probe: external_cold_stepn_split_fast16_20260225T082158Z
    • direct-u16-input A/B: external_cold_stepn_u16direct_off/on_20260225T082445Z/082447Z
    • lt_u16 workspace A/B: external_cold_stepn_ltu16ws_off/on_20260225T082735Z/082737Z
    • decision: all no-gain probe code paths were reverted; baseline remains canonical.
  • Uncertainty-capture A/B on the same profile (TRENI_DEMO_CAPTURE_UNCERTAINTY=1/0) is now complete:
    • request full 479.889 -> 473.367 ms
    • infer 461.771 -> 454.878 ms
    • decoder_stepN_logits_sample_mean 2.671 -> 2.562 ms
    • implication: uncertainty overhead is measurable but not the dominant decode cost.
  • Runtime-vLLM cold rerun (external_cold_runtime_vllm_uncertoff_20260225T001929Z) confirms runtime remains clearly ahead on this profile:
    • runtime TTFT/full/cold-total full: 3.929 / 472.724 / 1476.116 ms
    • vLLM TTFT/full/cold-total full: 49.577 / 1311.481 / 24344.013 ms
  • Full-depth qwen check (--layers 36, --pool-mb 16384) is now explicitly captured:
    • profiled runtime-only artifact: external_cold_stepn_split_layers36_pool16g_20260225T083216Z shows decoder_stepN_layers_mean=24.306 ms, decoder_stepN_logits_proj_mean=2.458 ms, decoder_stepN_total_mean=26.875 ms.
    • implication (full depth): decoder layers are the dominant request-path stage; logits projection is secondary.
  • Full-depth runtime-vLLM cold comparison (external_cold_runtime_vllm_layers36_pool16g_20260225T083306Z):
    • runtime TTFT/full/cold-total full: 26.775 / 2983.780 / 3987.092 ms
    • vLLM TTFT/full/cold-total full: 49.998 / 1315.478 / 24346.938 ms
    • implication: runtime is better on TTFT and cold-total, but currently slower on first-request full latency in this full-depth configuration.
  • Full-depth preload follow-up (external_cold_runtime_vllm_layers36_pool16g_preload_20260225T150209Z):
    • runtime request path improves to TTFT/full/infer = 26.748 / 2136.131 / 2114.951 ms with cache hits (cache_hit_delta=434, cache_miss_delta=0).
    • vLLM in same run remains faster on full (1279.729 ms) but much worse on cold-total (24310.219 ms).
    • implication: after removing upload misses, residual gap is decode/layer compute.
  • Full-depth preload-max-tokens probe (external_cold_runtime_vllm_layers36_pool16g_preload64_20260225T150410Z) is near-neutral vs preload=1 on runtime request full (2133.948 ms), but increases cold-total due heavier startup preload.
  • Full-depth seq1 hybrid matrix rerun (external_cold_layers36_hybrid_*_20260225T1508*.json):
    • default custom is best (infer ~2113 ms).
    • qk/pv/both cublas variants regress materially (infer ~2459-2556 ms).
  • Full-depth direct-u16-input probe (external_cold_layers36_preload_a2_u16direct_off/on_20260225T150710Z/150715Z) is near-neutral/regressed and was reverted.
  • Full-depth FFN u16 path A/B (TRENI_DECODER_FFN_U16_PATH=1) is now complete:
    • artifacts:
      • external_cold_layers36_preload64_ab2_base_20260225T1628Z
      • external_cold_layers36_preload64_ab2_ffnu16_20260225T1628Z
    • runtime deltas (ffnu16 - base):
      • TTFT 26.872 -> 18.077 ms
      • request full 2148.336 -> 1820.345 ms
      • cold-total full 6155.513 -> 4826.635 ms
    • implication: significant full-depth gain is validated, but runtime request full is still slower than vLLM (~1.38x) in this matched run.
  • Full-depth 3-seed expansion (base vs ATTN+FFN u16 vs ATTN+FFN+LOGITS u16) is now complete:
    • baseline means: runtime TTFT=26.863 ms, full=2147.754 ms, cold_full=6154.978 ms
    • ATTN+FFN u16 means: runtime TTFT=17.080 ms, full=1791.873 ms, cold_full=4797.910 ms
    • ATTN+FFN+LOGITS u16 means: runtime TTFT=16.104 ms, full=1775.313 ms, cold_full=4780.830 ms
    • implication: best runtime/vLLM full ratio improved to 1.365x (from 1.653x baseline), but full request parity is still not reached.
  • Full-depth decode-input reuse + u16-Lt follow-up (2026-02-25) is now complete:
    • pre-cast reuse 3-seed means: runtime TTFT=15.866 ms, full=1755.374 ms, cold_full=4761.440 ms
    • pre-cast reuse + u16-Lt 3-seed means: runtime TTFT=15.522 ms, full=1729.351 ms, cold_full=4735.345 ms
    • vs prior best (ATTN+FFN+LOGITS u16): request full -45.962 ms, TTFT -0.582 ms, cold-total full -45.485 ms
    • implication: best runtime/vLLM full ratio improved further to 1.323x, but request-full parity is still open.
  • Full-depth residual-fused u16-Lt follow-up (2026-02-26) is now complete:
    • 3-seed means: runtime TTFT=15.400 ms, full=1719.302 ms, cold_full=4725.923 ms
    • vs prior precastreuse+u16lt set: request full -10.049 ms, TTFT -0.122 ms, cold-total full -9.422 ms
    • implication: runtime request path improved again; vLLM moved in the same rerun window, so ratio remained mixed and request-full parity is still open.
  • Full-depth FFN gate+up fused-batch probe (2026-02-26) is now closed as non-canonical:
    • trial and 3-seed runtime-only A/B were completed.
    • result: regression on 3-seed means (full +4.768 ms when enabled), so the path was reverted.
  • Full-depth attention qkv fused-alias follow-up (TRENI_DECODER_ATTN_U16_QKV_FUSED) is now complete:
    • runtime-only 3-seed means:
      • off: TTFT=15.412 ms, full=1720.295 ms, cold_full=4726.358 ms
      • on: TTFT=15.323 ms, full=1714.426 ms, cold_full=4720.376 ms
      • delta (on-off): TTFT -0.089 ms, full -5.869 ms, cold_full -5.982 ms
    • runtime-vLLM 3-seed means:
      • off: runtime full=1720.062 ms, vLLM full=1282.024 ms (runtime/vLLM=1.3417x)
      • on: runtime full=1713.520 ms, vLLM full=1295.140 ms (runtime/vLLM=1.3230x)
    • implementation note: fused alias is now default-on in this lane; runtime logs confirm activation (attn qkv fused alias=on) on current qwen profile.
  • Post-rebuild full-depth sanity checks (2026-02-26) confirm no regression in this lane:
    • external_cold_layers36_sanity_postltwsoff_residfuse_u16lt_20260226T093127Z: runtime TTFT=15.384 ms, full=1720.056 ms, cold_full=4726.451 ms
    • external_cold_layers36_sanity_postbatch2revert_residfuse_u16lt_20260226T093905Z: runtime TTFT=15.399 ms, full=1720.835 ms, cold_full=4726.968 ms
    • external_cold_layers36_sanity_postffnsubprof_residfuse_u16lt_20260226T094109Z: runtime TTFT=15.432 ms, full=1720.919 ms, cold_full=4727.189 ms
    • external_cold_layers36_sanity_qkvfuseddefault_residfuse_u16lt_20260226T102520Z: runtime TTFT=15.319 ms, full=1713.886 ms, cold_full=4720.190 ms
  • Full-depth FFN sub-stage profile split (external_cold_layers36_stepn_profile_ffnsub_20260226T094140Z.log):
    • decoder_step_profile_ffn_proj_mean=0.205 ms
    • decoder_step_profile_ffn_proj_cast_mean=0.005 ms
    • decoder_step_profile_ffn_proj_gate_mean=0.101 ms
    • decoder_step_profile_ffn_proj_up_mean=0.099 ms
    • implication: cast is minor; remaining ffn_proj hotspot is gate/up linear compute itself.
  • FAST_16 follow-up probe on top of u16-Lt (2026-02-25) was evaluated and not promoted:
    • request-full deltas were small (~1-2 ms) and one startup run in the repeatability set showed a large shared host outlier.
    • decision: keep canonical baseline on non-fast compute and continue next work from residfuse+u16lt.

What Has Been Run

Phase 1 (Baseline, Python stack)

  • T4 set: baseline JSON exists.
  • G5 set: baseline JSON exists.
  • Includes cold start breakdown, warm model runs, and pipeline runs.

Phase 2 (Minimal runtime benchmark)

  • T4 set: runtime JSON exists.
  • G5 set: runtime JSON exists.
  • Includes cold starts, model run timing, and HTTP request latency.
  • True TTFT rerun exists (runtime timing, not SSE proxy).
  • Cold optimization rerun exists after tensor index-cache fix.
  • Stage-level cold decomposition exists (tokenizer/index/upload/prefill/step0 timings).
  • Fast tensor collect optimization rerun exists (clean4).
  • External cold canonical run exists across four backends (runtime, PyTorch, vLLM, Ollama) on G5 (2026-02-18).
  • External cold optimized run exists with runtime startup preload + tokenizer cache (2026-02-18).
  • External cold token-parity rerun exists after decoder/sampling fixes; runtime now wins request and cold-total vs vLLM (2026-02-18).
  • Qwen cold upload sub-stage ablation exists with GPU conversion toggle on G5 (2026-02-19).
  • External-cold runtime-only GPU-convert ablation exists on G5 (on/off toggle, preload+token-parity settings, 2026-02-19).
  • External-cold runtime-vLLM rerun exists after vLLM env restore (2026-02-19, 3-run repeatability).
  • External-cold all-backend repeatability exists after GPU-convert fix (2026-02-19, 3 runs; runtime+PyTorch+vLLM+Ollama).
  • Runtime-only cold stability sweep exists (2026-02-19, 5 runs with preload upload sub-stage inspection).
  • Runtime host-prefetch cold-variance fix exists (TRENI_TENSOR_HOST_PREFETCH, 2026-02-19) with stable runtime-only 5-run sweep.
  • External-cold all-backend repeatability rerun exists after host-prefetch fix (2026-02-19, 3 runs).
  • External-cold repeatability rerun exists after seq1 multi-head default promotion (2026-02-24, 3 runs, runtime + PyTorch + vLLM; Ollama skipped on host).
  • AWS G5 speedpass matrix exists for upload-sync + cublasLt toggles (2026-02-22).
  • AWS G5 TTFT kernel pass exists (2026-02-22): softmax near-parity, then norm-kernel rewrite (rmsnorm/layernorm) produced measurable cold/warm latency gains.
  • AWS G5 TTFT follow-up exists (2026-02-22): seq_q=1 tiny-attention kernel path + direct K/V cache writes further improved TTFT/warm latency and moved Bart TTFT materially.
  • AWS G5 attention backend A/B exists for custom vs cudnn_sdpa proxy, including reverse-order rerun to remove call-order cold bias (2026-02-22).
  • AWS G5 seq1 hybrid tuning matrix exists (2026-02-22): default custom seq1 vs qk-cublas vs pv-cublas vs both-cublas.
  • AWS G5 seq1 fused-softmax/PV follow-up exists (2026-02-22): default custom path rerun after fused seq1 softmax+PV + QK block retune (seq1_hybrid_fused_20260222T192656Z).
  • Attention runtime now caches backend env config once per process (removes per-call parse overhead on request path).
  • cudnn_sdpa is now fused-only by default; legacy proxy A/B runs are explicit opt-in via TRENI_ATTN_ALLOW_SDPA_PROXY=1.
  • H100 fused cuDNN SDPA probe pack exists (2026-02-22): alignment/shape/layout sweeps plus debug traces; no viable SDPA engine configs under current backend descriptor path.
  • AWS G5 strict fused frontend A/B rerun exists (attn_backend_ab_frontend_20260222T220111Z, fixed qwen, warmed query set): warm path near parity while cold-first-hit still regresses heavily.
  • Fused frontend stage-profile probe exists (cudnn_frontend_profile_probe_20260222T2204Z): miss-cost root cause isolated (~705 ms per plan-build miss on A10G; pack/execute/unpack are negligible).
  • Frontend A/B runner now hard-fails non-fused contamination (missing fused log marker or TRENI_WITH_CUDNN=0 runtime build).
  • AWS G5 frontend repeatability matrix exists (attn_backend_frontend_matrix_20260222T221948Z, 3 repeats/profile, warm_fixed + mixed_churn) and shows custom wins all tracked metrics in both profiles.
  • Frontend claim-strength report exists (attn_backend_frontend_claim_report_20260222T222958Z) with paired delta CI95 summaries for each metric/profile.
  • AWS G5 frontend miss-trace probe exists (attn_backend_ab_frontend_20260222T224739Z) with explicit miss-key logging (TRENI_ATTN_CUDNN_FRONTEND_TRACE_MISSES=1).
  • AWS G5 startup-preload mitigation matrix exists (attn_backend_frontend_matrix_20260222T224521Z) plus direct compare report vs no-preload (attn_backend_frontend_missmit_compare_20260222T225215Z).
  • AWS G5 preload splitter fix is verified (TRENI_HTTP_PRELOAD_PROMPTS now executes full list, run=1/4 ... run=4/4 in runtime logs).
  • AWS G5 benchmark-query preload mitigation matrix exists (attn_backend_frontend_matrix_20260222T231139Z) with direct compare vs matched no-preload baseline (attn_backend_frontend_missmit_compare_20260222T231335Z).
  • AWS G5 shape-level no-preload mitigation probe exists (prebuild_startup_nopreload_probe_20260222T232932Z): fused cold TTFT/request latency are near custom with startup prebuild enabled.
  • AWS G5 no-preload shape-prebuild matrix probe exists (attn_backend_frontend_matrix_20260222T233003Z) with direct compare vs no-preload baseline (attn_backend_frontend_missmit_compare_20260222T233116Z).
  • AWS G5 hybrid no-preload fused policy rerun exists (attn_backend_frontend_matrix_20260223T001959Z) with direct compare vs prior tuned no-gate shape-prebuild baseline (attn_backend_frontend_missmit_compare_20260223T002153Z), plus 3x startup probe repeatability (prebuild_hybrid10_nopreload_probe_r{1,2,3}_20260223T002214Z).
  • AWS G5 broader-shape sanity run exists for the hybrid policy (hybrid_shape_sanity_20260223T002857Z): startup stays near 2.0s and inference stays valid, but long-prompt growth past seq_kv=10 still triggers fused miss cascades.
  • AWS G5 bounded-hybrid follow-up exists (TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10): broader-shape sanity rerun (hybrid_shape_sanity_maxgate_20260223T003453Z) removes miss cascades, and 3x matrix rerun (attn_backend_frontend_matrix_20260223T003611Z) remains near-parity with prior hybrid fixed-profile metrics.
  • Runtime now emits per-request attention backend telemetry (attention.total_calls, custom/fused/proxy shares, gate/fail counters) in chat responses; phase2 harness/reporting now aggregates this in attention_backend.
  • Coverage-instrumented fused reruns exist (2026-02-23): 3x matrix (attn_backend_frontend_matrix_20260223T011158Z) plus warm/cold coverage profiles (fused_coverage_profiles_20260223T011504Z, fused_coverage_cold_profiles_20260223T011534Z).
  • Execution direction is now explicit: fused/frontend work is parked; next optimization cycles are custom-only.
  • Routing failure-amplification stress run exists with injected tool failures/timeouts plus controller retries (2026-02-18).
  • Routing matrix expansion exists on G5 (baseline + 5 stress profiles, 2026-02-19).
  • Cross-host routing pilot exists (local client via SSH tunnel to G5 runtime/controller; baseline + mild-timeout + stress, 2026-02-19).
  • Split-host routing matrix exists (CPU controller/tool host + GPU runtime host, 6 profiles, 2026-02-19).
  • Internet multi-hop expansion exists (Fly.io controller/tool hops with commercial APIs):
    • OpenAI gpt-5.2 profile matrix (2026-02-20, repeatability rerun at runs=3/profile).
    • OpenRouter openai/gpt-5.2 profile matrix (2026-02-20, repeatability rerun at runs=3/profile).
    • OpenRouter anthropic/claude-sonnet-4.6 profile matrix (2026-02-20, repeatability rerun at runs=3/profile).
  • Local control routing matrices exist (same harness, local standalone external router, no Fly scheduler path):
    • OpenAI gpt-5.2 (2026-02-20, runs=3/profile).
    • OpenRouter anthropic/claude-sonnet-4.6 (2026-02-20, runs=3/profile).
    • higher-N reruns (runs=8/profile) for the same pair.
  • task-family parity split (model_only, tool_only) on the same pair (runs=8).
  • grouped commercial root-cause report exists (commercial_gap_root_cause_20260222T222958Z) combining fairness artifacts (r4+r8) by provider/model/task family with stage decomposition.

Week 3 (Numerical parity)

  • T4 parity: strict mode, 0 failures.
  • G5 parity: strict mode, 0 failures.
  • Donut is intentionally skipped in parity check and marked as skipped.

Phase 3 comparison report

  • T4 comparison report exists.
  • G5 comparison report exists.

Phase 3 agentic loop benchmark (canonical G5 set)

  • Dedicated harness implemented with 3 scenarios:
    • retrieval correction
    • tool-state adaptation
    • confidence-gated branching
  • Evaluator metrics included:
    • success rate
    • steps-to-convergence / correction efficiency
    • latency per task and per successful task
    • failure taxonomy
  • Canonical G5 set run complete (2026-02-19):
    • baseline profile: 3 seeds
    • stress profile: 3 seeds
    • consolidated summary artifact published.
  • Realistic-v1 fixture run set complete (2026-02-22):
    • baseline profile: 3 seeds
    • stress profile: 3 seeds
    • consolidated summary artifact published.

Phase 3 uncertainty-awareness ablation (baseline + stress + comparison)

  • Harness now supports:
    • uncertainty source modes: normalized_logprob, raw_logit_margin, hybrid, runtime_native
    • independent uncertainty toggles: internal/external on/off
  • Matrix runner added for 4-arm ablation per source.
  • Baseline repeatability set complete (2026-02-19, runs=8, seeds 7/11/19, all 3 sources).
  • Stress repeatability set complete (2026-02-19, same seeds/sources, injected timeout/failure profile).
  • Consolidated baseline-vs-stress comparison report published.
  • Runtime/kernel-native uncertainty wiring is now implemented:
    • runtime HTTP response includes uncertainty payload
    • Phase 3 harness can consume it via runtime_native
  • Canonical G5 C2 rerun with runtime_native source is now published (2026-02-19, baseline+stress, 3 seeds each).
  • Realistic-v1 C2 baseline+stress pair is now published for normalized_logprob, raw_logit_margin, and hybrid (2026-02-22, seed 7).

Phase 4 hardware expansion (Lambda A100/H100)

  • Full A100 run set complete (phase2 cold/hot + routing matrix + C2 runtime-native calibrated).
  • Full H100 run set complete (phase2 cold/hot + routing matrix + C2 runtime-native calibrated).
  • Canonical loop summaries on A100/H100 are also complete (baseline+stress, 3 seeds each).
  • Paper-grade package generated from canonical G5 + A100 + H100 artifacts:
    • /benchmarks/paper_package/latest/package_summary.json
    • /benchmarks/paper_package/latest/paper_package.md
    • /benchmarks/paper_package/latest/tables/*.csv
    • /benchmarks/paper_package/latest/manuscript/* (captions, claims, figure manifest, mermaid figure specs)

Latest Key Findings (2026-02-17)

  • Warm path on G5 remains strong (~80.6 ms mean, ~90.4 ms p99 in latest clean7 sanity run).
  • Internal routing is faster than external routing (1.032x external/internal ratio).
  • Cold TTFT dropped further after stage decomposition + fast tensor collect:
    • qwen: 1.41s -> 1.10s (22.1% lower)
    • donut: 619ms -> 150ms (75.7% lower)
    • bart: 777ms -> 125ms (83.9% lower)
    • minilm: 23.4ms -> 22.6ms (3.4% lower)
  • model_tensor_index_build is no longer dominant (~1-2.3 ms mean across models in clean4).
  • An async pinned-upload experiment regressed Qwen cold TTFT and was reverted; clean4 remains the accepted cold-path reference.
  • Revert validation set (clean7, 2026-02-18 UTC) confirms clean4 numbers are reproducible within noise.

Latest Key Findings (2026-02-22, True Fused cuDNN Frontend Rerun)

  • Strict fused frontend A/B (attn_backend_ab_frontend_20260222T220111Z) with fixed qwen and warmup policy (http_warmup_runs=8) shows:
    • warm request mean: custom 19.324 ms vs fused frontend 21.503 ms (custom/frontend=0.899)
    • warm infer mean: custom 18.803 ms vs fused frontend 20.976 ms (custom/frontend=0.896)
    • warm TTFT: custom 4.199 ms vs fused frontend 4.498 ms
  • Cold-first-hit remains the blocker:
    • cold TTFT: custom 4.220 ms vs fused frontend 710.641 ms
    • cold full latency: custom 250.929 ms vs fused frontend 6610.148 ms
  • Stage-profile probe (TRENI_ATTN_CUDNN_FRONTEND_PROFILE=1) shows root cause is miss compile cost, not execution kernels:
    • plan-build miss cost: ~704.8 ms per miss
    • pack/execute/unpack per-call costs are tiny (~0.010/0.021-0.048/0.005 ms)
  • Interpretation:
    • fused path is real and validated.
    • warm steady-state is close to custom when shapes are warmed.
    • unresolved work is miss mitigation for cold/mixed shape churn.

Latest Key Findings (2026-02-22, Frontend Repeatability Matrix)

  • Artifact: attn_backend_frontend_matrix_20260222T221948Z (repeats=3 per profile).
  • Profiles:
    • warm_fixed: fixed model (qwen) with warmup (http_warmup_runs=8)
    • mixed_churn: fixed model with no warmup (http_warmup_runs=0) to expose miss churn
  • Win counts:
    • custom is faster on every tracked metric in both profiles (3/3 wins each metric)
  • Warm-fixed aggregate:
    • request mean: custom 19.271 +/- 0.050 ms vs fused 21.468 +/- 0.018 ms
    • infer mean: custom 18.812 +/- 0.059 ms vs fused 20.984 +/- 0.026 ms
    • TTFT mean: custom 4.198 +/- 0.001 ms vs fused 4.498 +/- 0.001 ms
  • Mixed-churn aggregate:
    • request mean: custom 47.864 +/- 0.018 ms vs fused 843.141 +/- 0.735 ms
    • infer mean: custom 47.331 +/- 0.050 ms vs fused 842.542 +/- 0.747 ms
    • TTFT mean: custom 4.197 +/- 0.002 ms vs fused 179.744 +/- 0.263 ms
  • Interpretation:
    • custom clearly wins under both stable warmed and churned request conditions.
    • fused frontend remains sensitive to shape misses; miss mitigation is still the blocker for cold/mixed competitiveness.

Latest Key Findings (2026-02-22, Frontend Claim-Strength Report)

  • Artifact: attn_backend_frontend_claim_report_20260222T222958Z.
  • This report computes paired deltas (frontend - custom) with CI95 from the repeatability matrix.
  • Warm-fixed signal:
    • request mean delta: +2.197 ms CI95 [2.125, 2.238]
    • TTFT delta: +0.300 ms CI95 [0.299, 0.301]
  • Mixed-churn signal:
    • request mean delta: +795.277 ms CI95 [794.408, 795.747]
    • TTFT delta: +175.546 ms CI95 [175.300, 175.820]
  • Interpretation:
    • current custom path is faster than current fused frontend path in both profiles, with non-overlapping positive deltas for all tracked latency metrics.
    • repeat count remains low (n=3/profile), but effect sizes are large and stable.

Latest Key Findings (2026-02-22, Startup-Preload Miss-Mitigation, Updated Canonical)

  • Artifacts:
    • baseline matrix (no_preload): attn_backend_frontend_matrix_20260222T230445Z
    • candidate matrix (startup_preload_benchmark_queries): attn_backend_frontend_matrix_20260222T231139Z
    • comparison report: attn_backend_frontend_missmit_compare_20260222T231335Z
    • exact-prompt probe: preload_exact_prompt_probe_20260222T231050Z.json
  • Mitigation used:
    • startup multi-prompt preload (TRENI_HTTP_PRELOAD_PROMPTS) with prompt set matched to benchmark cold/warm queries.
  • Mixed-churn deltas (no_preload -> startup_preload_benchmark_queries):
    • fused warm request mean: 843.242 -> 22.433 ms (37.590x faster)
    • fused warm infer mean: 842.684 -> 21.965 ms (38.365x faster)
    • fused warm TTFT: 179.541 -> 4.497 ms (39.928x faster)
    • fused cold TTFT: 704.521 -> 4.495 ms (156.723x faster)
    • fused cold full latency: 6593.495 -> 25.785 ms (255.707x faster)
  • Exact-prompt probe result:
    • preload on exact cold prompt drops first-hit fused TTFT to 4.499 ms and full latency to 26.090 ms.
  • Interpretation:
    • fused cold/mixed miss penalty can be removed on this harness when preload coverage matches serving prompts.
    • custom remains slightly faster in warmed steady state (ratio remains about 0.90 custom/frontend), but the prior ~704 ms first-hit fused TTFT blocker is resolved for this canonical prompt set.
    • still open: make this robust without prompt-list curation (shape-level prebuild/reuse path).

Latest Key Findings (2026-02-22, Shape-Prebuild No-Preload Probe)

  • Artifacts:
    • cold probe (no preload, startup shape prebuild): prebuild_startup_nopreload_probe_20260222T232932Z.json
    • matrix probe (repeats=1): attn_backend_frontend_matrix_20260222T233003Z
    • compare vs no-preload baseline: attn_backend_frontend_missmit_compare_20260222T233116Z
  • Mitigation used:
    • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=16 (initial probe)
    • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128
    • no startup preload prompts (TRENI_HTTP_PRELOAD unset)
  • Cold probe (qwen, fused frontend):
    • startup->healthy: 11017.541 ms
    • request TTFT: 5.814 ms
    • request full latency: 255.434 ms
  • Matrix probe highlights (shape_prebuild_nopreload, fused frontend):
    • mixed-churn cold TTFT: 5.805 ms
    • mixed-churn cold full latency: 255.267 ms
    • mixed-churn warm request mean: 51.482 ms
    • mixed-churn warm TTFT: 4.824 ms
  • Interpretation:
    • shape-level prebuild removes the no-preload fused cold/mixed request-path spike without curated prompt lists.
    • current tradeoff is startup cost shift (http_attn_prebuild dominates startup time), so next work is reducing compile-at-startup overhead.
  • Follow-up tuning (seq_kv_max: 16 -> 10) artifact: prebuild_startup10_nopreload_probe_20260222T235944Z.json
    • startup->healthy: 11017.541 -> 7011.472 ms (1.571x faster startup)
    • request TTFT: 5.814 -> 5.826 ms (near-identical)
    • request full latency: 255.434 -> 254.936 ms (near-identical)
  • Matrix confirmation for tuned range:
    • tuned matrix (seq_kv_max=10): attn_backend_frontend_matrix_20260223T000256Z
    • compare vs seq_kv_max=16: attn_backend_frontend_missmit_compare_20260223T000343Z
    • request-path behavior stayed near-identical while startup dropped materially:
      • warm-fixed fused request mean: 22.556 -> 22.265 ms
      • mixed fused request mean: 51.482 -> 50.974 ms
  • Lower-range probe (seq_kv_max=8) artifact: prebuild_startup8_nopreload_probe_20260223T000600Z.json
    • startup->healthy: 6010.381 ms (faster startup)
    • request TTFT: 703.771 ms (regression)
    • request full latency: 1660.576 ms (regression)
    • interpretation: seq_kv_max=8 under-covers this query profile; 10 is the minimum safe tuned range in current harness.
  • Heuristic-mode probe (TRENI_ATTN_CUDNN_FRONTEND_HEUR_MODE) on current sm86 path:
    • A and B had near-identical prebuild/startup behavior.
    • FALLBACK produced no valid engine configs for this frontend descriptor path.

Latest Key Findings (2026-02-23, Coverage-Instrumented Fused Reruns)

  • Coverage-instrumented 3x matrix (attn_backend_frontend_matrix_20260223T011158Z) confirms:
    • warm-fixed fused coverage is low (warm_attn_fused_share ~0.030303) under the bounded hybrid policy.
    • mixed-churn fused coverage is similarly low (~0.030303) with custom handling most calls.
    • warm TTFT remains slightly better for custom (4.194 ms custom vs 4.269 ms fused profile).
    • warm request mean/p99 stay near-parity on fixed profile; mixed profile still favors custom in request-path totals.
  • High-coverage fused profile (fused_coverage_profiles_20260223T011504Z) shows current fused frontend path is slower when heavily used:
    • frontend_all fused share ~0.878788 with warm request mean 22.310 ms vs custom 20.292 ms (~1.099x slower).
    • warm TTFT 4.496 ms vs custom 4.196 ms.
  • Cold coverage profile (fused_coverage_cold_profiles_20260223T011534Z) shows strong first-hit regression when fused coverage is high:
    • frontend_all fused share ~0.9 with cold TTFT 704.176 ms vs custom 4.215 ms.
    • cold full latency 6595.157 ms vs custom 246.306 ms.
  • Interpretation:
    • fused frontend path is now measurable and reproducible with explicit coverage accounting.
    • in current implementation, high fused coverage still regresses latency; bounded gating avoids worst regressions by keeping most calls on custom.
    • next optimization target remains dynamic shape plan reuse/coverage so fused can be exercised without miss-build penalties.

Latest Key Findings (2026-02-23, Hybrid Shape-Gate Frontend Policy)

  • Artifacts:
    • 3x startup probe: prebuild_hybrid10_nopreload_probe_r{1,2,3}_20260223T002214Z.json
    • 3x frontend matrix: attn_backend_frontend_matrix_20260223T001959Z
    • compare vs prior tuned no-gate baseline: attn_backend_frontend_missmit_compare_20260223T002153Z
    • broader-shape sanity (initial): hybrid_shape_sanity_20260223T002857Z
    • broader-shape sanity (bounded gate): hybrid_shape_sanity_maxgate_20260223T003453Z
    • 3x bounded-gate matrix: attn_backend_frontend_matrix_20260223T003611Z
    • bounded-gate compare vs prior hybrid: attn_backend_frontend_missmit_compare_20260223T003734Z
  • Policy used:
    • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MIN_KV=10
    • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=10
    • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128
    • TRENI_ATTN_CUDNN_FRONTEND_MIN_SEQ_KV=10
    • bounded-gate follow-up adds: TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10
  • 3-run startup probe summary (qwen, fused frontend, no preload prompts):
    • startup->healthy: 2004.840 +/- 0.146 ms
    • request TTFT: 4.955 +/- 0.011 ms
    • request full latency: 242.673 +/- 0.352 ms
  • Delta vs prior tuned shape-prebuild no-gate probe (prebuild_startup10_nopreload_probe_20260222T235944Z):
    • startup->healthy: 7011.472 -> 2004.840 ms (3.497x faster)
    • request TTFT: 5.826 -> 4.955 ms (1.176x faster)
    • request full latency: 254.936 -> 242.673 ms (1.051x faster)
  • Matrix deltas vs prior tuned no-gate matrix (attn_backend_frontend_matrix_20260223T000256Z):
    • warm-fixed fused request mean: 22.265 -> 20.354 ms (1.094x faster)
    • mixed fused request mean: 50.974 -> 47.904 ms (1.064x faster)
    • cold fused TTFT: 5.819 -> 4.959 ms (1.173x faster)
    • cold fused full latency: 254.146 -> 242.569 ms (1.048x faster)
  • Bounded-gate broader-shape follow-up (TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10):
    • broader-shape set mean full latency: 9974.576 -> 274.072 ms (36.395x faster)
    • broader-shape set max full latency: 30654.303 -> 434.776 ms (70.504x faster)
    • fixed-profile matrix stayed near-parity vs prior hybrid (attn_backend_frontend_missmit_compare_20260223T003734Z).
  • Interpretation:
    • the startup compile-burst tradeoff has been materially reduced while preserving low no-preload request-path latency.
    • strict fused runs remain inference-valid with low-shape custom fallback (inference.used=true), so this is now the best prompt-independent frontend policy in this harness.
    • broader-shape limitation seen in initial hybrid sanity (hybrid_shape_sanity_20260223T002857Z) is mitigated by bounded-gate follow-up (hybrid_shape_sanity_maxgate_20260223T003453Z), which removes miss cascades by routing out-of-window shapes to custom.
    • remaining work is wider fused coverage without fallback (dynamic shape-reuse/plan persistence).

Latest Key Findings (2026-02-22, Commercial Root-Cause Grouped Analysis)

  • Artifact: commercial_gap_root_cause_20260222T222958Z.
  • Grouped on fairness-hardened splits (r4+r8) by provider/model/task-family.
  • OpenAI gpt-5.2 model-only (paired_n=36):
    • latency delta mean (external-internal): -69.311 ms, CI95 [-193.985, 61.444] (near parity/noise).
    • external controller overhead mean: 2.081 ms; external model-hop mean: 1406.971 ms.
  • OpenAI gpt-5.2 tool-only parity (paired_n=12):
    • latency delta mean: +49.601 ms, CI95 [-162.047, 274.981] (near parity/noise).
    • external controller overhead mean: 12.842 ms; external model-hop mean: 2456.108 ms.
  • OpenRouter Sonnet 4.6 model-only (paired_n=24):
    • latency delta mean: +204.883 ms, CI95 [-148.517, 683.114] (near parity/noise).
    • external controller overhead mean: 2.254 ms; external model-hop mean: 2220.251 ms.
  • Interpretation:
    • current commercial control evidence does not show a statistically locked directional win/loss.
    • controller overhead is small relative to model-hop variance; higher-N reruns are required before claiming directional commercial gap outcomes.

Latest Key Findings (2026-02-18, External Cold Canonical)

  • Runtime cold total first response: 2342.996 ms.
  • PyTorch cold total first response: 8725.259 ms (3.724x runtime).
  • vLLM cold total first response: 25069.018 ms (10.7x runtime).
  • Ollama cold total first response: 3530.106 ms (1.507x runtime).
  • vLLM has the fastest request-path TTFT once healthy (51.763 ms), but startup (24032.203 ms) dominates end-to-end cold in this run.

Latest Key Findings (2026-02-18, External Cold Optimized Runtime)

  • Runtime request full latency: 271.346 ms (vs vLLM 1035.826 ms).
  • Runtime cold total first response: 2276.081 ms (vs vLLM 28072.508 ms).
  • Runtime still trails vLLM in request TTFT (91.596 ms vs 51.725 ms).
  • This run was not token-parity yet (runtime decode steps still 4 while others used 48).

Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Pre-Fix)

  • Runtime request full latency: 2518.142 ms (vLLM: 1075.404 ms).
  • Runtime request TTFT: 91.207 ms (vLLM: 51.310 ms).
  • Runtime cold total first response: 4522.345 ms (vLLM: 28111.652 ms, 6.216x runtime advantage).
  • Request-path gap remains: runtime per-token decode is now the dominant issue.

Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Decoder/Sampling Fix)

  • Runtime request TTFT: 5.022 ms (vLLM: 52.995 ms, runtime 10.553x faster).
  • Runtime request full latency: 311.289 ms (vLLM: 1094.517 ms, runtime 3.516x faster).
  • Runtime cold total first response: 2316.048 ms (vLLM: 25131.279 ms, runtime 10.851x better).
  • Startup remained stable (~2004.8 ms) while request-path bottleneck was removed.
  • Confirmation rerun (runtime+vLLM) matched the direction: runtime 5.021/310.376/2314.581 ms vs vLLM 51.655/1033.214/24065.623 ms (TTFT/full/cold-total).
  • Initial 3-run repeatability set (2026-02-18) showed TTFT 10.333x, full 3.380x, cold-total 10.688x; this was superseded by the 2026-02-19 rerun below.

Latest Key Findings (2026-02-19, Qwen Cold Upload GPU-Convert Fix)

  • A/B setup:
    • same G5 host, same runtime build, same cold_first_hit harness.
    • only toggle changed: TRENI_TENSOR_CONVERT_GPU=0 (off) vs default on.
  • Qwen results:
    • full_latency_ms: 1116.567 -> 238.740 (4.677x faster).
    • decoder_tensor_upload: 1007 ms -> 129 ms (7.806x faster).
    • decoder_tensor_convert: 862 ms -> 6 ms (143.667x faster).
    • decoder_tensor_h2d: 143 ms -> 121 ms (1.182x faster).
    • startup + first response total: 2119.906 ms -> 1242.057 ms (1.707x faster).
  • Interpretation:
    • the dominant cold bottleneck was CPU-side BF16/F16 conversion; moving conversion to GPU largely removed that bottleneck.
  • External-cold runtime-only confirmation (2026-02-19, preload enabled, max_tokens=48):
    • startup-to-healthy: 2004.560 -> 1003.455 ms (1.997x faster).
    • request full latency: 317.989 -> 317.276 ms (effectively unchanged).
    • cold total first response: 2322.549 -> 1320.731 ms (1.759x faster).
    • cold total first token: 2009.697 -> 1008.582 ms (1.993x faster).

Latest Key Findings (2026-02-19, Runtime vs vLLM After GPU-Convert Fix, 3-run)

  • Setup:
    • same G5 host and model family (Qwen 3B), token parity (max_tokens=48), runtime preload enabled.
    • backends run: runtime + vLLM (PyTorch/Ollama skipped in this repeatability set).
  • Mean over 3 runs:
    • runtime TTFT 5.135 ms vs vLLM 84.390 ms (16.433x faster).
    • runtime request full 319.063 ms vs vLLM 1111.463 ms (3.484x faster).
    • runtime cold-total first response 1656.573 ms vs vLLM 31151.892 ms (18.805x better).
  • Runs 2-3 only (post-first-run stabilization):
    • TTFT 17.211x, full 3.416x, cold-total 22.395x in runtime’s favor.
  • Interpretation:
    • after restoring vLLM env and rerunning on matched settings, runtime remains decisively ahead on request path and end-to-end cold total.

Latest Key Findings (2026-02-19, External Cold All-Backend Repeatability, GPU-Convert Fix2)

  • Setup:
    • same G5 host, same model family (Qwen 3B), token parity (max_tokens=48), runtime preload enabled.
    • backends run: runtime + PyTorch + vLLM + Ollama.
  • 3-run means (all runs):
    • runtime: startup 2339.131 ms, TTFT 5.131 ms, request full 318.315 ms, cold-total first response 2657.447 ms.
    • runtime-normalized ratios:
      • PyTorch: TTFT 115.313x, full 7.508x, cold-total 3.921x.
      • vLLM: TTFT 16.091x, full 3.852x, cold-total 10.887x.
      • Ollama: TTFT 2108.743x, full 35.118x, cold-total 4.584x.
  • Stable reference (runs 1-2):
    • runtime: startup 1003.915 ms, TTFT 5.131 ms, request full 317.290 ms, cold-total first response 1321.205 ms.
    • vLLM vs runtime (runs 1-2): TTFT 18.275x, full 4.298x, cold-total 21.875x.
  • Interpretation:
    • runtime remains decisively ahead on request path and cold-total across all backends.
    • one run had a startup/preload outlier (decoder_tensor_h2d spike) that inflated all-3-run startup mean.

Latest Key Findings (2026-02-19, External Cold All-Backend Repeatability, GPU-Convert + Host-Prefetch Fix)

  • Setup:
    • same G5 host, same model family (Qwen 3B), token parity (max_tokens=48), runtime preload enabled.
    • backends run: runtime + PyTorch + vLLM + Ollama.
    • runtime cold path change: TRENI_TENSOR_HOST_PREFETCH=1 with host-page MADV_WILLNEED on large tensor ranges.
  • 3-run means:
    • runtime: startup 1003.836 ms, TTFT 5.130 ms, request full 316.403 ms, cold-total first response 1320.240 ms.
    • runtime-normalized ratios:
      • PyTorch: TTFT 108.567x, full 7.341x, cold-total 14.601x.
      • vLLM: TTFT 16.537x, full 3.896x, cold-total 21.918x.
      • Ollama: TTFT 514.414x, full 9.471x, cold-total 3.029x.
  • Runtime-only 5-run stability comparison (before vs after host-prefetch):
    • startup max: 3006.388 -> 1003.627 ms.
    • cold-total first response max: 3324.212 -> 1322.338 ms.
    • decoder tensor h2d max: 1869.296 -> 120.671 ms.
    • decoder tensor upload max: 1877.485 -> 128.777 ms.
  • Interpretation:
    • the intermittent preload upload outlier is removed in this sweep while request-path lead is preserved.

Latest Key Findings (2026-02-24, External Cold Repeatability After Seq1 Multi-Head Default)

  • Setup:
    • same G5 host class and token-parity prompt budget (max_tokens=48), runtime preload enabled.
    • backends run: runtime + PyTorch + vLLM (Ollama skipped in this rerun host environment).
  • 3-run means:
    • runtime: startup 1003.315 ms, TTFT 4.022 ms, request full 239.277 ms, cold-total first response 1242.592 ms.
    • runtime-normalized ratios:
      • PyTorch: TTFT 127.900x, full 9.378x, cold-total 6.320x.
      • vLLM: TTFT 12.350x, full 4.139x, cold-total 19.333x.
  • Delta vs prior host-prefetch repeatability (2026-02-19, 3-run means):
    • runtime TTFT: 5.130 -> 4.022 ms (1.275x faster).
    • runtime request full: 316.403 -> 239.277 ms (1.322x faster).
    • runtime cold-total first response: 1320.240 -> 1242.592 ms (1.062x faster).
  • Interpretation:
    • after default-on seq1 multi-head promotion, runtime keeps a large cross-system margin and also improved its own cold request path vs the prior repeatability baseline.

Latest Key Findings (2026-02-24, External Cold Repeatability After Step0 Exp-Reuse Patch)

  • Setup:
    • same G5 host class and token-parity budget (max_tokens=48), runtime preload enabled.
    • backends run: runtime + PyTorch + vLLM (Ollama skipped in this host environment).
    • custom-kernel change: seq1 multi-head softmax/PV path now reuses normalized probabilities rather than recomputing exp in the inner PV loop.
  • 3-run means:
    • runtime: startup 1003.287 ms, TTFT 4.018 ms, request full 238.400 ms, cold-total first response 1241.688 ms.
    • runtime-normalized ratios:
      • PyTorch: TTFT 126.786x, full 9.374x, cold-total 6.320x.
      • vLLM: TTFT 12.545x, full 4.184x, cold-total 19.622x.
  • Delta vs immediate pre-patch repeatability baseline (external_cold_seq1mh_default_repeatability_20260224T192020Z):
    • runtime TTFT: 4.022 -> 4.018 ms (-0.004 ms)
    • runtime request full: 239.277 -> 238.400 ms (-0.877 ms)
    • runtime cold-total first response: 1242.592 -> 1241.688 ms (-0.904 ms)
  • Interpretation:
    • this step0 patch is valid and non-regressing with a small positive shift.
    • next gains likely require deeper reduction-path/launch-structure work in decoder_step0_layers, not only exp reuse.

Latest Key Findings (2026-02-24, External Cold Repeatability After Step0 Shared-Prob Follow-Up)

  • Setup:
    • same G5 host class and token-parity budget (max_tokens=48), runtime preload enabled.
    • backends run: runtime + PyTorch + vLLM (Ollama skipped in this host environment).
    • follow-up change: cached per-head seq1 probabilities in shared memory inside multi-head softmax/PV.
  • 3-run means:
    • runtime: TTFT 4.019 ms, request full 238.678 ms, cold-total first response 1241.970 ms.
  • Delta vs immediate step0expfix run:
    • TTFT: 4.018 -> 4.019 ms (+0.001 ms)
    • request full: 238.400 -> 238.678 ms (+0.278 ms)
    • cold-total first response: 1241.688 -> 1241.970 ms (+0.282 ms)
  • Interpretation:
    • this follow-up did not beat the prior exp-reuse patch.
    • path was reverted; current best remains step0expfix.

Latest Key Findings (2026-02-18, Routing Failure-Amplification Stress)

  • Stress profile: injected tool 503 every 2nd request, injected tool timeout every 3rd request, controller tool timeout 0.25s, controller tool retries 1.
  • Internal mean latency: 76.071 ms.
  • External mean latency: 109.806 ms (1.443x external/internal).
  • Internal error rate: 0.0000.
  • External error rate: 0.0833 (4 tool-hop failures over 48 requests).
  • External/internal error-rate ratio: inf (external errored while internal did not).
  • Retry signal: external tool retries mean 0.182; taxonomy shows tool_hop_failed=4.

Latest Key Findings (2026-02-19, Routing Matrix Expansion, G5)

  • Matrix set: 6 profiles (p00 baseline + p01..p05 stress variants).
  • Baseline profile: external/internal latency ratio 1.0420x, external error rate 0.0000.
  • Mild timeout profile (p02): ratio 1.1420x, external error rate 0.0000.
  • Mixed moderate profile (p03): ratio 1.1640x, external error rate 0.0417.
  • Mixed aggressive profile (p04): ratio 1.4360x, external error rate 0.0833.
  • Mixed aggressive + retry2 (p05): ratio 1.4160x, external error rate 0.0833.
  • Internal error rate stayed 0.0000 across all 6 profiles.
  • Interpretation: external path degradation scales with timeout/failure pressure; extra retries reduce some retry counts but do not close the latency/error gap.

Latest Key Findings (2026-02-19, Routing Cross-Host Pilot)

  • Topology:
    • local benchmark client
    • SSH tunnel to G5 host
    • runtime and external router on G5 host
  • Baseline profile (crosshost-p00-baseline, 12 runs):
    • Internal mean: 1071.477 ms
    • External mean: 1059.478 ms
    • External/Internal ratio: 0.989x
    • Error rates: internal 0.0000, external 0.0000
  • Mild-timeout profile (crosshost-p02-timeout-mild, 12 runs):
    • Internal mean: 1054.123 ms
    • External mean: 1123.393 ms
    • External/Internal ratio: 1.066x
    • Error rates: internal 0.0000, external 0.0000
    • External tool retries mean: 0.083
  • Stress profile (crosshost-p04-stress, 12 runs, fail/timeout injection):
    • Internal mean: 1056.013 ms
    • External mean: 1100.010 ms
    • External/Internal ratio: 1.042x
    • Error rates: internal 0.0000, external 0.0833
    • External tool retries mean: 0.182
  • Interpretation:
    • under cross-host stress, external path again degrades in both latency and errors while internal remains error-free.
    • this is a pilot sanity check; canonical Track B completion is the split-host matrix below.

Latest Key Findings (2026-02-19, Routing Split-Host Matrix, Canonical Track B)

  • Topology:
    • GPU host: runtime endpoint
    • CPU host: external controller + tool services
    • same VPC private-network runtime calls from controller/tool to runtime
  • Matrix set: 6 profiles (splithost-p00 baseline + splithost-p01..p05 stress variants), each with 12 runs.
  • Baseline (splithost-p00-baseline):
    • internal mean 1052.392 ms
    • external mean 1046.702 ms
    • ratio 0.995x
    • external error 0.0000
  • Mild fail (splithost-p01_fail_mild): ratio 0.998x, external error 0.0000, external tool retries 0.021.
  • Mild timeout (splithost-p02_timeout_mild): ratio 1.042x, external error 0.0000, external tool retries 0.021.
  • Mixed moderate (splithost-p03_mixed_moderate): ratio 1.001x, external error 0.0417, external tool retries 0.065.
  • Mixed aggressive (splithost-p04_mixed_aggressive): ratio 1.087x, external error 0.0833, external tool retries 0.182.
  • Mixed aggressive + retry2 (splithost-p05_mixed_aggressive_retry2): ratio 1.045x, external error 0.0833, external tool retries 0.091.
  • Matrix-wide summary:
    • external/internal latency ratio mean 1.028x
    • internal error mean 0.0000
    • external error mean 0.0347
  • Interpretation:
    • split-host confirms the same failure-amplification shape: baseline is near parity, but under timeout/failure pressure external path degrades in latency and error rate while internal remains error-free.

Latest Key Findings (2026-02-20, Internet Multi-Hop Matrix on Commercial APIs)

  • Topology:
    • local benchmark client
    • Fly.io hosted external controller/tool hop
    • commercial model runtime endpoints (api.openai.com, openrouter.ai)
  • OpenAI (gpt-5.2, 3 profiles, runs=3):
    • matrix mean external/internal ratio: 1.1123x
    • baseline profile: 1.110x
    • timeout-mild profile: 1.082x
    • mixed-aggressive profile: 1.145x
    • internal error rate 0.0000; mixed-aggressive external error rate 0.0833
  • OpenRouter (openai/gpt-5.2, 3 profiles, runs=3):
    • matrix mean external/internal ratio: 0.7553x
    • baseline profile: 0.686x
    • timeout-mild profile: 0.891x
    • mixed-aggressive profile: 0.689x
    • internal error rate 0.0000; mixed-aggressive external error rate 0.1667
  • OpenRouter (anthropic/claude-sonnet-4.6, 3 profiles, runs=3):
    • matrix mean external/internal ratio: 1.0277x
    • baseline profile: 1.236x
    • timeout-mild profile: 0.968x
    • mixed-aggressive profile: 0.879x
    • internal error rate 0.0000; mixed-aggressive external error rate 0.1667
  • Interpretation:
    • OpenAI matrix supports the routing thesis directionally: public-network external hops are slower and less reliable under stress.
    • OpenRouter remains non-canonical for Track B direction claims in this topology due mixed/inverted profile direction and elevated external errors.

Latest Key Findings (2026-02-20, Local Control Matrix, No Fly Scheduler Path)

  • Topology:
    • local benchmark client
    • local standalone external controller/tool server
    • commercial model runtime endpoints
  • OpenAI (gpt-5.2, runs=3, 3 profiles):
    • matrix mean external/internal ratio: 0.9867x
    • profiles: baseline 0.995x, timeout-mild 0.977x, mixed-aggressive 0.988x
    • external error mean 0.0313
  • OpenRouter (anthropic/claude-sonnet-4.6, runs=3, 3 profiles):
    • matrix mean external/internal ratio: 1.0663x
    • profiles: baseline 1.055x, timeout-mild 1.141x, mixed-aggressive 1.003x
    • external error mean 0.0313
  • Interpretation:
    • with higher-N local controls, OpenAI is near parity while OpenRouter Sonnet trends in expected direction (external > internal).
    • external errors still appear under stress while internal stayed error-free in these runs.

Latest Key Findings (2026-02-20, Task-Family Parity Split, Local Control, runs=8)

  • OpenAI gpt-5.2:
    • model_only: external/internal 0.958x (slight inversion, near parity).
    • tool_only: external/internal 1.136x (external slower).
  • OpenRouter anthropic/claude-sonnet-4.6:
    • model_only: external/internal 1.044x (external slower).
    • tool_only: external/internal 1.051x (external slower).
  • Errors:
    • all four task-family runs recorded 0 internal errors and 0 external errors.
  • Interpretation:
    • when isolating tool-required tasks, the architecture hypothesis holds on both providers.
    • model_only remains provider-sensitive; OpenAI is close to parity while Sonnet keeps external slower.

Latest Key Findings (2026-02-19, Phase 3 Canonical G5 Baseline, 3 seeds)

  • Internal success rate mean: 1.0000.
  • External success rate mean: 0.9006.
  • External/Internal latency ratio mean: 16.0603x.
  • External/Internal steps ratio mean: 1.8147x.
  • Scenario split (external success mean):
    • retrieval correction: 1.0000
    • tool-state adaptation: 0.7417
    • confidence-gated branching: 1.0000

Latest Key Findings (2026-02-19, Phase 3 Canonical G5 Stress, 3 seeds)

  • Stress profile: tool fail every 9th request, timeout sleep every 11th request (1.1s), controller timeout 0.35s, controller retries 2.
  • Internal success rate mean: 1.0000.
  • External success rate mean: 0.8782.
  • External/Internal latency ratio mean: 77.1703x.
  • External/Internal steps ratio mean: 1.8240x.
  • Scenario split (external success mean):
    • retrieval correction: 1.0000
    • tool-state adaptation: 0.6833
    • confidence-gated branching: 1.0000

Latest Key Findings (2026-02-19, Phase 3 Uncertainty Ablation, baseline runs=8)

  • Uncertainty enabled vs disabled changes success materially (same tasks, same hardware):
    • Internal success: 1.0000 -> 0.7692 when internal uncertainty is disabled (-0.2308).
    • External success: 0.8846 -> 0.6538 when external uncertainty is disabled (-0.2308).
  • Direction is consistent across all uncertainty sources:
    • normalized_logprob
    • raw_logit_margin
    • hybrid
  • Interpretation:
    • uncertainty-aware branching improves loop task completion in this benchmark.
    • this was the first-pass synthetic-signal proof; runtime-native canonical rerun is now published separately below.

Latest Key Findings (2026-02-19, Phase 3 Uncertainty Ablation Repeatability + Stress)

  • Baseline repeatability set (3 seeds) confirms stable uncertainty gains:
    • Internal uncertainty-on success delta mean: +0.2308 (all three sources).
    • External uncertainty-on success delta mean: +0.2308 (all three sources).
  • Stress repeatability set (3 seeds, tool fail every 9, timeout every 11, sleep 1.1s, controller timeout 0.35s, retries 2) shows:
    • Internal uncertainty-on success delta mean: +0.2308 (all three sources).
    • External uncertainty-on success delta mean: +0.2212 (all three sources).
  • Stress minus baseline:
    • Internal uncertainty gain change: 0.0000.
    • External uncertainty gain change: -0.0096.
  • Interpretation:
    • uncertainty-aware branching benefit is stable in this harness under both normal and stressed routing conditions.
    • synthetic-source result had an initial runtime-native corroboration, and final canonical interpretation now uses the calibrated zero-fallback rerun below.

Latest Key Findings (2026-02-19, Phase 3 Uncertainty Ablation Runtime-Native Canonical Rerun, Superseded)

  • Root-cause fix before rerun:
    • greedy decode uncertainty in /monolith/compute/sample.cu was emitting flat zeros (mean_logprob=0, mean_entropy=0).
    • patched greedy path now computes logprob + entropy from logits (log-sum-exp).
  • Initial runtime-native rerun (runs=8, seeds 7/11/19) showed positive uncertainty-on deltas:
    • baseline internal uncertainty success delta: +0.1026
    • baseline external uncertainty success delta: +0.1155
    • stress internal uncertainty success delta: +0.2308
    • stress external uncertainty success delta: +0.2212
  • Runtime-native int_on_ext_on arm means:
    • baseline: internal success 0.8718, external success 0.7853, ext/int latency 10.9504x
    • stress: internal success 1.0000, external success 0.8782, ext/int latency 74.1471x
  • Interpretation:
    • this run confirmed runtime-native plumbing after kernel fix, but was later superseded due fallback contamination in part of the seed set.
    • runtime API now emits a unified awareness payload with both route and generation uncertainty sections (legacy uncertainty preserved); Phase 3 runtime-native client now consumes awareness.generation first when present.

Latest Key Findings (2026-02-20, Phase 3 Runtime-Native Rerun with Unified Awareness, Quality-Gated)

  • Rerun profile:
    • source: runtime_native only
    • seeds: 7/11/19
    • baseline + stress
    • runtime configured with fast probe path (TRENI_DEMO_LAYERS=2)
    • client consumes unified awareness.generation first (legacy fallback preserved)
  • Probe quality gate:
    • all runtime-native arm artifacts in this rerun have fallback=0, errors=0, and non-zero requests/ok.
  • Clean rerun deltas (runtime-native):
    • baseline internal uncertainty success delta: -0.1538
    • baseline external uncertainty success delta: -0.1217
    • stress internal uncertainty success delta: -0.1538
    • stress external uncertainty success delta: -0.1089
    • stress-baseline external uncertainty delta change: +0.0128
  • Important interpretation correction:
    • previously published positive runtime-native deltas were influenced by runtime probe fallback in part of the seed set (notably s11/s19 in older fix1 artifacts).
    • with zero-fallback runtime-native probes, this awareness3 rerun showed uncertainty-on was not yet beneficial in this harness.
    • this kept runtime-native uncertainty wiring validated and motivated the calibration pass documented below.

Latest Key Findings (2026-02-20, Phase 3 Runtime-Native Calibration Rerun calib1, Quality-Gated)

  • Calibration update:
    • runtime-native confidence is now calibrated/blended for decision usage (runtime confidence floor/ceil scaling + prior blend + optional route blend), while preserving raw runtime fields.
    • runner now forwards calibration knobs through ablation harness to child benchmark runs.
  • Canonical rerun profile:
    • source: runtime_native only
    • seeds: 7/11/19
    • baseline + stress
    • calibration params: prior weight 0.75, confidence floor 0.10, confidence ceil 0.35, route blend 0.10
  • Probe quality gate:
    • all runtime-native arm artifacts in this rerun have non-zero requests/ok and fallback=0, errors=0.
  • Calibrated rerun deltas (runtime-native):
    • baseline internal uncertainty success delta: +0.1539
    • baseline external uncertainty success delta: +0.1058
    • stress internal uncertainty success delta: +0.1539
    • stress external uncertainty success delta: +0.1154
    • stress-baseline external uncertainty delta change: +0.0096
  • Interpretation:
    • calibrated runtime-native uncertainty now recovers positive uncertainty-on gains in both baseline and stress while staying zero-fallback.
    • C2 is re-locked for current harness; core phases are complete, and remaining optional work is region-pinned internet-hop controls (plus higher-N where needed).

Latest Key Findings (2026-02-20, Phase 4 Lambda Full Reruns + Paper Package)

Latest Key Findings (2026-02-20, Track B Fairness-Hardened Commercial Reruns, Local Control r8)

  • Harness changes applied and validated:
    • interleaved internal/external ordering (pair_order=alternate)
    • deterministic generation default (temperature=0)
    • token usage export and ms/completion_token normalization
    • strict tool parity enabled for tool_only runs
  • OpenAI gpt-5.2:
    • model-only ext/int: 0.971x (near parity/slight inversion remains)
    • tool-only ext/int (strict parity): 1.038x (internal faster)
    • model-only ms/token internal/external: 57.657 / 57.663 (effectively tied)
    • tool-only ms/token internal/external: 37.553 / 38.990 (internal better)
  • OpenRouter anthropic/claude-sonnet-4.6:
    • model-only ext/int: 1.102x (internal faster)
    • tool-only ext/int (strict parity): 1.063x (internal faster)
    • model-only ms/token internal/external: 61.606 / 70.054 (internal better)
    • tool-only ms/token internal/external: 41.212 / 43.791 (internal better)
  • Interpretation update:
    • fairness hardening removes most of the ambiguity for tool tasks; tool_only now favors internal on both providers.
    • OpenAI model_only remains near-parity/provider-sensitive, so claim language stays task-family-stratified.

Latest Key Findings (2026-02-22, AWS G5 TTFT Kernel Pass)

  • Matched setup:
    • same AWS G5 host (g5.2xlarge, A10G), same container, same benchmark harness.
    • baseline reference: lt0_sync0 post-cache (TRENI_LINEAR_USE_LT=0, TRENI_TENSOR_UPLOAD_SYNC=0).
    • TTFT pass:
      1. softmax/reduction parallelization (near-parity result),
      2. norm kernel rewrite (rmsnorm/layernorm from single-thread to row-parallel 256-thread reductions).
  • Best measured config in this pass: norm+softmax, TRENI_LINEAR_USE_LT=1, TRENI_TENSOR_UPLOAD_SYNC=0.
  • Baseline -> best deltas:
    • cold TTFT: 16.738 -> 13.974 ms (1.198x faster).
    • cold full latency: 424.685 -> 396.814 ms (1.070x faster).
    • warm mean latency: 174.237 -> 147.269 ms (1.183x faster).
    • warm p99 latency: 1035.823 -> 936.297 ms (1.106x faster).
  • Per-model cold TTFT signal:
    • qwen: 39.537 -> 29.411 ms (dominant gain).
    • donut: 3.505 -> 2.619 ms.
    • bart: near-flat (16.523 -> 16.573 ms), so seq2seq-specific step0 bottleneck still needs isolation.
  • Interpretation:
    • linear GEMM plumbing is no longer the main limiter for this run profile.
    • norm/reduction work gives a real TTFT lift; next targeted work should isolate the residual Bart/seq2seq path.

Latest Key Findings (2026-02-22, AWS G5 TTFT Follow-Up: seq_q=1 Attention Path)

  • Follow-up work after step0 profiling:
    • profile gate TRENI_STEP0_PROFILE=1 added for stage split (decoder_step0_embed, decoder_step0_layers, decoder_step0_logits_sample).
    • seq2seq/Bart profile showed step0 dominated by decoder_step0_layers (not embedding/logits).
    • implemented tiny-shape seq_q=1 attention kernels (QK + PV) and direct K/V projection-to-cache in decoder step path.
    • TRENI_ATTN_SEQ1_USE_KERNEL now defaults to on (1; set 0 to force cuBLAS fallback).
  • Previous best (norm+softmax, lt1_sync0) -> new default path:
    • cold TTFT: 13.974 -> 12.504 ms (1.118x faster).
    • cold full latency: 396.814 -> 390.099 ms (1.017x faster).
    • warm mean latency: 147.269 -> 143.230 ms (1.028x faster).
    • warm p99 latency: 936.297 -> 924.276 ms (1.013x faster).
  • Bart-specific impact:
    • cold TTFT 16.573 -> 12.842 ms (1.29x faster).
  • 3-seed repeatability on the new default path:
    • cold TTFT 12.563 ± 0.037 ms.
    • cold full 390.961 ± 0.270 ms.
    • warm mean 143.297 ± 0.222 ms.
    • warm p99 925.668 ± 1.070 ms.
  • Parity status note:
    • parser now classifies interleaved runtime logs correctly (fallback/failure markers are detected even when stderr is merged into tensor lines).
    • debug rerun identified old-container root cause: minilm used out-of-bounds tensor offsets in monolith_phase3.bin.
    • strict gate is now resolved with rebuilt parity container monolith_phase3_qbm.bin (qwen+bart+minilm):
      • week3_parity_qbm_report_20260222T132155Z.json => checked_total=3, failed_total=0, missing_decoder_models=[], missing_encoder_models=[].
    • runtime-on/off Bart step0 logits A/B remains numerically stable (max abs diff ~2e-6, cosine ~1.0).
  • Interpretation:
    • this confirms residual TTFT loss was in tiny decode attention execution overhead, not linear GEMM path.
    • seq2seq step0 path now moved in the expected direction while preserving gains from prior norm pass.

Latest Key Findings (2026-02-22, AWS G5 Attention Backend A/B, Deconfounded)

  • Setup:
    • runtime rebuilt with WITH_CUDNN=1 and TRENI_ATTN_BACKEND_STRICT=1.
    • compared TRENI_ATTN_BACKEND=custom vs TRENI_ATTN_BACKEND=cudnn_sdpa in explicit proxy mode (TRENI_ATTN_ALLOW_SDPA_PROXY=1).
    • reverse-order rerun used as canonical to remove first-run cold cache bias.
  • Reverse-order canonical (attn_backend_ab_rev_20260222T144736Z):
    • cold TTFT: custom 6.460 ms, cudnn 6.447 ms (custom/cudnn=1.002x).
    • cold full: custom 147.789 ms, cudnn 146.707 ms (1.007x).
    • warm mean: custom 53.545 ms, cudnn 53.341 ms (1.004x).
    • warm p99: custom 82.031 ms, cudnn 80.754 ms (1.016x).
  • Interpretation:
    • legacy proxy mode is near-parity/slightly faster in this decomposition.
    • runtime now treats cudnn_sdpa as fused-only by default; proxy behavior is explicit opt-in (TRENI_ATTN_ALLOW_SDPA_PROXY=1).
    • true fused cuDNN SDPA/flash-attention path is still pending.

Latest Key Findings (2026-02-22, AWS G5 Seq1 Hybrid Tuning + Fused Follow-Up)

  • Setup:
    • runtime rebuilt with seq1-path tuning changes:
      • specialized seq_q=1 softmax kernel
      • one-time cached attention env config reads
      • optional hybrid knobs: TRENI_ATTN_SEQ1_USE_CUBLAS_QK, TRENI_ATTN_SEQ1_USE_CUBLAS_PV
    • warm matrix (12 runs, 4 warmups): default vs qk-cublas vs pv-cublas vs both-cublas.
  • Warm results (seq1_hybrid_20260222T1554Z):
    • default: mean 54.505 ms, p99 82.134 ms
    • qk-cublas: mean 54.572 ms, p99 81.776 ms
    • pv-cublas: mean 54.281 ms, p99 80.754 ms
    • both-cublas: mean 54.822 ms, p99 79.947 ms
  • Cold sanity (seq1_hybrid_20260222T1558Z):
    • default: TTFT 6.447 ms, full 147.756 ms
    • pv-cublas: TTFT 6.450 ms, full 149.293 ms
  • Fused follow-up (seq1_hybrid_fused_20260222T192656Z):
    • code changes:
      • fused seq_q=1 softmax+PV custom kernel
      • seq1 QK kernel block retune (64/128/256 based on head_dim)
    • warm default: mean 54.505 -> 52.535 ms, p99 82.134 -> 80.554 ms
    • warm pv-cublas: mean 54.281 -> 51.964 ms, p99 80.754 -> 78.519 ms
    • cold default: TTFT 6.447 -> 6.209 ms, full 147.756 -> 145.587 ms
    • cold pv-cublas: TTFT 6.450 -> 6.215 ms, full 149.293 -> 147.937 ms
  • Interpretation:
    • fused seq1 follow-up improved both warm and cold for default custom path.
    • pv-cublas remains fastest warm variant in this pass, but default custom still keeps stronger cold-first-hit balance.
    • this closes more request-path overhead without changing model/tool behavior.

Latest Key Findings (2026-02-22, H100 cuDNN SDPA Fused Probe)

  • Probe pack: phase2_runtime/results/cudnn_sdpa_h100_probe_20260222T1935Z.
  • Environment:
    • Lambda H100 (sm90) probe host.
    • tested staged system cuDNN 9.19 and pip nvidia-cudnn-cu12==9.19.0.56.
  • Results:
    • alignment sweep: cnt=0 for align={16,32,64,128,256}.
    • shape/layout sweep: tested=1440, supported=0.
    • debug logs show candidate SDPA engines (8/9/10/11) but no viable configs:
      • NOT_SUPPORTED_GRAPH_PATTERN (8/9/11)
      • NOT_SUPPORTED_ARCH_MISMATCH (10, Blackwell-only).
  • Interpretation:
    • true fused cudnn_sdpa is still unresolved in current backend descriptor path even on H100.
    • runtime stays on explicit fused-only semantics for cudnn_sdpa; proxy remains opt-in only.

Latest Key Findings (2026-02-22, Phase 3 Realistic-v1 Reruns)

  • Realistic-v1 loop summary (phase3_realistic_v1_summary_20260222T143919Z):
    • baseline:
      • internal success 1.0000
      • external success 0.9010
      • external/internal latency ratio 15.8563x
      • external/internal steps ratio 1.8037x
    • stress:
      • internal success 1.0000
      • external success 0.9010
      • external/internal latency ratio 75.3563x
      • external/internal steps ratio 1.8037x
  • Realistic-v1 uncertainty compare (phase3_uncertainty_compare_realistic_v1_s7_20260222T144116Z):
    • baseline uncertainty-on success deltas: internal +0.2500, external +0.2500 (all 3 sources).
    • stress uncertainty-on success deltas: internal +0.2500, external +0.2344 (all 3 sources).
  • Interpretation:
    • richer file-backed fixtures keep the same thesis direction: internal loops are faster and more stable.
    • uncertainty-aware branching remains beneficial on realistic-v1.

What Is Still Missing Per Plan

If following the full sequence:

  1. Optional: add region-pinned internet multi-hop controls (Fly-to-Fly or fixed-region affinity) to reduce provider-path confounding.
  2. Still open: replace cudnn_sdpa proxy route with true fused cuDNN SDPA/flash-attention frontend path and rerun A/B.

Canonical Clarification

  • Full-system canonical set remains g5-20260216-foundation.
  • Cold optimization is tracked as g5-20260217-cold-indexcache (latest cold-specific canonical evidence).
  • Cold decomposition/collect optimization is tracked as phase2-runtime clean4 (latest cold-stage evidence).
  • External-cold canonical repeatability after GPU-convert + host-prefetch fix is tracked in phase2_external_cold external_cold_gpuconvert_prefetch_allbackends_repeatability_20260219T203017Z.

Artifact Pointers

Latest Qwen3.5 Status

  • Prompt/token parity for the failing IFEval probe is confirmed against HF tokenization.
  • The new batched prefill path is not the remaining cause of IFEval quality drift.
  • Current strict one-host control lane state:
    • runtime is ahead on latency
    • runtime still trails vLLM slightly on IFEval-style instruction fidelity
  • New evaluator-guided IFEval repair loop improves runtime quality over runtime control, but does not yet fully surpass vLLM control.

Latest Qwen Family Runtime Status (2026-03-10)

  • Live AWS qwen35 (Qwen/Qwen3.5-0.8B) is healthy again and direct /v1/chat/completions now returns inference.used=true.
  • Live AWS qwen35_4b (Qwen/Qwen3.5-4B) also performs real inference on the same host when launched with runtime_pool_mb=15360.
  • Direct runtime smoke against the root runtime URL is now split clearly:
    • qwen35 passes the direct tool-call smoke on AWS, including first-turn function calling and follow-up tool-result handling (benchmarks/qwen35_smoke/results/live-qwen35-toolsmoke-root-20260310.json).
    • qwen35_4b loads and infers, but still fails the same exact-output/tool-call smoke contract on the current harness (benchmarks/qwen35_smoke/results/live-qwen35_4b-toolsmoke-root-20260310.json).
  • Backward compatibility is re-proven:
    • a fresh Qwen/Qwen2.5-0.5B-Instruct container was packed and proved on AWS before cleanup
    • the old qwen2.5 host artifacts were then removed to free disk while preserving code-path compatibility
  • qwen35_9b family wiring exists in /Users/andrewcorrea/treni/scripts/qwen_runtime_env.py, but the current AWS host does not yet have a packed 9B container and is not the intended proof box for that model size.

Same-VM Promotion Status (2026-03-10, 4B + 9B follow-up)

  • The old negative 4B status is now superseded.
  • Root cause was a runtime parity bug in the cached linear-attention decode path for Qwen3.5-4B:
    • the step-path repeated key heads before the depthwise-conv update instead of after it
    • Hugging Face on the same host already proved the model itself was fine
  • Repaired canonical 4B same-VM artifact:
    • /Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-full-aws-r4-qwen35_4b_20260310T184433Z.json
  • Repaired 4B result:
    • 15/15
    • direct runtime smoke passes
    • direct PDF RAG passes
    • direct embed/rerank passes
    • direct TTS/STT passes
    • Hermes runtime-status/RAG/SQLite/memory/execute_code all pass
  • Current same-VM status on AWS A10G:
    • qwen35 (0.8B) remains the speed-first lane
    • qwen35_4b (4B) is now a real end-to-end valid agent lane
  • Lambda 9B state:
    • account auth and SSH key are valid
    • current launch attempts are blocked by provider-side capacity / rate limiting
    • no live Lambda 9B proof host exists yet from this sweep

Latest Same-VM Agent Compare Matrix (2026-03-10)

  • New clean model-dependent comparison suite:
    • /Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35.json
    • /Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35_4b.json
  • Scope of this lane:
    • runtime health
    • worker health
    • direct runtime smoke
    • Hermes runtime-status
    • Hermes RAG search
    • Hermes SQLite exec/query
    • Hermes memory add/read
    • Hermes execute_code
  • Result:
    • qwen35 (0.8B): 10/10 pass (1.0)
    • qwen35_4b (4B): 2/10 pass (0.2)
  • Claim-safe interpretation:
    • this selector artifact is stale and predates the repaired 4B decoder path
    • use the repaired full suite as the current source of truth for 4B
  • Isolated speed probes:
    • /Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/qwen35_model_speed_compare_20260310.md
      • 0.8B warm steady state: about 113.7 tok/s, ttft ≈ 95.4 ms
      • 4B repeated warm lane: about 38.5 tok/s, ttft ≈ 158.9 ms

Stub Audit Clarification (2026-03-10)

  • Direct Phase 5 runtime/vLLM comparisons do not rely on Hermes tool stubs.
  • Same-VM Hermes wrapper had one localized optional-import shim in /Users/andrewcorrea/treni/scripts/hermes_same_vm_mvp.py; that path has been fixed.
  • Same-VM Hermes now loads the real file/code tools (read_file, write_file, search_files, patch, execute_code) after the tools package import fix in the wrapper.
  • Live Hermes single-tool validation on AWS now shows:
    • qwen35 can execute real samevm_rag_search successfully against a raw-PDF-ingested local RAG store.
    • qwen35 can call real execute_code; the current small-model weakness is argument/code quality, not missing tool plumbing.
    • qwen35 also calls samevm_sqlite_query, but still tends to generate malformed SQL unless tightly guided.
  • Phase 3 loop studies still include synthetic fixture profiles by design. Those results remain useful, but they are not equivalent to the direct benchmark lane.

On this page

Direct AnswersLatest 2026-03-11 Update: Native Hermes 4B Conversation LaneLatest 2026-03-08 Update: Deterministic Strict LaneLatest 2026-03-08 Update: Sampled Lane FixedLatest 2026-03-08 Update: Larger-N Sampled Strict ConfirmationLatest 2026-03-08 Update: Thinking-Mode Parity ExplorationLate 2026-03-08 Update: Fast Sampler + Tie-Stable AB3Late 2026-03-08 Update: Batched Hybrid Qwen3.5 PrefillLate 2026-03-07 Update: ORPO Reload + Cache-Tier A/BQwen3.5 One-Host Strict Rerun + Request-Path Fixes (2026-03-07)Same-VM Wrapper Recovery (2026-03-07)Canonical Same-VM MVP (2026-03-10)Clean GPQA Runtime Profile (2026-03-07)Qwen3.5 Probe Matrix + Same-VM MVP (2026-03-06)Phase 5 Strict Parse-Fix AB3 (2026-03-04)Phase 5 Real-Benchmark Update (2026-03-01)Phase 5 + Qwen05 Follow-up (2026-03-02)Phase 5 + Qwen3.5 Nightly vLLM Follow-up (2026-03-02)Phase 5 Paper-Mode Debug (2026-03-03)Phase 5 Paper-Loop Alignment (2026-03-02 Late)Phase 5 Adaptive Uncertainty Fix (2026-03-02 Late 2)Decision Update (2026-02-28 Late)Rerun Update (2026-02-28 Late 2)Decision Update (2026-02-28 Late 3)Decision Update (2026-02-28 Late 4)Decision Update (2026-02-28 Late 5)Decision Update (2026-02-28 Late 6)Decision Update (2026-02-28 Late 7)Decision Update (2026-02-28 Late 8)Decision Update (2026-02-28 Late 9)Decision Update (2026-02-27)Decision Update (2026-02-27 Late, Full-Depth Lane)Decision Update (2026-02-27 Night, Logits Fast-Compute Hook)Decision Update (2026-02-27 Night, U16 Cache Unlock)Decision Update (2026-02-27 Late Night, FFN Follow-Up)Decision Update (2026-02-28 Early, Fast-Profile + Mixed-Load Repeatability)Decision Update (2026-02-28, Parser Fix + Full-Depth FFN Follow-Up)Decision Update (2026-02-26)Decision Update (2026-02-24)What Has Been RunPhase 1 (Baseline, Python stack)Phase 2 (Minimal runtime benchmark)Week 3 (Numerical parity)Phase 3 comparison reportPhase 3 agentic loop benchmark (canonical G5 set)Phase 3 uncertainty-awareness ablation (baseline + stress + comparison)Phase 4 hardware expansion (Lambda A100/H100)Latest Key Findings (2026-02-17)Latest Key Findings (2026-02-22, True Fused cuDNN Frontend Rerun)Latest Key Findings (2026-02-22, Frontend Repeatability Matrix)Latest Key Findings (2026-02-22, Frontend Claim-Strength Report)Latest Key Findings (2026-02-22, Startup-Preload Miss-Mitigation, Updated Canonical)Latest Key Findings (2026-02-22, Shape-Prebuild No-Preload Probe)Latest Key Findings (2026-02-23, Coverage-Instrumented Fused Reruns)Latest Key Findings (2026-02-23, Hybrid Shape-Gate Frontend Policy)Latest Key Findings (2026-02-22, Commercial Root-Cause Grouped Analysis)Latest Key Findings (2026-02-18, External Cold Canonical)Latest Key Findings (2026-02-18, External Cold Optimized Runtime)Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Pre-Fix)Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Decoder/Sampling Fix)Latest Key Findings (2026-02-19, Qwen Cold Upload GPU-Convert Fix)Latest Key Findings (2026-02-19, Runtime vs vLLM After GPU-Convert Fix, 3-run)Latest Key Findings (2026-02-19, External Cold All-Backend Repeatability, GPU-Convert Fix2)Latest Key Findings (2026-02-19, External Cold All-Backend Repeatability, GPU-Convert + Host-Prefetch Fix)Latest Key Findings (2026-02-24, External Cold Repeatability After Seq1 Multi-Head Default)Latest Key Findings (2026-02-24, External Cold Repeatability After Step0 Exp-Reuse Patch)Latest Key Findings (2026-02-24, External Cold Repeatability After Step0 Shared-Prob Follow-Up)Latest Key Findings (2026-02-18, Routing Failure-Amplification Stress)Latest Key Findings (2026-02-19, Routing Matrix Expansion, G5)Latest Key Findings (2026-02-19, Routing Cross-Host Pilot)Latest Key Findings (2026-02-19, Routing Split-Host Matrix, Canonical Track B)Latest Key Findings (2026-02-20, Internet Multi-Hop Matrix on Commercial APIs)Latest Key Findings (2026-02-20, Local Control Matrix, No Fly Scheduler Path)Latest Key Findings (2026-02-20, Task-Family Parity Split, Local Control, runs=8)Latest Key Findings (2026-02-19, Phase 3 Canonical G5 Baseline, 3 seeds)Latest Key Findings (2026-02-19, Phase 3 Canonical G5 Stress, 3 seeds)Latest Key Findings (2026-02-19, Phase 3 Uncertainty Ablation, baseline runs=8)Latest Key Findings (2026-02-19, Phase 3 Uncertainty Ablation Repeatability + Stress)Latest Key Findings (2026-02-19, Phase 3 Uncertainty Ablation Runtime-Native Canonical Rerun, Superseded)Latest Key Findings (2026-02-20, Phase 3 Runtime-Native Rerun with Unified Awareness, Quality-Gated)Latest Key Findings (2026-02-20, Phase 3 Runtime-Native Calibration Rerun calib1, Quality-Gated)Latest Key Findings (2026-02-20, Phase 4 Lambda Full Reruns + Paper Package)Latest Key Findings (2026-02-20, Track B Fairness-Hardened Commercial Reruns, Local Control r8)Latest Key Findings (2026-02-22, AWS G5 TTFT Kernel Pass)Latest Key Findings (2026-02-22, AWS G5 TTFT Follow-Up: seq_q=1 Attention Path)Latest Key Findings (2026-02-22, AWS G5 Attention Backend A/B, Deconfounded)Latest Key Findings (2026-02-22, AWS G5 Seq1 Hybrid Tuning + Fused Follow-Up)Latest Key Findings (2026-02-22, H100 cuDNN SDPA Fused Probe)Latest Key Findings (2026-02-22, Phase 3 Realistic-v1 Reruns)What Is Still Missing Per PlanCanonical ClarificationArtifact PointersLatest Qwen3.5 StatusLatest Qwen Family Runtime Status (2026-03-10)Same-VM Promotion Status (2026-03-10, 4B + 9B follow-up)Latest Same-VM Agent Compare Matrix (2026-03-10)Stub Audit Clarification (2026-03-10)