Treni

TODO

Live execution checklist and next actions.

Priority Order

Current Checklist

Same-VM Agent Productization

  • Register Treni same-VM tools natively inside Hermes instead of wrapper-only injection.
  • Verify no duplicate Hermes native tools for browser/code execution/same-VM registrations.
  • Fix multi-turn native Hermes 4B session replay bug (tool_call_id uniqueness).
  • Make the split real-world discover -> SQLite -> RAG -> memory -> recall conversation lane pass on native Hermes 4B.
  • Make the single-turn combined persistence prompt reliably satisfy SQLite + RAG + memory in one freeform turn.
  • Add true token streaming for agent-mode turns in the public GPU Agent console.
  • Move the public demo from AWS g5.2xlarge to a larger GPU host once Lambda (or another provider) gives stable capacity.
  • Harden Qwen3.5-4B runtime supervision on 18081 so long demo sessions survive without manual restarts.

Track A: Cold/Hot Foundations

  • True TTFT instrumentation in runtime request path.
  • 3x cold-first-hit repeatability set (G5).
  • 3x warm steady-state repeatability set (G5).
  • Cold bottleneck fix: per-model tensor lookup index cache.
  • Cold rerun after fix with artifact pack.
  • Add stage-level cold decomposition metrics (tokenizer load, index build, tensor upload, first decode step).
  • Optimize model_tensor_index_build via fast tensor collect path and rerun 3x cold validation.
  • Rerun 3x cold validation after reverting regressed upload path (clean7) and confirm clean4 parity.
  • Add sub-stage upload instrumentation (decoder_tensor_convert, decoder_tensor_h2d, decoder_tensor_copy_total).
  • Add startup preload + tokenizer cache path to cut first-request upload/tokenizer overhead.
  • Wire request max_tokens through runtime HTTP path for token-parity comparisons.
  • Disable decoder per-step trace by default (TRENI_DEMO_TRACE opt-in).
  • Reduce remaining Qwen request-path TTFT/full gap vs vLLM (decoder/sampling fixes validated on token-parity reruns).
  • Align decode-stop behavior with vLLM/HF semantics (stop on end markers, not im_start) and keep chat cleanup token-level (TRENI_HTTP_OUTPUT_FILTER=1 default, sanitize still opt-in).
    • Validation (2026-03-02, AWS qwen05 probe): prior "<|im" leak removed in direct /v1/chat/completions responses with decode-stop on.
  • Fix tokenizer special-token encode parity for chat templates (<|...|> now encoded as atomic tokens in BPE path).
    • Validation (2026-03-02, AWS qwen05 prompt-id probe): template prompt length dropped (35 -> 25) and first token is now the expected chat-control token id instead of punctuation-fragment ids.
  • Prevent HTTP heuristic route-text fallback when inference succeeds with empty output.
    • Validation (2026-03-02): empty generation now returns empty assistant content, not synthetic "Routed to ..." text.
  • Resolve qwen05 deterministic MCQ empty-completion parity gap (runtime token-0 stop vs vLLM non-empty output).
    • Root cause (2026-03-02): runtime Qwen template did not inject the default system preamble for user-only chats (HF/vLLM template does), shifting next-token distribution toward immediate EOS for that prompt.
    • Fix: inject Qwen default system preamble in HTTP chat-template build when no system message is provided.
    • Validation (2026-03-02, user-only prompts): runtime now returns non-empty MCQ output (\"12\") and no longer immediate-stops on EOS for that case.
  • Re-run qwen05 external-cold runtime-vLLM benchmark after template/decoder fixes and confirm non-empty runtime response on the prior failing path.
    • Artifacts (2026-03-02): external_cold_qwen05_templatefix_20260302T154019Z.json, external_cold_qwen05_templatefix_nofixeos_20260302T154151Z.json.
    • Result: runtime completion is non-empty (usage_completion_tokens=3), TTFT remains strongly ahead of vLLM on this profile.
  • Re-run Phase 5 awareness benchmark on canonical qwen with matched depth/samples after qwen05 parity fixes.
    • Artifact (2026-03-02, layers=36, samples=8): phase5_awareness_realbench_qwen-realbench-r9-templatefix1-l36s8_20260302T161123Z.json.
    • Result snapshot: gsm8k recovered materially (A=0.625, C=0.750), while gpqa_diamond regressed vs r5 (A=0.125), so quality claim remains mixed by task family.
  • Bring up Qwen3.5 serving path on AWS using vLLM nightly (main branch wheel path) and validate OpenAI-compatible endpoint.
    • Runtime env (2026-03-02): .venv-vllm-nightly-q35, vllm 0.16.1rc1.dev....
    • Launch mode: --language-model-only, --max-model-len 32768, --enforce-eager.
  • Resolve host infra blocker that broke Qwen3.5 startup (No usable temporary directory).
    • Root cause: root filesystem at 100%.
    • Fix: cleanup old caches/venvs and run server with explicit TMPDIR.
  • Run Qwen3.5 Phase 5 diagnostic sequence and remove A/B/C fairness noise.
    • r1: phase5_awareness_realbench_qwen35-realbench-r1-s8-nonthinking_20260302T184159Z.json (showed down deltas).
    • r2: phase5_awareness_realbench_qwen35-realbench-r2-policyfix1-s8-nonthinking_20260302T184624Z.json (partial improvement).
    • r3 shared-first fairness fix: phase5_awareness_realbench_qwen35-realbench-r3-sharedfirst-s8-nonthinking_20260302T184947Z.json (all B-A/C-A deltas 0.0).
  • Clone paper reference implementation and align Phase 5 trigger policy to paper-style uncertainty loop.
    • Repo: third_party/weave-logprobs-reasoning-loop
    • Harness updates (scripts/phase5_awareness_realbench.py):
      • new trigger mode paper|confidence|hybrid (default paper),
      • paper trigger signals: perplexity, max_entropy, low_confidence_tokens,
      • retry prompt now carries uncertainty summary from first pass,
      • artifact trace now includes per-call uncertainty metrics/table.
  • Validate paper-mode path with an end-to-end AWS smoke run (Qwen3.5 nightly vLLM).
    • Artifact: benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-paper-smoke_20260302T191420Z.json
    • Validation: retry_decision.paper_reasons and loop_trace[*].uncertainty populated per run.
  • Run Qwen3.5 Phase 5 rerun (r4) with --awareness-trigger-mode paper and compare deltas vs r3 no-up baseline.
    • Artifact: benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r4-paper-s8-nonthinking_20260302T191642Z.json
    • Outcome: loop path works and triggers correctly, but net quality uplift is not present on this sample (overall B -0.046875, overall C 0.0 vs A).
  • Retune from fixed paper thresholds to adaptive uncertainty gating and rerun on Qwen3.5.
    • Implementation: adaptive trigger mode with rolling per-task uncertainty history in scripts/phase5_awareness_realbench.py.
    • r5 artifact: benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r5-adaptive-s8-nonthinking_20260302T202105Z.json.
    • Result vs r4: lower negative delta (B-A -0.015625 vs -0.046875) and much lower latency overhead.
  • Run stricter adaptive follow-up (r6) and compare.
    • Artifact: benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r6-adaptive-strict-s8-nonthinking_20260302T202314Z.json.
    • Result: B-A=0.0 but C-A=-0.03125; kept r5 adaptive defaults as better balance.
  • Add strict inference hard-fail mode for HTTP benchmarking (TRENI_HTTP_REQUIRE_INFERENCE=1) so empty/fallback outputs are rejected instead of silently counted.
    • Runtime now returns 502 {"error":"inference_required"} when model inference is unused/invalid in strict mode.
  • Add strict canonical matrix runner for Qwen/Qwen3.5-0.8B runtime-vs-vLLM (scripts/phase5_qwen35_runtime_vs_vllm_matrix.py).
    • Enforces fixed seeds/params, endpoint preflight, hard artifact validation, and bootstrap CI output.
  • Add explicit arm selection to Phase 5 harness + matrix runner (--arms, --phase5-arms) so strict backend matrix can run Arm A-only.
  • Unblock runtime decoder support for Qwen3.5 linear_attn layers and complete strict canonical runtime-vs-vLLM matrix.
    • Matrix artifacts (2026-03-02): phase5_qwen35_runtime_vs_vllm_matrix_20260302T221546Z.json, phase5_qwen35_runtime_vs_vllm_matrix_20260302T222013Z.json.
    • Current canonical snapshot (r1, 3 seeds): runtime score 0.0503 vs vLLM 0.2170; runtime latency 1881.188 ms vs vLLM 178.093 ms.
  • Rerun strict Qwen3.5 matrix after decoder gate-layout fix in Arm A-only mode (3 seeds, 4 tasks, 8 samples/task).
    • Matrix artifacts (2026-03-03): phase5_qwen35_runtime_vs_vllm_matrix_20260303T104038Z.json, phase5_qwen35_runtime_vs_vllm_matrix_20260303T104038Z.md.
    • Outcome: quality gap narrowed (runtime 0.15625 vs vLLM 0.19097), but runtime is still slower overall (1723.685 ms vs 958.757 ms).
  • Harden Phase 5 closed-form parsing to prevent false positives from long reasoning traces.
    • Changes (2026-03-04, scripts/phase5_awareness_realbench.py):
      • strict final-answer extraction for GPQA/GSM8K/AIME (ANSWER: / Final Answer: / boxed / strict numeric-only),
      • strip <think>...</think> blocks before parse,
      • reject long chain-of-thought "last number" fallback parses.
    • Validation artifact: phase5_awareness_realbench_q35-parsefix-vllm-thinking1_20260304T032441Z.json now returns prediction_parsed=null for unresolved thinking traces.
  • Run post-parse-fix strict paired AB3 rerun on gpqa_diamond+ifeval (16/task, seeds 7/17/27, Arm A, request_logprobs=false).
    • Summary artifact: phase5_q35_runtime_vs_vllm_ifeval_gpqa_ab3_20260304T034227Z.json
    • Result:
      • overall score: runtime 0.3403 vs vLLM 0.3229 (small edge, CI includes parity),
      • overall latency: runtime 1772.931 ms vs vLLM 1553.034 ms (runtime slower on aggregate),
      • stratified: runtime wins strongly on ifeval latency/score, but remains far slower on gpqa_diamond.
  • Add Qwen3.5 tokenizer/full-vocab audit and extended endpoint probe matrix.
    • Tokenizer audit: runtime-q35-tokenizer-audit-r4_20260306T190418Z.json
    • Consolidated probe matrix: qwen35-probe-matrix-r2_20260306T200035Z.json
  • Build same-VM Hermes MVP for local runtime + CPU tools and validate ORPO smoke training.
    • Smoke: hermes-samevm-q35-smoke-r5_20260306T192703Z.json
    • ORPO smoke launch: hermes-samevm-q35-orpo-smoke-r1_20260306T194152Z.json
  • Recover the explicit AWS same-VM Qwen3.5 wrapper so it can auto-start runtime + tool worker and emit a usable final summary.
    • Wrapper artifact (2026-03-07): benchmarks/same_vm_mvp/results/samevm-q35-stack_20260307T172158Z.json
    • Smoke sub-artifacts: benchmarks/qwen35_smoke/results/samevm-smoke-20260307T172124Z_20260307T172124Z.json, benchmarks/qwen35_smoke/results/samevm-smoke-20260307T172124Z_20260307T172124Z.md
  • Add a sequential one-host strict matrix runner for Qwen3.5 runtime-vs-vLLM and rerun it on the active AWS host.
    • Runner: scripts/phase5_qwen35_remote_strict_matrix.py
    • Contract artifacts (2026-03-07): qwen35-tokenizer-audit-active_20260307T173024Z.json, qwen35-runtime-smoke-active2_20260307T173132Z.json, qwen35-isolated-ab-active_20260307T173228Z.json
  • Enable Qwen3.5 prefix cache by default and fix request-path TTFT accounting before the next strict rerun.
    • Code path: monolith/main.c
    • Late strict rerun artifact: benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T191653Z.json
    • Result: score recovered overall (runtime 0.333333 vs vLLM 0.315972), but latency is still far behind (runtime 3809.745 ms vs vLLM 1626.068 ms).
  • Keep two Qwen3.5 strict lanes explicit and fix sampled reproducibility.
    • Deterministic canonical strict run (2026-03-08):
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T204248Z.json
      • overall: runtime 0.295139 vs vLLM 0.267361, runtime 824.714 ms vs vLLM 1572.529 ms
      • gpqa_diamond: parity score, runtime slower
      • ifeval: runtime higher score and much faster
    • Sampled reproducibility is now fixed:
      • phase5_repro_runtime_ifeval_s7_raw_seedfix_r1.json vs phase5_repro_runtime_ifeval_s7_raw_seedfix_r2.json
      • same seed/config holds at 0.3125 with 8/8 outputs identical
    • Sampled canonical strict run (2026-03-08):
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T220806Z.json
      • overall: runtime 0.409722 vs vLLM 0.302083, runtime 1617.187 ms vs 2017.206 ms
      • gpqa_diamond: runtime higher score, runtime slower
      • ifeval: runtime higher score and faster
  • Profile the Qwen3.5 strict benchmark hotspot directly with TRENI_STEP0_PROFILE / TRENI_DECODE_STAGE_PROFILE.
    • Artifact: benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-clean_20260307T220200Z.json
    • Clean current finding:
      • first call: decoder_tensor_upload=218.091 ms, decoder_prefill=3263.527 ms, decoder_ttft=3317.441 ms
      • second call: decoder_tensor_upload=11.216 ms, decoder_prefix_cache_copy=0.162 ms, decoder_prefill=2690.001 ms, decoder_ttft=2750.672 ms
      • step-0 decode remains small (decoder_step0_layers ~8 ms, decoder_step0_logits_sample ~33-36 ms)
    • Conclusion: the remaining GPQA gap is still dominated by long-prompt prefill, not tokenizer or step-0 decode.
  • [~] Keep the Qwen3.5 prefix-cache path correctness-safe while continuing latency work.
    • Focused AWS profile (2x gpqa + 2x ifeval, 2026-03-07) found:
      • GPQA gets a real 64-token prefix-cache hit (decoder_prefill ~3075 -> ~2697 ms),
      • short IFEval prompts were tripping a prefix-cache/store CUDA invalid-argument path.
    • Safe runtime policy now skips prefix-cache store on short prompts while preserving long-prompt GPQA cache hits.
    • Follow-up smarter tiering (cap=112, quartile tiers + exact replay) now has clean latency evidence:
      • direct sequential GPQA profile: q35-gpqa-profile-aws-seq2-cap112_20260307T222540Z.json
        • second related-call decoder_prefill 2696.101 -> 2544.202 ms
        • second related-call decoder_ttft 2747.697 -> 2595.907 ms
      • clean strict seed-7 spot:
        • cap112: phase5_qwen35_remote_strict_matrix_20260307T223218Z.json
        • cap64: phase5_qwen35_remote_strict_matrix_20260307T223555Z.json
        • runtime latency delta (112 - 64):
          • overall -363.908 ms
          • gpqa_diamond -420.699 ms
          • ifeval -307.116 ms
    • Next requirement: convert this real but partial latency win into a multi-seed strict result that is not still behind vLLM overall.
  • Remove Qwen3.5 launcher/config drift between the strict runner and the AWS same-VM stack.
    • Shared env source: scripts/qwen_runtime_env.py
    • Updated launchers:
      • scripts/qwen35_remote_isolated_ab.py
      • scripts/treni_local_tool_worker.py
      • scripts/hermes_same_vm_mvp.py
    • Clean strict AB3 artifact: benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T231500Z.json
    • Current effect:
      • runtime overall score now leads in the paired set: 0.335648 vs vLLM 0.291667
      • runtime overall latency is still far behind: 3690.124 ms vs 1646.672 ms
      • gpqa_diamond score is now parity, but latency remains the main loss
      • ifeval score improves clearly in runtime while remaining slower
  • [~] Recover strict benchmark quality without giving back the new latency profile.
    • Batched hybrid prefill is now implemented:
      • linear-attention sequence forward
      • full-attention sequence prefill + K/V cache materialization
      • hybrid layer-major prefill in main.c
    • Latest clean split (2026-03-08, tie-stable fast-sampler AB3):
      • overall latency delta -237.060 ms (runtime faster)
      • gpqa_diamond score delta +0.083333
      • ifeval score delta -0.145833
    • Next implementation target is sampler/output-fidelity recovery on the runtime side, not another major prefill pass.
  • Add explicit thinking-mode parity lane (vLLM --reasoning-parser qwen3 + runtime equivalent output contract) before using thinking benchmarks for claim-grade comparisons.
    • First strict thinking artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T223442Z.json
    • Budget-fixed follow-up:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T224358Z.json
    • Finalized follow-up:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T235628Z.json
    • Lower-cost finalized follow-up:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T010353Z.json
    • Current result:
      • lane is runnable and measurable end-to-end,
      • runtime now leads on score overall (0.250000 vs 0.194444),
      • and with gpqa_max_tokens=256 it also wins overall latency (6823.816 ms vs 7503.000 ms).
    • Early extension:
      • gsm8k finalized thinking AB3 is directionally positive:
        • phase5_qwen35_remote_strict_matrix_20260310T022347Z.json
        • runtime 0.197917 vs vLLM 0.177083, runtime 7174.829 ms vs vLLM 7643.231 ms
  • [~] Tighten runtime thinking-mode output contract for Qwen3.5.
    • Current issue:
      • close-form finalize now recovers parseable final answers, but long reasoning tasks are still too expensive and quality remains modest.
    • One-example probe artifacts:
      • benchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_runtime_1024_20260308T230352Z.json
      • benchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_vllm_1024_20260308T230352Z.json
    • Next focus:
      • the old 512 runtime cap and long-decode host-buffer corruption are fixed,
      • the gpqa_max_tokens=256 sweep already removed the worst latency collapse,
      • the remaining blocker is improving closed-form thinking quality beyond the current modest score while keeping this lower-cost lane,
      • and understanding why aime25 stays 0.0 on both backends even after raising the reasoning budget and adding AIME-specific prompt/finalize guidance.
  • Fix sampled Qwen3.5 reproducibility on the runtime path.
    • Root cause was harness-side:
      • scripts/phase5_awareness_realbench.py shared-first arm_a_control request skipped the request seed and task-specific decode payload.
    • Fixed state:
      • repeated sampled runtime-only reruns are identical:
        • phase5_repro_runtime_ifeval_s7_raw_seedfix_r1.json
        • phase5_repro_runtime_ifeval_s7_raw_seedfix_r2.json
      • post-fix sampled strict matrix is now promotable:
        • phase5_qwen35_remote_strict_matrix_20260308T220806Z.json
  • Investigate the intermittent same-VM first tool-turn CUDA retry in Qwen3.5 wrapper runs.
    • Observed in benchmarks/same_vm_mvp/logs/runtime_20260307T171918Z.log: compute/ops.cu:765 invalid argument during prefill gather, request recovered on retry and smoke still passed.
  • Turn the same-VM Hermes path into an explicit demoable MVP flow with local runtime + local CPU tools + ORPO loop entrypoint.
    • Current entrypoints: scripts/hermes_same_vm_mvp.py, scripts/run_samevm_qwen35_stack.sh, scripts/samevm_full_mvp_demo.py, scripts/run_samevm_full_mvp.sh
    • Current multimodal additions already wired in code: samevm_multimodal_status, samevm_embed, samevm_rerank, samevm_tts, samevm_stt
    • New proof/demo entrypoints: scripts/bootstrap_samevm_multimodal.sh, scripts/samevm_stack_probe.py, benchmarks/same_vm_mvp/README.md
    • New proof artifacts:
      • canonical full MVP proof: benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.json
      • runtime-admin Hermes proof: benchmarks/same_vm_mvp/results/samevm-q35-runtime-admin-proof-v5_20260307T212852Z.json
      • Hermes SQLite query proof: benchmarks/same_vm_mvp/results/hermes-demo-sqlite-query-v2.json
      • Hermes RAG search proof: benchmarks/same_vm_mvp/results/hermes-demo-rag-search-v1.json
      • local stack proof: benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v5.json
      • ORPO control-plane proof: benchmarks/same_vm_mvp/results/samevm-orpo-probe-aws_20260307T215307Z.json
      • Qwen3.5 ORPO reload proof: benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json
    • Hot-reload/hand-off now exists on AWS:
      • local ORPO output is merged into a full HF model dir,
      • packed into a new monolith container,
      • restarted as a second local runtime and verified with a real chat response.
    • MVP contract now covered in the canonical run:
      • runtime health
      • Hermes runtime-status call
      • Hermes multimodal-status call
      • basic non-thinking runtime smoke with first-turn tool calling
      • extended thinking runtime smoke with exact-match checks
      • SQLite + RAG + embedding + reranking
      • TTS + STT
      • ORPO reload sidecar proof
  • Audit the current harness for stubbed-tool paths and lock the scope.
    • Direct Phase 5 runtime/vLLM lane is not stubbed.
    • Same-VM Hermes wrapper had a localized optional-import stub path in /Users/andrewcorrea/treni/scripts/hermes_same_vm_mvp.py; fixed.
    • Phase 3 still contains synthetic profiles by design; realistic_v1 only reduces that bias.
  • Re-prove dynamic Qwen-family runtime support on AWS.
    • qwen35 (0.8B) direct inference restored on the live host.
    • qwen35_4b direct inference proved on the same host with correct pool sizing.
    • qwen25 (Qwen2.5-0.5B-Instruct) packed fresh and direct inference proved.
  • Clean the AWS host down to the active Qwen same-VM target set after the compatibility repro.
    • removed stale monolith_qwen05* artifacts
    • removed temporary qwen25 host cache/artifacts after proving backward compatibility
    • kept the active qwen35 (0.8B), qwen35_4b (4B), and multimodal model caches
  • Deep-clean stale same-VM training/checkpoint/debug artifacts after the 4B promotion sweep.
    • removed the old q35-orpo-notemplate-1772992302 training tree
    • removed checkpoint-1 from samevm-orpo-reload-q35-fixed_20260308T182430Z
    • pruned old debug WAVs and surplus worker logs
    • current AWS root disk is back to about 4.0G free
  • Pack and prove qwen35_9b on a larger GPU host.
    • Current Lambda sweep is still blocked by provider-side insufficient-capacity plus Cloudflare 1015 rate limiting.
  • Build and run a real model-dependent same-VM agent comparison suite.
    • Canonical current selector artifacts:
      • benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35.json
      • benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35_4b.json
    • Current result on AWS A10G:
      • qwen35 (0.8B) = 10/10
      • qwen35_4b (4B) = 2/10
    • This selector lane is now historical for 4B; the repaired full suite below is the current source of truth.
  • Promote the ORPO self-improvement loop from the current Qwen2.5 demo model to the main Qwen3.5 target family.
    • Current passing proof: benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json
    • Current passing canonical full MVP: benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.json
  • Recover the stricter extended/thinking same-VM runtime smoke lane as a separate quality target.
    • Current MVP gate now includes the thinking profile and passes in benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.json.
    • Extended non-thinking profile now also passes cleanly: benchmarks/qwen35_smoke/results/postmvp-extended_20260308T185130Z.json
  • Revalidate real Hermes tool availability after the wrapper import/stub audit.
    • file/code tools now load again in the same-VM wrapper (read_file, write_file, search_files, patch, execute_code)
    • direct live qwen35 tool-call smoke passes at the runtime level
    • live Hermes single-tool RAG search succeeds on raw-PDF-ingested local data
  • Fix qwen35_4b exact-output and tool-call contract parity in the same-VM Hermes path.
    • Root cause was a decoder bug in the cached linear-attention step path.
    • Repaired full-suite artifact:
      • benchmarks/same_vm_mvp/results/samevm-agent-full-aws-r4-qwen35_4b_20260310T184433Z.json
    • Repaired result:
      • 15/15
    • qwen35_4b now passes:
      • direct runtime smoke
      • direct PDF RAG
      • direct embed/rerank
      • direct TTS/STT
      • Hermes runtime-status/RAG/SQLite/memory/execute_code
  • Refresh the short model-selector lane for qwen35_4b after the decoder fix.
    • The old samevm-agent-compare-aws-r2-qwen35_4b.json artifact is stale.
    • Use the repaired full suite as the current truth until the short compare lane is rerun cleanly.
  • Run the first real same-VM multimodal proof pass after bootstrap.
    • Artifact: benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v5.json
    • Confirmed on AWS:
      • samevm_multimodal_status
      • Qwen TTS
      • Qwen ASR STT on the generated WAV
      • Qwen embedding + reranking
      • SQLite + RAG in the same local tool worker
  • Run a live operator-style validation pass on the current AWS deployment.
    • Direct generation speed:
      • mean end-to-end throughput: 112.37 tok/s
      • mean decode-only throughput: 121.90 tok/s
    • Hermes tool proofs now include:
      • benchmarks/same_vm_mvp/results/hermes-demo-sqlite-query-v2.json
      • benchmarks/same_vm_mvp/results/hermes-demo-rag-search-v1.json
      • benchmarks/same_vm_mvp/results/hermes-tts-v2.json
      • benchmarks/same_vm_mvp/results/hermes-stt-v2.json
    • Real-world document caveat is now explicit:
      • extracted PDF text ingests/searches correctly,
      • raw PDF parsing is not yet native in the worker
    • Qwen3.5-4B feasibility on the current AWS host:
      • GPU memory looks plausible on the A10G 24 GB box,
      • current root disk headroom (~12 GB) is the first practical blocker for download + pack
  • Move same-VM multimodal bootstrap into an isolated environment for AWS runs.
    • Active AWS runs are now executed from /home/ubuntu/.venvs/hermes-treni.
    • Local Mac cleanup remains separate from the current AWS experiment path.
  • Add a worker-side multimodal cache clear path so local tool models do not keep starving the main runtime GPU.
    • New endpoint: POST /v1/mm/clear_cache
    • New Hermes tool: samevm_multimodal_clear_cache
    • Status now reports loaded_model_count, loaded_models, and CUDA allocation/reservation.
  • Add a true vision-encoder parity lane for Qwen3.5.
    • Current state:
      • runtime probe only validates multimodal placeholder handling,
      • current vLLM launch is --language-model-only, so multimodal cases fail by configuration.
  • Wire Q/K head RMS-norm into decoder path (q_norm_weight/k_norm_weight) and rerun strict Qwen3.5 smoke check.
    • Artifacts (qnorm-check1, seed=7, 2 samples/task): phase5_qwen35_runtime_vs_vllm_matrix_20260302T225529Z.json.
    • Result: still negative (rt_score=0.0000, vllm_score=0.0625; runtime latency 1880.622 ms vs 187.453 ms), so missing linear-attn parity remains the dominant blocker.
  • Recover awareness uplift on Qwen3.5 with task-aware paper mode (gpqa retries on, summary ifeval retries off).
    • 32/task: phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32-ifevaloff_20260303T222841Z.json (overall B-A=+0.015624).
    • 3-seed (16/task, s7/s17/s27): ...ifevaloff-rpt-* mean +0.020833.
  • Calibrate paper-trigger metrics for runtime-native uncertainty before next Phase 5 claim run.
    • New evidence (2026-03-03):
      • paper-mode selection bug is fixed in harness (phase5_awareness_realbench.py),
      • default paper thresholds over-trigger on runtime (max_entropy true on 16/16 cases in qwen35-paperfix-runtime-sweep-p1_4_20260303T202135Z.json),
      • summary-mode calibration fix is now implemented (uncertainty_source=runtime_summary + guarded vote rule), reducing retries (16 -> 9) and removing the immediate negative delta (phase5_awareness_realbench_qwen35-papersummaryfix-runtime-sanity2_20260303T204120Z.json),
      • task-aware follow-up (disable summary retries on IFEval) now yields the first repeatable positive signal:
        • phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32-ifevaloff_20260303T222841Z.json (overall B-A=+0.015624)
        • 3-seed (s7/s17/s27, 16/task) mean +0.020833.
      • compact invalid-parse recovery prompt + invalid-parse confidence gate (--invalid-parse-retry-confidence-max) now reduce overhead while preserving mean quality on repeatability:
        • invmax=0.73 3-seed (16/task) artifacts: phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s16_20260303T232029Z.json, ...-rpt-s17_20260303T232254Z.json, ...-rpt-s27_20260303T232516Z.json
        • quality unchanged vs prior baseline (overall B-A mean +0.020833), latency overhead reduced (+712.276 ms -> +404.603 ms).
        • 32/task confirmation: phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s32_20260303T232755Z.json kept overall B-A=+0.015624 while lowering latency overhead (+618.068 ms -> +326.187 ms).
    • Next implementation target:
      • reduce absolute GPQA malformed-output rate on first pass (current retries are still dominated by invalid_parse failures from decode quality, not uncertainty only).
  • Add GPU-side BF16/F16 cold tensor conversion path (TRENI_TENSOR_CONVERT_GPU) and validate Qwen cold upload ablation on G5.
  • Stabilize preload upload cold variance (intermittent decoder_tensor_h2d spikes) with explicit page-residency strategy (TRENI_TENSOR_HOST_PREFETCH) and verify with runtime/all-backend reruns.
  • Run TTFT softmax-kernel pass on AWS G5 (lt0_sync0) and confirm effect.
  • Replace single-thread norm kernels (rmsnorm/layernorm) with row-parallel reductions and rerun cold/warm matrix.
  • Isolate seq2seq/Bart TTFT hotspot via step0 stage profiling (TRENI_STEP0_PROFILE) and implement seq_q=1 attention follow-up (tiny-kernel + direct K/V cache write).
  • Run 3x repeatability for the new default seq_q=1 attention path and publish mean/std.
  • Resolve strict Week-3 parity gate: rebuilt parity container (monolith_phase3_qbm.bin, qwen+bart+minilm) and strict external-HF parity now passes (checked=3, failed=0, missing=0).
  • Add strict attention backend selector + repeatable A/B harness (custom vs cudnn_sdpa proxy) with runtime env overrides and summary reporting.
  • Run AWS G5 attention backend A/B and deconfound call-order effects with reverse-order rerun (attn_backend_ab_rev_20260222T144736Z).
  • Cache attention runtime env config values once per process (remove per-call getenv overhead in request path).
  • Add seq_q=1 hybrid tuning knobs (TRENI_ATTN_SEQ1_USE_CUBLAS_QK/PV) and run warm/cold matrix on G5.
  • Fuse seq_q=1 softmax+PV custom path and retune seq1 QK block sizing; rerun warm/cold matrix (seq1_hybrid_fused_20260222T192656Z).
  • Make cudnn_sdpa proxy behavior explicit opt-in (TRENI_ATTN_ALLOW_SDPA_PROXY=1) and keep strict fused-only semantics by default.
  • Probe fused cuDNN SDPA availability on H100 across alignment/shape/layout sweeps and pip/system cuDNN sources (cudnn_sdpa_h100_probe_20260222T1935Z).
  • Add hard A/B validation guard: fail frontend runs when fused marker is missing or runtime was built with TRENI_WITH_CUDNN=0.
  • Add fused frontend stage profiler (TRENI_ATTN_CUDNN_FRONTEND_PROFILE) and capture miss-cost probe artifacts.
  • Publish strict fused frontend A/B rerun with fixed qwen model + warmed query set (attn_backend_ab_frontend_20260222T220111Z).
  • Publish repeatability proof matrix for custom vs fused frontend (attn_backend_frontend_matrix_20260222T221948Z, 3 repeats each for warm_fixed and mixed_churn).
  • Publish frontend claim-strength report (paired deltas + CI95) for the repeatability matrix (attn_backend_frontend_claim_report_20260222T222958Z).
  • Add fused miss-trace + startup-preload knobs (TRENI_ATTN_CUDNN_FRONTEND_TRACE_MISSES, TRENI_HTTP_PRELOAD_PROMPTS, frontend A/B preload flag).
  • Run strict frontend matrix A/B no_preload vs startup_preload_4prompts and publish compare report (attn_backend_frontend_missmit_compare_20260222T225215Z).
  • Fix runtime preload prompt splitter bug (TRENI_HTTP_PRELOAD_PROMPTS) and verify multi-run execution from logs (run=1/4 ... run=4/4).
  • Run strict frontend matrix A/B no_preload vs startup_preload_benchmark_queries and publish compare report (attn_backend_frontend_missmit_compare_20260222T231335Z).
  • Add shape-level seq1 prebuild controls (TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV, TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM) and expose them in frontend scripts.
  • Validate no-preload fused cold TTFT fix with startup shape prebuild (prebuild_startup_nopreload_probe_20260222T232932Z).
  • Run no-preload frontend matrix probe with shape prebuild and compare against no-preload baseline (attn_backend_frontend_matrix_20260222T233003Z, attn_backend_frontend_missmit_compare_20260222T233116Z).
  • Tune shape-prebuild range (seq_kv_max: 16 -> 10) and reduce startup penalty while preserving no-preload fused TTFT/full (prebuild_startup10_nopreload_probe_20260222T235944Z).
  • Probe cuDNN frontend heuristic modes (A/B/FALLBACK) for startup/build relief on current path.
  • Run tuned shape-prebuild (seq_kv_max=10) matrix probe and compare against prior seq_kv_max=16 matrix (attn_backend_frontend_matrix_20260223T000256Z, attn_backend_frontend_missmit_compare_20260223T000343Z).
  • Run lower-range shape-prebuild bound probe (seq_kv_max=8) and confirm request-path regression (prebuild_startup8_nopreload_probe_20260223T000600Z).
  • Add shape-gated fused policy controls (TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MIN_KV, TRENI_ATTN_CUDNN_FRONTEND_MIN_SEQ_KV) and expose them in frontend matrix flags.
  • Fix strict-mode frontend gate fallback path so low-shape custom fallback remains inference-valid (inference.used=true under strict fused runs).
  • Run 3x hybrid no-preload startup probe (prebuild_hybrid10_nopreload_probe_r{1,2,3}_20260223T002214Z) and lock startup/request repeatability.
  • Run 3x hybrid frontend matrix (attn_backend_frontend_matrix_20260223T001959Z) and compare vs prior tuned no-gate baseline (attn_backend_frontend_missmit_compare_20260223T002153Z).
  • Run broader-shape sanity probe for hybrid policy (hybrid_shape_sanity_20260223T002857Z) and capture limitation: fused misses reappear for seq_kv>10 long-prompt growth.
  • Add upper seq-kv gate control (TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV) and expose --attn-fused-max-seq-kv in frontend runners.
  • Rerun broader-shape sanity with bounded gate (hybrid_shape_sanity_maxgate_20260223T003453Z) and confirm miss-cascade removal (miss_lines_head=[], inference.used=true all requests).
  • Rerun 3x hybrid matrix with max gate (attn_backend_frontend_matrix_20260223T003611Z) and compare vs prior hybrid policy (attn_backend_frontend_missmit_compare_20260223T003734Z).
  • Add per-request attention backend telemetry (attention counters/shares in runtime HTTP responses) and aggregate it in benchmark artifacts.
  • Run coverage-instrumented fused matrix and profile sweeps (attn_backend_frontend_matrix_20260223T011158Z, fused_coverage_profiles_20260223T011504Z, fused_coverage_cold_profiles_20260223T011534Z).
  • Decide lane direction from evidence: park cuDNN/frontend optimization and prioritize custom kernels (2026-02-23).
  • (Parked) Replace current cudnn_sdpa proxy path with true fused cuDNN SDPA/flash-attention frontend path, then rerun A/B.
  • Reduce shape-prebuild startup penalty while preserving no-preload fused TTFT/request gains (~7.0s -> ~2.0s startup baseline on G5 hybrid policy).
  • (Parked) Implement dynamic shape-reuse/coverage for fused path.
  • Add/validate seq1 custom microfused path (TRENI_ATTN_SEQ1_USE_MICROFUSED) to reduce decoder_step0_layers launch overhead for small seq_kv.
    • G5 A/B result (2026-02-23): no net win (mean/TTFT regressions; isolated bart p99 improvement only), so kept as opt-in and defaulted off.
    • Artifact summary: benchmarks/phase2_runtime/seq1_microfused_ab/seq1_microfused_ab_summary_20260223T014848Z.json and .md.
  • Add stream-cache toggles (TRENI_LINEAR_STREAM_CACHE, TRENI_ATTN_STREAM_CACHE) and run G5 on/off A/B.
    • Result (2026-02-23): near-neutral; keep cache enabled by default and prioritize higher-impact kernel paths.
    • Artifact summary: benchmarks/phase2_runtime/results/stream_cache_ab_summary_20260223T015222Z.json and .md.
  • Prototype hash-backed registry/model-index lookups (TRENI_REGISTRY_LOOKUP_HASH, TRENI_MODEL_INDEX_NAME_HASH) and run G5 on/off A/B.
    • Result (2026-02-23): no meaningful cold/setup win on this profile; kept as opt-in and defaulted off.
    • Artifact summary: benchmarks/phase2_runtime/results/registry_hash_ab_summary_20260223T020353Z.json and .md.
  • Fix cold-start harness startup timing granularity (wait_for_health now polls every 50ms instead of 1s cadence).
  • Add benchmark flag --runtime-skip-startup-smoke (default true) and validate cold startup impact on G5.
    • Result (startup_smoke_ab_hf_20260223T030059Z): startup-to-healthy 488.027 -> 404.184 ms (-17.18%) and start-to-first-response 705.454 -> 622.167 ms (-11.81%) with smoke skipped.
    • Runtime default also moved to skip startup smoke unless explicitly disabled (TRENI_SKIP_STARTUP_SMOKE=0).
  • Run custom cold-path A/B probes (TRENI_TENSOR_ENV_CACHE, TRENI_TENSOR_H2D_CHUNK_MB, TRENI_TENSOR_HOST_REGISTER) on G5.
    • Result (2026-02-23): all near-neutral/noise-level on this profile; keep as optional knobs and prioritize upload/decoder kernel hotspots.
  • Add per-tensor upload hotspot profiler (TRENI_TENSOR_UPLOAD_TOPK) and run cold probe on G5.
    • Result (tensor_upload_topk_probe_20260223T190829Z): dominant upload hotspot is model.embed_tokens.weight (~79.3 ms, ~63.8% of decoder_tensor_upload in that probe).
  • Add/benchmark container-level readahead hint (TRENI_CONTAINER_WILLNEED) on G5.
    • Result (container_willneed_ab8_20260223T191145Z): modest but repeatable cold-total gain (start->first-response: -1.94%, startup: -3.02%), request-path TTFT/full near-flat.
    • Runtime default moved to enable this hint unless explicitly disabled (TRENI_CONTAINER_WILLNEED=0).
  • Validate TRENI_CONTAINER_WILLNEED + TRENI_TENSOR_HOST_REGISTER combo on G5.
    • Result (container_hostreg_ab8_20260223T191255Z): no clear gain beyond readahead-only profile.
  • Validate staged-H2D upload path (TRENI_TENSOR_H2D_STAGING) with chunk-size follow-up and decide lane status.
    • Result (h2d_staging_followup_summary_20260224T101324Z): both min64/chunk32 (8-run A/B) and min64/chunk128 (3-run probe) regress materially on this G5 profile.
    • Decision: keep staged-H2D as opt-in experimental path, default-off, and continue Track A cold optimization on the non-staging custom path.
  • Run non-staging H2D chunk-size matrix (TRENI_TENSOR_H2D_CHUNK_MB=0/64/128, 8 runs each) on G5.
    • Result (h2d_chunk_matrix_summary_20260224T101730Z): request-path and upload-stage deltas were near-neutral in this initial run set (later superseded by 2026-02-28 full-depth AB3 promotion to default 0).
  • Implement and benchmark host page-touch pre-fault path (TRENI_TENSOR_HOST_TOUCH) on G5.
    • Result (host_touch_ab_summary_20260224T102444Z): decoder_tensor_h2d decreased but decoder_tensor_prefetch/upload increased, yielding net request regression (full +7.73%, infer +8.22%).
    • Decision: keep host-touch path as opt-in/default-off, not part of canonical Track A settings.
  • Run synchronized upload diagnostic probe (TRENI_TENSOR_UPLOAD_SYNC=0/1, 3 runs each) to isolate conversion vs transfer cost.
    • Result (upload_sync_probe_summary_20260224T102618Z): conversion is measurable with sync (~6 ms) but H2D remains dominant (~118 ms), so optimization focus stays transfer-path first.
  • Run synchronized host-register probe (TRENI_TENSOR_HOST_REGISTER=0/1, TRENI_TENSOR_UPLOAD_SYNC=1) on G5.
    • Result (host_register_sync_probe_summary_20260224T102915Z): no transfer-stage gain and slight request regression, so this lane is deprioritized.
  • Implement and benchmark decoder logits u16 path (TRENI_DECODER_LOGITS_U16_PATH) on G5.
    • Result (logits_u16_ab_fix1_summary_20260224T105532Z): cold upload/setup improved slightly, but request path regressed materially (ttft/infer/full), so path remains opt-in/default-off.
  • Implement and benchmark tensor-cache hash lookup path (TRENI_TENSOR_CACHE_HASH) on G5.
    • Result (tensor_cache_hash_warm3_20260224T114126Z): near-neutral request path with slight warm p99 regression (+0.149 ms) in this profile, so path remains opt-in/default-off.
  • Implement and benchmark sampler direct-store path (TRENI_SAMPLE_DIRECT_STORE) on G5.
    • Result (sample_direct_store_ab_20260224T114633Z): enabled path regressed warm request metrics (mean +0.062 ms, p95 +0.076 ms, p99 +0.143 ms), so it remains opt-in/default-off.
  • Implement and benchmark decoder direct-out residual path (TRENI_DECODER_DIRECT_OUT_HIDDEN) on G5.
    • Result (direct_outhidden_ab_20260224T115051Z): enabled path regressed warm request and infer metrics (mean +0.540 ms, p95 +0.495 ms, p99 +0.444 ms, infer +0.150 ms), so it remains opt-in/default-off.
  • Implement and benchmark multi-head seq1 attention path (TRENI_ATTN_SEQ1_USE_MULTIHEAD) on G5.
    • Result (seq1_multihead_ab_20260224T125127Z, seq1_multihead_bart_ab_20260224T125404Z): clear request-path wins on qwen warm/mixed and bart warm (including TTFT/infer improvements).
    • Decision: promote to default-on (TRENI_ATTN_SEQ1_USE_MULTIHEAD=1, TRENI_ATTN_SEQ1_MULTIHEAD_MAX_KV=2048), keep off-switch for fallback.
  • Add decode-stage profiling beyond step0 (TRENI_DECODE_STAGE_PROFILE) and publish first profile artifact (external_cold_stepn_profile_20260225T001334Z).
  • Add external-cold runtime env passthrough (--runtime-env) for reproducible flag-driven A/B runs.
  • Run uncertainty capture A/B (TRENI_DEMO_CAPTURE_UNCERTAINTY=1/0) with matched profile and capture decode-stage deltas (external_cold_uncert_on/off_20260225T0017*).
  • Rerun runtime-vLLM cold comparison on same profile (external_cold_runtime_vllm_uncertoff_20260225T001929Z).
  • Add non-step0 split metrics (decoder_stepN_logits_proj vs decoder_stepN_sample) and run immediate qwen A/B probes (lt16, fast16 GEMMEx, direct-u16-input, lt_u16 workspace); all were near-neutral/regressed and reverted.
  • Run full-depth (--layers 36, --pool-mb 16384) runtime-vLLM cold compare and validate hotspot shift (decoder_stepN_layers dominant).
  • Run full-depth preload follow-up (preload=1 and preload=64) to isolate cache-miss vs decode-compute contribution.
  • Rerun full-depth seq1 hybrid matrix (default vs qk vs pv vs both) and confirm default custom remains best.
  • Re-test full-depth direct-u16-input linear path; no gain, reverted.
  • Implement full-depth FFN u16 weight path (TRENI_DECODER_FFN_U16_PATH) and run runtime-vLLM A/B (ab2 artifacts, 2026-02-25).
  • Implement full-depth decoder attention u16 path (TRENI_DECODER_ATTN_U16_PATH) and run 3-seed runtime-vLLM matrix (ab3, 2026-02-25).
  • Re-test logits u16 on top of full-depth attention/ffn u16 (TRENI_DECODER_LOGITS_U16_PATH) and run 3-seed runtime-vLLM matrix (ab3, 2026-02-25).
  • Revert regressing fused gate+up FFN projection path after A/B regression and restore non-fused baseline.
  • Implement shared decode-input pre-cast reuse for full-depth u16 decode GEMMs (q/k/v and gate/up) and run 3-seed runtime-vLLM matrix.
  • Add u16 cublasLt cached path (with safe fallback) for decode u16 GEMMs and run 3-seed runtime-vLLM matrix.
  • Implement residual-fused u16 Lt decode path (o_proj + ffn_down no-bias accumulate) and run 3-seed runtime-vLLM matrix.
  • Add full-depth FFN sub-stage profiling (ffn_proj_cast, ffn_proj_gate, ffn_proj_up) and publish split profile artifact (external_cold_layers36_stepn_profile_ffnsub_20260226T094140Z.log).
  • Probe batched gate+up FFN projection follow-up and revert after regression (higher ffn_proj and slower full decode in A/B).
  • Implement attention qkv fused-alias path (TRENI_DECODER_ATTN_U16_QKV_FUSED) and run 3-seed runtime-only + runtime-vLLM A/B matrices.
  • Promote qkv fused-alias path as default-on in the full-depth u16 lane after parity pass and repeatability wins.
  • Probe TRENI_LINEAR_LT_WORKSPACE_MB in full-depth lane and reject after regression (full 1711.213 -> 1754.568 ms in trial A/B).
  • Implement FFN activation-to-u16 fused path (TRENI_DECODER_FFN_ACT_U16_FUSED) and run 3-seed runtime-only + runtime-vLLM A/B matrices.
  • Promote FFN activation-to-u16 fused path as default-on after strict parity pass and repeatability wins.
  • Probe FAST_16 compute modes on top of u16-Lt; keep as non-canonical lane and revert promotion (tiny request-full delta, noisy startup outlier in repeatability set).
  • Probe full-depth TRENI_DECODER_FFN_PROJ_U16_FUSED in 3-seed runtime-only + runtime-vLLM A/B and reject after consistent regression.
  • Add/probe TRENI_LINEAR_U16_FAST_COMPUTE in full-depth runtime-only 3-seed A/B; initial signal near-neutral/slight regression (superseded by later AB5 promotion rerun).
  • Probe full-depth linear Lt knobs (TRENI_LINEAR_LT_WORKSPACE_MB=64, TRENI_LINEAR_USE_LT=0) and reject both after material regressions.
  • Replace process-wide Lt disable-on-first-fail with shape-scoped Lt fail cache and run full-depth 3-seed runtime-only + runtime-vLLM validation (near-neutral; no canonical shift).
  • Refresh full-depth decode-stage profile (TRENI_DECODE_STAGE_PROFILE + TRENI_DECODER_STEP_PROFILE) and relock hotspot ordering (stepN_layers dominant, FFN ffn_proj still top layer sub-stage).
  • Implement full-depth FFN proj batched-two u16 GEMM path (TRENI_DECODER_FFN_PROJ_U16_BATCHED2) and run 3-seed runtime-only + runtime-vLLM A/B matrices.
  • Promote FFN proj batched-two path as default-on after strict parity pass and stage-profile corroboration.
  • Promote full-depth direct-out hidden path as default-on in this lane (TRENI_DECODER_DIRECT_OUT_HIDDEN) after 3-seed runtime-only A/B and strict parity pass.
  • Add completion-length capture to external-cold harness (completion_chars, completion_words, streamed usage fields) for runtime and vLLM.
  • Add fixed-token fairness controls to vLLM leg (ignore_eos, streamed usage capture) and rerun 3-seed runtime-vLLM comparison with matched completion_tokens=64.
  • Implement fused qkv split+bias path (TRENI_DECODER_QKV_SPLIT_BIAS_FUSED) replacing copy+bias sequence, validate 3-seed runtime-only A/B, and promote default-on after strict parity pass.
  • Wire TRENI_DECODER_LOGITS_U16_FAST_COMPUTE into runtime logits projection path (*_f32_input_ex(..., use_fast_compute)) and run full-depth runtime-only 3-seed A/B.
    • Result (2026-02-27): no material win and slight request-full regression (full +0.767 ms), so the knob is not promoted.
  • Run fixed-token runtime-vLLM sanity rerun after logits-fast hook integration.
    • Result: matched completion_tokens=64 still shows runtime TTFT lead and vLLM request-full lead in this profile.
  • Run strict Week 3 parity after logits-fast hook integration.
    • Result: pass (checked=3, failed=0, strict).
  • Implement u16 tensor-cache path (copy_tensor_to_gpu_u16 lookup/store) and add explicit env gate TRENI_TENSOR_CACHE_U16 (default-on) for claim-safe A/B.
  • Route logits-u16 upload path through shared cached helper (copy_tensor_to_gpu_u16) instead of uncached manual copy.
  • Run full-depth runtime-only 3-seed A/B for TRENI_TENSOR_CACHE_U16=0/1.
    • Result: large request-path win (full -472.529 ms, infer -471.235 ms) with near-flat TTFT.
  • Run same-window runtime-vLLM A/B for TRENI_TENSOR_CACHE_U16=0/1.
    • Result: request-full ordering flipped from runtime slower (+338.124 ms) to runtime faster (-98.671 ms).
  • Re-run strict Week 3 parity on final u16-cache default-on build.
    • Result: pass (checked=3, failed=0, strict).
  • Add optional TRENI_LINEAR_BATCHED2_USE_LT lane for FFN batched2 GEMMs and run full-depth A/B (ab3 runtime-only + ab2 runtime-vLLM).
    • Result (2026-02-27T222830Z): regressed runtime request path (full +12.469 ms, infer +12.534 ms); not promoted.
  • Run higher-N full-depth repeatability on TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=1 + TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1 (ab8 runtime-only).
    • Result (2026-02-27T223241Z): near-noise uplift (full -0.198 ms, infer -0.101 ms); not promoted.
  • Extend FFN fused path bias-deferral logic (fold gate/up bias into fused SiLU*Up activation when TRENI_DECODER_FFN_PROJ_U16_FUSED=1) and rerun full-depth A/B (ab3 runtime-only + ab2 runtime-vLLM).
    • Result (2026-02-27T223458Z): runtime-only effect is negligible (full -0.383 ms), no canonical shift; not promoted.
  • Run fast-profile (--layers 2) higher-N A/B for logits fast-compute (TRENI_DECODER_LOGITS_U16_FAST_COMPUTE=0/1, runtime-only ab8).
    • Result (2026-02-28T005529Z): near-noise movement (full -0.299 ms), stage profile unchanged; not promoted.
  • Run mixed-load p99 repeatability on canonical lane (run_mode=mixed_load, http_runs=120, 3 runs).
    • Result (2026-02-28T005626Z): stable set (mean 122.247 ms, p95 198.518 ms, p99 199.608 ms), no canonical config change.
  • Re-run strict Week 3 parity after latest follow-up patches.
    • Result (2026-02-28T005805Z): pass (checked=3, failed=0, strict).
  • Fix phase2_runtime_benchmark.py timing parser decimal handling (TIMING_RX) so stage telemetry preserves sub-ms values.
    • Result (2026-02-28): decoder_step_profile_* fields now parse as true decimals (for example ffn_proj_mean ~0.366 ms/layer) instead of integer-truncated values.
  • Rerun full-depth profile probes after parser fix (cold_first_hit + warm_steady_state, qwen, layers=36).
    • Result (2026-02-28T011037Z): hotspot remains FFN-heavy (decoder_step_profile_ffn_proj_mean ~0.366 ms/layer, ffn_down_resid_mean ~0.190 ms/layer, step_total_mean ~0.705 ms/layer).
  • Run full-depth warm AB3 for TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0/1 on the fixed profile.
    • Result (ffn_fast_compute_ab3_20260228T011146Z_summary): slight regression (request +0.317 ms, infer +0.305 ms), no stage win; not promoted.
  • Replace batched2 Lt fallback with one-call strided-batched Lt path and rerun AB3 (TRENI_LINEAR_BATCHED2_USE_LT=0/1).
    • Result (batched2lt_strided_ab3_20260228T011651Z_summary): near-noise in warm AB3 (request -0.190 ms, infer -0.194 ms, stage flat) and slight regression in runtime-only external-cold probe (full +0.579 ms); not promoted.
  • Implement FFN gate/up dual-bias fused add path (TRENI_DECODER_FFN_BIAS_PAIR_FUSED) and run full-depth warm/cold A/B.
    • Result (ffn_bias_pair_ab3_20260228T020257Z/summary.json, warm AB3): small warm gain (request -0.229 ms, p99 -0.390 ms, infer -0.090 ms) with near-flat TTFT (+0.009 ms).
    • Cold follow-up (ffn_bias_pair_cold_ab2_20260228T020723Z/summary.json, 3 seeds each after extension): slight cold regression (full +1.928 ms, infer +1.875 ms), so this remains non-canonical for now.
  • Add optional batched2 seq1 split-GEMM path (TRENI_LINEAR_BATCHED2_SPLIT_SEQ1) and run full-depth warm/cold AB3.
    • Warm AB3 (batched2_splitseq1_ab3_20260228T025841Z/summary.json): near-noise/slight regression (request +0.014 ms, p99 +0.124 ms, infer +0.105 ms).
    • Cold AB3 (batched2_splitseq1_cold_ab3_20260228T025841Z/summary.json): small gain (full -2.070 ms, infer -2.002 ms, ttft -0.021 ms).
    • Decision: keep opt-in and non-canonical (no warm-path win).
  • Add optional batched2 dup-input strided lane (TRENI_LINEAR_BATCHED2_DUP_INPUT) and run full-depth warm/cold AB3.
    • Warm AB3 (batched2_dupinput_ab3_20260228T031816Z/summary.json): slight mean regression (request +0.317 ms, infer +0.293 ms, ttft +0.009 ms) despite minor p99 drop (-0.208 ms).
    • Cold AB3 (batched2_dupinput_cold_ab3_20260228T031816Z/summary.json): regression (full +1.307 ms, infer +1.388 ms, ttft +0.010 ms).
    • Decision: keep opt-in and non-canonical.
  • Probe dup-input v2 implementation (replace two D2D memcpys with one duplicate kernel) as warm AB2 gate and revert if not better.
    • Gate AB2 (batched2_dupinput_v2warm_ab2_20260228T032741Z/summary_gate_ab2.json): regression (request +0.438 ms, infer +0.381 ms, ttft +0.015 ms, p99 +0.217 ms).
    • Decision: rejected and reverted before AB3/cold expansion.
  • Recheck prior FFN projection alternatives on current baseline with warm AB2 gates.
    • TRENI_DECODER_FFN_PROJ_U16_FUSED=0/1 (ffn_proj_u16_fused_gate_ab2_20260228T033524Z/summary_gate_ab2.json): near-flat/slight mean regression (request +0.149 ms, infer +0.173 ms), not expanded.
    • TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0/1 (ffn_proj_batched2_f32input_gate_ab2_20260228T033758Z/summary_gate_ab2.json): regression (request +0.236 ms, infer +0.248 ms, p99 +0.512 ms), not expanded.
  • Probe optional linear u16 CUBLAS_COMPUTE_16F lane as warm AB2 gate and revert if non-winning.
    • Gate AB2 (linear_u16_compute16f_gate_ab2_20260228T034412Z/summary_gate_ab2.json): regression (request +0.210 ms, infer +0.240 ms, p99 +0.594 ms).
    • Decision: rejected and reverted; no AB3 expansion.
  • Rebaseline full-depth warm profile on explicit u16 lane (qwen, layers=36) before next FFN probes.
    • Result (warm_profile_qwen_layers36_refresh_20260228T040010Z + run logs): active hotspot remains FFN projection (ffn_proj ~0.196 ms of step_total ~0.402 ms) under batched2.
  • Implement optional FFN gate/up contiguous pair-pack path (TRENI_DECODER_FFN_PAIR_PACK_U16) and run warm AB3 gate.
    • AB3 artifact (ffn_pair_pack_gate_ab2_20260228T040616Z/summary_ab3.json): small warm uplift (request -0.423 ms), but both off/on runs already showed contiguous pair active; non-causal for promotion.
    • Decision: keep implementation as experimental default-off (TRENI_DECODER_FFN_PAIR_PACK_U16=0), non-canonical.
  • Rerun batched2 Lt on explicit u16 lane (TRENI_LINEAR_BATCHED2_USE_LT) with warm AB3 + cold AB3.
    • Warm AB3 (batched2_use_lt_u16lane_gate_ab2_20260228T041041Z/summary_ab3.json): small gain (request -0.313 ms, infer -0.468 ms, p99 -0.511 ms).
    • Cold AB3 (batched2_use_lt_u16lane_cold_ab2_20260228T041359Z/summary_ab3.json): regression (full +1.165 ms, infer +1.424 ms).
    • Decision: keep opt-in/non-canonical (warm-only win not enough).
  • Add adaptive delayed-on policy for batched2 Lt (TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS) and rerun full-depth warm/cold AB3.
    • 5000ms AB3 (batched2_lt_enable_after_ms5000_warm_ab3_20260228T104525Z, batched2_lt_enable_after_ms5000_cold_ab3_20260228T104712Z): warm gain, but small cold full regression (+0.422 ms) remained.
    • 10000ms AB3 (batched2_lt_enable_after_ms10000_warm_ab3_20260228T105028Z, batched2_lt_enable_after_ms10000_cold_ab3_20260228T105213Z): warm and cold both improved:
      • warm: request -0.363 ms, infer -0.326 ms, p99 -0.696 ms.
      • cold: startup -4.307 ms, full -0.635 ms, infer -0.347 ms, TTFT -0.070 ms.
    • Strict parity passed (week3_parity_report_batched2_lt_delay10000_20260228T105329Z.json, checked=3, failed=0).
    • Default-path strict parity smoke also passed (week3_parity_report_batched2_lt_defaultdelay_20260228T110825Z.json).
    • Same-window mixed-load A/B (mixed_load_defaultdelay_onoff_ab3_20260228T115010Z.json) regressed with delayed-on (mean +0.846 ms, p95 +1.627 ms, p99 +0.679 ms).
    • Decision: keep lane opt-in/non-canonical; parser defaults remain off (TRENI_LINEAR_BATCHED2_USE_LT=0, TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0).
    • Post-revert strict parity passed on defaults (week3_parity_report_postrevert_defaults_20260228T115543Z.json).
  • Rerun Track A parser-default foundation pack (warm/cold/mixed) after delayed-Lt probe and publish canonical decision artifact.
    • Pack summary: foundation_defaultdelay_pack_20260228T114315Z.json (.md companion).
    • Warm AB3 means: request 147.258 ms, p99 247.617 ms, infer 128.450 ms, TTFT 16.999 ms.
    • Cold AB3 means: startup 425.532 ms, full 598.787 ms, infer 580.173 ms, TTFT 12.210 ms.
    • Mixed repeatability vs prior canonical: mean +2.841 ms, p95 +5.587 ms, p99 +5.140 ms (mixed_load_repeatability_compare_defaultdelay_vs_prev_20260228T114748Z.json).
    • Decision unchanged: delayed batched2 Lt stays opt-in/non-canonical.
  • Add experimental FFN batched2 Lt prewarm path (TRENI_DECODER_FFN_BATCHED2_LT_PREWARM) and run fixed-Lt warm/cold A/B.
    • Warm AB2 (batched2_lt_prewarm_warm_ab2_20260228T042453Z/summary_gate_ab2.json): small gain (request -0.328 ms, infer -0.394 ms).
    • Cold AB3 (batched2_lt_prewarm_cold_ab3_20260228T042649Z/summary_ab3.json): first-hit gain (full -1.497 ms, infer -1.406 ms).
  • Run direct same-window combo A/B (lt=0,prewarm=0 vs lt=1,prewarm=1) to test promotability.
    • Combined summary (batched2_lt_prewarm_combo_summary_20260228T042733Z.json): mixed outcome.
    • Warm AB3 (batched2_lt_prewarm_combo_warm_ab2_20260228T042733Z/summary_ab3.json): regression (request +0.198 ms, infer +0.178 ms, p99 +0.407 ms).
    • Cold AB3 (batched2_lt_prewarm_combo_cold_ab3_20260228T042733Z): improvement (full -1.099 ms, infer -0.819 ms).
    • Decision: keep prewarm path experimental default-off; non-canonical.
  • Probe and promote TRENI_DECODER_FFN_DOWN_U16_FAST_COMPUTE on canonical full-depth lane.
    • Warm AB3 (ffn_down_fast_compute_gate_ab3_20260228T044546Z/summary_ab3.json): request -0.565 ms, infer -0.566 ms, p99 -1.405 ms, TTFT -0.030 ms.
    • Cold AB3 (ffn_down_fast_compute_cold_ab3_20260228T044753Z/summary_ab3.json): startup -8.405 ms, full -0.351 ms, infer -0.406 ms, TTFT -0.028 ms.
    • Strict parity (week3_parity_report_ffn_down_fast_20260228T044846Z.json): pass (checked=3, failed=0, strict).
    • Decision: promote default-on in runtime parser (TRENI_DECODER_FFN_DOWN_U16_FAST_COMPUTE=1).
  • Run post-promotion retest matrix on updated canonical lane (AB3/AB5) for remaining FFN toggles.
    • New structural stacked-GEMM lane TRENI_LINEAR_BATCHED2_STACKED_SEQ1 AB3: regressed warm (request +1.259 ms, infer +1.229 ms, p99 +2.830 ms) with near-flat cold full (+0.030 ms), remains experimental/default-off.
    • TRENI_LINEAR_BATCHED2_SPLIT_SEQ1 retest AB3: regressed warm (request +0.964 ms) and cold (full +1.496 ms), remains non-canonical.
    • TRENI_LINEAR_BATCHED2_USE_LT fixed-on retest AB3: warm gain (request -0.855 ms) but cold startup/full penalty (startup +10.474 ms, full +0.330 ms); delayed-on follow-up still regressed mixed-load and remains non-canonical.
    • lt=1 + prewarm=1 combo retest: AB3 looked positive, but AB5 confirm failed on cold (full +1.152 ms, startup +3.199 ms), remains non-canonical.
    • TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE retest AB3: near-noise warm gain with cold regression (full +0.577 ms), non-canonical.
    • TRENI_LINEAR_U16_FAST_COMPUTE initial AB3 signal was mixed and stayed pending.
  • Revalidate TRENI_LINEAR_U16_FAST_COMPUTE with higher-N repeats and promote if stable.
    • warm+mixed AB5 (linearfast_ab5_20260228T124736Z/summary_ab5.json): on-off stayed positive in both modes (warm request -0.139 ms, mixed request -0.139 ms).
    • cold AB3 (linearfast_cold_ab3_20260228T124510Z/summary_ab3.json): near-flat full (+0.302 ms), better startup (-4.207 ms) and TTFT (-0.019 ms).
    • strict parity pass (week3_parity_report_linearfast_20260228T124557Z.json, checked=3, failed=0).
    • post-default strict parity smoke also passed (week3_parity_report_post_linearfast_default_20260228T125804Z.json).
    • Decision: promote runtime parser default TRENI_LINEAR_U16_FAST_COMPUTE=1.
  • [~] Optimize custom-kernel best path with explicit profile split:
    • fast profile (--layers 2): decoder_stepN_logits_proj first.
    • full depth (--layers 36, --pool-mb 16384): preserve and extend post-cache lead (runtime full ~1190 ms in latest same-window A/B) with deeper layer-compute work (still FFN-heavy) and higher-N repeatability.
    • continue mixed-load p99 and cold upload/convert/H2D in parallel.
  • Rerun canonical foundation pack (warm/cold/mixed AB3) after TRENI_LINEAR_U16_FAST_COMPUTE promotion and compare vs prior parser-default pack.
    • pack root: foundation_linearfastdefault_pack_20260228T134157Z (summary_ab3.json).
    • warm/cold: near-flat/slightly slower vs prior parser-default foundation (warm request +0.101 ms, cold full +0.491 ms).
    • mixed: improved (request -0.629 ms, p95 -1.281 ms, p99 -0.163 ms).
  • Rerun same-window runtime-vLLM full-depth AB3 on updated canonical lane.
    • run set: aws_speedpass_runtime_vllm_linearfastdefault_ab3_20260228T134630Z (summary_ab3.json).
    • averaged first-request full: runtime 1185.186 ms, vLLM 1305.971 ms (vLLM/runtime=1.102x).
  • Run higher-N same-window runtime-vLLM full-depth rerun on updated defaults (AB5) and publish aggregate summary.
    • run set: aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z.
    • summary: aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/summary_ab5.json and .md.
    • AB5 means:
      • runtime full 1184.812 ms, runtime TTFT 14.640 ms, runtime cold-total full 4190.848 ms.
      • vLLM full 1318.675 ms, vLLM TTFT 50.309 ms, vLLM cold-total full 24350.818 ms.
      • ratios (vLLM/runtime): full 1.113x, TTFT 3.436x, cold-total full 5.810x.
    • compare vs prior AB3: compare_vs_prev_linearfastdefault_ab3.json / .md confirms runtime full stayed slightly better (-0.375 ms) while keeping the same direction on full and cold-total ratios.
  • Test and decide batched2-Lt default-path fast-fallback short-circuit experiment.
    • isolation AB3 (fastfallback_isolation_ab3_20260228T140122Z/summary_ab3.json) showed warm regression (request +1.155 ms) and mixed near-flat/slightly worse (mean +0.144 ms), despite cold full improvement (-0.846 ms).
    • Decision: reverted; keep prior canonical path.
    • post-revert strict parity passed (week3_parity_report_post_fastfallback_revert_20260228T140626Z.json).
  • Re-evaluate TRENI_TENSOR_H2D_CHUNK_MB on current canonical full-depth profile and promote if still positive.
    • cold AB3 (h2d_chunk_cold_ab3_20260228T142114Z/summary_ab3.json): chunk0 - chunk64 improved startup/full/infer and reduced decoder_tensor_h2d/decoder_tensor_upload.
    • warm+mixed AB3 (h2d_chunk_warm_mixed_ab3_20260228T142258Z/summary_ab3.json): warm improved and mixed was near-neutral/slightly better.
    • strict parity after promotion passed (week3_parity_report_h2dchunk0_default_20260228T142805Z.json).
    • Decision: promote parser default to TRENI_TENSOR_H2D_CHUNK_MB=0 (no chunking).
  • Run full-depth post-AB5 gate sweep on current defaults and re-check delayed-Lt/FFN-proj-fast promotability.
    • gate AB2 set (fulldepth_gate_newdefaults_20260228T150709Z/summary_gate_ab2.json):
      • delayed-Lt: warm/mixed request deltas -0.384 / -0.256 ms (promoted to AB3 confirmation).
      • TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1: mixed/noise signal at that gate stage (warm p99 +0.129 ms, mixed p99 +0.022 ms), not promoted in that cycle.
    • delayed-Lt AB3 confirmation (fulldepth_delayedlt_ab3_20260228T151322Z/summary_ab3.json):
      • warm on-off: request -0.330 ms, infer -0.270 ms, p99 -0.098 ms;
      • mixed on-off: request +0.173 ms, infer +0.191 ms, p99 +0.291 ms.
    • Decision: keep delayed-Lt non-canonical on parser defaults.
  • Run tuned delayed-Lt slow-gate rescue probe and verify mixed-load tail behavior.
    • AB2 artifact (delayedlt_tunedslow_ab2_20260228T152358Z/summary_gate_ab2.json).
    • tuned on config: TRENI_LINEAR_BATCHED2_LT_SLOW_RATIO_PCT=0, TRENI_LINEAR_BATCHED2_LT_SLOW_STREAK_DISABLE=4 (+ delayed-Lt envs).
    • deltas (on-off):
      • warm: request -0.185 ms, infer -0.054 ms, p99 -0.417 ms;
      • mixed: request -0.004 ms, infer -0.032 ms, p99 +0.221 ms.
    • Decision: still non-promotable (mixed near-flat with p99 regression); keep delayed-Lt non-canonical.
  • Patch FFN-proj batched2 mixed-input fallback loop and re-gate TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT.
    • code patch: monolith/models/linear.cu caches unsupported mixed-input batched2 GEMM combos and short-circuits repeated failed calls.
    • forced-Lt diagnostic before/after:
      • pre-patch: ffnproj_f32input_ltalways_20260228T153113Z.json
      • post-patch: ffnproj_f32input_ltalways_patch_20260228T154942Z.json
      • request mean improved 175.208 -> 173.124 ms; linear_batched2_lt_failures dropped 26112 -> 1.
    • canonical AB2 re-gate (ffnproj_f32input_gate_patch_ab2_20260228T155033Z/summary_gate_ab2.json):
      • warm on-off: request +0.026 ms, p99 +0.099 ms;
      • mixed on-off: request +0.057 ms, p99 +0.446 ms.
    • Decision: keep TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0 canonical; patch retained for robustness.
  • Re-run TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE on clean full-depth path and validate promotability.
    • profiled AB3 (ffnprojfast_fullstep_ab3_20260228T160255Z/summary_ab3.json): on-off request -0.370 ms, infer -0.348 ms, p99 -0.533 ms.
    • non-profiled warm AB3 (ffnprojfast_fullwarm_ab3_20260228T160358Z/summary_ab3.json): on-off request -0.249 ms, infer -0.225 ms, p99 -0.328 ms.
    • strict parity passed with explicit candidate env and temporary promoted build (week3_parity_report_ffnprojfast_candidate_20260228T160459Z.json, week3_parity_report_ffnprojfast_default_20260228T160639Z.json).
    • post-promotion sanity AB3 (ffnprojfast_default_sanity_ab3_20260228T160557Z/summary_ab3.json) stayed near-flat and directionally positive on means (default-force_off request -0.094 ms).
    • Interim result: qwen-focused path looked positive; required full foundation gate before canonical decision.
  • Run full foundation same-window gate (default vs force_off) for TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE and finalize canonical default.
    • foundation pack (foundation_ffnprojfastdefault_pack_20260228T194204Z/summary_ab3.json) was slower vs prior canonical across warm/cold/mixed.
    • same-window gate AB2 (foundation_ffnprojfast_gate_ab2_20260228T195240Z/summary_gate_ab2.json) showed warm/cold mean regressions with default (warm request +0.489 ms, cold full +0.746 ms), mixed near-flat with better tails.
    • Decision: keep parser canonical default TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0 (opt-in only).

Track B: Internal vs External Routing

  • Minimal external baseline harness.
  • Matched task set and budgets.
  • Internal vs external run and report (G5).
  • Add explicit failure-amplification tests (timeouts/retries under load).
  • Publish first multi-profile stress matrix on G5 (baseline + 5 stress profiles).
  • Add cross-host benchmark harness + standalone external router server.
  • Expand cross-host pilot into full split-host matrix (6 profiles).
  • Optional internet multi-hop expansion (Fly.io controller/tool hops) with commercial endpoints (OpenAI/OpenRouter).
  • Publish grouped commercial root-cause analysis across fairness artifacts (commercial_gap_root_cause_20260222T222958Z).

Track B2: External Cold-Start Proof (Runtime vs PyTorch/vLLM/Ollama)

  • Implement unified cold-start harness with matched prompt/output budget.
  • Run G5 canonical set for all four backends.
  • Publish report with startup/TTFT/full-latency plus caveat tags (BF16 vs quantized).
  • Add canonical artifact links and leaderboard row for external-cold comparison.
  • Run 3x all-backend repeatability after GPU-convert fix and publish summary.
  • Rerun external-cold on G5 after default-on seq1 multi-head path and publish 3-run repeatability summary (external_cold_seq1mh_default_repeatability_20260224T192020Z).
  • Apply first decoder_step0_layers kernel optimization pass (seq1 multi-head softmax/PV exp-reuse) and rerun 3-run external-cold repeatability (external_cold_step0expfix_repeatability_20260224T194226Z).
  • Validate second decoder_step0_layers follow-up (seq1 multi-head shared-prob cache), compare against exp-reuse patch, and revert because it underperformed (external_cold_step0shared_repeatability_20260224T194913Z).

Track C: Agentic Loop Capability

  • Freeze 3 loop scenarios and success criteria.
  • Implement evaluators (success rate + steps-to-convergence).
  • Run internal vs external loop benchmark (canonical G5 set complete: baseline + stress, 3 seeds each).
  • Publish trace-backed capability report.
  • Add file-backed realistic_v1 profile to reduce synthetic stub bias in loop scenarios.
  • Run realistic-v1 multi-seed loop pack (baseline + stress) and publish summary (phase3_realistic_v1_summary_20260222T143919Z).

Track C2: Uncertainty-Awareness Ablation

  • Add uncertainty metric source modes (normalized_logprob, raw_logit_margin, hybrid, runtime_native) to Phase 3 harness.
  • Add independent uncertainty toggles on internal and external paths.
  • Add matrix runner for uncertainty on/off ablation arms.
  • Run first baseline ablation set (runs=8, all three uncertainty sources).
  • Wire runtime uncertainty export/ingestion path (runtime HTTP uncertainty -> C2 runtime_native source).
  • Run 3-seed repeatability baseline ablation set.
  • Run stress-profile uncertainty ablation set.
  • Publish consolidated baseline-vs-stress ablation summary.
  • Run runtime-native quality-gated rerun with unified awareness payload (awareness3, zero fallback/errors confirmed).
  • Retune runtime-native uncertainty policy (thresholds/mapping/decision gating) and recover positive deltas (calib1).
  • Rerun runtime-native C2 baseline+stress after policy retune and relock canonical interpretation.
  • Unify runtime response awareness payload (awareness.route + awareness.generation) and keep uncertainty backward compatible for existing clients.
  • Run realistic-v1 uncertainty ablation baseline+stress pair and publish comparison (phase3_uncertainty_compare_realistic_v1_s7_20260222T144116Z).

Track C3: Real-Benchmark Awareness A/B/C

  • Implement Phase 5 real benchmark harness (scripts/phase5_awareness_realbench.py) with three arms:
    • arm_a_control (single-pass)
    • arm_b_awareness_retry (uncertainty-gated second pass)
    • arm_c_awareness_consistency (uncertainty-gated consistency voting / IFEval self-check)
  • Add run wrapper (scripts/run_phase5_awareness_realbench.sh) and benchmark README (benchmarks/phase5_awareness_realbench/README.md).
  • Run first canonical real-data set on current GPU host (gpqa_diamond, ifeval, gsm8k, aime25) and publish artifact pack (r5: phase5_awareness_realbench_qwen-realbench-r5-tokenizerfix2_20260301T114510Z.json).
  • Publish first diagnostic A/B/C deltas and failure-mode traces (r5 + template A/B r6) with fixed token budgets.
  • Add HF-reference parity evaluation on the same sampled prompts and lock claim-safe interpretation (phase5_hf_reference_qwen_r5_20260301T1900Z.json).
  • Improve Phase 5 math-task quality floor (prompt/eval contract and model/task fit), then rerun runtime-vs-HF parity.

Expansion

  • Launch Lambda A100/H100 hosts and complete Phase 3 canonical loop reruns (baseline + stress, 3 seeds each).
  • Stage monolith_phase3.bin on Lambda A100/H100 and run Phase 4 hardware-pack script (phase2 + c2) end-to-end.
  • Full A100 run set.
  • Full H100 run set.
  • Paper-grade figure/table package.

Immediate Next Actions

  1. Recover score on the latest Qwen3.5 fast-sampler lane without giving back the strict latency win.
  2. Continue isolating the remaining ifeval fidelity gap on the one-host strict Qwen3.5 set:
    • exact instruction-following misses
    • small decode/logit drift vs vLLM on some prompts
    • stack-level repair loop policy that improves quality without erasing the latency lead
  3. Make the same-VM Hermes MVP explicit around scripts/hermes_same_vm_mvp.py and scripts/run_samevm_qwen35_stack.sh, then extend the demo flow for SQLite/RAG/ORPO.
  4. Improve custom mixed-load p99 (decode-shape specialization + cache/write-path tuning) and publish repeatability.
  5. Reduce custom cold-first-hit upload/convert/H2D cost and rerun external-cold token-parity pack.

On this page