TODO
Live execution checklist and next actions.
Priority Order
Current Checklist
Track A: Cold/Hot Foundations
- True TTFT instrumentation in runtime request path.
- 3x cold-first-hit repeatability set (G5).
- 3x warm steady-state repeatability set (G5).
- Cold bottleneck fix: per-model tensor lookup index cache.
- Cold rerun after fix with artifact pack.
- Add stage-level cold decomposition metrics (tokenizer load, index build, tensor upload, first decode step).
- Optimize
model_tensor_index_buildvia fast tensor collect path and rerun 3x cold validation. - Rerun 3x cold validation after reverting regressed upload path (
clean7) and confirm clean4 parity. - Add sub-stage upload instrumentation (
decoder_tensor_convert,decoder_tensor_h2d,decoder_tensor_copy_total). - Add startup preload + tokenizer cache path to cut first-request upload/tokenizer overhead.
- Wire request
max_tokensthrough runtime HTTP path for token-parity comparisons. - Disable decoder per-step trace by default (
TRENI_DEMO_TRACEopt-in). - Reduce remaining Qwen request-path TTFT/full gap vs vLLM (decoder per-token path).
Track B: Internal vs External Routing
- Minimal external baseline harness.
- Matched task set and budgets.
- Internal vs external run and report (G5).
- Add explicit failure-amplification tests (timeouts/retries under load).
Track B2: External Cold-Start Proof (Runtime vs PyTorch/vLLM/Ollama)
- Implement unified cold-start harness with matched prompt/output budget.
- Run G5 canonical set for all four backends.
- Publish report with startup/TTFT/full-latency plus caveat tags (BF16 vs quantized).
- Add canonical artifact links and leaderboard row for external-cold comparison.
Track C: Agentic Loop Capability
- Freeze 3 loop scenarios and success criteria.
- Implement evaluators (success rate + steps-to-convergence).
- Run internal vs external loop benchmark.
- Publish trace-backed capability report.
Expansion
- Full A100 run set.
- Full H100 run set.
- Paper-grade figure/table package.
Immediate Next Actions
- Run 3x repeatability for token-parity external cold benchmark (
max_tokens=48) on G5. - Optimize Qwen decoder per-token step path to close request-path gap vs vLLM.
- Add timeout/retry failure-amplification tests to internal-vs-external routing.