Treni

Treni Experiment Docs

A unified GPU agent runtime that can feel, not just be told.

What Is Treni

Treni is a single C/CUDA binary that packs multiple ML models, routing, tokenization, and tool execution into one GPU process. Instead of an SDK agent calling remote tools and receiving serialized responses, the agent lives inside the runtime — it can see logprobs, feel tokenization quality, inspect model state, and adapt its next step inline.

The core hypothesis: an agent that can observe its own execution context makes better decisions than one that gets sent results.

This is not just about speed. It's about aware generation.

Fast Facts (Canonical)

  • G5 warm path: 80.602 ms mean, 90.350 ms p99.
  • External-cold (G5) runtime vs vLLM:
    • TTFT: 5.130 ms vs 84.837 ms (16.537x).
    • Request full: 316.403 ms vs 1232.660 ms (3.896x).
    • Cold total first response: 1320.240 ms vs 28937.430 ms (21.918x).
  • Frontend A/B repeatability (repeats=3): current custom path wins all tracked metrics in both warm_fixed and mixed_churn profiles.
  • Runtime-native uncertainty (calibrated) is positive in both baseline and stress:
    • Baseline: internal +0.1539, external +0.1058.
    • Stress: internal +0.1539, external +0.1154.
  • Commercial control snapshot (OpenAI gpt-5.2, model-only): current external-internal delta is near parity/noise at present sample size.

What We've Proven So Far

ClaimStatusKey Number
Faster than Python baselineProven29x on warm path
Sub-100ms steady-stateProven80.602 ms mean, 90.350 ms p99
Internal routing beats externalProvenG5 overall ext/int 1.208x (higher is worse for external)
Cold start manageableProven (after staged fixes)25-620x speedup vs early true-TTFT set
Identical numerical outputsProven0 parity failures, strict mode
Aware generation improves loopsProven (harness)runtime-native C2 deltas positive after calibration
Phase 4 cross-hardware rerunsProvenA100/H100 full sets complete + paper package generated

What Is Still Open

  • Reduce startup overhead for shape-level fused miss mitigation.
  • Run higher-N, region-pinned commercial reruns for tighter confidence intervals.

Reading Order

Start here, then follow the links in order:

  1. Objectives and thesis — the core claim and why a GPU agent that can feel beats one that gets told
  2. Paper — Entropy-Guided Loop — the research foundation for uncertainty-aware generation
  3. Experiment logbook — the clean map of canonical lanes vs scratch work
  4. Benchmark status — detailed completion status by lane
  5. Leaderboard — the main benchmark numbers
  6. Track B claim-safe table — scoped commercial claims (model_only vs tool_only)
  7. Routing comparison — internal vs external routing breakdown
  8. Canonical G5 artifact set — the official reference run set
  9. Paper package — paper-ready cross-hardware tables
  10. Findings changelog — the long chronological record
  11. Raw artifacts — every JSON and report file
  12. Scratch experiments — exploratory and non-canonical work
  13. TODO and next actions — what's coming next

On this page