Treni Experiment Docs

What Is Treni

Treni is a single C/CUDA binary that packs multiple ML models, routing, tokenization, and tool execution into one GPU process. Instead of an SDK agent calling remote tools and receiving serialized responses, the agent lives inside the runtime — it can see logprobs, feel tokenization quality, inspect model state, and adapt its next step inline.

The core hypothesis: an agent that can observe its own execution context makes better decisions than one that gets sent results.

This is not just about speed. It's about aware generation.

Fast Facts (Canonical)

G5 warm path: 80.602 ms mean, 90.350 ms p99.
External-cold (G5) runtime vs vLLM:
- TTFT: 5.130 ms vs 84.837 ms (16.537x).
- Request full: 316.403 ms vs 1232.660 ms (3.896x).
- Cold total first response: 1320.240 ms vs 28937.430 ms (21.918x).
Frontend A/B repeatability (repeats=3): current custom path wins all tracked metrics in both warm_fixed and mixed_churn profiles.
Runtime-native uncertainty (calibrated) is positive in both baseline and stress:
- Baseline: internal +0.1539, external +0.1058.
- Stress: internal +0.1539, external +0.1154.
Commercial control snapshot (OpenAI gpt-5.2, model-only): current external-internal delta is near parity/noise at present sample size.

What We've Proven So Far

Claim	Status	Key Number
Faster than Python baseline	Proven	29x on warm path
Sub-100ms steady-state	Proven	80.602 ms mean, 90.350 ms p99
Internal routing beats external	Proven	G5 overall ext/int 1.208x (higher is worse for external)
Cold start manageable	Proven (after staged fixes)	25-620x speedup vs early true-TTFT set
Identical numerical outputs	Proven	0 parity failures, strict mode
Aware generation improves loops	Proven (harness)	runtime-native C2 deltas positive after calibration
Phase 4 cross-hardware reruns	Proven	A100/H100 full sets complete + paper package generated

What Is Still Open

Reduce startup overhead for shape-level fused miss mitigation.
Run higher-N, region-pinned commercial reruns for tighter confidence intervals.

Reading Order

Start here, then follow the links in order:

Objectives and thesis — the core claim and why a GPU agent that can feel beats one that gets told
Paper — Entropy-Guided Loop — the research foundation for uncertainty-aware generation
Experiment logbook — the clean map of canonical lanes vs scratch work
Benchmark status — detailed completion status by lane
Leaderboard — the main benchmark numbers
Track B claim-safe table — scoped commercial claims (model_only vs tool_only)
Routing comparison — internal vs external routing breakdown
Canonical G5 artifact set — the official reference run set
Paper package — paper-ready cross-hardware tables
Findings changelog — the long chronological record
Raw artifacts — every JSON and report file
Scratch experiments — exploratory and non-canonical work
TODO and next actions — what's coming next