Research Project

Measure time-to-aha.
Optimize what matters.

ROIBench builds AI apps and measures how quickly a new user gets real, verified value. Not model accuracy. Not feature count. Activation ROI: value delivered per unit of user effort.

What is Activation ROI?

Most AI benchmarks measure model answers. ROIBench measures something different: can a new user reach a verified "aha moment" quickly and without friction?

The Investment

User's first-session effort

Time spent before getting value. Steps taken (clicks, uploads, form fills). Friction endured (errors, confusion, dead ends).

The Return

Verified aha moment

The user actually got what the app promises, confirmed by a deterministic validator. Not vibes — proof.

score = activated × e−time/τ × e−friction/φ
activated: did they get value? (0 or 1) τ = 60s (time half-life) φ = 3 (friction half-life)
0 = never activated ~1 = instant value, zero friction Low = got there, but it was painful

Agentic build-test-iterate loop

An agentic conveyor builds apps, tests them with synthetic personas, and iterates until activation quality hits a target — or proves it can't.

Discovery
Build
Test
Iterate
Ship or Kill
Quantitative

Activation ROI Score

Computed from instrumented action traces. Every click, upload, and wait is logged with timestamps. The headline number that tracks improvement.

Qualitative

Persona Narratives

Synthetic users write free-text experience reports. These explain why the score is what it is and generate ranked change requests for the builder.

The score measures progress. The feedback drives it. A score without diagnostics is a number you can't act on. Diagnostics without a score is improvement you can't measure.

5 apps through the conveyor

4 out of 5 hit the activation target. The one that didn't revealed an architectural ceiling — which is itself a useful result.

4/5
Hit target
51
Eval rounds
17
Personas
4.86
Best score
App Rounds Score Status Key finding
DocBench Agentic document workspace with citations 23 4.75 Target hit R6 Weakest persona 1.8 → 5.0 over 10 rounds. Citation grounding was the bottleneck.
Night Desk AI detective game with generated scenes 10 4.86 Target hit R5 Image quality upgrade drove visual impact to 5.0 across all personas.
ClawTrade Ops Trading automation with receipts 4 4.65 Target hit R4 Fastest to target. NLP-first UX was the breakthrough.
Annotate AI annotation platform for students 4 4.53 Target hit R2 Deterministic fallback: 3.07. Real LLM: +1.46 in one round.
TailorCV Resume tailoring tool 10 3.43 Target missed Architecture ceiling ~3.5. Single-LLM-pass can't solve accuracy tension.

Score trajectory across rounds

1.0 2.0 3.0 4.0 5.0 target 4.5 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 DocBench Night Desk ClawTrade Annotate TailorCV (missed)

DocBench

Upload documents. Ask questions. Get answers with grounded citations you can verify. An agentic document workspace — not a chatbot wrapper.

4.75 Activation score
23 Iteration rounds
4 Personas tested
  • Upload PDFs, markdown, or text documents
  • AI agent reads and indexes your documents
  • Ask questions in natural language
  • Every answer includes grounded citations with verbatim quotes
  • Click any citation to see the source in context
  • Persistent workspaces — pick up where you left off
Coming soon
Sources
policy_retention.pdf
compliance_guide.md
data_handling.txt
What is the data retention policy for EU customers?
According to the retention policy, EU customer data must be deleted within 90 days of account closure [1]. The compliance guide specifies that this applies to all PII including derived analytics [2].

Persona scores (Round 19)

Alex (undergrad)
5.0
Maya (grad student)
4.4
James (researcher)
4.6
Sarah (compliance)
5.0

Research directions

ROIBench is an ongoing research project exploring the intersection of synthetic evaluation, agentic building, and product activation.

Can synthetic persona evaluation replace human expert judgment for UX quality? How well do synthetic Activation ROI scores predict real user activation?
What is the minimum number of iteration rounds needed to hit an activation target — and can we predict early whether an architecture will plateau?
How should quantitative activation metrics and qualitative persona feedback be combined to maximize improvement per iteration round?

Early evidence: TailorCV's flat trajectory (10 rounds, no improvement past R2) suggests architectural ceilings are detectable early. Annotate's +1.46 jump when switching from deterministic to real LLM shows infrastructure choices dominate UX polish.

A solo research project

ROIBench is built by Anton as a side project alongside a full-time role. The entire pipeline — discovery, building, persona testing, iteration — is agentic, built with Claude Code (Anthropic) and tested via Playwright.

All code, Value Contracts, persona definitions, and round-by-round results are available in the repository.

GitHub Follow on X