Proof — not a mockup
The loop runs.
On a real repo. In your terminal.
Every snippet on this page is read directly from the working proof-of-concept in this repository. The same pnpm proof:loop command that produced this output runs on your machine in under a second.
For Claude Code · Codex · Cursor · any agentic coding model
Agentic coding can write UIs.
It can't verify them.
Today the loop ends at "here's the diff". There's no deterministic feedback signal — no way for the model to see that the UI it wrote actually behaves correctly when a user drags a segment, scrubs a playhead, reorders a row. CUIT closes that loop. Observe, propose, verify, gate. End-to-end, in 0.18s.
- Observe
Recorder Chrome extension captures pointer events, semantic selectors, and window.__cuitDebug state snapshots into one JSON blob.
- Propose
@cuit/spec-gen turns the session into a Playwright/Vitest spec grounded in @cuit/harness primitives — no pixel coords, no waitForTimeout.
- Verify
Run the spec against the unfixed code → RED (bug deterministically reproduced). Apply the fix → GREEN. The agent sees both signals.
- Gate
The generated spec becomes a permanent CI regression gate. Re-introduce the bug six months later → CI blocks the merge.
The closed loop, end-to-end
stdout of pnpm proof:agent-loop — copied verbatim from agent-loop-output.log. Real recorder. Real spec-gen. Real RED → fix → GREEN.
[step 1/5] Capture — recorder runs while a developer reproduces the bug
-> recorder captured 27 events (6 pointer, 20 snapshot)
-> wrote out/recorded-session.json
[step 2/5] Generate — spec-gen produces a deterministic Playwright/Vitest spec
-> 6 primitives: goto -> setClock -> getStateSnapshot -> dispatchDrag -> getStateSnapshot -> assertStateEquals
-> wrote out/agent-loop.spec.ts (18 lines)
[step 3/5] Verify - run the spec against the buggy app
-> segments[0].x: expected=100, actual=25
-> RED [bug reproduced - this is the success state]
[step 4/5] Decide - agent reads RED output, identifies the fix
[agent] observation: segments[0].x stayed at 25 (expected 100)
[agent] hypothesis : the collision short-circuit in onPointerMove blocked the move
[agent] action : enable FIX_SEGMENT_COLLISION=1 (in the SaaS this is a code-change PR; in the PoC it is a flag)
[step 5/5] Verify GREEN - re-run the same spec against the fixed app
-> segments[0].x: expected=100, actual=100
-> GREEN [fix verified - regression locked in]
AGENT LOOP CLOSED - capture -> generate -> RED -> agent-fix -> GREEN in 0.18s// Browser side — install once. Drop the Chrome extension OR import
// @cuit/recorder directly. Same module, same JSON shape, same downstream.
import { Recorder, cuitDebugProvider } from '@cuit/recorder';
const recorder = new Recorder({
sessionId: 'rec-001',
vendor: 'cuit',
snapshotProvider: cuitDebugProvider, // reads window.__cuitDebug.getState()
});
recorder.start();
// ... developer reproduces the bug (drag a segment, click a row, etc.) ...
recorder.stop();
const session = recorder.export();
// session is plain JSON. No vendor account. No API key.
// Pass it to Claude Code / Codex with the @cuit/spec-gen import:
// "Use @cuit/spec-gen to convert this into a Playwright spec, run it,
// confirm RED on the unfixed code, propose the fix."# Copy this into Claude Code / Codex / Cursor:
I just captured a session reproducing a UI bug. The JSON is attached.
1. Run `@cuit/spec-gen` on the events to produce a Playwright/Vitest spec.
2. Run the spec against the current code. I expect it to fail RED — that
means the bug is reproduced.
3. Read the failure (expected vs actual). Identify the smallest code change
that flips the assertion to pass.
4. Apply the fix. Re-run the spec. Confirm GREEN.
5. Open a PR. The same spec becomes the regression gate.
# Why this works: the recorder gave you a deterministic input.
# The harness gives you a deterministic execution model.
# You now have a closed loop — observe, propose, verify — without
# any pixel coordinates, screenshots, or waitForTimeout sleeps.Two ways to close the loop today.
- →Run the demo agent loop locally. Clone the repo, install, run
pnpm proof:agent-loop. The same six lines of stdout above will print on your machine in under a second. - →Load the Chrome extension. Drop
packages/recorder-extension/intochrome://extensions→ load unpacked. Record any page that exposeswindow.__cuitDebug. Paste the resulting JSON straight into your coding agent.
# Try the recorder against the bundled demo
pnpm install
pnpm proof:agent-loop # recorder -> spec-gen -> RED -> fix -> GREEN
# Or load the Chrome extension on any page that exposes window.__cuitDebug:
# chrome://extensions -> Developer mode -> Load unpacked
# select: proof-of-concept/packages/recorder-extension/expected: AGENT LOOP CLOSED in 0.18s · exit 0
For UI developers — show me the code
Four things that flake on your team today.
How this fixes each one.
Every snippet below is verbatim from the working proof-of-concept — not a mockup. Clone the repo and run pnpm proof:loop to reproduce the output yourself.
Your Playwright tests use page.mouse.click(412, 89)
Pixel coordinates depend on viewport, CSS, browser engine, and last-frame layout. Change padding by 4px and your suite flakes.
await page.mouse.move(412, 89);
await page.mouse.down();
await page.mouse.move(512, 89, { steps: 10 });
await page.mouse.up();dispatchDrag('seg-0', 100, 0);
// Targets by stable name. No pixels.
// Same call works in Chromium, Firefox, WebKit.You sprinkle waitForTimeout(500) because rAF timing is unreliable
Real animations advance on requestAnimationFrame; pixel snapshots and CSS transitions land at the next frame. Sleeps fight non-determinism with prayer.
await page.waitForTimeout(500);
const box = await el.boundingBox();
// hope the animation finished by nowsetClock(1716800000000);
// Deterministic clock. Every rAF callback fires.
// Now state is exactly where the spec says it is.You hand-translate a Jam replay into a Playwright spec
2–6 hours per bug. Most teams skip it. You end up with no regression net, and the same bug reopens in 3 weeks.
// 12-minute Jam replay
// → engineer watches it twice
// → engineer guesses selectors
// → engineer writes 80-line spec
// → engineer realizes selectors broke last weekpnpm cuit gen jam:sess-2014 --apply
# Reads the session, emits a spec.ts
# grounded in your harness primitives.
# PR opens. You review the diff.The same bug keeps reopening every release
You shipped a fix but no regression test. Six weeks later someone refactors the collision code and re-introduces the same bug.
// One-shot fix. No spec.
// Six weeks later: "user reports drag broken"
// File reopened. Eng-days re-spent.# Generated spec lives in tests/regressions/
# CI runs it on every PR.
# Re-introduce the bug → CI blocks merge.
# The 6-Reopened-bugs loop is over.The proof loop, end-to-end
Real artifacts from proof-of-concept/ — copied verbatim. Run pnpm proof:loop to regenerate.
A user files a bug via Jam. The connector pulls 47 normalized events.
{
"sessionId": "jam-sess-2014",
"vendor": "jam",
"url": "http://localhost:5173/",
"browser": { "name": "chrome", "version": "125.0.0.0", "os": "macOS 14.4" },
"events": [
{ "seq": 0, "type": "nav", "url": "http://localhost:5173/", "ts": 0 },
{ "seq": 1, "type": "state-snapshot", "path": "segments[0].x", "value": 0 },
{ "seq": 2, "type": "state-snapshot", "path": "segments[1].x", "value": 200 },
{ "seq": 3, "type": "state-snapshot", "path": "segments.length", "value": 2 },
/* …42 more events: pointerdown / pointermove×N / pointerup / final state-snapshot… */
{ "seq": 45, "type": "pointer", "phase": "up", "targetName": "seg-0", "x": 240, "y": 32, "pointerId": 1 },
{ "seq": 46, "type": "state-snapshot", "path": "segments[0].x", "value": 0 }
]
}18 lines. 6 harness primitives. No pixel coords, no waitForTimeout, no hand-crafted selectors.
import { describe, expect, test } from 'vitest';
import {
dispatchDrag,
getStateSnapshot,
setClock,
} from '@cuit/harness';
describe('issue-2014 — segment 0 drag must not collide-noop', () => {
test('drags segment 0 right by 100px and asserts state moves', () => {
setClock(1716800000000);
dispatchDrag('seg-0', 100, 0);
const snapshot = getStateSnapshot();
expect(snapshot['segments[0].x']).toEqual(100);
});
});
The spec reproduces the failure deterministically. RED is the success state — the bug is now caught by an automated test.
// packages/demo-app/src/App.tsx — bug version
// Inside the pointermove handler:
setSegments((prev) => {
const next = prev.map((s) => ({ ...s }));
const moving = next[idx];
const proposedX = drag.originX + dx;
// BUG: the collision check is too eager — it blocks
// every move that would even momentarily overlap.
const collides = next.some((other, j) => {
if (j === idx) return false;
return proposedX < other.x + other.width &&
other.x < proposedX + moving.width;
});
if (collides) return prev; // <-- silently no-op'd
moving.x = proposedX;
return next;
});Same spec, same harness, fixed code. PASS. The spec is now a permanent CI gate.
// packages/demo-app/src/App.tsx — fix version
// Inside the pointermove handler:
setSegments((prev) => {
const next = prev.map((s) => ({ ...s }));
const moving = next[idx];
const proposedX = drag.originX + dx;
// FIX: drop the over-eager collision short-circuit.
// Free positioning; downstream layout handles overlap.
moving.x = proposedX;
return next;
});pnpm proof:loop— copied verbatim from proof-output.log[1/6] Loading recorded session events from fixtures/segment-collision.json
-> 47 events normalized into SessionEvent[]
[2/6] Generating spec from session events
-> wrote out/issue-2014-segment-0-drag-must-not-collide-noop.spec.ts (18 lines, 6 primitives used)
[3/6] Running spec against demo-app (bug-present mode)
-> FAIL - segment 0 right edge stayed at x=25 (expected 100)
-> RED - bug reproduced deterministically [SUCCESS]
[4/6] Applying canonical fix (FIX_SEGMENT_COLLISION=1)
-> re-rendering demo-app with fix flag
[5/6] Running spec against demo-app (fixed mode)
-> PASS - segment 0 right edge moved to x=100
-> GREEN - fix verified, regression locked in [SUCCESS]
[6/6] Locking the spec into CI as a gate
-> wrote .github/workflows/proof-regression.yml
LOOP COMPLETE - RED to GREEN in 0.1sTry it. Don't take our word.
Four shell commands. Node 20. About thirty seconds of install. The same six lines of stdout you see above will print on your machine — RED at step 3, GREEN at step 5, exit 0.
- →Read the source — every primitive in the spec is a real exported function from
@cuit/harness. - →Inspect the tests — 61 unit tests across 5 packages, TDD-first, all green.
- →Wire it into your repo — the same primitives work in any React/Vue app that exposes a state-snapshot hook.
# 1. Clone the repo
git clone git@github.com:speechlabinc/complex-ui-tester.git
cd complex-ui-tester/proof-of-concept
# 2. Install (Node 20 + pnpm)
pnpm install
# 3. Run the loop end-to-end
pnpm proof:loop
# 4. Run the package tests (61 tests across 5 packages)
pnpm testexpected output: RED to GREEN in 0.1s · exit 0
What this proves — and what it doesn't
Proves
- ✓A recorded session can be normalized into a stable
SessionEvent[]. - ✓Those events can be mapped to a Playwright/Vitest spec that calls only validated harness primitives.
- ✓The spec deterministically reproduces a real bug (RED) against the unfixed code.
- ✓The same spec passes (GREEN) against the fixed code with no spec edits.
- ✓The architecture in
docs/02,04, and10is implementable — not just designed.
Does NOT prove (yet)
- ○LLM-driven spec generation. The PoC generator is rule-based — a drop-in substitute for the 3-pass LLM pipeline in
docs/04. - ○Pre-built third-party connectors for Jam / LogRocket / Sentry are designed but not shipped — for now we ship a first-party Chrome recorder that produces the same
SessionEvent[]shape. Seepackages/recorder-extension/. - ○Multi-tenant prompt context. Single-tenant in the PoC.
- ○SOC 2. We're in observation for Type II per
docs/05; not yet audited. - ○Production deployment of the SaaS infra. Designed in
docs/03; not built.
The PoC's job is to prove the loop architecture is real — the rest is engineering, not invention.