Proof — not a mockup

The loop runs.
On a real repo. In your terminal.

Every snippet on this page is read directly from the working proof-of-concept in this repository. The same pnpm proof:loop command that produced this output runs on your machine in under a second.

61/61 tests passingRED → GREEN verified0.1s end-to-end5 packages47-event fixture

For Claude Code · Codex · Cursor · any agentic coding model

Agentic coding can write UIs.
It can't verify them.

Today the loop ends at "here's the diff". There's no deterministic feedback signal — no way for the model to see that the UI it wrote actually behaves correctly when a user drags a segment, scrubs a playhead, reorders a row. CUIT closes that loop. Observe, propose, verify, gate. End-to-end, in 0.18s.

  1. Observe

    Recorder Chrome extension captures pointer events, semantic selectors, and window.__cuitDebug state snapshots into one JSON blob.

  2. Propose

    @cuit/spec-gen turns the session into a Playwright/Vitest spec grounded in @cuit/harness primitives — no pixel coords, no waitForTimeout.

  3. Verify

    Run the spec against the unfixed code → RED (bug deterministically reproduced). Apply the fix → GREEN. The agent sees both signals.

  4. Gate

    The generated spec becomes a permanent CI regression gate. Re-introduce the bug six months later → CI blocks the merge.

The closed loop, end-to-end

stdout of pnpm proof:agent-loop — copied verbatim from agent-loop-output.log. Real recorder. Real spec-gen. Real RED → fix → GREEN.

closed0.18s end-to-end
[step 1/5] Capture — recorder runs while a developer reproduces the bug
              -> recorder captured 27 events (6 pointer, 20 snapshot)
              -> wrote out/recorded-session.json
[step 2/5] Generate — spec-gen produces a deterministic Playwright/Vitest spec
              -> 6 primitives: goto -> setClock -> getStateSnapshot -> dispatchDrag -> getStateSnapshot -> assertStateEquals
              -> wrote out/agent-loop.spec.ts (18 lines)
[step 3/5] Verify - run the spec against the buggy app
              -> segments[0].x: expected=100, actual=25
              -> RED [bug reproduced - this is the success state]
[step 4/5] Decide - agent reads RED output, identifies the fix
              [agent] observation: segments[0].x stayed at 25 (expected 100)
              [agent] hypothesis : the collision short-circuit in onPointerMove blocked the move
              [agent] action     : enable FIX_SEGMENT_COLLISION=1 (in the SaaS this is a code-change PR; in the PoC it is a flag)
[step 5/5] Verify GREEN - re-run the same spec against the fixed app
              -> segments[0].x: expected=100, actual=100
              -> GREEN [fix verified - regression locked in]

AGENT LOOP CLOSED - capture -> generate -> RED -> agent-fix -> GREEN in 0.18s
browser sidefirst-party — no vendor account
recorder.tstypescript
// Browser side — install once. Drop the Chrome extension OR import
// @cuit/recorder directly. Same module, same JSON shape, same downstream.

import { Recorder, cuitDebugProvider } from '@cuit/recorder';

const recorder = new Recorder({
  sessionId: 'rec-001',
  vendor: 'cuit',
  snapshotProvider: cuitDebugProvider,  // reads window.__cuitDebug.getState()
});

recorder.start();
// ... developer reproduces the bug (drag a segment, click a row, etc.) ...
recorder.stop();

const session = recorder.export();
// session is plain JSON. No vendor account. No API key.
// Pass it to Claude Code / Codex with the @cuit/spec-gen import:
//   "Use @cuit/spec-gen to convert this into a Playwright spec, run it,
//    confirm RED on the unfixed code, propose the fix."
agent sidepaste into Claude Code · Codex · Cursor
agent-prompt.mdmarkdown
# Copy this into Claude Code / Codex / Cursor:

I just captured a session reproducing a UI bug. The JSON is attached.

1. Run `@cuit/spec-gen` on the events to produce a Playwright/Vitest spec.
2. Run the spec against the current code. I expect it to fail RED — that
   means the bug is reproduced.
3. Read the failure (expected vs actual). Identify the smallest code change
   that flips the assertion to pass.
4. Apply the fix. Re-run the spec. Confirm GREEN.
5. Open a PR. The same spec becomes the regression gate.

# Why this works: the recorder gave you a deterministic input.
# The harness gives you a deterministic execution model.
# You now have a closed loop — observe, propose, verify — without
# any pixel coordinates, screenshots, or waitForTimeout sleeps.

Two ways to close the loop today.

  • Run the demo agent loop locally. Clone the repo, install, run pnpm proof:agent-loop. The same six lines of stdout above will print on your machine in under a second.
  • Load the Chrome extension. Drop packages/recorder-extension/ into chrome://extensions → load unpacked. Record any page that exposes window.__cuitDebug. Paste the resulting JSON straight into your coding agent.
run itbash
# Try the recorder against the bundled demo
pnpm install
pnpm proof:agent-loop        # recorder -> spec-gen -> RED -> fix -> GREEN

# Or load the Chrome extension on any page that exposes window.__cuitDebug:
#   chrome://extensions  ->  Developer mode  ->  Load unpacked
#   select: proof-of-concept/packages/recorder-extension/

expected: AGENT LOOP CLOSED in 0.18s · exit 0

For UI developers — show me the code

Four things that flake on your team today.
How this fixes each one.

Every snippet below is verbatim from the working proof-of-concept — not a mockup. Clone the repo and run pnpm proof:loop to reproduce the output yourself.

problem

Your Playwright tests use page.mouse.click(412, 89)

Pixel coordinates depend on viewport, CSS, browser engine, and last-frame layout. Change padding by 4px and your suite flakes.

todayflaky / manual / brittle
before.tstypescript
await page.mouse.move(412, 89);
await page.mouse.down();
await page.mouse.move(512, 89, { steps: 10 });
await page.mouse.up();
with CUITdeterministic / generated / permanent
after.tstypescript
dispatchDrag('seg-0', 100, 0);
// Targets by stable name. No pixels.
// Same call works in Chromium, Firefox, WebKit.
problem

You sprinkle waitForTimeout(500) because rAF timing is unreliable

Real animations advance on requestAnimationFrame; pixel snapshots and CSS transitions land at the next frame. Sleeps fight non-determinism with prayer.

todayflaky / manual / brittle
before.tstypescript
await page.waitForTimeout(500);
const box = await el.boundingBox();
// hope the animation finished by now
with CUITdeterministic / generated / permanent
after.tstypescript
setClock(1716800000000);
// Deterministic clock. Every rAF callback fires.
// Now state is exactly where the spec says it is.
problem

You hand-translate a Jam replay into a Playwright spec

2–6 hours per bug. Most teams skip it. You end up with no regression net, and the same bug reopens in 3 weeks.

todayflaky / manual / brittle
before.tstypescript
// 12-minute Jam replay
// → engineer watches it twice
// → engineer guesses selectors
// → engineer writes 80-line spec
// → engineer realizes selectors broke last week
with CUITdeterministic / generated / permanent
after.tstypescript
pnpm cuit gen jam:sess-2014 --apply
# Reads the session, emits a spec.ts
# grounded in your harness primitives.
# PR opens. You review the diff.
problem

The same bug keeps reopening every release

You shipped a fix but no regression test. Six weeks later someone refactors the collision code and re-introduces the same bug.

todayflaky / manual / brittle
before.tstypescript
// One-shot fix. No spec.
// Six weeks later: "user reports drag broken"
// File reopened. Eng-days re-spent.
with CUITdeterministic / generated / permanent
after.tstypescript
# Generated spec lives in tests/regressions/
# CI runs it on every PR.
# Re-introduce the bug → CI blocks merge.
# The 6-Reopened-bugs loop is over.

The proof loop, end-to-end

Real artifacts from proof-of-concept/ — copied verbatim. Run pnpm proof:loop to regenerate.

61 tests passing0.1s end-to-end
STEP 1Input: recorded Jam session

A user files a bug via Jam. The connector pulls 47 normalized events.

fixtures/segment-collision.jsonjson
{
  "sessionId": "jam-sess-2014",
  "vendor": "jam",
  "url": "http://localhost:5173/",
  "browser": { "name": "chrome", "version": "125.0.0.0", "os": "macOS 14.4" },
  "events": [
    { "seq": 0, "type": "nav", "url": "http://localhost:5173/", "ts": 0 },
    { "seq": 1, "type": "state-snapshot", "path": "segments[0].x", "value": 0 },
    { "seq": 2, "type": "state-snapshot", "path": "segments[1].x", "value": 200 },
    { "seq": 3, "type": "state-snapshot", "path": "segments.length", "value": 2 },
    /* …42 more events: pointerdown / pointermove×N / pointerup / final state-snapshot… */
    { "seq": 45, "type": "pointer", "phase": "up",   "targetName": "seg-0", "x": 240, "y": 32, "pointerId": 1 },
    { "seq": 46, "type": "state-snapshot", "path": "segments[0].x", "value": 0 }
  ]
}
STEP 2Output: generated Playwright/Vitest spec

18 lines. 6 harness primitives. No pixel coords, no waitForTimeout, no hand-crafted selectors.

out/issue-2014.spec.tstypescript
import { describe, expect, test } from 'vitest';
import {
  dispatchDrag,
  getStateSnapshot,
  setClock,
} from '@cuit/harness';

describe('issue-2014 — segment 0 drag must not collide-noop', () => {
  test('drags segment 0 right by 100px and asserts state moves', () => {
    setClock(1716800000000);

    dispatchDrag('seg-0', 100, 0);

    const snapshot = getStateSnapshot();
    expect(snapshot['segments[0].x']).toEqual(100);
  });
});
STEP 3 — REDRun the spec against the buggy code

The spec reproduces the failure deterministically. RED is the success state — the bug is now caught by an automated test.

App.tsx (buggy)typescript
// packages/demo-app/src/App.tsx — bug version

// Inside the pointermove handler:
setSegments((prev) => {
  const next = prev.map((s) => ({ ...s }));
  const moving = next[idx];
  const proposedX = drag.originX + dx;

  // BUG: the collision check is too eager — it blocks
  // every move that would even momentarily overlap.
  const collides = next.some((other, j) => {
    if (j === idx) return false;
    return proposedX < other.x + other.width &&
           other.x < proposedX + moving.width;
  });
  if (collides) return prev;          // <-- silently no-op'd

  moving.x = proposedX;
  return next;
});
STEP 4 — GREENApply the fix, re-run the same spec

Same spec, same harness, fixed code. PASS. The spec is now a permanent CI gate.

App.tsx (fixed)typescript
// packages/demo-app/src/App.tsx — fix version

// Inside the pointermove handler:
setSegments((prev) => {
  const next = prev.map((s) => ({ ...s }));
  const moving = next[idx];
  const proposedX = drag.originX + dx;

  // FIX: drop the over-eager collision short-circuit.
  // Free positioning; downstream layout handles overlap.

  moving.x = proposedX;
  return next;
});
Actual stdout of pnpm proof:loop— copied verbatim from proof-output.log
[1/6] Loading recorded session events from fixtures/segment-collision.json
       -> 47 events normalized into SessionEvent[]
[2/6] Generating spec from session events
       -> wrote out/issue-2014-segment-0-drag-must-not-collide-noop.spec.ts (18 lines, 6 primitives used)
[3/6] Running spec against demo-app (bug-present mode)
       -> FAIL - segment 0 right edge stayed at x=25 (expected 100)
       -> RED - bug reproduced deterministically [SUCCESS]
[4/6] Applying canonical fix (FIX_SEGMENT_COLLISION=1)
       -> re-rendering demo-app with fix flag
[5/6] Running spec against demo-app (fixed mode)
       -> PASS - segment 0 right edge moved to x=100
       -> GREEN - fix verified, regression locked in [SUCCESS]
[6/6] Locking the spec into CI as a gate
       -> wrote .github/workflows/proof-regression.yml

LOOP COMPLETE - RED to GREEN in 0.1s

Try it. Don't take our word.

Four shell commands. Node 20. About thirty seconds of install. The same six lines of stdout you see above will print on your machine — RED at step 3, GREEN at step 5, exit 0.

  • Read the source — every primitive in the spec is a real exported function from @cuit/harness.
  • Inspect the tests — 61 unit tests across 5 packages, TDD-first, all green.
  • Wire it into your repo — the same primitives work in any React/Vue app that exposes a state-snapshot hook.
run on your machinebash
# 1. Clone the repo
git clone git@github.com:speechlabinc/complex-ui-tester.git
cd complex-ui-tester/proof-of-concept

# 2. Install (Node 20 + pnpm)
pnpm install

# 3. Run the loop end-to-end
pnpm proof:loop

# 4. Run the package tests (61 tests across 5 packages)
pnpm test

expected output: RED to GREEN in 0.1s · exit 0

What this proves — and what it doesn't

Proves

  • A recorded session can be normalized into a stable SessionEvent[].
  • Those events can be mapped to a Playwright/Vitest spec that calls only validated harness primitives.
  • The spec deterministically reproduces a real bug (RED) against the unfixed code.
  • The same spec passes (GREEN) against the fixed code with no spec edits.
  • The architecture in docs/02, 04, and 10 is implementable — not just designed.

Does NOT prove (yet)

  • LLM-driven spec generation. The PoC generator is rule-based — a drop-in substitute for the 3-pass LLM pipeline in docs/04.
  • Pre-built third-party connectors for Jam / LogRocket / Sentry are designed but not shipped — for now we ship a first-party Chrome recorder that produces the same SessionEvent[] shape. See packages/recorder-extension/.
  • Multi-tenant prompt context. Single-tenant in the PoC.
  • SOC 2. We're in observation for Type II per docs/05; not yet audited.
  • Production deployment of the SaaS infra. Designed in docs/03; not built.

The PoC's job is to prove the loop architecture is real — the rest is engineering, not invention.