The best UI feedback loop for Claude Code & Codex

MCP server (Claude Code & Codex) + Claude Code skill · deterministic · CI-native

Turn a recorded interaction into a committed regression gate.

A deterministic harness — no pixel-coordinate tests, no fragile selectors. Run /cuit-loop in Claude Code: CUIT spec-gens a grounded Playwright spec from semantic events and commits a green CI gate — all in one conversation turn.

01Record interaction in Chrome
02Run /cuit-loop in Claude Code
03Regression gate committed to CI

0.18sloop closed, end-to-end

73 / 73package tests passing

0%CI flake rate

Chrome extfirst-party recorder shipping

Claude Code · MCP · skills — not REST, not curl

Claude Code closes the loop.
One skill. No endpoints.

Every edge of the observe → propose → verify → gate cycle is closed by an MCP tool or a Claude Code skill — not a REST call. An engineer types /cuit-loop and the full cycle runs inside the conversation in 0.18s.

Observe
MCP · cuit__get_session
Whatever feedback source you already use — your recorder, Jam, LogRocket, Sentry Replay, RUM — normalizes into one SessionEvent[] representation: pointer events, semantic selectors, console/errors, window.__cuitDebug snapshots. No curl. Claude Code calls the tool directly.
Propose
skill · /cuit-loop
Typing /cuit-loop in Claude Code invokes @cuit/spec-gen to turn the session into a Playwright/Vitest spec grounded in @cuit/harness primitives — no pixel coords, no waitForTimeout.
Verify
skill · /cuit-loop
The same /cuit-loop skill runs the spec against the unfixed code → RED (bug deterministically reproduced), then against the fix → GREEN. Claude Code sees both signals without leaving the conversation.
Gate
MCP · mcp__gitnexus__detect_changes
The generated spec is committed as a permanent CI regression gate. The GitNexus MCP tool tracks the change graph so future model calls know which specs are load-bearing before they touch related code.

Claude Codeone skill invocation — the full cycle, no REST endpoints

/cuit-loop — Claude Code conversationbash

# Claude Code conversation — no curl, no POST /v1/sessions

> /cuit-loop

  [cuit-loop] Reading session via cuit__get_session (source: any adapter)...
  [cuit-loop] 27 events normalized (6 pointer, 20 state-snapshot)

  [cuit-loop] Generating spec with @cuit/spec-gen...
  [cuit-loop] wrote out/agent-loop.spec.ts (18 lines, 6 primitives)

  [cuit-loop] Running spec — EXPECT RED...
  [cuit-loop]   segments[0].x: expected=100, actual=25 ✗  ← bug reproduced

  [cuit-loop] Hypothesis: collision short-circuit in onPointerMove blocks move
  [cuit-loop] Applying fix...

  [cuit-loop] Re-running spec — EXPECT GREEN...
  [cuit-loop]   segments[0].x: expected=100, actual=100 ✓  ← fix verified

  [cuit-loop] Opening PR with spec as regression gate.

  LOOP CLOSED in 0.18s

observe → propose → verify → gate · 0.18s end-to-end

The closed loop, end-to-end

stdout of pnpm proof:agent-loop — copied verbatim from agent-loop-output.log. Real recorder. Real spec-gen. Real RED → fix → GREEN.

closed0.18s end-to-end

[step 1/5] Capture — recorder runs while a developer reproduces the bug
              -> recorder captured 27 events (6 pointer, 20 snapshot)
              -> wrote out/recorded-session.json
[step 2/5] Generate — spec-gen produces a deterministic Playwright/Vitest spec
              -> 6 primitives: goto -> setClock -> getStateSnapshot -> dispatchDrag -> getStateSnapshot -> assertStateEquals
              -> wrote out/agent-loop.spec.ts (18 lines)
[step 3/5] Verify - run the spec against the buggy app
              -> segments[0].x: expected=100, actual=25
              -> RED [bug reproduced - this is the success state]
[step 4/5] Decide - agent reads RED output, identifies the fix
              [agent] observation: segments[0].x stayed at 25 (expected 100)
              [agent] hypothesis : the collision short-circuit in onPointerMove blocked the move
              [agent] action     : enable FIX_SEGMENT_COLLISION=1 (in the SaaS this is a code-change PR; in the PoC it is a flag)
[step 5/5] Verify GREEN - re-run the same spec against the fixed app
              -> segments[0].x: expected=100, actual=100
              -> GREEN [fix verified - regression locked in]

AGENT LOOP CLOSED - capture -> generate -> RED -> agent-fix -> GREEN in 0.18s

browser sidefirst-party — no vendor account

recorder.tstypescript

// Browser side — install once. Drop the Chrome extension OR import
// @cuit/recorder directly. Same module, same JSON shape, same downstream.

import { Recorder, cuitDebugProvider } from '@cuit/recorder';

const recorder = new Recorder({
  sessionId: 'rec-001',
  vendor: 'cuit',
  snapshotProvider: cuitDebugProvider,  // reads window.__cuitDebug.getState()
});

recorder.start();
// ... developer reproduces the bug (drag a segment, click a row, etc.) ...
recorder.stop();

const session = recorder.export();
// session is plain JSON. No vendor account. No API key.
// Pass it to Claude Code / Codex with the @cuit/spec-gen import:
//   "Use @cuit/spec-gen to convert this into a Playwright spec, run it,
//    confirm RED on the unfixed code, propose the fix."

Claude Code sidepaste into Claude Code — or just type /cuit-loop

agent-prompt.mdmarkdown

# Copy this into Claude Code / Codex / Cursor:

I just captured a session reproducing a UI bug. The JSON is attached.

1. Run `@cuit/spec-gen` on the events to produce a Playwright/Vitest spec.
2. Run the spec against the current code. I expect it to fail RED — that
   means the bug is reproduced.
3. Read the failure (expected vs actual). Identify the smallest code change
   that flips the assertion to pass.
4. Apply the fix. Re-run the spec. Confirm GREEN.
5. Open a PR. The same spec becomes the regression gate.

# Why this works: the recorder gave you a deterministic input.
# The harness gives you a deterministic execution model.
# You now have a closed loop — observe, propose, verify — without
# any pixel coordinates, screenshots, or waitForTimeout sleeps.

Two ways to close the loop today.

→Type /cuit-loop in Claude Code. The skill wires whatever feedback source you have, spec-gen, and the harness into one conversation turn. Observe → propose → verify → gate without leaving your editor.
→Run the demo loop locally. Clone the repo, install, run pnpm proof:agent-loop. The same stdout above prints on your machine in under a second — no API key, no account.

Full proof artifacts →·Extension source on GitHub ↗

run itbash

# Try the recorder against the bundled demo
pnpm install
pnpm proof:agent-loop        # recorder -> spec-gen -> RED -> fix -> GREEN

# Or load the Chrome extension on any page that exposes window.__cuitDebug:
#   chrome://extensions  ->  Developer mode  ->  Load unpacked
#   select: proof-of-concept/packages/recorder-extension/

expected: AGENT LOOP CLOSED in 0.18s · exit 0

One representation · every source · one loop

The data representation
is the protocol.

Whatever feedback you already capture — a recorder, a bug-report tool, session replay, RUM, raw console logs — normalizes into one canonical representation: SessionEvent[]. The loop closes on the representation, never on the vendor.

That's the moat: a posted, versioned contract — pointer, state, nav, console, error, keyboard — that any source maps into through the same SessionAdapter interface. Swap the source, swap the model — the representation, the spec, and the CI gate are unchanged.

Read the posted spec — docs/10 adapter contract

Any feedback source

Your recorderJamLogRocketSentry ReplayFullStoryDatadog RUMConsole + errorsCustom adapter

SessionEvent[]

canonical representation

One loop, in /cuit-loop

observe → propose → verify → gate

Two flows — same recorder, same loop, same final artifact

Lock in what works.
Or reproduce what doesn't.

The recorder doesn't care whether the code is broken when you press Start. Capture a working interaction → the agent recognizes a baseline and gates it. Capture a buggy interaction → the agent recognizes a regression and walks RED → fix → GREEN. The skill detects which flow you're in on the first run of the generated spec.

Flow B · proactive

Lock in a known-good interaction

Code works. Record the interaction. Lock the behavior in before someone refactors it into a bug.

You start with

Your code is on a known-good build. The waveform drag works. The undo stack works. You want to keep it that way forever — without hand-writing the Playwright test that proves it.

01 — Record the working interactionClick the recorder extension. Reproduce the interaction the way a real user would. Stop.
02 — Generate the spec@cuit/spec-gen turns the captured events into a deterministic Playwright/Vitest spec grounded in @cuit/harness primitives — no pixel coords.
03 — Run the specExpect GREEN. The interaction works on current code — the spec confirms it. This is the baseline.
04 — Commit the spec as the baselinePR adds the spec to tests/regressions/. Any future PR that breaks the interaction now fails CI before merge.

You end with

A GREEN spec.ts committed to your repo. Future regressions caught automatically — no human had to hand-write a Playwright test.

How the agent detects this: When the spec passes on the first run, the agent recognizes Flow B and commits + opens a PR adding regression coverage — no fix needed.

Flow A · reactive

Reproduce a known bug, fix it, lock in the regression

Bug is on prod. Reproduce it deterministically. Watch RED. Fix the code. Watch GREEN. Lock it in.

You start with

A user files a bug. The bug is real in the current code. You want to reproduce it deterministically, fix the smallest possible thing, and make sure it never reopens.

01 — Record the bugClick the recorder extension. Reproduce the bug exactly as the user did. Stop. The recording captures the broken final state via window.__cuitDebug.
02 — Generate the specSame step as Flow B. @cuit/spec-gen produces the spec. The assertion targets the state where the bug should NOT have happened.
03 — Run the spec — expect REDRED is the success state. The bug is now caught by a deterministic, semantically-grounded automated test. The failure shows expected vs actual, pointing the agent at the diagnosis.
04 — Agent proposes the smallest fixClaude Code / Codex reads the RED output, identifies the root cause (e.g. over-eager collision check, missing setClock advance), applies the minimum change.
05 — Re-run the spec — expect GREENSame spec. Fixed code. PASS. The fix is verified by exactly the test that proved the bug existed. No new manual test was written.
06 — Commit fix + spec, open PRPR contains both the code fix and the spec. Reviewer sees the spec went RED before the fix and GREEN after. The spec becomes the permanent regression gate.

You end with

Bug fixed. Regression test that proves the fix locked in. The same bug cannot reopen without CI catching it.

How the agent detects this: When the spec fails on the first run, the agent recognizes Flow A and walks the diagnose → fix → re-run loop until GREEN.

How the skill decides which flow you're in

Verbatim logic from .claude/skills/cuit-loop/SKILL.md — the agent reads this exact decision tree.

SKILL.md — flow detectionbash

# .claude/skills/cuit-loop/SKILL.md auto-detects the flow:

read session
generateSpec(events) -> spec.ts
run spec against the current code

  if PASS  ->  Flow B (baseline lock-in)
                commit spec, open PR. The interaction is now gated.

  if FAIL  ->  Flow A (bug reproduction)
                read expected vs actual,
                identify smallest code change,
                apply fix, re-run, expect GREEN,
                commit fix + spec, open PR.

# One skill. One CLI. Two flows. Both end in a GREEN .spec.ts.

Alpha — Chrome extension shipping today

Download the recorder.
Feed your coding agent real data.

10 KB. Chrome MV3. No account, no signup, no telemetry. Captures pointer events with semantic selectors, window.__cuitDebug state snapshots, and every console.log / warn / error and uncaught exception that fires during the session — all in one JSON blob. Drop the blob into Claude Code and the bundled /cuit-loop skill closes the loop: spec generated, spec run, console-error assertion included.

alpha · v0.1.0-alpha.1MV310 KBno telemetry · local only

cuit-recorder-alpha.zip

Unzipped extension ready for chrome://extensions → Load unpacked. Source is in the repo — review every line before you install if you want.

INSTALL.txt·Source on GitHub ↗

Download
Click the download button below to get cuit-recorder-alpha.zip (about 10 KB).
Unzip
Unzip anywhere. You’ll see a folder with manifest.json, content.js, popup.html, and four icon PNGs.
Load unpacked in Chrome
Open chrome://extensions, toggle Developer mode (top-right), click Load unpacked, and select the unzipped folder. Pin the extension from the puzzle-piece menu.
Wire window.__cuitDebug in your app
In a useEffect, set window.__cuitDebug = { getState: () => yourReduxOrZustandOrWhatever }. The recorder reads from this hook to capture state snapshots.
Record. Stop. Copy. Drop into Claude Code.
Click the pin, hit Start, reproduce the bug, hit Stop, click Copy JSON. Paste into your agent and invoke /cuit-loop. The skill walks the closed loop from observe to GREEN.

what gets capturedreal recorder output · ~6 KB for a 13-event session

cuit-session-demo-collision-001.jsonjson

{
  "sessionId": "demo-collision-001",
  "vendor": "cuit",
  "createdAt": 1748952000123,
  "url": "http://localhost:5173/",
  "events": [
    { "seq": 0, "type": "nav",   "url": "http://localhost:5173/" },
    { "seq": 1, "type": "state-snapshot", "path": "segments[0].x",  "value": 0 },
    { "seq": 5, "type": "state-snapshot", "path": "segments[1].x",  "value": 120 },
    { "seq": 7, "type": "pointer", "phase": "down",
      "targetName": "seg-0", "x": 40,  "y": 32, "pointerId": 1 },
    { "seq": 8,  "type": "pointer", "phase": "move",
      "targetName": "seg-0", "x": 65,  "y": 32, "pointerId": 1 },
    { "seq": 9,  "type": "pointer", "phase": "move",
      "targetName": "seg-0", "x": 90,  "y": 32, "pointerId": 1 },
    { "seq": 10, "type": "pointer", "phase": "move",
      "targetName": "seg-0", "x": 115, "y": 32, "pointerId": 1 },
    { "seq": 11, "type": "pointer", "phase": "move",
      "targetName": "seg-0", "x": 140, "y": 32, "pointerId": 1 },
    { "seq": 12, "type": "pointer", "phase": "up",
      "targetName": "seg-0", "x": 140, "y": 32, "pointerId": 1 },
    { "seq": 13, "type": "state-snapshot", "path": "segments[0].x",  "value": 25 },
    { "seq": 14, "type": "console", "level": "error",
      "message": "Cannot read properties of undefined (reading 'x')",
      "stack": "at onPointerMove (App.tsx:42:18)" },
    { "seq": 15, "type": "console", "level": "warn",
      "message": "Segment collision detected — clamping to boundary" },
    { "seq": 16, "type": "uncaught-error",
      "message": "ResizeObserver loop completed with undelivered notifications." }
  ]
}

Five event types in one blob: nav for the page URL, pointer for interactions (semantic targetName resolved from data-testid / data-cuit-id), state-snapshot for the before/after of __cuitDebug.getState(), console for every log / warn / error with stack trace, and uncaught-error for unhandled exceptions. The generated spec asserts expect(consoleLogs.errors).toHaveLength(0) automatically — zero-error CI gate included.

how an agent uses itthe /cuit-loop Claude Code skill

claude-code · codex · cursorbash

# Claude Code · Codex · Cursor · Aider
# Drop the session JSON in your repo, then run:

/cuit-loop ./cuit-session-demo-collision-001.json

# The agent reads .claude/skills/cuit-loop/SKILL.md and walks the loop:
#   1. validate the session shape
#   2. generateSpec(events) -> spec.ts grounded in @cuit/harness
#      includes: expect(consoleLogs.errors).toHaveLength(0)
#   3. run the spec -> RED expected (bug reproduced)
#   4. diagnose: actual segments[0].x = 25, expected 100
#              + 1 console.error captured during drag
#   5. propose minimum fix to onPointerMove collision check
#   6. re-run -> GREEN (interaction correct + zero console errors)
#   7. open PR with the spec + the fix
#
# Total: ~30 seconds of agent time for a fix that took your engineer
# 2-6 hours by hand.

The skill lives at .claude/skills/cuit-loop/SKILL.md. Codex and Cursor read the same Markdown via .codexrules / .cursorrules.

We capture what Jam captures — and more

Clicks, network, and console errors — but the output is a regression spec your CI gates, not a replay you watch.

Jam, LogRocket, and Sentry Replay capture console output and network requests behind their own accounts and pricing tiers. CUIT captures the same signal — pointer events, console logs, uncaught errors — and feeds it directly into Claude Code via /cuit-loop. What comes out is not a video you hand to an engineer. It is a .spec.ts that runs green in CI from that point on. Self-instrument in minutes; /cuit-instrument handles the wiring.

console.log / warn / error captured with full stack trace
Uncaught exceptions surfaced in the session blob
expect(consoleLogs.errors).toHaveLength(0) auto-asserted in every generated spec
No account, no SaaS pricing — runs local, open source
Jam / LogRocket give you a replay to watch. CUIT gives Claude Code a spec to gate.

What the agent reports back when the loop closes

Structured output from the /cuit-loop skill — agent-readable, human-readable, PR-pasteable.

loop closed

cuit-loop complete

  session     demo-collision-001 (17 events)
  spec        out/generated.spec.ts (7 primitives)
  red-actual  segments[0].x = 25 (expected 100)
              console.error: "Cannot read properties of undefined (reading 'x')"
  fix         remove over-eager collision check in onPointerMove
              (App.tsx, 4 lines deleted)
  green       ✓ interaction correct after fix
              ✓ expect(consoleLogs.errors).toHaveLength(0) — PASS
  pr          https://github.com/your-org/your-app/pull/4218

Alpha caveat: this is the first public release of the recorder. Expect rough edges. Chrome Web Store submission and Firefox/Safari ports are on the v0.2 roadmap. If you find a bug, file it at github.com/speechlabinc/complex-ui-tester/issues.

What the SaaS adds — the org-wide QA data warehouse

One developer needs a loop.
A team needs a corpus.

The OSS library is enough for one developer running the loop on their own laptop. The moment you have two — and the moment AI coding tools start writing UI changes for the whole team — you need somewhere central for every session to land, every spec to roll up, every QA insight to query. That somewhere is the SaaS.

free, MIT

OSS — your laptop

Run the recorder. Run the harness. Run the spec generator. Commit the spec to your repo. Your loop, your machine, your repo. Perfect for one developer.

What it covers

One developer captures a session and generates a spec — works on day one
Specs land in your own repo — no vendor in the data path
Free forever, MIT licensed, no account

Where it stops

Each session lives only in the developer who captured it
No cross-developer reuse — Engineer A and Engineer B capture the same bug twice
No history — yesterday's sessions are gone unless someone manually saved them
No queries — can't ask "show me all sessions where drag failed in the last quarter"
No agent memory — Claude/Codex sees one session at a time, never the corpus

Team · Business · Enterprise

SaaS — your org

Every developer's sessions land in one secure, versioned, queryable corpus your whole team — and your AI tools — can tap into.

What it unlocks

Every developer's captures roll up to one central, multi-tenant store
Sessions are versioned against git revisions so you can replay a year-old bug on today's code
Processed into derived data: per-component failure rates, bug-class clusters, generated-spec accept/reject signals
Queryable via dashboard, REST API, or MCP — for humans and agents both
Encrypted at rest with per-tenant KMS keys, SOC 2 Type II posture

Trade-offs

$499/mo Team tier and up — see /pricing
Requires you to install the recorder extension on developer machines

What the corpus unlocks

Every session, one place

Engineer A captures a drag bug Monday. Engineer B reproduces it Tuesday. Both sessions land in the same tenant corpus, deduped, linked to the same issue. No more "did anyone else see this?" Slack threads.

Versioned against your code

Every session is tagged with the git SHA at capture time. Replay yesterday's sessions on today's code to find which deploy introduced a regression. Walk backwards through the corpus to find when a behavior changed.

Processed, not just stored

We extract per-tenant signal from the raw sessions: a selector dictionary of your stable component names, a bug-class corpus of every accepted/rejected spec, per-component flake rates. Your AI tools query the derived data, not the raw bytes.

Queryable QA insights

Ask: "Show me every session in the last 30 days where waveform drag failed." "Which components have the highest reopen rate?" "Cluster bugs by failure mode — top 5." Get answers in the dashboard, via REST API, or via Claude Code through MCP.

Three ways your agent taps in

The corpus is reachable through the Claude Code skill (drop-in), the REST API (CI integrations and batch work), and an MCP server (mid-task investigation without leaving the agent loop). Pick whichever fits.

/cuit-loop · the skill

Claude Code skill (.claude/skills/cuit-loop/SKILL.md)

For a single-session loop. Drop a session, agent walks Flow A or Flow B, opens a PR with the regression spec.

examplebash

/cuit-loop ./cuit-session-2014.json

Read the spec ↗

POST /v1/specs/generate · the REST API

HTTPS REST · Bearer auth · OpenAPI 3.1 spec published

For batch and integration work. CI gates that auto-generate specs from PR-attached sessions; webhook integrations; team-internal tooling that pulls from the corpus.

examplebash

curl -X POST https://api.cuit.dev/v1/specs/generate -d @session.json

Read the spec ↗

cuit-mcp · the MCP server

Model Context Protocol server · exposes 8 tools to any MCP client

For deep investigation. Claude Code, Cursor, Aider can query the corpus mid-task: similar bugs, flake rates, per-component history — without leaving the agent loop.

examplebash

mcp__cuit__query_sessions({ predicate: "type=drag-fail AND ts>30d" })

Read the spec ↗

Same question, three surfaces

Ask the corpus in natural language inside Claude Code, in MCP tool calls, or via the REST API. Same answer, different shape.

“Show me every session in the last 30 days where waveform drag failed.”

via MCP (Claude Code, Cursor)

mcp tool calltypescript

mcp__cuit__query_sessions({
  tenant: 'acme-corp',
  predicate: "interaction='drag' AND outcome='red' AND ts > now-30d",
  limit: 50
})

via REST API

HTTPSbash

GET /v1/sessions?interaction=drag&outcome=red&since=30d

“Which UI components have the highest reopen rate this quarter?”

via MCP (Claude Code, Cursor)

mcp tool calltypescript

mcp__cuit__bug_class_distribution({
  groupBy: 'component',
  metric: 'reopen_rate',
  since: 'q-current'
})

via REST API

HTTPSbash

GET /v1/insights/bug-classes?groupBy=component&metric=reopen_rate&since=q-current

“Find sessions similar to this one — anyone already filed this bug?”
via MCP (Claude Code, Cursor)
mcp tool calltypescript
```
mcp__cuit__find_similar_sessions({
  reference: sessionId,
  threshold: 0.85
})
```
via REST API
HTTPSbash
```
POST /v1/sessions/${id}/similar?threshold=0.85
```

Full data model, security posture (encryption, per-tenant KMS keys, SOC 2 posture), and the complete REST + MCP surface live in docs/12-qa-data-warehouse.md.

Claude Code-native · MCP server + skills · zero curl required

The only testing tool
built for Claude Code & Codex.

Every other approach hands you a REST endpoint and says good luck. CUIT ships an MCP server and two Claude Code skills — the feedback loop lives inside your coding session, not a separate dashboard. And when the model changes, the loop stays: the substrate is deterministic specs, not LLM calls in CI.

Key differentiator

Feature

CUIT

Mabl

Octomind

Playwright (DIY)

claude-in-chrome

Agentic-tool-native (Claude Code, Cursor, MCP)

Ships an MCP server + skills so Claude Code can generate and run specs without leaving the conversation — no curl, no REST.

Yes

unless you build it

Partial

Chrome plugin only

Playwright tests written by hand

The standard. A human reads the bug, writes the spec, lands the PR.

Where it works

Battle-tested, predictable execution model
Source-controllable, diff-reviewable
Free, open source

Where it fails

Pixel coordinates flake when CSS changes — 5–15% CI flake on complex UIs
2–6 hours of engineering time per spec, most teams skip writing them
No path from a recorded user session to a working spec
Manual selector authoring drifts when components rename

CUIT

We emit Playwright/Vitest specs automatically from real sessions, grounded in semantic targets (data-segment-id, data-testid) and harness primitives — no pixel coordinates, no manual authoring.

Claude Code / Codex writes the Playwright test for you

Tell the agent: "write a Playwright test for this bug." It writes one.

Where it works

Faster than hand-writing
Source-controllable output
AI fluency with Playwright API is solid

Where it fails

The agent has no recording — it guesses selectors and clientX/Y
Same pixel-flake problem as hand-written tests
No deterministic state-snapshot — agent infers expected state from DOM, often wrong
No feedback loop — agent writes the test, can't verify it caught the actual bug

CUIT

We give the agent a deterministic input (recorded SessionEvent[]) and a deterministic execution model (harness primitives) — so the spec it generates actually catches the bug and stays stable across CSS changes.

Agent driving a real browser (claude-in-chrome MCP, etc.)

The agent itself opens Chrome, clicks around, reads the page.

Where it works

No human in the loop for exploration
Can interact with arbitrary pages
Useful for one-off "is this UI broken?" questions

Where it fails

Non-deterministic — same prompt produces different action sequences
No artifact left behind — nothing gates CI on future PRs
Token-expensive — every interaction is an LLM call
Slow — wall-clock minutes per interaction sequence
Can't reliably reproduce a complex multi-step user bug

CUIT

Our loop is deterministic. The output is a .spec.ts file that runs in your existing CI without an LLM in the hot path. The agent runs once at generation time; from then on, the regression test is free.

Session replay vendors (Jam, LogRocket, Sentry Replay, FullStory)

Customer files a bug; vendor sends you a video and a DOM event timeline.

Where it works

Captures the full user session out of the box
Mature integrations, account-driven
Useful for triage and root-cause analysis

Where it fails

No regression test output — the replay is a watching artifact, not a CI gate
Vendor lock-in and per-seat / per-session pricing
No semantic selectors — replays use pixel coords or best-effort CSS
No state-snapshot — vendors don't know about window.__cuitDebug
Manual translation needed: engineer watches the replay, writes the spec by hand

CUIT

We adapt their replays (see docs/10) and ship a first-party Chrome extension that captures semantic + state data they structurally cannot. Then we generate the spec — no human translation.

Screenshot diff testing (Percy, Chromatic, Argos)

Take a screenshot, diff pixels, fail if anything changes.

Where it works

Catches visual regressions humans would miss
Mature CI integrations
Component-library workflows are well-supported

Where it fails

Pixel-noisy — anti-aliasing, font hinting, browser version differences create false positives
Says nothing about behavior — segments could overlap visually but state could be correct, or vice versa
Doesn't capture interactions — only the final rendered state
High-maintenance: every intentional design change needs baseline re-approval

CUIT

We test behavior, not pixels. Our specs assert against the host app's state model — segments[0].x === 100 — not against rendered pixels. Pixel-diff tools and our tool are complementary, not competing.

Where we stand on this: Pixel-diff and session-replay tools solve adjacent problems and are worth keeping. We're specifically the test-generation + deterministic-execution edge that the others structurally cannot deliver. If you already use one of the above, you can keep using it — we don't replace your visual-regression pipeline or your session-replay vendor. We give you the regression spec they can't.

The problem

Teams shipping complex UIs are on a treadmill.

Waveform editors, video tools, design tools, dashboard with reorderable rows. The existing tooling stack — Playwright, Cypress, screenshot diff, session replay — produces flaky tests, misses canvas regressions, and offers no automated path from a recorded user session to a deterministic regression test.

BEFORE CUIT

What teams do today — 6 Reopened bugs in 60 days

Reopen-after-fix loop

6/14 bugs reopened in 60 days (43%). No regression net specific to your visual bug class.

0.5–2 eng-days per reopen

boundingBox() flakes

Pixel coordinates depend on viewport, CSS, browser engine. Tests fight rAF non-determinism with sleeps.

5–15% CI flake rate

Session → spec translation

Engineer eyeballs the replay, hand-writes a test. Most teams skip it entirely.

2–6 hours per bug

Canvas / animation blindness

Pixel screenshot diff hides sub-pixel and opacity glitches. Bugs reach prod.

Surface via support tickets

WITH CUIT

Branch B: 8 bugs locked in, 0% flake, 3 browsers verified

Spec is a CI gate forever

Generated spec runs on every PR. Reintroduce the bug six months later — CI catches it before merge.

0% reopen on locked specs

Harness primitives, not pixels

dispatchDrag, setClock, getStateSnapshot — no coordinates, no sleeps, no layout dependency.

0% flake on 9 new specs

Session → spec in minutes

Jam session URL lands in Slack. 8 minutes later, a PR is open with a grounded Playwright spec.

<8 min median

Three-browser verification

All generated specs are dry-run on Chromium, Firefox, and WebKit before the PR opens.

3 browsers, 1 command

How it works

Three Claude Code commands. One permanent CI gate.

The full loop — from filed bug to merged regression spec — runs inside Claude Code. MCP tool, skill, done. Median time under 8 minutes. After that, the gate costs nothing to maintain.

Drop in the MCP server

One-time setup

Add CUIT to Claude Code in one edit. Paste the server block into ~/.claude/mcp_servers.json and you're wired — no SDK changes, no new instrumentation, nothing your users ever see.

mcp_servers.jsonbash

// ~/.claude/mcp_servers.json
{
  "mcpServers": {
    "cuit": {
      "command": "npx",
      "args": ["-y", "@cuit/mcp-server"]
    }
  }
}

# Claude Code picks it up on next launch.
# Verify with: /mcp

Run /cuit-instrument in your repo

< 60 seconds

Open Claude Code, type /cuit-instrument, and the skill auto-detects your framework and state library, mounts window.__cuitDebug, installs the recorder bridge, and sets up the GitHub Action. Compress what used to be a day of wiring into under a minute.

cuit-instrument.shbash

# In Claude Code — just type the skill name:

> /cuit-instrument

  ✔ Detected: Next.js 14 + Zustand
  ✔ Mounted window.__cuitDebug bridge
  ✔ Installed @cuit/recorder (dev dep)
  ✔ Added .github/workflows/cuit.yml
  ✔ Round-trip test session: PASS

  Ready. Hit a bug and run /cuit-loop.

Hit a bug — type /cuit-loop, watch the gate land

Gate is permanent

When a bug surfaces, run /cuit-loop in Claude Code. It reads the recorded session, generates a grounded Playwright spec, runs it red to prove the bug is real, and opens a PR. Ship the fix in the same PR — the spec goes green and becomes a permanent CI gate. The feedback loop is the substrate: model-invariant, zero maintenance.

cuit-loop.shbash

# Bug filed. You type in Claude Code:

> /cuit-loop

  ✔ Session ingested (Jam / LogRocket / Sentry Replay)
  ✔ Spec generated — confidence 0.91, AST grounded
  ✔ Dry-run: RED  (bug reproduced ✓)
  ✔ PR opened: fix + spec in one branch

  After your fix lands:
  cuit/dry-run   ✅ GREEN — gate permanent

Prefer HTTP? REST fallback docs →

90-second walkthrough

From bug-filed to CI-locked, scene by scene.

Click through the 8 scenes — or use ← →. Each step shows a real artifact of the loop, no marketing illustrations.

Scene01/ 8

The bug appears.

It's the kind of bug that takes 90 seconds to file and 5 hours to reproduce. A user dragged a waveform segment and it didn't move.

waveform-editor — your app

00:0000:1500:3000:4501:00

seg-0

seg-1

seg-2

collision

Segment 0's right edge collides with segment 1's left edge. The drag silently no-ops. No error in the console.

←→to navigate

For UI developers — show me the code

Four things that flake on your team today.
How this fixes each one.

Every snippet below is verbatim from the working proof-of-concept — not a mockup. Clone the repo and run pnpm proof:loop to reproduce the output yourself.

problem

Your Playwright tests use page.mouse.click(412, 89)

Pixel coordinates depend on viewport, CSS, browser engine, and last-frame layout. Change padding by 4px and your suite flakes.

todayflaky / manual / brittle

before.tstypescript

await page.mouse.move(412, 89);
await page.mouse.down();
await page.mouse.move(512, 89, { steps: 10 });
await page.mouse.up();

with CUITdeterministic / generated / permanent

after.tstypescript

dispatchDrag('seg-0', 100, 0);
// Targets by stable name. No pixels.
// Same call works in Chromium, Firefox, WebKit.

problem

You sprinkle waitForTimeout(500) because rAF timing is unreliable

Real animations advance on requestAnimationFrame; pixel snapshots and CSS transitions land at the next frame. Sleeps fight non-determinism with prayer.

todayflaky / manual / brittle

before.tstypescript

await page.waitForTimeout(500);
const box = await el.boundingBox();
// hope the animation finished by now

with CUITdeterministic / generated / permanent

after.tstypescript

setClock(1716800000000);
// Deterministic clock. Every rAF callback fires.
// Now state is exactly where the spec says it is.

problem

You hand-translate a Jam replay into a Playwright spec

2–6 hours per bug. Most teams skip it. You end up with no regression net, and the same bug reopens in 3 weeks.

todayflaky / manual / brittle

before.tstypescript

// 12-minute Jam replay
// → engineer watches it twice
// → engineer guesses selectors
// → engineer writes 80-line spec
// → engineer realizes selectors broke last week

with CUITdeterministic / generated / permanent

after.tstypescript

pnpm cuit gen jam:sess-2014 --apply
# Reads the session, emits a spec.ts
# grounded in your harness primitives.
# PR opens. You review the diff.

problem

The same bug keeps reopening every release

You shipped a fix but no regression test. Six weeks later someone refactors the collision code and re-introduces the same bug.

todayflaky / manual / brittle

before.tstypescript

// One-shot fix. No spec.
// Six weeks later: "user reports drag broken"
// File reopened. Eng-days re-spent.

with CUITdeterministic / generated / permanent

after.tstypescript

# Generated spec lives in tests/regressions/
# CI runs it on every PR.
# Re-introduce the bug → CI blocks merge.
# The 6-Reopened-bugs loop is over.

The proof loop, end-to-end

Real artifacts from proof-of-concept/ — copied verbatim. Run pnpm proof:loop to regenerate.

61 tests passing0.1s end-to-end

STEP 1Input: recorded Jam session

A user files a bug via Jam. The connector pulls 47 normalized events.

fixtures/segment-collision.jsonjson

{
  "sessionId": "jam-sess-2014",
  "vendor": "jam",
  "url": "http://localhost:5173/",
  "browser": { "name": "chrome", "version": "125.0.0.0", "os": "macOS 14.4" },
  "events": [
    { "seq": 0, "type": "nav", "url": "http://localhost:5173/", "ts": 0 },
    { "seq": 1, "type": "state-snapshot", "path": "segments[0].x", "value": 0 },
    { "seq": 2, "type": "state-snapshot", "path": "segments[1].x", "value": 200 },
    { "seq": 3, "type": "state-snapshot", "path": "segments.length", "value": 2 },
    /* …42 more events: pointerdown / pointermove×N / pointerup / final state-snapshot… */
    { "seq": 45, "type": "pointer", "phase": "up",   "targetName": "seg-0", "x": 240, "y": 32, "pointerId": 1 },
    { "seq": 46, "type": "state-snapshot", "path": "segments[0].x", "value": 0 }
  ]
}

STEP 2Output: generated Playwright/Vitest spec

18 lines. 6 harness primitives. No pixel coords, no waitForTimeout, no hand-crafted selectors.

out/issue-2014.spec.tstypescript

import { describe, expect, test } from 'vitest';
import {
  dispatchDrag,
  getStateSnapshot,
  setClock,
} from '@cuit/harness';

describe('issue-2014 — segment 0 drag must not collide-noop', () => {
  test('drags segment 0 right by 100px and asserts state moves', () => {
    setClock(1716800000000);

    dispatchDrag('seg-0', 100, 0);

    const snapshot = getStateSnapshot();
    expect(snapshot['segments[0].x']).toEqual(100);
  });
});

STEP 3 — REDRun the spec against the buggy code

The spec reproduces the failure deterministically. RED is the success state — the bug is now caught by an automated test.

App.tsx (buggy)typescript

// packages/demo-app/src/App.tsx — bug version

// Inside the pointermove handler:
setSegments((prev) => {
  const next = prev.map((s) => ({ ...s }));
  const moving = next[idx];
  const proposedX = drag.originX + dx;

  // BUG: the collision check is too eager — it blocks
  // every move that would even momentarily overlap.
  const collides = next.some((other, j) => {
    if (j === idx) return false;
    return proposedX < other.x + other.width &&
           other.x < proposedX + moving.width;
  });
  if (collides) return prev;          // <-- silently no-op'd

  moving.x = proposedX;
  return next;
});

STEP 4 — GREENApply the fix, re-run the same spec

Same spec, same harness, fixed code. PASS. The spec is now a permanent CI gate.

App.tsx (fixed)typescript

// packages/demo-app/src/App.tsx — fix version

// Inside the pointermove handler:
setSegments((prev) => {
  const next = prev.map((s) => ({ ...s }));
  const moving = next[idx];
  const proposedX = drag.originX + dx;

  // FIX: drop the over-eager collision short-circuit.
  // Free positioning; downstream layout handles overlap.

  moving.x = proposedX;
  return next;
});

Actual stdout of pnpm proof:loop— copied verbatim from proof-output.log

[1/6] Loading recorded session events from fixtures/segment-collision.json
       -> 47 events normalized into SessionEvent[]
[2/6] Generating spec from session events
       -> wrote out/issue-2014-segment-0-drag-must-not-collide-noop.spec.ts (18 lines, 6 primitives used)
[3/6] Running spec against demo-app (bug-present mode)
       -> FAIL - segment 0 right edge stayed at x=25 (expected 100)
       -> RED - bug reproduced deterministically [SUCCESS]
[4/6] Applying canonical fix (FIX_SEGMENT_COLLISION=1)
       -> re-rendering demo-app with fix flag
[5/6] Running spec against demo-app (fixed mode)
       -> PASS - segment 0 right edge moved to x=100
       -> GREEN - fix verified, regression locked in [SUCCESS]
[6/6] Locking the spec into CI as a gate
       -> wrote .github/workflows/proof-regression.yml

LOOP COMPLETE - RED to GREEN in 0.1s

Try it. Don't take our word.

Four shell commands. Node 20. About thirty seconds of install. The same six lines of stdout you see above will print on your machine — RED at step 3, GREEN at step 5, exit 0.

→Read the source — every primitive in the spec is a real exported function from @cuit/harness.
→Inspect the tests — 61 unit tests across 5 packages, TDD-first, all green.
→Wire it into your repo — the same primitives work in any React/Vue app that exposes a state-snapshot hook.

run on your machinebash

# 1. Clone the repo
git clone git@github.com:speechlabinc/complex-ui-tester.git
cd complex-ui-tester/proof-of-concept

# 2. Install (Node 20 + pnpm)
pnpm install

# 3. Run the loop end-to-end
pnpm proof:loop

# 4. Run the package tests (61 tests across 5 packages)
pnpm test

expected output: RED to GREEN in 0.1s · exit 0

Evidence — Branch B

This already works in production.

The 6-layer harness shipped in PR #1995 on SpeechLab's waveform editor. Every claim in this section is grounded in that PR — open the link to verify.

historical bugs locked in

across the waveform editor

PR #1995 ↗

9 specs + 37 tests

all GREEN

Chromium, Firefox, WebKit

harness bug self-caught

dispatchDrag off-by-seg.x

PR #1995 ↗

The 8 bugs we locked in

Every spec ships on PR #1995. All RED before fix, GREEN after, with zero flakes since.

View the PR ↗

Meta-evidence: the harness itself was caught by its own loop. Issue #1967 (dispatchDrag off-by-seg.x for segment 0) was discovered when a generated spec consistently went RED on code we believed was correct. If the loop catches its own bugs, it catches yours.

Maturity ladder — no surprises

Here's exactly what
runs today.

We'd rather you trust a precise status than be surprised by what's not there yet. Every item in Shipping Now runs on your machine from the public repo. Items in In Progress are under active development — reach out if you want early access. Items in Not Yet are on the roadmap but we're not shipping until they're production-grade.

Shipping Now

v0.x — OSS

Runs on your machine today. Pull the repo and go.

Deterministic harnessrule-based spec-gen, zero LLM required
Generalized spec-gendrag, click, and text-input shapes
Real spec execution via primitive-exec
Recorder with console + error captureChrome extension, first-party
Local MCP shimOSS runs fully standalone — no cloud dependency
2 adapters: Jam + CUIT
/cuit-loop + /cuit-instrument Claude Code skills
AX envelopes + step-back debug primitives

In Progress

private pilot

Private pilots underway. Not yet generally available.

Hosted SaaS data warehouseprivate pilot on Fly + Neon — reach out to join
LLM 3-pass spec-genrule-based is the default today; LLM pass is additive
Additional interaction shapeshover, focus, keyboard nav, drag-to-resize
Self-healing selectorsresilient to minor DOM changes without re-recording

Not Yet

roadmap

On the roadmap. We won't ship until it's production-grade.

AWS production infrastructurecurrently Fly; AWS migration is a deliberate later step
General step-extractor for arbitrary interactionsbeyond the current shape set
SOC 2 Type II reportaudit begins when the SaaS exits private pilot

Why publish this? A tool that catches UI regressions should itself be honest about what it catches. Overpromising is the bug we're fixing in your codebase — we won't ship it in ours.

Pricing

You pay for the corpus that compounds, not for specs.

The feedback loop is the durable asset — model-invariant, version-invariant, and owned by your team. Free spec generation and unlimited CI runs, always. The paid tiers cover the warehouse, the similarity search, and the agent's institutional memory that makes every future spec smarter than the last.

Start in Claude Code: connect the MCP server, run the /cuit-loop skill, and capture your first session in minutes — no curl, no dashboard, no sprint-planning required.

OSS

$0Free forever — MIT licensed

Use the harness in your own repo. No SaaS, no account, no telemetry.

Unlimited spec generation. Unlimited runs. Always.

@cuit/harness — all 6 layers
dispatchDrag, dispatchResize, seekTo
State snapshot via getStateSnapshot()
Deterministic clock (setClock / tick)
DOM mutation + CSS observer invariants
Vitest + Playwright adapter

Team

$499/ month — unlimited specs, UI Intelligence Chat included

Close the loop for your AI coding tools. Unlimited recorded sessions → unlimited generated specs → unlimited regression gates. Plus a natural-language chat over the whole QA corpus.

You're paying for the warehouse + similarity search + the agent's institutional memory — not for tests.

Everything in OSS
Unlimited spec generation — no cap, no overage
UI Intelligence Chat — query the QA corpus in plain English
First-party Chrome recorder (no vendor lock-in)
Selector dictionary (unlimited entries on Team and above)
Confidence scoring + auto-PR at 0.75+

Business

$2,500/ month — unlimited specs, full connector coverage, SOC 2

Everything in Team, plus all five session sources, the SOC 2 report procurement wants, and the audit log your SIEM expects.

Everything in Team
All 5 connectors (+ FullStory, Datadog RUM)
Bug-class corpus training (custom to your UI)
SOC 2 report on request
Audit log export (S3 / BigQuery)
Up to 25 seats

See full pricing — including Enterprise

FAQ

Questions worth asking.

Every answer links to the design doc that goes deeper. If you have a question we should add, file a docs issue.

The harness library — every primitive in Layers 1–6, the React and Vue adapters, the Playwright runner integration — is MIT-licensed with no usage gating. You can audit the code, fork it, and ship it without ever touching our SaaS. The SaaS sits on top: it provides LLM inference, session connectors, multi-tenant cost accounting, and SOC 2 audit logging. Those are the parts you pay for, not the library.
See doc 01 ↗

The loop is the product

Close the loop Claude Code can't close alone.

Get a free token, connect the CUIT MCP server in Claude Code (or Codex), capture a session, and run /cuit-loop. Ten seconds. No human in the loop. Your first verified regression gate lands as a PR.

Browse the OSS harness

Free token. No credit card. MIT-licensed harness. Yours to fork.

Turn a recorded interaction into a committed regression gate.

Claude Code closes the loop.One skill. No endpoints.

The closed loop, end-to-end

Two ways to close the loop today.

The data representation is the protocol.

Lock in what works.Or reproduce what doesn't.

Lock in a known-good interaction

You start with

You end with

Reproduce a known bug, fix it, lock in the regression

You start with

You end with

How the skill decides which flow you're in

Download the recorder.Feed your coding agent real data.

cuit-recorder-alpha.zip

Download

Unzip

Load unpacked in Chrome

Wire window.__cuitDebug in your app

Record. Stop. Copy. Drop into Claude Code.

What the agent reports back when the loop closes

One developer needs a loop.A team needs a corpus.

OSS — your laptop

What it covers

Where it stops

SaaS — your org

What it unlocks

Trade-offs

What the corpus unlocks

Every session, one place

Versioned against your code

Processed, not just stored

Queryable QA insights

Three ways your agent taps in

/cuit-loop · the skill

POST /v1/specs/generate · the REST API

cuit-mcp · the MCP server

Same question, three surfaces

The only testing toolbuilt for Claude Code & Codex.

Playwright tests written by hand

Where it works

Where it fails

Claude Code / Codex writes the Playwright test for you

Where it works

Where it fails

Agent driving a real browser (claude-in-chrome MCP, etc.)

Where it works

Where it fails

Session replay vendors (Jam, LogRocket, Sentry Replay, FullStory)

Where it works

Where it fails

Screenshot diff testing (Percy, Chromatic, Argos)

Where it works

Where it fails

Teams shipping complex UIs are on a treadmill.

BEFORE CUIT

WITH CUIT

Three Claude Code commands. One permanent CI gate.

Drop in the MCP server

Run /cuit-instrument in your repo

Hit a bug — type /cuit-loop, watch the gate land

From bug-filed to CI-locked, scene by scene.

The bug appears.

Four things that flake on your team today.How this fixes each one.

Your Playwright tests use page.mouse.click(412, 89)

You sprinkle waitForTimeout(500) because rAF timing is unreliable

You hand-translate a Jam replay into a Playwright spec

The same bug keeps reopening every release

The proof loop, end-to-end

Try it. Don't take our word.

This already works in production.

The 8 bugs we locked in

Here's exactly whatruns today.

Shipping Now

In Progress

Not Yet

You pay for the corpus that compounds, not for specs.

OSS

Team

Business

Claude Code closes the loop.
One skill. No endpoints.

The data representation
is the protocol.

Lock in what works.
Or reproduce what doesn't.

Download the recorder.
Feed your coding agent real data.

One developer needs a loop.
A team needs a corpus.

The only testing tool
built for Claude Code & Codex.

Four things that flake on your team today.
How this fixes each one.

Here's exactly what
runs today.