The deterministic feedback edge for agentic UI engineering

Your AI tool ships UI changes. Make the regression spec land with it.

Yes, Playwright works — when a human writes it. Yes, Claude Code can drive a Chrome extension — non-deterministically. Yes, Jam and LogRocket record sessions — without producing a regression test. CUIT captures the interaction in a 10 KB Chrome extension, generates a Playwright spec grounded in semantic events (not pixel coords), and locks it in as a CI gate. Two flows — lock in a baseline, or reproduce a bug. Closed loop in 0.18s.

0.18sclosed loop, end-to-end
73 / 73package tests passing
0%CI flake rate
Chrome extfirst-party recorder shipping

For Claude Code · Codex · Cursor · any agentic coding model

Agentic coding can write UIs.
It can't verify them.

Today the loop ends at "here's the diff". There's no deterministic feedback signal — no way for the model to see that the UI it wrote actually behaves correctly when a user drags a segment, scrubs a playhead, reorders a row. CUIT closes that loop. Observe, propose, verify, gate. End-to-end, in 0.18s.

  1. Observe

    Recorder Chrome extension captures pointer events, semantic selectors, and window.__cuitDebug state snapshots into one JSON blob.

  2. Propose

    @cuit/spec-gen turns the session into a Playwright/Vitest spec grounded in @cuit/harness primitives — no pixel coords, no waitForTimeout.

  3. Verify

    Run the spec against the unfixed code → RED (bug deterministically reproduced). Apply the fix → GREEN. The agent sees both signals.

  4. Gate

    The generated spec becomes a permanent CI regression gate. Re-introduce the bug six months later → CI blocks the merge.

The closed loop, end-to-end

stdout of pnpm proof:agent-loop — copied verbatim from agent-loop-output.log. Real recorder. Real spec-gen. Real RED → fix → GREEN.

closed0.18s end-to-end
[step 1/5] Capture — recorder runs while a developer reproduces the bug
              -> recorder captured 27 events (6 pointer, 20 snapshot)
              -> wrote out/recorded-session.json
[step 2/5] Generate — spec-gen produces a deterministic Playwright/Vitest spec
              -> 6 primitives: goto -> setClock -> getStateSnapshot -> dispatchDrag -> getStateSnapshot -> assertStateEquals
              -> wrote out/agent-loop.spec.ts (18 lines)
[step 3/5] Verify - run the spec against the buggy app
              -> segments[0].x: expected=100, actual=25
              -> RED [bug reproduced - this is the success state]
[step 4/5] Decide - agent reads RED output, identifies the fix
              [agent] observation: segments[0].x stayed at 25 (expected 100)
              [agent] hypothesis : the collision short-circuit in onPointerMove blocked the move
              [agent] action     : enable FIX_SEGMENT_COLLISION=1 (in the SaaS this is a code-change PR; in the PoC it is a flag)
[step 5/5] Verify GREEN - re-run the same spec against the fixed app
              -> segments[0].x: expected=100, actual=100
              -> GREEN [fix verified - regression locked in]

AGENT LOOP CLOSED - capture -> generate -> RED -> agent-fix -> GREEN in 0.18s
browser sidefirst-party — no vendor account
recorder.tstypescript
// Browser side — install once. Drop the Chrome extension OR import
// @cuit/recorder directly. Same module, same JSON shape, same downstream.

import { Recorder, cuitDebugProvider } from '@cuit/recorder';

const recorder = new Recorder({
  sessionId: 'rec-001',
  vendor: 'cuit',
  snapshotProvider: cuitDebugProvider,  // reads window.__cuitDebug.getState()
});

recorder.start();
// ... developer reproduces the bug (drag a segment, click a row, etc.) ...
recorder.stop();

const session = recorder.export();
// session is plain JSON. No vendor account. No API key.
// Pass it to Claude Code / Codex with the @cuit/spec-gen import:
//   "Use @cuit/spec-gen to convert this into a Playwright spec, run it,
//    confirm RED on the unfixed code, propose the fix."
agent sidepaste into Claude Code · Codex · Cursor
agent-prompt.mdmarkdown
# Copy this into Claude Code / Codex / Cursor:

I just captured a session reproducing a UI bug. The JSON is attached.

1. Run `@cuit/spec-gen` on the events to produce a Playwright/Vitest spec.
2. Run the spec against the current code. I expect it to fail RED — that
   means the bug is reproduced.
3. Read the failure (expected vs actual). Identify the smallest code change
   that flips the assertion to pass.
4. Apply the fix. Re-run the spec. Confirm GREEN.
5. Open a PR. The same spec becomes the regression gate.

# Why this works: the recorder gave you a deterministic input.
# The harness gives you a deterministic execution model.
# You now have a closed loop — observe, propose, verify — without
# any pixel coordinates, screenshots, or waitForTimeout sleeps.

Two ways to close the loop today.

  • Run the demo agent loop locally. Clone the repo, install, run pnpm proof:agent-loop. The same six lines of stdout above will print on your machine in under a second.
  • Load the Chrome extension. Drop packages/recorder-extension/ into chrome://extensions → load unpacked. Record any page that exposes window.__cuitDebug. Paste the resulting JSON straight into your coding agent.
run itbash
# Try the recorder against the bundled demo
pnpm install
pnpm proof:agent-loop        # recorder -> spec-gen -> RED -> fix -> GREEN

# Or load the Chrome extension on any page that exposes window.__cuitDebug:
#   chrome://extensions  ->  Developer mode  ->  Load unpacked
#   select: proof-of-concept/packages/recorder-extension/

expected: AGENT LOOP CLOSED in 0.18s · exit 0

Two flows — same recorder, same loop, same final artifact

Lock in what works.
Or reproduce what doesn't.

The recorder doesn't care whether the code is broken when you press Start. Capture a working interaction → the agent recognizes a baseline and gates it. Capture a buggy interaction → the agent recognizes a regression and walks RED → fix → GREEN. The skill detects which flow you're in on the first run of the generated spec.

Flow B · proactive

Lock in a known-good interaction

Code works. Record the interaction. Lock the behavior in before someone refactors it into a bug.

You start with

Your code is on a known-good build. The waveform drag works. The undo stack works. You want to keep it that way forever — without hand-writing the Playwright test that proves it.

  1. 01 — Record the working interactionClick the recorder extension. Reproduce the interaction the way a real user would. Stop.
  2. 02 — Generate the spec@cuit/spec-gen turns the captured events into a deterministic Playwright/Vitest spec grounded in @cuit/harness primitives — no pixel coords.
  3. 03 — Run the specExpect GREEN. The interaction works on current code — the spec confirms it. This is the baseline.
  4. 04 — Commit the spec as the baselinePR adds the spec to tests/regressions/. Any future PR that breaks the interaction now fails CI before merge.

You end with

A GREEN spec.ts committed to your repo. Future regressions caught automatically — no human had to hand-write a Playwright test.

How the agent detects this: When the spec passes on the first run, the agent recognizes Flow B and commits + opens a PR adding regression coverage — no fix needed.

Flow A · reactive

Reproduce a known bug, fix it, lock in the regression

Bug is on prod. Reproduce it deterministically. Watch RED. Fix the code. Watch GREEN. Lock it in.

You start with

A user files a bug. The bug is real in the current code. You want to reproduce it deterministically, fix the smallest possible thing, and make sure it never reopens.

  1. 01 — Record the bugClick the recorder extension. Reproduce the bug exactly as the user did. Stop. The recording captures the broken final state via window.__cuitDebug.
  2. 02 — Generate the specSame step as Flow B. @cuit/spec-gen produces the spec. The assertion targets the state where the bug should NOT have happened.
  3. 03 — Run the spec — expect REDRED is the success state. The bug is now caught by a deterministic, semantically-grounded automated test. The failure shows expected vs actual, pointing the agent at the diagnosis.
  4. 04 — Agent proposes the smallest fixClaude Code / Codex reads the RED output, identifies the root cause (e.g. over-eager collision check, missing setClock advance), applies the minimum change.
  5. 05 — Re-run the spec — expect GREENSame spec. Fixed code. PASS. The fix is verified by exactly the test that proved the bug existed. No new manual test was written.
  6. 06 — Commit fix + spec, open PRPR contains both the code fix and the spec. Reviewer sees the spec went RED before the fix and GREEN after. The spec becomes the permanent regression gate.

You end with

Bug fixed. Regression test that proves the fix locked in. The same bug cannot reopen without CI catching it.

How the agent detects this: When the spec fails on the first run, the agent recognizes Flow A and walks the diagnose → fix → re-run loop until GREEN.

How the skill decides which flow you're in

Verbatim logic from .claude/skills/cuit-loop/SKILL.md — the agent reads this exact decision tree.

SKILL.md — flow detectionbash
# .claude/skills/cuit-loop/SKILL.md auto-detects the flow:

read session
generateSpec(events) -> spec.ts
run spec against the current code

  if PASS  ->  Flow B (baseline lock-in)
                commit spec, open PR. The interaction is now gated.

  if FAIL  ->  Flow A (bug reproduction)
                read expected vs actual,
                identify smallest code change,
                apply fix, re-run, expect GREEN,
                commit fix + spec, open PR.

# One skill. One CLI. Two flows. Both end in a GREEN .spec.ts.

Alpha — Chrome extension shipping today

Download the recorder.
Feed your coding agent real data.

10 KB. Chrome MV3. No account, no signup, no telemetry. Captures pointer events with semantic selectors and window.__cuitDebug state snapshots into one JSON blob. Drop the JSON into Claude Code / Codex / Cursor and the bundled /cuit-loop skill closes the loop for you.

alpha · v0.1.0-alpha.1MV310 KBno telemetry · local only

cuit-recorder-alpha.zip

Unzipped extension ready for chrome://extensions → Load unpacked. Source is in the repo — review every line before you install if you want.

  1. Download

    Click the download button below to get cuit-recorder-alpha.zip (about 10 KB).

  2. Unzip

    Unzip anywhere. You’ll see a folder with manifest.json, content.js, popup.html, and four icon PNGs.

  3. Load unpacked in Chrome

    Open chrome://extensions, toggle Developer mode (top-right), click Load unpacked, and select the unzipped folder. Pin the extension from the puzzle-piece menu.

  4. Wire window.__cuitDebug in your app

    In a useEffect, set window.__cuitDebug = { getState: () => yourReduxOrZustandOrWhatever }. The recorder reads from this hook to capture state snapshots.

  5. Record. Stop. Copy. Drop into Claude Code.

    Click the pin, hit Start, reproduce the bug, hit Stop, click Copy JSON. Paste into your agent and invoke /cuit-loop. The skill walks the closed loop from observe to GREEN.

what gets capturedreal recorder output · ~6 KB for a 13-event session
cuit-session-demo-collision-001.jsonjson
{
  "sessionId": "demo-collision-001",
  "vendor": "cuit",
  "createdAt": 1748952000123,
  "url": "http://localhost:5173/",
  "events": [
    { "seq": 0, "type": "nav",   "url": "http://localhost:5173/" },
    { "seq": 1, "type": "state-snapshot", "path": "segments[0].x",  "value": 0 },
    { "seq": 5, "type": "state-snapshot", "path": "segments[1].x",  "value": 120 },
    { "seq": 7, "type": "pointer", "phase": "down",
      "targetName": "seg-0", "x": 40,  "y": 32, "pointerId": 1 },
    { "seq": 8,  "type": "pointer", "phase": "move",
      "targetName": "seg-0", "x": 65,  "y": 32, "pointerId": 1 },
    { "seq": 9,  "type": "pointer", "phase": "move",
      "targetName": "seg-0", "x": 90,  "y": 32, "pointerId": 1 },
    { "seq": 10, "type": "pointer", "phase": "move",
      "targetName": "seg-0", "x": 115, "y": 32, "pointerId": 1 },
    { "seq": 11, "type": "pointer", "phase": "move",
      "targetName": "seg-0", "x": 140, "y": 32, "pointerId": 1 },
    { "seq": 12, "type": "pointer", "phase": "up",
      "targetName": "seg-0", "x": 140, "y": 32, "pointerId": 1 },
    { "seq": 13, "type": "state-snapshot", "path": "segments[0].x",  "value": 25 }
  ]
}

Three event types — nav for the page URL, pointer for interactions (with the semantic targetName resolved from data-segment-id / data-testid / data-cuit-id), and state-snapshot for the before/after of __cuitDebug.getState(). That's everything @cuit/spec-gen needs.

how an agent uses itthe /cuit-loop Claude Code skill
claude-code · codex · cursorbash
# Claude Code · Codex · Cursor · Aider
# Drop the session JSON in your repo, then run:

/cuit-loop ./cuit-session-demo-collision-001.json

# The agent reads .claude/skills/cuit-loop/SKILL.md and walks the loop:
#   1. validate the session shape
#   2. generateSpec(events) -> spec.ts grounded in @cuit/harness
#   3. run the spec -> RED expected (bug reproduced)
#   4. diagnose: actual segments[0].x = 25, expected 100
#   5. propose minimum fix to onPointerMove collision check
#   6. re-run -> GREEN
#   7. open PR with the spec + the fix
#
# Total: ~30 seconds of agent time for a fix that took your engineer
# 2-6 hours by hand.

The skill lives at .claude/skills/cuit-loop/SKILL.md. Codex and Cursor read the same Markdown via .codexrules / .cursorrules.

What the agent reports back when the loop closes

Structured output from the /cuit-loop skill — agent-readable, human-readable, PR-pasteable.

loop closed
cuit-loop complete

  session     demo-collision-001 (14 events)
  spec        out/generated.spec.ts (6 primitives)
  red-actual  segments[0].x = 25 (expected 100)
  fix         remove over-eager collision check in onPointerMove
              (App.tsx, 4 lines deleted)
  green       ✓ same spec passes after fix
  pr          https://github.com/your-org/your-app/pull/4218

Alpha caveat: this is the first public release of the recorder. Expect rough edges. Chrome Web Store submission and Firefox/Safari ports are on the v0.2 roadmap. If you find a bug, file it at github.com/speechlabinc/complex-ui-tester/issues.

What the SaaS adds — the org-wide QA data warehouse

One developer needs a loop.
A team needs a corpus.

The OSS library is enough for one developer running the loop on their own laptop. The moment you have two — and the moment AI coding tools start writing UI changes for the whole team — you need somewhere central for every session to land, every spec to roll up, every QA insight to query. That somewhere is the SaaS.

free, MIT

OSS — your laptop

Run the recorder. Run the harness. Run the spec generator. Commit the spec to your repo. Your loop, your machine, your repo. Perfect for one developer.

What it covers

  • One developer captures a session and generates a spec — works on day one
  • Specs land in your own repo — no vendor in the data path
  • Free forever, MIT licensed, no account

Where it stops

  • Each session lives only in the developer who captured it
  • No cross-developer reuse — Engineer A and Engineer B capture the same bug twice
  • No history — yesterday's sessions are gone unless someone manually saved them
  • No queries — can't ask "show me all sessions where drag failed in the last quarter"
  • No agent memory — Claude/Codex sees one session at a time, never the corpus
Team · Business · Enterprise

SaaS — your org

Every developer's sessions land in one secure, versioned, queryable corpus your whole team — and your AI tools — can tap into.

What it unlocks

  • Every developer's captures roll up to one central, multi-tenant store
  • Sessions are versioned against git revisions so you can replay a year-old bug on today's code
  • Processed into derived data: per-component failure rates, bug-class clusters, generated-spec accept/reject signals
  • Queryable via dashboard, REST API, or MCP — for humans and agents both
  • Encrypted at rest with per-tenant KMS keys, SOC 2 Type II posture

Trade-offs

  • $499/mo Team tier and up — see /pricing
  • Requires you to install the recorder extension on developer machines

What the corpus unlocks

Every session, one place

Engineer A captures a drag bug Monday. Engineer B reproduces it Tuesday. Both sessions land in the same tenant corpus, deduped, linked to the same issue. No more "did anyone else see this?" Slack threads.

Versioned against your code

Every session is tagged with the git SHA at capture time. Replay yesterday's sessions on today's code to find which deploy introduced a regression. Walk backwards through the corpus to find when a behavior changed.

Processed, not just stored

We extract per-tenant signal from the raw sessions: a selector dictionary of your stable component names, a bug-class corpus of every accepted/rejected spec, per-component flake rates. Your AI tools query the derived data, not the raw bytes.

Queryable QA insights

Ask: "Show me every session in the last 30 days where waveform drag failed." "Which components have the highest reopen rate?" "Cluster bugs by failure mode — top 5." Get answers in the dashboard, via REST API, or via Claude Code through MCP.

Three ways your agent taps in

The corpus is reachable through the Claude Code skill (drop-in), the REST API (CI integrations and batch work), and an MCP server (mid-task investigation without leaving the agent loop). Pick whichever fits.

/cuit-loop · the skill

Claude Code skill (.claude/skills/cuit-loop/SKILL.md)

For a single-session loop. Drop a session, agent walks Flow A or Flow B, opens a PR with the regression spec.

examplebash
/cuit-loop ./cuit-session-2014.json
Read the spec ↗

POST /v1/specs/generate · the REST API

HTTPS REST · Bearer auth · OpenAPI 3.1 spec published

For batch and integration work. CI gates that auto-generate specs from PR-attached sessions; webhook integrations; team-internal tooling that pulls from the corpus.

examplebash
curl -X POST https://api.cuit.dev/v1/specs/generate -d @session.json
Read the spec ↗

cuit-mcp · the MCP server

Model Context Protocol server · exposes 8 tools to any MCP client

For deep investigation. Claude Code, Cursor, Aider can query the corpus mid-task: similar bugs, flake rates, per-component history — without leaving the agent loop.

examplebash
mcp__cuit__query_sessions({ predicate: "type=drag-fail AND ts>30d" })
Read the spec ↗

Same question, three surfaces

Ask the corpus in natural language inside Claude Code, in MCP tool calls, or via the REST API. Same answer, different shape.

  • Show me every session in the last 30 days where waveform drag failed.

    via MCP (Claude Code, Cursor)

    mcp tool calltypescript
    mcp__cuit__query_sessions({
      tenant: 'acme-corp',
      predicate: "interaction='drag' AND outcome='red' AND ts > now-30d",
      limit: 50
    })

    via REST API

    HTTPSbash
    GET /v1/sessions?interaction=drag&outcome=red&since=30d
  • Which UI components have the highest reopen rate this quarter?

    via MCP (Claude Code, Cursor)

    mcp tool calltypescript
    mcp__cuit__bug_class_distribution({
      groupBy: 'component',
      metric: 'reopen_rate',
      since: 'q-current'
    })

    via REST API

    HTTPSbash
    GET /v1/insights/bug-classes?groupBy=component&metric=reopen_rate&since=q-current
  • Find sessions similar to this one — anyone already filed this bug?

    via MCP (Claude Code, Cursor)

    mcp tool calltypescript
    mcp__cuit__find_similar_sessions({
      reference: sessionId,
      threshold: 0.85
    })

    via REST API

    HTTPSbash
    POST /v1/sessions/${id}/similar?threshold=0.85

Full data model, security posture (encryption, per-tenant KMS keys, SOC 2 posture), and the complete REST + MCP surface live in docs/12-qa-data-warehouse.md.

What about Playwright? · claude-in-chrome? · screenshot diff?

All of these work.
Each one fails differently.

Verifying UI is not a solved problem with zero alternatives — five approaches already exist, and they all do something useful. They also all have specific failure modes that bite teams shipping complex UIs daily. Here's the honest comparison.

Playwright tests written by hand

The standard. A human reads the bug, writes the spec, lands the PR.

Where it works

  • Battle-tested, predictable execution model
  • Source-controllable, diff-reviewable
  • Free, open source

Where it fails

  • Pixel coordinates flake when CSS changes — 5–15% CI flake on complex UIs
  • 2–6 hours of engineering time per spec, most teams skip writing them
  • No path from a recorded user session to a working spec
  • Manual selector authoring drifts when components rename
CUIT

We emit Playwright/Vitest specs automatically from real sessions, grounded in semantic targets (data-segment-id, data-testid) and harness primitives — no pixel coordinates, no manual authoring.

Claude Code / Codex writes the Playwright test for you

Tell the agent: "write a Playwright test for this bug." It writes one.

Where it works

  • Faster than hand-writing
  • Source-controllable output
  • AI fluency with Playwright API is solid

Where it fails

  • The agent has no recording — it guesses selectors and clientX/Y
  • Same pixel-flake problem as hand-written tests
  • No deterministic state-snapshot — agent infers expected state from DOM, often wrong
  • No feedback loop — agent writes the test, can't verify it caught the actual bug
CUIT

We give the agent a deterministic input (recorded SessionEvent[]) and a deterministic execution model (harness primitives) — so the spec it generates actually catches the bug and stays stable across CSS changes.

Agent driving a real browser (claude-in-chrome MCP, etc.)

The agent itself opens Chrome, clicks around, reads the page.

Where it works

  • No human in the loop for exploration
  • Can interact with arbitrary pages
  • Useful for one-off "is this UI broken?" questions

Where it fails

  • Non-deterministic — same prompt produces different action sequences
  • No artifact left behind — nothing gates CI on future PRs
  • Token-expensive — every interaction is an LLM call
  • Slow — wall-clock minutes per interaction sequence
  • Can't reliably reproduce a complex multi-step user bug
CUIT

Our loop is deterministic. The output is a .spec.ts file that runs in your existing CI without an LLM in the hot path. The agent runs once at generation time; from then on, the regression test is free.

Session replay vendors (Jam, LogRocket, Sentry Replay, FullStory)

Customer files a bug; vendor sends you a video and a DOM event timeline.

Where it works

  • Captures the full user session out of the box
  • Mature integrations, account-driven
  • Useful for triage and root-cause analysis

Where it fails

  • No regression test output — the replay is a watching artifact, not a CI gate
  • Vendor lock-in and per-seat / per-session pricing
  • No semantic selectors — replays use pixel coords or best-effort CSS
  • No state-snapshot — vendors don't know about window.__cuitDebug
  • Manual translation needed: engineer watches the replay, writes the spec by hand
CUIT

We adapt their replays (see docs/10) and ship a first-party Chrome extension that captures semantic + state data they structurally cannot. Then we generate the spec — no human translation.

Screenshot diff testing (Percy, Chromatic, Argos)

Take a screenshot, diff pixels, fail if anything changes.

Where it works

  • Catches visual regressions humans would miss
  • Mature CI integrations
  • Component-library workflows are well-supported

Where it fails

  • Pixel-noisy — anti-aliasing, font hinting, browser version differences create false positives
  • Says nothing about behavior — segments could overlap visually but state could be correct, or vice versa
  • Doesn't capture interactions — only the final rendered state
  • High-maintenance: every intentional design change needs baseline re-approval
CUIT

We test behavior, not pixels. Our specs assert against the host app's state model — segments[0].x === 100 — not against rendered pixels. Pixel-diff tools and our tool are complementary, not competing.

Where we stand on this: Pixel-diff and session-replay tools solve adjacent problems and are worth keeping. We're specifically the test-generation + deterministic-execution edge that the others structurally cannot deliver. If you already use one of the above, you can keep using it — we don't replace your visual-regression pipeline or your session-replay vendor. We give you the regression spec they can't.

The problem

Teams shipping complex UIs are on a treadmill.

Waveform editors, video tools, design tools, dashboard with reorderable rows. The existing tooling stack — Playwright, Cypress, screenshot diff, session replay — produces flaky tests, misses canvas regressions, and offers no automated path from a recorded user session to a deterministic regression test.

BEFORE CUIT

What teams do today — 6 Reopened bugs in 60 days

Reopen-after-fix loop

6/14 bugs reopened in 60 days (43%). No regression net specific to your visual bug class.

0.5–2 eng-days per reopen

boundingBox() flakes

Pixel coordinates depend on viewport, CSS, browser engine. Tests fight rAF non-determinism with sleeps.

5–15% CI flake rate

Session → spec translation

Engineer eyeballs the replay, hand-writes a test. Most teams skip it entirely.

2–6 hours per bug

Canvas / animation blindness

Pixel screenshot diff hides sub-pixel and opacity glitches. Bugs reach prod.

Surface via support tickets

WITH CUIT

Branch B: 8 bugs locked in, 0% flake, 3 browsers verified

Spec is a CI gate forever

Generated spec runs on every PR. Reintroduce the bug six months later — CI catches it before merge.

0% reopen on locked specs

Harness primitives, not pixels

dispatchDrag, setClock, getStateSnapshot — no coordinates, no sleeps, no layout dependency.

0% flake on 9 new specs

Session → spec in minutes

Jam session URL lands in Slack. 8 minutes later, a PR is open with a grounded Playwright spec.

<8 min median

Three-browser verification

All generated specs are dry-run on Chromium, Firefox, and WebKit before the PR opens.

3 browsers, 1 command

How it works

Three steps. One permanent CI gate.

The full loop from filed bug to merged regression spec takes under 8 minutes on median sessions. After that, the gate is permanent — it costs nothing to maintain.

Record

5 vendors supported

Your users use Jam, LogRocket, Sentry Replay, FullStory, or Datadog RUM as they do today. No SDK changes, no new instrumentation, no behavior change for your users.

no-changes.shbash
# No code changes in your app.
# Your users file bugs the same way they always did.

user → Jam "drag didn't work" → session URL in Slack
                    ↓
    CUIT connector picks it up

Generate

< $0.50 all-in

A 3-pass LLM pipeline normalizes the session, grounds selectors against your tenant's selector dictionary and bug-class corpus, then materializes a Playwright spec that calls only validated harness primitives. AST validation enforces it — hallucinations don't compile.

issue-2014-segment-collision.spec.tstypescript
// Generated: issue-2014-segment-collision.spec.ts
// Confidence: 0.91 — AST grounded ✓

import { test, expect } from '@playwright/test';
import { dispatchDrag, getStateSnapshot, setClock } from '@cuit/harness';

test('segment 0 drag — no collision regression', async ({ page }) => {
  await page.goto('/waveform');
  await setClock(page, 0);

  const before = await getStateSnapshot(page);
  expect(before.segments[0].x).toBe(0);

  await dispatchDrag(page, getSegment(page, 'seg-0'), { dx: 100, dy: 0 });

  const after = await getStateSnapshot(page);
  expect(after.segments[0].x).toBe(100);
});

Lock in

Gate is permanent

The GitHub App opens a PR with the generated spec. Dry-run goes RED (proving the bug is reproducible). Engineer reviews, ships the fix in the same PR, dry-run goes GREEN. From that moment, the spec is a CI gate — reintroducing the bug in any future PR fails CI before merge.

ci-output.txttext
# GitHub Actions — runs on every PR

  cuit/spec-grounded     ✅ PASS
  cuit/dry-run           ✅ GREEN (after fix)
  cuit/confidence        ✅ 0.91 / threshold 0.75

  All checks passed — ready to merge

For UI developers — show me the code

Four things that flake on your team today.
How this fixes each one.

Every snippet below is verbatim from the working proof-of-concept — not a mockup. Clone the repo and run pnpm proof:loop to reproduce the output yourself.

problem

Your Playwright tests use page.mouse.click(412, 89)

Pixel coordinates depend on viewport, CSS, browser engine, and last-frame layout. Change padding by 4px and your suite flakes.

todayflaky / manual / brittle
before.tstypescript
await page.mouse.move(412, 89);
await page.mouse.down();
await page.mouse.move(512, 89, { steps: 10 });
await page.mouse.up();
with CUITdeterministic / generated / permanent
after.tstypescript
dispatchDrag('seg-0', 100, 0);
// Targets by stable name. No pixels.
// Same call works in Chromium, Firefox, WebKit.
problem

You sprinkle waitForTimeout(500) because rAF timing is unreliable

Real animations advance on requestAnimationFrame; pixel snapshots and CSS transitions land at the next frame. Sleeps fight non-determinism with prayer.

todayflaky / manual / brittle
before.tstypescript
await page.waitForTimeout(500);
const box = await el.boundingBox();
// hope the animation finished by now
with CUITdeterministic / generated / permanent
after.tstypescript
setClock(1716800000000);
// Deterministic clock. Every rAF callback fires.
// Now state is exactly where the spec says it is.
problem

You hand-translate a Jam replay into a Playwright spec

2–6 hours per bug. Most teams skip it. You end up with no regression net, and the same bug reopens in 3 weeks.

todayflaky / manual / brittle
before.tstypescript
// 12-minute Jam replay
// → engineer watches it twice
// → engineer guesses selectors
// → engineer writes 80-line spec
// → engineer realizes selectors broke last week
with CUITdeterministic / generated / permanent
after.tstypescript
pnpm cuit gen jam:sess-2014 --apply
# Reads the session, emits a spec.ts
# grounded in your harness primitives.
# PR opens. You review the diff.
problem

The same bug keeps reopening every release

You shipped a fix but no regression test. Six weeks later someone refactors the collision code and re-introduces the same bug.

todayflaky / manual / brittle
before.tstypescript
// One-shot fix. No spec.
// Six weeks later: "user reports drag broken"
// File reopened. Eng-days re-spent.
with CUITdeterministic / generated / permanent
after.tstypescript
# Generated spec lives in tests/regressions/
# CI runs it on every PR.
# Re-introduce the bug → CI blocks merge.
# The 6-Reopened-bugs loop is over.

The proof loop, end-to-end

Real artifacts from proof-of-concept/ — copied verbatim. Run pnpm proof:loop to regenerate.

61 tests passing0.1s end-to-end
STEP 1Input: recorded Jam session

A user files a bug via Jam. The connector pulls 47 normalized events.

fixtures/segment-collision.jsonjson
{
  "sessionId": "jam-sess-2014",
  "vendor": "jam",
  "url": "http://localhost:5173/",
  "browser": { "name": "chrome", "version": "125.0.0.0", "os": "macOS 14.4" },
  "events": [
    { "seq": 0, "type": "nav", "url": "http://localhost:5173/", "ts": 0 },
    { "seq": 1, "type": "state-snapshot", "path": "segments[0].x", "value": 0 },
    { "seq": 2, "type": "state-snapshot", "path": "segments[1].x", "value": 200 },
    { "seq": 3, "type": "state-snapshot", "path": "segments.length", "value": 2 },
    /* …42 more events: pointerdown / pointermove×N / pointerup / final state-snapshot… */
    { "seq": 45, "type": "pointer", "phase": "up",   "targetName": "seg-0", "x": 240, "y": 32, "pointerId": 1 },
    { "seq": 46, "type": "state-snapshot", "path": "segments[0].x", "value": 0 }
  ]
}
STEP 2Output: generated Playwright/Vitest spec

18 lines. 6 harness primitives. No pixel coords, no waitForTimeout, no hand-crafted selectors.

out/issue-2014.spec.tstypescript
import { describe, expect, test } from 'vitest';
import {
  dispatchDrag,
  getStateSnapshot,
  setClock,
} from '@cuit/harness';

describe('issue-2014 — segment 0 drag must not collide-noop', () => {
  test('drags segment 0 right by 100px and asserts state moves', () => {
    setClock(1716800000000);

    dispatchDrag('seg-0', 100, 0);

    const snapshot = getStateSnapshot();
    expect(snapshot['segments[0].x']).toEqual(100);
  });
});
STEP 3 — REDRun the spec against the buggy code

The spec reproduces the failure deterministically. RED is the success state — the bug is now caught by an automated test.

App.tsx (buggy)typescript
// packages/demo-app/src/App.tsx — bug version

// Inside the pointermove handler:
setSegments((prev) => {
  const next = prev.map((s) => ({ ...s }));
  const moving = next[idx];
  const proposedX = drag.originX + dx;

  // BUG: the collision check is too eager — it blocks
  // every move that would even momentarily overlap.
  const collides = next.some((other, j) => {
    if (j === idx) return false;
    return proposedX < other.x + other.width &&
           other.x < proposedX + moving.width;
  });
  if (collides) return prev;          // <-- silently no-op'd

  moving.x = proposedX;
  return next;
});
STEP 4 — GREENApply the fix, re-run the same spec

Same spec, same harness, fixed code. PASS. The spec is now a permanent CI gate.

App.tsx (fixed)typescript
// packages/demo-app/src/App.tsx — fix version

// Inside the pointermove handler:
setSegments((prev) => {
  const next = prev.map((s) => ({ ...s }));
  const moving = next[idx];
  const proposedX = drag.originX + dx;

  // FIX: drop the over-eager collision short-circuit.
  // Free positioning; downstream layout handles overlap.

  moving.x = proposedX;
  return next;
});
Actual stdout of pnpm proof:loop— copied verbatim from proof-output.log
[1/6] Loading recorded session events from fixtures/segment-collision.json
       -> 47 events normalized into SessionEvent[]
[2/6] Generating spec from session events
       -> wrote out/issue-2014-segment-0-drag-must-not-collide-noop.spec.ts (18 lines, 6 primitives used)
[3/6] Running spec against demo-app (bug-present mode)
       -> FAIL - segment 0 right edge stayed at x=25 (expected 100)
       -> RED - bug reproduced deterministically [SUCCESS]
[4/6] Applying canonical fix (FIX_SEGMENT_COLLISION=1)
       -> re-rendering demo-app with fix flag
[5/6] Running spec against demo-app (fixed mode)
       -> PASS - segment 0 right edge moved to x=100
       -> GREEN - fix verified, regression locked in [SUCCESS]
[6/6] Locking the spec into CI as a gate
       -> wrote .github/workflows/proof-regression.yml

LOOP COMPLETE - RED to GREEN in 0.1s

Try it. Don't take our word.

Four shell commands. Node 20. About thirty seconds of install. The same six lines of stdout you see above will print on your machine — RED at step 3, GREEN at step 5, exit 0.

  • Read the source — every primitive in the spec is a real exported function from @cuit/harness.
  • Inspect the tests — 61 unit tests across 5 packages, TDD-first, all green.
  • Wire it into your repo — the same primitives work in any React/Vue app that exposes a state-snapshot hook.
run on your machinebash
# 1. Clone the repo
git clone git@github.com:speechlabinc/complex-ui-tester.git
cd complex-ui-tester/proof-of-concept

# 2. Install (Node 20 + pnpm)
pnpm install

# 3. Run the loop end-to-end
pnpm proof:loop

# 4. Run the package tests (61 tests across 5 packages)
pnpm test

expected output: RED to GREEN in 0.1s · exit 0

Evidence — Branch B

This already works in production.

The 6-layer harness shipped in PR #1995 on SpeechLab's waveform editor. Every claim in this section is grounded in that PR — open the link to verify.

Meta-evidence: the harness itself was caught by its own loop. Issue #1967 (dispatchDrag off-by-seg.x for segment 0) was discovered when a generated spec consistently went RED on code we believed was correct. If the loop catches its own bugs, it catches yours.

Pricing

The library is free. The SaaS pays for itself.

On the Team tier, even one avoided regression a quarter covers your annual SaaS bill. The OSS harness costs nothing — install it today.

OSS

$0Free forever — MIT licensed

Use the harness in your own repo. No SaaS, no account, no telemetry.

  • @cuit/harness — all 6 layers
  • dispatchDrag, dispatchResize, seekTo
  • State snapshot via getStateSnapshot()
  • Deterministic clock (setClock / tick)
  • DOM mutation + CSS observer invariants
  • Vitest + Playwright adapter
Most popular

Team

$499/ month — unlimited specs, UI Intelligence Chat included

Close the loop for your AI coding tools. Unlimited recorded sessions → unlimited generated specs → unlimited regression gates. Plus a natural-language chat over the whole QA corpus.

  • Everything in OSS
  • Unlimited spec generation — no cap, no overage
  • UI Intelligence Chat — query the QA corpus in plain English
  • First-party Chrome recorder (no vendor lock-in)
  • Selector dictionary (unlimited entries on Team and above)
  • Confidence scoring + auto-PR at 0.75+

Business

$2,500/ month — unlimited specs, full connector coverage, SOC 2

Everything in Team, plus all five session sources, the SOC 2 report procurement wants, and the audit log your SIEM expects.

  • Everything in Team
  • All 5 connectors (+ FullStory, Datadog RUM)
  • Bug-class corpus training (custom to your UI)
  • SOC 2 report on request
  • Audit log export (S3 / BigQuery)
  • Up to 25 seats

FAQ

Questions worth asking.

Every answer links to the design doc that goes deeper. If you have a question we should add, file a docs issue.

  • The harness library — every primitive in Layers 1–6, the React and Vue adapters, the Playwright runner integration — is MIT-licensed with no usage gating. You can audit the code, fork it, and ship it without ever touching our SaaS. The SaaS sits on top: it provides LLM inference, session connectors, multi-tenant cost accounting, and SOC 2 audit logging. Those are the parts you pay for, not the library.

    See doc 01

Ready when you are

End the 6-Reopened-bugs treadmill.

Three design-partner slots remain in Q3 2026. White-glove onboarding, shared Slack channel, six months free, founder on-call. Bring one bug ticket and one repo — we'll do the rest.

MIT-licensed. Zero telemetry. Yours to fork.