Curated Claude Code catalog
Updated 07.05.2026 · 19:39 CET
01 / Skill
trycua

cua

Quality
9.0

Cua is a comprehensive platform enabling AI agents to interact with computers through background macOS automation, cross-OS sandboxes, and robust benchmarking tools. It excels when developers need to build, evaluate, and deploy agents capable of visually interacting with GUIs, clicking, typing, and performing complex tasks across diverse operating systems.

USP

Its unique selling point is the ability to drive native macOS apps in the background without stealing focus or cursor, combined with a unified API for sandboxed GUI automation across Linux, macOS, Windows, and Android.

Use cases

  • 01Automating GUI tasks
  • 02End-to-end QA for applications
  • 03Benchmarking AI computer-use agents
  • 04Developing cross-OS agent interactions
  • 05Background macOS app control

Detected files (3)

  • libs/cua-driver/Skills/cua-driver/SKILL.mdskill
    Show content (47065 bytes)
    ---
    name: cua-driver
    description: Drive a native macOS app via the cua-driver CLI (default) or MCP server — snapshot its AX tree, click/type/scroll by element_index, verify via re-snapshot. Use when the user asks you to operate, drive, automate, or perform a GUI task in a real macOS application on the host (e.g. "open a file in TextEdit", "navigate to /Applications in Finder", "click the Save button in Numbers").
    ---
    
    # cua-driver
    
    Orchestrates macOS app automation via `cua-driver`. Whenever a user
    asks to drive a native macOS app, follow the loop in this skill rather
    than calling tools ad-hoc — the snapshot-before-action invariant is not
    optional and silently breaks if you skip it.
    
    ## The no-foreground contract — read this first
    
    **The user's frontmost app MUST NOT change.** This is the whole
    reason cua-driver exists. Users pay for the right to keep typing in
    their editor while an agent drives another app in the background.
    Violate this rule and every other nice property the driver gives
    you (no cursor warp, no Space switch, no window raise) stops
    mattering — you just shipped the Accessibility Inspector with extra
    steps.
    
    Before running any shell command, ask: **"does this raise,
    activate, foreground, or make-key any app?"** If yes, don't run it.
    Every one of the commands below activates the target on macOS and
    is therefore forbidden unless the user **explicitly** asked for
    frontmost state:
    
    - **Every form of the `open` CLI — `open -a <App>`, `open -b
      <bundle-id>`, `open <file>`, `open <path-to-App.app>`, `open
      <url>` — always activates.** macOS routes all forms through
      LaunchServices, which unhides and foregrounds the target
      regardless of whether you passed an app name, a bundle id, a
      document, a URL, or the bundle path itself. The activation
      happens even when the only intent was "start the process."
      **Never use `open` for any app launch.** This includes launching
      a just-built .app from a local build dir (e.g. `open
      build/Build/Products/Debug/MyApp.app`) — resolve the
      `CFBundleIdentifier` from `Info.plist` and use `launch_app`
      with that id. See "The narrow carve-out" below for why
      `launch_app` is safe even when the app internally calls
      `NSApp.activate`.
    - `osascript -e 'tell application "X" to activate'` —
      activates by design. Same for `... to open <file>`,
      `... to launch`, and anything with `activate` in the tell block.
    - `osascript -e 'tell application "System Events" to ... frontmost'`
      in a mutating form (setting `frontmost` rather than reading it).
    - AppleScript files that invoke `activate`, `launch`, or `open`
      against the target app.
    - `cliclick` (moves the user's real cursor to the target coords
      before clicking — a focus-steal-equivalent even if the app's
      window state is unchanged).
    - `CGEventPost` with `cghidEventTap` targeting a coordinate over
      a different app's window (warps the cursor, possibly activates
      on hit).
    - `AppleScriptTask`, `NSAppleScript`, `Process` wrapping `osascript`
      that contains any of the above.
    - `NSRunningApplication.activate(options:)` called from your own
      helper binary — same class.
    - Dock clicks and any `open` invocation (see the first bullet —
      every form of `open` goes through LaunchServices which
      activates, full stop).
    - **Keyboard shortcuts that semantically mean "focus here" —
      most notably Chrome / Safari / Arc's `⌘L` (focus omnibox) and
      Finder's `⌘⇧G` (Go to Folder).** These aren't pure key events —
      the receiving app interprets "user wants to type here" as
      activation intent and raises its window to be key. Even when
      delivered to a backgrounded pid via `hotkey`, the downstream app
      pulls focus. **For omnibox navigation specifically**, the correct
      path is `launch_app({bundle_id: "com.google.Chrome", urls:
      ["https://…"]})` — no omnibox dance, no `⌘L`, no focus-steal. Do
      NOT try `set_value` on the omnibox: Chrome's commit logic requires
      a "user-typed" signal that neither an AX value write nor
      `CGEvent.postToPid` keystrokes supply from a backgrounded pid —
      the URL lands in the field but Return fires as a no-op. See
      `WEB_APPS.md` → "Navigate to a URL" for the full pattern. The
      general principle: a shortcut that says "put my cursor inside this
      app" is a focus-steal; a shortcut that says "do this thing" (copy,
      save, quit) is fine.
    - **Tab-switching shortcuts in browsers (`⌘1..⌘9`, `⌘]`, `⌘[`,
      `⌘⇧[`, `⌘⇧]`) are visibly disruptive even when delivered to a
      backgrounded pid.** The app's key handler processes the shortcut,
      the window re-renders the new tab's content, the user sees their
      tabs flipping. There is no AX-only workaround: page content (HTML,
      form state, `AXWebArea`) populates only for the focused tab;
      inspecting a background tab requires activating it, which is the
      visible flip. Observed with Dia; the same mechanic applies to every
      Chromium-family browser (Chrome, Arc, Brave, Edge).
    
      **Prefer the windows-over-tabs pattern**: for each URL you need to
      drive backgrounded, use `launch_app({bundle_id, urls: [url]})` —
      browsers open each URL in a new **window**. Each window has its own
      `window_id`, its own AX tree, and can be inspected / interacted with
      via `element_index` without activating or switching anything. Tabs
      are a UX grouping for humans; cua-driver workflows should default to
      windows. See `WEB_APPS.md` → "Tabs vs windows" for the full pattern.
    
      Tab-title enumeration (read-only) IS safe — walk a window's toolbar
      AX tree for `AXTab` / `AXRadioButton` children and read their
      `AXTitle`s. Tab switching (activating one) is not.
    
    Reading frontmost state is fine (`osascript -e 'tell application
    "System Events" to get name of first application process whose
    frontmost is true'`). Mutating it is not.
    
    **Corollary — the AXMenuBar rule.** `AXMenuBarItem` + AXPick
    dispatches at the AX layer regardless of which app is frontmost,
    but macOS's on-screen menu bar always belongs to the frontmost
    app. If you drive a *backgrounded* app's menu bar, the AX call
    succeeds but the viewer sees the dispatch rendered over the
    *frontmost* app's menu bar — confusing in any observed session and
    routinely a silent no-op too, because action menu items go
    `DISABLED` when their owning app isn't the key window. **So: only
    use menu-bar navigation when the target is already frontmost.** For
    backgrounded targets, read state via in-window AX (window title,
    toolbar `AXStaticText`) and dispatch via in-window `element_index`
    or pixel clicks — both paths are frontmost-insensitive. Full
    rationale in "Navigating native menu bars" below.
    
    **"Open \<app\>" in user speech means launch, not activate.**
    `cua-driver launch_app` is the one correct path for process
    startup — it's idempotent (no-op on a running app), returns the
    pid, and has an internal `FocusRestoreGuard` that catches
    `NSApp.activate(ignoringOtherApps:)` calls the target makes during
    `application(_:open:)` and clobbers the frontmost back to what it
    was before the launch. That guard is why `launch_app` with `urls`
    (e.g. `{"bundle_id": "com.colliderli.iina", "urls": ["~/video.mp4"]}`)
    is safe even for apps that normally foreground on media-load
    (Chrome, Electron, media players).
    
    ## Defaults — always prefer cua-driver over shell shims
    
    **Default transport is the `cua-driver` CLI** — `Bash` shelling out
    to `cua-driver <tool-name> '<JSON-args>'`. MCP tools (prefix
    `mcp__cua-driver__*`) only when the user explicitly asks for them.
    CLI wins because it picks up rebuilds instantly, failures are
    easier to diagnose, and there's no per-tool schema-load overhead.
    
    Every reference to `click(...)`, `get_window_state(...)` etc. in this
    skill means `cua-driver click '{...}'` — translate to MCP form only
    when MCP is requested.
    
    ### Claude Code computer-use compatibility mode
    
    For normal Claude Code use, keep the default CLI or `cua-driver` MCP server path above. If the user explicitly wants Claude Code's vision/computer-use-style flow, they can register:
    
    ```bash
    claude mcp add --transport stdio cua-computer-use -- cua-driver mcp --claude-code-computer-use-compat
    ```
    
    Observation: Claude Code vision flows appear to treat a screenshot MCP tool as the image-grounding anchor. This compatibility mode keeps the normal CuaDriver tools and changes only `screenshot`. The compatibility `screenshot` requires `pid` and `window_id`, captures only that target window, and returns the window-local pixel coordinate frame. Start with `launch_app` or `list_windows`, then call `screenshot({pid, window_id})`; do not assume desktop coordinates or a full-screen capture.
    
    Use MCP for this Claude Code vision/computer-use-style path. Do not shell out to `cua-driver screenshot` as a substitute: CLI screenshots still work as CuaDriver calls, but they do not expose the `mcp__cua-computer-use__screenshot` tool name that Claude Code appears to use as the image-grounding cue.
    
    Intent → tool mapping. If you find yourself reaching for the right
    column, something has gone wrong — re-read "The no-foreground
    contract" above:
    
    | Intent | Use | Don't use |
    |---|---|---|
    | Open / launch an app | `launch_app({bundle_id})` or `launch_app({bundle_id, urls:[...]})` | `open -a`, `osascript 'tell app … to launch/activate/open'` |
    | Find a pid | `list_apps` or `launch_app`'s return | `pgrep`, `ps`, `osascript frontmost` |
    | Enumerate an app's windows | `list_windows({pid})` — or read the `windows` array `launch_app` already returns | `osascript 'every window of app …'` |
    | Click / type / scroll / keys | `click`, `type_text`, `scroll`, `press_key`, `hotkey` | `osascript`, `cliclick`, raw `CGEvent`, `open <url>` |
    | Drag / drag-and-drop / marquee select | `drag({pid, from_x, from_y, to_x, to_y})` (pixel-only — macOS AX has no semantic drag) | `cliclick dd:`, `osascript drag` |
    | Screenshot | `screenshot` or the PNG in `get_window_state` | `screencapture` |
    | Quit an app | ask the user first, then `hotkey({pid, keys:["cmd","q"]})` | `kill`, `killall`, `pkill` |
    | Hand a file/URL to an app | `launch_app({bundle_id, urls:[<path>]})` | `open -a <App> <path>`, `open <url>` |
    
    ### The narrow carve-out
    
    The **only** legitimate use of `osascript -e 'tell app X to
    activate'` is when the user **explicitly** asked for frontmost
    state ("bring Chrome to the front", "make it frontmost", "I want
    to see X"). Reaching for it because a tool call returned something
    confusing is wrong — that's the skill's classic foot-in-the-door
    failure mode and it steals focus every time.
    
    When a cua-driver call surprises you, diagnose cua-driver first:
    
    - **Tiny screenshot / empty `tree_markdown`?** Check
      `cua-driver get_config` → `capture_mode`. Default `"som"` returns
      both the AX tree and screenshot. `"vision"` omits the AX tree
      (PNG only), `"ax"` omits the PNG. If a snapshot lacks a tree,
      `capture_mode` is almost certainly `"vision"` — either reason
      purely from the PNG or flip to `"som"` / `"ax"` via `set_config`.
    - **`has_screenshot: false`?** The window capture failed (transient
      race against a close, or the window has no backing store yet).
      Re-snapshot; if persistent, pick a different `window_id` via
      `list_windows`.
    - **`Invalid element_index` / `No cached AX state`?** You either
      skipped `get_window_state` this turn or passed a different
      `window_id` than the one the snapshot cached against. The cache
      is keyed on `(pid, window_id)` — indices don't carry across
      windows of the same app. Re-snapshot with the same window_id
      you're about to click in.
    - **Sparse Chromium AX tree?** Retry `get_window_state` once — the
      tree populates on second call.
    
    Only after those are ruled out, and only if the user's action
    genuinely needs frontmost state, fall through to the activate
    fallback. Always name the focus steal in your response ("I'll
    briefly bring Chrome to the front because …").
    
    ### Self-check pattern
    
    Before every `Bash` call whose command line touches any macOS app
    (launching, opening, clicking, typing, scripting, screenshotting),
    run the self-check:
    
    1. **Does this command foreground the target?** If yes — stop and
       translate to the cua-driver equivalent from the mapping table.
    2. **Does this command move the user's real cursor?** (`cliclick`,
       any `CGEventPost` at `cghidEventTap` over another app's window).
       If yes — stop; use `click({pid, x, y})` which routes per-pid
       via SkyLight and never warps the cursor.
    3. **Does this command bypass cua-driver entirely?** (`osascript`
       mutating GUI state, AppleScript files, external helpers.) If
       yes — stop; find the cua-driver tool that does the intent.
    
    If all three are "no," the command is safe. If you can't answer,
    default to stop and ask rather than proceed. A single `open -a`
    run by accident kills the demo, the trust, and the user's in-flight
    editor state.
    
    ## Prerequisites — check before starting
    
    1. `cua-driver` is on `$PATH` (`which cua-driver`). If not, point the
       user at `scripts/install-local.sh` and stop.
    2. Run `cua-driver check_permissions` (with the daemon up — see step 3).
       The default behavior also raises the system permission dialogs for
       any missing grants, so the user can grant on the spot. If either
       grant still reads `false` after that (user dismissed the dialog),
       tell them to open System Settings → Privacy & Security and grant
       Accessibility and Screen Recording to `CuaDriver.app`, then stop.
       Pass `'{"prompt":false}'` for a purely read-only status check that
       won't steal focus.
    3. Start the daemon with `open -n -g -a CuaDriver --args serve` (the
       recommended form — goes through LaunchServices so TCC attributes
       the process to CuaDriver.app). `cua-driver serve &` also works;
       the CLI auto-relaunches through `open -n -g -a CuaDriver` when it
       detects a wrong-TCC context (any IDE-spawned shell: Claude Code,
       Cursor, VS Code, Conductor). Verify with `cua-driver status`.
    
    ## Using cua-driver from the shell
    
    Tool names are `snake_case`, management subcommands are
    `kebab-case` — no ambiguity. Tools invoked as `cua-driver
    <tool-name> '<JSON-args>'`. Management subcommands:
    
    - `open -n -g -a CuaDriver --args serve` — start persistent daemon
      (**required** for `element_index` workflows; without it each CLI
      invocation spawns a fresh process and the per-pid element cache
      dies between calls). `cua-driver serve &` also works — the CLI
      auto-relaunches via `open` when the shell's TCC context is wrong.
      Pass `--no-relaunch` / `CUA_DRIVER_NO_RELAUNCH=1` to opt out.
    - `cua-driver stop` / `status`
    - `cua-driver list-tools`, `describe <tool>`
    - `cua-driver recording start|stop|status` — see `RECORDING.md`
    
    Canonical multi-step workflow:
    
    ```
    open -n -g -a CuaDriver --args serve
    cua-driver launch_app '{"bundle_id":"com.apple.calculator"}'
    # → {pid: 844, windows: [{window_id: 10725, ...}]}
    cua-driver get_window_state '{"pid":844,"window_id":10725}'
    cua-driver click '{"pid":844,"window_id":10725,"element_index":14}'
    cua-driver stop
    ```
    
    ## Agent cursor overlay
    
    Visual cursor overlay for demos and screen recordings. Default:
    enabled. Toggle with `cua-driver set_agent_cursor_enabled
    '{"enabled":true|false}'`. A triangle pointer Bezier-glides to each
    click target, ring-ripples on landing, idle-hides after ~1.5s.
    Motion knobs: `set_agent_cursor_motion` takes any subset of
    `start_handle`, `end_handle`, `arc_size`, `arc_flow`, `spring` —
    tuneable at runtime, persisted to config.
    
    Requires an AppKit runloop, which `cua-driver serve` / `mcp`
    bootstraps. One-shot CLI invocations skip the overlay entirely.
    
    ## The core invariant — snapshot before AND after every action
    
    **Every action MUST be bracketed by `get_window_state(pid, window_id)`**:
    
    - **Before** — the pre-action snapshot resolves the `element_index`
      you're about to use. Indices from previous turns are stale; the
      server replaces the element index map on every snapshot, keyed
      on `(pid, window_id)`. Indices from turn N don't resolve in turn
      N+1, and indices from window A don't resolve against window B of
      the same app. Skip this and element-indexed actions fail with
      `No cached AX state`.
    - **After** — the post-action snapshot verifies the action actually
      landed. Without it you can't tell a silent no-op from a real
      effect. The AX tree change (new value, new window, disappeared
      menu, disabled button, etc.) is your evidence that the action
      fired. If nothing changed, the action probably failed silently —
      say so, don't assume success.
    
    This applies to pixel clicks too — re-snapshot after to confirm the
    click landed on the intended target.
    
    ### Why window selection is the caller's job now
    
    `get_app_state` used to pick a window for you via a max-area heuristic
    that returned the wrong surface on apps with large off-screen utility
    panels. Concrete reproducer: IINA's OpenSubtitles helper (600×432
    off-screen) out-area'd the visible 320×240 player window, so
    `get_app_state(pid)` screenshot'd the invisible panel and clicks landed
    there silently. The new `get_window_state(pid, window_id)` makes the
    caller name the window explicitly — the driver validates that the
    window belongs to the pid and is on the current Space, then snapshots
    exactly what was asked for. Enumerate candidates via `list_windows` or
    read the `windows` array `launch_app` already returns.
    
    ## Behavior matrix
    
    Two orthogonal axes shape what the agent can do.
    
    **capture_mode → addressing mode**
    
    | `capture_mode` | `get_window_state` returns | Use for actions |
    |---|---|---|
    | **`som`** (default) | tree + screenshot | `element_index` preferred; pixel fallback |
    | **`ax`** | tree only (no PNG) | `element_index` only |
    | **`vision`** | PNG only (no tree) | pixel only — see [SCREENSHOT.md](./SCREENSHOT.md) |
    
    `vision` was renamed from `screenshot` — the old name still decodes
    as a deprecated alias, so an on-disk `"capture_mode": "screenshot"`
    keeps working. Default is `som` so element_index clicks work the
    first time a user calls `get_window_state`; the other modes are
    opt-in when the caller specifically doesn't want one half of the
    work. Note the tool named `screenshot` is separate (raw PNG, no AX
    walk) and unrelated to the capture mode.
    
    When a snapshot looks wrong (tiny screenshot / empty tree), check
    `cua-driver get_config` for `capture_mode` before anything else.
    
    Pure-vision mode has its own caveats — Claude Code's vision
    pipeline downsamples dense text aggressively, so pixel grounding
    takes multiple correction cycles on text-heavy UIs. Read
    [SCREENSHOT.md](./SCREENSHOT.md) before driving anything in that
    mode; it documents the iterate/annotate/verify recipe plus the
    JPEG-over-PNG finding.
    
    **Window state → what works**
    
    | state | `get_window_state` | `click`/`set_value` (AX) | `press_key` commit (Return/Space/Tab) | pixel click |
    |---|---|---|---|---|
    | frontmost | ✅ | ✅ | ✅ | ✅ |
    | backgrounded / visible | ✅ | ✅ | ✅ | ✅ |
    | **minimized** (Dock genie) | ✅ | ✅ (no deminiaturize — AX actions fire on the minimized window in place) | ❌ silent no-op / system beep — use `set_value` or click equivalent | ❌ no on-screen bounds |
    | hidden (`hides=true` / `NSApp.hide`) | ✅ | ✅ | depends | ❌ |
    | on another Space | ⚠️ AX tree often stripped to menu-bar-only on SwiftUI apps (System Settings) — AppKit apps usually fine. Response carries `off_space: true` + `window_space_ids` so you can detect it | ✅ | ✅ | ❌ window not in current-Space list |
    
    **Critical cell — minimized + keyboard commit.** The keystroke
    reaches the app but AX focus doesn't propagate to renderer focus on
    a minimized window. Workarounds in order of preference:
    `set_value` to write the field's entire value directly, or AX-click
    a commit-equivalent button (Go, Submit, checkbox). Tell the user
    the window needs to un-minimize only as a last resort.
    
    ## The canonical loop
    
    ```
    launch_app(target)
      → pick window_id from the returned `windows` array
        (or call list_windows(pid) separately)
      → get_window_state(pid, window_id)
        → [act]  # every action also takes (pid, window_id)
      → get_window_state(pid, window_id) → verify
    ```
    
    `launch_app` now returns a `windows` array alongside the pid, so the
    common case collapses to two calls (`launch_app` → `get_window_state`)
    without a separate `list_windows` hop.
    
    ### 1. Resolve target pid — always via `launch_app`
    
    **Always start with `launch_app`**, whether or not the target is already
    running. It's idempotent (relaunching returns the existing pid with no
    side effects) and gives you the pid in one call — no `list_apps` hop.
    
    - `launch_app({bundle_id: "com.apple.finder"})` — preferred, unambiguous.
    - `launch_app({name: "Calculator"})` — when bundle_id isn't known.
    
    `launch_app` is a **hidden-launch primitive by design** — that's the
    entire point of cua-driver: agents drive apps in the background while
    the user keeps typing in their real foreground app. The target's
    window is initialized (AX tree fully populated, clickable via
    `element_index`, the pid appears in `list_apps`) but not drawn on
    screen. The driver never activates or unhides apps on its own; that
    would violate the no-foreground contract the whole driver exists to
    protect.
    
    If the user explicitly wants the window visible (usually for a demo
    or recording), they unhide it themselves — Dock click, Cmd-Tab, or
    Spotlight. Do not reach for `open` / `osascript activate` as a
    shortcut to make the window visible; those paths break the backgrounded
    invariant on every call, not just the call that "needed" the
    foreground. Say out loud what the user needs to do ("click the
    Todo app in your Dock to bring it forward") and let them do it.
    
    Never shell out to **any** form of `open` (including `open
    <path-to-App.app>` for a just-built binary — resolve the bundle id
    from `Info.plist` and use `launch_app` with that), `osascript 'tell
    app … to launch/open'`, or similar. Those paths activate the target,
    bypass the driver's focus-restore guard, and require a Bash
    permission prompt the agent loop shouldn't be burning on app launch.
    See "Prefer cua-driver tools over shell shims" above for the full
    intent → tool mapping.
    
    `list_apps` is for app-level discovery (answering "what's installed /
    running / frontmost?") — not part of the core action loop. Skip it in
    the loop. For **window-level** questions — "does this app have a
    visible window?", "which Space is this window on?", "which of this
    pid's windows is the main one?" — call `list_windows` instead; the
    app record doesn't carry window state on purpose. In the common
    single-window case you can skip `list_windows` entirely and read the
    `windows` array that `launch_app` already returned.
    
    ### 2. Snapshot and act by element_index
    
    Call `get_window_state({pid, window_id})` with the `window_id` from
    `launch_app`'s `windows` array (or a fresh `list_windows({pid})` if
    you're interacting with a long-lived process). The default `som`
    capture_mode returns **both the AX tree and screenshot**, so the
    canonical loop works immediately without any config change. The rest
    of this section walks through `som` mode. If you're in `vision` mode
    (PNG only, no AX tree), flip back: `cua-driver set_config '{"key":
    "capture_mode", "value": "som"}'`.
    
    In `som` mode (the default) the response carries:
    
    - `tree_markdown` — every actionable element tagged `[N]`. That `N`
      is the `element_index`. The tree can be very large (Finder is
      ~1600 elements, ~190 KB); when it exceeds token limits the MCP
      harness saves it to a file and returns the path. Use `Bash` +
      `jq -r '.tree_markdown'` + `grep` to pull the section you need.
    - `screenshot_file_path` — absolute path to the saved screenshot when
      `screenshot_out_file` was passed. Absent otherwise.
    - `screenshot_width` / `_height` / `_scale_factor` — dimensions of the
      captured image. Present whenever a screenshot was taken.
    **Getting the screenshot as a file (CLI and context-constrained agents):**
    
    ```bash
    # write to file — stdout stays readable (AX tree / summary only, no base64)
    cua-driver get_window_state '{"pid":N,"window_id":W,"screenshot_out_file":"/tmp/shot.jpg"}'
    
    # CLI --screenshot-out-file flag is equivalent and works for all capture modes
    cua-driver get_window_state '{"pid":N,"window_id":W}' --screenshot-out-file /tmp/shot.jpg
    ```
    
    Pass `screenshot_out_file` when using `get_window_state` via CLI or from an
    agent whose context window can't absorb ~31 KB of inline base64 (e.g.
    OpenCode with a local Ollama model). The MCP image content block is omitted
    from the response when this param is set — the model receives only the AX
    tree and `screenshot_file_path`, then reads the image from disk.
    
    **Reason over both the tree AND the screenshot — they're
    complementary, not redundant.** In `som` mode every
    turn's `get_window_state` gives you both halves and you should pull
    signal from each:
    
    - The **AX tree** tells you *what's clickable* — roles, labels,
      `element_index` handles, advertised actions, parent-child
      structure. This is the ground truth for dispatching.
    - The **screenshot** tells you *which one* — the tree often has
      many buttons with similar or empty labels ("Delete", "OK",
      anonymous UUID-labeled buttons, five `AXStaticText = " "`), and
      visual context disambiguates. Captions, colors, layout relationships
      visible in pixels often don't show up in the AX tree at all
      (especially in Chromium / Electron / web content).
    
    Canonical pattern: look at the screenshot to decide "the blue
    Subscribe button on the top-right of the video card", then walk the
    tree to find the matching `AXButton` and dispatch by its
    `element_index`. Don't try to do it from just the tree — you'll
    pick the wrong element when labels repeat. Don't try to do it from
    just the screenshot — you lose the reliable AX-action path and the
    safe backgrounded-dispatch.
    
    Reach for pixel coordinates only when the target is a canvas /
    video / WebGL / custom-drawn surface that isn't in the AX tree
    (see Pixel-coordinate clicks below).
    
    The `actions=[...]` list on each element is **advisory**, not
    authoritative. cua-driver does not pre-flight check against it —
    `click({pid, element_index})` always attempts `AXPress` (or the
    action you pass) and surfaces whatever the target returns. Many
    apps accept `AXPress` on elements that don't advertise it — Chrome's
    omnibox suggestion `AXMenuItem` is a live example. **Try the click
    first** — pivot only on the returned AX error code.
    
    Dispatch table (every row assumes a `(pid, window_id)` pair from the
    last `get_window_state`; `window_id` is required alongside
    `element_index`, ignored on pixel-only forms unless you want to
    anchor the conversion against a specific window):
    
    | Intent | Tool | Notes |
    |---|---|---|
    | List an app's windows | `list_windows({pid})` | returns `window_id`, `title`, `bounds`, `z_index`, `is_on_screen`, `on_current_space`. Already included in `launch_app`'s response — only call this for long-lived pids |
    | Snapshot a window | `get_window_state({pid, window_id})` | returns `tree_markdown` + `screenshot_*`; populates the `(pid, window_id)` element_index cache |
    | Left click | `click({pid, window_id, element_index})` | default `action: "press"`. Pixel form: `click({pid, x, y})` (window_id optional — when supplied, pinpoints the anchor window) — `modifier: ["cmd"]` |
    | Double-click / open | `double_click({pid, window_id, element_index})` | AXOpen when advertised (Finder items, openable rows); else stamped pixel double-click at the element's center. Pixel form: `double_click({pid, x, y})` — primer-gated recipe lands on backgrounded Chromium web content (YouTube fullscreen, Finder open-on-dbl). `click({..., count: 2})` still works and routes through the same recipe; `double_click` is the intent-first spelling |
    | Right click / context menu | `right_click({pid, window_id, element_index})` or `click({pid, window_id, element_index, action: "show_menu"})` | Chromium web-content coerces pixel right-click to left — see `WEB_APPS.md` |
    | Type at cursor | `type_text({pid, text, window_id, element_index})` | `AXSelectedText` write; focuses first |
    | Set whole field value | `set_value({pid, window_id, element_index, value})` | sliders, steppers, text fields; **use for keyboard-commit workarounds on minimized windows** |
    | Scroll | `scroll({pid, direction, amount, by, window_id, element_index})` | synthesizes PageUp/PageDown/arrows via SLEventPostToPid |
    | Focus + send key | `press_key({pid, key, window_id, element_index, modifiers})` | element_index sets AXFocused, then posts key |
    | Send key to pid | `press_key({pid, key, modifiers})` | no focus change; key goes to pid's current focus |
    | Modifier combo | `hotkey({pid, keys})` | e.g. `["cmd","c"]`; posted per-pid, not HID tap |
    | Unicode keystrokes | `type_text({pid, text, delay_ms})` | AX write with automatic CGEvent fallback; reaches Chromium/Electron inputs |
    
    **All keyboard/text primitives require `pid`.** There is no
    frontmost-routed variant — every key goes to the named target via
    `CGEvent.postToPid`, so the driver cannot leak keystrokes into the
    user's foreground app.
    
    **Why `element_index` is the primary path:** works on hidden /
    occluded / off-Space windows, no focus steal, stable across
    rebuilds, labels tell you what you're clicking. Reach for pixel
    coordinates only when AX can't.
    
    ### Pixel-coordinate clicks
    
    The pixel path (`click({pid, x, y})`) is for surfaces the AX tree
    doesn't reach — canvases, video players, WebGL, custom-drawn controls.
    Coords are **window-local screenshot pixels** (same space as the PNG
    `get_window_state` returns). Top-left origin, y-down. The driver
    handles screen-point conversion internally. Passing `window_id`
    alongside `x, y` is optional but recommended — it pins the
    coordinate conversion to the window whose screenshot produced the
    pixel, rather than the driver's heuristic choice.
    
    #### Reading coordinates from the PNG
    
    PNGs returned by `get_window_state` are capped at **1568 px
    long-side by default** (`max_image_dimension` config), matching
    Anthropic's multimodal-vision downsampling limit. That means the
    image the model reasons over and the image the click tool's
    coordinate system lives in are the **same resolution** — just look
    at the PNG, pick a pixel, click at that pixel. No scaling math.
    
    This is the default because the mismatch between "rendered
    thumbnail" and "native PNG" was a recurring coord-estimation
    footgun. If you opt out (explicit `max_image_dimension=0` for
    pixel-perfect verification flows), the old rule applies: don't
    eyeball coords from whatever your client renders — it may be
    2-4× smaller than the PNG on disk, and a 2% error in thumbnail
    space becomes ~80 px in the real image. Use the crosshair recipe
    below against the full-resolution file in that case.
    
    1. `get_window_state({pid, window_id})` returns an image capped
       at 1568 long-side (default) plus its dimensions
       (`screenshot_width` / `screenshot_height`). Write the bytes to
       disk with `--screenshot-out-file <path>` in any capture mode — works
       identically in `vision` (where it's the only way) and `som`
       (where it sidesteps the jq + base64 dance on the spliced
       `screenshot_png_b64` field).
    2. You are a multimodal model — look at the PNG. Since the PNG
       matches what you see, pick the target pixel directly. No
       fractional math needed.
    3. When precision matters (small targets, dense UIs), draw a
       crosshair on the image (do **not** crop — cropping loses the
       coordinate system and requires error-prone offset math) and
       show it before clicking:
    
    ```python
    from PIL import Image, ImageDraw
    img = Image.open('/tmp/shot.png')
    draw = ImageDraw.Draw(img)
    x, y = <your_coordinate>
    r = 18
    draw.ellipse([x-r, y-r, x+r, y+r], outline='red', width=4)
    draw.line([x-30, y, x+30, y], fill='red', width=3)
    draw.line([x, y-30, x, y+30], fill='red', width=3)
    img.save('/tmp/shot_annotated.png')
    ```
    
    4. Only dispatch the click after the user (or your own re-read of
       the annotated image) confirms the crosshair is on target.
    
    #### Addressing variants
    
    - `click({pid, x, y})` — single left-click.
    - `click({pid, x, y, count: 2})` — double-click.
    - `click({pid, x, y, modifier: ["cmd"]})` — cmd-click. Accepts any
      subset of `cmd/shift/option/ctrl`.
    - `right_click({pid, x, y})` — also takes `modifier`.
    
    The pixel path animates the agent cursor overlay but never warps
    the real cursor. If the pid has no on-screen window the call errors
    with `pid X has no on-screen window` — you need a visible window to
    anchor the conversion.
    
    #### How the pixel click is dispatched
    
    The recipe is the backgrounded "noraise" sequence: yabai's
    focus-without-raise SLPS event records followed by an off-screen
    user-activation primer and the real click, all stamped via
    `SLEventPostToPid`. The target app becomes AppKit-active for event
    routing but its window does **not** rise to the front of the
    z-stack, and macOS's "switch to Space with windows for app" follow
    is suppressed. Full mechanics in
    `Sources/CuaDriverCore/Input/MouseInput.swift` (`clickViaAuthSignedPost`)
    and the companion `FocusWithoutRaise.swift`.
    
    #### Known limits
    
    - **Chromium `<video>` play/pause**: pixel click is often rejected
      by HTML5's click-to-play handler on some builds. Use keyboard
      instead: `press_key({pid, key: "k"})` (YouTube) or
      `press_key({pid, key: "space"})` (generic). Keyboard events
      travel through a different auth envelope.
    - **Pixel right-click on Chromium web content** coerces to a
      left-click — a known Chromium renderer-IPC limitation that affects
      every non-HID-tap synthesis path. For context menus on
      AX-addressable elements (links, buttons, toolbar items), use
      `right_click({pid, element_index})` instead.
    
    ### Canvases, viewports, games (Blender, Unity, GHOST, Qt, wxWidgets)
    
    Apps whose main surface is an OpenGL / Metal / Qt / wxWidgets
    viewport expose **no useful AX tree** — the whole surface is one
    opaque `AXGroup` or `AXWindow` from AX's perspective. Per-pid event
    paths (`SLEventPostToPid`, `CGEvent.postToPid`) are filtered by the
    viewport's own event-source check and silently dropped — the event
    loop wants "real HID origin".
    
    The working pattern:
    
    1. Bring the target frontmost (a brief `osascript activate` is
       acceptable here — this is the carve-out the skill's osascript
       gate allows).
    2. `CGEvent.post(tap: .cghidEventTap)` with a leading `mouseMoved`
       event (~30 ms before the click). `cua-driver click` when the
       target is frontmost automatically takes this path.
    3. Accept that the real cursor visibly moves — `cghidEventTap` is
       the system HID stream, the cursor warps to the click point.
    
    There is no backgrounded path that reaches these apps today.
    
    ## Navigating native menu bars (AXMenuBar)
    
    **Only drive the menu bar when the target app is frontmost.** This
    is the single most-misused cua-driver capability. If the target is
    backgrounded, don't reach for `AXMenuBarItem` + AXPick — use
    in-window `element_index` or pixel clicks instead. Two reasons, one
    functional and one perceptual:
    
    - **Functional:** menu items that touch document/playback/editor
      state go `DISABLED` when their owning app isn't the key window
      (Preview rotate, IINA speed change, most editor commands). AXPick
      + AXPress will dispatch successfully from the driver's side but
      no-op at the target — you get a silent false-pass.
    - **Perceptual (matters for demos, screen recordings, and anything
      the user watches live):** macOS's screen-rendered menu bar
      always belongs to the *frontmost* app. AXPick on a backgrounded
      app's `AXMenuBarItem` dispatches to that app's per-process menu at
      the AX layer, but any visible menu render happens over the
      frontmost app's menu bar — the viewer sees an IINA submenu
      flashing on top of Chrome's menus, which reads as "the agent
      clicked the wrong app." The AX call was correct; the frame the
      user sees is not. For recorded or observed sessions, this is an
      integrity bug even though it's not a correctness bug.
    
    **Good decision rule:** if the target is not already frontmost, do
    not use `AXMenuBarItem` at all. For *reading* in-window state,
    snapshot the window AX tree — most apps expose the same state via
    an in-window `AXStaticText`, title bar, or toolbar. For *dispatching*
    actions, use in-window `element_index` (buttons, toolbar items) or
    pixel clicks on in-window controls — both dispatch via AppKit's
    window-under-pointer hit-test and are **not** frontmost-gated.
    
    When the target IS frontmost, the menu-bar flow below is fine and
    the canonical path for menus.
    
    ### The two-snapshot pattern (target frontmost only)
    
    Menu contents are a two-snapshot flow. Closed AXMenu subtrees are
    deliberately skipped during snapshot — otherwise every app's File /
    Edit / View hierarchy plus every Recent Items macOS has ever seen
    would inflate the tree 10-100x. But once a menu is *open*, its
    AXMenuItem children do receive `element_index` values so you can
    click them normally.
    
    1. Find the `[N] AXMenuBarItem "<Menu Name>"` in the tree.
    2. `click({pid, element_index: N, action: "pick"})` — menu bar items
       implement `AXPick` ("open my submenu"), not `AXPress`. Using the
       default action on an AXMenuBarItem is a no-op.
    3. Re-snapshot. The expanded menu's items now appear under the bar
       item as `[M] AXMenuItem "<Item Name>"`.
    4. Click the target item — most items respond to `AXPress` (default
       action). Submenus nest under the item and are walked the same way.
    5. Re-snapshot and verify.
    
    If you ever need to back out without selecting, `press_key({pid, key:
    "escape"})` closes the open menu. Leaving a menu expanded between
    turns poisons subsequent snapshots for that pid.
    
    ### Commands gated on the target being frontmost
    
    Some menu items and global shortcuts (Preview's Tools → Rotate
    Right, ⌘R; anything in the View menu that manipulates the current
    document; most editor commands) are **disabled unless the target
    app is the key / frontmost window**. You'll see it in the AX tree
    as `DISABLED` on the menu item even though the user's intent is
    obviously valid.
    
    Before activating, confirm you're in this narrow case — the menu
    item still reads `DISABLED` after a fresh snapshot AND the action
    the user requested genuinely requires frontmost (Preview rotate,
    View menu document manipulation, editor commands). If either
    check fails, don't activate.
    
    When both checks pass, the driver has no `activate` tool
    (deliberately — the whole point is backgroundable control), so
    this is the one legitimate `osascript` fallback:
    
    ```
    osascript -e 'tell application "<App Name>" to activate'
    ```
    
    Then re-snapshot — the menu item loses its `DISABLED` tag — and
    `click({action: "pick"})` the item. Alternatively, a `hotkey`
    call delivered to the now-frontmost app works for the shortcut
    form (`⌘R`, `⌘+`, etc.).
    
    **Always name the focus steal in your response** so the user isn't
    surprised — "Briefly activating Preview to enable Tools → Rotate
    Right" or similar. Don't silently steal focus. You don't need to
    restore the previous frontmost afterwards unless the user asks —
    they can cmd-tab back.
    
    ## Web-rendered apps (browsers, Electron, Tauri)
    
    For Chrome / Edge / Brave / Arc / Safari, Electron apps (Slack,
    VSCode, Notion, Discord), and Tauri apps — see **`WEB_APPS.md`**.
    
    Covers: sparse AX tree population (retry-once pattern for Chromium),
    URL navigation via omnibox suggestions, the `set_value` workaround
    for keyboard commits on **minimized** windows (Return silently
    no-ops — symptom is a macOS system beep; use `set_value` or click a
    clickable equivalent), scrolling via synthetic PageUp/Down keystrokes,
    in-page clicks, and typing into web inputs.
    
    Chromium web content specifically also coerces `right_click` back to
    left — use `element_index` for AX-addressable targets and accept the
    limit otherwise.
    
    ### Browser JS primitives — `page` tool and `get_window_state(javascript=)`
    
    When the AX tree doesn't expose the data you need (common in
    Chromium/Electron — the tree is sparse for web content), use the
    `page` tool or the `javascript` param on `get_window_state` to query
    the DOM directly via Apple Events. Requires "Allow JavaScript from
    Apple Events" to be enabled — see `WEB_APPS.md` for the setup path.
    
    **Three actions on the `page` tool:**
    
    - `page({pid, window_id, action: "get_text"})` — returns
      `document.body.innerText`. Fastest way to read page content, prices,
      article text, or any raw text the AX tree truncates or omits.
    
    - `page({pid, window_id, action: "query_dom", css_selector: "a[href]",
      attributes: ["href"]})` — runs `querySelectorAll` and returns each
      match's tag, text, and requested attributes as a JSON array. Use for
      table rows, link hrefs, data attributes, structured page data.
    
    - `page({pid, window_id, action: "execute_javascript", javascript:
      "..."})` — raw JS. Wrap in an IIFE with try-catch. Don't use this for
      elements already indexed by `get_window_state` — `click` and
      `set_value` are more reliable there.
    
    **Co-located read — `get_window_state` with `javascript`:**
    
    ```
    get_window_state({pid, window_id, javascript: "document.title"})
    ```
    
    Runs the JS and appends the result as a `## JavaScript result` section
    alongside the AX snapshot — one round-trip instead of two. Use this
    when you need both the element tree (for subsequent clicks) and some
    page data in the same turn.
    
    **Decision rule — AX vs JS:**
    
    | Need | Use |
    |---|---|
    | Click / type into an element | `get_window_state` → `click` / `set_value` (AX, works backgrounded) |
    | Read text the AX tree drops | `page(get_text)` or `get_window_state(javascript=)` |
    | Scrape structured data (tables, hrefs) | `page(query_dom)` |
    | Trigger JS events / mutations | `page(execute_javascript)` |
    
    Supported backends:
    
    | App type | How | Context |
    |---|---|---|
    | Chrome / Brave / Edge | Apple Events `execute javascript` | Full DOM ✅ |
    | Safari | Apple Events `do JavaScript` | Full DOM ✅ |
    | Electron (VS Code, Cursor…) | SIGUSR1 → V8 inspector → CDP | Main process only: `process`, `Buffer` — no `document`, no `require` in sandboxed apps |
    | Electron (with `--remote-debugging-port`) | CDP page target | Full DOM ✅ |
    
    **Electron sandbox note:** SIGUSR1 connects to the Node.js *main* process.
    Sandboxed Electron apps (VS Code, Cursor) strip `require` and Electron
    APIs there. Useful for: `process.env`, `process.versions`, `process.cwd()`,
    `process.pid`. For full DOM/renderer access, launch the app with
    `--remote-debugging-port=9222` — cua-driver will detect and prefer the
    page target automatically.
    
    Arc returns no values; Firefox has no JS-via-AppleEvents support — see
    `WEB_APPS.md` for the full matrix.
    
    ### 3. Re-snapshot and verify — mandatory
    
    **Always** call `get_window_state({pid, window_id})` after the action.
    This isn't optional verification — it's the second half of the
    snapshot invariant.
    
    Check the AX tree diff: a changed value, a new element, a new
    window, or the disappearance of the thing you just clicked (menus
    collapse after selection, buttons may become disabled, etc.). If
    nothing changed, the action likely failed silently — **tell the
    user what you attempted and what you observed**, don't paper over
    with "done" language. Agents that skip this step report success on
    silently-dropped actions — the single most common failure mode.
    
    ## Recording trajectories
    
    Session-scoped action recording + replay, for demos, regressions, and
    training data. Only invoke when the user explicitly asks to record a
    session — the skill does not auto-enable this. CLI surface:
    `cua-driver recording start|stop|status`; raw tool: `set_recording`.
    
    See **`RECORDING.md`** for the full flow: enable/disable, turn folder
    contents, replay via `replay_trajectory`, and the element_index
    doesn't-survive-across-sessions caveat.
    
    ## Common error patterns
    
    | Error text | Meaning | Fix |
    |---|---|---|
    | `No cached AX state for pid X window_id W` | You either skipped `get_window_state` this turn, or passed a different `window_id` to the click than the one the snapshot cached against | Call `get_window_state({pid: X, window_id: W})` first — the same window_id you intend to click in |
    | `Invalid element_index N for pid X window_id W` | Index is stale or out of range | Re-run `get_window_state` with the same window_id, pick a fresh index from the new tree |
    | `window_id W belongs to pid P, not …` | Passed a window_id that's owned by a different process | Use `list_windows({pid: X})` to enumerate this pid's own windows |
    | `AX action AXPress failed with code …` | Element doesn't support AXPress | Try `show_menu`, `confirm`, `cancel`, or `pick` |
    | macOS system-alert beep on `press_key` with no visible change | Target window is minimized; Return / Space / Tab commits don't establish real renderer focus on minimized windows | AX-click a clickable equivalent (Go button, Submit button, checkbox) instead of pressing the key; see "Keyboard commits on minimized windows" under the Browser section |
    | `Accessibility permission not granted` | TCC not granted | Stop; tell user to grant in System Settings |
    | `Screen Recording permission not granted` | TCC not granted for capture | Affects `screenshot` and `get_window_state` (which always captures). Grant in System Settings — the driver can't operate without it |
    
    ## Things to avoid
    
    - **Never** reuse an `element_index` across a re-snapshot of the same pid.
    - **Never** translate screenshot pixels into a click — the screenshot
      is for visual disambiguation, not coordinates. Use the
      `element_index`.
    - **Prefer AX over pixels.** `click({pid, x, y})` works for
      canvas / WebView regions, but it lands blindly and skips the
      agent-cursor overlay. Exhaust AX paths (menu bars, cmd-k palettes,
      toolbar items, keyboard shortcuts) before dropping to coordinates.
    - **Never** drive destructive actions (delete files, close unsaved
      documents, send messages, submit forms) without explicit user
      intent for that specific destructive step.
    - **Never** launch apps autonomously; confirm with the user first
      unless their original request clearly implies the launch.
    
    ## Example end-to-end task
    
    **User:** "Open the Downloads folder in Finder."
    
    1. `launch_app({bundle_id: "com.apple.finder", urls: ["~/Downloads"]})`
       → `{pid: 844, windows: [{window_id: 6123, title: "Downloads", ...}]}`.
       Idempotent launch; plus Finder opens a hidden window rooted at
       `~/Downloads` via `application(_:open:)` — zero activation, no
       focus steal. The `windows` array lets you skip a `list_windows` hop.
    2. `get_window_state({pid: 844, window_id: 6123})` → verify an
       `AXWindow` whose title contains "Downloads" is present with a
       populated AX subtree (sidebar, list view, files).
    3. Done.
    
    If the user instead asks to navigate *within* an already-open Finder
    window, use the menu-bar flow from the "Navigating native menu bars"
    section above (click Go → pick a menu item → re-snapshot → click it).
    
  • skills/gui-automation/SKILL.mdskill
    Show content (5713 bytes)
    ---
    name: gui-automation
    description: >-
      Use when you need to visually interact with a GUI — test buttons, fill forms,
      verify visual layouts, fuzz web pages, automate user flows, take screenshots,
      or perform end-to-end QA on any application. Works on cloud VMs, Docker
      containers, local machines, and sandboxes. Install: pip install cua.
    ---
    
    # GUI Automation
    
    CUA gives you **eyes and hands on a real computer**: see the screen, move the
    mouse, click, type, drag, and manage windows — like a human at the keyboard.
    
    Use this skill for **visual interaction** that can't be done via shell or API.
    
    ## Setup
    
    ```bash
    cua --version          # check install; if missing: pip install cua
    
    # Connect to target (pick one)
    cua do switch cloud my-vm
    cua do switch docker my-container
    cua do-host-consent && cua do switch host   # local machine (one-time consent)
    ```
    
    > `ANTHROPIC_API_KEY` is optional. With it, `cua do snapshot` returns an
    > AI-annotated screen with element coordinates. Without it, use `screenshot`
    > and read the image yourself.
    
    ## Workflow
    
    **Look → Act → Verify** — repeat until done, then share:
    
    ```bash
    cua do screenshot          # look
    cua do click 450 280       # act
    cua do screenshot          # verify
    cua trajectory share       # share replay link with user
    ```
    
    > Re-screenshot after every UI change — coordinates go stale when the screen changes.
    
    ## Scenarios
    
    ### Click a button
    
    ```bash
    cua do screenshot
    cua do click 450 280
    cua do screenshot
    ```
    
    ### Fill a form
    
    ```bash
    cua do screenshot
    cua do click 400 200 && cua do type "Jane Doe"
    cua do key tab            && cua do type "jane@example.com"
    cua do key tab            && cua do type "SecureP@ss123"
    cua do click 400 500
    cua do screenshot
    ```
    
    ### File upload dialog
    
    ```bash
    cua do click 350 400       # "Choose File"
    cua do type "/home/user/report.pdf"
    cua do key enter
    cua do screenshot
    ```
    
    ### Zoom in for precision clicks (host or small targets)
    
    When clicking small or dense UI elements — especially on the host machine —
    zoom into the target window first. Coordinates become **window-relative** and
    screenshots show only that window, giving you higher effective resolution.
    
    ```bash
    cua do zoom "Google Chrome"   # crop to Chrome window; coords are now window-relative
    cua do screenshot              # zoomed view — easier to locate small elements
    cua do click 112 44            # precise click on a small tab or button
    cua do screenshot              # verify
    cua do unzoom                  # restore full-screen coords when done
    cua do screenshot              # back to full desktop view
    ```
    
    > Use `zoom` any time click accuracy is uncertain. `unzoom` before switching
    > windows or when you need to see the full desktop again.
    
    ### Drag and drop
    
    ```bash
    cua do window ls               # list open windows
    cua do drag 150 300 650 400    # source → destination
    cua do screenshot
    ```
    
    ### Fuzz a form
    
    ```bash
    cua do screenshot
    cua do click 400 200
    cua do type "<script>alert(1)</script>"
    cua do key tab && cua do type "'; DROP TABLE users; --"
    cua do key tab && cua do type "AAAAAAAAAAAAAAAAAAAAAAA"
    cua do click 400 500
    cua do screenshot              # check for errors, crashes, unexpected behavior
    ```
    
    ## Trajectory
    
    Every action is auto-recorded to `~/.cua/trajectories/{machine}/{session}/`.
    
    ```bash
    cua trajectory share           # upload and get shareable HTTPS link (always do this at end)
    cua trajectory ls              # list sessions
    cua trajectory export          # generate HTML report
    cua do --no-record click 100 200   # disable recording for a single action
    ```
    
    Tell the user: `"Here is the trajectory of my session: {url}"`
    
    ## Quick Reference
    
    | Action              | Command                                      |
    | ------------------- | -------------------------------------------- |
    | Connect to target   | `cua do switch <provider> [name]`            |
    | Screenshot          | `cua do screenshot`                          |
    | AI-annotated screen | `cua do snapshot ["instructions"]`           |
    | Click               | `cua do click <x> <y> [left\|right\|middle]` |
    | Double-click        | `cua do dclick <x> <y>`                      |
    | Type text           | `cua do type "text"`                         |
    | Press key           | `cua do key <key>`                           |
    | Hotkey              | `cua do hotkey <combo>` (e.g. `ctrl+c`)      |
    | Scroll              | `cua do scroll <direction> [amount]`         |
    | Drag                | `cua do drag <x1> <y1> <x2> <y2>`            |
    | Move cursor         | `cua do move <x> <y>`                        |
    | Shell command       | `cua do shell "command"`                     |
    | Open URL/file       | `cua do open <url\|path>`                    |
    | List windows        | `cua do window ls [app]`                     |
    | Focus window        | `cua do window focus <id>`                   |
    | Zoom to window      | `cua do zoom "App Name"`                     |
    | Unzoom              | `cua do unzoom`                              |
    | Share trajectory    | `cua trajectory share`                       |
    
    ## Providers
    
    | Provider     | Example                             |
    | ------------ | ----------------------------------- |
    | `cloud`      | `cua do switch cloud my-vm`         |
    | `cloudv2`    | `cua do switch cloudv2 my-vm`       |
    | `docker`     | `cua do switch docker my-container` |
    | `lume`       | `cua do switch lume my-vm`          |
    | `lumier`     | `cua do switch lumier my-vm`        |
    | `winsandbox` | `cua do switch winsandbox`          |
    | `host`       | `cua do switch host`                |
    
    See [references/command-reference.md](references/command-reference.md) for full argument syntax.
    
  • libs/cuabot/src/prompts/.mcp.jsonmcp_server
    Show content (179 bytes)
    {
      "mcpServers": {
        "computer-use": {
          "command": "/home/user/.local/bin/uv",
          "args": ["run", "--script", "/home/user/.cuabot/mcp/computer-use-mcp.py"]
        }
      }
    }
    

README

Cua logo

Build, benchmark, and deploy agents that use computers

cua.ai Discord Twitter Documentation
trycua%2Fcua | Trendshift

Choose Your Path


Cua Driver - Background computer-use on macOS

Drive any native macOS app in the background — agents click, type, and verify without stealing the cursor, focus, or Space, even on non-AX surfaces like Chromium web content and canvas-based tools (Blender, Figma, DAWs, game engines). Use with the CLI or MCP server for Claude Code, Cursor, and custom clients. Every session records as a replayable trajectory.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh)"

Full tool reference, architecture notes, and the Claude Code skill ship with the package: libs/cua-driver/README.md.


Cua - Agent-Ready Sandboxes for Any OS

Build agents that see screens, click buttons, and complete tasks autonomously. One API for any VM or container image — cloud or local.

pip install cua
# Requires Python 3.11 or later
from cua import Sandbox, Image

# Same API regardless of OS or runtime
async with Sandbox.ephemeral(Image.linux()) as sb:   # or .macos() .windows() .android()
    result = await sb.shell.run("echo hello")
    screenshot = await sb.screenshot()
    await sb.mouse.click(100, 200)
    await sb.keyboard.type("Hello from Cua!")
    await sb.mobile.gesture((100, 500), (100, 200))  # multi-touch gestures
Linux containerLinux VMmacOSWindowsAndroidBYOI (.qcow2, .iso)
Cloud (cua.ai)🔜 soon
Local (QEMU)

Get Started | Examples | API Reference


CuaBot - Co-op computer-use for any agent

cuabot screenshot

cuabot gives any coding agent a seamless sandbox for computer-use. Individual windows appear natively on your desktop with H.265, shared clipboard, and audio.

npx cuabot                 # Setup onboarding
# Run any agent in a sandbox
cuabot claude              # Claude Code
cuabot openclaw            # OpenClaw in the sandbox

# Run any GUI workflow in a sandbox
cuabot chromium
cuabot --screenshot
cuabot --type "hello"
cuabot --click <x> <y> [button]

Built-in support for agent-browser and agent-device (iOS, Android) out of the box.

Get Started | Installation | First spotted at ClawCon

cuaXclawdbot_nbg

Cua-Bench - Benchmarks & RL Environments

Evaluate computer-use agents on OSWorld, ScreenSpot, Windows Arena, and custom tasks. Export trajectories for training.

# Install and create base image
cd cua-bench
uv tool install -e . && cb image create linux-docker

# Run benchmark with agent
cb run dataset datasets/cua-bench-basic --agent cua-agent --max-parallel 4

Get Started | Partner With Us | Registry | CLI Reference


Lume - macOS Virtualization

Create and manage macOS/Linux VMs with near-native performance on Apple Silicon using Apple's Virtualization.Framework.

# Install Lume
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"

# Pull & start a macOS VM
lume run macos-sequoia-vanilla:latest

Get Started | FAQ | CLI Reference


Packages

PackageDescription
cuabotMulti-agent computer-use sandbox CLI
cua-agentAI agent framework for computer-use tasks
cua-sandboxSDK for creating and controlling sandboxes
cua-computer-serverDriver for UI interactions and code execution in sandboxes
cua-benchBenchmarks and RL environments for computer-use
lumemacOS/Linux VM management on Apple Silicon
lumierDocker-compatible interface for Lume VMs

Resources

  • Documentation — Guides, examples, and API reference
  • Blog — Tutorials, updates, and research
  • Discord — Community support and discussions
  • GitHub Issues — Bug reports and feature requests

Contributing

We welcome contributions! See our Contributing Guidelines for details.

License

MIT License — see LICENSE for details.

Third-party components have their own licenses:

  • Kasm (MIT)
  • OmniParser (CC-BY-4.0)
  • Optional cua-agent[omni] includes ultralytics (AGPL-3.0)

Trademarks

Apple, macOS, Ubuntu, Canonical, and Microsoft are trademarks of their respective owners. This project is not affiliated with or endorsed by these companies.


Stargazers over time

Thank you to all our GitHub Sponsors!

coderabbit-cli