Babysitter CLI Surface Spec (cli_tool)
Scope & Intent
- Define the external CLI (
babysitter) that ships with@a5c-ai/babysitter-sdkand gives humans or automation a thin, deterministic shell around run folders produced by the SDK. - Cover the commands already sketched in
sdk.md §12and implemented inpackages/sdk/src/cli/*: run lifecycle inspection, deterministic orchestration loops, task introspection/execution, and state-repair utilities. - Keep the interface consistent across macOS, Linux, and Windows shells, honoring
cli_tooldomain guardrails (stable flags/defaults, explicit config precedence, and no sensitive payloads echoed to stdout).
Behavior
-
Global invocation
- Binary name
babysitter. Subcommands followbabysitter <area>:<verb>(e.g.,run:continue). - Supported top-level flags on every command:
--runs-dir <path>(advanced override; default root is~/.a5c/runs, or<repo>/.a5c/runswhenBABYSITTER_RUNS_SCOPE=repo),--json,--dry-run(commands that mutate state must honor it), and--verbose(when set, log filesystem paths and resolved options to stderr). - Exit codes:
0for success,1for expected user errors (bad args, missing run),>1for unexpected crashes.--jsonnever changes exit semantics. - All paths returned to the user are normalized to POSIX separators relative to
<runDir>even on Windows; CLI accepts either slash style as input.
- Binary name
-
Run lifecycle management
run:createwritesrun.json, optionalinputs.json, and appendsRUN_CREATEDvia the runtime API. Required flags:--process-id,--entry. Optional--inputs,--run-id,--process-revision,--request.run:statusprints[run:status] state=<created|waiting|completed|failed> last=<TYPE#SEQ ISO> pending[...]plus one line per pending kind; JSON mirrors{ state, lastEvent, pendingByKind }. Works even if journal/state files are missing by treating them as empty.run:eventsstreams journal entries with--limit,--reverse,--filter-type, and--json. Missing run directory or unreadable event files emit a single error line and exit1.run:rebuild-state(surface forrebuildStateCache) locks the run, replays the journal, writesstate/state.json, and prints/returns the rebuild reason, event counts, and resultingstateVersion.
-
Orchestration control loops
run:continuewas removed; callers should looprun:iterate, execute pending effects externally, and commit viatask:post.
-
Task introspection and execution
task:listreads the effect index and prints- <effectId> [<kind> <status>] <label?> (taskId=<taskId>). Flags:--pending,--kind. JSON payload is{ tasks: TaskListEntry[] }where every entry includes refs for task/result/stdout/stderr with POSIX paths.task:showpretty-printstask.jsonandresult.json(or(not yet written)if pending) and mirrors the list entry in JSON mode.task:postcommits externally produced results for any effect kind. It validates the effect is stillrequested, writestasks/<effectId>/result.json, and appendsEFFECT_RESOLVEDviacommitEffectResult.--dry-runpreviews the mutation without committing. JSON response includes{ status, committed, stdoutRef, stderrRef, resultRef }.- Manual breakpoint resolution stays manual:
task:listhighlightskind="breakpoint". Dedicatedbreakpoint:resolve/sleep:listcommands are tracked separately and are not required to ship with this part.
-
Output and UX conventions
- Human text is intentionally terse (single-line headers with prefixed command ids) for easy parsing in CI logs.
--jsonoutputs single JSON documents (no streams) so scripts canjqthem. All timestamps are ISO8601 strings, numbers stay numeric.- Errors include the command prefix, the resolved
<runDir>, and the underlying message ([run:events] unable to read run metadata at ...).--verboseadds stack traces. - Secrets from task definitions are never echoed: CLI logs file refs instead of dumping blobs/result payloads unless
--verboseis paired with--jsonandBABYSITTER_ALLOW_SECRET_LOGS=true.
Acceptance Criteria
- Flag & path consistency – Every command resolves runs through the central default path policy, honors
--runs-dirwhen explicitly provided, validates required positional args, and prints actionable errors with non-zero exit codes when resolution fails. Tests cover Windows-style and POSIX-style inputs. - Deterministic JSON contracts –
run:create,run:status,run:events,run:iterate,task:list,task:show, andtask:postemit the schemas described above; snapshot tests guard against accidental drift. - Safe automation loops – orchestration loops are owned by the caller (skill/hook/worker). The CLI provides deterministic primitives (
run:iterate,task:list,task:post) and never embeds task-execution policy. - State repair tooling –
run:rebuild-staterebuilds derived state whenstate/state.jsonis missing or stale and reports the rebuild result in both human and JSON modes. Subsequentrun:statusreflects the rebuiltstateVersion. - Process integration – CLI surfaces are thin wrappers over runtime APIs (
createRun,orchestrateIteration,commitEffectResult,rebuildStateCache). Unit tests stub these APIs to ensure argument translation and error propagation are correct. - Documentation & help –
babysitter --help(or bare invocation, or wrong-syntax error) prints the agent-facing usage block (commands intended for skill/hook automation).babysitter --help-humanprints the human-facing usage block (commands intended for direct interactive use, e.g.harness:*,session:init,mcp:serve,compress-output). README/sdk.md tables stay in sync with both surfaces.
Edge Cases
- Missing or deleted run directories: commands fail fast with
[command] unable to read run metadataand exit1. - Empty journals:
run:statusreportscreatedwithlast=noneandpending[total]=0;run:events --jsonreturns an empty array. - Task output blobs larger than 1 MiB:
task:listandtask:showprint refs to blob files rather than dumping whole payloads;task:post --jsonpoints tostdoutRef,stderrRef, andresultRef. - Windows drive letters and UNC paths:
--runs-dirand<runDir>may include drive prefixes; CLI resolves them but continues to emit POSIX-style refs in JSON/logs. - Legacy compatibility: when the active runs root is global, commands that read existing runs should also probe
<repo>/.a5c/runsbefore reporting a missing run.
Non-Goals
- Implementing interactive TUIs, dashboards, or VS Code surfaces (handled elsewhere in Babysitter).
- Remote/distributed task execution backends; CLI focuses on run iteration + result commits, not execution.
- New intrinsic kinds or scheduler policies; CLI simply reflects what the runtime reports.
- Packaging/distribution mechanics (npm publish, Homebrew formulas) and telemetry collection—tracked in separate operational docs.
- Auto-resolving breakpoints, orchestrator tasks, or sleep gates in this part; those require explicit manual commands or future automation.