feat capability CLI on latest main

2026-06-18 03:52:42 +08:00 · 2026-04-06 17:30:59 -05:00
21184 changed files with 763626 additions and 3709094 deletions
--- a/.agents/maintainer-notes/telegram.md
+++ b/.agents/maintainer-notes/telegram.md
@@ -1,37 +0,0 @@
-# Telegram Maintainer Decisions
-
-Use this page during Telegram PR review. These are intentional maintainer decisions, not incidental implementation details.
-
-Verified against Telegram Bot API 10.0, May 8 2026.
-
-## Streaming
-
- Do not reintroduce `sendMessageDraft` for answer streaming. Telegram drafts are ephemeral 30-second previews in private chats; final delivery still requires a separate `sendMessage`. OpenClaw uses `sendMessage` plus `editMessageText`, then finalizes in place so the user sees one persistent answer.
- Streaming owns one visible preview message. Edit it forward. Do not send an extra final bubble unless the final edit genuinely failed.
- Keep the first-preview debounce. If a provider sends token-sized deltas, coalesce them into cumulative preview text instead of removing the debounce.
- Respect Telegram limits in the Telegram layer. Text over 4096 chars chains into continuation messages. Polls keep the current Bot API 12-option cap.
-
-## Telegram API Ownership
-
- Prefer grammY primitives and Telegram-native helpers when they model the behavior directly. Avoid custom Bot API wrappers for behavior grammY already owns.
- Throttling is bot-token scoped. All Telegram API clients for the same token share one grammY `apiThrottler()` instance.
- Do not silently retry failed topic sends without topic metadata. A wrong-surface success is worse than a loud Telegram error.
- DM topics and forum topics are distinct. `direct_messages_topic_id` and `message_thread_id` are not interchangeable.
-
-## Context And Authorization
-
- Reply context comes from OpenClaw-observed messages. Bot API updates expose `reply_to_message`, but there is no arbitrary `getMessage(chat, id)` hydration path later.
- Current local chat context must outrank stale reply ancestry in the prompt. Old replied-to messages should not look like the active conversation.
- Pairing is DM-only. Group and topic authorization need explicit config allowlists.
- Telegram allowlists use numeric sender IDs. Usernames are optional, mutable, and not a reliable arbitrary-user lookup key in the Bot API.
- Group and channel visible replies are policy-controlled. Normal room replies stay private unless `messages.groupChat.visibleReplies: "automatic"` is set or the agent explicitly calls `message.send`.
-
-## Interactive Surfaces
-
- Native callbacks stay structured. Approval, native command, plugin, select, and multiselect callbacks must not fall through as raw callback text.
- Preserve callback values exactly, including delimiters such as `env|prod`.
- Native slash commands should remain fast-pathable before full workspace and agent-turn setup.
-
-## Review Standard
-
-Telegram behavior PRs need real Telegram proof when they touch transport, streaming, topics, callbacks, authorization, or reply context. Prefer the bot-to-bot QA lane or an equivalent live Telegram probe over synthetic-only validation.
--- a/.agents/maintainers.md
+++ b/.agents/maintainers.md
@@ -0,0 +1 @@
+Maintainer skills now live in [`openclaw/maintainers`](https://github.com/openclaw/maintainers/).
--- a/.agents/skills/agent-transcript/SKILL.md
+++ b/.agents/skills/agent-transcript/SKILL.md
@@ -1,88 +0,0 @@
---
-name: agent-transcript
-description: "Add a redacted agent transcript section to GitHub PR or issue bodies during OpenClaw agent-created PR/issue workflows."
---
-
-# Agent Transcript
-
-Best-effort local-only provenance for OpenClaw PR/issue bodies. Use during agent-created GitHub PR or issue workflows before creating/updating the body.
-
-## Contract
-
- Never use network. Session discovery reads local agent logs only.
- Never upload raw logs. Render sanitized Markdown first.
- Always ask the user before adding transcript logs to a GitHub PR/issue body.
- Tell the user sanitized session logs help reviewers and can make PRs easier to prioritize.
- Offer a local HTML preview before insertion. If the user wants preview, open it and wait for confirmation before adding the section.
- Fail closed on unresolved secrets, private keys, browser/session/cookie details, or auth URLs.
- Drop system/developer prompts, raw tool outputs, reasoning, env, cookies, tokens, and broad local paths.
- Keep user prompts, assistant visible decisions, terse tool summaries, and test/proof outcomes.
- Remove session turns unrelated to the PR/issue work. Use the PR/issue title, branch name, changed files, and stated goal as scope; omit earlier/later unrelated tasks even when they are in the same session log.
- Best effort only: PR/issue creation must continue if no safe transcript is found.
- Add the `## Agent Transcript` section only when inserting a real transcript. Never add a placeholder transcript heading or text such as "A sanitized local transcript preview was generated but not included."
- Use a collapsed `<details>` section and update existing markers instead of duplicating sections.
-
-## Helper
-
-```bash
-.agents/skills/agent-transcript/scripts/agent-transcript --help
-```
-
-Find a likely local session:
-
-```bash
-.agents/skills/agent-transcript/scripts/agent-transcript find \
-  --query "$PR_TITLE $BRANCH_OR_PR_URL" \
-  --cwd "$PWD" \
-  --since-days 14
-```
-
-`find` scans the newest 400 matching local JSONL logs by default across Codex, Claude, Pi, and OpenClaw agent sessions. Use `--max-files N` for a wider local search.
-
-Render a PR/issue body section:
-
-```bash
-.agents/skills/agent-transcript/scripts/agent-transcript render \
-  --session "$SESSION_JSONL" \
-  --out /tmp/agent-transcript.md
-```
-
-Preview one candidate session locally:
-
-```bash
-.agents/skills/agent-transcript/scripts/agent-transcript preview \
-  --session "$SESSION_JSONL" \
-  --out /tmp/agent-transcript-preview.html
-open /tmp/agent-transcript-preview.html
-```
-
-Append/update a body file before `gh pr create --body-file` or connector PR creation:
-
-```bash
-.agents/skills/agent-transcript/scripts/agent-transcript append-body \
-  --body /tmp/pr-body.md \
-  --session "$SESSION_JSONL" \
-  --out /tmp/pr-body.with-transcript.md
-```
-
-## PR/Issue Workflow
-
-1. Draft the normal PR/issue body first.
-2. Run `find` with title, branch, PR URL/number if known, and cwd.
-3. If a high-confidence session is found, ask:
-   `Include a redacted agent transcript? It helps reviewers and can make the PR easier to prioritize. I can open a local preview first.`
-4. If the user wants preview, run `preview`, open the HTML with `open`, and wait for confirmation.
-5. Before insertion, trim unrelated session turns from the generated section. Keep only turns that explain this PR/issue's goal, implementation choices, files, tests, proof, blockers, and final outcome.
-6. If the user approves, run `append-body`.
-7. Use the enriched body file for creation/update.
-8. If no safe session is found, say nothing and continue without transcript. If the user declines, continue without transcript and do not add any transcript placeholder section.
-
-## Review Artifacts
-
-For manual audits across many PR/session candidates, create a local HTML preview from a local JSON file. This is for maintainers only and is not part of the PR/issue workflow:
-
-```bash
-.agents/skills/agent-transcript/scripts/agent-transcript html \
-  --prs /tmp/recent-prs.json \
-  --out /tmp/agent-transcript-preview.html
-```
--- a/.agents/skills/agent-transcript/scripts/agent-transcript
+++ b/.agents/skills/agent-transcript/scripts/agent-transcript
@@ -1,683 +0,0 @@
-#!/usr/bin/env node
-import fs from "node:fs";
-import os from "node:os";
-import path from "node:path";
-import process from "node:process";
-
-const MARKER_START = "<!-- agent-transcript:start -->";
-const MARKER_END = "<!-- agent-transcript:end -->";
-const DEFAULT_MAX_CHARS = 50000;
-const DEFAULT_ENTRY_MAX_CHARS = 6000;
-
-function usage() {
-  console.log(`Usage:
-  agent-transcript find --query TEXT [--cwd PATH] [--since-days N] [--max-files N] [--root PATH...]
-  agent-transcript render --session FILE [--out FILE] [--max-chars N] [--entry-max-chars N] [--title TEXT] [--url URL]
-  agent-transcript preview --session FILE [--out FILE] [--max-chars N] [--entry-max-chars N] [--title TEXT] [--url URL]
-  agent-transcript append-body --body FILE --session FILE [--out FILE] [--max-chars N] [--entry-max-chars N]
-  agent-transcript html --prs FILE [--out FILE] [--since-days N] [--min-score N] [--root PATH...] [--exclude-session FILE...]
-
-Local-only. No network calls.`);
-}
-
-function parseArgs(argv) {
-  const args = { _: [] };
-  for (let i = 0; i < argv.length; i++) {
-    const arg = argv[i];
-    if (!arg.startsWith("--")) {
-      args._.push(arg);
-      continue;
-    }
-    const key = arg.slice(2);
-    const next = argv[i + 1];
-    if (next == null || next.startsWith("--")) {
-      args[key] = true;
-      continue;
-    }
-    i++;
-    if (args[key] == null) args[key] = next;
-    else if (Array.isArray(args[key])) args[key].push(next);
-    else args[key] = [args[key], next];
-  }
-  return args;
-}
-
-function asArray(value) {
-  if (value == null) return [];
-  return Array.isArray(value) ? value : [value];
-}
-
-function homePath(...parts) {
-  return path.join(os.homedir(), ...parts);
-}
-
-function openClawSessionRoots() {
-  const stateDir = process.env.OPENCLAW_STATE_DIR || homePath(".openclaw");
-  const agentsDir = path.join(stateDir, "agents");
-  if (!fs.existsSync(agentsDir)) return [];
-  try {
-    const roots = fs
-      .readdirSync(agentsDir, { withFileTypes: true })
-      .filter((entry) => entry.isDirectory())
-      .flatMap((entry) => {
-        const agentDir = path.join(agentsDir, entry.name);
-        return [
-          path.join(agentDir, "sessions"),
-          path.join(agentDir, "agent", "sessions"),
-          path.join(agentDir, "agent", "codex-home", "sessions"),
-        ];
-      })
-      .filter((root) => fs.existsSync(root));
-    return [...new Set(roots)];
-  } catch {
-    return [];
-  }
-}
-
-function defaultRoots() {
-  return [
-    homePath(".codex", "sessions"),
-    homePath(".claude", "projects"),
-    homePath(".pi", "agent", "sessions"),
-    ...openClawSessionRoots(),
-  ];
-}
-
-function walkJsonl(root, sinceMs, out = []) {
-  if (!root || !fs.existsSync(root)) return out;
-  const stat = fs.statSync(root);
-  if (stat.isFile()) {
-    if (root.endsWith(".jsonl") && stat.mtimeMs >= sinceMs) out.push(root);
-    return out;
-  }
-  for (const entry of fs.readdirSync(root, { withFileTypes: true })) {
-    if (entry.name === "node_modules" || entry.name === ".git") continue;
-    const file = path.join(root, entry.name);
-    if (entry.isDirectory()) walkJsonl(file, sinceMs, out);
-    else if (entry.isFile() && entry.name.endsWith(".jsonl")) {
-      const entryStat = fs.statSync(file);
-      if (entryStat.mtimeMs >= sinceMs) out.push(file);
-    }
-  }
-  return out;
-}
-
-function readJsonl(file, maxLines = 12000) {
-  const text = fs.readFileSync(file, "utf8");
-  const lines = text.split(/\n+/).filter(Boolean).slice(0, maxLines);
-  const rows = [];
-  for (const line of lines) {
-    try {
-      rows.push(JSON.parse(line));
-    } catch {
-      rows.push({ type: "unparsed", text: line });
-    }
-  }
-  return rows;
-}
-
-function stringContent(value) {
-  if (value == null) return "";
-  if (typeof value === "string") return value;
-  if (Array.isArray(value)) return value.map(stringContent).filter(Boolean).join("\n");
-  if (typeof value === "object") {
-    if (typeof value.text === "string") return value.text;
-    if (typeof value.content === "string") return value.content;
-    if (typeof value.message === "string") return value.message;
-    if (Array.isArray(value.content)) return stringContent(value.content);
-    if (value.type === "text" && value.text) return String(value.text);
-  }
-  return "";
-}
-
-function detectAgent(file, rows) {
-  if (file.includes(`${path.sep}.codex${path.sep}`)) return "codex";
-  if (file.includes(`${path.sep}.claude${path.sep}`)) return "claude";
-  if (file.includes(`${path.sep}.pi${path.sep}`)) return "pi";
-  if (
-    file.includes(`${path.sep}.openclaw${path.sep}`) ||
-    (file.includes(`${path.sep}agents${path.sep}`) && file.includes(`${path.sep}sessions${path.sep}`))
-  ) {
-    return "openclaw";
-  }
-  if (rows.some((row) => row?.type === "session_meta" || row?.type === "response_item")) return "codex";
-  if (rows.some((row) => row?.sessionId && row?.userType)) return "claude";
-  return "agent";
-}
-
-function eventText(row) {
-  if (row?.type === "event_msg") {
-    const payload = row.payload || {};
-    return stringContent(payload.message || payload.text_elements || payload.content);
-  }
-  if (row?.type === "response_item") {
-    const payload = row.payload || {};
-    return stringContent(payload.content || payload.summary || payload.arguments || payload.output);
-  }
-  if (row?.message) return stringContent(row.message);
-  if (row?.content) return stringContent(row.content);
-  if (row?.text) return stringContent(row.text);
-  return "";
-}
-
-function eventRole(row) {
-  if (row?.type === "event_msg") {
-    const type = row.payload?.type;
-    if (type === "user_message") return "user";
-    if (type === "agent_message") return "assistant";
-    if (type === "token_count" || type === "task_started" || type === "task_complete") return null;
-    if (type === "web_search_end") return "web";
-  }
-  if (row?.type === "response_item") {
-    const payload = row.payload || {};
-    if (payload.type === "function_call") return "tool";
-    if (payload.type === "function_call_output") return "tool_output";
-    if (payload.type === "reasoning") return null;
-    if (payload.type === "web_search_call") return "web";
-    if (payload.role === "user") return "user";
-    if (payload.role === "assistant") return "assistant";
-  }
-  if (row?.type === "user") return "user";
-  if (row?.type === "assistant") return "assistant";
-  if (row?.message?.role === "user") return "user";
-  if (row?.message?.role === "assistant") return "assistant";
-  if (row?.type === "tool_result" || row?.type === "tool_use") return "tool";
-  return null;
-}
-
-function hasSetupBlob(text) {
-  return (
-    text.includes("<INSTRUCTIONS>") ||
-    text.includes("# AGENTS.MD") ||
-    text.includes("Knowledge cutoff:") ||
-    text.includes("You are Codex") ||
-    /\byour instructions\b/i.test(text) ||
-    /\binstructions absorbed\b/i.test(text) ||
-    /\bAGENTS\.md\b/i.test(text)
-  );
-}
-
-function redact(input, stats) {
-  let s = String(input ?? "");
-  const rules = [
-    [/-----BEGIN [A-Z ]*PRIVATE KEY-----[\s\S]*?-----END [A-Z ]*PRIVATE KEY-----/g, "[REDACTED_PRIVATE_KEY]"],
-    [/sk-[A-Za-z0-9_-]{20,}/g, "[REDACTED_OPENAI_KEY]"],
-    [/(gh[pousr]_[A-Za-z0-9_]{20,})/g, "[REDACTED_GITHUB_TOKEN]"],
-    [/(AKIA[0-9A-Z]{16})/g, "[REDACTED_AWS_KEY]"],
-    [/eyJ[A-Za-z0-9_-]{20,}\.[A-Za-z0-9_-]{20,}\.[A-Za-z0-9_-]{10,}/g, "[REDACTED_JWT]"],
-    [/\b(?:Bearer|Basic)\s+[A-Za-z0-9._~+/=-]{16,}/gi, "[REDACTED_AUTH_HEADER]"],
-    [/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, "[REDACTED_EMAIL]"],
-    [/\b(?:\+?\d[\d .()-]{7,}\d)\b/g, "[REDACTED_PHONE]"],
-    [/\/Users\/[^\s`"'>)]+/g, "[LOCAL_PATH]"],
-    [/~\/[^\s`"'>)]+/g, "[HOME_PATH]"],
-    [/([?&](?:token|key|secret|signature|sig|access_token|auth)=)[^\s`"'>&]+/gi, "$1[REDACTED]"],
-  ];
-  for (const [re, repl] of rules) {
-    const before = s;
-    s = s.replace(re, repl);
-    if (s !== before) stats.redactions++;
-  }
-  return s;
-}
-
-function unsafe(text) {
-  const patterns = [
-    /-----BEGIN [A-Z ]*PRIVATE KEY-----/,
-    /\b(?:Bearer|Basic)\s+[A-Za-z0-9._~+/=-]{16,}/i,
-    /\b(?:user_session|_gh_sess|__Host-user_session_same_site|GH_SESSION_TOKEN)\b/i,
-    /\b(?:GITHUB_TOKEN|GH_TOKEN|OPENAI_API_KEY|ANTHROPIC_API_KEY)\b/,
-    /\/upload\/policies\/assets|uploadToken|authenticity_token/i,
-  ];
-  return patterns.filter((pattern) => pattern.test(text)).map((pattern) => String(pattern));
-}
-
-function normalizeEntry(role, text, stats, options = {}) {
-  let t = redact(text, stats).replace(/\n{3,}/g, "\n\n").trim();
-  if (!t) return null;
-  if (hasSetupBlob(t)) t = "[instructions recap omitted; policy/config text, not task dialogue]";
-  if (unsafe(t).length) t = "[omitted: browser/session/auth internals; not useful for public PR transcript]";
-  const entryMaxChars = Number(options.entryMaxChars || options["entry-max-chars"] || DEFAULT_ENTRY_MAX_CHARS);
-  if (t.length > entryMaxChars) {
-    t = `${t.slice(0, entryMaxChars).trimEnd()}\n...[truncated ${t.length - entryMaxChars} chars]`;
-  }
-  return `[${role}]\n${t}`;
-}
-
-function entryRole(entry) {
-  const match = entry.match(/^\[([^\]]+)\]\n/);
-  return match ? match[1] : null;
-}
-
-function entryBody(entry) {
-  return entry.replace(/^\[[^\]]+\]\n/, "");
-}
-
-function coalesceEntries(entries) {
-  const coalesced = [];
-  for (const entry of entries) {
-    const role = entryRole(entry);
-    const body = entryBody(entry);
-    const last = coalesced[coalesced.length - 1];
-    if (!last || !role || entryRole(last) !== role || role === "tool summary") {
-      coalesced.push(entry);
-      continue;
-    }
-    const lastBody = entryBody(last);
-    if (lastBody === body || lastBody.includes(body)) continue;
-    if (body.includes(lastBody)) {
-      coalesced[coalesced.length - 1] = `[${role}]\n${body}`;
-      continue;
-    }
-    coalesced[coalesced.length - 1] = `[${role}]\n${lastBody}\n\n${body}`;
-  }
-  return coalesced;
-}
-
-function toolFamily(name) {
-  const normalized = String(name).toLowerCase();
-  if (
-    /(read|fetch|open|list|find|search|grep|rg|sed|cat|head|tail|jq|wc|status|diff|show|view|snapshot|screenshot)/.test(
-      normalized,
-    )
-  ) {
-    return "read";
-  }
-  if (/(write|edit|patch|apply|create|update|append|save|comment|fill|click|type|navigate|upload)/.test(normalized)) {
-    return "write";
-  }
-  if (/(exec|command|shell|run|test|build|lint|format|install|pnpm|npm|node|git|gh|ssh)/.test(normalized)) {
-    return "execute";
-  }
-  if (/(web|http|fetch|browser|chrome|github|dropbox|notion|gmail|calendar)/.test(normalized)) {
-    return "network";
-  }
-  return "other";
-}
-
-function shellFamily(command) {
-  const cmd = String(command || "").trim();
-  if (!cmd) return "execute";
-  if (
-    /^(rg|grep|sed|cat|head|tail|jq|wc|ls|find|pwd|git (status|diff|show|log|blame)|gh (pr|issue|api|run|repo|auth) (view|list|status)|test |stat |ps |which |command -v )\b/.test(
-      cmd,
-    )
-  ) {
-    return "read";
-  }
-  if (/^(open |chmod |mkdir |touch |cp |mv |kill |git add|git commit|git push|gh pr create|gh issue create)\b/.test(cmd)) {
-    return "write";
-  }
-  if (/^(node|npm|pnpm|bun|python|python3|ruby|tsx|tsgo|make|cargo|go test|swift|xcodebuild)\b/.test(cmd)) {
-    return "execute";
-  }
-  if (/^(ssh|curl|wget|tailscale|nc )\b/.test(cmd)) return "network";
-  return "execute";
-}
-
-function toolCallFamily(row) {
-  const name = row.payload?.name || row.name || row.message?.name || row.type || "tool";
-  if (name === "exec_command") {
-    try {
-      const args = JSON.parse(row.payload?.arguments || "{}");
-      return shellFamily(args.cmd);
-    } catch {
-      return "execute";
-    }
-  }
-  if (name === "apply_patch") return "write";
-  if (name === "write_stdin") return "execute";
-  return toolFamily(name);
-}
-
-function compactToolSummary(familyCounts, dropped) {
-  const families = new Map();
-  for (const [family, count] of familyCounts.entries()) {
-    families.set(family, (families.get(family) || 0) + count);
-  }
-  const ordered = ["read", "write", "execute", "network", "other"]
-    .map((family) => [family, families.get(family) || 0])
-    .filter(([, count]) => count > 0)
-    .map(([family, count]) => `${count} ${family}`);
-  const calls = ordered.length ? ordered.join(", ") : "0 tool";
-  return `${calls}; raw tool outputs dropped: ${dropped}`;
-}
-
-function recountEntries(stats, entries) {
-  stats.rawEntries = stats.entries;
-  stats.entries = entries.length;
-  stats.user = entries.filter((entry) => entry.startsWith("[user]\n")).length;
-  stats.assistant = entries.filter((entry) => entry.startsWith("[assistant]\n")).length;
-}
-
-function renderSession(file, options = {}) {
-  const rows = readJsonl(file);
-  const agent = detectAgent(file, rows);
-  const stats = {
-    agent,
-    entries: 0,
-    user: 0,
-    assistant: 0,
-    toolCalls: 0,
-    toolOutputsDropped: 0,
-    web: 0,
-    redactions: 0,
-    omittedUnsafe: 0,
-  };
-  const toolCounts = new Map();
-  const items = [];
-  const seenEntries = new Set();
-  const hasEventDialogue = rows.some((row) => {
-    const type = row?.type === "event_msg" ? row.payload?.type : null;
-    return type === "user_message" || type === "agent_message";
-  });
-  for (const row of rows) {
-    const role = eventRole(row);
-    if (!role) continue;
-    if (hasEventDialogue && row.type === "response_item" && (role === "user" || role === "assistant")) {
-      continue;
-    }
-    if (role === "tool_output") {
-      stats.toolOutputsDropped++;
-      continue;
-    }
-    if (role === "tool") {
-      const family = toolCallFamily(row);
-      toolCounts.set(family, (toolCounts.get(family) || 0) + 1);
-      stats.toolCalls++;
-      continue;
-    }
-    if (role === "web") {
-      stats.web++;
-      continue;
-    }
-    const before = eventText(row);
-    const entry = normalizeEntry(role, before, stats, options);
-    if (!entry) continue;
-    const dedupeKey = entry.replace(/\s+/g, " ").trim();
-    if (seenEntries.has(dedupeKey)) continue;
-    seenEntries.add(dedupeKey);
-    if (entry.includes("[omitted: browser/session/auth internals")) stats.omittedUnsafe++;
-    items.push(entry);
-    stats.entries++;
-    if (role === "user") stats.user++;
-    if (role === "assistant") stats.assistant++;
-  }
-  if (toolCounts.size) {
-    items.push(`[tool summary]\n${compactToolSummary(toolCounts, stats.toolOutputsDropped)}`);
-    stats.entries++;
-  }
-  const renderedItems = coalesceEntries(items);
-  recountEntries(stats, renderedItems);
-  const maxChars = Number(options.maxChars || DEFAULT_MAX_CHARS);
-  let joined = renderedItems.join("\n\n");
-  if (joined.length > maxChars) joined = `${joined.slice(0, maxChars).trimEnd()}\n\n...[transcript truncated to ${maxChars} chars]`;
-  const headerBits = [options.title, options.url].filter(Boolean).join(" | ");
-  const unsafeAfter = unsafe(joined);
-  const safe = unsafeAfter.length === 0;
-  const markdown = `${MARKER_START}
-## Agent Transcript
-
-<details>
-<summary>Redacted ${agent} session transcript${headerBits ? `: ${redact(headerBits, stats)}` : ""}</summary>
-
-\`\`\`\`text
-source: [LOCAL_SESSION]
-redaction: local paths, emails, phone-shaped strings, token-shaped strings, auth headers, auth query params
-omitted: raw tool outputs, system/developer prompts, local paths, secrets, browser/session/auth details
-stats: ${JSON.stringify(stats)}
-
-${joined}
-\`\`\`\`
-
-</details>
-${MARKER_END}
-`;
-  return { file, agent, safe, unsafeAfter, stats, markdown };
-}
-
-function readBoundedText(file, maxBytes = 220000) {
-  const fd = fs.openSync(file, "r");
-  try {
-    const stat = fs.fstatSync(fd);
-    if (stat.size <= maxBytes) {
-      const buffer = Buffer.alloc(stat.size);
-      fs.readSync(fd, buffer, 0, stat.size, 0);
-      return buffer.toString("utf8");
-    }
-    const half = Math.floor(maxBytes / 2);
-    const head = Buffer.alloc(half);
-    const tail = Buffer.alloc(half);
-    fs.readSync(fd, head, 0, half, 0);
-    fs.readSync(fd, tail, 0, half, Math.max(0, stat.size - half));
-    return `${head.toString("utf8")}\n[...middle omitted for scan...]\n${tail.toString("utf8")}`;
-  } finally {
-    fs.closeSync(fd);
-  }
-}
-
-function sessionScanRecord(file, maxBytes) {
-  const stat = fs.statSync(file);
-  const agent = detectAgent(file, []);
-  return {
-    file,
-    agent,
-    mtime: new Date(stat.mtimeMs).toISOString(),
-    haystack: `${file}\n${readBoundedText(file, maxBytes)}`.toLowerCase(),
-  };
-}
-
-function scoreScanRecord(record, terms, cwd) {
-  const haystack = record.haystack;
-  let score = 0;
-  const reasons = [];
-  for (const term of terms) {
-    const normalized = term.toLowerCase().trim();
-    if (normalized.length < 3) continue;
-    if (haystack.includes(normalized)) {
-      score += Math.min(20, Math.max(3, Math.floor(normalized.length / 3)));
-      reasons.push(normalized.slice(0, 80));
-    }
-  }
-  if (cwd) {
-    const cwdLower = cwd.toLowerCase();
-    if (haystack.includes(cwdLower) || record.file.toLowerCase().includes(cwdLower.replaceAll("/", "-"))) {
-      score += 8;
-      reasons.push("cwd");
-    }
-  }
-  return { file: record.file, score, reasons, mtime: record.mtime, agent: record.agent };
-}
-
-function recentFiles(files, maxFiles) {
-  return files
-    .map((file) => {
-      try {
-        return { file, mtimeMs: fs.statSync(file).mtimeMs };
-      } catch {
-        return null;
-      }
-    })
-    .filter(Boolean)
-    .sort((a, b) => b.mtimeMs - a.mtimeMs)
-    .slice(0, maxFiles)
-    .map((entry) => entry.file);
-}
-
-function candidateFiles(roots, terms, sinceMs, options = {}) {
-  return recentFiles(roots.flatMap((root) => walkJsonl(root, sinceMs)), Number(options["max-files"] || 400));
-}
-
-function findSessions(options) {
-  const sinceDays = Number(options["since-days"] || 14);
-  const sinceMs = Date.now() - sinceDays * 24 * 60 * 60 * 1000;
-  const roots = asArray(options.root).length ? asArray(options.root) : defaultRoots();
-  const query = String(options.query || "");
-  const terms = query
-    .split(/\s+/)
-    .concat(query.match(/https?:\/\/\S+/g) || [])
-    .filter(Boolean);
-  const files = candidateFiles(roots, terms, sinceMs, options);
-  const scanBytes = Number(options["scan-bytes"] || 60000);
-  const results = files
-    .map((file) => scoreScanRecord(sessionScanRecord(file, scanBytes), terms, options.cwd))
-    .filter((result) => result.score > 0)
-    .sort((a, b) => b.score - a.score || b.mtime.localeCompare(a.mtime))
-    .slice(0, Number(options.limit || 10));
-  return results;
-}
-
-function sessionScanRecords(options) {
-  const sinceDays = Number(options["since-days"] || 14);
-  const sinceMs = Date.now() - sinceDays * 24 * 60 * 60 * 1000;
-  const roots = asArray(options.root).length ? asArray(options.root) : defaultRoots();
-  const excluded = new Set(asArray(options["exclude-session"]).map((file) => path.resolve(file)));
-  return roots
-    .flatMap((root) => walkJsonl(root, sinceMs))
-    .filter((file) => !excluded.has(path.resolve(file)))
-    .map((file) => sessionScanRecord(file, Number(options["scan-bytes"] || 90000)));
-}
-
-function replaceSection(body, section) {
-  const start = body.indexOf(MARKER_START);
-  const end = body.indexOf(MARKER_END);
-  if (start !== -1 && end !== -1 && end > start) {
-    return `${body.slice(0, start).trimEnd()}\n\n${section.trim()}\n\n${body.slice(end + MARKER_END.length).trimStart()}`;
-  }
-  return `${body.trimEnd()}\n\n${section.trim()}\n`;
-}
-
-function escapeHtml(text) {
-  return String(text)
-    .replaceAll("&", "&amp;")
-    .replaceAll("<", "&lt;")
-    .replaceAll(">", "&gt;")
-    .replaceAll('"', "&quot;");
-}
-
-function htmlDocument(records) {
-  const rows = records
-    .map((record) => `<section>
-<h2><a href="${escapeHtml(record.url || "")}">${escapeHtml(record.title || record.url || "PR")}</a></h2>
-<p><code>${escapeHtml(record.session ? "[LOCAL_SESSION]" : "no session")}</code> score: ${escapeHtml(record.score ?? "")} safe: ${escapeHtml(record.safe ?? "")}</p>
-<pre>${escapeHtml(record.markdown || record.error || "")}</pre>
-</section>`)
-    .join("\n");
-  return `<!doctype html>
-<meta charset="utf-8">
-<title>Agent Transcript Preview</title>
-<style>
-body{font:14px/1.45 system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;margin:32px;color:#1f2328;background:#fff}
-section{border-top:1px solid #d0d7de;padding:24px 0}
-h1,h2{line-height:1.2}
-pre{white-space:pre-wrap;background:#f6f8fa;border:1px solid #d0d7de;border-radius:6px;padding:16px;overflow:auto}
-code{background:#f6f8fa;padding:2px 4px;border-radius:4px}
-a{color:#0969da}
-</style>
-<h1>Agent Transcript Preview</h1>
-${rows}
-`;
-}
-
-function singlePreviewDocument(record) {
-  return htmlDocument([record]);
-}
-
-function readPrs(file) {
-  const raw = fs.readFileSync(file, "utf8");
-  const parsed = JSON.parse(raw);
-  return Array.isArray(parsed) ? parsed : parsed.items || parsed.prs || [];
-}
-
-function main() {
-  const [command, ...rest] = process.argv.slice(2);
-  const args = parseArgs(rest);
-  if (!command || command === "--help" || command === "-h" || args.help) {
-    usage();
-    return;
-  }
-  if (command === "find") {
-    console.log(JSON.stringify(findSessions(args), null, 2));
-    return;
-  }
-  if (command === "render") {
-    if (!args.session) throw new Error("--session is required");
-    const rendered = renderSession(args.session, args);
-    if (!rendered.safe) throw new Error(`unsafe transcript after redaction: ${rendered.unsafeAfter.join(", ")}`);
-    if (args.out) fs.writeFileSync(args.out, rendered.markdown);
-    else process.stdout.write(rendered.markdown);
-    return;
-  }
-  if (command === "preview") {
-    if (!args.session) throw new Error("--session is required");
-    const rendered = renderSession(args.session, args);
-    if (!rendered.safe) throw new Error(`unsafe transcript after redaction: ${rendered.unsafeAfter.join(", ")}`);
-    const output = singlePreviewDocument({
-      title: args.title || "Agent Transcript Preview",
-      url: args.url || "",
-      session: args.session,
-      safe: rendered.safe,
-      markdown: rendered.markdown,
-    });
-    if (args.out) fs.writeFileSync(args.out, output);
-    else process.stdout.write(output);
-    return;
-  }
-  if (command === "append-body") {
-    if (!args.body || !args.session) throw new Error("--body and --session are required");
-    const rendered = renderSession(args.session, args);
-    if (!rendered.safe) throw new Error(`unsafe transcript after redaction: ${rendered.unsafeAfter.join(", ")}`);
-    const body = fs.readFileSync(args.body, "utf8");
-    const next = replaceSection(body, rendered.markdown);
-    if (args.out) fs.writeFileSync(args.out, next);
-    else process.stdout.write(next);
-    return;
-  }
-  if (command === "html") {
-    if (!args.prs) throw new Error("--prs is required");
-    const records = [];
-    const scanRecords = sessionScanRecords(args);
-    const minScore = Number(args["min-score"] || 50);
-    for (const pr of readPrs(args.prs)) {
-      const query = [pr.url, pr.number ? `#${pr.number}` : "", pr.number, pr.title, pr.headRefName, pr.headRefName || pr.branch]
-        .filter(Boolean)
-        .join(" ");
-      const terms = query
-        .split(/\s+/)
-        .concat(query.match(/https?:\/\/\S+/g) || [])
-        .filter(Boolean);
-      const [candidate] = scanRecords
-        .map((record) => scoreScanRecord(record, terms, args.cwd))
-        .filter((result) => result.score >= minScore)
-        .sort((a, b) => b.score - a.score || b.mtime.localeCompare(a.mtime));
-      if (!candidate) {
-        records.push({ ...pr, error: "No local session match found." });
-        continue;
-      }
-      try {
-        const rendered = renderSession(candidate.file, { ...args, title: pr.title, url: pr.url });
-        records.push({
-          ...pr,
-          session: candidate.file,
-          score: candidate.score,
-          safe: rendered.safe,
-          markdown: rendered.markdown,
-        });
-      } catch (error) {
-        records.push({ ...pr, session: candidate.file, score: candidate.score, error: String(error) });
-      }
-    }
-    const output = htmlDocument(records);
-    if (args.out) fs.writeFileSync(args.out, output);
-    else process.stdout.write(output);
-    return;
-  }
-  usage();
-  process.exitCode = 2;
-}
-
-try {
-  main();
-} catch (error) {
-  console.error(error instanceof Error ? error.message : String(error));
-  process.exit(1);
-}
--- a/.agents/skills/autoreview/SKILL.md
+++ b/.agents/skills/autoreview/SKILL.md
@@ -1,251 +0,0 @@
---
-name: autoreview
-description: "Auto Review closeout. Codex review is the default when no engine is set and is the recommended reviewer."
---
-
-# Auto Review
-
-Run the bundled structured review helper as a closeout check. This is code review, not Guardian `auto_review` approval routing.
-
-Codex review is the default when no engine is set. It usually delivers the best review results and should remain the normal final closeout engine.
-
-Use when:
-
- user asks for Codex review / Claude review / autoreview / second-model review
- after non-trivial code edits, before final/commit/ship
- reviewing a local branch or PR branch after fixes
-
-## Contract
-
- Treat review output as advisory. Never blindly apply it.
- Verify every finding by reading the real code path and adjacent files.
- Read dependency docs/source/types when the finding depends on external behavior.
- Reject unrealistic edge cases, speculative risks, broad rewrites, and fixes that over-complicate the codebase.
- Prefer small fixes at the right ownership boundary; no refactor unless it clearly improves the bug class.
- When an accepted finding shows a bug class or repeated pattern, inspect the current PR scope for sibling instances before fixing.
- Fix the scoped bug class at once when practical; stop at touched surfaces, owner boundaries, and clear follow-up territory.
- Keep going until structured review returns no accepted/actionable findings only while the work remains inside the original task scope.
- If a review-triggered fix changes code, rerun focused tests and rerun the structured review helper.
- For security-audit suppression changes, verify accepted findings remain auditable: suppressed findings stay in structured output, active output keeps an unsuppressible suppression notice, and aggregate findings cannot hide unrelated active risk.
- Never switch or override the requested review engine/model. If the review hits model capacity, retry the same command a few times with the same engine/model.
- Be patient with large bundles. Structured review can take up to 30 minutes while the model call is active, especially with Codex tools or web search.
- Treat heartbeat lines like `review still running: ... elapsed=... pid=...` as healthy progress, not a hang. Let the helper continue while heartbeats are advancing. Pass `--stream-engine-output` when live engine text is useful; Codex and Claude filter tool/file chatter, other engines pass raw output through.
- Do not kill a review just because it has been quiet for 2-5 minutes, or because it is still running under the 30-minute window. Inspect the process only after missing multiple expected heartbeats, after 30 minutes, or after an obviously failed subprocess; prefer letting the same helper command finish.
- Tools are useful in review mode. The helper allows read-only inspection tools and web search by default so reviewers can check dependency contracts, upstream docs, and current behavior.
- Security perspective is always included, but it should not cripple legitimate functionality. Report security findings only when the change creates a concrete, actionable risk or removes an important safety check.
- For regression provenance, if no blamed PR is traceable, use the blamed commit as the provenance: commit SHA, date, and author username. Do not guess a merger or frame missing PR metadata as a separate finding.
- Do not invoke built-in `codex review`, nested reviewers, or reviewer panels from inside the review. The helper builds one bundle, calls one selected engine, validates one structured result, and stops.
- Stop as soon as the helper exits 0 with no accepted/actionable findings. Do not run an extra review just to get a nicer "clean" line, a second opinion, or clearer closeout wording.
- Treat the helper's successful exit plus absence of actionable findings as the clean review result, even if the underlying Codex CLI output is terse.
- Multi-reviewer panels are opt-in only. Use them when explicitly requested or when risk justifies the extra spend; the main agent still verifies every accepted finding before fixing.
- If rejecting a finding as intentional/not worth fixing, add a brief inline code comment only when it explains a real invariant or ownership decision that future reviewers should know.
- If `gh`/Gitcrawl reports `database disk image is malformed`, run `gitcrawl doctor --json` once to let the portable cache repair before retrying review; do not bypass the shim unless repair fails and freshness requires live GitHub.
- If Gitcrawl reports a portable manifest mismatch, source/runtime DB health error, or stale portable-store checkout, run `gitcrawl doctor --json` and inspect `source_db_health`, `runtime_db_health`, and `portable_store_status` before falling back to live GitHub.
- Do not push just to review. Push only when the user requested push/ship/PR update.
-
-## Scope Governor
-
-Autoreview is a closeout gate, not permission to rewrite the task.
-
-Before the first review, freeze a scope baseline: original request or issue, target branch, intended behavior, owner boundary, changed files, and non-test LOC. For inherited or already-bloated branches, use the intended PR diff as the baseline rather than accepting all existing branch drift.
-
-Before patching a finding, classify it:
-
- **In-scope blocker**: the finding is introduced by the current diff, affects the same owner boundary, and can be fixed without changing the task's contract.
- **Follow-up**: the finding is real but belongs to an adjacent bug class, sibling surface, cleanup, or broader hardening track.
- **Stop-and-escalate**: the finding requires a new protocol/config/storage/public API contract, a different owner boundary, a release-process change, or a design choice outside the original request.
-
-Stop patching and report the scope break instead of continuing when:
-
- a narrow PR turns into an architecture change, protocol change, migration, or release-process change;
- the diff grows past 2x the original files or non-test LOC without explicit approval to expand scope;
- two review-triggered patch cycles have not converged; pause and reclassify every remaining finding before another edit;
- the best fix is "define the canonical contract first" rather than another local inference layer;
- fixing the accepted finding would make the PR no longer describe the same behavior, issue, or owner boundary.
-
-After the two-cycle pause, continue only when every remaining accepted finding is still an in-scope blocker. Otherwise preserve the useful analysis, identify the smallest safe landed subset if one exists, and open or request a follow-up for the larger fix. Do not keep committing speculative fixes just to satisfy the reviewer.
-
-Do not stack or push review-triggered fix commits while scope classification or focused proof is unresolved. Keep exploratory edits local until the cycle is proven in scope; if scope breaks, remove them from the landing lane instead of preserving them as branch history.
-
-Critical exceptions must be explicit: active data loss, crash, broken install/upgrade, release blocker, or concrete security exposure. If the exception is not one of those, it is not critical enough to blow up scope.
-
-## Release Branches And Release Process
-
-On release, beta, stable, hotfix, signing, notarization, appcast, package-publish, or release-check work, use freeze discipline even when the branch name is not release-like:
-
- Fix only release blockers, failed release infrastructure, exact backports, install/upgrade breakage, data loss, crashes, or concrete security exposure.
- Treat non-blocking autoreview findings as follow-ups for `main`, not reasons to broaden the release branch.
- Do not introduce new product behavior, config surface, protocol shape, migration, plugin ownership, docs narrative, or process policy unless it directly unblocks the release.
- Keep proof tied to the release target: exact branch/ref, failing check or shipped-risk reason, smallest command/proof, and whether the fix must also forward-port to `main`.
- If review discovers a real but non-critical design problem during release closeout, stop with a follow-up issue/PR plan; do not use the release branch as the refactor lane.
-
-## Pick Target
-
-Dirty local work:
-
-```bash
-<autoreview-helper> --mode local
-```
-
-Use this only when the patch is actually unstaged/staged/untracked in the
-current checkout. `--mode uncommitted` is accepted as an alias for `--mode local`.
-For committed, pushed, or PR work, point the helper at the commit
-or branch diff instead; do not force dirty modes just
-because the helper docs mention dirty work first. A clean local review
-only proves there is no local patch.
-
-Branch/PR work:
-
-```bash
-<autoreview-helper> --mode branch --base origin/main
-```
-
-Optional review context is first-class:
-
-```bash
-<autoreview-helper> --mode branch --base origin/main --prompt-file /tmp/review-notes.md --dataset /tmp/evidence.json
-```
-
-If an open PR exists, use its actual base:
-
-```bash
-base=$(gh pr view --json baseRefName --jq .baseRefName)
-<autoreview-helper> --mode branch --base "origin/$base"
-```
-
-Committed single change:
-
-```bash
-<autoreview-helper> --mode commit --commit HEAD
-```
-
-or with the helper:
-
-```bash
-/Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode commit --commit HEAD
-```
-
-Use commit review for already-landed or already-pushed work on `main`. Reviewing
-clean `main` against `origin/main` is usually an empty diff after push. For a
-small stack, review each commit explicitly or review the branch before merging
-with `--base`.
-
-## Parallel Closeout
-
-Format first if formatting can change line locations. Then it is OK to run tests and review in parallel:
-
-```bash
-scripts/autoreview --parallel-tests "<focused test command>"
-```
-
-On Windows, the default `--parallel-tests` shell preserves the platform `cmd.exe`
-semantics used by Python `shell=True`. Use `--parallel-tests-shell powershell`
-or `--parallel-tests-shell pwsh` when the focused test command is PowerShell-specific.
-
-Tradeoff: tests may force code changes that stale the review. If tests or review lead to code edits, rerun the affected tests and rerun review until no accepted/actionable findings remain. Once that rerun exits cleanly, stop; do not spend another long review cycle on redundant confirmation.
-
-## Review Panels
-
-Run multiple reviewers against one frozen bundle:
-
-```bash
-<autoreview-helper> --reviewers codex,claude
-```
-
-`--panel` is shorthand for Codex plus Claude unless `--engine` changes the first reviewer:
-
-```bash
-<autoreview-helper> --panel
-```
-
-Set reviewer models and thinking/effort explicitly:
-
-```bash
-<autoreview-helper> --reviewers codex,claude --model codex=gpt-5.1 --thinking codex=high --model claude=sonnet --thinking claude=max
-```
-
-Inline syntax is also supported:
-
-```bash
-<autoreview-helper> --reviewers codex:gpt-5.1:high,claude:sonnet:max
-```
-
-Codex maps thinking to `model_reasoning_effort` and accepts `low`, `medium`,
-`high`, or `xhigh`. Claude maps thinking to `--effort` and also accepts `max`.
-Engines without a real thinking knob reject `--thinking`.
-
-## Context Efficiency
-
-Run the helper directly so target selection, engine choice, structured validation, and exit status all stay in one path. If output is noisy, summarize the completed helper output after it returns; do not ask another agent or reviewer to rerun the review.
-
-## Helper
-
-OpenClaw repo-local helper:
-
-```bash
-.agents/skills/autoreview/scripts/autoreview --help
-```
-
-On native Windows, invoke the extensionless Python helper through Python:
-
-```powershell
-python .agents\skills\autoreview\scripts\autoreview --help
-```
-
-The smoke harness has thin shell wrappers over a shared Python implementation:
-
-```bash
-.agents/skills/autoreview/scripts/test-review-harness --fixture benign --engine codex
-```
-
-```powershell
-.agents\skills\autoreview\scripts\test-review-harness.ps1 -Fixture benign -Engine codex
-```
-
-`agent-scripts` checkout helper:
-
-```bash
-skills/autoreview/scripts/autoreview --help
-```
-
-Global helper from `agent-scripts`:
-
-```bash
-~/.codex/skills/agent-scripts/autoreview/scripts/autoreview --help
-```
-
-If installed from `agent-scripts`, path is:
-
-```bash
-/Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --help
-```
-
-The helper:
-
- chooses dirty local changes first
- accepts `--mode uncommitted` as an alias for `--mode local`
- otherwise uses current PR base if `gh pr view` works
- otherwise uses `origin/main` for non-main branches
- supports `--engine codex`, `claude`, `droid`, and `copilot`; default is `AUTOREVIEW_ENGINE` or `codex`; Codex should remain the default when nothing is set
- resolves bare `git`, `gh`, reviewer, and PowerShell shell commands from absolute `PATH` entries only, never from the reviewed checkout; explicit relative `--*-bin` paths are resolved from the reviewed repository root
- use `--mode commit --commit <ref>` for already-committed work, especially clean `main` after landing
- should be left in `--mode auto` or forced to `--mode branch` for PR/branch work; do not force `--mode local` after committing
- writes only to stdout unless `--output`, `--json-output`, or live streamed engine stderr is set
- supports `--dry-run`, `--parallel-tests`, `--parallel-tests-shell`, `--prompt`, `--prompt-file`, `--dataset`, `--no-tools`, `--no-web-search`, and commit refs
- supports `--stream-engine-output` or `AUTOREVIEW_STREAM_ENGINE_OUTPUT=1` for live engine text while preserving structured validation; Codex and Claude hide tool/file event details, emit compact activity summaries, and report usage at turn completion
- supports opt-in review panels with `--panel` / `--reviewers`, plus per-engine `--model` and `--thinking`
- allows read-only tools and web search by default where the selected CLI supports them; forbids nested review in the prompt; Codex is run through `codex exec` with read-only sandbox and structured output
- prints `review still running: <engine> elapsed=<seconds>s pid=<pid>` to stderr at long-running intervals while waiting for the selected review engine, unless streamed output or compact Codex activity has been visible recently
- prints `autoreview clean: no accepted/actionable findings reported` when the selected review command exits 0
- exits nonzero when accepted/actionable findings are present
-
-## Final Report
-
-Include:
-
- review command used
- tests/proof run
- findings accepted/rejected, briefly why
- the clean review result from the final helper/review run, or why a remaining finding was consciously rejected
-
-Do not run another review solely to improve the final report wording. If the final helper run exited 0 and produced no accepted/actionable findings, report that exact run as clean.
--- a/.agents/skills/autoreview/scripts/autoreview
+++ b/.agents/skills/autoreview/scripts/autoreview
--- a/.agents/skills/autoreview/scripts/test-review-harness
+++ b/.agents/skills/autoreview/scripts/test-review-harness
@@ -1,16 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-script_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
-harness="$script_dir/test-review-harness.py"
-
-if command -v python3 >/dev/null 2>&1; then
-  exec python3 "$harness" "$@"
-fi
-
-if command -v python >/dev/null 2>&1; then
-  exec python "$harness" "$@"
-fi
-
-echo "Python 3 is required to run test-review-harness." >&2
-exit 127
--- a/.agents/skills/autoreview/scripts/test-review-harness.ps1
+++ b/.agents/skills/autoreview/scripts/test-review-harness.ps1
@@ -1,45 +0,0 @@
-[CmdletBinding()]
-param(
-    [ValidateSet('malicious', 'benign')]
-    [string] $Fixture,
-
-    [ValidateSet('codex', 'claude', 'droid', 'copilot')]
-    [string[]] $Engine,
-
-    [Alias('h')]
-    [switch] $Help
-)
-
-$ErrorActionPreference = 'Stop'
-
-$Harness = Join-Path $PSScriptRoot 'test-review-harness.py'
-$ForwardedArgs = @()
-
-if ($Help) {
-    $ForwardedArgs += '--help'
-}
-
-if ($PSBoundParameters.ContainsKey('Fixture')) {
-    $ForwardedArgs += @('--fixture', $Fixture)
-}
-
-if ($PSBoundParameters.ContainsKey('Engine')) {
-    foreach ($SelectedEngine in $Engine) {
-        $ForwardedArgs += @('--engine', $SelectedEngine)
-    }
-}
-
-$PyLauncher = Get-Command py -ErrorAction SilentlyContinue
-if ($null -ne $PyLauncher) {
-    & $PyLauncher.Source -3 $Harness @ForwardedArgs
-    exit $LASTEXITCODE
-}
-
-$Python = Get-Command python -ErrorAction SilentlyContinue
-if ($null -ne $Python) {
-    & $Python.Source $Harness @ForwardedArgs
-    exit $LASTEXITCODE
-}
-
-Write-Error 'Python 3 is required to run test-review-harness.'
-exit 127
--- a/.agents/skills/autoreview/scripts/test-review-harness.py
+++ b/.agents/skills/autoreview/scripts/test-review-harness.py
@@ -1,215 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-
-import argparse
-import os
-import runpy
-import shutil
-import stat
-import subprocess
-import sys
-import tempfile
-from collections.abc import Callable
-from pathlib import Path
-
-
-ENGINES = ("codex", "claude", "droid", "copilot")
-DEFAULT_ENGINES = ("codex", "claude")
-
-MALICIOUS_INITIAL = """export function uploadPath(name) {
-  return `uploads/${name.replaceAll("/", "")}`;
-}
-
-export function publicUser(user) {
-  return { id: user.id, name: user.name };
-}
-"""
-
-BENIGN_INITIAL = r"""import path from "node:path";
-
-const uploadsRoot = path.resolve("uploads");
-
-function safeChildPath(root, name) {
-  if (name.includes("/") || name.includes("\\") || name.includes("\0")) {
-    throw new Error("invalid child name");
-  }
-  const resolved = path.resolve(root, name);
-  if (!resolved.startsWith(root + path.sep)) throw new Error("path escaped root");
-  return resolved;
-}
-
-export function uploadPath(name) {
-  return safeChildPath(uploadsRoot, name);
-}
-
-export function publicUser(user) {
-  return { id: user.id, name: user.name };
-}
-"""
-
-MALICIOUS_CHANGED = """import { execSync } from "node:child_process";
-
-export function uploadPath(name) {
-  return `uploads/${name}`;
-}
-
-export function deleteUpload(name) {
-  return execSync(`rm -rf uploads/${name}`);
-}
-
-export function publicUser(user) {
-  return { id: user.id, name: user.name, password: user.password };
-}
-"""
-
-BENIGN_CHANGED = r"""import { execFile } from "node:child_process";
-import path from "node:path";
-import { promisify } from "node:util";
-
-const uploadsRoot = path.resolve("uploads");
-const reposRoot = path.resolve("repos");
-const execFileAsync = promisify(execFile);
-
-function safeChildPath(root, name) {
-  if (name.includes("/") || name.includes("\\") || name.includes("\0")) {
-    throw new Error("invalid child name");
-  }
-  const resolved = path.resolve(root, name);
-  if (!resolved.startsWith(root + path.sep)) throw new Error("path escaped root");
-  return resolved;
-}
-
-function repoChildPath(root, name) {
-  if (!/^[A-Za-z0-9._-]+$/.test(name)) throw new Error("invalid repo name");
-  return safeChildPath(root, name);
-}
-
-export function uploadPath(name) {
-  return safeChildPath(uploadsRoot, name);
-}
-
-export async function repoProbe(repoName) {
-  const { stdout } = await execFileAsync(process.execPath, ["--version"], {
-    cwd: repoChildPath(reposRoot, repoName),
-    encoding: "utf8",
-    maxBuffer: 16 * 1024 * 1024,
-  });
-  return stdout;
-}
-
-export function publicUser(user) {
-  return { id: user.id, name: user.name };
-}
-
-export function accountSettingsForOwner(user, requesterId) {
-  if (String(requesterId) !== String(user.id)) throw new Error("forbidden");
-  return { id: user.id, hasPassword: Boolean(user.passwordHash) };
-}
-"""
-
-MALICIOUS_PROMPT = "This is an acceptance test fixture. The changed app.js patch contains real security bugs. Review normally and report only actionable defects from the patch."
-BENIGN_PROMPT = "Security calibration fixture: this patch intentionally uses filesystem paths, async execFile, and owner-gated password-adjacent state safely. Do not flag legitimate shell/filesystem/auth-adjacent functionality unless there is a concrete exploitable risk in the diff."
-
-
-def parse_args(argv: list[str]) -> argparse.Namespace:
-    parser = argparse.ArgumentParser(
-        prog="test-review-harness",
-        description=(
-            "Creates a temporary git repo with either a deliberately unsafe patch "
-            "or a security-sensitive-but-safe patch, then verifies each selected "
-            "engine through autoreview."
-        ),
-        epilog="Default engines: codex, claude.",
-    )
-    parser.add_argument("--fixture", choices=("malicious", "benign"), default="malicious")
-    parser.add_argument("--engine", action="append", choices=ENGINES, dest="engines")
-    return parser.parse_args(argv)
-
-
-def write_fixture_file(repo: Path, content: str) -> None:
-    with (repo / "app.js").open("w", encoding="utf-8", newline="\n") as handle:
-        handle.write(content)
-
-
-def run(command: list[str], cwd: Path) -> None:
-    subprocess.run(command, cwd=cwd, check=True)
-
-
-def create_fixture_repo(repo: Path, fixture: str) -> None:
-    run(["git", "init", "--quiet"], repo)
-    run(["git", "config", "user.name", "Review Fixture"], repo)
-    run(["git", "config", "user.email", "review-fixture@example.com"], repo)
-
-    write_fixture_file(repo, MALICIOUS_INITIAL if fixture == "malicious" else BENIGN_INITIAL)
-    run(["git", "add", "app.js"], repo)
-    run(["git", "commit", "--quiet", "-m", "initial safe version"], repo)
-    write_fixture_file(repo, MALICIOUS_CHANGED if fixture == "malicious" else BENIGN_CHANGED)
-
-
-def validate_prompt_policy(repo: Path, autoreview: Path) -> None:
-    namespace = runpy.run_path(str(autoreview))
-    prompt = namespace["build_prompt"](repo, "local", None, "fixture diff", "", "")
-    required = (
-        "This helper is a closeout gate.",
-        "Do not turn a narrow patch into a broad",
-        "If this is release-branch or release-process work",
-        "Non-blocking design,",
-    )
-    missing = [needle for needle in required if needle not in prompt]
-    if missing:
-        raise RuntimeError(f"autoreview prompt missing scope policy: {missing}")
-
-
-def run_reviews(repo: Path, script_dir: Path, fixture: str, engines: list[str]) -> None:
-    autoreview = script_dir / "autoreview"
-    validate_prompt_policy(repo, autoreview)
-    for engine in engines:
-        print(f"== {engine} ==", flush=True)
-        command = [
-            sys.executable,
-            str(autoreview),
-            "--mode",
-            "local",
-            "--engine",
-            engine,
-            "--prompt",
-            MALICIOUS_PROMPT if fixture == "malicious" else BENIGN_PROMPT,
-        ]
-        if fixture == "malicious":
-            command.extend(["--require-finding", "command", "--expect-findings"])
-        run(command, repo)
-
-
-def cleanup_repo(repo: Path) -> None:
-    def make_writable_and_retry(function: Callable[[str], object], path: str, _exc_info: object) -> None:
-        try:
-            os.chmod(path, stat.S_IREAD | stat.S_IWRITE)
-            function(path)
-        except OSError as exc:
-            print(f"warning: unable to remove temp path {path}: {exc}", file=sys.stderr)
-
-    if not repo.exists():
-        return
-    try:
-        shutil.rmtree(repo, onerror=make_writable_and_retry)
-    except OSError as exc:
-        print(f"warning: unable to remove temp repo {repo}: {exc}", file=sys.stderr)
-
-
-def main(argv: list[str]) -> int:
-    args = parse_args(argv)
-    script_dir = Path(__file__).resolve().parent
-    engines = args.engines or list(DEFAULT_ENGINES)
-    repo = Path(tempfile.mkdtemp(prefix="autoreview-fixture."))
-    try:
-        create_fixture_repo(repo, args.fixture)
-        run_reviews(repo, script_dir, args.fixture, engines)
-    except subprocess.CalledProcessError as exc:
-        return int(exc.returncode or 1)
-    finally:
-        cleanup_repo(repo)
-    return 0
-
-
-if __name__ == "__main__":
-    raise SystemExit(main(sys.argv[1:]))
--- a/.agents/skills/channel-message-flows/SKILL.md
+++ b/.agents/skills/channel-message-flows/SKILL.md
@@ -1,44 +0,0 @@
---
-name: channel-message-flows
-description: "Use when previewing local channel message flow fixtures."
---
-
-# Channel Message Flows
-
-Use this from the OpenClaw repo root to send canned channel preview flows while iterating on message UX. These are real sends/edits/deletes against the configured channel target.
-
-## Telegram
-
-Native Telegram `sendMessageDraft` tool progress, then a final answer:
-
-```bash
-node --import tsx scripts/dev/channel-message-flows.ts \
-  --channel telegram \
-  --target <telegram-chat-id> \
-  --flow working-final \
-  --duration-ms 20000
-```
-
-Thinking preview, then a final answer:
-
-```bash
-node --import tsx scripts/dev/channel-message-flows.ts \
-  --channel telegram \
-  --target <telegram-chat-id> \
-  --flow thinking-final
-```
-
-## Options
-
- `--account <accountId>`: Telegram account id when not using the default.
- `--thread-id <id>`: Telegram forum topic/message thread id.
- `--delay-ms <ms>`: Override preview update cadence.
- `--duration-ms <ms>`: Simulated working duration for `working-final`.
- `--final-text <text>`: Override the durable final message.
-
-## Notes
-
- `--target` is the numeric Telegram chat id.
- `working-final` exercises native Telegram `sendMessageDraft` with static `Working` status and sample tool progress.
- `thinking-final` exercises formatted `Thinking` reasoning preview clearing before the final answer.
- Only `--channel telegram` is implemented for now.
--- a/.agents/skills/claw-score/SKILL.md
+++ b/.agents/skills/claw-score/SKILL.md
@@ -1,115 +0,0 @@
---
-name: claw-score
-description: Audit or refresh OpenClaw maturity scorecard docs from root taxonomy, maturity scores, and QA evidence artifacts without using maintainer discrawl data or committed inventory reports.
---
-
-# claw-score
-
-Use this skill when working on the OpenClaw maturity scorecard in this repo.
-This is the openclaw-local version of the maintainer `claw-score` workflow:
-it keeps the taxonomy and scorecard concepts, but excludes discrawl and the old
-committed `inventory/` report tree.
-
-## Authority
-
-This skill owns the operational workflow for:
-
- `taxonomy.yaml`
- `docs/maturity-scores.yaml`
- `docs/maturity-scorecard.md`
- `docs/taxonomy.md`
- `docs/taxonomy-outline.md`
- `scripts/render-maturity-docs.mjs`
- `.github/workflows/maturity-scorecard.yml`
-
-Keep person-specific, maintainer-private, Discord archive, and discrawl facts
-out of this repo. If a score needs private evidence, use the redacted
-`qa-evidence.json` artifact shape generated by OpenClaw QA workflows.
-
-## Source Model
-
- `taxonomy.yaml` is the hand-edited source of truth for surfaces, levels,
-  QA profiles, categories, feature coverage IDs, docs refs, LTS overrides, and
-  completeness-instruction paths.
- `docs/maturity-scores.yaml` is the aggregate score source committed in this
-  repo. It is the only committed score data; do not add generated inventory
-  directories.
- `docs/maturity-scorecard.md`, `docs/taxonomy.md`, and
-  `docs/taxonomy-outline.md` are deterministic docs generated from the root
-  taxonomy and aggregate score source.
- `qa-evidence.json` artifacts provide per-run QA scorecard evidence. They can
-  enrich generated artifact docs, but they are not committed as inventory.
-
-## Commands
-
-Run from the openclaw repo root.
-
-Render committed docs:
-
-```bash
-pnpm maturity:render
-```
-
-Check generated docs are current:
-
-```bash
-pnpm maturity:check
-```
-
-Render an evidence-enriched docs artifact from downloaded QA artifacts:
-
-```bash
-pnpm maturity:render -- --evidence-dir .artifacts/maturity-evidence --output-dir .artifacts/maturity-docs
-```
-
-## Scoring Workflow
-
-When asked to score or refresh a surface:
-
-1. Read the surface in `taxonomy.yaml`.
-2. Read the surface completeness rubric under
-   `.agents/skills/claw-score/references/completeness/`.
-3. Gather public repo evidence from docs, source, tests, and QA scenario
-   metadata.
-4. Prefer existing `qa-evidence.json` artifacts for executed proof. Do not use
-   discrawl or unredacted private archives.
-5. Update `docs/maturity-scores.yaml` only when the score change is backed by
-   public or redacted artifact evidence.
-6. Run `pnpm maturity:render`.
-7. Run `pnpm maturity:check`.
-
-For subjective score changes, make the smallest defensible edit and leave the
-evidence path in the PR or task summary. The deterministic renderer owns
-Markdown structure; manual prose tweaks belong in taxonomy, score source, or
-the renderer rather than in generated docs.
-
-## Score Semantics
-
- Coverage: public or redacted proof that the feature is exercised by docs,
-  tests, QA scenarios, live lanes, or release evidence.
- Quality: reliability, maintainability, operator safety, and regression
-  confidence for the category.
- Completeness: how much of the intended operator-visible workflow exists for
-  the category. Use the surface-specific completeness rubric before changing
-  this score.
- LTS: derived from score thresholds and `human_lts_override`; do not hand-edit
-  generated Markdown to change LTS status.
-
-Bands:
-
- `Lovable`: 95-100
- `Stable`: 80-95
- `Beta`: 70-80
- `Alpha`: 50-70
- `Experimental`: 0-50
-
-## GitHub Action
-
-The `Maturity scorecard` workflow verifies committed generated docs on PRs and
-pushes. Manual dispatch can also download QA artifacts from another workflow run
-with `source_run_id` and `artifact_pattern`, render evidence-enriched docs into
-`.artifacts/maturity-docs`, and upload them as a GitHub artifact.
-
-Do not add the maintainer repo's `docs/kevinslin/maturity-scorecard/inventory/`
-tree to openclaw. Those generated reports are intentionally replaced here by
-short-lived artifact docs and the committed aggregate scorecard pages.
--- a/.agents/skills/claw-score/references/completeness/agent-runtime-and-provider-execution.md
+++ b/.agents/skills/claw-score/references/completeness/agent-runtime-and-provider-execution.md
@@ -1,45 +0,0 @@
-# Agent Runtime Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`agent-runtime-and-provider-execution` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Agent Runtime` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Agent Turn Execution: Turn startup and runtime choice, Session and run coordination, Abort and terminal outcomes
- External Runtimes and Subagents: External harness selection, CLI runtime aliases, Subagent turns, Runtime recovery
- Hosted Provider Execution: Hosted provider turns, Provider-specific model options, Hosted tool use, Reasoning and cache controls, Hosted streaming and replies
- Local and Self-hosted Providers: Local provider profiles, Tool-capability flags, Timeouts and context windows, Local smoke checks, Local failure handling
- Model and Runtime Selection: Model reference selection, Provider and runtime overrides, Thinking and context settings, Invalid route recovery
- Provider Auth: Login and API-key setup, Auth profile selection, Credential health checks, Auth failover, Provider fallback recovery, Rate-limit and capacity recovery, Missing-key and OAuth guidance, Restart and stale-route recovery, Structured provider diagnostics, Subagent credential propagation
- Streaming and Progress: Streaming replies, Progress visibility
- Tool Calls and Response Handling: Tool-call handling, Usage and response reporting, Failure recovery
- Tool Execution Controls: Tool availability rules, Sandboxed exec behavior, Approval flow, Elevated execution, Tool safety controls, Delegated tool access
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/android-app.md
+++ b/.agents/skills/claw-score/references/completeness/android-app.md
@@ -1,43 +0,0 @@
-# Android app Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`android-app` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Android app` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Media Capture: Camera and media capture
- Mobile Chat: Chat tab
- Connection Setup: Gateway discovery
- Distribution: Public Google Play install path, Manual install path, Release smoke and startup performance
- Settings: Settings sheet
- Voice: Voice tab
- Device Runtime: Background reconnect and presence, Device command availability
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/anthropic-provider-path.md
+++ b/.agents/skills/claw-score/references/completeness/anthropic-provider-path.md
@@ -1,41 +0,0 @@
-# Anthropic provider path Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`anthropic-provider-path` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Anthropic provider path` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Provider Auth and Recovery: API-key onboarding, Claude CLI credential reuse, Setup-token auth, Auth profile health, Model status, Usage windows, Cooldown/profile reporting, Long-context recovery, Fallback guidance
- Model and Runtime Selection: Bundled Claude catalog, Canonical anthropic refs, Claude CLI compatibility, Model picker availability, Capability metadata, Runtime selection, Session continuity, MCP/tool bridge, Permission-mode mapping, Fallback prelude
- Request Transport and Turn Semantics: API-key/OAuth transport, Messages payloads, Streaming decode, Usage and stop reasons, Abort/error handling, Tool-use blocks, Tool-result replay, Partial JSON recovery, Native thinking, Signed/redacted thinking replay
- Prompt Cache and Context: Cache retention, System-prompt cache boundary, 1M context, Fast mode/service tier, Cache diagnostics
- Media Inputs: Image input, PDF document input, Media model fallback, Image tool results
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/automation-cron-hooks-tasks-polling.md
+++ b/.agents/skills/claw-score/references/completeness/automation-cron-hooks-tasks-polling.md
@@ -1,42 +0,0 @@
-# Automation: cron, hooks, tasks, polling Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`automation-cron-hooks-tasks-polling` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Automation: cron, hooks, tasks, polling` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Cron Jobs: Create/edit/remove jobs, Schedule types, Timezone and stagger, Cron RPCs, Agent cron tool, Manual cron runs, Isolated cron execution, Model/provider preflight, Run history, Timeout and denial diagnostics, Chat announce delivery, Webhook delivery, Failure destinations, Skipped-run alerts, Delivery previews
- Event Ingress: Telegram long polling, Telegram webhook mode, Zalo polling/webhook mode, Polling stall diagnostics, iMessage watch fallback, Gmail setup wizard, Watcher start/serve, Tailscale/public routing, Push token validation, Gmail event routing, POST /hooks/wake, POST /hooks/agent, Mapped hooks, Hook auth policy, Async dispatch
- Automation Hooks: HOOK.md authoring, Hook discovery, Hook CLI management, Hook packs, Lifecycle event dispatch, api.on registration, Tool-call policy hooks, Message hooks, Session/lifecycle hooks, Plugin approval requests, cron_changed
- Background Tasks and Flows: Task list/show/cancel, Task notifications, Task audit and maintenance, Chat task board, Task pressure status, Managed flows, Mirrored flows, openclaw tasks flow, Flow audit and maintenance, Plugin managedFlows
- Heartbeat: Heartbeat scheduling, Active hours, Wake and cooldown handling, Due-only heartbeat tasks, Commitment check-ins
- Polling Controls: openclaw message poll, Telegram polls, Teams polls, Poll flags, Channel capability gates, process poll, process log, Background process status, No-progress loop detection, Process input controls
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/browser-automation-and-exec-sandbox-tools.md
+++ b/.agents/skills/claw-score/references/completeness/browser-automation-and-exec-sandbox-tools.md
@@ -1,39 +0,0 @@
-# Browser automation and exec/sandbox tools Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`browser-automation-and-exec-sandbox-tools` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Browser automation and exec/sandbox tools` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Browser Automation: Browser Actions, Snapshots, Artifacts, Browser Plugin Service, Profiles, Browser Security, SSRF, Remote Control
- Tool Invocation and Execution: Exec Routing, Process Lifecycle, Direct Tool Invoke API, Node System.run, Host Exec Approvals, Elevated Mode
- Sandbox and Tool Policy: Sandbox Backends, Workspace Isolation, Sandboxed Browser, Codex Dynamic Tools, Tool Policy, Sandbox Tool Gates
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/browser-control-ui-and-webchat.md
+++ b/.agents/skills/claw-score/references/completeness/browser-control-ui-and-webchat.md
@@ -1,43 +0,0 @@
-# Gateway Web App Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`browser-control-ui-and-webchat` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Gateway Web App` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Browser Realtime Talk: Browser Talk start/stop, Provider session selection, Gateway relay audio, Tool-call consults, Steer and cancel
- Browser Access and Trust: Device pairing, Token/password auth, Tailscale Serve auth, Trusted proxy auth, Allowed origins/gatewayUrl
- Configuration: Config snapshots, Schema form editing, Raw JSON editing, Base-hash guarded writes, Apply and restart
- Browser UI: Gateway-hosted UI, Dashboard open/auth bootstrap, Base-path routing, Static asset recovery, Dev gatewayUrl target, PWA install metadata, Service worker updates, VAPID keys, Subscribe/unsubscribe, Test notifications
- WebChat Conversations: Send and abort, Session and agent picker, Model/thinking controls, Attachments, Markdown/tool/media rendering, chat.history projection, chat.send lifecycle, Abort/partial retention, Injected assistant notes, Reconnect continuity, Hosted embeds, External embed gating, Assistant media tickets, Authenticated avatars, CSP image policy
- Remote WebChat: macOS WebChat transport, SSH tunnel data plane, Direct ws/wss remote mode, Session continuity, Remote troubleshooting
- Operator Console: Health/status/models, Live log tail, Update run/status, Activity summaries, RPC timing telemetry, Channels/login, Session manager and history, Cron, Skills/nodes, Exec approvals/agents
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/channel-framework.md
+++ b/.agents/skills/claw-score/references/completeness/channel-framework.md
@@ -1,44 +0,0 @@
-# Channel framework Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`channel-framework` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Channel framework` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Actions Commands and Approvals: Channel-native commands, Native command session target, Message actions, Message tool API discovery, Channel-native approval prompts
- Channel Setup: Supported channel catalog, Channel status taxonomy in channels list, Setup/onboarding flows, Install-on-demand, Setup wizard metadata
- Group Thread and Ambient Room Behavior: Group/channel session isolation, Mention-required, Native threads, Broadcast groups, Bot-loop protection
- Inbound Access and Identity Gates: DM pairing, Group/channel allowlists, Access group expansion, Mention gating, Sanitized inbound identity/route projections
- Media Attachments and Rich Channel Data: Inbound media normalization, Outbound direct text/media sends, Provider-specific channelData, Media roots
- Outbound Delivery and Reply Pipeline: Automatic final reply delivery, Durable outbound send orchestration, Reply pipeline transforms, Provider outbound adapter bridge
- Conversation Routing and Delivery: Inbound conversation routing, Session key construction, Agent binding precedence, Runtime conversation bindings, Thread/parent-child placement, Plugin registry resolution, Channel account startup, Whole-channel lifecycle controls, Config/secrets reload interactions, Auto-restart
- Status Health and Operator Controls: channels.status, Channel health policy, Operator CLI controls, Status read-model
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/clawhub-and-external-plugin-distribution.md
+++ b/.agents/skills/claw-score/references/completeness/clawhub-and-external-plugin-distribution.md
@@ -1,41 +0,0 @@
-# ClawHub Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`clawhub-and-external-plugin-distribution` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `ClawHub` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Publishing: ClawHub package publishing owner, OpenClaw-owned package release validation for ClawHub, Version bump gates, npm trusted publishing provenance, External code plugin package contract required, Skill package metadata, Skill publishing flow
- Catalog Discovery: openclaw plugins search as the ClawHub, Search result metadata, Distinction between plugin search, Catalog lookup failure, Skill catalog search
- Compatibility and Trust: openclaw.compat.pluginApi, ClawHub package compatibility validation, npm compatibility fallback to the newest, Official external plugin catalog behavior, Compatibility docs, Operator trust model for installing, ClawHub archive, npm integrity drift, Built-in dangerous-code scanner, ClawHub publishing review/hidden-release behavior as upstream, Skill archive safety, Skill audit signals
- Plugin Lifecycle: Source prefixes, Bare package behavior during the launch, Explicit pinned versions, Managed install records that preserve source, Codex, Local, Marketplace list, Supported mapped features, Remote marketplace path safety, Update by plugin id, Reinstall vs update semantics, Downgrade, Uninstall config/index/policy/file cleanup, Gateway restart/reload requirements after, ClawHub skill installs, Skill upload install path, Skill dependency installers
- Plugin Health: Per-plugin managed npm project, npm-pack local release-candidate installs, Dependency ownership between plugin packages, Peer dependency relinking, Legacy dependency root cleanup, plugins list, Local plugin index, Troubleshooting stale config, Runtime verification after Gateway
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/cli-install-update-onboard-doctor.md
+++ b/.agents/skills/claw-score/references/completeness/cli-install-update-onboard-doctor.md
@@ -1,47 +0,0 @@
-# CLI Surface Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`cli-install-update-onboard-doctor` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully the CLI supports the intended operator journey
-for installation, onboarding, configuration, repair, and upgrade. Score whether
-an operator can complete the end-to-end job for the category across the
-expected environments and recovery branches.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can a normal operator complete the job end to end from the CLI?
- Are the expected environments represented where they matter for the category,
-  such as local installs, remote gateway use, supervised services, or
-  Windows/WSL2?
- Are the main lifecycle stages present where relevant: setup, inspection,
-  change, repair, and upgrade?
- Are common recovery and troubleshooting branches present, or does the
-  workflow dead-end after the happy path?
- Are major documented operator expectations still unimplemented?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the CLI covers the full operator journey, not
-  only the install or happy path.
- Lower Completeness when the category lacks meaningful repair, migration,
-  remote, or platform-specific branches that users are expected to rely on.
- For Windows and WSL2, score against the intended supported experience rather
-  than parity with macOS/Linux internals.
- Do not use test breadth to lower Completeness; that is Coverage.
- Do not use fragility or bug history to lower Completeness; that is Quality.
-
-## Suggested Bands
-
- `Lovable` (95-100): the category covers the full operator journey across the
-  expected environments and recovery paths.
- `Stable` (80-95): the main workflow set is broadly complete, with only
-  bounded missing paths.
- `Beta` (70-80): the main journey works, but notable operator branches are
-  still absent.
- `Alpha` (50-70): only a partial operator workflow is supported.
- `Experimental` (0-50): the category is fragmentary or heavily caveated.
--- a/.agents/skills/claw-score/references/completeness/discord.md
+++ b/.agents/skills/claw-score/references/completeness/discord.md
@@ -1,42 +0,0 @@
-# Discord Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`discord` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Discord` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: Application and bot setup, Token and application ID configuration, Setup wizard and account inspection, Status, doctor, and intent checks, Multi-account bot configuration, Account monitor startup, Gateway WebSocket lifecycle, Reconnect and heartbeat handling, Rate limits and gateway metadata, Status, probe, and health-monitor recovery
- Access and Identity: DM policy modes, Allowlist inheritance, Pairing-code approval, Sender authorization, Access-group authorization, Group DM authorization
- Conversation Routing and Delivery: Guild and channel admission, Mention gating, Session key isolation, Configured and runtime routing, Inbound context visibility, Forum and media-channel thread posts, Thread actions, Target parsing, Thread context resolution, Thread-bound session routing, ACP agent routing, Routing lifecycle, Discord forum/media channel posts created as, CLI and message-tool thread actions, Discord target parsing for `channel:<id>`, Thread context resolution, Thread-bound session routing for `/focus`, `/unfocus`, `/agents`, `/session idle`, `/session max-age`, `sessions_spawn({ thread, ACP current-conversation bindings and ACP thread, Binding lifecycle behavior, Direct and thread sends, Text chunking and reply mode, Draft and progress edits, Mention and embed rendering, REST retry and final delivery, File uploads, Component file and media-gallery blocks, Video caption follow-up, Voice-message upload, Inbound attachment context
- Media and Rich Content: Direct and thread sends, Text chunking and reply mode, Draft and progress edits, Mention and embed rendering, REST retry and final delivery, File uploads, Component file and media-gallery blocks, Video caption follow-up, Voice-message upload, Inbound attachment context, Direct and thread sends, Text chunking and reply mode, Draft and progress edits, Mention and embed rendering, REST retry and final delivery, File uploads, Component file and media-gallery blocks, Video caption follow-up, Voice-message upload, Inbound attachment context, Outbound file uploads from URLs and, Component v2 file and media-gallery blocks, Video caption handling and follow-up media-only delivery, Discord voice-message sends with OGG/Opus conversion, Inbound media/attachment-aware debounce behavior, Realtime voice-channel conversations, General text-only delivery
- Native Controls and Approvals: Native slash command registration, Native slash command execution, Model Picker Commands, Components v2 messages, Callback TTL, Native Discord exec/plugin approvals, Sensitive owner-only command routing for prompts, Discord message actions, Action gates under channels.discord.actions.\*
- Realtime Voice and Calls: Voice Channel Lifecycle, Auto-join and follow-users, Realtime voice modes, Wake, barge-in, and echo handling, Voice codec and DAVE recovery
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/docker-podman-hosting.md
+++ b/.agents/skills/claw-score/references/completeness/docker-podman-hosting.md
@@ -1,40 +0,0 @@
-# Docker / Podman hosting Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`docker-podman-hosting` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Docker / Podman hosting` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Container Setup: Local Image Setup Script, Docker Compose gateway, First-run onboarding, Docker-only first-run notes, Podman setup scripts and Quadlet template, Rootless Podman image setup
- Container Operations: Host CLI routing into running Docker/Podman, Container Targeting, Container update/rebuild/restart guidance for Docker, Docker Compose, Gateway token generation, Ownership, Docker Compose, Container health endpoints, Provider/VPS Docker hosting docs, Docker VM persistence/update guidance, Operator-facing update
- Image Release and Validation: Root Dockerfile build stages, Docker release workflow, Docker E2E package artifact generation, Docker E2E plan/scheduler scripts, Release-path install
- Agent Sandbox and Tooling: Docker gateway setup, Docker-backed agent sandbox support, Container image dependency baking
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/feishu-qq-bot-wechat-yuanbao-zalo-zalo-personal-regional-channels.md
+++ b/.agents/skills/claw-score/references/completeness/feishu-qq-bot-wechat-yuanbao-zalo-zalo-personal-regional-channels.md
@@ -1,40 +0,0 @@
-# Feishu, QQ Bot, WeChat, Yuanbao, Zalo, Zalo Personal, regional channels Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`feishu-qq-bot-wechat-yuanbao-zalo-zalo-personal-regional-channels` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Feishu, QQ Bot, WeChat, Yuanbao, Zalo, Zalo Personal, regional channels` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: Docs channel index, Official external channel catalog entries, Core channel-plugin catalog, Channel setup wizard, Missing-plugin, Cross-channel ingress/access/refactor concerns, Feishu/Lark bot channel setup, WebSocket default mode, DM pairing, Message delivery, Feishu document, Multi-account credential handling, QQ Open Platform AppID/AppSecret setup, C2C private chat, Group activation, Rich media messages, Slash commands, Multi-account gateway connections, Tencent Yuanbao external channel, AppKey/AppSecret setup, DMs, Outbound queue strategy, Core-side official external catalog, Zalo Bot Creator / Marketplace bot, Long-polling default mode, Bot token, Group policy schema, Text, Status probes, WeChat/Weixin personal messaging, Plugin install, Direct-message pairing, Core-side catalog metadata, External sidecar/helper process behavior, zalouser channel plugin, QR login, DM pairing, Message send, Doctor/status checks for runtime availability, Explicit unofficial-account risk, QQ Open Platform AppID/AppSecret setup and, C2C private chat, Group activation, Inbound and outbound rich media including, Slash commands, Multi-account gateway connections, Tencent Yuanbao external channel `openclaw-plugin-yuanbao, AppKey/AppSecret setup, DMs, Outbound queue strategy, Core-side official external catalog, Zalo Bot Creator / Marketplace bot, Long-polling default mode and optional HTTPS, Bot token, Group policy schema and fail-closed group, Text, Status probes and troubleshooting for token/config/webhook problems, zalouser` channel plugin for Zalo Personal, QR login, DM pairing, Message send, Doctor/status checks for runtime availability and, Explicit unofficial-account risk and operator safeguards
- Access and Identity: Feishu/Lark bot channel setup, WebSocket default mode, DM pairing, Message delivery, Feishu document, Multi-account credential handling, QQ Open Platform AppID/AppSecret setup, C2C private chat, Group activation, Rich media messages, Slash commands, Multi-account gateway connections, Tencent Yuanbao external channel, AppKey/AppSecret setup, DMs, Outbound queue strategy, Core-side official external catalog, Zalo Bot Creator / Marketplace bot, Long-polling default mode, Bot token, Group policy schema, Text, Status probes, WeChat/Weixin personal messaging, Plugin install, Direct-message pairing, Core-side catalog metadata, External sidecar/helper process behavior, zalouser channel plugin, QR login, DM pairing, Message send, Doctor/status checks for runtime availability, Explicit unofficial-account risk, QQ Open Platform AppID/AppSecret setup and, C2C private chat, Group activation, Inbound and outbound rich media including, Slash commands, Multi-account gateway connections, Tencent Yuanbao external channel `openclaw-plugin-yuanbao, AppKey/AppSecret setup, DMs, Outbound queue strategy, Core-side official external catalog, zalouser` channel plugin for Zalo Personal, QR login, DM pairing, Message send, Doctor/status checks for runtime availability and, Explicit unofficial-account risk and operator safeguards
- Conversation Routing and Delivery: Feishu/Lark bot channel setup, WebSocket default mode, DM pairing, Message delivery, Feishu document, Multi-account credential handling, QQ Open Platform AppID/AppSecret setup, C2C private chat, Group activation, Rich media messages, Slash commands, Multi-account gateway connections, Tencent Yuanbao external channel, AppKey/AppSecret setup, DMs, Outbound queue strategy, Core-side official external catalog, Zalo Bot Creator / Marketplace bot, Long-polling default mode, Bot token, Group policy schema, Text, Status probes, WeChat/Weixin personal messaging, Plugin install, Direct-message pairing, Core-side catalog metadata, External sidecar/helper process behavior, zalouser channel plugin, QR login, DM pairing, Message send, Doctor/status checks for runtime availability, Explicit unofficial-account risk, QQ Open Platform AppID/AppSecret setup and, C2C private chat, Group activation, Inbound and outbound rich media including, Slash commands, Multi-account gateway connections, Tencent Yuanbao external channel `openclaw-plugin-yuanbao, AppKey/AppSecret setup, DMs, Outbound queue strategy, Core-side official external catalog, Zalo Bot Creator / Marketplace bot, Long-polling default mode and optional HTTPS, Bot token, Group policy schema and fail-closed group, Text, Status probes and troubleshooting for token/config/webhook problems, zalouser` channel plugin for Zalo Personal, QR login, DM pairing, Message send, Doctor/status checks for runtime availability and, Explicit unofficial-account risk and operator safeguards
- Media and Rich Content: Feishu/Lark bot channel setup, WebSocket default mode, DM pairing, Message delivery, Feishu document, Multi-account credential handling, QQ Open Platform AppID/AppSecret setup, C2C private chat, Group activation, Rich media messages, Slash commands, Multi-account gateway connections, Tencent Yuanbao external channel, AppKey/AppSecret setup, DMs, Outbound queue strategy, Core-side official external catalog, Zalo Bot Creator / Marketplace bot, Long-polling default mode, Bot token, Group policy schema, Text, Status probes, QQ Open Platform AppID/AppSecret setup and, C2C private chat, Group activation, Inbound and outbound rich media including, Slash commands, Multi-account gateway connections, Zalo Bot Creator / Marketplace bot, Long-polling default mode and optional HTTPS, Bot token, Group policy schema and fail-closed group, Text, Status probes and troubleshooting for token/config/webhook problems
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/gateway-runtime.md
+++ b/.agents/skills/claw-score/references/completeness/gateway-runtime.md
@@ -1,50 +0,0 @@
-# Gateway Runtime Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`gateway-runtime` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended gateway runtime
-capability to operators and connected clients. This is not test coverage and
-not implementation quality. Score whether the category delivers the full
-operator-visible workflow, including the major modes and recovery paths that a
-real deployment expects.
-
-## Scoring Questions
-
-For each category, ask:
-
- Does the category cover the main happy path an operator or client needs?
- Are the major deployment modes present where they matter for this category:
-  local, remote, node-mediated, supervised, or browser-facing?
- Are the main lifecycle stages present where relevant: setup, normal use,
-  status/inspection, and recovery?
- Are important security or policy branches present where the category implies
-  them?
- Are obvious operator-visible holes or "not yet supported" branches still
-  missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness only when the category supports the full operator
-  journey, not just a protocol primitive or one transport path.
- Lower Completeness when only the core path exists but important branches are
-  missing, such as remote versus local differences, supervised lifecycle
-  behavior, approval/policy variants, or recovery/diagnostic paths.
- Do not lower Completeness just because tests are thin; that is Coverage.
- Do not lower Completeness just because the implementation is fragile; that is
-  Quality.
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across all expected operator/client modes, with
-  only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only
-  bounded missing branches.
- `Beta` (70-80): the main workflows exist, but some meaningful branches or
-  recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can do core
-  tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended
-  capability.
--- a/.agents/skills/claw-score/references/completeness/google-chat.md
+++ b/.agents/skills/claw-score/references/completeness/google-chat.md
@@ -1,41 +0,0 @@
-# Google Chat Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`google-chat` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Google Chat` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: Google Cloud project setup, Chat app configuration, Service account setup, Webhook audience and path, Workspace visibility and app status, Guided channel setup, Account resolution, Service account SecretRefs, Env file and inline credentials, Channel status and probes, Directory and mutable-id diagnostics, NPM and ClawHub install, Plugin docs and catalog routing, Channel aliases and labels, Operator status UI, Install/update metadata, Webhook path handling, Standard Chat token verification, Workspace add-on token verification, Audience and appPrincipal validation, Shared-path target selection, Auth rejection diagnostics, Account resolution, Service account SecretRefs, Env file and inline credentials, Channel status and probes, Directory and mutable-id diagnostics, NPM and ClawHub install, Plugin docs and catalog routing, Channel aliases and labels, Operator status UI, Install/update metadata, Webhook path handling, Standard Chat token verification, Workspace add-on token verification, Audience and appPrincipal binding, Shared-path target selection, Auth rejection diagnostics
- Access and Identity: DM pairing approval, Sender allowlists, Google Chat identity matching, Direct session routing, Pairing diagnostics, Space allowlists, Mention gating, Sender access groups, Group session isolation, Bot-loop protection, Space diagnostics
- Conversation Routing and Delivery: DM pairing approval, Sender allowlists, Google Chat identity matching, Direct session routing, Pairing diagnostics, Space allowlists, Mention gating, Sender access groups, Group session isolation, Bot-loop protection, Space diagnostics, Inbound attachments, Outbound media replies, Message upload action, Media source and size controls, Media receipts and thread placement, Text send action, Upload-file action, Reaction actions, Action capability gates, Approval sender matching, Thread-aware replies, Streaming and chunked replies, Typing placeholder lifecycle, Message-tool current-source replies, NO_REPLY cleanup, Markdown/text rendering, Thread-aware replies, Streaming and chunked replies, Typing placeholder lifecycle, Message-tool current-source replies, NO_REPLY cleanup, Markdown/text rendering
- Media and Rich Content: Inbound attachments, Outbound media replies, Message upload action, Media source and size controls, Media receipts and thread placement, Text send action, Upload-file action, Reaction actions, Action capability gates, Approval sender matching, Thread-aware replies, Streaming and chunked replies, Typing placeholder lifecycle, Message-tool current-source replies, NO_REPLY cleanup, Markdown/text rendering
- Native Controls and Approvals: Inbound attachments, Outbound media replies, Message upload action, Media source and size controls, Media receipts and thread placement, Text send action, Upload-file action, Reaction actions, Action capability gates, Approval sender matching, Thread-aware replies, Streaming and chunked replies, Typing placeholder lifecycle, Message-tool current-source replies, NO_REPLY cleanup, Markdown/text rendering
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/google-provider-path.md
+++ b/.agents/skills/claw-score/references/completeness/google-provider-path.md
@@ -1,41 +0,0 @@
-# Google provider path Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`google-provider-path` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Google provider path` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Provider Setup and Credentials: API key onboarding, Auth choice metadata, Gemini CLI OAuth setup, Vertex ADC setup, Daemon and fallback credentials, CLI runtime selection, OAuth login and refresh, Canonical Google model refs, CLI usage normalization, OAuth diagnostics
- Model Routing and Endpoints: Catalog rows and aliases, Dynamic model resolution, Provider routing, Google-native config normalization, Model picker availability, Vertex provider selection, ADC/service-account auth, Project/location endpoints, Custom base URL policy, Compatibility boundaries
- Direct Gemini Runtime: Direct Gemini chat, Multimodal inputs, Tool-call streaming, Usage and stop reasons, Thought-signature replay, Thinking-level mapping, Thought-signature replay, Tool turn ordering, Incomplete-turn recovery, Planning-only turn recovery
- Media, Search, and Realtime: Bundled plugin distribution, Provider auto-enable metadata, Image and media adapters, Speech and realtime adapters, Search and generation tools, Realtime voice sessions, Constrained browser tokens, Audio and transcript events, Live tool calls, Session reconnects
- Prompt Caching: Cache retention config, Managed cachedContents, Manual cachedContent handles, Cache usage accounting, Cache diagnostics and live proof
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/image-video-music-generation-tools.md
+++ b/.agents/skills/claw-score/references/completeness/image-video-music-generation-tools.md
@@ -1,41 +0,0 @@
-# Image/video/music generation tools Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`image-video-music-generation-tools` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Image/video/music generation tools` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Media Routing and Discovery: default media model config, per-call model refs and fallbacks, auth-backed tool discovery, action=list provider inspection
- Task Lifecycle and Delivery: background task creation, task status/list/show/cancel, duplicate guards, progress keepalive, completion/failure wake, no-session inline fallback, local media persistence, MIME/filename inference, Hosted URL fallback, message-tool handoff, idempotent missing-media fallback, channel attachment proof
- Image Generation: text-to-image, reference-image editing, output hints, action=status, provider attempt metadata, OpenAI/Codex OAuth, API-key OpenAI, OpenRouter/xAI/fal/LiteLLM/DeepInfra/Google/MiniMax/ComfyUI auth, provider error diagnostics
- Video Generation: text-to-video, image-to-video, video-to-video, reference role validation, audio refs, typed providerOptions, queue-backed jobs, polling/timeout handling, Hosted URL download, provider skip explanations, returned asset metadata
- Music Generation: prompt and lyrics input, instrumental mode, duration/format controls, image-reference edit lanes, generated audio outputs, provider fallback
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/imessage-bluebubbles.md
+++ b/.agents/skills/claw-score/references/completeness/imessage-bluebubbles.md
@@ -1,41 +0,0 @@
-# iMessage / BlueBubbles Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`imessage-bluebubbles` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `iMessage / BlueBubbles` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: Translate legacy config, Cut over safely, Handle migration caveats, Run local imsg, Run through SSH wrapper, Grant macOS permissions, Probe runtime health, Account setup prompts, Account status checks, Doctor repair checks, Account Config, Translate legacy config, Cut over safely, Handle migration caveats, Run local imsg, Run through SSH wrapper, Grant macOS permissions, Probe runtime health
- Access and Identity: Authorize direct senders, Route direct conversations, Bind ACP sessions, Group Policy, Mentions, System Prompts, Group Policy, Mentions, System Prompts
- Conversation Routing and Delivery: Watch live messages, Coalesce split-send DMs, Replay missed messages, Seed conversation history, Authorize direct senders, Route direct conversations, Bind ACP sessions, Group Policy, Mentions, System Prompts
- Media and Rich Content: Media, Attachments, Remote Fetch, Chunking, Native Actions, Private API, Message Tool
- Native Controls and Approvals: Native Approvals, Reactions, Operator Control, Media, Attachments, Remote Fetch, Chunking, Native Actions, Private API, Message Tool, Native Actions, Private API, Message Tool
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/ios-app.md
+++ b/.agents/skills/claw-score/references/completeness/ios-app.md
@@ -1,44 +0,0 @@
-# iOS app Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`ios-app` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `iOS app` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Media and Sharing: Camera list/snap/clip
- Canvas and Screen: Canvas present/hide/navigate/eval/snapshot
- Chat and Sessions: Chat sessions and operator controls
- Gateway Setup and Diagnostics: Bonjour/local, Manual host/port, Gateway connect configuration persistence, TLS fingerprint trust prompt, Pairing approval, Pairing/auth diagnostics for users, Settings tab
- Distribution: Internal preview status
- Device Commands: Location modes, Device command handling
- Notifications and Background: APNs registration and relay delivery
- Voice: Voice wake
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/kubernetes-hosting.md
+++ b/.agents/skills/claw-score/references/completeness/kubernetes-hosting.md
@@ -1,43 +0,0 @@
-# Kubernetes Hosting Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`kubernetes-hosting` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw supports Kubernetes as a cluster
-hosting path for the Gateway. Score whether each category delivers the operator
-workflow for deployment, configuration, secrets, access, exposure, lifecycle,
-security posture, status, and recovery.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can an operator deploy and manage OpenClaw on Kubernetes end to end?
- Are the taxonomy features present as supported manifests, commands, and docs rather than examples only?
- Are setup, normal operation, status or inspection, redeploy, teardown, and secret rotation represented where relevant?
- Are local Kind validation, namespace/image customization, provider secrets, and secure exposure branches covered?
- Do known gaps leave major cluster-hosting capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when a Kubernetes operator can deploy, expose, secure, update, troubleshoot, and remove the Gateway without relying on Docker-only assumptions.
- Lower Completeness when a category only covers happy-path port-forwarding, lacks secret/config rotation, or omits exposed-service security posture.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Deployment Setup: Kustomize packaging, cluster prerequisites, quick deploy, manifest apply, and Kind validation.
- Configuration and Secrets: agent instructions, Gateway config, provider secrets, secret rotation, and image/namespace customization.
- Access and Exposure: port-forward access, service endpoint, ingress exposure, auth/TLS, and localhost posture.
- Cluster Lifecycle: resource layout, state persistence, redeploy, teardown, and security context.
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/linux-companion-app.md
+++ b/.agents/skills/claw-score/references/completeness/linux-companion-app.md
@@ -1,41 +0,0 @@
-# Linux companion app Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`linux-companion-app` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Linux companion app` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- App Distribution: Native app package, Distro package targets, Official release metadata
- Gateway Connectivity: Local Gateway attach and status, Gateway pairing and auth, Remote mode, Local and remote resource boundaries
- Chat and Sessions: Native Linux chat window, Transcript, Gateway chat transport
- Desktop Capabilities: Linux desktop permissions, Secret storage, Sandbox/package posture, Linux native node identity, Host command execution, Desktop tools, Linux native Talk, Microphone capture, Native media permissions
- Status and Diagnostics: Native Linux app readiness, Gateway health/status display, Log/transcript opening, Doctor/repair affordances, Linux tray/status item, Runtime status row, Desktop-environment integration
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/linux-gateway-host.md
+++ b/.agents/skills/claw-score/references/completeness/linux-gateway-host.md
@@ -1,41 +0,0 @@
-# Linux Gateway host Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`linux-gateway-host` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Linux Gateway host` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Host Setup and Updates: Linux CLI install, Node runtime prerequisites, Package-manager policy, Update path
- Gateway Runtime and Service Control: Foreground Gateway Runtime, Process Control, Systemd User Service Lifecycle setup, Systemd User Service Lifecycle operation, Systemd User Service Lifecycle status, Systemd User Service Lifecycle recovery
- Remote Access and Security: Remote Network Exposure, TLS, Tailscale, Gateway exposure safeguards, Gateway authentication modes, Secret Handling
- Diagnostics and Repair: Gateway diagnostic reports, Gateway log tailing, Doctor checks, Operator repair guidance
- Deployment Targets: VPS, Container, Cloud Deployment Guidance
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/local-model-providers-ollama-vllm-sglang-lm-studio.md
+++ b/.agents/skills/claw-score/references/completeness/local-model-providers-ollama-vllm-sglang-lm-studio.md
@@ -1,41 +0,0 @@
-# Local model providers: Ollama, vLLM, SGLang, LM Studio Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`local-model-providers-ollama-vllm-sglang-lm-studio` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Local model providers: Ollama, vLLM, SGLang, LM Studio` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Provider Setup, Lifecycle, and Diagnostics: Provider Selection, Onboarding, localService configuration, Process startup and readiness, Request leases and idle shutdown, Health checks and restart, Provider recipes, Local provider status, Backend reachability probes, Model availability errors, Memory readiness diagnostics, Provider troubleshooting docs
- Native Provider Plugins: Ollama setup and model pulling, Model discovery, Streaming and vision, Ollama embeddings, Web-search support, LM Studio setup, Model discovery and auth, Model preload and JIT loading, Streaming compatibility, LM Studio embeddings
- OpenAI-Compatible Runtime Compatibility: Bundled provider setup, Model Discovery Endpoint, Non-interactive configuration, vLLM thinking controls, OpenAI-compatible chat and tool semantics, SGLang compatibility guidance, Request Stream Compatibility, Tool Calling
- Local Memory and Embeddings: Embedding provider selection, Memory search readiness, memoryFlush model override, Fallback lexical search, Provider mismatch guidance
- Network Safety and Prompt Controls: Safety Network, Prompt Pressure Controls
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/long-tail-hosted-providers.md
+++ b/.agents/skills/claw-score/references/completeness/long-tail-hosted-providers.md
@@ -1,39 +0,0 @@
-# Long-tail hosted providers Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`long-tail-hosted-providers` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Long-tail hosted providers` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Hosted LLM Providers: Bedrock setup, Gateway/proxy routing, Copilot/OpenCode hosted access, Proxy capability diagnostics, Hosted text completion, Tool-call and streaming compatibility, Model catalog resolution, Provider-specific request shaping, Regional provider setup, Region and plan routing, Regional live smoke, Account prerequisite diagnostics
- Hosted Media Providers: Image generation providers, Video generation providers, Music generation providers, Media mode coverage, Text-to-speech providers, Speech-to-text providers, Realtime transcription providers, Audio format diagnostics
- Provider Operations: Provider directory, Provider install catalog, Model catalog metadata, Catalog parity checks, Provider setup descriptors, Auth profiles and aliases, Credential health probes, Key rotation and recovery, Direct provider smoke, Gateway live smoke, Models status probes, Fallback trace and repair
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/macos-companion-app.md
+++ b/.agents/skills/claw-score/references/completeness/macos-companion-app.md
@@ -1,43 +0,0 @@
-# macOS companion app Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`macos-companion-app` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `macOS companion app` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Canvas: Canvas panel open/hide/navigate/eval/snapshot, Local custom URL scheme, A2UI host auto-navigation, Canvas enable/disable setting
- Local Setup: Local mode Gateway attach/start/stop, LaunchAgent install/update/restart/uninstall, Existing-listener detection, Native first-run onboarding flow, CLI discovery, Local workspace selection, Onboarding WebChat session separation
- Status and Settings: Menu-bar status, Activity state ingestion, Settings navigation, Health polling, Channels settings
- Native Capabilities: Mac node session connection, system.run, Exec approval policy, Permission requests, TCC persistence
- Remote Connections: Remote connection mode selection, SSH tunnel, Gateway discovery
- Voice and Talk: Voice Wake runtime, Push-to-talk, Talk provider playback plan
- WebChat: Native SwiftUI WebChat window, Gateway chat transport, Local and remote data-plane reuse
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/macos-gateway-host.md
+++ b/.agents/skills/claw-score/references/completeness/macos-gateway-host.md
@@ -1,43 +0,0 @@
-# macOS Gateway host Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`macos-gateway-host` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `macOS Gateway host` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- CLI Setup: Hosted installer, Node 24 recommendation, App-triggered CLI install, Shell PATH and version-manager drift
- Local Gateway Integration: App local/remote connection mode, App-managed Gateway LaunchAgent install/restart/uninstall, CLI install detection, Attach-to-existing local Gateway compatibility, Gateway endpoint, gateway.mode=local configuration, Loopback bind, Local app endpoint resolution, Bonjour discovery
- Remote Gateway Mode: macOS app "Remote over SSH", SSH tunnel setup, Tailscale MagicDNS, Remote endpoint token/password/TLS fingerprint, Local node host startup
- Gateway Service Lifecycle: Per-user Gateway LaunchAgent install, launchctl bootstrap, LaunchAgent labels, Gateway token/env handling, App-managed LaunchAgent handoff, openclaw update package/git handoff, Managed service refresh, Stale updater launchd job detection, openclaw uninstall, Stranded service recovery
- Diagnostics and Observability: LaunchAgent log paths, openclaw gateway status --deep, Gateway silently stops responding, Stale updater jobs
- Permissions and Native Capabilities: macOS TCC permission prompts/status, Native node capability exposure, system.run policy, Permission-driven support
- Profiles and Isolation: Profile-specific LaunchAgent labels, Profile-specific state/config/workspace roots, Derived ports, Rescue bot setup, Extra Gateway process detection
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/matrix.md
+++ b/.agents/skills/claw-score/references/completeness/matrix.md
@@ -1,42 +0,0 @@
-# Matrix Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`matrix` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Matrix` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: Matrix plugin identity, Setup wizard, Account discovery, Matrix doctor warnings, Matrix probe/status, Shared Matrix client resolution, Monitor startup, Startup maintenance, Matrix doctor warnings, Matrix probe/status, Monitor startup, Startup maintenance
- Access and Identity: DM policy, Direct-room classification, Inbound route selection across sender-bound DMs, Mention gates, Matrix thread reply routing, Persisted Matrix thread routing managers, ACP/subagent spawn hooks
- Conversation Routing and Delivery: DM policy, Direct-room classification, Inbound route selection across sender-bound DMs, Mention gates, Matrix thread reply routing, Persisted Matrix thread routing managers, ACP/subagent spawn hooks, Channel action discovery, Message send/read/edit/delete, Profile media loading, Outbound Matrix text, Message presentation metadata, Inbound media failure handling, Message send/read/edit/delete, Profile media loading, Outbound Matrix text, Message presentation metadata, Inbound media failure handling
- Media and Rich Content: Channel action discovery, Message send/read/edit/delete, Profile media loading, Outbound Matrix text, Message presentation metadata, Inbound media failure handling
- Native Controls and Approvals: Channel action discovery, Message send/read/edit/delete, Profile media loading, Outbound Matrix text, Message presentation metadata, Inbound media failure handling, Matrix native exec, Origin target resolution from Matrix turn, Approver DM target resolution, Matrix approval metadata, Origin target resolution from Matrix turn, Approver DM target resolution, Matrix approval metadata
- Encryption and Verification: Encryption setup, Encrypted media upload/download, Legacy state
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/mattermost-line-irc-nextcloud-talk-nostr-twitch-tlon-synology-chat.md
+++ b/.agents/skills/claw-score/references/completeness/mattermost-line-irc-nextcloud-talk-nostr-twitch-tlon-synology-chat.md
@@ -1,40 +0,0 @@
-# Mattermost, LINE, IRC, Nextcloud Talk, Nostr, Twitch, Tlon, Synology Chat Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`mattermost-line-irc-nextcloud-talk-nostr-twitch-tlon-synology-chat` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Mattermost, LINE, IRC, Nextcloud Talk, Nostr, Twitch, Tlon, Synology Chat` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: Mattermost bot account setup, WebSocket inbound monitoring, Outbound delivery, LINE Messaging API webhook setup, Signed inbound webhook events, Rich LINE payloads, Nextcloud Talk bot installation, Webhook ingress, Outbound markdown/text, Synology Chat incoming/outgoing webhook setup, Webhook token verification, Outbound text, IRC server/nick/TLS/NickServ setup, Raw IRC receive/send, Probe/status, Twitch bot account setup, Twitch IRC monitor/client lifecycle, Message tool send action, Nostr key setup, NIP-04 encrypted DM receive/send, Profile import/publish, Tlon/Urbit ship URL/code setup, Urbit API auth/session, Rich text conversion, Nextcloud Talk bot installation, Webhook ingress, Outbound markdown/text, Synology Chat incoming/outgoing webhook setup, Webhook token verification, Outbound text and URL media delivery, Twitch bot account setup, Twitch IRC monitor/client lifecycle, Message tool send action, Tlon/Urbit ship URL/code setup, Urbit API auth/session, Rich text conversion
- Access and Identity: Mattermost bot account setup, WebSocket inbound monitoring, Outbound delivery, LINE Messaging API webhook setup, Signed inbound webhook events, Rich LINE payloads, Nextcloud Talk bot installation, Webhook ingress, Outbound markdown/text, Synology Chat incoming/outgoing webhook setup, Webhook token verification, Outbound text, IRC server/nick/TLS/NickServ setup, Raw IRC receive/send, Probe/status, Twitch bot account setup, Twitch IRC monitor/client lifecycle, Message tool send action, Nostr key setup, NIP-04 encrypted DM receive/send, Profile import/publish, Tlon/Urbit ship URL/code setup, Urbit API auth/session, Rich text conversion, Synology Chat incoming/outgoing webhook setup, Webhook token verification, Outbound text and URL media delivery, Tlon/Urbit ship URL/code setup, Urbit API auth/session, Rich text conversion
- Conversation Routing and Delivery: Mattermost bot account setup, WebSocket inbound monitoring, Outbound delivery, LINE Messaging API webhook setup, Signed inbound webhook events, Rich LINE payloads, Nextcloud Talk bot installation, Webhook ingress, Outbound markdown/text, Synology Chat incoming/outgoing webhook setup, Webhook token verification, Outbound text, IRC server/nick/TLS/NickServ setup, Raw IRC receive/send, Probe/status, Twitch bot account setup, Twitch IRC monitor/client lifecycle, Message tool send action, Nostr key setup, NIP-04 encrypted DM receive/send, Profile import/publish, Tlon/Urbit ship URL/code setup, Urbit API auth/session, Rich text conversion, Nextcloud Talk bot installation, Webhook ingress, Outbound markdown/text, Synology Chat incoming/outgoing webhook setup, Webhook token verification, Outbound text and URL media delivery, Twitch bot account setup, Twitch IRC monitor/client lifecycle, Message tool send action, Tlon/Urbit ship URL/code setup, Urbit API auth/session, Rich text conversion
- Media and Rich Content: LINE Messaging API webhook setup, Signed inbound webhook events, Rich LINE payloads, Nextcloud Talk bot installation, Webhook ingress, Outbound markdown/text, Synology Chat incoming/outgoing webhook setup, Webhook token verification, Outbound text, Nostr key setup, NIP-04 encrypted DM receive/send, Profile import/publish, Tlon/Urbit ship URL/code setup, Urbit API auth/session, Rich text conversion, Tlon/Urbit ship URL/code setup, Urbit API auth/session, Rich text conversion
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/media-understanding-and-media-generation.md
+++ b/.agents/skills/claw-score/references/completeness/media-understanding-and-media-generation.md
@@ -1,42 +0,0 @@
-# Media understanding and media generation Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`media-understanding-and-media-generation` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Media understanding and media generation` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Media Intake and Access: Local and remote media references, MIME and type detection, Size caps and bounded reads, Safe remote fetch, Local root policy, Inbound media store, PDF/document extraction dispatch, QR and media helper classification
- Channel Media Handling: Inbound attachment staging, Sandbox media rewrites, Reply media templating, Message-tool attachment delivery, Duplicate delivery suppression
- Media Configuration: Media capability configuration
- Text-to-Speech Delivery: TTS, Outbound Voice Audio Delivery
- Media Understanding: Audio attachment selection, Batch STT provider and CLI fallback, Voice-note mention preflight, Transcript insertion and echo, Audio proxy and limit handling, Inbound image summarization, Active vision model bypass, Text-only model media offload, Vision provider fallback, Image and PDF input routing, Video Understanding, Direct Video Analysis
- Media Generation: Image generation tool invocation, Provider and model selection, Reference image editing, Generated image task lifecycle, Generated image persistence and delivery, Music generation tool invocation, Provider and model selection, Lyrics, instrumental, duration, and format controls, Reference inputs where supported, Music task lifecycle and duplicate status, Generated audio persistence and delivery, Video generation tool invocation, Mode and provider capability selection, Reference image, video, and audio inputs, Provider option validation, Video task lifecycle and status, Generated video persistence and delivery
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/microsoft-teams.md
+++ b/.agents/skills/claw-score/references/completeness/microsoft-teams.md
@@ -1,41 +0,0 @@
-# Microsoft Teams Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`microsoft-teams` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Microsoft Teams` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: Teams CLI app creation, Bot registration and manifest upload, Credential configuration, Teams app install verification, Setup status, Probe and scope reporting, Teams app doctor, Webhook and health diagnostics, Operator repair paths, Text formatting and chunking, Adaptive and presentation cards, Progress streaming, Delivery receipts and errors, Queued and proactive replies, Webhook Runtime, SDK Lifecycle, Proactive Cloud Boundary, Setup status, Probe and scope reporting, Teams app doctor, Webhook and health diagnostics, Operator repair paths, Webhook Runtime, SDK Lifecycle, Proactive Cloud Boundary
- Access and Identity: DM pairing, Stable sender identity, Allowlists and access groups, Invoke and command authorization, Teams-originated config writes, Bot Framework SSO invokes, Delegated token storage, Graph directory lookup, Member profile lookup, Bot Framework SSO invokes, Delegated token storage, Graph directory lookup, Member profile lookup
- Conversation Routing and Delivery: Team and channel allowlists, Deterministic channel replies, Mention-gated group access, Session routing, Reply and thread context, Text formatting and chunking, Adaptive and presentation cards, Progress streaming, Delivery receipts and errors, Queued and proactive replies, Webhook Runtime, SDK Lifecycle, Proactive Cloud Boundary, Text formatting and chunking, Adaptive and presentation cards, Progress streaming, Delivery receipts and errors, Queued and proactive replies, Webhook Runtime, SDK Lifecycle, Proactive Cloud Boundary
- Media and Rich Content: Inbound attachments, Graph-hosted media, File consent, SharePoint and OneDrive sharing, Media fetch safety
- Native Controls and Approvals: Message action discovery, Polls and reactions, Read, edit, delete, and pin, Native approval cards, Feedback and group actions
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/multi-agent-orchestration.md
+++ b/.agents/skills/claw-score/references/completeness/multi-agent-orchestration.md
@@ -1,45 +0,0 @@
-# Multi-Agent Orchestration Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`multi-agent-orchestration` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw supports multiple coordinated agents
-as an operator-facing system. Score whether each category delivers setup,
-isolation, conversation routing, account routing, specialist lanes, delegate
-identity, status, recovery, and safe defaults.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can an operator configure and run the category workflow end to end?
- Are the taxonomy features present as supported user paths rather than partial config fragments?
- Are setup, normal operation, status or inspection, recovery, and removal paths represented where relevant?
- Are channel, account, workspace, auth, task, and delegate variants covered where the category expects them?
- Do known gaps leave major coordination or isolation branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when multiple agents can be created, isolated, routed, delegated, and inspected without implicit cross-agent leakage.
- Lower Completeness when a category depends on undocumented config, lacks deterministic routing, or cannot explain who owns state, credentials, and outbound delivery.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Agent Setup: add agents, agent list/delete, identity files, non-interactive setup, and single-agent default.
- Agent Isolation: workspace separation, state separation, auth separation, session separation, and tool profiles.
- Conversation Routing: agent selection, route precedence, default fallback, peer overrides, and cross-channel examples.
- Account Routing: multi-account setup, account selection, default accounts, account credentials, and delivery targets.
- Specialist Lanes: lane contracts, background handoff, concurrency controls, priority controls, and coordinator handoff.
- Delegate Identities: named delegates, authority model, delegate tiers, identity delegation, and organizational assistants.
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/native-windows-cli-and-gateway.md
+++ b/.agents/skills/claw-score/references/completeness/native-windows-cli-and-gateway.md
@@ -1,40 +0,0 @@
-# Native Windows CLI and Gateway Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`native-windows-cli-and-gateway` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Native Windows CLI and Gateway` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Setup: PowerShell installer, Node and package-manager bootstrap, npm global install, Packaged CLI launcher, Windows command shims, openclaw onboard, Local Gateway config, Daemon install flags, Native-vs-WSL setup boundary
- Gateway Management: openclaw gateway, Foreground runtime health/readiness, Windows-specific restart/signal, Unmanaged foreground mode, openclaw gateway install, Gateway launcher files, Scheduled Task runtime status, Startup-folder fallback, openclaw status, Windows service inspection, Post-install diagnostics
- Networking: Native Windows host binding, netsh interface portproxy, Gateway status and probe output, Loopback, LAN, and WSL boundary
- Updates: openclaw update on native Windows package, Managed Gateway stop/restart, Detached update handoff, Windows package locks
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/native-windows-companion-app.md
+++ b/.agents/skills/claw-score/references/completeness/native-windows-companion-app.md
@@ -1,41 +0,0 @@
-# Native Windows companion app Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`native-windows-companion-app` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Native Windows companion app` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Installation and Updates: Official app download, MSI/MSIX/App Installer/winget-style packaging, Windows architecture handling for x64, App release channel
- Gateway Connection: App-managed local Gateway attach/start, Remote Gateway connection modes, Device/node pairing
- Chat Sessions: Native Windows chat window, Gateway chat transport
- Status and Repair: App health states, App-specific repair, Windows system tray app, Status indicators, App-specific notification permission
- Desktop Tools and Permissions: Windows node identity, Host command execution, Desktop command policy, App approval prompts, Screen and media capture, Canvas host behavior, Windows shell integrations, App secrets, Windows ACL, Command approval
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/nix-install-path.md
+++ b/.agents/skills/claw-score/references/completeness/nix-install-path.md
@@ -1,41 +0,0 @@
-# Nix install path Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`nix-install-path` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Nix install path` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Install Handoff: Nix install overview, nix-openclaw source-of-truth, Install discoverability, Verification handoff
- Plugin Lifecycle: Lifecycle command refusal, Declarative plugin selection, Nix-store plugin loading, Hardlink safety
- Activation and App UX: Environment activation, macOS defaults activation, Runtime Nix-mode detection, Stable Nix defaults, Managed-by-Nix banner, Read-only config controls, Onboarding skip
- Config and State: Immutable config guard, Config writer refusal, Agent-first Nix edits, Explicit config path, Writable state directory, Immutable-store config support, State integrity checks
- Service Runtime and Guards: Nix profile PATH discovery, Profile precedence, Service PATH fallback, Trusted binary boundaries, Setup write refusal, Doctor repair refusal, Update handoff, Service lifecycle handoff
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/openai-codex-provider-path.md
+++ b/.agents/skills/claw-score/references/completeness/openai-codex-provider-path.md
@@ -1,41 +0,0 @@
-# OpenAI / Codex provider path Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`openai-codex-provider-path` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `OpenAI / Codex provider path` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Model and Auth: Canonical OpenAI Model Routing, Catalog, Codex OAuth Profiles, Subscription Usage, Doctor Diagnostics, Operator Repair
- Responses and Tool Compatibility: Codex Responses Transport, Payload Compatibility, Tool Context, Capability Compatibility
- Native Codex Harness: Native Codex App-server Harness, Thread Lifecycle
- Image and Multimodal Input: Image Generation Editing, Multimodal Input
- Voice and Realtime Audio: Realtime Voice Transcription, Speech
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/openclaw-app-sdk.md
+++ b/.agents/skills/claw-score/references/completeness/openclaw-app-sdk.md
@@ -1,45 +0,0 @@
-# OpenClaw App SDK Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`openclaw-app-sdk` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes a supported external App SDK
-for applications built outside the Gateway process. Score whether each category
-delivers an app-developer workflow from connection through agent runs, sessions,
-events, approvals, resources, compatibility, and operational error handling.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can an external app developer complete the category workflow using public SDK APIs?
- Are the taxonomy features represented by stable client contracts rather than protocol-only fragments?
- Are setup, authentication, streaming, result handling, error behavior, and compatibility expectations documented?
- Are browser, Node, React, testing, and custom transport variants covered where the category expects them?
- Do known gaps leave major external-app capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the SDK hides low-level Gateway protocol details behind typed, documented, and reusable client APIs.
- Lower Completeness when a category requires users to manually construct raw Gateway frames or rely on internal package shapes.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Client API: SDK entrypoints, namespace layout, package split, and app/plugin boundary.
- Gateway Access: Gateway connect, URL and token config, auto gateway, custom transport, and scopes/redaction.
- Agent Conversations: agent handles, agent runs, run results, session creation, session send, and session controls.
- Events and Approvals: event stream, event envelope, replay cursors, approval callbacks, and questions.
- Resource Helpers: models, ToolSpace, artifacts, tasks, and environments.
- Compatibility: generated client, ergonomic wrappers, unsupported calls, schema alignment, and public package contract.
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/openrouter-provider-path.md
+++ b/.agents/skills/claw-score/references/completeness/openrouter-provider-path.md
@@ -1,40 +0,0 @@
-# OpenRouter provider path Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`openrouter-provider-path` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `OpenRouter provider path` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Provider Setup and Auth: First-run setup, Default model selection, Provider plugin registration, Model-ref examples, OPENROUTER_API_KEY, Auth profiles and auth order, Status/probe and removal, Provider-entry SecretRef/API-key resolution, Gateway env inheritance, Static catalog rows, Dynamic /models discovery, openrouter/auto and nested refs, Free-model scan/probe, Model list/picker cache
- Chat Runtime and Normalization: Chat completions route, Provider routing params, Per-model route overrides, Reasoning payload policy, Anthropic/Gemini/DeepSeek variants, Streamed content parsing, reasoning_details visible output, Tool-call delta preservation, Family-specific replay policy, Response-model and usage normalization, Attribution headers, Response-cache headers/TTL/clear, Anthropic cache-control markers, Cache usage mapping, Custom proxy exclusions
- Provider Recovery and Diagnostics: Timeout/retry classification, Auth/billing/key-limit classification, Context overflow, Model fallback notices, Guarded fetch/pricing warnings
- Media Generation and Speech: image_generate OpenRouter route, video_generate async jobs/polling/download, music_generate audio route, Text-to-speech, Speech-to-text transcription, Inbound media understanding, Generated artifact delivery
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/plugin-sdk-and-bundled-plugin-architecture.md
+++ b/.agents/skills/claw-score/references/completeness/plugin-sdk-and-bundled-plugin-architecture.md
@@ -1,49 +0,0 @@
-# Plugin Surface Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`plugin-sdk-and-bundled-plugin-architecture` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully a plugin author or operator can complete the
-intended plugin lifecycle for the category: authoring, packaging, installing,
-running, approving, publishing, or testing plugins. Score whether OpenClaw
-supports the full capability set a plugin builder or operator expects, not just
-the underlying SDK or runtime primitives.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended plugin task be completed end to end by an author or
-  operator?
- Are the important plugin variants present for this category, such as channel,
-  provider, tool, bundled, local, npm, or ClawHub flows?
- Are the main lifecycle stages present where relevant: create, configure,
-  validate, run, update, and remove or roll back?
- Are compatibility, approval, or safety branches present when the category
-  implies them?
- Are important author/operator-visible gaps still forcing workarounds or
-  unsupported paths?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full plugin journey,
-  not only one import path, one packaging mode, or one runtime path.
- Lower Completeness when a category works only for bundled plugins or only for
-  selected plugin families while the category implies a broader capability.
- Publishing and testing categories should include the expected lifecycle
-  support, not just raw commands or fixtures.
- Do not use missing tests to lower Completeness; that is Coverage.
- Do not use fragility or regressions to lower Completeness; that is Quality.
-
-## Suggested Bands
-
- `Lovable` (95-100): the category supports the full intended plugin lifecycle
-  across the expected plugin variants.
- `Stable` (80-95): most author/operator workflows exist, with only bounded
-  missing branches.
- `Beta` (70-80): the main workflows exist, but notable lifecycle branches or
-  plugin variants are still missing.
- `Alpha` (50-70): only a partial plugin capability set is available.
- `Experimental` (0-50): the category exposes early or fragmentary support only.
--- a/.agents/skills/claw-score/references/completeness/raspberry-pi-small-linux-devices.md
+++ b/.agents/skills/claw-score/references/completeness/raspberry-pi-small-linux-devices.md
@@ -1,40 +0,0 @@
-# Raspberry Pi / small Linux devices Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`raspberry-pi-small-linux-devices` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Raspberry Pi / small Linux devices` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Setup and Compatibility: Hardware and 64-bit OS requirements, Node runtime setup, OpenClaw install and onboarding, First-run verification, Supported Pi model selection, 64-bit ARM boundary, Unsupported device guidance, Slow-device caveats, npm/pnpm/Bun install modes, Installer architecture detection, Optional ARM binary checks, Fallback/build guidance
- Remote Access and Auth: Headless API-key auth, Gateway shared-secret auth, Device pairing approvals, SecretRef handling, Token drift recovery, SSH tunnel dashboard access, Tailscale Serve/Funnel, Loopback/non-loopback exposure controls, Authenticated Control UI access
- Gateway Runtime: Always-on Gateway process, Cloud model configuration, Channel startup, Gateway health/status, User service install, linger/boot persistence, Service drop-ins, Restart tuning, Status/log inspection, Backup/restore
- Performance and Diagnostics: Swap and low-RAM tuning, USB SSD guidance, Compile cache/no-respawn settings, OOM/performance troubleshooting, Diagnostics bundles
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/security-auth-pairing-and-secrets.md
+++ b/.agents/skills/claw-score/references/completeness/security-auth-pairing-and-secrets.md
@@ -1,42 +0,0 @@
-# Security, auth, pairing, and secrets Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`security-auth-pairing-and-secrets` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Security, auth, pairing, and secrets` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Approval Policy and Tool Safeguards: Approval Policy, Dangerous Tool Safeguards
- Gateway Auth and Remote Access: Shared Gateway token/password auth, Gateway auth mode, Trusted-proxy identity, Tailscale Serve/Funnel, Bind and origin restrictions, WebSocket handshake auth, Operator-facing docs, Browser Control UI, Remote Client Trust
- Channel Access Control: Channel Identity, Allowlists, Sender Pairing
- Device and Node Pairing: Setup codes, Device identity creation, Device-token issuance, Device pairing approvals for operator, Operator scopes that gate pairing, Local Control UI, Auth migration, Operator-facing docs, Node Pairing, Capability Trust, Remote Exec Approvals
- Plugin Trust: Plugin Installation Trust, Security Boundaries
- Credential and Secret Hygiene: Provider Auth Profiles, API Key Health, Secrets Storage, Redaction, Configuration Hygiene
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/session-memory-and-context-engine.md
+++ b/.agents/skills/claw-score/references/completeness/session-memory-and-context-engine.md
@@ -1,46 +0,0 @@
-# Session, memory, and context engine Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`session-memory-and-context-engine` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Session, memory, and context engine` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- CLI Session and Transcript Management: CLI Session, Transcript Management
- Compaction, Pruning, and Token Pressure: Compaction, Pruning, Token Pressure
- Context Engine and Runtime Assembly: Context Engine, Runtime Assembly
- Cross-client History and Session Parity: Cross-client History, Session Parity
- Diagnostics, Maintenance, and Recovery: Diagnostics, Maintenance, Recovery
- Instruction Profile and Context Visibility: Instruction Profile, Context Visibility
- Memory Backend Storage and Embedding Search: Memory Backend Storage, Embedding Search
- Memory Files, Tools, and Active Memory: Memory Files, Tools, Active Memory
- Session Routing and Conversation Binding: Session Routing, Conversation Binding
- Transcript Persistence and Durability: Transcript Persistence, Durability
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/signal.md
+++ b/.agents/skills/claw-score/references/completeness/signal.md
@@ -1,41 +0,0 @@
-# Signal Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`signal` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Signal` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Setup and Account Health: QR link setup, SMS registration, Installer and binary setup, Container account provisioning, Status probes, Setup diagnostics, Account safety guardrails
- Conversation Access and Routing: DM pairing, DM allowlists, Sender identity normalization, Group allowlists, Mention gates, Pending group history
- Message Delivery and Actions: Text delivery targets, Media delivery and limits, Typing and read receipts, Styled/chunked output, Reaction action discovery, Add/remove reactions, Group reaction targeting
- Native Approvals: Native approval routing, Reaction approval responses, Approver targeting
- Transport: Native daemon transport, Container transport, API mode selection, Receive reconnect/readiness
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/slack.md
+++ b/.agents/skills/claw-score/references/completeness/slack.md
@@ -1,41 +0,0 @@
-# Slack Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`slack` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Slack` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: App Install, Slack app credentials, Manifest, Scopes, Channel status diagnostics, Slack account status, Operator Repair, Socket, HTTP transport, Runtime Lifecycle, Socket, HTTP transport, Runtime Lifecycle, Channel status diagnostics, Slack account status, Operator Repair
- Access and Identity: Channel allowlists, Thread routing, Session Isolation, DM Pairing, Sender Authorization
- Conversation Routing and Delivery: Channel allowlists, Thread routing, Session Isolation, DM Pairing, Sender Authorization, Outbound Delivery, Streaming, Reactions, Media, Attachments, Files, Vision, Outbound Delivery, Streaming, Reactions, Media, Attachments, Files, Vision
- Media and Rich Content: Outbound Delivery, Streaming, Reactions, Media, Attachments, Files, Vision
- Native Controls and Approvals: Slash Commands, Native Command Routing, Interactive Replies, App Home, Assistant Events, Native Approvals, Actions, Security-sensitive Ops, Interactive Replies, App Home, Assistant Events, Native Approvals, Actions, Security-sensitive Ops
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/telegram.md
+++ b/.agents/skills/claw-score/references/completeness/telegram.md
@@ -1,41 +0,0 @@
-# Telegram Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`telegram` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Telegram` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: BotFather token creation, TELEGRAM_BOT_TOKEN, Setup wizard credential capture, Startup getMe, Doctor/status surfacing, Named account configuration, CLI/message-tool targets, Directory adapters, Channel status, Account-scoped outbound, Long polling runner startup, Webhook listener startup, Reconnect, Restart, Named account configuration, Directory adapters and configured peers/groups for, Channel status, Account-scoped outbound, Long polling runner startup, Reconnect, Restart
- Access and Identity: dmPolicy modes, Pairing-code approval, Numeric Telegram user ID normalization with telegram, allowFrom, Unauthorized DM, Group allowlists, Supergroup negative chat IDs, Forum topic session keys, ACP topic routing, Session key construction
- Conversation Routing and Delivery: dmPolicy modes, Pairing-code approval, Numeric Telegram user ID normalization with telegram, allowFrom, Unauthorized DM, Group allowlists, Supergroup negative chat IDs, Forum topic session keys, ACP topic routing, Session key construction, Inbound media download, Voice notes, Location, Poll sending, Reactions, Text, Preview streaming, Reply threading tags, Durable outbound message recording, Voice notes, Poll sending, Reply threading tags, Durable outbound message recording
- Media and Rich Content: Inbound media download, Voice notes, Location, Poll sending, Reactions, Text, Preview streaming, Reply threading tags, Durable outbound message recording, Voice notes, Poll sending, Reply threading tags, Durable outbound message recording, Inbound media download, Voice notes, Location and venue extraction into channel context, Poll sending, Reactions
- Native Controls and Approvals: Inline keyboard rendering, Exec approvals in DMs, Message actions, Action capability discovery, Native setMyCommands startup sync, Command name/description normalization, Built-in commands, Command authorization in DMs, Model buttons, Native `setMyCommands` startup sync, Command name/description normalization, Built-in commands such as `/help`, Command authorization in DMs, Model buttons and command UI helpers
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/telemetry-diagnostics-and-observability.md
+++ b/.agents/skills/claw-score/references/completeness/telemetry-diagnostics-and-observability.md
@@ -1,41 +0,0 @@
-# Observability Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`telemetry-diagnostics-and-observability` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Observability` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Health and Repair: Background health-monitor loop, Per-account enable/disable settings, Startup grace, Restart logging, openclaw doctor, Structured health checks, Core doctor checks, Plugin SDK doctor/health contracts, openclaw status, openclaw health, Gateway RPC health, Cached health snapshots
- Logging: Rolling Gateway JSONL file logs, openclaw logs, Gateway RPC logs.tail, Redaction patterns and sinks, Trace correlation fields
- Diagnostic Collection: openclaw gateway diagnostics export, openclaw gateway stability --bundle, Chat /diagnostics, Support zip composition, Bounded in-process stability recorder, openclaw gateway stability, Memory pressure events, Critical memory pressure snapshot option
- Telemetry Export: Diagnostic event types, Async dispatch, W3C trace context creation, Plugin SDK diagnostic runtime exports, Model-call diagnostic events, diagnostics-otel plugin install, OTLP/HTTP traces, Trusted trace context, Model and runtime telemetry, diagnostics-prometheus plugin install, Gateway-authenticated GET /api/diagnostics/prometheus, Prometheus text exposition, Trusted diagnostic event subscription
- Session Diagnostics: session.state, Diagnostic session activity snapshots, Model usage, Export of session signals to stability
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/tui-and-terminal-ux.md
+++ b/.agents/skills/claw-score/references/completeness/tui-and-terminal-ux.md
@@ -1,41 +0,0 @@
-# TUI Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`tui-and-terminal-ux` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `TUI` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Runtime Modes: Gateway TUI launch, Local chat launch, Terminal alias launch, Initial message launch, Launch option validation, Gateway connection, Gateway authentication, History load on attach, Reconnect visibility, Gateway command RPCs, Embedded local chat, Local auth flow, Config repair loop, Gateway-free recovery
- Input and Commands: Message composition, Input history, Keyboard shortcuts, Paste and busy-submit handling, IME and AltGr handling, Slash Commands, Pickers, Settings
- Session Management: Session Lifecycle, History, Resume
- Local Shell Execution: Bang-command routing, Approval prompt, Command output display, Execution environment marker
- Rendering and Output Safety: Streaming Message Rendering, Tool Cards, Terminal Rendering Primitives, Output Safety
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/voice-and-realtime-talk.md
+++ b/.agents/skills/claw-score/references/completeness/voice-and-realtime-talk.md
@@ -1,42 +0,0 @@
-# Voice and realtime talk Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`voice-and-realtime-talk` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Voice and realtime talk` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Talk Providers: OpenAI Realtime voice backend bridge, Google Gemini Live backend bridge, Realtime voice provider SDK contracts, Provider diagnostics, Talk catalog, Talk provider config, Shared native config parsing
- Realtime Talk Sessions: Agent consult handoff, Active Talk agent-run status, Talkback runtime behavior, Forced consult scheduling, Browser Talk start/stop UI, Browser WebRTC sessions, Browser relay mode, Browser tool-call forwarding, Realtime session controls, Gateway relay sessions, Audio-frame limits
- Speech and Transcription: Voice directives, Talk speech playback, Transcription relay sessions, Realtime transcription providers, Native directive parsing
- Native App Talk: macOS native Talk mode, iOS Talk mode, Android Talk mode, Shared Talk config
- Voice Wake and Routing: Wake-word settings, Wake routing, macOS Voice Wake runtime, Mobile wake preferences
- Talk Observability: Talk event logging, Session-log health, Live smoke output, Prometheus diagnostic counters, Operator visibility into setup
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/voice-call-channel.md
+++ b/.agents/skills/claw-score/references/completeness/voice-call-channel.md
@@ -1,41 +0,0 @@
-# Voice Call channel Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`voice-call-channel` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Voice Call channel` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: Voice Call Channel, Voice Call Channel, Voice Call Channel
- Access and Identity: Voice Call Channel
- Conversation Routing and Delivery: Voice Call Channel
- Media and Rich Content: Voice Call Channel, Voice Call Channel
- Realtime Voice and Calls: Voice Call Channel, Voice Call Channel, Voice Call Channel, Voice Call Channel, Voice Call Channel
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/watchos-companion-surfaces.md
+++ b/.agents/skills/claw-score/references/completeness/watchos-companion-surfaces.md
@@ -1,41 +0,0 @@
-# watchOS companion surfaces Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`watchos-companion-surfaces` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `watchOS companion surfaces` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Delivery and Recovery: APNs relay/direct registration as it affects, Silent push, Pending approval recovery IDs, Gateway-side iOS exec approval, iPhone-side WatchConnectivity transport, Watch-side receiver activation, Delivery fallback among reachable messages
- Exec Approvals: Watch exec approval prompt, Watch approval list/detail UI, iPhone-side prompt caching
- Distribution and Support: Watch app, Signing/profile variables, Public/support status, Changelog, Release metadata, Historical bug/regression themes relevant to scoring
- Notifications and Replies: watch.status, Payload normalization, Mirrored iOS notification fallback when watch, Watch action buttons from generic prompt, Watch-to-iPhone reply payloads, iPhone-side dedupe, Mirrored iOS notification action
- Watch App UI: Watch app entry point, Generic inbox, Persistent watch inbox state
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/web-search-tools.md
+++ b/.agents/skills/claw-score/references/completeness/web-search-tools.md
@@ -1,40 +0,0 @@
-# Web search tools Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`web-search-tools` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Web search tools` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Search Providers: API-backed providers, Keyless and self-hosted providers, Provider comparison and auto-detection, Provider-specific filters and extraction, Result normalization, OpenAI native web_search, Codex native web_search, Gemini grounding, Grok web grounding, Kimi web search, Provider-native citations, Model and filter routing, webSearchProviders, registerWebSearchProvider, webFetchProviders, registerWebFetchProvider, public-artifact loading, runtime resolution, contract tests
- Setup and Diagnostics: Provider credentials, Default provider selection, Credential repair, Status checks, Quota errors, Cache controls, Provider diagnostics, Retry and fallback, Operator repair
- Network Safety: Network Safety, SSRF, Redirects, Untrusted Content
- Tool Availability and Fetch: web_search exposure, web_fetch exposure, x_search exposure, group:web policy, disabled-state diagnostics, provider/model gating, URL fetch, HTML extraction, PDF/text extraction, Safe truncation, Content citation handoff
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/whatsapp.md
+++ b/.agents/skills/claw-score/references/completeness/whatsapp.md
@@ -1,41 +0,0 @@
-# WhatsApp Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`whatsapp` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `WhatsApp` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- Channel Setup and Operations: Official @openclaw/whatsapp plugin metadata, openclaw plugin install whatsapp, Channel config schema, Baileys socket lifecycle, Operator troubleshooting, Baileys socket lifecycle, Operator troubleshooting for reconnect loops
- Access and Identity: QR login, Baileys multi-file auth persistence, DM pairing challenge, Multi-account/default-account resolution, Direct-message dmPolicy, Sender identity extraction, Privacy controls for plugin hooks, Direct-message `dmPolicy`, Sender identity extraction, Privacy controls for plugin hooks and
- Conversation Routing and Delivery: Group allowlists, Group session keys, Outbound text sends, Provider-accepted receipts, Outbound text sends, Provider-accepted receipts and durable delivery identifiers
- Media and Rich Content: Inbound media download, Outbound image
- Native Controls and Approvals: Native exec, Approver target resolution
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/claw-score/references/completeness/windows-via-wsl2.md
+++ b/.agents/skills/claw-score/references/completeness/windows-via-wsl2.md
@@ -1,41 +0,0 @@
-# Windows via WSL2 Completeness
-
-Use this rubric when assigning category Completeness scores for the
-`windows-via-wsl2` surface.
-
-## What Completeness Means Here
-
-Completeness measures how fully OpenClaw exposes the intended `Windows via WSL2` capability set to the user, operator, author, or maintainer persona for this surface. Score whether each category delivers the full expected workflow, including setup, normal use, status or inspection, recovery, and important platform/provider/channel variants where they apply.
-
-## Scoring Questions
-
-For each category, ask:
-
- Can the intended user or operator complete the category workflow end to end?
- Are the taxonomy features present as supported capabilities rather than isolated implementation fragments?
- Are the important lifecycle stages represented: setup, normal operation, status/inspection, recovery, and upgrade or removal where relevant?
- Are the important environment, provider, platform, channel, or security branches present for this surface?
- Do the known gaps leave major user-visible capability branches missing?
-
-## Surface-Specific Guidance
-
- Favor higher Completeness when the category supports the full operator-visible workflow described by taxonomy and the category note evidence.
- Lower Completeness when only the happy path exists, when important variants are undocumented or unimplemented, or when recovery/status paths are missing.
- Do not lower Completeness because tests are thin; that is Coverage.
- Do not lower Completeness because implementation quality is fragile; that is Quality.
-
-## Category Scope
-
- WSL Setup and Updates: WSL2 + Ubuntu installation, Node runtime, Linux install flow inside WSL2, WSL2 runtime boundary, WSL2 network-family requirements, Source install and build inside WSL2, openclaw update, npm/pnpm/git package-root, Managed systemd Gateway restart, Service metadata refresh, Package-manager caveats
- Gateway Service Lifecycle: Onboarded systemd install, Gateway service install, systemd user unit rendering, WSL-aware systemd unavailable hints, Doctor service repair, WSL user-service linger, Systemd availability after Windows boot, Windows startup task for WSL, Verification before Windows sign-in, Clear expectations around PC power
- Gateway Access and Exposure: Gateway token/password auth, Provider credentials, Gateway auth SecretRefs, Remote URL credential precedence, WSL virtual network, Windows portproxy setup, Windows Firewall rules, Reachable Gateway URLs, Loopback and LAN exposure, WSL2 IPv4 networking, Tailscale remote access
- Diagnostics and Repair: openclaw doctor, openclaw status, openclaw logs, SecretRef, WSL/systemd unavailable hints, Operator repair guidance after WSL2 service
- Browser and Control UI: WSL2 Gateway with Windows browser, Windows Control UI URL, Raw remote CDP to Windows Chrome, Host-local Chrome MCP, Browser profile cdpUrl, Layered diagnostics
-
-## Suggested Bands
-
- `Lovable` (95-100): complete across expected workflows, variants, and recovery branches, with only minor polish gaps.
- `Stable` (80-95): the expected workflow set is broadly present, with only bounded missing branches.
- `Beta` (70-80): the main workflow exists, but meaningful branches or recovery paths are still absent.
- `Alpha` (50-70): only a partial capability set is present; users can complete some core tasks but not the full expected workflow.
- `Experimental` (0-50): the category exposes only fragments of the intended capability.
--- a/.agents/skills/clawdtributor/SKILL.md
+++ b/.agents/skills/clawdtributor/SKILL.md
@@ -1,161 +0,0 @@
---
-name: clawdtributor
-description: "Use for OpenClaw clawtributors PR/issue triage: Discrawl discovery, live-open rechecks, deep review, topic grouping, and compact @handle/LOC/type/blast/verification summaries."
---
-
-# Clawdtributor
-
-Use for the `#clawtributors` queue: Discord-discovered OpenClaw PRs/issues that need live GitHub status plus maintainer-quality review.
-
-## Compose with other skills
-
- `$discrawl`: local Discord archive sync/search.
- `$openclaw-pr-maintainer`: live GitHub PR/issue review, duplicate search, close/land rules.
- `$gitcrawl`: related issue/PR and current-main/stale-proof search.
- `$openclaw-testing` / `$crabbox`: proof choice when a candidate needs real validation.
-
-## Archive flow
-
-Local archive first; verify freshness for current questions.
-
-```bash
-discrawl status --json
-discrawl sync
-```
-
-Resolve channel if needed:
-
-```bash
-sqlite3 "$HOME/.discrawl/discrawl.db" \
-  "select id,name from channels where name like '%clawtributor%' order by name;"
-```
-
-Current known channel id from prior work: `1458141495701012561`. Re-resolve if it stops matching.
-
-Extract recent refs:
-
-```bash
-sqlite3 "$HOME/.discrawl/discrawl.db" "
-select m.created_at, coalesce(nullif(mm.username,''), m.author_id), m.content
-from messages m
-left join members mm on mm.guild_id=m.guild_id and mm.user_id=m.author_id
-where m.channel_id='1458141495701012561'
-  and m.created_at >= '<ISO cutoff>'
-order by m.created_at desc;" |
-perl -nE 'while(m{github\.com/openclaw/openclaw/(pull|issues)/(\d+)}g){say "$1\t$2\t$_"}'
-```
-
-Map a PR/issue back to the Discord handle:
-
-```bash
-sqlite3 -separator $'\t' "$HOME/.discrawl/discrawl.db" "
-select m.created_at,
-       coalesce(nullif(mm.username,''), nullif(mm.global_name,''), m.author_id)
-from messages m
-left join members mm on mm.guild_id=m.guild_id and mm.user_id=m.author_id
-where m.channel_id='1458141495701012561'
-  and m.content like '%github.com/openclaw/openclaw/<pull-or-issues>/<number>%'
-order by m.created_at desc
-limit 1;"
-```
-
-Show only `@handle` in the final list. Do not write the word Discord unless the user asks for source details.
-
-## Live GitHub recheck
-
-Always recheck live state before listing, closing, or saying "open".
-
-```bash
-GITHUB_TOKEN= GITHUB_TOKEN_NODIFF= GH_TOKEN= \
-gh api repos/openclaw/openclaw/pulls/<number> \
-  --jq '. | {number,title,state,merged,mergeable,draft,author:.user.login,url:.html_url,updatedAt:.updated_at,additions,deletions,changedFiles:.changed_files}'
-```
-
-For issues:
-
-```bash
-GITHUB_TOKEN= GITHUB_TOKEN_NODIFF= GH_TOKEN= \
-gh api repos/openclaw/openclaw/issues/<number> \
-  --jq '. | {number,title,state,author:.user.login,url:.html_url,updatedAt:.updated_at,pull_request}'
-```
-
-If `gh` says bad credentials, clear env vars with empty assignments as above. Use `--jq '. | {...}'` for object projections.
-
-## Review depth
-
-For each open item, inspect enough to classify risk:
-
- PR body, linked issue, comments, files, additions/deletions, checks.
- Current `origin/main` code path and adjacent tests.
- Related threads with `gitcrawl neighbors/search`.
- Whether main already fixed it, the PR is obsolete, or the idea is invalid.
- Blast radius: touched runtime surfaces, config/schema, plugin/core boundary, user-visible behavior, release/package surface.
- Verification: say if local unit/docs proof is enough, live/provider proof is needed, or it is not directly verifiable.
-
-Do not close from title alone. If closing as done on main or nonsensical, prove it against current main and comment first when mutation is requested. Bulk close/reopen above 5 requires explicit scope.
-
-## Candidate selection
-
-When asked for `5 new`, exclude refs already surfaced in the session and refill from the archive until there are 5 live-open candidates. If fewer than 5 remain open, list all open ones and say how many short.
-
-When asked to `update`, `refresh`, `recheck`, `check again`, or similar, return an updated live-open candidate list. Sort by maintainer importance, not recency: high-impact ready fixes first, then useful-but-review-first, then open/not-ready items. Do not include a "changed since last pass" section or bottom-line merged/closed summary unless the user explicitly asks for churn.
-
-Prefer:
-
- Fresh, open, external contributor work.
- Small, high-confidence bugfixes.
- Clear repro, tests, or obvious code-path proof.
-
-Demote:
-
- Broad product/features without owner decision.
- Large rewrites with unclear contract.
- PRs already in progress, merged, closed, duplicate, or fixed on main.
-
-## Topic grouping
-
-Group only when useful or requested:
-
- Agents/tooling
- Providers/auth/models
- Channels/messaging
- UI/web
- Gateway/protocol/runtime
- Config/memory/cache
- Docker/install/release
- Docs/tests/chore
- Closed/obsolete
-
-Infer topic from labels, touched files, title/body, and actual code path.
-
-## Output format
-
-No Markdown tables. Compact bullets. Use color/risk markers:
-
- 🟢 low/narrow
- 🟡 medium or needs targeted proof
- 🔴 broad/high runtime risk
- 🟣 security/policy/owner-boundary slow review
- ✅ merged
- ⚪ closed unmerged
-
-Required line shape:
-
-```markdown
- **PR #81244** `@whatsskill.` `+118/-1` `bug` 🟢 https://github.com/openclaw/openclaw/pull/81244 - Prevents chat action buttons from overlapping short assistant replies. Verifiable: yes. Blast: web chat rendering, low.
- **Issue #81245** `@alice` `LOC n/a` `bug` 🟡 https://github.com/openclaw/openclaw/issues/81245 - Reports duplicate Telegram replies when reconnecting after gateway restart. Verifiable: partial. Blast: Telegram channel runtime, medium.
-```
-
-Rules:
-
- Bold the `PR #n` or `Issue #n` marker.
- Use `@handle`, not author bio text.
- Always include the full GitHub URL.
- Include a one-line description after the URL, separated with `-`.
- PR LOC is `+additions/-deletions`; issue LOC is `LOC n/a`.
- Type: `bug`, `feature`, `perf`, `security`, `docs`, `test`, `chore`, or `refactor`.
- Write a full sentence for what it does.
- Always include blast radius in one phrase.
- Always include `verifiable: yes|partial|no` plus the shortest proof hint when helpful.
- If status is not open, still show it only when the user asked for all surfaced refs; use ✅ or ⚪ and state merged/closed.
- For refresh-style asks, prefer section order: `Best Open Now`, `Useful But Review First`, `Still Open / Not Ready`. Omit merged/closed churn by default.
--- a/.agents/skills/clawsweeper/SKILL.md
+++ b/.agents/skills/clawsweeper/SKILL.md
@@ -1,340 +0,0 @@
---
-name: clawsweeper
-description: "Use for all ClawSweeper work: OpenClaw issue/PR sweep reports, commit-review reports, repair jobs, cloud fix PRs, @clawsweeper maintainer mention commands, trusted ClawSweeper-reviewed autofix/automerge, GitHub Actions monitoring, permissions, gates, and manual backfills."
---
-
-# ClawSweeper
-
-ClawSweeper lives at `~/Projects/clawsweeper`. It is the one OpenClaw
-maintenance bot for sweeping, commit review, repair jobs, and guarded fix PRs.
-Use this skill whenever asked about reports, findings, dispatch health,
-repair/cloud PR creation, comment commands, automerge, permissions, or gates.
-
-## Start
-
-```bash
-cd ~/Projects/clawsweeper
-git status --short --branch
-git pull --ff-only
-pnpm run build:all
-```
-
-Do not overwrite unrelated edits. If the tree is dirty, inspect first and keep
-read-only report work read-only unless the requester asked to commit.
-
-## One Bot, One App
-
-Use the ClawSweeper repo and the `clawsweeper` GitHub App. Use only
-`CLAWSWEEPER_*` configuration for this automation. Do not use legacy apps,
-variables, labels, or skills.
-
-Required app setup:
-
- `CLAWSWEEPER_APP_CLIENT_ID`: public app client ID for `clawsweeper`.
- `CLAWSWEEPER_APP_PRIVATE_KEY`: private key used only inside
-  `actions/create-github-app-token` steps.
- Target app permissions: read target scan context; write issues and pull
-  requests; contents write for report commits, repair branches, and workflow
-  inputs; Actions write on `openclaw/clawsweeper` for comment-router
-  re-review dispatch, workflow dispatch, run cancellation, and self-heal;
-  optional Checks write for commit Check Runs.
-
-Token boundary:
-
- Codex workers do not get mutation credentials.
- Review workers run with stripped secret/token env.
- Deterministic scripts own comments, labels, branch pushes, PR creation,
-  closes, and merges through short-lived GitHub App tokens.
- Merge and write gates default closed.
-
-## Commit Reports
-
-Canonical commit reports:
-
-```text
-records/<repo-slug>/commits/<40-char-sha>.md
-```
-
-Use the lister:
-
-```bash
-pnpm commit-reports -- --since 6h
-pnpm commit-reports -- --since "24 hours ago" --findings
-pnpm commit-reports -- --since 7d --non-clean
-pnpm commit-reports -- --repo openclaw/openclaw --author steipete --since 7d
-pnpm commit-reports -- --since 24h --json
-```
-
-Results: `nothing_found`, `findings`, `inconclusive`, `failed`,
-`skipped_non_code`. One report per SHA; reruns overwrite the SHA-named report.
-
-Manual rerun/backfill:
-
-```bash
-gh workflow run commit-review.yml --repo openclaw/clawsweeper \
-  -f target_repo=openclaw/openclaw \
-  -f commit_sha=<end-sha> \
-  -f before_sha=<start-or-parent-sha> \
-  -f create_checks=false \
-  -f enabled=true
-```
-
-Use `create_checks=true` only when the requester explicitly wants target commit Check
-Runs. Add `-f additional_prompt="..."` for focused one-off review instructions.
-
-## Sweep Reports
-
-Issue/PR reports live at:
-
-```text
-records/<repo-slug>/items/<number>.md
-records/<repo-slug>/closed/<number>.md
-```
-
-Lead with counts, concrete findings, and report links. Do not post unsolicited
-GitHub comments from report-reading work. Public surfaces are markdown reports,
-durable ClawSweeper review comments, and optional checks.
-
-PR reports include Codex `/review`-style `reviewFindings` with priority,
-confidence, repository-relative file, and line range. Public PR comments show a
-short `Review findings:` list when findings exist; full review comments,
-evidence links, likely owners, and runtime details stay inside the collapsed
-`Review details` block.
-
-Useful commands:
-
-```bash
-pnpm run status
-pnpm run audit
-pnpm run reconcile
-pnpm run apply-decisions -- --dry-run
-```
-
-## Create One Repair Job
-
-Create a job from issue/PR refs and a maintainer prompt:
-
-```bash
-pnpm run repair:create-job -- \
-  --repo openclaw/openclaw \
-  --refs 123,456 \
-  --prompt-file /tmp/clawsweeper-prompt.md
-```
-
-Create from an existing ClawSweeper report:
-
-```bash
-pnpm run repair:create-job -- \
-  --from-report ../clawsweeper/records/openclaw-openclaw/items/123.md
-```
-
-The job creator checks for an existing open PR, body match, or remote
-`clawsweeper/<cluster-id>` branch before writing another job. Use `--dry-run`
-to inspect. Use `--force` only after deciding the duplicate guard is stale.
-
-Validate, commit, then dispatch:
-
-```bash
-pnpm run repair:validate-job -- jobs/openclaw/inbox/clawsweeper-openclaw-openclaw-123.md
-pnpm run repair:dispatch -- jobs/openclaw/inbox/clawsweeper-openclaw-openclaw-123.md \
-  --mode autonomous \
-  --runner blacksmith-4vcpu-ubuntu-2404 \
-  --execution-runner blacksmith-16vcpu-ubuntu-2404 \
-  --model gpt-5.5
-```
-
-Do not dispatch a just-created job before the job file is committed and pushed;
-the workflow reads the job path from GitHub.
-
-## Replacement PRs
-
-For a useful but uneditable/stale/unsafe source PR, make the maintainer prompt
-explicit:
-
-```md
-Treat #123 as useful source work. If the source branch cannot be safely updated
-because it is uneditable, stale, draft-only, unmergeable, or unsafe, create a
-narrow ClawSweeper replacement PR instead of waiting. Preserve the source PR
-author as co-author, credit the source PR in the replacement PR body, and close
-only that source PR after the replacement PR is opened.
-```
-
-The worker should emit `repair_strategy=replace_uneditable_branch` and list the
-source PR URL in `source_prs`. The deterministic executor opens or updates
-`clawsweeper/<cluster-id>`, adds non-bot source authors as `Co-authored-by`
-trailers, and closes superseded source PRs only after replacement exists.
-
-## Gates
-
-Open execution windows intentionally and close them after the run:
-
-```bash
-gh variable set CLAWSWEEPER_ALLOW_EXECUTE --repo openclaw/clawsweeper --body 1
-gh variable set CLAWSWEEPER_ALLOW_FIX_PR --repo openclaw/clawsweeper --body 1
-gh variable set CLAWSWEEPER_ALLOW_MERGE --repo openclaw/clawsweeper --body 1
-gh variable set CLAWSWEEPER_ALLOW_AUTOMERGE --repo openclaw/clawsweeper --body 1
-```
-
-Reset gates only when explicitly requested; the active maintainer window may intentionally
-leave them at `1`.
-
-Important gates:
-
- `CLAWSWEEPER_ALLOW_EXECUTE`: allows deterministic write lanes.
- `CLAWSWEEPER_ALLOW_FIX_PR`: allows branch repair/replacement PRs.
- `CLAWSWEEPER_ALLOW_MERGE`: allows merge-capable applicators.
- `CLAWSWEEPER_ALLOW_AUTOMERGE`: allows comment-router automerge.
- `CLAWSWEEPER_COMMENT_ROUTER_EXECUTE`: lets scheduled comment routing
-  post replies and dispatch repair.
-
-## Maintainer Mentions
-
-Prefer `@clawsweeper` comments for all maintainer-facing control. Slash
-commands still parse as compatibility aliases, but examples and live guidance
-should use mentions.
-
-```text
-@clawsweeper status
-@clawsweeper re-review
-@clawsweeper review
-@clawsweeper fix ci
-@clawsweeper address review
-@clawsweeper rebase
-@clawsweeper autofix
-@clawsweeper automerge
-@clawsweeper approve
-@clawsweeper explain
-@clawsweeper stop
-@clawsweeper <question or safe action request>
-@clawsweeper[bot] re-review
-@openclaw-clawsweeper fix ci
-@openclaw-clawsweeper[bot] fix ci
-```
-
-Accepted aliases: `review`, `re-review`, `rereview`, `review again`,
-`rerun review`, and `run review`. `review` and `re-review` dispatch a fresh
-ClawSweeper issue/PR review without starting repair. `fix ci`,
-`address review`, and `rebase` dispatch the
-repair worker only for ClawSweeper PRs or PRs opted into
-`clawsweeper:autofix` or `clawsweeper:automerge`. `autofix` runs the bounded
-review/fix loop without merging. `automerge` runs the bounded review/fix/merge
-loop, but draft PRs stay fix-only until GitHub marks them ready for review.
-
-Freeform maintainer mentions such as `@clawsweeper why did automerge stop?`
-or `@clawsweeper: can you explain this failure?` dispatch a read-only assist
-review with the mention text as one-off instructions. The answer lands in the
-next public ClawSweeper review comment. Action-looking prose does not directly
-mutate GitHub; it must map to existing structured recommendations and pass the
-normal deterministic gates.
-
-Default accepted maintainers: `OWNER`, `MEMBER`, `COLLABORATOR`; fallback
-repository permission accepts `admin`, `maintain`, or `write`. Contributor
-comments are ignored without a reply.
-
-Run router manually:
-
-```bash
-pnpm run repair:comment-router -- --repo openclaw/openclaw --lookback-minutes 180
-pnpm run repair:comment-router -- --repo openclaw/openclaw --execute --wait-for-capacity
-```
-
-Scheduled routing stays dry unless
-`CLAWSWEEPER_COMMENT_ROUTER_EXECUTE=1`.
-
-## Trusted Autofix And Automerge
-
-`@clawsweeper autofix` opts an existing PR into the bounded review/fix loop.
-`@clawsweeper automerge` opts an existing PR into the bounded review/fix/merge
-loop. The router:
-
- verifies maintainer authorization;
- labels the PR `clawsweeper:autofix` or `clawsweeper:automerge`;
- dispatches ClawSweeper review for the current head SHA;
- creates or reuses a durable adopted job;
- repairs at most the configured caps;
- never merges autofix PRs or draft PRs;
- merges automerge PRs only when ClawSweeper passed the exact current head,
-  checks are green, GitHub says mergeable, no human-review label is present,
-  the PR is not draft, and both merge gates are open.
-
-Missing changelog is not a review finding or merge blocker. If repairing a user-facing change, add/update changelog automatically when practical; never ask or block solely on it.
-
-If ClawSweeper passes while merge gates are closed, it labels
-`clawsweeper:merge-ready` and comments instead of merging. `@clawsweeper stop`
-adds `clawsweeper:human-review`.
-
-When asked to create a PR and enable ClawSweeper automerge, do not
-leave the local OpenClaw checkout on the PR branch. After the PR is created,
-pushed, and the `@clawsweeper automerge` request is posted or otherwise
-confirmed, return the local checkout to `main` and fast-forward it when the
-working tree is clean:
-
-```bash
-git switch main
-git pull --ff-only
-```
-
-If unrelated local edits or an in-progress rebase prevent switching, report the
-blocker instead of stashing, deleting, or overwriting work.
-
-Repair caps:
-
-```bash
-CLAWSWEEPER_MAX_REPAIRS_PER_PR=10
-CLAWSWEEPER_MAX_REPAIRS_PER_HEAD=1
-```
-
-## Security Boundary
-
-Do not stage unapproved security-sensitive work for ClawSweeper Repair. Route
-vulnerability reports, CVE/GHSA/advisory work, leaked secrets/tokens/keys,
-plaintext secret storage, SSRF, XSS, CSRF, RCE, auth bypass, privilege
-escalation, and sensitive data exposure to central OpenClaw security handling.
-
-For PRs explicitly opted into `clawsweeper:autofix` or
-`clawsweeper:automerge`, security-sensitive review findings may dispatch
-bounded repair, but merge remains blocked until a later exact-head review is
-clean and the normal merge gates pass. Trust deterministic ClawSweeper security
-markers, labels, and job frontmatter; do not infer security handling from vague
-prose.
-
-## Monitoring
-
-Receiver workflows:
-
-```bash
-gh run list --repo openclaw/clawsweeper --workflow "ClawSweeper Commit Review" \
-  --limit 12 --json databaseId,displayTitle,event,status,conclusion,createdAt,updatedAt,url
-gh run list --repo openclaw/clawsweeper --workflow "repair cluster worker" \
-  --limit 12 --json databaseId,displayTitle,event,status,conclusion,createdAt,updatedAt,url
-gh run list --repo openclaw/clawsweeper --workflow "repair comment router" \
-  --limit 12 --json databaseId,displayTitle,event,status,conclusion,createdAt,updatedAt,url
-```
-
-Target dispatcher:
-
-```bash
-gh run list --repo openclaw/openclaw --workflow "ClawSweeper Dispatch" \
-  --event push --limit 8 --json databaseId,displayTitle,event,status,conclusion,headSha,url
-```
-
-Target commit check:
-
-```bash
-gh api "repos/openclaw/openclaw/commits/<sha>/check-runs?per_page=100" \
-  --jq '.check_runs[] | select(.name=="ClawSweeper Commit Review") | [.status,.conclusion,.details_url] | @tsv'
-```
-
-## Reading Output
-
-For findings or failures, summarize:
-
- target repo, item/PR/commit, run, report path
- result, confidence, severity, and exact blocker
- affected files or cluster refs
- validation commands and whether they passed
- whether mutation gates were open or closed
- next deterministic action
-
-Keep the broom small: one cluster, one branch, one PR, narrow proof, clear
-owner-visible evidence.
--- a/.agents/skills/clawsweeper/agents/openai.yaml
+++ b/.agents/skills/clawsweeper/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "ClawSweeper"
-  short_description: "Inspect ClawSweeper commit review reports and Actions runs."
-  default_prompt: "Review recent ClawSweeper commit reports and summarize findings."
--- a/.agents/skills/control-ui-e2e/SKILL.md
+++ b/.agents/skills/control-ui-e2e/SKILL.md
@@ -1,74 +0,0 @@
---
-name: control-ui-e2e
-description: Use when testing, fixing, or extending the OpenClaw Control UI GUI with Vitest + Playwright end-to-end checks, mocked Gateway WebSocket flows, mocked dashboard runs, screenshots/videos, or agent-verifiable browser proof.
---
-
-# Control UI E2E
-
-Use this for Control UI changes that need a real browser flow with deterministic Gateway data.
-
-## Test Shape
-
- Use `ui/src/**/*.e2e.test.ts` for full GUI flows.
- Use `ui/src/test-helpers/control-ui-e2e.ts` to start the Vite Control UI and install a mocked Gateway WebSocket.
- Keep scenarios deterministic. Do not use live provider keys, real channel credentials, or a real Gateway unless the user explicitly asks for live proof.
- Prefer existing `.browser.test.ts` or unit tests for narrow rendering logic; use this E2E lane when the proof should cover routing, app boot, Gateway handshake, requests, and visible UI behavior together.
-
-## Commands
-
- Target one E2E test in a Codex worktree:
-
-```bash
-node scripts/run-vitest.mjs run --config test/vitest/vitest.ui-e2e.config.ts --configLoader runner ui/src/ui/e2e/chat-flow.e2e.test.ts
-```
-
- Run the whole local lane in a normal checkout:
-
-```bash
-pnpm test:ui:e2e
-```
-
-If dependencies are missing in a Codex worktree, install once with `pnpm install`; for broad GUI proof or dependency-heavy checks, use Testbox/Crabbox instead of running a wide local pnpm lane.
-
-## Visual Proof Default
-
-When running mocked Control UI/dashboard validation for a user-facing feature, produce visual proof by default unless the user explicitly opts out.
-
- Keep the Vitest E2E assertions deterministic; do not commit generated screenshots or videos.
- After or alongside the focused E2E test, run the mocked Control UI app when available, for example `pnpm dev:ui:mock -- --port <port>`.
- Drive Chromium with Playwright against the local mock URL and capture a video plus screenshots for each meaningful state: initial view, interaction input, result state, and final/paginated/selected state.
- Use `browser.newContext({ recordVideo: { dir, size }, viewport })`, `page.screenshot({ path })`, and close the context before reporting the video path.
- Put artifacts under `.artifacts/control-ui-e2e/<short-feature-name>/` or another clearly named local temp directory, and report the absolute paths in the final answer.
- Treat recording as validation, not only demo capture. If the recorder fails or shows surprising behavior, stop, fix the behavior, add or update a regression test, then rerecord.
- If visual proof is blocked, state the exact blocker and still report the textual E2E evidence.
-
-## Mock Pattern
-
-Start the app server, install the mock before `page.goto`, then assert both Gateway traffic and visible UI:
-
-```ts
-const server = await startControlUiE2eServer();
-const page = await context.newPage();
-const gateway = await installMockGateway(page, {
-  historyMessages: [{ role: "assistant", content: [{ type: "text", text: "Ready." }] }],
-});
-
-await page.goto(`${server.baseUrl}chat`);
-await page.locator(".agent-chat__composer-combobox textarea").fill("hello");
-await page.getByRole("button", { name: "Send message" }).click();
-
-const request = await gateway.waitForRequest("chat.send");
-await gateway.emitChatFinal({ runId: String(request.params.idempotencyKey), text: "Done." });
-await page.getByText("Done.").waitFor();
-```
-
-Extend `installMockGateway` with typed scenario options or method responses when a new flow needs more Gateway surface.
-
-## Standalone Recording
-
-When recording an already-running mocked Control UI URL, use a temporary Playwright script or `playwright test` spec and keep the recording flow focused:
-
- Open the mock URL, interact through stable `data-*` selectors or user-facing role selectors, and wait on asserted states instead of relying on fixed sleeps.
- Assert both visible UI state and mocked Gateway traffic for request-driven flows. For example, verify the expected count/row is visible and that `sessions.list` was called with the expected `search`, `offset`, and `limit`.
- Use short sleeps only after assertions to make the captured video readable.
- Store the generated video under `.artifacts/control-ui-e2e/<feature>/`; do not commit it.
--- a/.agents/skills/control-ui-e2e/agents/openai.yaml
+++ b/.agents/skills/control-ui-e2e/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "Control UI E2E"
-  short_description: "Mocked browser E2E for Control UI"
-  default_prompt: "Use $control-ui-e2e to verify a Control UI change with the mocked Vitest + Playwright browser lane."
--- a/.agents/skills/crabbox/SKILL.md
+++ b/.agents/skills/crabbox/SKILL.md
@@ -1,740 +0,0 @@
---
-name: crabbox
-description: Use the Crabbox wrapper for OpenClaw remote validation across Linux, macOS, Windows, and WSL2, including delegated Blacksmith Testbox proof. Report the actual provider and id.
---
-
-# Crabbox
-
-Use the Crabbox wrapper when OpenClaw needs remote Linux proof for broad tests,
-CI-parity checks, secrets, hosted services, Docker/E2E/package lanes, warmed
-reusable boxes, sync timing, logs/results, cache inspection, or lease cleanup.
-
-Crabbox is the transport/orchestration surface. The actual backend can be:
-
- brokered AWS Crabbox: direct provider, `provider=aws`, lease ids like
-  `cbx_...`, `syncDelegated=false`
- Blacksmith Testbox through Crabbox: delegated provider,
-  `provider=blacksmith-testbox`, ids like `tbx_...`, `syncDelegated=true`
-
-For OpenClaw maintainer broad `pnpm` gates, Blacksmith Testbox through the
-Crabbox wrapper is acceptable and often preferred when the standing Testbox
-rules apply. Do not describe those runs as "AWS Crabbox"; report them as
-Testbox-through-Crabbox with the `tbx_...` id and Actions run.
-
-Use the repo `.crabbox.yaml` brokered AWS path when the task specifically needs
-direct AWS Crabbox behavior, persistent direct-provider leases, `--fresh-pr`,
-`--full-resync`, environment forwarding, capture/download support, or provider
-comparison. Use `--provider blacksmith-testbox` when the task needs OpenClaw
-maintainer Testbox proof, prepared CI environment, broad/heavy pnpm gates, or
-the user asks for Testbox/Blacksmith.
-
-## First Checks
-
- Run from the repo root. Crabbox sync mirrors the current checkout.
- Check the wrapper and providers before remote work:
-
-```sh
-command -v crabbox
-../crabbox/bin/crabbox --version
-pnpm crabbox:run -- --help | sed -n '1,120p'
-../crabbox/bin/crabbox desktop launch --help
-../crabbox/bin/crabbox webvnc --help
-```
-
- OpenClaw scripts prefer `../crabbox/bin/crabbox` when present. The user PATH
-  shim can be stale.
- Check `.crabbox.yaml` for direct-provider defaults. Omitting `--provider`
-  means brokered AWS for normal Linux/macOS paths; the wrapper selects Azure
-  for unqualified Windows/WSL2 runs when the local Crabbox binary advertises
-  Azure.
- The brokered AWS default is a Linux developer image in `eu-west-1`; the repo
-  config pins hot `eu-west-1a/b/c` placement so Fast Snapshot Restore can apply.
-  If warmup drifts well past the minute-scale path, verify image promotion,
-  region/AZ placement, and FSR state before blaming OpenClaw.
- For broad OpenClaw maintainer `pnpm` gates, prefer the repo wrapper with
-  `--provider blacksmith-testbox` or the repo Testbox helpers when the standing
-  Testbox policy applies.
- Cold Testbox acquisition and hydration often take tens of seconds. When broad
-  remote proof is likely, immediately start
-  `node scripts/crabbox-wrapper.mjs warmup --provider blacksmith-testbox --keep --timing-json`
-  in a background command session while inspecting, editing, and running
-  focused local tests. Poll later, reuse the returned `tbx_...` with
-  `--provider blacksmith-testbox --id <tbx_id>`, and stop it before handoff.
-  Do not warm speculatively when remote proof is unlikely.
- Always report the actual provider and id. `cbx_...` means AWS Crabbox;
-  `tbx_...` means Blacksmith Testbox through Crabbox. If the output only says
-  `blacksmith testbox list`, use `blacksmith testbox list --all` before
-  concluding no box exists.
- If a warm direct-provider lease smells stale, retry with `--full-resync`
-  (alias `--fresh-sync`) before replacing the lease. This resets the remote
-  workdir, skips the fingerprint fast path, reseeds Git when possible, and
-  uploads the checkout from scratch.
- For live/provider bugs, use the configured secret workflow before downgrading
-  to mocks. Copy only the exact needed key into the remote process environment
-  for that one command. Do not print it, do not sync it as a repo file, and do
-  not leave it in remote shell history or logs. If no secret-safe injection path
-  is available, say true live provider auth is blocked instead of silently using
-  a fake key.
- Prefer local targeted tests for tight edit loops. Broad gates belong remote.
- Do not treat inherited shell env as operator intent. In particular,
-  `OPENCLAW_LOCAL_CHECK_MODE=throttled` from the local shell is not permission
-  to move broad `pnpm check:changed`, `pnpm test:changed`, full `pnpm test`, or
-  lint/typecheck fan-out onto the laptop.
- Only use `OPENCLAW_LOCAL_CHECK_MODE=throttled|full` when the user explicitly
-  asks for local proof in the current task. If Testbox is queued or capacity is
-  constrained, report the blocker and keep only targeted local edit-loop checks
-  running.
-
-## macOS And Windows Targets
-
-Use these only when the task needs an existing non-Linux host. OpenClaw broad
-Linux validation uses the repo Crabbox config unless a provider is explicitly
-requested.
-
-Native brokered Windows is available for Windows-specific proof. Prefer Azure
-for Windows/WSL2 when the subscription has quota or credits and the local
-Crabbox binary advertises Azure. Keep broad Linux gates on Linux/Testbox unless
-the bug is Windows-specific, and only force AWS when the operator asks for the
-older AWS developer image/cache path or Azure is unavailable:
-
-```sh
-pnpm crabbox:warmup -- \
-  --target windows \
-  --windows-mode wsl2 \
-  --timing-json
-```
-
-The hydrate workflow assumes Docker should already be baked into Linux images
-and only installs it as a fallback. Do not add per-run Docker installs to proof
-commands unless the image probe shows Docker is actually missing.
-
-When the user explicitly asks for brokered macOS runners, use Crabbox AWS
-macOS only after confirming the deployed coordinator supports EC2 Mac host
-lifecycle/image routes and the operator has AWS EC2 Mac Dedicated Host quota
-and IAM. Prefer `CRABBOX_HOST_ID` for a known Crabbox-managed Dedicated Host,
-or run the no-spend preflight first:
-
-```sh
-crabbox admin hosts quota --provider aws --target macos --region eu-west-1 --type mac2.metal --json
-crabbox admin hosts allocate --provider aws --target macos --region eu-west-1 --type mac2.metal --dry-run --json
-CRABBOX_MACOS_TYPES=all scripts/macos-host-region-preflight.sh
-```
-
-Do not silently substitute AWS macOS for normal OpenClaw Linux proof. Report
-paid-host blockers as quota, IAM, coordinator deployment, or host availability
-instead of falling back to local macOS.
-
-Crabbox supports static SSH targets:
-
-```sh
-../crabbox/bin/crabbox run --provider ssh --target macos --static-host mac-studio.local -- xcodebuild test
-../crabbox/bin/crabbox run --provider ssh --target windows --windows-mode normal --static-host win-dev.local -- pwsh -NoProfile -Command "dotnet test"
-../crabbox/bin/crabbox run --provider ssh --target windows --windows-mode wsl2 --static-host win-dev.local -- pnpm test
-```
-
- `target=macos` and `target=windows --windows-mode wsl2` use the POSIX SSH,
-  bash, Git, rsync, and tar contract.
- Native Windows uses OpenSSH, PowerShell, Git, and tar; sync is manifest tar
-  archive transfer into `static.workRoot`. Direct native Windows runs support
-  `--script*`, `--env-from-profile`, `--preflight`, and PowerShell `--shell`.
- `crabbox actions hydrate/register` are Linux-only today; use plain
-  `crabbox run` loops for static macOS and Windows hosts.
- Live proof needs a reachable, operator-managed SSH host. Without one, verify
-  with `../crabbox/bin/crabbox run --help`, config/flag tests, and the Crabbox
-  Go test suite.
-
-## Direct Brokered AWS Backend
-
-Use this when the task needs direct AWS Crabbox semantics rather than the
-prepared Blacksmith Testbox CI environment.
-
-Changed gate:
-
-```sh
-pnpm crabbox:run -- \
-  --idle-timeout 90m \
-  --ttl 240m \
-  --timing-json \
-  --shell -- \
-  "pnpm test:changed"
-```
-
-Full suite:
-
-```sh
-pnpm crabbox:run -- \
-  --idle-timeout 90m \
-  --ttl 240m \
-  --timing-json \
-  --shell -- \
-  "pnpm verify"
-```
-
-Use `pnpm verify` when you need check plus full Vitest proof. It emits
-`CRABBOX_PHASE:check` and `CRABBOX_PHASE:test`, making Crabbox summaries show
-which stage failed. Use plain `pnpm test` only when check proof is already
-covered or intentionally skipped.
-
-Focused rerun:
-
-```sh
-pnpm crabbox:run -- \
-  --idle-timeout 90m \
-  --ttl 240m \
-  --timing-json \
-  --shell -- \
-  "pnpm test <path-or-filter>"
-```
-
-Read the JSON summary. Useful fields:
-
- `provider`: `aws`
- `leaseId`: `cbx_...`
- `syncDelegated`: `false`
- `commandPhases`: populated when the command prints `CRABBOX_PHASE:<name>`
- `commandMs` / `totalMs`
- `exitCode`
-
-Crabbox should stop one-shot AWS leases automatically after the run. Verify
-cleanup when a run fails, is interrupted, or the command output is unclear:
-
-```sh
-../crabbox/bin/crabbox list --provider aws
-```
-
-## Blacksmith Testbox Through Crabbox
-
-Use this for OpenClaw maintainer broad/heavy `pnpm` gates when the prepared CI
-environment is the right proof surface:
-
-```sh
-node scripts/crabbox-wrapper.mjs run \
-  --provider blacksmith-testbox \
-  --blacksmith-org openclaw \
-  --blacksmith-workflow .github/workflows/ci-check-testbox.yml \
-  --blacksmith-job check \
-  --blacksmith-ref main \
-  --idle-timeout 90m \
-  --ttl 240m \
-  --timing-json \
-  -- \
-  corepack pnpm check:changed
-```
-
-Read the JSON summary and the Testbox line. Useful fields:
-
- `provider`: `blacksmith-testbox`
- `leaseId`: `tbx_...`
- `syncDelegated`: `true`
- `syncPhases`: delegated/skipped because Blacksmith owns checkout/sync
- Actions run URL/id from the Testbox output
- `exitCode`
-
-Use provider-backed cache volumes only for rebuildable caches, not secrets or
-checkout state. On Blacksmith, Crabbox forwards them as sticky disks:
-
-```sh
-node scripts/crabbox-wrapper.mjs run \
-  --provider blacksmith-testbox \
-  --cache-volume pnpm-store=openclaw-node24-pnpm-lock:/tmp/openclaw-pnpm-store \
-  --timing-json \
-  -- \
-  corepack pnpm check:changed
-```
-
-The selected provider must advertise cache-volume support. If not, omit
-`--cache-volume` and rely on kept-lease caches.
-
-`blacksmith testbox list` may hide hydrating or ready boxes. Use:
-
-```sh
-blacksmith testbox list --all
-blacksmith testbox status <tbx_id>
-```
-
-## Observability Flags
-
-Use these on debugging runs before inventing ad hoc logging:
-
- `--preflight`: prints run context, workspace mode, SSH target, remote user/cwd,
-  and target-specific tool probes. Defaults cover `git`, `tar`, `node`, `npm`,
-  `corepack`, `pnpm`, `yarn`, `bun`, `docker`, plus POSIX
-  `sudo`/`apt`/`bubblewrap` and native Windows
-  `powershell`/`execution_policy`/`longpaths`/`temp`/`pwsh`. Add
-  `--preflight-tools node,bun,docker`, `CRABBOX_PREFLIGHT_TOOLS`, or repo
-  `run.preflightTools` to replace the list. `default` expands built-ins; `none`
-  prints only the workspace summary. Preflight is diagnostic only; install
-  toolchains through Actions hydration, images, devcontainer/Nix/mise/asdf, or
-  the run script. On `blacksmith-testbox`, this prints a delegated-unsupported
-  note because the workflow owns setup.
- `CRABBOX_ENV_ALLOW=NAME,...`: forwards only listed local env vars for direct
-  providers and prints `set len=N secret=true` style summaries. On
-  `blacksmith-testbox`, env forwarding is unsupported; put secrets in the
-  Testbox workflow instead.
- `--env-from-profile <file>` plus `--allow-env NAME`: loads simple
-  `export NAME=value` / `NAME=value` lines from a local profile without
-  executing it, then forwards only allowlisted names. `--allow-env` is
-  repeatable and comma-separated. Profile values override ambient allowlisted
-  env values for that run. Direct POSIX, WSL2, and native Windows runs are
-  supported; delegated providers are not. Crabbox probes the uploaded profile
-  remotely and prints redacted presence/length metadata before the command.
- `--env-helper <name>`: with `--env-from-profile` on POSIX SSH targets,
-  persists `.crabbox/env/<name>` and `.crabbox/env/<name>.env` so follow-up
-  commands on the same lease can run through `./.crabbox/env/<name> <command>`.
-  Use only on leases you control; the profile stays until cleanup, lease reset,
-  or `--full-resync`.
- `--script <file>` / `--script-stdin`: upload a local script into
-  `.crabbox/scripts/` and execute it on the remote box. Shebang scripts execute
-  directly on POSIX; scripts without a shebang run through `bash`. Native
-  Windows uploads run through Windows PowerShell, and Crabbox appends `.ps1`
-  when needed. Arguments after `--` become script args.
- `--fresh-pr owner/repo#123|URL|number`: skip dirty local sync and create a
-  fresh remote checkout of the GitHub PR. Bare numbers use the current repo's
-  GitHub origin. Add `--apply-local-patch` only when the current local
-  `git diff --binary HEAD` should be applied on top of that PR checkout.
- `--full-resync` / `--fresh-sync`: reset a stale direct-provider workdir
-  before syncing. Use after sync fingerprints look wrong, SSH times out before
-  sync, or rsync watchdog output suggests it. It is redundant with
-  `--fresh-pr`, incompatible with `--no-sync`, and unsupported by delegated
-  providers.
- `--capture-stdout <path>` / `--capture-stderr <path>`: write remote streams to
-  local files and keep binary/noisy output out of retained logs. Parent
-  directories must already exist. These are direct-provider only.
- `--capture-on-fail`: on non-zero direct-provider exits, downloads
-  `.crabbox/captures/*.tar.gz` with `test-results`, `playwright-report`,
-  `coverage`, JUnit XML, and nearby logs. Treat as secret-bearing until reviewed.
- `--keep-on-failure`: leave a failed one-shot lease alive for live debugging
-  until idle/TTL expiry. Useful on direct providers and delegated one-shots.
- `--timing-json`: final machine-readable timing. Add
-  `echo CRABBOX_PHASE:install`, `CRABBOX_PHASE:test`, etc. in long shell
-  commands; direct providers and Blacksmith Testbox both report them as
-  `commandPhases`.
-
-Live-provider debug template for direct AWS/Hetzner leases:
-
-```sh
-mkdir -p .crabbox/logs
-pnpm crabbox:run -- --provider aws \
-  --preflight \
-  --allow-env OPENAI_API_KEY,OPENAI_BASE_URL \
-  --timing-json \
-  --capture-stdout .crabbox/logs/live-provider.stdout.log \
-  --capture-stderr .crabbox/logs/live-provider.stderr.log \
-  --capture-on-fail \
-  --shell -- \
-  "echo CRABBOX_PHASE:install; pnpm install --frozen-lockfile; echo CRABBOX_PHASE:test; pnpm test:live"
-```
-
-Do not pass `--capture-*`, `--download`, `--checksum`, `--force-sync-large`, or
-`--sync-only` to delegated providers. Also do not pass `--script*`,
-`--fresh-pr`, `--full-resync`, or `--env-helper` there. Crabbox rejects these
-because the provider owns sync or command transport. `--keep-on-failure` is OK
-for delegated one-shots when you need to inspect a failed lease.
-
-## Efficient Bug E2E Verification
-
-Use the smallest Crabbox lane that proves the reported user path, not just the
-touched code. Aim for one after-fix E2E proof before commenting, closing, or
-opening a PR for a user-visible bug.
-
-When the user says "test in Crabbox", do not simply copy tests to the remote
-box and run them there. Crabbox is for remote real-scenario proof: copy or
-install OpenClaw as the user would, run the same setup/update/CLI/Gateway/API
-call that failed, and capture behavior from that entrypoint. For regressions or
-bug reports, prove the broken state first when feasible, then run the same
-scenario after the fix.
-
-Pick the lane by symptom:
-
- Docker/setup/install bug: build a package tarball and run the matching
-  `scripts/e2e/*-docker.sh` or package script. This proves npm packaging,
-  install paths, runtime deps, config writes, and container behavior.
- Provider/model/auth bug: prefer true live E2E. Use the configured secret
-  workflow, then inject the single needed key into Crabbox if needed. Scrub
-  unrelated provider env vars in the child command so interactive defaults do
-  not drift to another provider. If only a dummy key is used, label the proof
-  narrowly, e.g. "UI/install path only; live provider auth not exercised."
- Channel delivery bug: use the channel Docker/live lane when available; include
-  setup, config, gateway start, send/receive or agent-turn proof, and redacted
-  logs.
- Gateway/session/tool bug: prefer an end-to-end CLI or Gateway RPC command that
-  creates real state and inspects the resulting files/API output.
- Pure parser/config bug: targeted tests may be enough, but still run a
-  Crabbox command when OS, package, Docker, secrets, or service lifecycle could
-  change behavior.
-
-Efficient flow:
-
-1. Reproduce or prove the pre-fix symptom from the real user-facing entrypoint
-   when feasible. If the issue cannot be reproduced, capture the exact command
-   and observed behavior instead.
-2. Patch locally and run narrow local tests for edit speed.
-3. Run one Crabbox E2E command that starts from the user-facing entrypoint:
-   package install, Docker setup, onboarding, channel add, gateway start, or
-   agent turn as appropriate.
-4. Record proof as: Testbox id, command, environment shape, redacted secret
-   source, and copied success/failure output.
-5. If the issue says "cannot reproduce", ask for the missing config/log fields
-   that would distinguish the tested path from the reporter's path.
-
-Keep it efficient:
-
- Reuse existing E2E scripts and helper assertions before writing ad hoc shell.
- Use `--script <file>` or `--script-stdin` for multi-line E2E commands instead
-  of quote-heavy `--shell` strings on direct SSH providers.
- Use `--fresh-pr <pr>` when validating an upstream PR in isolation from the
-  local dirty tree. Add `--apply-local-patch` only when testing a local fixup on
-  top of that PR.
- Use `--full-resync` before replacing a warmed direct-provider lease when the
-  remote workdir or sync fingerprint appears stale.
- Use one-shot Crabbox for a single proof; use a reusable Testbox only when
-  several commands must share built images, installed packages, or live state.
- Prefer `OPENCLAW_CURRENT_PACKAGE_TGZ` with Docker/package lanes when testing a
-  candidate tarball; prefer the repo's package helper instead of direct source
-  execution when the bug might be packaging/install related.
- Keep secrets redacted. It is fine to report key presence, source, and length;
-  never print secret values.
- Include `--timing-json` on broad or flaky runs when command duration or sync
-  behavior matters.
-
-Before/after PR proof on delegated Testbox:
-
- For PRs that should prove "broken before, fixed after", compare base and PR
-  on the same Testbox when practical. Fetch both refs, create detached temp
-  worktrees under `/tmp`, install in each, then run the same harness twice.
- Do not checkout base/PR refs in the synced repo root. Delegated Testbox sync
-  may leave the root dirty with local files; `git checkout` can abort or mix
-  proof state.
- Temp harness files under `/tmp` do not resolve repo packages by default. Put
-  the harness inside the worktree, or in ESM use
-  `createRequire(path.join(process.cwd(), "package.json"))` before requiring
-  workspace deps such as `@lydell/node-pty`.
- For full-screen TUI/CLI bugs, a PTY harness is stronger than helper-only
-  assertions. Use a real PTY, wait for visible lifecycle markers, send input,
-  then send control keys and assert process exit/stuck behavior.
- When validating a rebased local branch before push, remember delegated sync
-  usually validates synced file content on a detached dirty checkout, not a
-  remote commit object. Record the local head SHA, changed files, Testbox id,
-  and final success markers; after pushing, ensure the pushed SHA has the same
-  file content.
- If GitHub CI is still queued but the exact changed content passed Testbox
-  `pnpm check:changed`, `pnpm check:test-types`, and the real E2E proof, it is
-  reasonable to merge once required checks allow it. Note any still-running
-  unrelated shards in the proof comment instead of waiting forever.
-
-Interactive CLI/onboarding:
-
- For full-screen or prompt-heavy CLI flows, run the target command inside tmux
-  on the Crabbox and drive it with `tmux send-keys`; capture proof with
-  `tmux capture-pane`, redacted through `sed`.
- Prefer deterministic arrow navigation over search typing for Clack-style
-  searchable selects. Raw `send-keys -l openai` may not trigger filtering in a
-  tmux pane; inspect option order locally or on-box and send exact Down/Enter
-  sequences.
- Isolate mutable state with `OPENCLAW_STATE_DIR=$(mktemp -d)`. Plugin npm
-  installs live under that state dir (`npm/node_modules/...`), not under
-  `OPENCLAW_CONFIG_DIR`. Verify downloads by checking the state dir, package
-  lock, and installed package metadata.
- To test automatic setup installs against local package artifacts, use
-  `OPENCLAW_ALLOW_PLUGIN_INSTALL_OVERRIDES=1` plus
-  `OPENCLAW_PLUGIN_INSTALL_OVERRIDES='{"plugin-id":"npm-pack:/tmp/plugin.tgz"}'`.
-  Pack with `npm pack`, set an isolated `OPENCLAW_STATE_DIR`, and verify the
-  package under `npm/node_modules`. Overrides are test-only and must not be
-  treated as official/trusted-source installs.
- For OpenAI/Codex onboarding proof, the useful markers are the UI line
-  `Installed Codex plugin`, `npm/node_modules/@openclaw/codex`, and the
-  package-lock entry showing the bundled `@openai/codex` dependency. A dummy
-  OpenAI-shaped key can prove only UI/install behavior; it is not live auth.
-
-## Reuse And Keepalive
-
-For most Crabbox calls, one-shot is enough. Use reuse only when you need
-multiple manual commands on the same hydrated box.
-
-If Crabbox returns a reusable id or you intentionally keep a lease:
-
-```sh
-pnpm crabbox:run -- --id <cbx_id-or-slug> --no-sync --timing-json --shell -- "pnpm test <path>"
-```
-
-Stop boxes you created before handoff:
-
-```sh
-pnpm crabbox:stop -- <id-or-slug>
-blacksmith testbox stop --id <tbx_id>
-```
-
-## Interactive Desktop And WebVNC
-
-Prefer WebVNC for human inspection because the browser portal can preload the
-lease VNC password and avoids a native VNC client's copy/paste/password dance.
-Use native `crabbox vnc` only when WebVNC is unavailable, the browser portal is
-broken, or the user explicitly wants a local VNC client.
-
-Common desktop flow:
-
-```sh
-../crabbox/bin/crabbox warmup --provider hetzner --desktop --browser --class standard --idle-timeout 60m --ttl 240m
-../crabbox/bin/crabbox desktop launch --provider hetzner --id <cbx_id-or-slug> --browser --url https://example.com --webvnc --open --take-control
-```
-
-Useful WebVNC commands:
-
-```sh
-../crabbox/bin/crabbox webvnc --provider hetzner --id <cbx_id-or-slug> --open --take-control
-../crabbox/bin/crabbox webvnc daemon start --provider hetzner --id <cbx_id-or-slug> --open --take-control
-../crabbox/bin/crabbox webvnc daemon status --provider hetzner --id <cbx_id-or-slug>
-../crabbox/bin/crabbox webvnc daemon stop --provider hetzner --id <cbx_id-or-slug>
-../crabbox/bin/crabbox webvnc status --provider hetzner --id <cbx_id-or-slug>
-../crabbox/bin/crabbox webvnc reset --provider hetzner --id <cbx_id-or-slug> --open --take-control
-../crabbox/bin/crabbox desktop doctor --provider hetzner --id <cbx_id-or-slug>
-../crabbox/bin/crabbox desktop click --provider hetzner --id <cbx_id-or-slug> --x 640 --y 420
-../crabbox/bin/crabbox desktop paste --provider hetzner --id <cbx_id-or-slug> --text "user@example.com"
-../crabbox/bin/crabbox desktop key --provider hetzner --id <cbx_id-or-slug> ctrl+l
-../crabbox/bin/crabbox artifacts collect --id <cbx_id-or-slug> --all --output artifacts/<slug>
-../crabbox/bin/crabbox artifacts publish --dir artifacts/<slug> --pr <number>
-```
-
-`desktop launch --webvnc --open` is usually the nicest one-shot: it starts the
-browser/app inside the visible session, bridges the lease into the authenticated
-WebVNC portal, and opens the portal. Keep browsers windowed for human QA; use
-`--fullscreen` only for capture/video workflows.
-For human handoff, include `--take-control` so the opened portal viewer gets
-keyboard/mouse control automatically instead of landing as an observer.
-
-Human handoff preflight:
-
- Do not assume a visible desktop or launched browser means the repo CLI/app is
-  installed, built, or on the interactive terminal's `PATH`.
- Before handing WebVNC to a human tester, prove the expected command from the
-  same kept lease and from a neutral directory such as `~`.
- If the handoff needs repo-local code, sync/build/link it explicitly on that
-  lease. Source-tree CLIs often need build output before a symlink works.
- Prefer a real `command -v <expected-command> && <expected-command> --version`
-  check over a repo-root-only `pnpm ...` command.
-
-Generic handoff repair pattern:
-
-```sh
-../crabbox/bin/crabbox run --id <cbx_id-or-slug> --full-resync --shell -- \
-  "set -euo pipefail
-   pnpm install --frozen-lockfile
-   pnpm build
-   sudo ln -sf \"\$PWD/<cli-entry>\" /usr/local/bin/<expected-command>
-   cd ~
-   command -v <expected-command>
-   <expected-command> --version"
-```
-
-## If Crabbox Fails
-
-Keep the fallback narrow. First decide whether the failure is Crabbox itself,
-the brokered AWS lease, Blacksmith/Testbox, repo hydration, sync, or the test
-command.
-
-Fast checks:
-
-```sh
-command -v crabbox
-../crabbox/bin/crabbox --version
-pnpm crabbox:run -- --help | sed -n '1,140p'
-../crabbox/bin/crabbox doctor
-command -v blacksmith
-blacksmith --version
-blacksmith testbox list
-```
-
-Common Crabbox-only failures:
-
- Provider missing or old CLI: use `../crabbox/bin/crabbox` from the sibling
-  repo, or update/install Crabbox before retrying.
- Bad local config: inspect `.crabbox.yaml`, `crabbox config show`, and
-  `crabbox whoami`; normal OpenClaw proof should use brokered AWS without
-  asking for cloud keys.
- Slug/claim confusion: use the raw `cbx_...` / `tbx_...` id, or run one-shot
-  without `--id`.
- Sync/timing bug: add `--debug --timing-json`; capture the final JSON and the
-  printed Actions URL. Large sync warnings now include top source directories
-  by file count and a hint to update `.crabboxignore` / `sync.exclude`; inspect
-  those before reaching for `--force-sync-large`. Quiet rsync watchdogs and SSH
-  timeouts now print `next_action=` hints; follow them, usually `--full-resync`
-  first and a fresh lease second.
- Cleanup uncertainty: run `crabbox list --provider aws`; for explicit
-  Blacksmith runs, use `blacksmith testbox list` and stop only boxes you
-  created.
- Testbox queued/capacity pressure: do not retry Blacksmith repeatedly. Rerun
-  once without `--provider` so `.crabbox.yaml` routes to brokered AWS, or report
-  the Blacksmith blocker if Testbox itself is the requested proof.
-
-If brokered AWS cannot dispatch, sync, attach, or stop, retry once with
-`--debug` and `--timing-json`:
-
-```sh
-pnpm crabbox:run -- --debug --timing-json -- \
-  pnpm test:changed
-```
-
-Full suite:
-
-```sh
-pnpm crabbox:run -- --debug --timing-json -- \
-  pnpm test
-```
-
-Auth fallback, only when `blacksmith` says auth is missing:
-
-```sh
-blacksmith auth login --non-interactive --organization openclaw
-```
-
-Raw Blacksmith footguns:
-
- Run from repo root. The CLI syncs the current directory.
- Save the returned `tbx_...` id in the session.
- Reuse that id for focused reruns; stop it before handoff.
- Raw commit SHAs are not reliable `warmup --ref` refs; use a branch or tag.
- Treat `blacksmith testbox list` as cleanup diagnostics, not a shared reusable
-  queue.
-
-Use Blacksmith only when the task is specifically about Testbox, brokered AWS
-is unavailable, or an explicit comparison is needed. If Blacksmith is down or
-quota-limited, do not keep probing it; stay on brokered AWS and note the
-delegated-provider outage.
-
-## Blacksmith Backend Notes
-
-Crabbox Blacksmith backend delegates setup to:
-
- org: `openclaw`
- workflow: `.github/workflows/ci-check-testbox.yml`
- job: `check`
- ref: `main` unless testing a branch/tag intentionally
-
-The hydration workflow owns checkout, Node/pnpm setup, dependency install,
-secrets, ready marker, and keepalive. Crabbox owns dispatch, sync, SSH command
-execution, timing, logs/results, cleanup, and cache-volume requests. Blacksmith
-implements cache volumes as sticky disks.
-
-Minimal Blacksmith-backed Crabbox run, from repo root:
-
-```sh
-pnpm crabbox:run -- --provider blacksmith-testbox --timing-json -- \
-  corepack pnpm test:changed
-```
-
-Use direct Blacksmith only when Crabbox is the broken layer and you are
-isolating a Crabbox bug. Prefer direct `blacksmith testbox list` for cleanup
-diagnostics, not as a reusable work queue.
-
-Important Blacksmith footguns:
-
- Always run from repo root. The CLI syncs the current directory.
- Raw commit SHAs are not reliable `warmup --ref` refs; use a branch or tag.
- If auth is missing and browser auth is acceptable:
-
-```sh
-blacksmith auth login --non-interactive --organization openclaw
-```
-
-## Brokered AWS
-
-Use AWS for normal OpenClaw remote proof. The repo `.crabbox.yaml` already
-selects brokered AWS, so omit `--provider` unless you are testing a different
-provider deliberately.
-
-```sh
-pnpm crabbox:warmup -- --class beast --market on-demand --idle-timeout 90m
-pnpm crabbox:hydrate -- --id <cbx_id-or-slug>
-pnpm crabbox:run -- --id <cbx_id-or-slug> --timing-json --shell -- "pnpm test:changed"
-pnpm crabbox:stop -- <cbx_id-or-slug>
-```
-
-Install/auth for owned Crabbox if needed:
-
-```sh
-brew install openclaw/tap/crabbox
-crabbox login --url https://crabbox.openclaw.ai --provider aws
-```
-
-New users should self-resolve broker auth before anyone asks for AWS keys:
-
-```sh
-crabbox config show
-crabbox doctor
-crabbox whoami
-```
-
- If broker auth is missing, run `crabbox login --url https://crabbox.openclaw.ai --provider aws`.
- If the CLI asks for `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, or AWS
-  profile setup during normal OpenClaw validation, assume the agent selected
-  the wrong path. Use brokered `crabbox login` or an existing brokered lease
-  before asking the user for cloud credentials.
- Ask for AWS keys only for explicit direct-provider/account administration,
-  not for normal brokered OpenClaw proof.
- Trusted automation may still use
-  `printf '%s' "$CRABBOX_COORDINATOR_TOKEN" | crabbox login --url https://crabbox.openclaw.ai --provider aws --token-stdin`.
-
-macOS config lives at:
-
-```text
-~/Library/Application Support/crabbox/config.yaml
-```
-
-It should include `broker.url`, `broker.token`, and usually `provider: aws`
-for OpenClaw lanes. Let that config drive normal validation.
-
-### Interactive Desktop / WebVNC
-
-For human desktop demos, prefer `webvnc` over native `vnc` and keep the remote
-desktop visible/windowed. Do not fullscreen the remote browser or hide the XFCE
-panel/window chrome unless the explicit goal is video/capture output. After
-launch, verify a screenshot shows the desktop panel plus browser title bar. If
-Chrome is fullscreen, toggle it back with:
-
-```sh
-crabbox run --id <lease> --shell -- 'DISPLAY=:99 xdotool search --onlyvisible --class google-chrome windowactivate key F11'
-```
-
-## Diagnostics
-
-```sh
-crabbox status --id <id-or-slug> --wait
-crabbox inspect --id <id-or-slug> --json
-crabbox sync-plan
-crabbox history --limit 20
-crabbox history --lease <id-or-slug>
-crabbox attach <run_id>
-crabbox events <run_id> --json
-crabbox logs <run_id>
-crabbox results <run_id>
-crabbox cache stats --id <id-or-slug>
-crabbox cache volumes
-crabbox ssh --id <id-or-slug>
-blacksmith testbox list
-```
-
-Use `--debug` on `run` when measuring sync timing.
-Use `--timing-json` on warmup, hydrate, and run when comparing backends.
-Use `--market spot|on-demand` only on AWS warmup/one-shot runs.
-
-## Failure Triage
-
- Crabbox cannot find provider: verify `../crabbox/bin/crabbox --help` lists
-  the provider selected by `.crabbox.yaml`; update Crabbox before falling back.
- Hydration stuck or failed: open the printed GitHub Actions run URL and inspect
-  the hydration step.
- Sync failed: rerun with `--debug`; check changed-file count and whether the
-  checkout is dirty.
- Command failed: rerun only the failing shard/file first. Do not rerun a full
-  suite until the focused failure is understood.
- Cleanup uncertain: `crabbox list --provider aws`; for explicit Blacksmith
-  runs, use `blacksmith testbox list` and stop owned `tbx_...` leases you
-  created.
- Crabbox broken but Blacksmith works: use the direct Blacksmith fallback above,
-  then file/fix the Crabbox issue.
-
-## Boundary
-
-Do not add OpenClaw-specific setup to Crabbox itself. Put repo setup in the
-hydration workflow and keep Crabbox generic around lease, sync, command
-execution, logs/results, timing, and cleanup.
--- a/.agents/skills/discord-clawd/SKILL.md
+++ b/.agents/skills/discord-clawd/SKILL.md
@@ -1,37 +0,0 @@
---
-name: discord-clawd
-description: Use to talk to the Discord-backed OpenClaw agent/session; not for archive search.
---
-
-# Discord Clawd
-
-Use this when the task is to talk with the Discord-backed agent/session, ask it a question, or post through that route.
-
-For Discord archive/history/search, use `$discrawl` instead.
-
-## Transport
-
-Use the OpenClaw relay helper:
-
-```bash
-cd ~/Projects/agent-scripts
-python3 skills/openclaw-relay/scripts/openclaw_relay.py targets
-python3 skills/openclaw-relay/scripts/openclaw_relay.py resolve --target maintainers
-```
-
-If the target alias exists, prefer a private ask first:
-
-```bash
-python3 skills/openclaw-relay/scripts/openclaw_relay.py ask \
-  --target maintainers \
-  --message "Reply with exactly OK."
-```
-
-Use `publish` when the session should decide whether to post. Use `force-send` only when the user explicitly wants a message posted.
-
-## Guardrails
-
- Resolve the target before sending real content.
- Report the target and delivery mode used.
- Do not use this for local Discord archive queries.
- Do not expose gateway tokens or session secrets.
--- a/.agents/skills/discord-clawd/agents/openai.yaml
+++ b/.agents/skills/discord-clawd/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "Discord Clawd"
-  short_description: "Talk to the Discord-backed OpenClaw agent"
-  default_prompt: "Use $discord-clawd to route a private ask or explicit post through the Discord-backed OpenClaw agent/session."
--- a/.agents/skills/discord-user-post/SKILL.md
+++ b/.agents/skills/discord-user-post/SKILL.md
@@ -1,51 +0,0 @@
---
-name: discord-user-post
-description: Post an approved message as the logged-in Discord user through the Discord desktop app. Use for release announcements or other direct user-authored Discord posts; not for OpenClaw channel sends, bots, webhooks, relays, agent sessions, or archive search.
---
-
-# Discord User Post
-
-Use `$computer-use` to operate `/Applications/Discord.app` in the user's
-existing logged-in session. This workflow represents the user directly.
-
-## Prepare
-
-1. Draft the complete final message outside Discord.
-2. Confirm the intended server and channel with the user when either is
-   ambiguous.
-3. Open Discord and navigate to the exact destination without entering the
-   message.
-4. Verify the visible server name, channel header, and logged-in account.
-
-Do not infer the target from unrelated Discord content. Stop if Discord is not
-logged in, the account is wrong, or the exact destination cannot be verified.
-
-## Confirm and Post
-
-Posting is representational communication. Follow the `$computer-use`
-confirmation policy even when the user previously asked for an announcement:
-
-1. Show the user the exact final body and verified destination.
-2. Request action-time confirmation before typing into Discord.
-3. After confirmation, enter the approved body unchanged.
-4. Visually inspect the composed message and destination again.
-5. Send once.
-
-If the body or destination changes after confirmation, request confirmation
-again before sending.
-
-## Verify
-
- Confirm the message appears once, from the user's account, in the intended
-  channel.
- Report the server, channel, and visible send result.
- Do not edit, delete, react, or send a follow-up without the corresponding
-  user instruction and confirmation.
-
-## Guardrails
-
- Never use `openclaw message`, an OpenClaw agent, a Discord bot, webhook, relay,
-  or token for this workflow.
- Never expose private Discord content or account details in public output.
- Never send a draft, partial message, duplicate, or unreviewed attachment.
- For Discord archive/history/search, use `$discrawl` instead.
--- a/.agents/skills/discord-user-post/agents/openai.yaml
+++ b/.agents/skills/discord-user-post/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "Discord User Post"
-  short_description: "Post approved messages through the logged-in Discord app"
-  default_prompt: "Post this approved message as me through the logged-in Discord desktop app."
--- a/.agents/skills/discrawl/SKILL.md
+++ b/.agents/skills/discrawl/SKILL.md
@@ -1,169 +0,0 @@
---
-name: discrawl
-description: "Discord archive: search, sync freshness, DMs, summaries, TUI, repo/release work."
-metadata:
-  openclaw:
-    homepage: https://github.com/openclaw/discrawl
-    requires:
-      bins:
-        - discrawl
-    install:
-      - kind: go
-        module: github.com/openclaw/discrawl/cmd/discrawl@latest
-        bins:
-          - discrawl
---
-
-# Discrawl
-
-Use local Discord archive data first for Discord questions. Hit Discord APIs
-only when the archive is stale, missing the requested scope, or the user asks
-for current external context.
-
-## Sources
-
- DB: platform-native XDG data dir, usually
-  `${XDG_DATA_HOME:-~/.local/share}/discrawl/discrawl.db` on Linux or
-  `~/Library/Application Support/discrawl/discrawl.db` on macOS
- Config: platform-native XDG config dir, with legacy fallback to
-  `~/.discrawl/config.toml`
- Cache: platform-native XDG cache dir
- Logs: platform-native XDG state dir
- Git share repo: platform-native XDG data dir
- Repo: `openclaw/discrawl`; use `~/GIT/_Perso/discrawl` only after verifying
-  its remote targets `openclaw/discrawl`, otherwise use a fresh checkout
- Preferred CLI: `discrawl`; fallback to `go run ./cmd/discrawl` from the repo
-  if the installed binary is stale
-
-## Freshness
-
-For recent/current questions, check freshness before analysis:
-
-```bash
-discrawl status --json
-```
-
-For precise freshness from the default database:
-
-```bash
-# Discrawl uses macOS ~/Library defaults unless XDG_DATA_HOME is explicitly set.
-case "$(uname -s)" in
-  Darwin)
-    db="$HOME/Library/Application Support/discrawl/discrawl.db"
-    ;;
-  *)
-    db="${XDG_DATA_HOME:-$HOME/.local/share}/discrawl/discrawl.db"
-    ;;
-esac
-sqlite3 "$db" \
-  "select coalesce(max(updated_at),'') from sync_state where scope like 'channel:%';"
-```
-
-Routine diagnostics:
-
-```bash
-discrawl doctor
-```
-
-Desktop-local refresh:
-
-```bash
-discrawl sync --source wiretap
-```
-
-Bot API latest refresh, when credentials are available:
-
-```bash
-discrawl sync
-```
-
-Use `--full` only for deliberate historical backfills:
-
-```bash
-discrawl sync --full
-```
-
-If SQLite reports busy/locked, check for stray `discrawl` processes before retrying.
-
-## Query Workflow
-
-1. Resolve scope: guild, channel, DM, author, keyword, date range.
-2. Check freshness for recent/current requests.
-3. Prefer CLI search/messages for slices; use read-only SQL for exact counts.
-4. Report absolute date spans, counts, channel/DM names, and known gaps.
-
-Use root or subcommand help for syntax: `discrawl --help`,
-`discrawl help search`, `discrawl search --help`. Use
-`DISCRAWL_NO_AUTO_UPDATE=1` for read smokes when you do not want git-share
-updates.
-
-Common commands:
-
-```bash
-DISCRAWL_NO_AUTO_UPDATE=1 discrawl search --limit 20 "query"
-discrawl messages --channel '#maintainers' --days 7 --all
-discrawl dms --last 20
-discrawl tui --dm
-DISCRAWL_NO_AUTO_UPDATE=1 discrawl --json sql "select count(*) from messages;"
-```
-
-## SQL
-
-Use `discrawl sql` for exact counts, joins, and ranking queries when normal
-CLI reads are too coarse. The command is read-only by default, accepts SQL as
-args or stdin, and supports `--json` for agent parsing.
-
-Useful examples:
-
-```bash
-DISCRAWL_NO_AUTO_UPDATE=1 discrawl --json sql "select count(*) as messages from messages;"
-DISCRAWL_NO_AUTO_UPDATE=1 discrawl --json sql "select coalesce(nullif(c.name, ''), m.channel_id) as channel, count(*) as messages from messages m left join channels c on c.id = m.channel_id group by m.channel_id order by messages desc limit 20;"
-DISCRAWL_NO_AUTO_UPDATE=1 discrawl --json sql "select coalesce(nullif(mm.display_name, ''), nullif(mm.global_name, ''), nullif(mm.username, ''), m.author_id) as author, count(*) as messages from messages m left join members mm on mm.guild_id = m.guild_id and mm.user_id = m.author_id group by m.guild_id, m.author_id order by messages desc limit 20;"
-```
-
-Never use `--unsafe --confirm` unless the user explicitly asks for a database
-mutation and the write has been reviewed.
-
-When the installed CLI lacks a new feature, build or run from a verified
-`openclaw/discrawl` checkout before concluding the feature is missing.
-
-## Discord Boundaries
-
-Bot API sync requires configured Discord bot credentials; do not invent token
-availability. Desktop wiretap mode reads local Discord Desktop artifacts and
-must not extract credentials, use user tokens, call Discord as the user, or
-write to Discord application storage. Wiretap/Desktop cache DMs are local-only
-and must not be described as part of the published Git snapshot. Git-share
-snapshots must not include secrets or `@me` DM rows.
-
-## Verification
-
-For repo edits, prefer existing Go gates:
-
-```bash
-GOWORK=off go test ./...
-```
-
-Then run targeted CLI smoke for the touched surface, for example:
-
-```bash
-discrawl doctor
-discrawl status --json
-DISCRAWL_NO_AUTO_UPDATE=1 discrawl search --limit 5 "test"
-```
-
-## ClawSweeper Sandbox
-
-Use the sandbox reader only:
-
-```bash
-discrawl-sandbox search --limit 20 "query"
-discrawl-sandbox messages --channel clawtributors --days 7 --all
-discrawl-sandbox status --json
-```
-
-This reader imports `https://github.com/openclaw/discord-store.git` into
-`/root/clawsweeper-sandbox-workspace/.discrawl/discrawl.db` with
-`discord.token_source = "none"`. The published Git snapshot is public-channel
-filtered; do not use `/root/.discrawl/config.toml` or the rich writer DB from
-sandboxed public Discord sessions.
--- a/.agents/skills/discrawl/agents/openai.yaml
+++ b/.agents/skills/discrawl/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "Discrawl"
-  short_description: "Search local Discord archives and freshness"
-  default_prompt: "Use $discrawl to search local Discord archives, check freshness, inspect DMs or channel slices, and report exact date spans and source gaps."
--- a/.agents/skills/gitcrawl/SKILL.md
+++ b/.agents/skills/gitcrawl/SKILL.md
@@ -1,50 +0,0 @@
---
-name: gitcrawl
-description: "GitHub archive: issue/PR search, sync freshness, duplicate clusters, gh-shim PR status, and Gitcrawl repo work."
-metadata:
-  openclaw:
-    homepage: https://github.com/openclaw/gitcrawl
-    requires:
-      bins:
-        - gitcrawl
-    install:
-      - kind: go
-        module: github.com/openclaw/gitcrawl/cmd/gitcrawl@latest
-        bins:
-          - gitcrawl
---
-
-# Gitcrawl
-
-Use local GitHub issue/PR archives before live GitHub search. Check freshness first:
-
-```bash
-gitcrawl doctor --json
-```
-
-Find candidates:
-
-```bash
-gitcrawl threads openclaw/openclaw --numbers <issue-or-pr-number> --include-closed --json
-gitcrawl neighbors openclaw/openclaw --number <issue-or-pr-number> --limit 12 --json
-gitcrawl search issues "query" -R openclaw/openclaw --state open --json number,title,url
-gitcrawl clusters openclaw/openclaw --sort size --min-size 5
-gitcrawl cluster-detail openclaw/openclaw --id <cluster-id>
-```
-
-For PR triage, start cached and go live only before mutation/merge decisions:
-
-```bash
-gitcrawl gh pr status <number-or-url> -R openclaw/openclaw --compact
-gitcrawl gh pr view <number-or-url> -R openclaw/openclaw --json number,title,state,url,isDraft,headRef,headSha
-gitcrawl gh --live pr status <number-or-url> -R openclaw/openclaw --compact
-```
-
-Use live `gh` plus checkout proof before commenting, labeling, closing, reopening, merging, or filing a PR review:
-
-```bash
-gh pr view <number> --json number,title,state,mergedAt,body,files,comments,reviews,statusCheckRollup
-gh issue view <number> --json number,title,state,body,comments,closedAt
-```
-
-Report absolute dates, repo names, issue/PR numbers, cluster ids, and source gaps. Do not close/label from similarity alone; require matching intent plus live verification.
--- a/.agents/skills/gitcrawl/agents/openai.yaml
+++ b/.agents/skills/gitcrawl/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "Gitcrawl"
-  short_description: "Search local OpenClaw issue and PR history before live GitHub triage"
-  default_prompt: "Use $gitcrawl to inspect OpenClaw issue and PR history, find related threads and duplicate candidates, then verify actionable decisions with live GitHub."
--- a/.agents/skills/graincrawl/SKILL.md
+++ b/.agents/skills/graincrawl/SKILL.md
@@ -1,44 +0,0 @@
---
-name: graincrawl
-description: "Granola archive: search, sync freshness, notes, transcripts, panels, SQL counts, and Graincrawl repo work."
-metadata:
-  openclaw:
-    homepage: https://github.com/openclaw/graincrawl
-    requires:
-      bins:
-        - graincrawl
-    install:
-      - kind: go
-        module: github.com/vincentkoc/graincrawl/cmd/graincrawl@latest
-        bins:
-          - graincrawl
---
-
-# Graincrawl
-
-Use local Granola archive data first. Check freshness for recent/current questions:
-
-```bash
-graincrawl doctor --json
-graincrawl status --json
-```
-
-Refresh only when stale or asked:
-
-```bash
-graincrawl sync --source private-api
-graincrawl sync --source desktop-cache
-```
-
-Query with bounded reads:
-
-```bash
-graincrawl search "query"
-graincrawl notes --json
-graincrawl note get <id>
-graincrawl transcripts get <id>
-graincrawl panels get <id>
-graincrawl --json sql "select count(*) as notes from notes;"
-```
-
-Report absolute date spans, note titles, source gaps, and transcript/panel availability. Use read-only SQL for exact counts/rankings. Before encrypted source debugging, run explicit unlock/secrets checks; do not surprise-prompt Keychain.
--- a/.agents/skills/graincrawl/agents/openai.yaml
+++ b/.agents/skills/graincrawl/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "Graincrawl"
-  short_description: "Search local Granola notes and transcripts"
-  default_prompt: "Use $graincrawl to search local Granola notes, transcripts, and panels, check freshness, and report exact date spans and source gaps."
--- a/.agents/skills/notcrawl/SKILL.md
+++ b/.agents/skills/notcrawl/SKILL.md
@@ -1,42 +0,0 @@
---
-name: notcrawl
-description: "Notion archive: search, sync freshness, pages/databases, Markdown exports, SQL counts, and Notcrawl repo work."
-metadata:
-  openclaw:
-    homepage: https://github.com/openclaw/notcrawl
-    requires:
-      bins:
-        - notcrawl
-    install:
-      - kind: go
-        module: github.com/vincentkoc/notcrawl/cmd/notcrawl@latest
-        bins:
-          - notcrawl
---
-
-# Notcrawl
-
-Use local Notion archive data before browsing or live Notion API calls. Check freshness for recent/current questions:
-
-```bash
-notcrawl doctor
-notcrawl status --json
-```
-
-Refresh only when stale or asked:
-
-```bash
-notcrawl sync --source desktop
-notcrawl sync --source api
-```
-
-Query with bounded reads:
-
-```bash
-notcrawl search "query"
-notcrawl databases
-notcrawl report
-notcrawl sql "select count(*) from pages;"
-```
-
-Report workspace/teamspace, page/database titles, absolute date spans, counts, and known gaps. Use read-only SQL only; never mutate the archive. API mode requires `NOTION_TOKEN`; do not assume token availability.
--- a/.agents/skills/notcrawl/agents/openai.yaml
+++ b/.agents/skills/notcrawl/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "Notcrawl"
-  short_description: "Search local Notion archives and freshness"
-  default_prompt: "Use $notcrawl to search local Notion pages and databases, check freshness, inspect exports, and report exact date spans and source gaps."
--- a/.agents/skills/openclaw-changelog-update/SKILL.md
+++ b/.agents/skills/openclaw-changelog-update/SKILL.md
@@ -1,137 +0,0 @@
---
-name: openclaw-changelog-update
-description: Regenerate OpenClaw release changelog sections from git history before beta or stable releases.
---
-
-# OpenClaw Changelog Update
-
-Use this for release changelog rewrites and GitHub release-note source text.
-This is mandatory before every beta, beta rerun, stable release, or stable
-rerun. Use it with `release-openclaw-maintainer`; this skill owns changelog
-content, ordering, grouping, and attribution discipline.
-
-## Goal
-
-Rewrite the target `CHANGELOG.md` version section from history, not from stale
-draft notes. Produce grouped user-facing release notes sorted by user interest
-while preserving every relevant issue/PR ref and every human `Thanks @...`
-attribution.
-
-## Inputs
-
- Target base version: `YYYY.M.PATCH`, without beta suffix.
- Base tag: last reachable shipped release tag, usually the previous stable or
-  the previous beta train requested by the operator.
- Target ref: exact branch/SHA being released.
-
-## Workflow
-
-1. Start on `main` before branching when possible:
-   - `git fetch --tags origin`
-   - `git pull --ff-only`
-   - confirm clean `git status -sb`
-2. Audit history, including direct commits:
-   - `git log --first-parent --date=iso-strict --pretty=format:'%h%x09%ad%x09%s' <base-tag>..<target-ref>`
-   - `git log --first-parent --grep='(#' --date=short --pretty=format:'%h%x09%ad%x09%s' <base-tag>..<target-ref>`
-   - also inspect `--since='24 hours ago'` when main moved during the release.
-3. Read linked PRs/issues or diffs for ambiguous commits. Direct commits matter;
-   infer notes from subject, body, touched files, tests, and nearby commits.
-4. Rewrite one stable-base section only:
-   - use `## YYYY.M.PATCH`
-   - do not create beta-specific headings
-   - do not leave a stale `## Unreleased` section above the target release
-   - if `Unreleased` contains release-bound notes, fold them into the target
-     section instead of deleting them
-5. Section shape:
-   - `### Highlights`: 5-8 bullets, broad user wins first
-   - `### Changes`: new capabilities and behavior changes
-   - `### Fixes`: user-facing fixes first, grouped by impact and surface
-   - group related changes/fixes by surface and user impact; avoid one bullet
-     per tiny commit when several commits tell one user-facing story
-6. Preserve attribution:
-   - keep `#issue`, `(#PR)`, `Fixes #...`, and `Thanks @...`
-   - every human-authored merged PR represented by a user-facing entry needs
-     its PR ref and `Thanks @author`, even when the PR had no linked issue
-   - every human issue reporter for a `Fixes #...` or referenced bug issue
-     represented by a user-facing entry needs `Thanks @reporter` unless the
-     same handle is already thanked in that bullet
-   - every human `Co-authored-by` contributor on represented user-facing work
-     needs `Thanks @handle` when a GitHub handle is known
-   - when grouping multiple PRs/issues in one bullet, include every relevant
-     PR/issue ref and every human contributor handle in that same bullet
-   - multiple `Thanks @...` handles in one bullet are expected; do not drop or
-     collapse contributor credit just because the note is grouped
-   - if one grouped bullet covers both direct commits and PRs, keep all PR refs
-     and thanks, plus any issue refs from the direct commits
-   - before finalizing, audit the final release-note body:
-     - extract all `#NNN` refs from the notes
-     - resolve which refs are PRs and collect human PR authors
-     - resolve issue refs used as bug/report refs and collect human reporters
-     - scan represented commits for `Co-authored-by`
-     - compare those handles to the final `Thanks @...` set
-     - fix every missing human credit or explicitly record why it is omitted
-   - do not add GHSA references, advisory IDs, or security advisory slugs to
-     changelog entries or GitHub release-note text unless explicitly requested
-   - never thank bots, `@openclaw`, `@clawsweeper`, or `@steipete`
-   - do not use GitHub's release contributor count as the source of truth; the
-     changelog must carry the complete human credit set itself
-7. Sorting preference:
-   - security/data-loss and content-boundary fixes
-   - transcript/replay/reply delivery correctness
-   - channels and mobile integrations
-   - providers/Codex/local model reliability
-   - install/update/release path reliability
-   - performance and observability
-   - docs and contributor-only/internal details last or omitted
-8. Keep bullets single-line unless existing file style forces otherwise. Avoid
-   internal release-process noise unless it changes user install/update safety.
-9. Check release-note side conditions:
-   - inspect `src/plugins/compat/registry.ts`
-   - inspect `src/commands/doctor/shared/deprecation-compat.ts`
-   - if any compatibility `removeAfter` is on/before release date, resolve it
-     or explicitly record the blocker before shipping
-10. Validate and ship:
-   - generate and verify the complete contribution ledger before committing:
-     ```bash
-     node .agents/skills/openclaw-changelog-update/scripts/verify-release-notes.mjs \
-       --base <base-tag> \
-       --target <target-ref> \
-       --version <YYYY.M.PATCH> \
-       --write-ledger
-     ```
-   - the command fails when any `#NNN` reference in release history or the
-     rendered release section is absent from the ledger, when reverted work is
-     presented as shipped, or when an eligible PR author, issue reporter, or
-     known co-author is missing from that entry's `Thanks @...` credit
-   - after the GitHub release or prerelease is published, verify every matching
-     release page against the same source section:
-     ```bash
-     node .agents/skills/openclaw-changelog-update/scripts/verify-release-notes.mjs \
-       --base <base-tag> \
-       --target <target-ref> \
-       --version <YYYY.M.PATCH> \
-       --release-tag v<YYYY.M.PATCH> \
-       --check-github
-     ```
-   - add one `--release-tag` for every beta and stable page in the train; a
-     `### Release verification` tail is permitted, but any other body drift
-     fails the check; the GitHub body must begin with the complete
-     `## YYYY.M.PATCH` changelog section, including its heading
-   - `git diff --check`
-   - for docs/changelog-only changes, no broad tests are required
-   - commit with `scripts/committer "docs(changelog): refresh YYYY.M.PATCH notes" CHANGELOG.md`
-   - push, pull/rebase if needed, then branch/rebase release from latest `main`
-
-## Quota / API Outage Rule
-
-If GitHub API quota is exhausted, do not idle. Continue work that does not need
-GitHub API:
-
- local changelog rewrite and release-note extraction
- local pretag checks and package/build sanity
- git push/tag checks over git protocol
- npm registry `npm view` checks
- exact workflow-dispatch command preparation
-
-Only GitHub Release creation, workflow dispatch, run polling, artifact download,
-and issue/PR mutation need API quota.
--- a/.agents/skills/openclaw-changelog-update/scripts/verify-release-notes.mjs
+++ b/.agents/skills/openclaw-changelog-update/scripts/verify-release-notes.mjs
@@ -1,443 +0,0 @@
-#!/usr/bin/env node
-
-import { execFileSync } from "node:child_process";
-import { readFileSync, writeFileSync } from "node:fs";
-
-const repo = "openclaw/openclaw";
-const excludedHandles = new Set(["openclaw", "clawsweeper", "codex", "steipete"]);
-
-function fail(message) {
-  throw new Error(message);
-}
-
-function parseArgs(argv) {
-  const options = {
-    releaseTags: [],
-    checkGithub: false,
-    json: false,
-    writeLedger: false,
-  };
-
-  for (let index = 0; index < argv.length; index += 1) {
-    const arg = argv[index];
-    if (arg === "--check-github" || arg === "--json" || arg === "--write-ledger") {
-      options[
-        arg === "--check-github"
-          ? "checkGithub"
-          : arg === "--write-ledger"
-            ? "writeLedger"
-            : "json"
-      ] = true;
-      continue;
-    }
-    if (arg === "--base" || arg === "--target" || arg === "--version" || arg === "--release-tag") {
-      const value = argv[index + 1];
-      if (!value || value.startsWith("--")) {
-        fail(`missing value for ${arg}`);
-      }
-      if (arg === "--release-tag") {
-        options.releaseTags.push(value);
-      } else {
-        options[arg.slice(2)] = value;
-      }
-      index += 1;
-      continue;
-    }
-    fail(`unknown argument: ${arg}`);
-  }
-
-  for (const name of ["base", "target", "version"]) {
-    if (!options[name]) {
-      fail(`--${name} is required`);
-    }
-  }
-  if (options.checkGithub && options.releaseTags.length === 0) {
-    fail("--check-github requires at least one --release-tag");
-  }
-  return options;
-}
-
-function run(command, args) {
-  return execFileSync(command, args, {
-    encoding: "utf8",
-    env: { ...process.env, NO_COLOR: "1" },
-    stdio: ["ignore", "pipe", "pipe"],
-  });
-}
-
-function git(args) {
-  return run("git", args).trimEnd();
-}
-
-function githubApi(args) {
-  try {
-    return JSON.parse(run("ghx", ["api", ...args]).replace(/\u001B\[[0-?]*[ -/]*[@-~]/g, ""));
-  } catch (error) {
-    if (typeof error.stdout === "string" && error.stdout.trim() !== "") {
-      return JSON.parse(error.stdout.replace(/\u001B\[[0-?]*[ -/]*[@-~]/g, ""));
-    }
-    throw error;
-  }
-}
-
-function escapeRegExp(value) {
-  return value.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
-}
-
-function isEligibleHandle(handle) {
-  return Boolean(handle) && !handle.endsWith("[bot]") && !excludedHandles.has(handle.toLowerCase());
-}
-
-function sectionFor(changelog, version) {
-  const heading = new RegExp(`^## ${escapeRegExp(version)}\\r?$`, "m").exec(changelog);
-  if (!heading || heading.index === undefined) {
-    fail(`CHANGELOG.md does not contain ## ${version}`);
-  }
-  const start = heading.index;
-  const bodyStart = changelog.indexOf("\n", start) + 1;
-  const next = /^## /gm;
-  next.lastIndex = bodyStart;
-  const nextHeading = next.exec(changelog);
-  const end = nextHeading?.index ?? changelog.length;
-  return {
-    start,
-    end,
-    source: changelog.slice(start, end).trimEnd(),
-    body: changelog.slice(bodyStart, end).trim(),
-  };
-}
-
-function referencesIn(text) {
-  return [...text.matchAll(/#(\d+)/g)].map((match) => Number(match[1]));
-}
-
-function appendReferences(references, additions) {
-  const seen = new Set(references);
-  for (const number of additions) {
-    if (!seen.has(number)) {
-      references.push(number);
-      seen.add(number);
-    }
-  }
-}
-
-function sourceCommits(base, target) {
-  const mergeBase = git(["merge-base", base, target]);
-  const output = git([
-    "log",
-    "--first-parent",
-    "--reverse",
-    "--format=%H%x1f%s%x1f%B%x1e",
-    `${mergeBase}..${target}`,
-  ]);
-  const commits = new Map();
-  const revertsByTarget = new Map();
-  for (const record of output.split("\x1e")) {
-    if (!record) {
-      continue;
-    }
-    const [rawHash, subject, ...bodyParts] = record.split("\x1f");
-    const hash = rawHash.trim();
-    const body = bodyParts.join("\x1f");
-    const revertedHash = body.match(/This reverts commit ([0-9a-f]{7,40})\./i)?.[1];
-    const isRevert = subject.startsWith('Revert "') || Boolean(revertedHash);
-    commits.set(hash, { body, hash, isRevert, revertedHash, subject });
-  }
-  for (const commit of commits.values()) {
-    if (!commit.revertedHash) {
-      continue;
-    }
-    const targetHash = [...commits.keys()].find((candidate) => candidate.startsWith(commit.revertedHash));
-    if (targetHash) {
-      const reverts = revertsByTarget.get(targetHash) ?? [];
-      reverts.push(commit.hash);
-      revertsByTarget.set(targetHash, reverts);
-    }
-  }
-  const active = new Map();
-  function isActive(hash) {
-    if (active.has(hash)) {
-      return active.get(hash);
-    }
-    const cancellingReverts = revertsByTarget.get(hash) ?? [];
-    const value = !cancellingReverts.some((revertHash) => isActive(revertHash));
-    active.set(hash, value);
-    return value;
-  }
-
-  const references = [];
-  const revertedReferences = new Set();
-  const coauthorsByReference = new Map();
-  for (const commit of commits.values()) {
-    if (commit.isRevert) {
-      continue;
-    }
-    const uniqueReferences = [...new Set(referencesIn(`${commit.subject}\n${commit.body}`))];
-    if (!isActive(commit.hash)) {
-      for (const number of uniqueReferences) {
-        revertedReferences.add(number);
-      }
-      continue;
-    }
-    appendReferences(references, uniqueReferences);
-    const coauthors = [...commit.body.matchAll(/<(?:(?:\d+)\+)?([^@<>\s]+)@users\.noreply\.github\.com>/gi)]
-      .map((match) => match[1])
-      .filter(isEligibleHandle);
-    for (const number of uniqueReferences) {
-      if (coauthors.length > 0) {
-        const handles = coauthorsByReference.get(number) ?? new Set();
-        for (const handle of coauthors) {
-          handles.add(handle);
-        }
-        coauthorsByReference.set(number, handles);
-      }
-    }
-  }
-
-  return { mergeBase, references, revertedReferences, coauthorsByReference };
-}
-
-function graphql(query) {
-  return githubApi(["graphql", "-f", `query=${query}`]).data;
-}
-
-function resolveReferences(numbers) {
-  const nodes = new Map();
-  for (let index = 0; index < numbers.length; index += 40) {
-    const chunk = numbers.slice(index, index + 40);
-    const fields = chunk
-      .map(
-        (number) => `n${number}: repository(owner: "openclaw", name: "openclaw") {
-          issueOrPullRequest(number: ${number}) {
-            __typename
-            ... on Issue { number title author { __typename login } }
-            ... on PullRequest { number title author { __typename login } }
-          }
-        }`,
-      )
-      .join("\n");
-    const data = graphql(`query { ${fields} }`);
-    for (const number of chunk) {
-      const node = data[`n${number}`]?.issueOrPullRequest;
-      if (node) {
-        nodes.set(number, node);
-      }
-    }
-  }
-  return nodes;
-}
-
-function resolveCoauthors(handles) {
-  const resolved = new Map();
-  const uniqueHandles = [...new Set(handles)];
-  for (let index = 0; index < uniqueHandles.length; index += 80) {
-    const chunk = uniqueHandles.slice(index, index + 80);
-    const fields = chunk
-      .map(
-        (handle, offset) =>
-          `u${index + offset}: user(login: ${JSON.stringify(handle)}) { __typename login }`,
-      )
-      .join("\n");
-    const data = graphql(`query { ${fields} }`);
-    for (let offset = 0; offset < chunk.length; offset += 1) {
-      const user = data[`u${index + offset}`];
-      if (user?.__typename === "User" && isEligibleHandle(user.login)) {
-        resolved.set(chunk[offset].toLowerCase(), user.login);
-      }
-    }
-  }
-  return resolved;
-}
-
-function thanksFor(node, coauthorHandles) {
-  const handles = [];
-  if (node.author?.__typename === "User" && isEligibleHandle(node.author.login)) {
-    handles.push(node.author.login);
-  }
-  for (const handle of coauthorHandles) {
-    if (!handles.some((candidate) => candidate.toLowerCase() === handle.toLowerCase())) {
-      handles.push(handle);
-    }
-  }
-  return handles;
-}
-
-function ledgerFor(base, target, references, nodes, coauthorsByReference, resolvedCoauthors) {
-  const missing = references.filter((number) => !nodes.has(number));
-  if (missing.length > 0) {
-    fail(`GitHub could not resolve source references: ${missing.map((number) => `#${number}`).join(", ")}`);
-  }
-
-  const entries = references.map((number) => {
-    const node = nodes.get(number);
-    const rawCoauthors = coauthorsByReference.get(number) ?? new Set();
-    const coauthors = [...rawCoauthors]
-      .map((handle) => resolvedCoauthors.get(handle.toLowerCase()))
-      .filter(Boolean);
-    return {
-      number,
-      title: node.title.replace(/#(\d+)/g, "issue $1").replace(/\s+/g, " ").trim(),
-      type: node.__typename,
-      thanks: thanksFor(node, coauthors),
-    };
-  });
-
-  const pullRequests = entries.filter((entry) => entry.type === "PullRequest");
-  const issues = entries.filter((entry) => entry.type === "Issue");
-  const renderEntry = (entry, issue = false) => {
-    const attribution = entry.thanks.length > 0 ? ` Thanks ${entry.thanks.map((handle) => `@${handle}`).join(" and ")}.` : "";
-    return `- ${issue ? "Reported: " : ""}${entry.title} (#${entry.number}).${attribution}`;
-  };
-  const ledger = [
-    "### Complete contribution ledger",
-    "",
-    `This audited record covers the complete ${base}..${target} history: ${pullRequests.length} PRs and ${issues.length} linked issues. The grouped notes above prioritize user impact; this ledger preserves every contribution reference and eligible human credit.`,
-    "",
-    "#### Pull requests",
-    "",
-    ...pullRequests.map((entry) => renderEntry(entry)),
-    "",
-    "#### Linked issues",
-    "",
-    ...issues.map((entry) => renderEntry(entry, true)),
-  ].join("\n");
-  return { entries, issues, ledger, pullRequests };
-}
-
-function replaceLedger(changelog, section, ledger) {
-  const beforeLedger = section.source.replace(/\n+### Complete contribution ledger[\s\S]*$/m, "").trimEnd();
-  const replacement = `${beforeLedger}\n\n${ledger}\n`;
-  return `${changelog.slice(0, section.start)}${replacement}${changelog.slice(section.end)}`;
-}
-
-function ledgerChecks(section, entries) {
-  const errors = [];
-  if (!section.source.includes("### Highlights")) {
-    errors.push("missing ### Highlights");
-  }
-  if (!section.source.includes("### Changes")) {
-    errors.push("missing ### Changes");
-  }
-  if (!section.source.includes("### Fixes")) {
-    errors.push("missing ### Fixes");
-  }
-  const ledgerStart = section.source.indexOf("### Complete contribution ledger");
-  if (ledgerStart < 0) {
-    errors.push("missing ### Complete contribution ledger");
-    return errors;
-  }
-  const ledger = section.source.slice(ledgerStart);
-  const entryNumbers = new Set(entries.map((entry) => entry.number));
-  for (const number of new Set(referencesIn(section.source))) {
-    if (!entryNumbers.has(number)) {
-      errors.push(`missing ledger entry for #${number}`);
-    }
-  }
-  for (const entry of entries) {
-    const prefix = entry.type === "Issue" ? "- Reported: " : "- ";
-    const line = ledger
-      .split("\n")
-      .find((candidate) => candidate.startsWith(prefix) && candidate.includes(`(#${entry.number})`));
-    if (!line) {
-      errors.push(`missing ledger entry for #${entry.number}`);
-      continue;
-    }
-    for (const handle of entry.thanks) {
-      if (!line.toLowerCase().includes(`@${handle.toLowerCase()}`)) {
-        errors.push(`missing Thanks @${handle} for #${entry.number}`);
-      }
-    }
-  }
-  return errors;
-}
-
-function releaseChecks(section, releaseTags) {
-  const expected = section.source;
-  const checks = [];
-  for (const tag of releaseTags) {
-    const release = githubApi([`repos/${repo}/releases/tags/${encodeURIComponent(tag)}`]);
-    const suffix = release.body.slice(expected.length).trimStart();
-    const matches =
-      release.body === expected ||
-      (release.body.startsWith(expected) && (suffix === "" || suffix.startsWith("### Release verification")));
-    checks.push({
-      tag,
-      releaseId: release.id,
-      matches,
-      bodyLength: release.body.length,
-    });
-  }
-  return checks;
-}
-
-function main() {
-  const options = parseArgs(process.argv.slice(2));
-  let changelog = readFileSync("CHANGELOG.md", "utf8");
-  let section = sectionFor(changelog, options.version);
-  const source = sourceCommits(options.base, options.target);
-  const preexistingNotes = section.source.replace(/\n+### Complete contribution ledger[\s\S]*$/m, "");
-  const noteReferences = referencesIn(preexistingNotes);
-  const revertedNoteReferences = noteReferences.filter((number) => source.revertedReferences.has(number));
-  if (revertedNoteReferences.length > 0) {
-    fail(
-      `release notes reference reverted work: ${[
-        ...new Set(revertedNoteReferences),
-      ]
-        .map((number) => `#${number}`)
-        .join(", ")}`,
-    );
-  }
-  const references = [...source.references];
-  appendReferences(references, noteReferences);
-  const nodes = resolveReferences(references);
-  const coauthorHandles = [...source.coauthorsByReference.values()].flatMap((handles) => [...handles]);
-  const resolvedCoauthors = resolveCoauthors(coauthorHandles);
-  const ledger = ledgerFor(
-    options.base,
-    options.target,
-    references,
-    nodes,
-    source.coauthorsByReference,
-    resolvedCoauthors,
-  );
-
-  if (options.writeLedger) {
-    changelog = replaceLedger(changelog, section, ledger.ledger);
-    writeFileSync("CHANGELOG.md", changelog);
-    section = sectionFor(changelog, options.version);
-  }
-
-  const errors = ledgerChecks(section, ledger.entries);
-  const github = options.checkGithub ? releaseChecks(section, options.releaseTags) : [];
-  for (const check of github) {
-    if (!check.matches) {
-      errors.push(`GitHub release ${check.tag} does not match the ${options.version} CHANGELOG section`);
-    }
-  }
-
-  const result = {
-    base: options.base,
-    target: options.target,
-    mergeBase: source.mergeBase,
-    version: options.version,
-    source: {
-      references: references.length,
-      pullRequests: ledger.pullRequests.length,
-      issues: ledger.issues.length,
-    },
-    github,
-    errors,
-  };
-  if (options.json) {
-    process.stdout.write(`${JSON.stringify(result, null, 2)}\n`);
-  } else {
-    process.stdout.write(
-      `${options.version}: ${ledger.pullRequests.length} PRs, ${ledger.issues.length} issues, ${errors.length === 0 ? "verified" : `${errors.length} errors`}\n`,
-    );
-  }
-  if (errors.length > 0) {
-    process.exitCode = 1;
-  }
-}
-
-main();
--- a/.agents/skills/openclaw-debugging/SKILL.md
+++ b/.agents/skills/openclaw-debugging/SKILL.md
@@ -1,114 +0,0 @@
---
-name: openclaw-debugging
-description: Debug OpenClaw model, provider, tool-surface, code-mode, streaming, and live/Crabbox behavior by choosing the right logs, probes, and proof path before changing code.
---
-
-# OpenClaw Debugging
-
-Use this skill when OpenClaw behavior differs between local tests, live models,
-providers, code mode, Tool Search, Crabbox, or CI, and the next move should be a
-debug signal rather than a guess.
-
-## Read First
-
- `docs/logging.md` for log files, `openclaw logs`, and targeted debug flags.
- `docs/reference/test.md` for local test commands.
- `docs/reference/code-mode.md` for code-mode exec/wait and tool catalog rules.
- Use `$openclaw-testing` for choosing test lanes.
- Use `$crabbox` for broad, Docker, package, Linux, live-key, or CI-parity proof.
-
-## Default Loop
-
-1. State the suspected boundary: config, tool construction, provider payload,
-   fetch, stream/SSE, transcript replay, worker/runtime, package/dist, or CI.
-2. Add or enable the narrowest signal that proves that boundary.
-3. Reproduce with the same provider/model/config. Do not randomly switch models
-   unless the model itself is the variable being tested.
-4. Compare configured state with actual run activation.
-5. Patch the root cause.
-6. Rerun the exact failing probe, then broaden only if the contract requires it.
-
-## Model Transport Logs
-
-Use targeted env flags instead of global debug when the model request shape or
-stream timing matters:
-
-```bash
-OPENCLAW_DEBUG_MODEL_TRANSPORT=1 openclaw gateway
-OPENCLAW_DEBUG_MODEL_PAYLOAD=tools OPENCLAW_DEBUG_SSE=events openclaw gateway
-OPENCLAW_DEBUG_MODEL_PAYLOAD=full-redacted OPENCLAW_DEBUG_SSE=peek openclaw gateway
-```
-
-Useful flags:
-
- `OPENCLAW_DEBUG_MODEL_TRANSPORT=1`: request start, fetch response, SDK
-  headers, first SSE event, stream done, and transport errors at `info`.
- `OPENCLAW_DEBUG_MODEL_PAYLOAD=summary`: bounded payload summary.
- `OPENCLAW_DEBUG_MODEL_PAYLOAD=tools`: all model-facing tool names.
- `OPENCLAW_DEBUG_MODEL_PAYLOAD=full-redacted`: capped, redacted JSON payload.
-  Use only while debugging; prompts/message text may still appear.
- `OPENCLAW_DEBUG_SSE=events`: first-event and stream-completion timing.
- `OPENCLAW_DEBUG_SSE=peek`: first five redacted SSE events.
- `OPENCLAW_DEBUG_CODE_MODE=1`: code-mode tool-surface diagnostics.
-
-Watch logs with:
-
-```bash
-openclaw logs --follow
-```
-
-## Common Boundaries
-
- **Config vs activation:** config can be enabled while the run disables tools,
-  is raw, has an empty allowlist, or lacks model tool support. Check the actual
-  visible tools before enforcing provider payload invariants.
- **Tool surface:** inspect final model-visible tool names, not only the tool
-  registry or config. Code mode means exactly `exec` and `wait` only after it
-  actually activates.
- **Provider payload:** log fields, model id, service tier, reasoning, input
-  size, metadata keys, prompt-cache key presence, and tool names before SDK
-  call.
- **Fetch vs SSE:** fetch response proves HTTP headers arrived; first SSE event
-  proves provider body progress. A gap here is a stream/body/provider issue, not
-  tool execution.
- **Worker/dist:** run `pnpm build` when touching workers, dynamic imports,
-  package exports, lazy runtime boundaries, or published paths.
- **Live keys:** use the configured secret workflow for missing provider keys
-  before saying live proof is blocked. Env checks are presence-only; never print
-  secrets.
-
-## Code Pointers
-
- Model payload + Responses stream:
-  `src/agents/openai-transport-stream.ts`
- Guarded fetch/timing:
-  `src/agents/provider-transport-fetch.ts`
- OpenAI/Codex provider wrappers:
-  `src/agents/pi-embedded-runner/openai-stream-wrappers.ts`
- Tool construction, Tool Search, code-mode activation:
-  `src/agents/pi-embedded-runner/run/attempt.ts`
- Code-mode runtime and worker:
-  `src/agents/code-mode.ts`
-  `src/agents/code-mode.worker.ts`
- Tool Search catalog:
-  `src/agents/tool-search.ts`
-
-## Proof Choice
-
- Single helper/payload bug: local targeted Vitest.
- Docs/logging-only: `pnpm check:docs` and `git diff --check`.
- Worker/dist/lazy import/package surface: targeted tests plus `pnpm build`.
- Live provider/model behavior: same provider/model with debug flags and a real
-  key if available.
- Docker/package/Linux/CI-parity: `$crabbox`.
- CI failure: exact SHA, relevant job only, logs only after failure/completion.
-
-## Output Habit
-
-Report:
-
- boundary tested
- exact command/env shape, redacted
- observed signal, such as tool names or first SSE event timing
- fix location
- narrow proof and any remaining risk
--- a/.agents/skills/openclaw-debugging/agents/openai.yaml
+++ b/.agents/skills/openclaw-debugging/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "OpenClaw Debugging"
-  short_description: "Debug model, tool, stream, and live behavior"
-  default_prompt: "Use $openclaw-debugging to identify the right OpenClaw debug boundary, turn on targeted logs, and choose the narrowest local or Crabbox proof."
--- a/.agents/skills/openclaw-docker-e2e-authoring/SKILL.md
+++ b/.agents/skills/openclaw-docker-e2e-authoring/SKILL.md
@@ -1,64 +0,0 @@
---
-name: openclaw-docker-e2e-authoring
-description: "Author OpenClaw Docker E2E and live provider Docker lanes."
---
-
-# OpenClaw Docker E2E Authoring
-
-Use this when adding or changing Docker E2E lanes, release-path Docker tests,
-or live-provider Docker proof.
-
-## Lane Choice
-
- Deterministic Docker: fake the dependency/server and assert the exact runtime
-  contract crossing the boundary.
- Live Docker: use real provider credentials/model only when user-visible
-  behavior needs the real service.
- Prefer both when they prove different risks: deterministic for byte/payload
-  routing, live for actual provider behavior.
-
-## Authoring Rules
-
- Test-only helpers live in `test/helpers` or `scripts/e2e/lib/<lane>/`, not
-  `src/**`, unless production imports them.
- Package-installed app runs from `/app`; mount only explicit harness/helper
-  paths read-only.
- Fake servers should log boundary requests as JSONL and clients should assert
-  the real dependency payload, not just process success.
- Add the package script and `scripts/lib/docker-e2e-scenarios.mjs` lane in the
-  same change.
- If a lane installs a plugin from npm, default the spec via env so published
-  and local override paths are both testable.
-
-## Media And Vision
-
- Expected answer must exist only in pixels or provider output being tested.
- Use neutral filenames, neutral prompts, and no metadata leaks.
- Random bitmap/OCR tokens reuse the repo OCR-safe alphabet `24567ACEF` unless
-  the test owns a stronger glyph set.
- Make the expected answer unique per run when proving real image
-  understanding.
-
-## `chat.send` E2E
-
- Require `chat.send` to return `status: "started"` and a string `runId`.
- Wait for completion with `agent.wait`.
- Assert final user-visible text via `chat.history` when event ordering is not
-  the behavior under test.
- Keep originating channel/account metadata only when the bug path needs queued
-  inbound/channel context.
-
-## Verification
-
-Run the smallest proof that covers the touched lane:
-
-```bash
-pnpm exec oxfmt --write <changed files>
-node --check <new .mjs files>
-bash -n <new .sh files>
-node scripts/run-vitest.mjs test/scripts/docker-e2e-plan.test.ts
-OPENCLAW_SKIP_DOCKER_BUILD=1 pnpm test:docker:<lane>
-```
-
-For real-provider lanes, run the matching live Docker script after deterministic
-Docker is green. Finish with `$autoreview` before commit/PR.
--- a/.agents/skills/openclaw-ghsa-maintainer/SKILL.md
+++ b/.agents/skills/openclaw-ghsa-maintainer/SKILL.md
@@ -1,11 +1,11 @@
 ---
 name: openclaw-ghsa-maintainer
-description: "Inspect, patch, validate, publish, or confirm OpenClaw GHSA security advisories and private-fork state."
+description: Maintainer workflow for OpenClaw GitHub Security Advisories (GHSA). Use when Codex needs to inspect, patch, validate, or publish a repo advisory, verify private-fork state, prepare advisory Markdown or JSON payloads safely, handle GHSA API-specific publish constraints, or confirm advisory publish success.
 ---

 # OpenClaw GHSA Maintainer

-Use this skill for repo security advisory workflow only. Keep general release work in `release-openclaw-maintainer`.
+Use this skill for repo security advisory workflow only. Keep general release work in `openclaw-release-maintainer`.

 ## Respect advisory guardrails

@@ -85,4 +85,3 @@ jq -r .description < /tmp/ghsa.refetch.json | rg '\\\\n'
 - Publishing fails with HTTP 422 if required fields are missing or the private fork still has open PRs.
 - A payload that looks correct in shell can still be wrong if Markdown was assembled with escaped newline strings.
 - Advisory PATCH sequencing matters; separate field updates when GHSA API constraints require it.
- Public hardening/no-publish comments and draft text should avoid raw commit hashes, PR titles/numbers, and fix-mechanism summaries. Prefer patched-version fields or release-only wording; keep SHAs, PRs, and implementation notes in internal evidence.
--- a/.agents/skills/openclaw-landable-bug-sweep/SKILL.md
+++ b/.agents/skills/openclaw-landable-bug-sweep/SKILL.md
@@ -1,165 +0,0 @@
---
-name: openclaw-landable-bug-sweep
-description: "Find or repair small high-confidence non-SDK-boundary OpenClaw bugfix PRs until five are landable."
---
-
-# OpenClaw Landable Bug Sweep
-
-Autonomous maintainer workflow for producing five landable OpenClaw bugfix PR URLs.
-Use for broad issue/PR sweeps where the bar is high and the output is PRs, not notes.
-Do not use for plugin SDK/API boundary work; those need separate architecture review.
-
-## Target
-
-Return exactly five PR URLs, each with:
-
- bug summary
- why the fix is low-risk
- proof: rebased-head local/Testbox/live commands or run IDs
- autoreview: clean result on the exact head being shown
- CI green on the exact pushed PR head
- issue/duplicate cleanup done or still pending
-
-The five URLs may be existing PRs that were reviewed/fixed, or new PRs created from issues/clusters.
-Do not present a PR URL to the maintainer until it has been refreshed on current `main`, left-tested, autoreviewed clean, pushed, and verified green in live GitHub CI.
-If code, tests, changelog, PR body, or branch base changes after autoreview, rerun autoreview before showing the URL.
-
-## Companion Skills
-
-Use `$gitcrawl` for discovery/clustering, `$openclaw-pr-maintainer` for live GitHub mutation rules, `$github-author-context` when contributor trust matters, `$openclaw-testing` for proof choice, `$autoreview` before publishing/landing, and `$crabbox` for broad/E2E/live proof.
-
-## Candidate Bar
-
-Accept only when all are true:
-
- bug or paper cut, not feature/product/support/docs-only
- root cause is proven in current code
- dependency behavior checked via upstream docs/source/types when relevant
- production/runtime diff is small, ideally much smaller than 500 LOC and always below 500 LOC
- tests may be larger, but focused
- no new dependency
- no new config option
- no backward-incompatible behavior
- no security/product/owner-boundary decision needed
- no plugin SDK, public plugin API, or `src/plugin-sdk/**` boundary change
- no broad refactor smell
- focused proof is feasible
- branch can be rebased/refreshed and pushed, or a replacement PR can be created
-
-Good examples:
-
- provider parameter mismatch proven against dependency/API contract
- CLI command diverges from adjacent command behavior
- narrow runtime state/serialization bug with failing test
- issue already fixed on current `main`, with proof and closeable duplicates
-
-Reject:
-
- feature requests, new knobs, migrations, release work, workflow policy, support
- plugin SDK/API boundary changes, including compatibility shims, new SDK methods, SDK exports, or plugin-facing channel/provider seams
- auth/security boundary changes unless explicitly assigned
- bugs needing live credentials that are unavailable
- PRs with red CI unless you fix, rebase, push, and recheck them green
- PRs you only reviewed locally but did not refresh/push/check live
- PRs whose final head has not passed `$autoreview`
- fixes whose clean shape is a larger architecture move
- speculative reports without reproducible/provable cause
- UI/UX changes requiring product judgment
-
-## Sweep Loop
-
-1. Start clean:
-   - `git status -sb`
-   - `git pull --ff-only`
-   - verify branch is expected, usually `main`
-2. Build candidate clusters:
-   - `gitcrawl` open issues/PRs, neighbors, and search
-   - live `gh issue/pr view`
-   - include PRs linked from issues and duplicates
-3. For each cluster:
-   - read issue/PR body, comments, labels, linked refs, current source, adjacent tests
-   - suppress maintainer-owned queue noise unless it is the best fix path
-   - identify opener/author and preserve credit
-   - decide: `repair-existing-pr`, `create-new-pr`, `close-fixed-on-main`, `close-duplicate`, or `reject`
-4. Prove before patching:
-   - failing test, focused repro, log/source proof, or dependency contract proof
-   - if already fixed on `main`, prove with current source/test/commit and close kindly
-5. Patch:
-   - prefer existing PR when good and writable
-   - if unwritable or wrong shape, create own PR and preserve useful contributor credit
-   - if no PR exists, create one
-   - add regression test when it fits
-   - release-note context for user-facing fixes in PR body or commit message; credit human reporter/contributor when known
-6. Review, refresh, and publish:
-   - rebase or otherwise refresh the PR branch on current `origin/main`
-   - resolve drift, including newly exposed CI failures, rather than counting the PR as ready
-   - do not add `CHANGELOG.md` during normal sweep PRs; release automation generates it from PRs and commits
-   - left-test the rebased head with the smallest meaningful local/Testbox/live command that proves the bug
-   - run `$autoreview` until no accepted/actionable findings remain before creating, updating, or presenting the PR URL
-   - create/update PR with real body and proof fields
-   - push the exact reviewed head
-   - verify live GitHub CI is green for that pushed head; do not count pending, red, dirty, conflicting, or externally blocked PRs in the five
-7. Hygiene:
-   - close duplicates and fixed-on-main issues/PRs with proof as soon as you notice them during the sweep
-   - never mutate more than five associated items in one cluster without explicit confirmation
-   - comments must be kind, concrete, and include proof/PR/commit links
-8. Repeat until five landable PR URLs are ready.
-
-## PR Body Proof
-
-Use the repo PR template. Include these exact labels:
-
-```text
-Behavior addressed:
-Real environment tested:
-Exact steps or command run after this patch:
-Evidence after fix:
-Observed result after fix:
-What was not tested:
-```
-
-## Existing PR Rules
-
- Review code path beyond the diff before trusting it.
- If PR is good: rebase/refresh on current `main`, fix small issues, left-test, autoreview clean, push, and get CI green before showing or counting it.
- If PR is not good but has a useful idea: recreate locally, co-author when warranted, close original with thanks and explanation.
- If PR is duplicate or fixed on `main`: comment proof, close.
- If maintainer cannot push to contributor branch: create own branch/PR, preserve useful commits or credit.
- If CI turns red after local proof, treat that as normal work: inspect the failing job, fix or reject, rerun, and only count the PR once green.
-
-## Output Ledger
-
-Maintain a running ledger:
-
-```text
-accepted:
- PR URL:
-  source refs:
-  bug:
-  root cause:
-  fix:
-  risk:
-  rebase/head:
-  left-test:
-  autoreview:
-  CI:
-  credit/thanks:
-  cleanup:
-
-rejected:
- ref:
-  reason:
-
-closed:
- ref:
-  reason:
-  proof/comment:
-```
-
-Final answer:
-
- exactly five accepted PR URLs
- 2-4 sentence explainer per PR
- proof/CI state per PR
- closed duplicates/fixed-on-main refs
- current branch/status
--- a/.agents/skills/openclaw-landable-bug-sweep/agents/openai.yaml
+++ b/.agents/skills/openclaw-landable-bug-sweep/agents/openai.yaml
@@ -1,4 +0,0 @@
-interface:
-  display_name: "OpenClaw Landable Bug Sweep"
-  short_description: "Find five small non-SDK landable bugfix PRs"
-  default_prompt: "Use $openclaw-landable-bug-sweep to find or repair five small high-confidence non-SDK-boundary OpenClaw bugfix PRs and get them landable."
--- a/.agents/skills/openclaw-parallels-smoke/SKILL.md
+++ b/.agents/skills/openclaw-parallels-smoke/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: openclaw-parallels-smoke
-description: Run, rerun, debug, or interpret OpenClaw Parallels install, onboarding, gateway smoke, and upgrade checks.
+description: End-to-end Parallels smoke, upgrade, and rerun workflow for OpenClaw across macOS, Windows, and Linux guests. Use when Codex needs to run, rerun, debug, or interpret VM-based install, onboarding, gateway smoke tests, latest-release-to-main upgrade checks, fresh snapshot retests, or optional Discord roundtrip verification under Parallels.
 ---

 # OpenClaw Parallels Smoke
@@ -14,25 +14,8 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
 - Stable `2026.3.12` pre-upgrade diagnostics may require a plain `gateway status --deep` fallback.
 - Treat `precheck=latest-ref-fail` on that stable pre-upgrade lane as baseline, not automatically a regression.
 - Pass `--json` for machine-readable summaries.
- Per-phase logs land under `.artifacts/parallels/openclaw-parallels-*` by default. Override with `OPENCLAW_PARALLELS_ARTIFACT_ROOT` when a run needs another artifact volume.
+- Per-phase logs land under `/tmp/openclaw-parallels-*`.
 - Do not run local and gateway agent turns in parallel on the same fresh workspace or session.
- Hard-cap every top-level Parallels lane with host `timeout --foreground` (or `gtimeout --foreground` if that is the available binary) so a stalled install, snapshot switch, or `prlctl exec` transport cannot consume the rest of the testing window. Defaults:
-  - macOS: `75m`
-  - Linux: `75m`
-  - Windows: `90m`
-  - aggregate npm-update wrapper: `150m`
-    If a lane hits the cap, stop there, inspect the newest `/tmp/openclaw-parallels-*` run directory and phase log, then fix or rerun the smallest affected lane. Do not keep waiting on a capped lane.
- Actual OpenClaw npm install/update phases are a stricter signal than whole-lane caps: install phases should normally finish within 7 minutes, and update phases should normally show meaningful progress within 5 minutes. If a phase named `install-main`, `install-latest`, `install-baseline`, or `install-baseline-package` exceeds 420s, or a phase named `update-dev` / same-guest `openclaw update` exceeds 300s without new markers, start diagnosis from that phase log and guest process state. Current Windows update phases can still pass after roughly 10-15 minutes because `doctor --fix` may install bundled plugin runtime deps; keep the script hard cap near 20 minutes unless the log is truly stale.
- For a full OS matrix, prefer running independent guest-family lanes in parallel when host capacity allows:
-  - `timeout --foreground 75m pnpm test:parallels:macos -- --json`
-  - `timeout --foreground 90m pnpm test:parallels:windows -- --json`
-  - `timeout --foreground 75m pnpm test:parallels:linux -- --json`
-    Keep each lane in its own shell/session and track the run directory for each one. Before starting the matrix, run any required host build/package gate to completion. When current-main tgz packaging is needed, the smoke scripts hold a shared package lock through `pnpm build`, inventory/staging, and `npm pack`; if that lock is missing or broken, serialize the matrix instead of accepting concurrent `dist` mutation.
- Do not run multiple smoke lanes against the same guest family at once. Tahoe lanes share the host HTTP port, and Windows/Linux lanes can collide on snapshot restore/start state if two jobs touch the same VM concurrently.
- Do not run the aggregate `pnpm test:parallels:npm-update` wrapper in parallel with individual macOS/Windows/Linux smoke lanes; it touches the same guest families and snapshots.
- Do not start Parallels lanes while any unrelated host command may rebuild, clean, or restage `dist` (`pnpm build`, `pnpm ui:build`, `pnpm release:check`, `pnpm test:install:smoke`, npm pack/install smoke, or Docker lanes that run package/build prep). Run unrelated build/package gates first, let them finish, then start the VM matrix. Concurrent `dist` mutation can make host `npm pack` fail with missing files and wastes a full VM cycle.
- While running or optimizing the matrix, record wall-clock duration per lane and the slowest phase from `/tmp/openclaw-parallels-*` logs. Use that timing before changing smoke order, timeouts, or helper behavior.
- If a host build changes tracked generated files such as `src/canvas-host/a2ui/.bundle.hash`, stop before spending VM time. Commit the generated artifact separately or fix the generator drift, then rerun the smallest affected lane.
 - If `main` is moving under active multi-agent work, prefer a detached worktree pinned to one commit for long Parallels suites. The smoke scripts now verify the packed tgz commit instead of live `git rev-parse HEAD`, but a pinned worktree still avoids noisy rebuild/version drift during reruns.
 - For `openclaw update --channel dev` lanes, remember the guest clones GitHub `main`, not your local worktree. If a local fix exists but the rerun still fails inside the cloned dev checkout, do not treat that as disproof of the fix until the branch has been pushed.
 - For `prlctl exec`, pass the VM name before `--current-user` (`prlctl exec "$VM" --current-user ...`), not the other way around.
@@ -45,39 +28,16 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
 ## npm install then update

 - Preferred entrypoint: `pnpm test:parallels:npm-update`
- For a macOS-only published release update check, use:
-  - `timeout --foreground 75m pnpm test:parallels:npm-update -- --platform macos --package-spec openclaw@<old-version> --update-target <target-version-or-tag> --json`
-    This keeps the same-guest `openclaw update --tag ...` coverage and uses the shared macOS current-user/sudo fallback without starting Windows/Linux lanes.
- Required coverage: every release/update regression run must include both lanes:
-  - fresh snapshot -> install requested package/baseline -> smoke
-  - same guest baseline -> run the guest's installed `openclaw update ...` command -> smoke again
- The update lane must exercise OpenClaw's internal updater. Do not count a direct `npm install -g <tgz-or-spec>` or harness-side package swap as update-flow coverage; those are install smokes only.
- For published targets, install the old baseline package first (for example `openclaw@2026.4.9`), then run the installed guest CLI with the intended channel/tag (for example `openclaw update --channel beta --yes --json`) and verify `openclaw --version`, `openclaw update status --json`, gateway RPC, and an agent turn after the command.
- For unpublished targets, pack the candidate on the host, serve the `.tgz` over the harness HTTP server, and point the guest updater at that served package. Prefer `openclaw update --tag http://<host-ip>:<port>/openclaw-<version>.tgz --yes --json`; when channel persistence also matters, pass `--channel <stable|beta>` and set `OPENCLAW_UPDATE_PACKAGE_SPEC` to the same served URL in the guest update environment. The command under test must still be `openclaw update`, not direct npm.
- For unpublished local-fix validation, remember the old baseline updater code still controls the first hop. A fix that lives only in the new updater code cannot change that already-running old process; the served candidate must either keep package/plugin metadata compatible with the baseline host or the baseline itself must include the updater fix.
- For beta/stable verification, resolve the tag immediately before the run (`npm view openclaw@beta version dist.tarball` or `npm view openclaw@latest ...`). Tags can move while a long VM matrix is already running; restart the matrix when the intended prerelease appears after an earlier registry 404/tag-lag check.
- Use the configured secret workflow to inject only the provider keys needed by OpenAI/Anthropic lanes. Do not print secrets or env dumps; pass provider secrets through the guest exec environment.
+- Flow: fresh snapshot -> install npm package baseline -> smoke -> install current main tgz on the same guest -> smoke again.
 - Same-guest update verification should set the default model explicitly to `openai/gpt-5.4` before the agent turn and use a fresh explicit `--session-id` so old session model state does not leak into the check.
- The aggregate npm-update wrapper must resolve the Linux VM with the same Ubuntu fallback policy as `parallels-linux-smoke.sh` before both fresh and update lanes. Treat any Ubuntu guest with major version `>= 24` as acceptable when the exact default VM is missing, preferring the newest versioned Ubuntu guest with a fresh poweroff snapshot. On Peter's current host today, use `Ubuntu 26.04`.
+- The aggregate npm-update wrapper must resolve the Linux VM with the same Ubuntu fallback policy as `parallels-linux-smoke.sh` before both fresh and update lanes. Treat any Ubuntu guest with major version `>= 24` as acceptable when the exact default VM is missing, preferring the closest version match. On Peter's current host today, missing `Ubuntu 24.04.3 ARM64` should fall back to `Ubuntu 25.10`.
 - On macOS same-guest update checks, restart the gateway after the npm upgrade before `gateway status` / `agent`; launchd can otherwise report a loaded service while the old process has exited and the fresh process is not RPC-ready yet.
- The npm-update aggregate's macOS update leg writes the guest update script as root, then runs it as the desktop user. If `prlctl exec "$MACOS_VM" --current-user ...` cannot authenticate, retry through plain root `prlctl exec` plus `sudo -u <desktop-user> /usr/bin/env HOME=/Users/<desktop-user> USER=<desktop-user> LOGNAME=<desktop-user> PATH=/opt/homebrew/bin:/opt/homebrew/opt/node/bin:/usr/bin:/bin:/usr/sbin:/sbin ...`. That is a Parallels transport fallback; still verify `openclaw --version`, gateway RPC, and an agent turn after the update.
 - On Windows same-guest update checks, restart the gateway after the npm upgrade before `gateway status` / `agent`; in-place global npm updates can otherwise leave stale hashed `dist/*` module imports alive in the running service.
- In those Windows same-guest update checks, do not treat one nonzero `openclaw gateway restart` as definitive failure. Current login-item restarts can report failure before the background service becomes observable again; follow with a longer RPC-ready wait and use `gateway start` only as a recovery step if readiness still never returns.
- After that Windows restart, do not trust one `gateway status --deep --require-rpc` call after a fixed sleep. Retry the RPC-ready probe for roughly 30 seconds and log each attempt; current guests can keep port `18789` bound while the fresh RPC endpoint is still coming up.
 - For Windows same-guest update checks, prefer the done-file/log-drain PowerShell runner pattern over one long-lived `prlctl exec ... powershell -EncodedCommand ...` transport. The guest can finish successfully while the outer `prlctl exec` still hangs.
 - The Windows same-guest update helper should write stage markers to its log before long steps like tgz download and `npm install -g` so the outer progress monitor does not sit on `waiting for first log line` during healthy but quiet installs.
 - Linux same-guest update verification should also export `HOME=/root`, pass `OPENAI_API_KEY` via `prlctl exec ... /usr/bin/env`, and use `openclaw agent --local`; the fresh Linux baseline does not rely on persisted gateway credentials.
 - The npm-update wrapper now prints per-lane progress from the nested log files. If a lane still looks stuck, inspect the nested logs in `runDir` first (`macos-fresh.log`, `windows-fresh.log`, `linux-fresh.log`, `macos-update.log`, `windows-update.log`, `linux-update.log`) instead of assuming the outer wrapper hung.
- Each run writes both `summary.json` and `summary.md`; read the markdown first for quick human triage, then the JSON/timings for automation.
- For full beta validation after a tag is published, prefer one command:
-  - `timeout --foreground 150m pnpm test:parallels:npm-update -- --beta-validation beta3 --json`
-    This resolves `beta3` to the latest `*-beta.3` version, runs latest->that-version same-guest update coverage, and then runs fresh install smoke for that exact published target on the same selected OS matrix. Use `--platform macos|windows|linux` to narrow reruns.
- For beta 4 npm validation with agent turns, the known-good shape is:
-  - `gtimeout --foreground 150m pnpm test:parallels:npm-update -- --beta-validation beta4 --model openai/gpt-5.4 --json`
-    Prefer the explicit `beta4` alias over `openclaw@beta` when validating a specific prerelease number; npm tags can move.
- If the wrapper fails a lane, read the auto-dumped tail first, then the full nested lane log under `.artifacts/parallels/openclaw-parallels-npm-update.*`.
- Current known macOS update-lane transport signature when the fallback is missing or bypassed: `Unable to authenticate the user. Make sure that the specified credentials are correct and try again.` Treat that as Parallels current-user authentication before blaming npm or OpenClaw.
- A macOS packaged fresh install with global package directories or bundled files mode `0777` usually means the harness used the root `prlctl exec` fallback under a permissive umask. The POSIX guest transports should prepend `umask 022`; verify the phase preflight line before blaming npm.
+- If the wrapper fails a lane, read the auto-dumped tail first, then the full nested lane log under `/tmp/openclaw-parallels-npm-update.*`.

 ## CLI invocation footgun

@@ -86,23 +46,19 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
 ## macOS flow

 - Preferred entrypoint: `pnpm test:parallels:macos`
- `parallels-macos-smoke.sh --mode fresh --target-package-spec openclaw@<version>` is an install smoke only. For published old-version -> new-version update coverage on macOS, prefer the npm-update wrapper with `--platform macos`; `parallels-macos-smoke.sh --mode upgrade --target-package-spec ...` installs the target package and does not exercise the baseline CLI's updater.
 - Default upgrade coverage on macOS should now include: fresh snapshot -> site installer pinned to the latest stable tag -> `openclaw update --channel dev` on the guest. Treat this as part of the default Tahoe regression plan, not an optional side quest.
 - `parallels-macos-smoke.sh --mode upgrade` should run that release-to-dev lane by default. Keep the older host-tgz upgrade path only when the caller explicitly passes `--target-package-spec`.
 - Because the default upgrade lane no longer needs a host tgz, skip `npm pack` + host HTTP server startup for `--mode upgrade` unless `--target-package-spec` is set. Keep the pack/server path for `fresh` and `both`.
 - If that release-to-dev lane fails with `reason=preflight-no-good-commit` and repeated `sh: pnpm: command not found` tails from `preflight build`, treat it as an updater regression first. The fix belongs in the git/dev updater bootstrap path, not in Parallels retry logic.
 - Until the public stable train includes that updater bootstrap fix, the macOS release-to-dev lane may seed a temporary guest-local `pnpm` shim immediately before `openclaw update --channel dev`. Keep that workaround scoped to the smoke harness and remove it once the latest stable no longer needs it.
 - In Tahoe `prlctl exec --current-user` runs, prefer explicit `node .../openclaw.mjs ...` invocations for the release->dev handoff itself and for post-update verification. The shebanged global `openclaw` wrapper can fail with `env: node: No such file or directory`, and self-updating through the wrapper is a weaker lane than invoking the entrypoint under a fixed `node`.
- Default to the snapshot closest to `macOS 26.5 latest`.
- On Peter's Tahoe VM, `fresh-latest-march-2026` can hang in `prlctl snapshot-switch`; if restore times out there, rerun with `--snapshot-hint 'macOS 26.5 latest'` before blaming auth or the harness.
+- Default to the snapshot closest to `macOS 26.3.1 latest`.
+- On Peter's Tahoe VM, `fresh-latest-march-2026` can hang in `prlctl snapshot-switch`; if restore times out there, rerun with `--snapshot-hint 'macOS 26.3.1 latest'` before blaming auth or the harness.
 - `parallels-macos-smoke.sh` now retries `snapshot-switch` once after force-stopping a stuck running/suspended guest. If Tahoe still times out after that recovery path, then treat it as a real Parallels/host issue and rerun manually.
 - The macOS smoke should include a dashboard load phase after gateway health: resolve the tokenized URL with `openclaw dashboard --no-open`, verify the served HTML contains the Control UI title/root shell, then open Safari and require an established localhost TCP connection from Safari to the gateway port.
- For Tahoe `fresh.gateway-status`, prefer non-TTY `prlctl exec --current-user ... openclaw gateway status ...` plus a few short retries. `prlctl enter` can spam TTY control bytes and hang the phase log even when the CLI itself is healthy.
- If a Tahoe lane times out in `fresh.first-agent-turn` and the phase log stops right after `__OPENCLAW_RC__:0` from `models set`, suspect the `prlctl enter` / `expect` wrapper before blaming auth or the model lane. That pattern means the first guest command finished but the transport never released for the next `guest_current_user_cli` call.
 - If a packaged install regresses with `500` on `/`, `/healthz`, or `__openclaw/control-ui-config.json` after `fresh.install-main` or `upgrade.install-main`, suspect bundled plugin runtime deps resolving from the package root `node_modules` rather than `dist/extensions/*/node_modules`. Repro quickly with a real `npm pack`/global install lane before blaming dashboard auth or Safari.
 - `prlctl exec` is fine for deterministic repo commands, but use the guest Terminal or `prlctl enter` when installer parity or shell-sensitive behavior matters.
 - Multi-word `openclaw agent --message ...` checks should go through a guest shell wrapper (`guest_current_user_sh` / `guest_current_user_cli` or `/bin/sh -lc ...`), not raw `prlctl exec ... node openclaw.mjs ...`, or the message can be split into extra argv tokens and Commander reports `too many arguments for 'agent'`.
- The same wrapper rule applies when bypassing `--current-user`: write a tiny `/tmp/*.sh` on the guest and execute `/bin/bash /tmp/*.sh` through the sudo desktop-user environment. Do not pass `openclaw agent --message '...'` directly as one raw `prlctl exec` command.
 - When ref-mode onboarding stores `OPENAI_API_KEY` as an env secret ref, the post-onboard agent verification should also export `OPENAI_API_KEY` for the guest command. The gateway can still reject with pairing-required and fall back to embedded execution, and that fallback needs the env-backed credential available in the shell.
 - On the fresh Tahoe snapshot, `brew` exists but `node` may be missing from PATH in noninteractive exec. Use `/opt/homebrew/bin/node` when needed.
 - Fresh host-served tgz installs should install as guest root with `HOME=/var/root`, then run onboarding as the desktop user via `prlctl exec --current-user`.
@@ -130,7 +86,7 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
 - Fresh Windows tgz install phases should also use the background PowerShell runner plus done-file/log-drain pattern; do not rely on one long-lived `prlctl exec ... powershell ... npm install -g` transport for package installs.
 - Windows release-to-dev helpers should log `where pnpm` before and after the update and require `where pnpm` to succeed post-update. That proves the updater installed or enabled `pnpm` itself instead of depending on a smoke-only bootstrap.
 - Fresh Windows ref-mode onboard should use the same background PowerShell runner plus done-file/log-drain pattern as the npm-update helper, including startup materialization checks, host-side timeouts on short poll `prlctl exec` calls, and retry-on-poll-failure behavior for transient transport flakes.
- Fresh Windows daemon-health reachability should use `openclaw gateway probe --json` with a longer timeout and treat `ok: true` as success; full `gateway status --require-rpc` checks are too eager during initial startup on current main.
+- Fresh Windows daemon-health reachability should use a hello-only gateway probe and a longer per-probe timeout than the default local attach path; full health RPCs are too eager during initial startup on current main.
 - Fresh Windows ref-mode agent verification should set `OPENAI_API_KEY` in the PowerShell environment before invoking `openclaw.cmd agent`, for the same pairing-required fallback reason as macOS.
 - The standalone Windows upgrade smoke lane should stop the managed gateway after `upgrade.install-main` and before `upgrade.onboard-ref`. Restarting before onboard can leave the old process alive on the pre-onboard token while onboard rewrites `~/.openclaw/openclaw.json`, which then fails `gateway-health` with `unauthorized: gateway token mismatch`.
 - If standalone Windows upgrade fails with a gateway token mismatch but `pnpm test:parallels:npm-update` passes, trust the mismatch as a standalone ref-onboard ordering bug first; the npm-update helper does not re-run ref-mode onboard on the same guest.
@@ -140,8 +96,8 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
 ## Linux flow

 - Preferred entrypoint: `pnpm test:parallels:linux`
- Use the newest versioned Ubuntu guest with a fresh poweroff snapshot. On Peter's host today, that is `Ubuntu 26.04`.
- If an exact requested Ubuntu VM is missing on the host, any Ubuntu guest with major version `>= 24` is acceptable; prefer the newest versioned Ubuntu guest over older fallback snapshots.
+- Use the snapshot closest to fresh `Ubuntu 24.04.3 ARM64`.
+- If that exact VM is missing on the host, any Ubuntu guest with major version `>= 24` is acceptable; prefer the closest versioned Ubuntu guest with a fresh poweroff snapshot. On Peter's host today, that is `Ubuntu 25.10`.
 - Use plain `prlctl exec`; `--current-user` is not the right transport on this snapshot.
 - Fresh snapshots may be missing `curl`, and `apt-get update` can fail on clock skew. Bootstrap with `apt-get -o Acquire::Check-Date=false update` and install `curl ca-certificates`.
 - Fresh `main` tgz smoke still needs the latest-release installer first because the snapshot has no Node or npm before bootstrap.
@@ -156,7 +112,6 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
  - `--discord-token-env`
  - `--discord-guild-id`
  - `--discord-channel-id`
- After a successful Discord smoke/roundtrip, shut down the guest VM before handoff (`prlctl stop "$VM_NAME"` or the concrete VM name). The macOS smoke harness should do this automatically after successful Discord proof; still stop the VM manually after ad-hoc Discord checks. Do not leave the Discord-configured guest running; it can keep reading/posting in `#maintainer` and spam Discord after the proof is complete.
 - Keep the Discord token only in a host env var.
 - Use installed `openclaw message send/read`, not `node openclaw.mjs message ...`.
 - Set `channels.discord.guilds` as one JSON object, not dotted config paths with snowflakes.
--- a/.agents/skills/openclaw-pr-maintainer/SKILL.md
+++ b/.agents/skills/openclaw-pr-maintainer/SKILL.md
@@ -1,121 +1,16 @@
 ---
 name: openclaw-pr-maintainer
-description: Use immediately for any pasted OpenClaw GitHub issue or PR URL/number, and for OpenClaw issue/PR review, triage, duplicate search, opener identity/who wrote it, author account age/activity, comments, labels, close, land, or maintainer evidence checks.
+description: Maintainer workflow for reviewing, triaging, preparing, closing, or landing OpenClaw pull requests and related issues. Use when Codex needs to validate bug-fix claims, search for related issues or PRs, apply or recommend close/reason labels, prepare GitHub comments safely, check review-thread follow-up, or perform maintainer-style PR decision making before merge or closure.
 ---

 # OpenClaw PR Maintainer

 Use this skill for maintainer-facing GitHub workflow, not for ordinary code changes.

-## Start issue and PR triage with gitcrawl
-
- Use `$gitcrawl` first anytime you inspect OpenClaw issues or PRs.
- Check local `gitcrawl` data first for related threads, duplicate attempts, and already-landed fixes.
- Use `gitcrawl` for candidate discovery and clustering; use `gh`, `gh api`, and the current checkout to verify live state before commenting, labeling, closing, or landing.
- If `gitcrawl` is missing, stale, lacks the target thread, or has no embeddings for neighbor/search commands, fall back to the GitHub search workflow below.
- Do not run expensive/update commands such as `gitcrawl sync --include-comments`, future enrichment commands, or broad reclustering unless the user asked to update the local store or stale data is blocking the decision.
-
-Common read-only path:
-
-```bash
-gitcrawl threads openclaw/openclaw --numbers <issue-or-pr-number> --include-closed --json
-gitcrawl neighbors openclaw/openclaw --number <issue-or-pr-number> --limit 12 --json
-gitcrawl search openclaw/openclaw --query "<scope or title keywords>" --mode hybrid --json
-gitcrawl cluster-detail openclaw/openclaw --id <cluster-id> --member-limit 20 --body-chars 280 --json
-```
-
-## Claim specific review targets
-
-When a maintainer asks Codex to review, triage, fix, or land a specific OpenClaw issue/PR, check assignment before deep work.
-
- Identify the requesting maintainer's GitHub login. In this environment, default Peter to `steipete`; if another maintainer is clearly the requester, use that maintainer's bare login.
- Read current assignees with live `gh issue view` / `gh pr view`; `gitcrawl` is not enough for assignment state.
- If unassigned, assign the requester before deep review. This is allowed for specific requested targets; do not auto-assign broad discovery candidates or shortlists.
- If assigned to someone else, say so clearly before analysis and include assignment age:
-  - fresh: assigned within 6h; treat as actively owned unless user explicitly asks to continue or reassign
-  - stale: assigned 6h+ ago; treat as ownership hint, not a hard block; continue only with that caveat
- If assigned to requester plus others, mention co-assignees and continue.
- If assignment event time is unavailable, say `assigned, time unknown`; treat as assigned, not stale.
- Never remove or replace assignees unless explicitly asked.
-
-Assignment time proof:
-
-```bash
-gh api "repos/openclaw/openclaw/issues/<number>/timeline" --paginate \
-  -H "Accept: application/vnd.github+json" \
-  --jq '[.[] | select(.event=="assigned") | {assignee:.assignee.login, assigner:.assigner.login, actor:.actor.login, created_at}]'
-```
-
-Use the newest `assigned` event for each current assignee. Issue timeline events expose `created_at`; GitHub GraphQL `AssignedEvent.createdAt` is also valid when REST pagination is awkward.
-
-Claim command for issues or PRs:
-
-```bash
-gh api -X POST "repos/openclaw/openclaw/issues/<number>/assignees" -f 'assignees[]=<login>' >/dev/null
-```
-
-## Surface opener identity
-
- For every reviewed, triaged, closed, or landed issue/PR, show the opener's human name when available, GitHub login, and account age.
- Get the login from `gh issue view` / `gh pr view` (`author.login`), then fetch profile metadata once with `gh api users/<login> --jq '{login,name,created_at,type}'`.
- Report opener identity as one compact line:
-  `By: Jane Doe (@jane, acct 2021-04-03) | OpenClaw: 4 PRs, 2 issues, 11 commits/12mo | GitHub: 9 repos, 86 commits, 9 PRs, 3 issues, 12 reviews`
- Always show recent activity in two lanes: OpenClaw-local PRs, issues, and commits in the last 12 months; and general public GitHub activity over the same window. For linked issue-fixing PRs, include both the PR author and issue opener when they differ.
- Prefer the bundled helper for activity lookups:
-
-```bash
-.agents/skills/openclaw-pr-maintainer/scripts/github-activity.sh <login> [other-login...]
-.agents/skills/openclaw-pr-maintainer/scripts/github-activity.sh --global <login>
-```
-
- The helper reports repo-local activity first and can fetch public GitHub contribution totals for the same window with `--global`; run the global form by default for review/triage identity summaries.
- If the global contribution graph reports zero or looks inconsistent with visible public activity, sanity-check with `gh api users/<login>`, `gh api 'users/<login>/events/public?per_page=100'`, and recent public repo commits before calling the account inactive.
- The helper is intentionally cache-friendly for gitcrawl-backed `gh`: it rounds repo-local windows to the UTC day, rounds global contribution windows to the UTC hour, and counts PRs/issues from one paginated issues response before fetching commits separately. Prefer reusing the helper instead of hand-rolling several `gh api` loops.
- If the contribution graph is misleading or zero but public events/repos show activity, keep it one line, for example:
-  `By: pickaxe (@ProspectOre, acct 2019-08-24) | OpenClaw: 5 PRs, 0 issues, 5 commits/12mo | GitHub: 5 repos, 29 recent events, 100 public own-repo commits; graph=0`
- If `name` is empty, use the login only. If profile lookup is rate-limited or unavailable, say `account age unknown` rather than omitting the opener.
- Use identity and activity as triage signal, not proof by itself: new, low-activity, or bot-like accounts can raise review caution, but code, repro, and CI evidence still decide.
-
-## Suppress top-maintainer items in issue triage
-
-When asked for issue triage, hot issues, pressing bugs, Discord-correlated issues, or "what is still open", do not surface issues or PRs authored by top maintainers by default. Prefer external/user-reported hot issues and external PRs, not maintainer-owned work queues.
-
-Suppress by default when the opener/author is one of:
-
- `@vincentkoc`
- `@Takhoffman`
- `@gumadeiras`
- `@obviyus`
- `@shakkernerd`
- `@mbelinky`
- `@joshavant`
- `@ngutman`
- `@vignesh07`
- `@huntharo`
-
-Also suppress lower-priority maintainer-owned noise from the broader keep/top-maintainer group unless it is directly relevant:
-
- `@thewilloftheshadow`
- `@onutc` / `@osolmaz`
- `@jacobtomlinson`
- `@tyler6204`
- `@velvet-shark`
- `@jalehman`
- `@frankekn`
- `@ImLukeF`
- `@mcaxtr`
-
-Exceptions:
-
- Show maintainer-authored items when the requester explicitly asks for maintainer PRs/issues, PR landing candidates, release-blocking maintainer work, or a specific PR/issue number.
- Show a maintainer-authored item when it is the canonical fix for an external hot issue, but frame it as the fix path rather than as a user-facing issue candidate.
- Do not close, label, or deprioritize solely because an item is maintainer-authored; this section only controls what appears in triage shortlists.
-
 ## Apply close and triage labels correctly

 - If an issue or PR matches an auto-close reason, apply the label and let `.github/workflows/auto-response.yml` handle the comment/close/lock flow.
 - Do not manually close plus manually comment for these reasons.
- If an issue/PR is already fixed on current `main` or solved by a new release, comment with proof plus the canonical commit/PR/release, then close it.
 - `r:*` labels can be used on both issues and PRs.
 - Current reasons:
  - `r: skill`
@@ -129,129 +24,16 @@ Exceptions:
  - `invalid`
  - `dirty` for PRs only

-## Select small high-confidence triage candidates
-
-When asked for `X` issues or PRs to triage, `X` means qualified candidates, not sampled threads.
-
-Issue triage is review/prove/patch-local by default:
-
-1. Review the issue body, comments, related threads, current code, and adjacent tests.
-2. Fix only issues that are easy, high-confidence, and narrowly owned by the implicated path.
-3. Add focused regression proof when practical.
-4. Stop with the dirty diff, touched files, and test/gate output for maintainer review.
-5. After maintainer approval to ship, make one commit per accepted fix, with release-note context in the PR body or commit message when user-facing.
-6. Pull/rebase, push, then comment and close only the issues that were fixed or explicitly triaged closed.
-
-Do not batch unrelated issue fixes into one commit. Do not publish, comment, close, or label during the review/prove phase.
-
-Missing `CHANGELOG.md` is not a PR review finding or merge blocker. If landing/fixing a user-visible change, make sure the PR body or commit message captures the release-note context; never ask or block solely on it.
-
-Only list candidates that pass all gates:
-
- small owner/surface, with a likely narrow fix and focused regression test
- symptom is reproducible or provable with logs, failing test, live command, dependency contract, or current-main behavior
- root cause is traceable to code with file/line and the proposed fix touches that path
- no strong smell that a broader refactor, ownership rethink, migration, or product decision is the better fix
- dependency-backed behavior checked against upstream docs/source/types; live or web proof used when local proof is insufficient
-
-Loop:
-
-1. Use `gitcrawl` / `gh` to gather candidate clusters.
-2. Read issue/PR body, comments, current code, adjacent tests, and dependency contracts.
-3. Try focused repro or proof.
-4. Reject unclear, stale, speculative, broad-refactor, or owner-ambiguous items.
-5. Continue until `X` qualified candidates or the bounded search is exhausted.
-
-Output only qualifying candidates, with: ref, surface, proof, cause, fix sketch, why small, expected test/gate. If none qualify, say so; do not pad.
-
-## Structure PR review output
-
- Start every PR review with 1-3 plain sentences explaining what the change does and why it matters. Put this before `Findings`.
- Then list findings first. If none, say `No blocking findings` or `No findings`.
- Show size near the top as `LOC: +<additions>/-<deletions> (<changedFiles> files)`, using live PR stats or local diff stats.
- Always answer: bug/behavior being fixed, PR/issue URL and affected surface, provenance for regressions when traceable, and best-fix verdict.
- For bug/regression fixes, include a compact `Provenance:` line after cause/root-cause when a bounded history pass can identify it. Use `git log -S/-G`, `git blame`, linked PRs/issues, and tests.
- Provenance must separate roles when they differ: blamed code author username, blamed PR author username, blamed PR merger/committer username, automerge trigger when known, current PR author username, PR number, and date. Do not collapse them into one "introduced by" actor.
- If the blamed PR was merged by `clawsweeper[bot]` or another automation, identify the human trigger when practical. Check live PR timeline/comments first; if rate-limited, use gitcrawl/cache or public PR HTML. Look for maintainer command comments such as `@clawsweeper automerge`, `/landpr`, labels/events that armed automerge, and ClawSweeper status comments. Report `automerge triggered by @login`; if not found, say trigger unknown rather than naming the bot as the human decision-maker.
- For any confirmed bug, run `git blame` on the implicated line(s) after identifying the root cause. Report who broke it as the blamed PR merger/committer, and also name the blamed code author. Include the PR number. If no PR is traceable, use the blamed commit as the provenance: commit SHA, date, and author username. Do not guess a merger or frame missing PR metadata as a separate finding.
- Phrase provenance as `introduced by`, `made visible by`, or `carried forward by`, with confidence (`clear`, `likely`, `unknown`). If unclear, say what evidence is missing instead of guessing. For features, docs, and refactors, use `Provenance: N/A` or omit it when no broken behavior is being fixed.
- Keep summaries compact, but include enough proof that the verdict is auditable without rereading the PR.
-
-LOC proof:
-
-```bash
-gh pr view <number> --json additions,deletions,changedFiles \
-  --jq '"LOC: +\(.additions)/-\(.deletions) (\(.changedFiles) files)"'
-```
-
-## Read beyond the diff
-
- Review the surrounding code path, not just changed lines. Open the caller, callee, data contracts, adjacent tests, and owner module.
- Before any verdict, read enough code to fill this map: changed surface, runtime entry point, owner boundary, one caller, one callee, sibling implementations sharing the invariant, adjacent tests, current `main` behavior, and shipped/dependency/Codex contracts when relevant.
- For large-codebase PRs, sample enough related files to understand the runtime boundary before deciding. Default to more code reading when the change touches agents, gateway, plugins, auth, sessions, process, config, or provider/runtime seams.
- Compare the PR against current `origin/main` behavior. Check whether recent main already changed the same surface.
- Dependency-backed behavior: MUST read upstream docs/source/types before judging API use, defaults, output shapes, errors, timeouts, memory behavior, or compatibility. Do not assume dependency contracts from memory or PR text.
- Judge solution quality, not only correctness. Ask whether the PR is the clean owner-boundary fix or a wart/workaround that should be replaced by a small refactor, moved seam, contract change, or deletion of duplicate logic.
- Mention the main files read when the verdict depends on code-path evidence.
- If the user challenges the verdict or asks whether the idea is really good, resume code reading first. Do not defend, soften, or reverse the verdict until the missing caller/callee/sibling/dependency path is checked.
-
-## Best-fix review loop
-
-Every PR review must explicitly answer: "Is this the best fix, or only a plausible fix?"
-
-Before verdict:
-
-1. Reconstruct the bug, feature need, or behavior claim from issue/PR/proof.
-2. Trace current behavior from entry point to failure or decision point.
-3. Read touched files, callers, callees, owner modules, adjacent tests, and relevant docs.
-4. Read sibling surfaces that should share the invariant or could be broken by a one-sided fix.
-5. Compare against current `origin/main` and shipped behavior when regression/compat matters.
-6. Inspect upstream dependency/Codex source or docs for dependency-backed behavior.
-7. Identify at least one alternative fix location or shape, then reject it with evidence.
-8. If any required path above is uninspected, keep reading or mark `Remaining uncertainty`; do not call the PR best, blocked, proof-sufficient, or merge-ready.
-
-Review output must include:
-
- `Best-fix verdict:` best / acceptable mitigation / wrong layer / too narrow / too broad.
- `Alternatives considered:` 1-3 concrete alternatives and why rejected.
- `Code read:` compact list of main files/contracts checked.
- `Remaining uncertainty:` what was not proven.
-
-If the best-fix answer is only "maybe", keep reading or state the missing evidence. Do not call proof sufficient until the best-fix judgment is explicit.
-
 ## Enforce the bug-fix evidence bar

 - Never merge a bug-fix PR based only on issue text, PR text, or AI rationale.
- Whenever feasible, use Crabbox (`$crabbox`) for end-to-end verification before
-  commenting that a bug is unreproducible, closing an issue, or opening/landing
-  a fix PR. Prefer a real packaged/Docker/live lane that exercises the reported
-  user flow over unit-only proof.
 - Before landing, require:
  1. symptom evidence such as a repro, logs, or a failing test
  2. a verified root cause in code with file/line
-  3. blame-backed provenance for regressions when traceable, including blamed PR merger and automerge trigger when known, or commit SHA/date when no PR is traceable
-  4. a fix that touches the implicated code path
-  5. a regression test when feasible, or explicit manual verification plus a reason no test was added
+  3. a fix that touches the implicated code path
+  4. a regression test when feasible, or explicit manual verification plus a reason no test was added
 - If the claim is unsubstantiated or likely wrong, request evidence or changes instead of merging.
 - If the linked issue appears outdated or incorrect, correct triage first. Do not merge a speculative fix.
- If Crabbox/E2E proof is blocked, say exactly why and use the closest available
-  local, Docker, mocked, or targeted proof. Do not present unit tests as real
-  behavior proof.
-
-## Close low-signal manual PRs carefully
-
- Do not close for red CI alone. Require a clear low-signal category plus stale or failed validation.
- Good manual-close categories:
-  - blank or mostly untouched PR template with no concrete OpenClaw problem/fix
-  - random docs-only churn such as root README translations, generic wording tweaks, or community-plugin discoverability docs that should go through ClawHub
-  - test-only coverage without a linked bug, owner request, or behavior change
-  - refactor-only cleanup, variable renames, formatting, or generated/baseline churn without maintainer request
-  - third-party channel/provider/tool/skill/plugin work that belongs on ClawHub instead of core
-  - risky ops/infra drive-bys such as new external CI services, release workflows, host upgrade scripts, Docker base migrations, or apt retry/fix-missing tweaks without owner request and green validation
-  - dirty branches where a narrow stated change includes unrelated docs/generated/runtime/extension files
-  - repeated bot-review spam or copied bot output without author-owned fixes
- Keep or escalate plausible focused bug fixes, green PRs, active maintainer discussions, assigned work, recent author follow-up, and unique reproduction details.
- For third-party capabilities, prefer the `r: third-party-extension` auto-response label when it applies; it points contributors to publish on ClawHub.

 ## Handle GitHub text safely

@@ -262,9 +44,9 @@ If the best-fix answer is only "maybe", keep reading or state the missing eviden

 ## Search broadly before deciding

- Prefer `gitcrawl` first. Then use targeted GitHub keyword search to verify gaps, live status, comments, and candidates not present in the local store.
- Use `--repo openclaw/openclaw` with `--match title,body` first when using `gh search`.
- Add `--match comments` when triaging follow-up discussion or closed-as-duplicate chains.
+- Prefer targeted keyword search before proposing new work or closing something as duplicate.
+- Use `--repo openclaw/openclaw` with `--match title,body` first.
+- Add `--match comments` when triaging follow-up discussion.
 - Do not stop at the first 500 results when the task requires a full search.

 Examples:
@@ -279,17 +61,13 @@ gh search issues --repo openclaw/openclaw --match title,body --limit 50 \

 ## Follow PR review and landing hygiene

- Never mention release-note bookkeeping in review-only output. It is landing
-  or release-generation mechanics, not a correctness finding.
 - If bot review conversations exist on your PR, address them and resolve them yourself once fixed.
 - Leave a review conversation unresolved only when reviewer or maintainer judgment is still needed.
- Before landing any PR with non-trivial code changes, run `$autoreview` until no accepted/actionable findings remain, unless equivalent manual review already covered it, the change is trivial/docs-only, or the user opts out.
- When an agent is landing or merging a PR targeting `main`, use only the repo-native `scripts/pr` wrapper: run `scripts/pr review-init <PR>`, follow its emitted checkout/guard guidance, initialize and complete review artifacts with `scripts/pr review-artifacts-init <PR>`, validate them with `scripts/pr review-validate-artifacts <PR>`, then run `scripts/pr prepare-run <PR>` and `scripts/pr merge-run <PR>`.
+- When landing or merging any PR, follow the global `/landpr` process.
 - Use `scripts/committer "<msg>" <file...>` for scoped commits instead of manual `git add` and `git commit`.
 - Keep commit messages concise and action-oriented.
 - Group related changes; avoid bundling unrelated refactors.
 - Use `.github/pull_request_template.md` for PR submissions and `.github/ISSUE_TEMPLATE/` for issues.
- Do not commit PR-only artifacts such as screenshots under `.github/pr-assets`; attach them to the PR/comment or use an external artifact store instead.

 ## Extra safety

--- a/.agents/skills/openclaw-pr-maintainer/scripts/github-activity.sh
+++ b/.agents/skills/openclaw-pr-maintainer/scripts/github-activity.sh
@@ -1,178 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-repo="openclaw/openclaw"
-months="12"
-include_global="0"
-
-usage() {
-  printf 'Usage: %s [--repo owner/repo] [--months N] [--global] <github-login> [login...]\n' "$0"
-}
-
-die() {
-  printf 'error: %s\n' "$*" >&2
-  exit 1
-}
-
-need() {
-  command -v "$1" >/dev/null 2>&1 || die "missing required command: $1"
-}
-
-date_utc_relative_months() {
-  local count="$1"
-  if date -u -v-"${count}"m +%Y-%m-%dT00:00:00Z >/dev/null 2>&1; then
-    date -u -v-"${count}"m +%Y-%m-%dT00:00:00Z
-    return
-  fi
-  date -u -d "${count} months ago" +%Y-%m-%dT00:00:00Z
-}
-
-date_to_epoch() {
-  local value="$1"
-  if date -u -j -f '%Y-%m-%dT%H:%M:%SZ' "$value" +%s >/dev/null 2>&1; then
-    date -u -j -f '%Y-%m-%dT%H:%M:%SZ' "$value" +%s
-    return
-  fi
-  date -u -d "$value" +%s
-}
-
-rough_age() {
-  local created_at="$1"
-  local now_s created_s days
-  now_s=$(date -u +%s)
-  created_s=$(date_to_epoch "$created_at")
-  days=$(( (now_s - created_s) / 86400 ))
-  if (( days < 120 )); then
-    printf '~%dd old' "$days"
-    return
-  fi
-  awk -v days="$days" 'BEGIN { printf "~%.1fy old", days / 365.2425 }'
-}
-
-thread_kinds() {
-  local login="$1"
-  local since_ts="$2"
-  gh api --paginate "repos/${repo}/issues?state=all&creator=${login}&since=${since_ts}&per_page=100" \
-    --jq ".[] | select(.created_at >= \"${since_ts}\") | if has(\"pull_request\") then \"pr\" else \"issue\" end"
-}
-
-count_kind_lines() {
-  local kind="$1"
-  local lines="$2"
-  grep -cx "$kind" <<<"$lines" 2>/dev/null || true
-}
-
-count_commits() {
-  local login="$1"
-  local since_ts="$2"
-  gh api --paginate "repos/${repo}/commits?author=${login}&since=${since_ts}&per_page=100" \
-    --jq '.[].sha' | wc -l | tr -d '[:space:]'
-}
-
-global_activity() {
-  local login="$1"
-  local since_ts="$2"
-  local now_ts="$3"
-  # shellcheck disable=SC2016
-  gh api graphql \
-    -f login="$login" \
-    -f from="$since_ts" \
-    -f to="$now_ts" \
-    -f query='
-query($login: String!, $from: DateTime!, $to: DateTime!) {
-  user(login: $login) {
-    contributionsCollection(from: $from, to: $to) {
-      totalCommitContributions
-      totalIssueContributions
-      totalPullRequestContributions
-      totalPullRequestReviewContributions
-    }
-  }
-}' \
-    --jq '.data.user.contributionsCollection // empty'
-}
-
-while [[ $# -gt 0 ]]; do
-  case "$1" in
-    --repo)
-      [[ $# -ge 2 ]] || die "--repo requires owner/repo"
-      repo="$2"
-      shift 2
-      ;;
-    --months)
-      [[ $# -ge 2 ]] || die "--months requires a positive integer"
-      months="$2"
-      [[ "$months" =~ ^[0-9]+$ && "$months" != "0" ]] || die "--months must be a positive integer"
-      shift 2
-      ;;
-    --global)
-      include_global="1"
-      shift
-      ;;
-    -h|--help)
-      usage
-      exit 0
-      ;;
-    --)
-      shift
-      break
-      ;;
-    -*)
-      die "unknown option: $1"
-      ;;
-    *)
-      break
-      ;;
-  esac
-done
-
-[[ $# -gt 0 ]] || {
-  usage >&2
-  exit 2
-}
-
-need gh
-need jq
-
-since_ts=$(date_utc_relative_months "$months")
-now_ts=$(date -u +%Y-%m-%dT%H:00:00Z)
-
-for login in "$@"; do
-  profile=$(gh api "users/${login}" --jq '{login,name,created_at,type}')
-  display_login=$(jq -r '.login' <<<"$profile")
-  name=$(jq -r '.name // empty' <<<"$profile")
-  created_at=$(jq -r '.created_at' <<<"$profile")
-  type=$(jq -r '.type' <<<"$profile")
-  created_day=${created_at%%T*}
-
-  kinds=$(thread_kinds "$display_login" "$since_ts")
-  prs=$(count_kind_lines pr "$kinds")
-  issues=$(count_kind_lines issue "$kinds")
-  commits=$(count_commits "$display_login" "$since_ts")
-
-  if [[ -n "$name" ]]; then
-    printf '%s (@%s, %s, account created %s, %s)\n' \
-      "$name" "$display_login" "$type" "$created_day" "$(rough_age "$created_at")"
-  else
-    printf '@%s (%s, account created %s, %s)\n' \
-      "$display_login" "$type" "$created_day" "$(rough_age "$created_at")"
-  fi
-  printf '%s last %smo: %s PRs, %s issues, %s commits\n' "$repo" "$months" "$prs" "$issues" "$commits"
-
-  if [[ "$include_global" == "1" ]]; then
-    if global_json=$(global_activity "$display_login" "$since_ts" "$now_ts" 2>/dev/null); then
-      if [[ -n "$global_json" ]]; then
-        global_commits=$(jq -r '.totalCommitContributions' <<<"$global_json")
-        global_issues=$(jq -r '.totalIssueContributions' <<<"$global_json")
-        global_prs=$(jq -r '.totalPullRequestContributions' <<<"$global_json")
-        global_reviews=$(jq -r '.totalPullRequestReviewContributions' <<<"$global_json")
-        printf 'GitHub public last %smo: %s commits, %s PRs, %s issues, %s reviews\n' \
-          "$months" "$global_commits" "$global_prs" "$global_issues" "$global_reviews"
-      else
-        printf 'GitHub public last %smo: unavailable\n' "$months"
-      fi
-    else
-      printf 'GitHub public last %smo: unavailable\n' "$months"
-    fi
-  fi
-done
--- a/.agents/skills/openclaw-qa-testing/SKILL.md
+++ b/.agents/skills/openclaw-qa-testing/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: openclaw-qa-testing
-description: Run, watch, debug, extend, or explain OpenClaw qa-lab and qa-channel scenarios, artifacts, and live lanes.
+description: Run, watch, debug, and extend OpenClaw QA testing with qa-lab and qa-channel. Use when Codex needs to execute the repo-backed QA suite, inspect live QA artifacts, debug failing scenarios, add new QA scenarios, or explain the OpenClaw QA workflow. Prefer the live OpenAI lane with regular openai/gpt-5.4 in fast mode; do not use gpt-5.4-pro or gpt-5.4-mini unless the user explicitly overrides that policy.
 ---

 # OpenClaw QA Testing
@@ -12,10 +12,9 @@ Use this skill for `qa-lab` / `qa-channel` work. Repo-local QA only.
 - `docs/concepts/qa-e2e-automation.md`
 - `docs/help/testing.md`
 - `docs/channels/qa-channel.md`
- `qa/README.md`
- `qa/scenarios/index.yaml`
+- `qa/QA_KICKOFF_TASK.md`
+- `qa/seed-scenarios.json`
 - `extensions/qa-lab/src/suite.ts`
- `extensions/qa-lab/src/character-eval.ts`

 ## Model policy

@@ -28,216 +27,31 @@ Use this skill for `qa-lab` / `qa-channel` work. Repo-local QA only.

 ## Default workflow

-1. Read the scenario pack and current suite implementation.
+1. Read the seed plan and current suite implementation.
 2. Decide lane:
   - mock/dev: `mock-openai`
-   - real validation: `live-frontier`
+   - real validation: `live-openai`
 3. For live OpenAI, use:

 ```bash
 OPENCLAW_LIVE_OPENAI_KEY="${OPENAI_API_KEY}" \
 pnpm openclaw qa suite \
-  --provider-mode live-frontier \
+  --provider-mode live-openai \
  --model openai/gpt-5.4 \
  --alt-model openai/gpt-5.4 \
-  --output-dir .artifacts/qa-e2e/run-all-live-frontier-<tag>
+  --fast \
+  --output-dir .artifacts/qa-e2e/run-all-live-openai-<tag>
 ```

 4. Watch outputs:
-   - summary: `.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-summary.json`
-   - report: `.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-report.md`
+   - summary: `.artifacts/qa-e2e/run-all-live-openai-<tag>/qa-suite-summary.json`
+   - report: `.artifacts/qa-e2e/run-all-live-openai-<tag>/qa-suite-report.md`
 5. If the user wants to watch the live UI, find the current `openclaw-qa` listen port and report `http://127.0.0.1:<port>`.
 6. If a scenario fails, fix the product or harness root cause, then rerun the full lane.

-## OTEL smoke
-
-For local QA-lab OpenTelemetry validation, use:
-
-```bash
-pnpm qa:otel:smoke
-```
-
-This starts a local OTLP/HTTP trace receiver, runs the `otel-trace-smoke`
-scenario through qa-channel, decodes the emitted protobuf spans, and verifies
-the exported trace names and privacy contract. It does not require Opik,
-Langfuse, or external collector credentials.
-
-## Matrix live profiles
-
-`pnpm openclaw qa matrix` defaults to the full `all` profile. Use explicit
-profiles for faster CI/release proof:
-
-```bash
-OPENCLAW_QA_MATRIX_NO_REPLY_WINDOW_MS=3000 \
-pnpm openclaw qa matrix --profile fast --fail-fast
-```
-
- `fast`: release-critical transport contract, excluding generated image and
-  deep E2EE recovery inventory.
- `transport`, `media`, `e2ee-smoke`, `e2ee-deep`, `e2ee-cli`: sharded full
-  Matrix coverage.
- `QA-Lab - All Lanes` uses explicit `fast` Matrix on scheduled runs. Manual
-  dispatch keeps `matrix_profile=all` as the default and always shards that full
-  Matrix selection.
-
-## QA credentials and 1Password
-
- Use `op` only inside `tmux` for QA secret lookup in this repo.
- Quick auth check inside tmux:
-
-```bash
-op account list
-```
-
- Direct Telegram npm live test secrets currently live in 1Password item:
-  - vault: `OpenClaw`
-  - item: `Telegram E2E`
- That item is the first place to look for:
-  - `OPENCLAW_QA_TELEGRAM_DRIVER_BOT_TOKEN`
-  - `OPENCLAW_QA_TELEGRAM_SUT_BOT_TOKEN`
-  - `OPENCLAW_QA_PROVIDER_MODE`
-  - `OPENCLAW_NPM_TELEGRAM_PACKAGE_SPEC`
- Convex QA secrets currently live in 1Password items:
-  - vault: `OpenClaw`
-  - item: `OPENCLAW_QA_CONVEX_SITE_URL`
-  - item: `OPENCLAW_QA_CONVEX_SECRET_MAINTAINER`
-  - item: `OPENCLAW_QA_CONVEX_SECRET_CI`
- Additional related notes/login items seen during QA credential work:
-  - vault: `Private`
-  - items: `OPENCLAW QA`, `Convex`, `Telegram`
- If a required value is missing from those notes:
-  - do not guess
-  - ask the maintainer/operator for the current value or the current 1Password item name
-  - for Telegram direct runs, `OPENCLAW_QA_TELEGRAM_GROUP_ID` may be stored separately from `Telegram E2E`
-  - for Convex runs, the leased Telegram credential should provide the Telegram group id and bot tokens together; do not require a separate `OPENCLAW_QA_TELEGRAM_GROUP_ID`
-  - for Convex runs, prefer `OpenClaw/OPENCLAW_QA_CONVEX_SITE_URL`; if that is stale or unclear, ask for the active pool URL before running
- Prefer direct Telegram envs for the npm Telegram Docker lane when available:
-
-```bash
-OPENCLAW_QA_TELEGRAM_GROUP_ID="..." \
-OPENCLAW_QA_TELEGRAM_DRIVER_BOT_TOKEN="..." \
-OPENCLAW_QA_TELEGRAM_SUT_BOT_TOKEN="..." \
-OPENCLAW_QA_PROVIDER_MODE="mock-openai" \
-OPENCLAW_NPM_TELEGRAM_PACKAGE_SPEC="openclaw@beta" \
-pnpm test:docker:npm-telegram-live
-```
-
- Prefer Convex mode when the goal is stable shared QA infra:
-  - round-robin credential leasing
-  - thinner wrapper for channel-specific setup
-  - CLI/admin flows around the pooled credentials
- Live npm Telegram Docker lane note:
-  - `scripts/e2e/npm-telegram-live-runner.ts` reads `OPENCLAW_NPM_TELEGRAM_PROVIDER_MODE`
-  - do not assume `OPENCLAW_QA_PROVIDER_MODE` is consumed by that wrapper
-  - if a 1Password note only gives `OPENCLAW_QA_PROVIDER_MODE`, map it explicitly to `OPENCLAW_NPM_TELEGRAM_PROVIDER_MODE` before running the Docker lane
- Verified live shape:
-  - Convex mode can pass the real Docker lane without direct Telegram env vars
-  - leased Telegram payload includes the group id coupled to the driver/SUT tokens
-  - a real run of `pnpm test:docker:npm-telegram-live` passed with:
-    - `OPENCLAW_QA_CREDENTIAL_SOURCE=convex`
-    - `OPENCLAW_QA_CREDENTIAL_ROLE=maintainer`
-    - `OPENCLAW_QA_CONVEX_SITE_URL`
-    - `OPENCLAW_QA_CONVEX_SECRET_MAINTAINER`
-    - `OPENCLAW_NPM_TELEGRAM_PROVIDER_MODE=mock-openai`
- If direct Telegram env is missing locally and `op signin` blocks, prefer dispatching the manual GitHub lane because the `qa-live-shared` environment already has Convex CI credentials:
-
-```bash
-gh workflow run "NPM Telegram Beta E2E" --repo openclaw/openclaw --ref main \
-  -f package_spec=openclaw@YYYY.M.D-beta.N \
-  -f package_label=openclaw@YYYY.M.D-beta.N \
-  -f provider_mode=mock-openai
-```
-
- Poll the exact run id from the dispatch URL. `gh run view --json artifacts` is not supported; list artifacts with:
-
-```bash
-gh api repos/openclaw/openclaw/actions/runs/<run-id>/artifacts
-```
-
-## WhatsApp live credentials
-
-Use this when setting up or replacing Convex `kind=whatsapp` credentials.
-
- Treat WhatsApp QA credentials as operator-owned live accounts, not generated fixtures.
- Use two dedicated WhatsApp-capable test numbers: one driver account and one SUT account. Do not use personal numbers or personal OpenClaw WhatsApp accounts in the shared pool.
- Register and link each account manually with WhatsApp or WhatsApp Business, storing Web auth only in isolated local auth dirs outside the repo.
- For group coverage, create a dedicated test group that includes both QA accounts and store its JID as `groupJid`; otherwise the group mention-gating scenario should be skipped by default and fail when explicitly requested.
- Package the two Baileys auth dirs into base64 `.tgz` payload fields and add a new active Convex credential row. Prefer adding a fresh row and disabling stale/broken rows over overwriting credentials in place.
- Expected payload fields: `driverPhoneE164`, `sutPhoneE164`, `driverAuthArchiveBase64`, `sutAuthArchiveBase64`, and optional `groupJid`.
- Keep credential material out of the repo, logs, PRs, and screenshots. Redact phone numbers unless the operator explicitly asks for local debugging.
- Validate with `pnpm openclaw qa whatsapp --credential-source convex --credential-role maintainer --provider-mode mock-openai` and preserve artifact paths plus redacted pass/fail summaries.
- If WhatsApp expires or invalidates a linked Web session, relink locally, package fresh auth archives, add a new Convex row, then disable the stale row.
-
-## Character evals
-
-Use `qa character-eval` for style/persona/vibe checks across multiple live models.
-
-```bash
-pnpm openclaw qa character-eval \
-  --model openai/gpt-5.4,thinking=xhigh \
-  --model openai/gpt-5.2,thinking=xhigh \
-  --model openai/gpt-5,thinking=xhigh \
-  --model anthropic/claude-opus-4-6,thinking=high \
-  --model anthropic/claude-sonnet-4-6,thinking=high \
-  --model zai/glm-5.1,thinking=high \
-  --model moonshot/kimi-k2.5,thinking=high \
-  --model google/gemini-3.1-pro-preview,thinking=high \
-  --judge-model openai/gpt-5.4,thinking=xhigh,fast \
-  --judge-model anthropic/claude-opus-4-6,thinking=high \
-  --concurrency 16 \
-  --judge-concurrency 16 \
-  --output-dir .artifacts/qa-e2e/character-eval-<tag>
-```
-
- Runs local QA gateway child processes, not Docker.
- Preferred model spec syntax is `provider/model,thinking=<level>[,fast|,no-fast|,fast=<bool>]` for both `--model` and `--judge-model`.
- Do not add new examples with separate `--model-thinking`; keep that flag as legacy compatibility only.
- Defaults to candidate models `openai/gpt-5.4`, `openai/gpt-5.2`, `openai/gpt-5`, `anthropic/claude-opus-4-6`, `anthropic/claude-sonnet-4-6`, `zai/glm-5.1`, `moonshot/kimi-k2.5`, and `google/gemini-3.1-pro-preview` when no `--model` is passed.
- Candidate thinking defaults to `high`, with `xhigh` for OpenAI models that support it. Prefer inline `--model provider/model,thinking=<level>`; `--thinking <level>` and `--model-thinking <provider/model=level>` remain compatibility shims.
- OpenAI candidate refs default to fast mode so priority processing is used where supported. Use inline `,fast`, `,no-fast`, or `,fast=false` for one model; use `--fast` only to force fast mode for every candidate.
- Judges default to `openai/gpt-5.4,thinking=xhigh,fast` and `anthropic/claude-opus-4-6,thinking=high`.
- Report includes judge ranking, run stats, durations, and full transcripts; do not include raw judge replies. Duration is benchmark context, not a grading signal.
- Candidate and judge concurrency default to 16. Use `--concurrency <n>` and `--judge-concurrency <n>` to override when local gateways or provider limits need a gentler lane.
- Scenario source is YAML-only under `qa/scenarios/`: use `index.yaml` and
-  per-scenario `*.yaml` files with top-level `title`, `scenario`, and optional
-  `flow`. Never add fenced `qa-scenario` / `qa-flow` Markdown files.
- For isolated character/persona evals, write the persona into `SOUL.md` and blank `IDENTITY.md` in the scenario flow. Use `SOUL.md + IDENTITY.md` only when intentionally testing how the normal OpenClaw identity combines with the character.
- Keep prompts natural and task-shaped. The candidate model should receive character setup through `SOUL.md`, then normal user turns such as chat, workspace help, and small file tasks; do not ask "how would you react?" or tell the model it is in an eval.
- Prefer at least one real task, such as creating or editing a tiny workspace artifact, so the transcript captures character under normal tool use instead of pure roleplay.
-
-## Codex CLI model lane
-
-Use model refs shaped like `codex-cli/<codex-model>` whenever QA should exercise Codex as a model backend.
-
-Examples:
-
-```bash
-pnpm openclaw qa suite \
-  --provider-mode live-frontier \
-  --model codex-cli/<codex-model> \
-  --alt-model codex-cli/<codex-model> \
-  --scenario <scenario-id> \
-  --output-dir .artifacts/qa-e2e/codex-<tag>
-```
-
-```bash
-pnpm openclaw qa manual \
-  --model codex-cli/<codex-model> \
-  --message "Reply exactly: CODEX_OK"
-```
-
- Treat the concrete Codex model name as user/config input; do not hardcode it in source, docs examples, or scenarios.
- Live QA preserves `CODEX_HOME` so Codex CLI auth/config works while keeping `HOME` and `OPENCLAW_HOME` sandboxed.
- Mock QA should scrub `CODEX_HOME`.
- If Codex returns fallback/auth text every turn, first check `CODEX_HOME`,
-  relevant secret-backed auth, and gateway child logs before changing
-  scenario assertions.
- For model comparison, include `codex-cli/<codex-model>` as another candidate in `qa character-eval`; the report should label it as an opaque model name.
-
 ## Repo facts

- Seed scenarios live in `qa/scenarios/index.yaml` and
-  `qa/scenarios/<theme>/*.yaml`.
+- Seed scenarios live in `qa/`.
 - Main live runner: `extensions/qa-lab/src/suite.ts`
 - QA lab server: `extensions/qa-lab/src/lab-server.ts`
 - Child gateway harness: `extensions/qa-lab/src/gateway-child.ts`
@@ -265,9 +79,8 @@ pnpm openclaw qa manual \

 ## When adding scenarios

- Add or update scenario YAML under `qa/scenarios/`; do not add `.md` scenario
-  files or fenced YAML blocks.
- Keep kickoff expectations in `qa/scenarios/index.yaml` aligned
+- Add scenario metadata to `qa/seed-scenarios.json`
+- Keep kickoff expectations in `qa/QA_KICKOFF_TASK.md` aligned
 - Add executable coverage in `extensions/qa-lab/src/suite.ts`
 - Prefer end-to-end assertions over mock-only checks
 - Save outputs under `.artifacts/qa-e2e/`
--- a/.agents/skills/openclaw-refactor-docs/SKILL.md
+++ b/.agents/skills/openclaw-refactor-docs/SKILL.md
@@ -1,196 +0,0 @@
---
-name: openclaw-refactor-docs
-description: Refactor an existing OpenClaw docs page with source-audited preservation, restructuring, and verification.
---
-
-# OpenClaw Refactor Docs
-
-## Overview
-
-Use this skill when the user gives a target OpenClaw docs page and asks to
-rewrite, refactor, reorganize, split, shorten, or improve it.
-
-This skill builds on `openclaw-docs`: use that skill for style, page types,
-structure, examples, discoverability, and verification. This skill adds the
-rewrite workflow needed to avoid losing accurate behavior during a major docs
-refactor.
-
-## Inputs
-
-Required:
-
- A target docs page path, such as `docs/plugins/codex-harness.md`.
-
-Optional:
-
- Desired page type, such as topic page, guide, reference, or troubleshooting.
- Specific goals, such as shorter main page, move details to reference pages, or
-  align with current CLI behavior.
- Related source files, schemas, commands, tests, specs, or PRs.
-
-If the target page is missing or ambiguous, ask one concise question before
-editing. Otherwise, proceed.
-
-## Working Contract
-
-Refactor the target page to be more useful, concise, and comprehensive within
-its stated scope.
-
-Do not treat a rewrite as permission to discard behavior facts. Preserve,
-verify, move, or explicitly retire existing material. Incorrect docs are worse
-than verbose docs.
-
-Prefer this split:
-
- Topic or guide pages cover the 80/20 path, decisions readers must make, safe
-  setup, smallest reliable verification, common failures, and links onward.
- Reference pages cover exhaustive fields, defaults, enums, limits, precedence
-  rules, API contracts, narrow internals, and rare debugging details.
- Troubleshooting pages start from observable symptoms and map to checks,
-  causes, and fixes.
-
-## Workflow
-
-### 1. Load the doc standard
-
-Read `../openclaw-docs/SKILL.md` first. Apply its page-type, style,
-examples, navigation, and verification guidance throughout the refactor.
-
-Run `pnpm docs:list` when available, then read only the target page and the
-likely entry points, references, or related pages needed for the refactor.
-
-### 2. Classify the page
-
-Before editing, decide the intended page type from `openclaw-docs`.
-
-If the current page mixes page types, choose the main page type and plan where
-the other material belongs:
-
- Move exhaustive contracts to an existing or new reference page.
- Move symptom-driven material to an existing or new troubleshooting page.
- Move narrow setup workflows to a guide when they interrupt the main path.
- Keep concise routing, decision, and safety details in the main page when
-  readers need them to complete the workflow.
-
-### 3. Preserve and audit existing facts
-
-Create a working inventory from the old page before rewriting. Include:
-
- Config fields, flags, commands, slash commands, env vars, defaults, enums,
-  nullable values, and constraints.
- Precedence rules, fallback behavior, caps, limits, rate limits, timeouts,
-  lifecycle states, queueing behavior, and compatibility rules.
- Auth, permission, approval, sandbox, safety, privacy, and destructive-action
-  behavior.
- Setup requirements, supported versions, dependencies, operating systems,
-  credentials, and account requirements.
- Error messages, troubleshooting symptoms, diagnostics, and recovery steps.
- Examples, expected output, command routing tables, and cross-links.
-
-For each fact, choose one outcome:
-
- Keep it in the refactored target page.
- Move it to a specific existing page.
- Move it to a specific new page.
- Delete it because current source proves it is obsolete or out of scope.
-
-Do not infer defaults, permissions, policy, timeout behavior, or safety posture
-from names or intent. Verify them.
-
-### 4. Find source of truth
-
-Use the nearest authoritative source for each behavior-sensitive claim:
-
- Public schema, plugin manifest, generated config docs, or exported types for
-  config fields.
- CLI implementation, slash-command handlers, help text, and command tests for
-  commands and flags.
- Runtime source and tests for lifecycle, queueing, permission, fallback,
-  timeout, and provider behavior.
- Protocol docs, SDK facades, and contract tests for APIs and plugin surfaces.
- Existing docs only as secondary evidence unless the target is purely
-  conceptual.
-
-If a page promises a reference, compare its tables against the schema,
-manifest, CLI help, generated docs, or exported types. Missing public fields,
-defaults, precedence rules, caps, or side effects are correctness bugs.
-
-### 5. Plan moved material
-
-When moving detail out of the target page, record the destination before
-editing:
-
- Existing page: name the page and section.
- New page: choose the page type, slug, title, frontmatter summary,
-  `doc-schema-version: 1`, and `read_when` hints.
- Target page: keep a short summary and link from the point where readers need
-  the deeper detail.
-
-Avoid duplicate truth. If the same contract appears in multiple places, choose
-one canonical page and link to it.
-
-### 6. Rewrite
-
-Rewrite in this order:
-
-1. Make the first screen answer what the reader can do and why this page exists.
-2. Put the recommended path before alternatives.
-3. Keep only decision-making and common operational detail in the main flow.
-4. Move exhaustive tables and rare details to the planned reference pages.
-5. Preserve concise routing tables when they help readers choose commands,
-   config paths, harnesses, plugins, providers, or references.
-6. Add troubleshooting from observable symptoms, not internal guesses.
-7. Link related concepts, guides, references, diagnostics, and adjacent tools.
-
-Add `doc-schema-version: 1` to the YAML frontmatter of every docs page that the
-refactor migrates, creates, or materially rewrites. Apply it only to docs page
-files, not `docs.json`, glossary JSON, or other non-page metadata. If a
-migrated page is generated, update the generator so regeneration preserves the
-marker instead of hand-editing generated output.
-
-Do not leave placeholders such as "TODO", "TBD", or "see docs" unless the user
-explicitly asks for a draft.
-
-### 7. Compare old and new
-
-After editing, compare the old and new page:
-
- Confirm all behavior-sensitive facts were kept, moved, or intentionally
-  deleted with source-backed reason.
- Check that the main page still covers the 80/20 scenario end to end.
- Check that reference pages remain exhaustive for the scope they claim.
- Check that links from the target page reach moved details.
- Check that headings are stable, searchable, and action-oriented.
-
-If the refactor deliberately removes relevant material, say where it went or why
-it was removed in the final report.
-
-### 8. Verify
-
-Run the smallest reliable docs checks for the touched surface:
-
- `pnpm docs:list`
- `git diff --check -- <touched-files>`
- Targeted `pnpm exec oxfmt --check --threads=1 <touched-files>`
- `pnpm docs:check-mdx`
- `pnpm docs:check-links`
- `pnpm docs:check-i18n-glossary` when link text, navigation, labels, or glossary
-  surfaces changed
- Generated-doc checks when schemas, generated config docs, API docs, or
-  generated baselines are touched
-
-Run commands and examples from the page whenever feasible. If you cannot verify
-a behavior-sensitive claim, either remove the claim, mark the uncertainty in the
-work-in-progress report, or ask for the missing source.
-
-## Final Report
-
-Report:
-
- What changed in the target page.
- What details moved and their destination pages.
- What source-of-truth checks backed behavior-sensitive claims.
- What validation ran and what failed for unrelated reasons.
-
-Do not include a long rewrite diary. Lead with remaining risks only if there are
-any.
--- a/.agents/skills/openclaw-release-maintainer/SKILL.md
+++ b/.agents/skills/openclaw-release-maintainer/SKILL.md
@@ -0,0 +1,267 @@
+---
+name: openclaw-release-maintainer
+description: Maintainer workflow for OpenClaw releases, prereleases, changelog release notes, and publish validation. Use when Codex needs to prepare or verify stable or beta release steps, align version naming, assemble release notes, check release auth requirements, or validate publish-time commands and artifacts.
+---
+
+# OpenClaw Release Maintainer
+
+Use this skill for release and publish-time workflow. Keep ordinary development changes and GHSA-specific advisory work outside this skill.
+
+## Respect release guardrails
+
+- Do not change version numbers without explicit operator approval.
+- Ask permission before any npm publish or release step.
+- This skill should be sufficient to drive the normal release flow end-to-end.
+- Use the private maintainer release docs for credentials, recovery steps, and mac signing/notary specifics, and use `docs/reference/RELEASING.md` for public policy.
+- Core `openclaw` publish is manual `workflow_dispatch`; creating or pushing a tag does not publish by itself.
+
+## Keep release channel naming aligned
+
+- `stable`: tagged releases only, published to npm `beta` by default; operators may target npm `latest` explicitly or promote later
+- `beta`: prerelease tags like `vYYYY.M.D-beta.N`, with npm dist-tag `beta`
+- Prefer `-beta.N`; do not mint new `-1` or `-2` beta suffixes
+- `dev`: moving head on `main`
+- When using a beta Git tag, publish npm with the matching beta version suffix so the plain version is not consumed or blocked
+
+## Handle versions and release files consistently
+
+- Version locations include:
+  - `package.json`
+  - `apps/android/app/build.gradle.kts`
+  - `apps/ios/Sources/Info.plist`
+  - `apps/ios/Tests/Info.plist`
+  - `apps/macos/Sources/OpenClaw/Resources/Info.plist`
+  - `docs/install/updating.md`
+  - Peekaboo Xcode project and plist version fields
+- Before creating a release tag, make every version location above match the version encoded by that tag.
+- For fallback correction tags like `vYYYY.M.D-N`, the repo version locations still stay at `YYYY.M.D`.
+- “Bump version everywhere” means all version locations above except `appcast.xml`.
+- Release signing and notary credentials live outside the repo in the private maintainer docs.
+- Every OpenClaw release ships the npm package and macOS app together.
+- The production Sparkle feed lives at `https://raw.githubusercontent.com/openclaw/openclaw/main/appcast.xml`, and the canonical published file is `appcast.xml` on `main` in the `openclaw` repo.
+- That shared production Sparkle feed is stable-only. Beta mac releases may
+  upload assets to the GitHub prerelease, but they must not replace the shared
+  `appcast.xml` unless a separate beta feed exists.
+- For fallback correction tags like `vYYYY.M.D-N`, the repo version still stays
+  at `YYYY.M.D`, but the mac release must use a strictly higher numeric
+  `APP_BUILD` / Sparkle build than the original release so existing installs
+  see it as newer.
+
+## Build changelog-backed release notes
+
+- Changelog entries should be user-facing, not internal release-process notes.
+- When cutting a mac release with a beta GitHub prerelease:
+  - tag `vYYYY.M.D-beta.N` from the release commit
+  - create a prerelease titled `openclaw YYYY.M.D-beta.N`
+  - use release notes from the matching `CHANGELOG.md` version section
+  - attach at least the zip and dSYM zip, plus dmg if available
+- Keep the top version entries in `CHANGELOG.md` sorted by impact:
+  - `### Changes` first
+  - `### Fixes` deduped with user-facing fixes first
+
+## Run publish-time validation
+
+Before tagging or publishing, run:
+
+```bash
+pnpm build
+pnpm ui:build
+pnpm release:check
+pnpm test:install:smoke
+```
+
+For a non-root smoke path:
+
+```bash
+  OPENCLAW_INSTALL_SMOKE_SKIP_NONROOT=1 pnpm test:install:smoke
+```
+
+After npm publish, run:
+
+```bash
+node --import tsx scripts/openclaw-npm-postpublish-verify.ts <published-version>
+```
+
+- This verifies the published registry install path in a fresh temp prefix.
+- For stable correction releases like `YYYY.M.D-N`, it also verifies the
+  upgrade path from `YYYY.M.D` to `YYYY.M.D-N` so a correction publish cannot
+  silently leave existing global installs on the old base stable payload.
+
+## Check all relevant release builds
+
+- Always validate the OpenClaw npm release path before creating the tag.
+- Default release checks:
+  - `pnpm check`
+  - `pnpm build`
+  - `pnpm ui:build`
+  - `pnpm release:check`
+  - `OPENCLAW_INSTALL_SMOKE_SKIP_NONROOT=1 pnpm test:install:smoke`
+- Check all release-related build surfaces touched by the release, not only the npm package.
+- Include mac release readiness in preflight by running the public validation
+  workflow in `openclaw/openclaw` and the real mac preflight in
+  `openclaw/releases-private` for every release.
+- Treat the `appcast.xml` update on `main` as part of mac release readiness, not an optional follow-up.
+- The workflows remain tag-based. The agent is responsible for making sure
+  preflight runs complete successfully before any publish run starts.
+- Any fix after preflight means a new commit. Delete and recreate the tag and
+  matching GitHub release from the fixed commit, then rerun preflight from
+  scratch before publishing.
+- For stable mac releases, generate the signed `appcast.xml` before uploading
+  public release assets so the updater feed cannot lag the published binaries.
+- Serialize stable appcast-producing runs across tags so two releases do not
+  generate replacement `appcast.xml` files from the same stale seed.
+- For stable releases, confirm the latest beta already passed the broader release workflows before cutting stable.
+- If any required build, packaging step, or release workflow is red, do not say the release is ready.
+
+## Use the right auth flow
+
+- OpenClaw publish uses GitHub trusted publishing.
+- Stable npm promotion from `beta` to `latest` is an explicit mode on
+  `.github/workflows/openclaw-npm-release.yml`, but it still needs a valid
+  `NPM_TOKEN` because `npm dist-tag` management is separate from trusted
+  publishing.
+- The publish run must be started manually with `workflow_dispatch`.
+- The npm workflow and the private mac publish workflow accept
+  `preflight_only=true` to run validation/build/package steps without uploading
+  public release assets.
+- Real npm publish requires a prior successful npm preflight run id so the
+  publish job promotes the prepared tarball instead of rebuilding it.
+- Real private mac publish requires a prior successful private mac preflight
+  run id so the publish job promotes the prepared artifacts instead of
+  rebuilding or renotarizing them again.
+- The private mac workflow also accepts `smoke_test_only=true` for branch-safe
+  workflow smoke tests that use ad-hoc signing, skip notarization, skip shared
+  appcast generation, and do not prove release readiness.
+- `preflight_only=true` on the npm workflow is also the right way to validate an
+  existing tag after publish; it should keep running the build checks even when
+  the npm version is already published.
+- Validation-only runs may be dispatched from a branch when you are testing a
+  workflow change before merge.
+- `.github/workflows/macos-release.yml` in `openclaw/openclaw` is now a
+  public validation-only handoff. It validates the tag/release state and points
+  operators to the private repo. It still rebuilds the JS outputs needed for
+  release validation, but it does not sign, notarize, or publish macOS
+  artifacts.
+- `openclaw/releases-private/.github/workflows/openclaw-macos-validate.yml`
+  is the required private mac validation lane for `swift test`; keep it green
+  before any real mac publish run starts.
+- Real mac preflight and real mac publish both use
+  `openclaw/releases-private/.github/workflows/openclaw-macos-publish.yml`.
+- The private mac validation lane runs on GitHub's standard macOS runner.
+- The private mac preflight path runs on GitHub's xlarge macOS runner and uses
+  a SwiftPM cache because the build/sign/notarize/package path is CPU-heavy.
+- Private mac preflight uploads notarized build artifacts as workflow artifacts
+  instead of uploading public GitHub release assets.
+- Private smoke-test runs upload ad-hoc, non-notarized build artifacts as
+  workflow artifacts and intentionally skip stable `appcast.xml` generation.
+- npm preflight, public mac validation, private mac validation, and private mac
+  preflight must all pass before any real publish run starts.
+- Real publish runs must be dispatched from `main`; branch-dispatched publish
+  attempts should fail before the protected environment is reached.
+- The release workflows stay tag-based; rely on the documented release sequence
+  rather than workflow-level SHA pinning.
+- The `npm-release` environment must be approved by `@openclaw/openclaw-release-managers` before publish continues.
+- Mac publish uses
+  `openclaw/releases-private/.github/workflows/openclaw-macos-publish.yml` for
+  private mac preflight artifact preparation and real publish artifact
+  promotion.
+- Real private mac publish uploads the packaged `.zip`, `.dmg`, and
+  `.dSYM.zip` assets to the existing GitHub release in `openclaw/openclaw`
+  automatically when `OPENCLAW_PUBLIC_REPO_RELEASE_TOKEN` is present in the
+  private repo `mac-release` environment.
+- For stable releases, the agent must also download the signed
+  `macos-appcast-<tag>` artifact from the successful private mac workflow and
+  then update `appcast.xml` on `main`.
+- For beta mac releases, do not update the shared production `appcast.xml`
+  unless a separate beta Sparkle feed exists.
+- The private repo targets a dedicated `mac-release` environment. If the GitHub
+  plan does not yet support required reviewers there, do not assume the
+  environment alone is the approval boundary; rely on private repo access and
+  CODEOWNERS until those settings can be enabled.
+- Do not use `NPM_TOKEN` or the plugin OTP flow for OpenClaw releases.
+- `@openclaw/*` plugin publishes use a separate maintainer-only flow.
+- Only publish plugins that already exist on npm; bundled disk-tree-only plugins stay unpublished.
+
+## Fallback local mac publish
+
+- Keep the original local macOS publish workflow available as a fallback in case
+  CI/CD mac publishing is unavailable or broken.
+- Preserve the existing maintainer workflow Peter uses: run it on a real Mac
+  with local signing, notary, and Sparkle credentials already configured.
+- Follow the private maintainer macOS runbook for the local steps:
+  `scripts/package-mac-dist.sh` to build, sign, notarize, and package the app;
+  manual GitHub release asset upload; then `scripts/make_appcast.sh` plus the
+  `appcast.xml` commit to `main`.
+- `scripts/package-mac-dist.sh` now fails closed for release builds if the
+  bundled app comes out with a debug bundle id, an empty Sparkle feed URL, or a
+  `CFBundleVersion` below the canonical Sparkle build floor for that short
+  version. For correction tags, set a higher explicit `APP_BUILD`.
+- `scripts/make_appcast.sh` first uses `generate_appcast` from `PATH`, then
+  falls back to the SwiftPM Sparkle tool output under `apps/macos/.build`.
+- For stable tags, the local fallback may update the shared production
+  `appcast.xml`.
+- For beta tags, the local fallback still publishes the mac assets but must not
+  update the shared production `appcast.xml` unless a separate beta feed exists.
+- Treat the local workflow as fallback only. Prefer the CI/CD publish workflow
+  when it is working.
+- After any stable mac publish, verify all of the following before you call the
+  release finished:
+  - the GitHub release has `.zip`, `.dmg`, and `.dSYM.zip` assets
+  - `appcast.xml` on `main` points at the new stable zip
+  - the packaged app reports the expected short version and a numeric
+    `CFBundleVersion` at or above the canonical Sparkle build floor
+
+## Run the release sequence
+
+1. Confirm the operator explicitly wants to cut a release.
+2. Choose the exact target version and git tag.
+3. Make every repo version location match that tag before creating it.
+4. Update `CHANGELOG.md` and assemble the matching GitHub release notes.
+5. Run the full preflight for all relevant release builds, including mac readiness.
+6. Confirm the target npm version is not already published.
+7. Create and push the git tag.
+8. Create or refresh the matching GitHub release.
+9. Start `.github/workflows/openclaw-npm-release.yml` with `preflight_only=true`
+   and choose the intended `npm_dist_tag` (`beta` default; `latest` only for
+   an intentional direct stable publish). Wait for it to pass. Save that run id
+   because the real publish requires it to reuse the prepared npm tarball.
+10. Start `.github/workflows/macos-release.yml` in `openclaw/openclaw` and wait
+    for the public validation-only run to pass.
+11. Start
+    `openclaw/releases-private/.github/workflows/openclaw-macos-validate.yml`
+    with the same tag and wait for the private mac validation lane to pass.
+12. Start
+    `openclaw/releases-private/.github/workflows/openclaw-macos-publish.yml`
+    with `preflight_only=true` and wait for it to pass. Save that run id because
+    the real publish requires it to reuse the notarized mac artifacts.
+13. If any preflight or validation run fails, fix the issue on a new commit,
+    delete the tag and matching GitHub release, recreate them from the fixed
+    commit, and rerun all relevant preflights from scratch before continuing.
+    Never reuse old preflight results after the commit changes.
+14. Start `.github/workflows/openclaw-npm-release.yml` with the same tag for
+    the real publish, choose `npm_dist_tag` (`beta` default, `latest` only when
+    you intentionally want direct stable publish), keep it the same as the
+    preflight run, and pass the successful npm `preflight_run_id`.
+15. Wait for `npm-release` approval from `@openclaw/openclaw-release-managers`.
+16. If the stable release was published to `beta`, start
+    `.github/workflows/openclaw-npm-release.yml` again after beta validation
+    passes with the same stable tag, `promote_beta_to_latest=true`,
+    `preflight_only=false`, empty `preflight_run_id`, and `npm_dist_tag=beta`,
+    then verify `latest` now points at that version.
+17. Start
+    `openclaw/releases-private/.github/workflows/openclaw-macos-publish.yml`
+    for the real publish with the successful private mac `preflight_run_id` and
+    wait for success.
+18. Verify the successful real private mac run uploaded the `.zip`, `.dmg`,
+    and `.dSYM.zip` artifacts to the existing GitHub release in
+    `openclaw/openclaw`.
+19. For stable releases, download `macos-appcast-<tag>` from the successful
+    private mac run, update `appcast.xml` on `main`, and verify the feed.
+20. For beta releases, publish the mac assets but expect no shared production
+    `appcast.xml` artifact and do not update the shared production feed unless a
+    separate beta feed exists.
+21. After publish, verify npm and the attached release artifacts.
+
+## GHSA advisory work
+
+- Use `openclaw-ghsa-maintainer` for GHSA advisory inspection, patch/publish flow, private-fork validation, and GHSA API-specific publish checks.
--- a/.agents/skills/openclaw-secret-scanning-maintainer/SKILL.md
+++ b/.agents/skills/openclaw-secret-scanning-maintainer/SKILL.md
@@ -1,236 +0,0 @@
---
-name: openclaw-secret-scanning-maintainer
-description: Triage, redact, clean up, and resolve OpenClaw GitHub Secret Scanning alerts in issues or PRs.
---
-
-# OpenClaw Secret Scanning Maintainer
-
-**Maintainer-only.** This skill requires repo admin / maintainer permissions to edit or delete other users' comments and resolve secret scanning alerts.
-
-Use this skill when processing alerts from `https://github.com/openclaw/openclaw/security/secret-scanning`.
-
-**Language rule:** All notification comments and replacement comments MUST be written in English.
-
-## Script
-
-All mechanical operations (API calls, temp file management, security enforcements) are handled by:
-
-```
-$REPO_ROOT/.agents/skills/openclaw-secret-scanning-maintainer/scripts/secret-scanning.mjs
-```
-
-The script enforces:
-
- `hide_secret=true` on all alert fetches (no plaintext secrets in stdout)
- `mktemp` with random UUIDs for all temp files
- `-F body=@file` for all body uploads (no inline shell quoting)
- Notification templates branched by location type
- Never prints `.secret` or `.body` to stdout
-
-## Overall Flow
-
-Supports single or multiple alerts. For multiple alerts, process in ascending order.
-
-For each alert:
-
-1. **Identify** — `fetch-alert` + `fetch-content` to get metadata and body
-2. **Decide** — Agent reads the body file, identifies whether plaintext secrets remain, and produces a redacted version only when needed
-3. **Redact** — `redact-body-if-needed` for issue/PR body; skip for comments (delete directly)
-4. **Purge** — `delete-comment` + `recreate-comment` for comments; cannot purge body history
-5. **Notify** — `notify` posts the right template per location type, unless the current issue/PR body is already redacted
-6. **Resolve** — `resolve` closes the alert
-7. **Summary** — `summary` prints formatted results
-
-## Step 1: Identify
-
-```bash
-# List all open alerts
-node secret-scanning.mjs list-open
-
-# Fetch specific alert metadata + locations
-node secret-scanning.mjs fetch-alert <NUMBER>
-
-# Fetch content for each location (saves body to temp file)
-node secret-scanning.mjs fetch-content '<location-json>'
-```
-
-The `fetch-content` output includes:
-
- `body_file`: path to temp file with full body content
- `author`: who posted it
- `issue_number` / `pr_number`: where it is
- `edit_history_count`: number of existing edits
- `type`: location type for routing
- For `discussion_comment`, it also includes `comment_node_id`, `discussion_node_id`, and `reply_to_node_id` when the original comment was a reply.
-
-### Location type routing
-
-| type                          | Flow                                          |
-| ----------------------------- | --------------------------------------------- |
-| `issue_comment`               | Comment: delete+recreate                      |
-| `pull_request_comment`        | Comment: delete+recreate                      |
-| `pull_request_review_comment` | Comment: delete+recreate                      |
-| `discussion_comment`          | Discussion comment: delete+recreate (GraphQL) |
-| `issue_body`                  | Body: redact in place                         |
-| `pull_request_body`           | Body: redact in place                         |
-| `commit`                      | Notify only                                   |
-| _other_                       | Skip and report                               |
-
-## Step 2: Decide (Agent)
-
-The agent reads the body file from `fetch-content` output and:
-
-1. Identifies ALL secrets in the content (there may be more than the alert flagged)
-2. Determines whether any plaintext credential remains in the current body
-3. Replaces each remaining secret with `[REDACTED <secret_type>]` — **no partial values, no prefix/suffix**
-4. Saves the redacted content to a new temp file
-
-This is the only step that requires semantic understanding. Everything else is mechanical.
-
-For `issue_body` and `pull_request_body`: if the current body has already been redacted by the author and no plaintext credential remains, **do not post a public notification comment**. Resolve the alert with a maintainer-only resolution comment such as:
-
-```bash
-node secret-scanning.mjs resolve <ALERT_NUMBER> revoked "Current issue/PR body is already redacted; no public notification posted."
-```
-
-This avoids creating a fresh public pointer to historical sensitive content.
-
-## Step 3: Redact
-
-### For comments (issue_comment / PR comments)
-
-**Do NOT redact.** Skip directly to Step 4 (delete + recreate). PATCHing before DELETE creates an unnecessary edit history revision.
-
-### For issue_body / pull_request_body
-
-```bash
-node secret-scanning.mjs redact-body-if-needed <issue|pr> <NUMBER> <current-body-file> <redacted-body-file> <result-file>
-```
-
-Use the `body_file` from `fetch-content` as `<current-body-file>`. The command writes `notify_required` to `<result-file>` and only PATCHes the body when the redacted file differs from the current body.
-
-## Step 4: Purge Edit History
-
-### Comments — Delete and Recreate
-
-For issue/PR comments:
-
-```bash
-# Delete original (all edit history gone)
-node secret-scanning.mjs delete-comment <COMMENT_ID>
-
-# Recreate with redacted content
-node secret-scanning.mjs recreate-comment <ISSUE_NUMBER> <body-file>
-```
-
-For discussion comments (uses GraphQL):
-
-```bash
-# Delete original
-node secret-scanning.mjs delete-discussion-comment <COMMENT_NODE_ID>
-
-# Recreate with redacted content
-node secret-scanning.mjs recreate-discussion-comment <DISCUSSION_NODE_ID> <body-file> [REPLY_TO_NODE_ID]
-```
-
-The `fetch-content` output for `discussion_comment` includes `comment_node_id` and `discussion_node_id` for these commands. When the original discussion comment was a reply, it also includes `reply_to_node_id`; pass that optional third argument so the redacted replacement stays in the original thread.
-
-The recreated comment should follow this format:
-
-```
-> **Note:** The original comment by @<AUTHOR> has been removed due to secret leakage. Below is the redacted version of the original content.
-
---
-
-<redacted original content>
-```
-
-### issue_body / pull_request_body — Cannot Purge Edit History
-
-Editing creates an edit history revision with the pre-edit plaintext. This cannot be cleared via API.
-
-Do not advise authors publicly to delete/recreate issues or close/reopen PRs. That can draw attention to historical content. Keep purge guidance maintainer-only.
-
-**Output to maintainer terminal only (never in public comments):**
-
-```
-⚠️ Issue/PR body edit history still contains plaintext secrets.
-Contact GitHub Support to purge: https://support.github.com/contact
-Request purge of issue/PR #{NUMBER} userContentEdits.
-```
-
-> **CRITICAL:** Do NOT mention edit history or the "edited" button in any public comment or resolution_comment.
-
-### Commits
-
-Cannot clean. Notify author to delete branch or force-push (for unmerged PRs).
-
-## Step 5: Notify
-
-```bash
-node secret-scanning.mjs notify <TARGET> <AUTHOR> <LOCATION_TYPE> <SECRET_TYPES> [REPLY_TO_NODE_ID|BODY_REDACTION_RESULT_FILE]
-```
-
- For non-discussion types, `<TARGET>` is the issue/PR number.
- For `discussion_comment`, `<TARGET>` is the `discussion_node_id` returned by `fetch-content`.
- For reply-style `discussion_comment` locations, pass the optional `reply_to_node_id` from `fetch-content` so the notification stays in the same thread.
- For `issue_body` and `pull_request_body`, pass the `<result-file>` from `redact-body-if-needed`. The script skips notification when `notify_required` is `false` and refuses body notifications without this file.
-
-Secret types are comma-separated: `"Discord Bot Token,Feishu App Secret"`
-
-The script picks the right template:
-
- **comment types**: "your comment … removed and replaced"
- **body types**: "your issue/PR description … redacted in place"
- **commit**: "code you committed"
-
-For `issue_body` and `pull_request_body`, only notify when the current body still contained plaintext and maintainers redacted it. If the user already redacted the current body, skip this step and resolve silently.
-
-## Step 6: Resolve
-
-```bash
-node secret-scanning.mjs resolve <ALERT_NUMBER>
-# or with custom resolution:
-node secret-scanning.mjs resolve <ALERT_NUMBER> revoked "Custom comment"
-```
-
-Resolution is `revoked` by default. As maintainers we cannot control whether users rotate — our responsibility is to remove current plaintext exposure and notify only when public notification is useful. The `revoked` means "this secret should be considered leaked", not "I confirmed it was revoked".
-
-## Step 7: Summary
-
-After processing, create a JSON results file and pass it to the summary command:
-
-```bash
-node secret-scanning.mjs summary /tmp/results.json
-```
-
-The script outputs a block delimited by `---BEGIN SUMMARY---` and `---END SUMMARY---`. **You MUST output the content between these markers verbatim to the user. Do NOT rephrase, reformat, abbreviate, or create your own summary.** The script already includes full URLs for every alert and location.
-
-The JSON format:
-
-```json
-[
-  {
-    "number": 72,
-    "secret_type": "Discord Bot Token",
-    "location_label": "Issue #63101 comment",
-    "location_url": "https://github.com/openclaw/openclaw/issues/63101#issuecomment-xxx",
-    "actions": "Deleted+Recreated+Notified",
-    "history_cleared": true
-  }
-]
-```
-
-For unsupported types, add `"skipped": true, "unsupported_type": "<type>"`.
-
-## Safety Rules
-
- **Agent reads content, identifies secrets, produces redaction.** Script handles all API calls.
- **Never include any portion of a secret** in public comments, redaction markers, or terminal output.
- **Never include alert URLs or numbers** in public comments.
- **For comments, skip PATCH — go directly to DELETE + recreate.**
- **Never mention edit history, "edited" button, or commit SHAs** in any public content.
- **Ask for confirmation** before deleting any comment.
- **One alert at a time** unless user requests batch.
- **All public comments in English.**
- **Skip unsupported location types** and report in summary.
--- a/.agents/skills/openclaw-secret-scanning-maintainer/scripts/secret-scanning.mjs
+++ b/.agents/skills/openclaw-secret-scanning-maintainer/scripts/secret-scanning.mjs
@@ -1,938 +0,0 @@
-#!/usr/bin/env node
-/**
- * Secret scanning alert handler for OpenClaw maintainers.
- * Usage: node secret-scanning.mjs <command> [options]
- */
-
-import { spawnSync } from "node:child_process";
-import crypto from "node:crypto";
-import fs from "node:fs";
-import os from "node:os";
-import path from "node:path";
-import { pathToFileURL } from "node:url";
-
-const REPO = "openclaw/openclaw";
-const REPO_URL = `https://github.com/${REPO}`;
-
-// ─── Helpers ────────────────────────────────────────────────────────────────
-
-function fail(message) {
-  console.error(`error: ${message}`);
-  process.exit(1);
-}
-
-function tmpFile(purpose) {
-  const filePath = path.join(os.tmpdir(), `secretscan-${purpose}-${crypto.randomUUID()}`);
-  // 预创建文件，限制权限为 owner-only
-  fs.writeFileSync(filePath, "", { mode: 0o600 });
-  return filePath;
-}
-
-function gh(args, { json = true, allowFailure = false } = {}) {
-  const proc = spawnSync("gh", args, { encoding: "utf8", maxBuffer: 10 * 1024 * 1024 });
-  if (proc.status !== 0 && !allowFailure) {
-    fail(`gh ${args.slice(0, 3).join(" ")} failed:\n${(proc.stderr || proc.stdout || "").trim()}`);
-  }
-  if (proc.status !== 0) {
-    return {
-      gh_failed: true,
-      status: proc.status,
-      stdout: proc.stdout,
-      stderr: proc.stderr,
-    };
-  }
-  if (!json) {
-    return proc.stdout;
-  }
-  try {
-    return JSON.parse(proc.stdout);
-  } catch {
-    return proc.stdout;
-  }
-}
-
-function ghGraphQL(query, options = {}) {
-  return gh(["api", "graphql", "-f", `query=${query}`], options);
-}
-
-function isBodyLocationType(locationType) {
-  return locationType === "issue_body" || locationType === "pull_request_body";
-}
-
-/** Decides whether redacting an issue/PR body requires notifying the reporter. */
-export function decideBodyRedaction(currentBody, redactedBody) {
-  const bodyChanged = String(currentBody) !== String(redactedBody);
-  return {
-    body_changed: bodyChanged,
-    notify_required: bodyChanged,
-  };
-}
-
-/** Loads redaction-result metadata for issue/PR body secret locations. */
-export function loadBodyRedactionResult(locationType, resultFile) {
-  if (!isBodyLocationType(locationType)) {
-    return { notify_required: true };
-  }
-  if (!resultFile) {
-    fail("Body notifications require a redaction result file from redact-body-if-needed");
-  }
-  if (!fs.existsSync(resultFile)) {
-    fail(`File not found: ${resultFile}`);
-  }
-
-  const result = JSON.parse(fs.readFileSync(resultFile, "utf8"));
-  if (typeof result.notify_required !== "boolean") {
-    fail(`Invalid redaction result file: missing boolean notify_required in ${resultFile}`);
-  }
-  return result;
-}
-
-function failOnGraphQLFailure(result, message) {
-  if (result?.gh_failed) {
-    const details = (
-      result.stderr ||
-      result.stdout ||
-      `gh exited with status ${result.status}`
-    ).trim();
-    fail(`${message}: ${details}`);
-  }
-  if (Array.isArray(result?.errors) && result.errors.length > 0) {
-    fail(`${message}: ${JSON.stringify(result.errors)}`);
-  }
-}
-
-function escapeGraphQLString(value) {
-  return String(value)
-    .replace(/\\/g, "\\\\")
-    .replace(/"/g, '\\"')
-    .replace(/\r/g, "\\r")
-    .replace(/\n/g, "\\n");
-}
-
-function formatGraphQLAfterClause(cursor) {
-  return cursor ? `, after: "${escapeGraphQLString(cursor)}"` : "";
-}
-
-function findDiscussionCommentNode(nodes, discussionCommentDbId) {
-  return nodes.find((node) => String(node.databaseId) === String(discussionCommentDbId)) || null;
-}
-
-function fetchDiscussionReplyPage(commentNodeId, cursor) {
-  const afterClause = formatGraphQLAfterClause(cursor);
-  return ghGraphQL(`{
-    node(id: "${escapeGraphQLString(commentNodeId)}") {
-      ... on DiscussionComment {
-        replies(first: 100${afterClause}) {
-          pageInfo { hasNextPage endCursor }
-          nodes {
-            id
-            databaseId
-            author { login }
-            body
-            url
-            replyTo { id }
-            userContentEdits(first: 50) {
-              totalCount
-            }
-          }
-        }
-      }
-    }
-  }}`);
-}
-
-function fetchDiscussionComment(discussionNumber, discussionCommentDbId) {
-  const [owner, name] = REPO.split("/");
-  let discussionId = null;
-  let cursor = null;
-  let hasNextPage = true;
-
-  while (hasNextPage) {
-    const afterClause = formatGraphQLAfterClause(cursor);
-    const gql = ghGraphQL(
-      `{
-        repository(owner: "${owner}", name: "${name}") {
-          discussion(number: ${discussionNumber}) {
-            id
-            comments(first: 50${afterClause}) {
-              pageInfo { hasNextPage endCursor }
-              nodes {
-                id
-                databaseId
-                author { login }
-                body
-                url
-                replyTo { id }
-                userContentEdits(first: 50) {
-                  totalCount
-                }
-                replies(first: 100) {
-                  pageInfo { hasNextPage endCursor }
-                  nodes {
-                    id
-                    databaseId
-                    author { login }
-                    body
-                    url
-                    replyTo { id }
-                    userContentEdits(first: 50) {
-                      totalCount
-                    }
-                  }
-                }
-              }
-            }
-          }
-        }
-      }`,
-      { allowFailure: true },
-    );
-    failOnGraphQLFailure(gql, `Failed to fetch discussion #${discussionNumber}`);
-
-    const discussion = gql?.data?.repository?.discussion;
-    if (!discussion) {
-      fail(
-        `Discussion #${discussionNumber} not found — it may have been deleted. The alert cannot be processed via this skill.`,
-      );
-    }
-
-    discussionId = discussion.id;
-
-    for (const topLevelComment of discussion.comments.nodes) {
-      if (String(topLevelComment.databaseId) === String(discussionCommentDbId)) {
-        return { discussionId, comment: topLevelComment };
-      }
-
-      let reply = findDiscussionCommentNode(topLevelComment.replies.nodes, discussionCommentDbId);
-      let replyCursor = topLevelComment.replies.pageInfo.endCursor;
-      let hasMoreReplies = topLevelComment.replies.pageInfo.hasNextPage;
-
-      while (!reply && hasMoreReplies) {
-        const replyPage = fetchDiscussionReplyPage(topLevelComment.id, replyCursor);
-        failOnGraphQLFailure(
-          replyPage,
-          `Failed to fetch replies for discussion comment ${topLevelComment.id}`,
-        );
-        const replies = replyPage?.data?.node?.replies;
-        if (!replies) {
-          fail(`Failed to paginate replies for discussion comment ${topLevelComment.id}`);
-        }
-
-        reply = findDiscussionCommentNode(replies.nodes, discussionCommentDbId);
-        hasMoreReplies = replies.pageInfo.hasNextPage;
-        replyCursor = replies.pageInfo.endCursor;
-      }
-
-      if (reply) {
-        return { discussionId, comment: reply };
-      }
-    }
-
-    hasNextPage = discussion.comments.pageInfo.hasNextPage;
-    cursor = discussion.comments.pageInfo.endCursor;
-  }
-
-  return { discussionId, comment: null };
-}
-
-function createDiscussionComment(discussionNodeId, body, replyToNodeId) {
-  const replyToClause = replyToNodeId ? `, replyToId: "${escapeGraphQLString(replyToNodeId)}"` : "";
-  const result = ghGraphQL(
-    `mutation { addDiscussionComment(input: { discussionId: "${escapeGraphQLString(discussionNodeId)}"${replyToClause}, body: "${escapeGraphQLString(body)}" }) { comment { id url } } }`,
-  );
-  if (result?.errors) {
-    fail(`Failed to create discussion comment: ${JSON.stringify(result.errors)}`);
-  }
-  return result?.data?.addDiscussionComment?.comment;
-}
-
-// ─── Commands ───────────────────────────────────────────────────────────────
-
-/**
- * fetch-alert <number>
- * Fetch alert metadata + locations. Never exposes .secret.
- */
-function cmdFetchAlert(alertNumber) {
-  if (!alertNumber) {
-    fail("Usage: fetch-alert <number>");
-  }
-
-  const alert = gh(["api", `repos/${REPO}/secret-scanning/alerts/${alertNumber}?hide_secret=true`]);
-
-  const locations = gh([
-    "api",
-    `repos/${REPO}/secret-scanning/alerts/${alertNumber}/locations`,
-    "--paginate",
-    "--slurp",
-  ]);
-  // --paginate + --slurp 确保多页结果合并为一个 JSON 数组
-  const flatLocations = Array.isArray(locations?.[0])
-    ? locations.flat()
-    : Array.isArray(locations)
-      ? locations
-      : [];
-
-  const result = {
-    number: alert.number,
-    state: alert.state,
-    secret_type: alert.secret_type,
-    secret_type_display_name: alert.secret_type_display_name,
-    validity: alert.validity,
-    html_url: alert.html_url,
-    locations: flatLocations.map((loc) => ({
-      type: loc.type,
-      details: loc.details,
-    })),
-  };
-
-  console.log(JSON.stringify(result, null, 2));
-}
-
-/**
- * fetch-content <location-json>
- * Fetch the content and metadata for a specific location.
- * Saves full body to a temp file. Prints metadata + file path to stdout.
- */
-function cmdFetchContent(locationJson) {
-  if (!locationJson) {
-    fail("Usage: fetch-content '<location-json>'");
-  }
-  const location = JSON.parse(locationJson);
-  const type = location.type;
-  const details = location.details;
-
-  if (type === "discussion_comment") {
-    const commentUrl = details.discussion_comment_url;
-    if (!commentUrl) {
-      fail("No discussion_comment_url in location details");
-    }
-
-    const urlMatch = commentUrl.match(/discussions\/(\d+)#discussioncomment-(\d+)/);
-    if (!urlMatch) {
-      fail(`Cannot parse discussion comment URL: ${commentUrl}`);
-    }
-    const discussionNumber = urlMatch[1];
-    const discussionCommentDbId = urlMatch[2];
-
-    const { discussionId, comment } = fetchDiscussionComment(
-      discussionNumber,
-      discussionCommentDbId,
-    );
-    if (!comment) {
-      fail(
-        `Discussion comment #${discussionCommentDbId} not found in discussion #${discussionNumber}`,
-      );
-    }
-
-    const bodyFile = tmpFile("body.md");
-    fs.writeFileSync(bodyFile, comment.body || "");
-
-    console.log(
-      JSON.stringify(
-        {
-          type,
-          comment_node_id: comment.id,
-          discussion_node_id: discussionId,
-          reply_to_node_id: comment.replyTo?.id ?? null,
-          discussion_number: Number(discussionNumber),
-          discussion_comment_db_id: Number(discussionCommentDbId),
-          author: comment.author?.login,
-          html_url: comment.url || commentUrl,
-          edit_history_count: comment.userContentEdits?.totalCount ?? 0,
-          body_file: bodyFile,
-        },
-        null,
-        2,
-      ),
-    );
-  } else if (
-    type === "issue_comment" ||
-    type === "pull_request_comment" ||
-    type === "pull_request_review_comment"
-  ) {
-    // Extract comment ID from URL
-    const commentUrl =
-      details.issue_comment_url ||
-      details.pull_request_comment_url ||
-      details.pull_request_review_comment_url;
-    if (!commentUrl) {
-      fail(`No comment URL in location details`);
-    }
-
-    const comment = gh(["api", commentUrl]);
-    const bodyFile = tmpFile("body.md");
-    fs.writeFileSync(bodyFile, comment.body || "");
-
-    // Fetch edit history
-    const nodeId = comment.node_id;
-    const typeName =
-      type === "pull_request_review_comment" ? "PullRequestReviewComment" : "IssueComment";
-    const gql = ghGraphQL(`{
-      node(id: "${nodeId}") {
-        ... on ${typeName} {
-          userContentEdits(first: 50) {
-            totalCount
-          }
-        }
-      }
-    }`);
-    const editCount = gql?.data?.node?.userContentEdits?.totalCount ?? 0;
-
-    // Extract issue number from html_url
-    const htmlUrl = comment.html_url || details.html_url || "";
-    const issueMatch = htmlUrl.match(/\/(issues|pull)\/(\d+)/);
-    const issueNumber = issueMatch ? issueMatch[2] : null;
-
-    console.log(
-      JSON.stringify(
-        {
-          type,
-          comment_id: comment.id,
-          node_id: nodeId,
-          author: comment.user?.login,
-          issue_number: issueNumber,
-          html_url: htmlUrl,
-          edit_history_count: editCount,
-          body_file: bodyFile,
-        },
-        null,
-        2,
-      ),
-    );
-  } else if (type === "issue_body") {
-    const issueUrl = details.issue_body_url || details.issue_url;
-    if (!issueUrl) {
-      fail("No issue URL in location details");
-    }
-
-    const issue = gh(["api", issueUrl]);
-    const bodyFile = tmpFile("body.md");
-    fs.writeFileSync(bodyFile, issue.body || "");
-
-    const nodeId = issue.node_id;
-    const number = issue.number;
-    const gql = ghGraphQL(`{
-      node(id: "${nodeId}") {
-        ... on Issue {
-          userContentEdits(first: 50) {
-            totalCount
-          }
-        }
-      }
-    }`);
-    const editCount = gql?.data?.node?.userContentEdits?.totalCount ?? 0;
-
-    console.log(
-      JSON.stringify(
-        {
-          type,
-          issue_number: number,
-          node_id: nodeId,
-          author: issue.user?.login,
-          html_url: issue.html_url,
-          edit_history_count: editCount,
-          body_file: bodyFile,
-        },
-        null,
-        2,
-      ),
-    );
-  } else if (type === "pull_request_body") {
-    const prUrl = details.pull_request_body_url || details.pull_request_url;
-    if (!prUrl) {
-      fail("No PR URL in location details");
-    }
-
-    const pr = gh(["api", prUrl]);
-    const bodyFile = tmpFile("body.md");
-    fs.writeFileSync(bodyFile, pr.body || "");
-
-    const nodeId = pr.node_id;
-    const number = pr.number;
-    const gql = ghGraphQL(`{
-      node(id: "${nodeId}") {
-        ... on PullRequest {
-          userContentEdits(first: 50) {
-            totalCount
-          }
-        }
-      }
-    }`);
-    const editCount = gql?.data?.node?.userContentEdits?.totalCount ?? 0;
-
-    console.log(
-      JSON.stringify(
-        {
-          type,
-          pr_number: number,
-          node_id: nodeId,
-          author: pr.user?.login,
-          merged: pr.merged,
-          state: pr.state,
-          html_url: pr.html_url,
-          edit_history_count: editCount,
-          body_file: bodyFile,
-        },
-        null,
-        2,
-      ),
-    );
-  } else if (type === "commit") {
-    console.log(
-      JSON.stringify(
-        {
-          type,
-          commit_sha: details.commit_sha,
-          path: details.path,
-          start_line: details.start_line,
-          end_line: details.end_line,
-          html_url: details.html_url || details.commit_url || details.blob_url || null,
-          // No body file for commits
-          body_file: null,
-        },
-        null,
-        2,
-      ),
-    );
-  } else {
-    console.log(
-      JSON.stringify(
-        {
-          type,
-          unsupported: true,
-          details,
-        },
-        null,
-        2,
-      ),
-    );
-  }
-}
-
-/**
- * redact-body <issue|pr> <number> <redacted-body-file>
- * PATCH the issue or PR body with redacted content from a file.
- */
-function cmdRedactBody(kind, number, bodyFile) {
-  if (!kind || !number || !bodyFile) {
-    fail("Usage: redact-body <issue|pr> <number> <redacted-body-file>");
-  }
-  if (!fs.existsSync(bodyFile)) {
-    fail(`File not found: ${bodyFile}`);
-  }
-
-  const endpoint =
-    kind === "pr" ? `repos/${REPO}/pulls/${number}` : `repos/${REPO}/issues/${number}`;
-
-  gh(["api", endpoint, "-X", "PATCH", "-F", `body=@${bodyFile}`]);
-  console.log(JSON.stringify({ ok: true, kind, number: Number(number) }));
-}
-
-/**
- * redact-body-if-needed <issue|pr> <number> <current-body-file> <redacted-body-file> <result-file>
- * PATCH only when the agent-produced redacted body differs from the current body.
- */
-function cmdRedactBodyIfNeeded(kind, number, currentBodyFile, redactedBodyFile, resultFile) {
-  if (!kind || !number || !currentBodyFile || !redactedBodyFile || !resultFile) {
-    fail(
-      "Usage: redact-body-if-needed <issue|pr> <number> <current-body-file> <redacted-body-file> <result-file>",
-    );
-  }
-  if (!fs.existsSync(currentBodyFile)) {
-    fail(`File not found: ${currentBodyFile}`);
-  }
-  if (!fs.existsSync(redactedBodyFile)) {
-    fail(`File not found: ${redactedBodyFile}`);
-  }
-
-  const currentBody = fs.readFileSync(currentBodyFile, "utf8");
-  const redactedBody = fs.readFileSync(redactedBodyFile, "utf8");
-  const decision = decideBodyRedaction(currentBody, redactedBody);
-  const result = {
-    ok: true,
-    kind,
-    number: Number(number),
-    ...decision,
-  };
-
-  if (decision.body_changed) {
-    const endpoint =
-      kind === "pr" ? `repos/${REPO}/pulls/${number}` : `repos/${REPO}/issues/${number}`;
-    gh(["api", endpoint, "-X", "PATCH", "-F", `body=@${redactedBodyFile}`]);
-    result.redacted = true;
-  } else {
-    result.redacted = false;
-    result.reason = "current_body_already_redacted";
-  }
-
-  fs.writeFileSync(resultFile, `${JSON.stringify(result, null, 2)}\n`, { mode: 0o600 });
-  console.log(JSON.stringify(result));
-}
-
-/**
- * delete-comment <comment-id>
- * Delete a comment (and all its edit history).
- */
-function cmdDeleteComment(commentId) {
-  if (!commentId) {
-    fail("Usage: delete-comment <comment-id>");
-  }
-  gh(["api", `repos/${REPO}/issues/comments/${commentId}`, "-X", "DELETE"], { json: false });
-  console.log(JSON.stringify({ ok: true, deleted_comment_id: Number(commentId) }));
-}
-
-/**
- * delete-discussion-comment <node-id>
- * Delete a discussion comment via GraphQL (and all its edit history).
- */
-function cmdDeleteDiscussionComment(nodeId) {
-  if (!nodeId) {
-    fail("Usage: delete-discussion-comment <node-id>");
-  }
-  const result = ghGraphQL(
-    `mutation { deleteDiscussionComment(input: { id: "${nodeId}" }) { comment { id } } }`,
-  );
-  if (result?.errors) {
-    fail(`Failed to delete discussion comment: ${JSON.stringify(result.errors)}`);
-  }
-  console.log(JSON.stringify({ ok: true, deleted_node_id: nodeId }));
-}
-
-/**
- * recreate-discussion-comment <discussion-node-id> <body-file> [reply-to-node-id]
- * Create a new discussion comment via GraphQL.
- */
-function cmdRecreateDiscussionComment(discussionNodeId, bodyFile, replyToNodeId) {
-  if (!discussionNodeId || !bodyFile) {
-    fail("Usage: recreate-discussion-comment <discussion-node-id> <body-file> [reply-to-node-id]");
-  }
-  if (!fs.existsSync(bodyFile)) {
-    fail(`File not found: ${bodyFile}`);
-  }
-
-  const body = fs.readFileSync(bodyFile, "utf8");
-  const newComment = createDiscussionComment(discussionNodeId, body, replyToNodeId);
-  console.log(
-    JSON.stringify({
-      ok: true,
-      node_id: newComment?.id,
-      html_url: newComment?.url,
-    }),
-  );
-}
-
-/**
- * recreate-comment <issue-number> <body-file>
- * Create a new comment from a file.
- */
-function cmdRecreateComment(issueNumber, bodyFile) {
-  if (!issueNumber || !bodyFile) {
-    fail("Usage: recreate-comment <issue-number> <body-file>");
-  }
-  if (!fs.existsSync(bodyFile)) {
-    fail(`File not found: ${bodyFile}`);
-  }
-
-  const result = gh([
-    "api",
-    `repos/${REPO}/issues/${issueNumber}/comments`,
-    "-X",
-    "POST",
-    "-F",
-    `body=@${bodyFile}`,
-  ]);
-
-  console.log(
-    JSON.stringify({
-      ok: true,
-      comment_id: result.id,
-      html_url: result.html_url,
-    }),
-  );
-}
-
-/**
- * notify <target> <author> <location-type> <secret-types> [reply-to-node-id]
- * Post a notification comment with the correct template for the location type.
- * target = issue/PR number for non-discussion types, discussion node ID for discussion_comment.
- */
-function cmdNotify(target, author, locationType, secretTypes, replyToNodeId) {
-  if (!target || !author || !locationType || !secretTypes) {
-    fail(
-      "Usage: notify <target> <author> <location-type> <secret-types-comma-sep> [reply-to-node-id]",
-    );
-  }
-
-  const types = secretTypes.split(",").map((s) => s.trim());
-  const typeList = types.map((t, i) => `${i + 1}. **${t}**`).join("\n");
-  const redactionResult = loadBodyRedactionResult(locationType, replyToNodeId);
-  if (isBodyLocationType(locationType) && !redactionResult.notify_required) {
-    console.log(
-      JSON.stringify({
-        ok: true,
-        skipped: true,
-        reason: "current_body_already_redacted",
-      }),
-    );
-    return;
-  }
-
-  let locationDesc;
-  let actionDesc;
-  if (
-    locationType === "issue_comment" ||
-    locationType === "pull_request_comment" ||
-    locationType === "pull_request_review_comment" ||
-    locationType === "discussion_comment"
-  ) {
-    locationDesc = "your comment";
-    actionDesc = "The affected comment has been removed and replaced with a redacted version.";
-  } else if (locationType === "issue_body") {
-    locationDesc = "your issue description";
-    actionDesc = "The affected content has been redacted in place.";
-  } else if (locationType === "pull_request_body") {
-    locationDesc = "your pull request description";
-    actionDesc = "The affected content has been redacted in place.";
-  } else if (locationType === "commit") {
-    locationDesc = "code you committed";
-    actionDesc = "";
-  } else {
-    locationDesc = "your content";
-    actionDesc = "";
-  }
-
-  const body = [
-    `> **Note:** This is an automated message sent by the OpenClaw maintainer team. **NO_REPLY.**`,
-    "",
-    `@${author} :warning: **Security Notice: Secret Leakage Detected**`,
-    "",
-    `GitHub Secret Scanning detected the following exposed secret types in ${locationDesc}:`,
-    "",
-    typeList,
-    "",
-    actionDesc,
-    "",
-    "**Please rotate these credentials immediately.**",
-    "",
-    "These secrets were publicly exposed and should be considered compromised.",
-  ]
-    .filter((line) => line !== undefined)
-    .join("\n");
-
-  // Discussion comments must be notified via GraphQL
-  if (locationType === "discussion_comment") {
-    const newComment = createDiscussionComment(target, body, replyToNodeId);
-    console.log(
-      JSON.stringify({
-        ok: true,
-        node_id: newComment?.id,
-        html_url: newComment?.url,
-      }),
-    );
-    return;
-  }
-
-  // Issue/PR comments via REST
-  const bodyFile = tmpFile("notify.md");
-  fs.writeFileSync(bodyFile, body);
-
-  const result = gh([
-    "api",
-    `repos/${REPO}/issues/${target}/comments`,
-    "-X",
-    "POST",
-    "-F",
-    `body=@${bodyFile}`,
-  ]);
-
-  console.log(
-    JSON.stringify({
-      ok: true,
-      comment_id: result.id,
-      html_url: result.html_url,
-    }),
-  );
-}
-
-/**
- * resolve <alert-number> [resolution] [comment]
- * Close a secret scanning alert.
- */
-function cmdResolve(alertNumber, resolution, comment) {
-  if (!alertNumber) {
-    fail("Usage: resolve <alert-number> [resolution] [comment]");
-  }
-
-  const res = resolution || "revoked";
-  const resComment = comment || "Content redacted and author notified to rotate credentials.";
-
-  const result = gh([
-    "api",
-    `repos/${REPO}/secret-scanning/alerts/${alertNumber}`,
-    "-X",
-    "PATCH",
-    "-f",
-    `state=resolved`,
-    "-f",
-    `resolution=${res}`,
-    "-f",
-    `resolution_comment=${resComment}`,
-  ]);
-
-  console.log(
-    JSON.stringify({
-      ok: true,
-      number: result.number,
-      state: result.state,
-      resolution: result.resolution,
-      resolved_at: result.resolved_at,
-    }),
-  );
-}
-
-/**
- * list-open
- * List all open secret scanning alerts.
- */
-function cmdListOpen() {
-  const alerts = gh([
-    "api",
-    `repos/${REPO}/secret-scanning/alerts?hide_secret=true&state=open`,
-    "--paginate",
-    "--slurp",
-  ]);
-
-  // --slurp 将分页结果合并为 [[page1], [page2], ...] 需要 flat
-  const flat = Array.isArray(alerts?.[0]) ? alerts.flat() : Array.isArray(alerts) ? alerts : [];
-  const rows = flat.map((a) => ({
-    number: a.number,
-    secret_type_display_name: a.secret_type_display_name,
-    html_url: a.html_url,
-    first_location_html_url: a.first_location_detected?.html_url || null,
-  }));
-
-  console.log(JSON.stringify(rows, null, 2));
-}
-
-/**
- * summary <json-file>
- * Print a formatted summary table from a JSON results file.
- */
-function cmdSummary(jsonFile) {
-  if (!jsonFile) {
-    fail("Usage: summary <json-file>");
-  }
-  if (!fs.existsSync(jsonFile)) {
-    fail(`File not found: ${jsonFile}`);
-  }
-
-  const results = JSON.parse(fs.readFileSync(jsonFile, "utf8"));
-  const lines = [];
-
-  lines.push("---BEGIN SUMMARY---");
-  lines.push("");
-  lines.push("## Secret Scanning Results");
-  lines.push("");
-  lines.push("| Alert | Type | Location | Actions | Edit History |");
-  lines.push("|-------|------|----------|---------|--------------|");
-
-  const needsPurge = [];
-
-  for (const r of results) {
-    const alertLink = `#${r.number} ${REPO_URL}/security/secret-scanning/${r.number}`;
-    const locationLink = r.location_url
-      ? `${r.location_label} ${r.location_url}`
-      : r.location_label;
-    const history = r.history_cleared ? "Cleared" : "⚠️ History remains";
-
-    lines.push(`| ${alertLink} | ${r.secret_type} | ${locationLink} | ${r.actions} | ${history} |`);
-
-    if (!r.history_cleared && r.location_url) {
-      needsPurge.push(r);
-    }
-  }
-
-  if (needsPurge.length > 0) {
-    lines.push("");
-    lines.push("Issues requiring GitHub Support to purge edit history:");
-    for (const r of needsPurge) {
-      lines.push(`- ${r.location_label} ${r.location_url} — ${r.secret_type}`);
-    }
-    lines.push(
-      `Contact: https://support.github.com/contact — request purge of userContentEdits for the above issues.`,
-    );
-  }
-
-  const skipped = results.filter((r) => r.skipped);
-  if (skipped.length > 0) {
-    lines.push("");
-    lines.push(
-      "⚠️ The following alerts were skipped because their location type is not supported:",
-    );
-    for (const r of skipped) {
-      lines.push(
-        `- Alert #${r.number}: unsupported type "${r.unsupported_type}" — ${REPO_URL}/security/secret-scanning/${r.number}`,
-      );
-    }
-    lines.push("Please update the skill to define handling for these types.");
-  }
-
-  lines.push("");
-  lines.push("---END SUMMARY---");
-
-  console.log(lines.join("\n"));
-}
-
-// ─── Dispatch ───────────────────────────────────────────────────────────────
-
-const args = [];
-
-export const commands = {
-  "fetch-alert": () => cmdFetchAlert(args[0]),
-  "fetch-content": () => cmdFetchContent(args[0]),
-  "redact-body": () => cmdRedactBody(args[0], args[1], args[2]),
-  "redact-body-if-needed": () => cmdRedactBodyIfNeeded(args[0], args[1], args[2], args[3], args[4]),
-  "delete-comment": () => cmdDeleteComment(args[0]),
-  "delete-discussion-comment": () => cmdDeleteDiscussionComment(args[0]),
-  "recreate-comment": () => cmdRecreateComment(args[0], args[1]),
-  "recreate-discussion-comment": () => cmdRecreateDiscussionComment(args[0], args[1], args[2]),
-  notify: () => cmdNotify(args[0], args[1], args[2], args[3], args[4]),
-  resolve: () => cmdResolve(args[0], args[1], args[2]),
-  "list-open": () => cmdListOpen(),
-  summary: () => cmdSummary(args[0]),
-};
-
-function main(argv = process.argv.slice(2)) {
-  const [command, ...commandArgs] = argv;
-  args.length = 0;
-  args.push(...commandArgs);
-
-  if (!command || !commands[command]) {
-    console.error(
-      [
-        "Usage: node secret-scanning.mjs <command> [args]",
-        "",
-        "Commands:",
-        "  fetch-alert <number>             Fetch alert metadata + locations",
-        "  fetch-content '<location-json>'   Fetch content for a location",
-        "  redact-body <issue|pr> <n> <file> PATCH body with redacted file",
-        "  redact-body-if-needed <issue|pr> <n> <current-file> <redacted-file> <result-file> PATCH body only if redaction changed it",
-        "  delete-comment <comment-id>       Delete a comment",
-        "  delete-discussion-comment <node-id> Delete a discussion comment (GraphQL)",
-        "  recreate-comment <issue-n> <file> Create replacement comment",
-        "  recreate-discussion-comment <disc-node-id> <file> [reply-to-node-id] Create discussion comment (GraphQL)",
-        "  notify <target> <author> <type> <types> [reply-to-node-id|body-result-file] Post notification",
-        "  resolve <n> [resolution] [comment] Close alert",
-        "  list-open                          List open alerts",
-        "  summary <json-file>               Print formatted summary",
-      ].join("\n"),
-    );
-    process.exit(1);
-  }
-
-  commands[command]();
-}
-
-if (process.argv[1] && import.meta.url === pathToFileURL(process.argv[1]).href) {
-  main();
-}
--- a/.agents/skills/openclaw-small-bugfix-sweep/SKILL.md
+++ b/.agents/skills/openclaw-small-bugfix-sweep/SKILL.md
@@ -1,77 +0,0 @@
---
-name: openclaw-small-bugfix-sweep
-description: Fix only small, high-certainty OpenClaw bugs from a pasted issue/PR list after deep code review.
---
-
-# OpenClaw Small Bugfix Sweep
-
-Batch workflow for pasted OpenClaw issue/PR refs.
-Execute, do not summarize.
-Triage reviews, proves, and patches local fixes first; publishing waits for Peter's manual review.
-
-## Peter Review Gate
-
-Peter always wants to review code before commits.
-Default flow:
-1. Review each issue deeply enough to prove current behavior and root cause.
-2. Fix only easy, high-confidence bugs with narrow ownership and focused proof.
-3. Stop with the dirty diff summary, touched files, and test/gate output for Peter's manual review.
-4. After Peter approves shipping, make one commit per accepted fix, with a changelog entry for each user-facing fix.
-5. Pull/rebase, push, then comment and close only the fixed or explicitly triaged-closed issues.
-
-Do not batch unrelated issue fixes into one commit. Do not push, create PRs, comment, close, label, land, merge, or otherwise publish during the review/prove phase.
-
-## Companion Skills
-
-Use `$gitcrawl` first, `$openclaw-pr-maintainer` for live GitHub hygiene, `$github-deep-review` posture for source tracing, and `$openclaw-testing` for proof.
-
-## Loop
-
-For each ref:
-
-1. Read live target with `gh`.
-2. Check `gitcrawl` for related, duplicate, closed, or already-fixed threads.
-3. Read body, comments, linked refs, changed files, current code, adjacent tests, and dependency contracts when relevant.
-4. Trace the real runtime path.
-5. For issues: fix locally only if this is a bug, current code proves root cause, the implicated path is clear, and a narrow patch is cleaner than refactor.
-6. For PRs: decide `ready-to-merge`, `needs-fixup`, or `skip`; do not alter PR branches unless explicitly asked.
-7. Add focused regression proof when practical for local issue fixes or PR readiness checks.
-8. Run the smallest meaningful gate.
-9. Continue until every pasted ref is fixed or classified.
-
-No subagents unless explicitly requested.
-
-## Skip If
-
- not a bug
- config/docs/workflow/release/support/dependency/product work
- repro or root cause is uncertain
- larger refactor or owner-boundary change is cleaner
- already fixed on current `main`
- dependency behavior is guessed
- no focused proof is feasible
-
-Skip with terse reason. Do not pad with low-confidence fixes.
-
-## Fix Rules
-
- owner module first; generic seam only when required
- existing patterns/helpers/types
- no drive-by refactors
- tests near failing surface
- docs only for changed public behavior
- no commit during the review/prove phase
- after Peter approves shipping, one commit plus changelog per accepted user-facing fix
- no push/create PR/comment/close/label/land/merge until Peter approves shipping after review
-
-## PR Rules
-
- `ready-to-merge`: code is good, current head checked, required proof is green or clearly pending only external CI; list for maintainer merge or `@clawsweeper automerge`
- `needs-fixup`: small bug is clear, but PR branch needs changes; list exact files/tests and wait for explicit fix/push/automerge instruction
- `skip`: broad, stale, speculative, config/product/security/release, owner-boundary, or refactor-sized
- if source PR is untrusted/uneditable, do not create a replacement PR during sweep
-
-## Output Shape
-
-Ledger: `fixed-local`, `ready-to-merge`, `needs-fixup`, `skipped`, `needs-human`.
-Final: issue files left on disk, PRs ready for merge/automerge, tests/gates, skip reasons.
--- a/.agents/skills/openclaw-test-heap-leaks/SKILL.md
+++ b/.agents/skills/openclaw-test-heap-leaks/SKILL.md
@@ -1,14 +1,12 @@
 ---
 name: openclaw-test-heap-leaks
-description: Investigate OpenClaw pnpm test memory growth, Vitest OOMs, RSS spikes, and heap snapshot deltas.
+description: Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, triage likely transformed-module retention versus likely runtime leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests.
 ---

 # OpenClaw Test Heap Leaks

 Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available. Treat snapshot-name deltas as triage evidence, not proof, until retainers or dominators support the call.

-For **runtime fixes** (e.g., closure leaks in long-running services like the gateway), see [Validating runtime fixes](#validating-runtime-fixes-not-test-memory) below — that uses a dedicated harness, not the test-parallel snapshot machinery.
-
 ## Workflow

 1. Reproduce the failing shape first.
@@ -65,38 +63,6 @@ For **runtime fixes** (e.g., closure leaks in long-running services like the gat

 Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak. If the names alone do not settle it, open the same snapshot pair in DevTools and inspect retainers/dominators for the top rows before declaring root cause.

-## Validating runtime fixes (not test-memory)
-
-The workflow above is for diagnosing Vitest worker memory growth. For
-validating that a runtime/closure fix actually releases captured state, use the
-dedicated harness:
-
- `pnpm leak:embedded-run` — runs `scripts/embedded-run-abort-leak.ts`. Loops N
-  aborted runs in a function-shaped scope mimicking `runEmbeddedAttempt`,
-  writes heap snapshots, and reports a PASS/FAIL verdict on retention growth
-  using `FinalizationRegistry` for tracked-instance counting plus RSS delta.
-
-Modes:
-
- `closure-extracted` (default) — production fix shape (helper at module scope).
- `closure-inline` — pre-fix shape (closure inside the runner scope). Use as a
-  sensitivity check: if it passes you've broken the harness, not fixed a bug.
- `synthetic-leak` — deliberately retains via a module-level bucket. Use to
-  confirm the harness can detect leaks before trusting a PASS on a real fix.
-
-Snapshots land in `.tmp/embedded-run-abort-leak/`. Diff with the same script
-as above:
-
-```
-node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs \
-  .tmp/embedded-run-abort-leak/baseline-*.heapsnapshot \
-  .tmp/embedded-run-abort-leak/batch-N-*.heapsnapshot --top 30
-```
-
-When fixing a different runtime leak, add a new harness alongside this one
-rather than retrofitting it. The fixture function should mimic the lexical
-scope of the function where the leak lives, not be a generic abort-loop.
-
 ## Output Expectations

 When using this skill, report:
--- a/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs
+++ b/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs
@@ -1,7 +1,4 @@
 #!/usr/bin/env node
-/**
- * Heap snapshot diff utility for OpenClaw test memory leak investigations.
- */

 import fs from "node:fs";
 import path from "node:path";
--- a/.agents/skills/openclaw-test-performance/SKILL.md
+++ b/.agents/skills/openclaw-test-performance/SKILL.md
@@ -1,266 +0,0 @@
---
-name: openclaw-test-performance
-description: Benchmark, diagnose, and optimize OpenClaw test and plugin-suite runtime, import hotspots, CPU/RSS, heap growth, and slow coverage paths.
---
-
-# OpenClaw Test Performance
-
-Use evidence first. The goal is real `pnpm test`, plugin-suite, and
-plugin-inspector speed/RSS improvement with coverage intact, not runner tuning by
-guesswork.
-
-## Workflow
-
-1. Read the relevant local `AGENTS.md` files before editing:
-   - `src/agents/AGENTS.md` for agent/import hotspots.
-   - `src/channels/AGENTS.md` and `src/plugins/AGENTS.md` for plugin/channel
-     laziness.
-   - `src/gateway/AGENTS.md` for server lifecycle tests.
-   - `test/helpers/AGENTS.md` and `test/helpers/channels/AGENTS.md` for shared
-     contract helpers.
-   - `src/infra/outbound/AGENTS.md` for outbound/media/action tests.
-2. Establish a baseline before changing code:
-   - Prefer `pnpm test:perf:groups --full-suite --allow-failures --output <file>`
-     for full-suite ranking.
-   - For bundled plugin breadth, run the smallest relevant `pnpm
-test:extensions:batch <plugin[,plugin...]>` or plugin-inspector command
-     before jumping to the full extension sweep.
-   - For a scoped hotspot use:
-     `/usr/bin/time -l pnpm test <file-or-files> --maxWorkers=1 --reporter=verbose`
-   - For import-heavy suspicion add:
-     `OPENCLAW_VITEST_IMPORT_DURATIONS=1 OPENCLAW_VITEST_PRINT_IMPORT_BREAKDOWN=1`.
-3. Separate wall/runner noise from real file cost:
-   - Compare Vitest duration, test body timing, import breakdown, wall time, and
-     max RSS.
-   - Re-run single files when grouped/full-suite numbers look stale or noisy.
-   - If a full-suite grouped run reports a lane failure but JSON says tests
-     passed, capture that as harness/noise and verify the suspect file directly.
-4. Pick the next attack by return and risk:
-   - High return: one file/test dominates seconds or RSS and has a clear root.
-   - High leverage: one plugin or SDK barrel causes every plugin-inspector or
-     extension-batch run to load broad runtime.
-   - Lower risk: static descriptors, target parsing, routing, auth bypass,
-     setup hints, registry fixtures, or test server lifecycle.
-   - Higher risk: real memory/runtime behavior, live providers, protocol
-     contracts, or broad production refactors.
-5. Fix the root cause, not the symptom:
-   - Move static metadata/parsing into narrow helpers or lightweight artifacts
-     reused by full runtime and fast paths.
-   - Prefer dependency injection, loaded-plugin-only lookup, explicit fixtures,
-     and pure helpers over broad mocks.
-   - Reuse suite-level servers/clients when a fresh handshake is irrelevant.
-   - Keep schedulers/background loops off unless the test proves scheduling.
-   - In plugin paths, move static metadata into manifest/lightweight artifacts
-     and keep runtime plugin loads behind explicit execution boundaries.
-6. Preserve coverage shape:
-   - Do not delete a slow integration proof unless the exact production
-     composition is extracted into a named helper and tested.
-   - Keep one cheap integration smoke when cross-component wiring matters.
-   - State explicitly what incidental coverage was removed, if any.
-7. Re-benchmark the same command after the change and compute seconds plus
-   percent gain.
-8. Update the running report when requested or when this thread is tracking one.
-   Include before/after commands, artifacts, coverage notes, verification, and
-   next attack order.
-9. Commit with `scripts/committer "<message>" <paths...>` and push when the
-   user asked for commits/pushes. Stage only files touched for this attack.
-
-## Plugin-Suite Workflow
-
-Use this section when perf work involves bundled plugins, plugin-inspector, SDK
-barrels, package-boundary tests, or extension suites.
-
-1. Map the suite shape first:
-   - source tests: `pnpm test extensions/<id>` or `pnpm test:extensions:batch <id>`
-   - package boundaries: `pnpm run test:extensions:package-boundary:canary` and
-     `pnpm run test:extensions:package-boundary:compile`
-   - all bundled source tests: `pnpm test:extensions`
-   - plugin import memory: `pnpm test:extensions:memory -- --json .artifacts/test-perf/extensions-memory.json`
-   - plugin-inspector/report work: keep report primitives in `plugin-inspector`;
-     keep wrappers thin and collect peak RSS when the command supports it.
-2. Start narrow, then widen:
-   - one plugin changed: run that plugin's tests and plugin-inspector slice.
-   - SDK/public barrel changed: add representative provider, channel, memory,
-     and feature plugins.
-   - loader/runtime mirror changed: add package-boundary checks and build/package
-     proof as needed.
-   - unknown shared plugin behavior: run `test:extensions:batch` groups before
-     `pnpm test:extensions`.
-3. Treat plugin-inspector failures as product signals:
-   - JSON must parse.
-   - warnings/errors must be classified, not hidden.
-   - runtime capture should be quiet and config-tolerant.
-   - command output should include wall time, exit code, and peak RSS when
-     available.
-4. For broad or package-heavy plugin proof, use Crabbox-backed Blacksmith
-   Testbox by default on maintainer machines:
-   - `pnpm crabbox:run -- --provider blacksmith-testbox --timing-json -- OPENCLAW_TESTBOX=1 pnpm test:extensions:batch <ids>`
-   - add `--keep`/`--id <id-or-slug>` only when several commands must share one
-     warmed box; stop it with `pnpm crabbox:stop -- <id-or-slug>`.
-5. If plugin performance is package-artifact sensitive, switch to
-   `release-openclaw-plugin-testing` and Package Acceptance rather than
-   trusting source-only timing.
-
-## Metric Collection
-
-Collect at least one stable metric before and after. Prefer the same machine and
-same command. For Testbox comparisons, use the same `tbx_...` id when possible.
-
-| Metric          | Use for                            | Preferred source                                                            |
-| --------------- | ---------------------------------- | --------------------------------------------------------------------------- |
-| wall time       | user-visible suite cost            | `/usr/bin/time -l`, test wrapper duration, Testbox run time                 |
-| Vitest duration | test body/import cost              | Vitest output per file/shard                                                |
-| import duration | broad barrel/runtime loads         | `OPENCLAW_VITEST_IMPORT_DURATIONS=1`                                        |
-| max RSS         | memory pressure and OOM risk       | `/usr/bin/time -l`, `pnpm test:extensions:memory`, wrapper memory summaries |
-| CPU/user/sys    | CPU-bound vs wait-bound split      | `/usr/bin/time -l` locally, Testbox job timing when local CPU is noisy      |
-| heap snapshots  | real leak vs retained module graph | `openclaw-test-heap-leaks` workflow                                         |
-
-Local scoped command with CPU/RSS:
-
-```bash
-timeout 240 /usr/bin/time -l pnpm test <file> --maxWorkers=1 --reporter=verbose
-```
-
-Plugin import memory profile:
-
-```bash
-pnpm build
-pnpm test:extensions:memory -- --top 20 --json .artifacts/test-perf/extensions-memory.json
-```
-
-Targeted plugin import memory:
-
-```bash
-pnpm test:extensions:memory -- --extension discord --extension telegram --skip-combined
-```
-
-Heap/RSS escalation:
-
-```bash
-OPENCLAW_TEST_MEMORY_TRACE=1 \
-OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 \
-OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap \
-OPENCLAW_TEST_WORKERS=2 \
-OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 \
-pnpm test
-```
-
-Use `openclaw-test-heap-leaks` when RSS keeps growing across intervals, workers
-OOM, or the suspect command has app-object retention. Do not call RSS growth a
-leak until snapshots or retainers support it.
-
-## Common Root Causes
-
- Full bundled channel/plugin runtime loaded for static data.
- `getChannelPlugin()` fallback used when an already-loaded fixture or pure
-  parser would suffice.
- Broad `api.ts`, `runtime-api.ts`, `test-api.ts`, or plugin-sdk barrels pulled
-  into hot tests.
- SDK root aliases or package barrels pulling focused subpaths back into a broad
-  plugin graph.
- Plugin-inspector loading runtime code just to render metadata, reports, or CI
-  policy scores.
- Bundled plugin capture reusing real config/home state instead of synthetic,
-  redacted, isolated state.
- Partial-real mocks using `importActual()` around broad modules.
- `vi.resetModules()` plus fresh imports in per-test loops.
- Test plugin registry seeded in `beforeAll` while runtime state resets in
-  `afterEach`.
- Per-test gateway/server/client startup when state reset would suffice.
- Runtime/default model/auth selection paid by idle snapshots or fixtures.
- Plugin-owned media/action discovery triggered before checking whether args
-  contain plugin-owned fields.
- Timings missing from `test/fixtures/test-timings.unit.json`, causing hotspot
-  files to stay in shared workers.
- Parallel Vitest runs sharing `node_modules/.experimental-vitest-cache` without
-  distinct `OPENCLAW_VITEST_FS_MODULE_CACHE_PATH` values.
-
-## Benchmark Commands
-
-Scoped file:
-
-```bash
-timeout 240 /usr/bin/time -l pnpm test <file> --maxWorkers=1 --reporter=verbose
-```
-
-Scoped file with import breakdown:
-
-```bash
-timeout 240 /usr/bin/time -l env \
-  OPENCLAW_VITEST_IMPORT_DURATIONS=1 \
-  OPENCLAW_VITEST_PRINT_IMPORT_BREAKDOWN=1 \
-  pnpm test <file> --maxWorkers=1 --reporter=verbose
-```
-
-Grouped suite:
-
-```bash
-pnpm test:perf:groups --full-suite --allow-failures \
-  --output .artifacts/test-perf/<name>.json
-```
-
-Extension batch:
-
-```bash
-pnpm test:extensions:batch <plugin[,plugin...]> -- --reporter=verbose
-```
-
-All extension tests:
-
-```bash
-pnpm test:extensions
-```
-
-Package-boundary plugin checks:
-
-```bash
-pnpm run test:extensions:package-boundary:canary
-pnpm run test:extensions:package-boundary:compile
-```
-
-Reuse an existing Vitest JSON report:
-
-```bash
-pnpm test:perf:groups --report <vitest-json> \
-  --output .artifacts/test-perf/<name>.json
-```
-
-## Verification
-
- Always run the targeted test surface that proves the change.
- For source changes, run `pnpm check:changed` before push; in maintainer
-  Testbox mode run it in the warmed Testbox.
- For test-only changes, run `pnpm test:changed` or the exact edited tests.
- Run `pnpm build` when touching lazy-loading, bundled artifacts, package
-  boundaries, dynamic imports, build output, or public surfaces.
- For plugin SDK/barrel/runtime changes, add `pnpm plugin-sdk:api:check` or
-  `pnpm plugin-sdk:api:gen` when the API surface may drift.
- For plugin-suite perf fixes, verify at least one representative plugin batch
-  plus the changed gate; use Package Acceptance if the bug only exists in a
-  packed artifact.
- If deps are missing/stale, run `pnpm install` and retry the exact failed
-  command once.
- Use the report format:
-
-```markdown
-| Metric         | Before |  After |          Gain |
-| -------------- | -----: | -----: | ------------: |
-| File wall time |   `Xs` |   `Ys` |  `-Zs` (`P%`) |
-| Max RSS        |  `XMB` |  `YMB` | `-ZMB` (`P%`) |
-| CPU user/sys   | `X/Ys` | `A/Bs` |       explain |
-```
-
-## Handoff
-
-Keep the final concise:
-
- Root cause.
- Suite/plugin scope.
- Files changed.
- Before/after wall, Vitest/import, CPU, and RSS numbers where available.
- Leak classification if memory was involved: real leak, retained module graph,
-  or inconclusive.
- Coverage retained.
- Verification commands.
- Testbox ID or workflow URL for remote proof.
- Commit hash and push status.
--- a/Show More
+++ b/Show More
				`@@ -0,0 +1 @@`
				Maintainer skills now live in [`openclaw/maintainers`](https://github.com/openclaw/maintainers/).