feat capability CLI on latest main

2026-06-06 14:01:24 +08:00 · 2026-04-06 17:30:59 -05:00
58 changed files with 5875 additions and 50 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,12 +6,14 @@ Docs: https://docs.openclaw.ai

 ### Changes

+- CLI/capabilities: add a first-class `openclaw capability ...` hub for provider-backed inference workflows across model, media, web, and embedding tasks, with capability inspection, provider discovery, and consistent JSON output. Thanks @Takhoffman.
 - Providers/Anthropic: restore Claude CLI as the preferred local Anthropic path in onboarding, model-auth guidance, and doctor flows again, and keep the Docker Claude CLI live lane aligned with the restored guidance.
 - Plugins/webhooks: add a bundled webhook ingress plugin so external automation can create and drive bound TaskFlows through per-route shared-secret endpoints. (#61892) Thanks @mbelinky.
 - Tools/media: document per-provider music and video generation capabilities, and add shared live video-to-video sweep coverage for providers that support local reference clips.

 ### Fixes

+- CLI/capabilities: keep provider-backed capability behavior aligned with actual runtime execution by fixing explicit TTS override handling, profile-aware gateway TTS prefs resolution, per-request transcription `prompt`/`language` overrides, image output MIME/extension mismatches, configured web-search fallback behavior, and agent-vs-CLI web-search execution drift.
 - Channels/secrets: keep bundled channel artifact and secret-contract loading stable under lazy loading so bundled channel secrets continue to appear in `openclaw secret`, status, and security-audit surfaces.
 - Providers/xAI: recognize `api.grok.x.ai` as an xAI-native endpoint again so native xAI web-search attribution keeps working on Grok-hosted base URLs. (#61377) Thanks @jjjojoj.
 - Providers/Anthropic/cache: preserve thinking blocks for Claude Opus 4.5+, Sonnet 4.5+, and newer Claude 4-family models so Anthropic prompt-cache prefixes keep matching after thinking turns. (#61793)
--- a/apps/shared/OpenClawKit/Sources/OpenClawKit/Resources/tool-display.json
+++ b/apps/shared/OpenClawKit/Sources/OpenClawKit/Resources/tool-display.json
@@ -361,14 +361,6 @@
        }
      }
    },
-    "update_plan": {
-      "emoji": "🗺️",
-      "title": "Update Plan",
-      "detailKeys": [
-        "explanation",
-        "plan.0.step"
-      ]
-    },
    "gateway": {
      "emoji": "🔌",
      "title": "Gateway",
--- a/docs/cli/capability.md
+++ b/docs/cli/capability.md
@@ -0,0 +1,116 @@
+---
+summary: "Capability-first CLI for provider-backed model, media, web, and embedding workflows"
+read_when:
+  - Adding or modifying `openclaw capability` commands
+  - Designing stable headless capability automation
+title: "Capability CLI"
+---
+
+# Capability CLI
+
+`openclaw capability` is the canonical headless surface for provider-backed capabilities.
+
+It intentionally exposes capability families, not raw gateway RPC names and not raw agent tool ids.
+
+## Command tree
+
+```text
+openclaw capability
+  list
+  inspect
+
+  model
+    run
+    list
+    inspect
+    providers
+    auth login
+    auth logout
+    auth status
+
+  media
+    image
+      generate
+      edit
+      describe
+      describe-many
+      providers
+    audio
+      transcribe
+      providers
+    tts
+      convert
+      voices
+      providers
+      status
+      enable
+      disable
+      set-provider
+    video
+      generate
+      describe
+      providers
+
+  web
+    search
+    fetch
+    providers
+
+  memory
+    embedding
+      create
+      providers
+```
+
+## Transport
+
+Supported transport flags:
+
+- `--local`
+- `--gateway`
+
+Default transport is implicit auto at the command-family level:
+
+- Stateless execution commands default to local.
+- Gateway-managed state commands default to gateway.
+
+Examples:
+
+```bash
+openclaw capability model run --prompt "hello" --json
+openclaw capability media image generate --prompt "friendly lobster" --json
+openclaw capability media tts status --json
+openclaw capability embedding create --text "hello world" --json
+```
+
+## JSON output
+
+Capability commands normalize JSON output under a shared envelope:
+
+```json
+{
+  "ok": true,
+  "capability": "media.image.generate",
+  "transport": "local",
+  "provider": "openai",
+  "model": "gpt-image-1",
+  "attempts": [],
+  "outputs": []
+}
+```
+
+Top-level fields are stable:
+
+- `ok`
+- `capability`
+- `transport`
+- `provider`
+- `model`
+- `attempts`
+- `outputs`
+- `error`
+
+## Notes
+
+- `model run` reuses the agent runtime so provider/model overrides behave like normal agent execution.
+- `media tts status` defaults to gateway because it reflects gateway-managed TTS state.
--- a/docs/cli/index.md
+++ b/docs/cli/index.md
@@ -35,6 +35,7 @@ This page describes the current CLI behavior. If commands change, update this do
 - [`logs`](/cli/logs)
 - [`system`](/cli/system)
 - [`models`](/cli/models)
+- [`capability`](/cli/capability)
 - [`memory`](/cli/memory)
 - [`directory`](/cli/directory)
 - [`nodes`](/cli/nodes)
@@ -248,6 +249,16 @@ openclaw [--dev] [--profile <name>] <command>
    fallbacks list|add|remove|clear
    image-fallbacks list|add|remove|clear
    scan
+  capability
+    list
+    inspect
+    model run|list|inspect|providers|auth login|logout|status
+    media image generate|edit|describe|describe-many|providers
+    media audio transcribe|providers
+    media tts convert|voices|providers|status|enable|disable|set-provider
+    media video generate|describe|providers
+    web search|fetch|providers
+    embedding create|providers
    auth add|login|login-github-copilot|setup-token|paste-token
    auth order get|set|clear
  sandbox
--- a/extensions/microsoft-foundry/index.ts
+++ b/extensions/microsoft-foundry/index.ts
@@ -1,5 +1,6 @@
 import { definePluginEntry } from "openclaw/plugin-sdk/plugin-entry";
 import { buildMicrosoftFoundryProvider } from "./provider.js";
+import { buildMicrosoftFoundryRealtimeTranscriptionProvider } from "./realtime-transcription-provider.js";

 export default definePluginEntry({
  id: "microsoft-foundry",
@@ -7,5 +8,6 @@ export default definePluginEntry({
  description: "Microsoft Foundry provider with Entra ID and API key auth",
  register(api) {
    api.registerProvider(buildMicrosoftFoundryProvider());
+    api.registerRealtimeTranscriptionProvider(buildMicrosoftFoundryRealtimeTranscriptionProvider());
  },
 });
--- a/extensions/microsoft-foundry/realtime-transcription-provider.test.ts
+++ b/extensions/microsoft-foundry/realtime-transcription-provider.test.ts
@@ -0,0 +1,58 @@
+import { describe, expect, it } from "vitest";
+import { buildMicrosoftFoundryRealtimeTranscriptionProvider } from "./realtime-transcription-provider.js";
+
+describe("buildMicrosoftFoundryRealtimeTranscriptionProvider", () => {
+  it("normalizes foundry config from the voice provider block", () => {
+    const provider = buildMicrosoftFoundryRealtimeTranscriptionProvider();
+    const resolved = provider.resolveConfig?.({
+      cfg: {} as never,
+      rawConfig: {
+        providers: {
+          "microsoft-foundry": {
+            apiKey: "azure-test-key",
+            baseUrl: "https://example.services.ai.azure.com/openai/v1",
+            deployment: "gpt-realtime",
+            apiVersion: "2025-04-01-preview",
+          },
+        },
+      },
+    });
+
+    expect(resolved).toEqual({
+      apiKey: "azure-test-key",
+      baseUrl: "https://example.services.ai.azure.com/openai/v1",
+      deployment: "gpt-realtime",
+      apiVersion: "2025-04-01-preview",
+    });
+  });
+
+  it("accepts model-provider style config with api-key headers", () => {
+    const provider = buildMicrosoftFoundryRealtimeTranscriptionProvider();
+    const resolved = provider.resolveConfig?.({
+      cfg: {} as never,
+      rawConfig: {
+        providers: {
+          "microsoft-foundry": {
+            baseUrl: "https://example.services.ai.azure.com/openai/v1",
+            headers: {
+              "api-key": "azure-test-key",
+            },
+            model: "gpt-realtime",
+          },
+        },
+      },
+    });
+
+    expect(resolved).toEqual({
+      apiKey: "azure-test-key",
+      baseUrl: "https://example.services.ai.azure.com/openai/v1",
+      deployment: "gpt-realtime",
+      model: "gpt-realtime",
+    });
+  });
+
+  it("registers foundry aliases for voice provider selection", () => {
+    const provider = buildMicrosoftFoundryRealtimeTranscriptionProvider();
+    expect(provider.aliases).toContain("azure-foundry");
+  });
+});
--- a/extensions/microsoft-foundry/realtime-transcription-provider.ts
+++ b/extensions/microsoft-foundry/realtime-transcription-provider.ts
@@ -0,0 +1,313 @@
+import type {
+  RealtimeTranscriptionProviderConfig,
+  RealtimeTranscriptionProviderPlugin,
+  RealtimeTranscriptionSession,
+  RealtimeTranscriptionSessionCreateRequest,
+} from "openclaw/plugin-sdk/realtime-transcription";
+import WebSocket from "ws";
+import { normalizeFoundryEndpoint, PROVIDER_ID } from "./shared.js";
+
+type FoundryRealtimeTranscriptionProviderConfig = {
+  apiKey?: string;
+  baseUrl?: string;
+  endpoint?: string;
+  deployment?: string;
+  model?: string;
+  apiVersion?: string;
+  silenceDurationMs?: number;
+  vadThreshold?: number;
+};
+
+type FoundryRealtimeTranscriptionSessionConfig = RealtimeTranscriptionSessionCreateRequest & {
+  apiKey: string;
+  baseUrl: string;
+  deployment: string;
+  apiVersion: string;
+  silenceDurationMs: number;
+  vadThreshold: number;
+};
+
+type RealtimeEvent = {
+  type: string;
+  delta?: string;
+  transcript?: string;
+  error?: unknown;
+  item?: { transcript?: string } | null;
+};
+
+function trimToUndefined(value: unknown): string | undefined {
+  return typeof value === "string" && value.trim() ? value.trim() : undefined;
+}
+
+function asNumber(value: unknown): number | undefined {
+  return typeof value === "number" && Number.isFinite(value) ? value : undefined;
+}
+
+function asObject(value: unknown): Record<string, unknown> | undefined {
+  return typeof value === "object" && value !== null && !Array.isArray(value)
+    ? (value as Record<string, unknown>)
+    : undefined;
+}
+
+function extractFoundryProviderConfig(
+  rawConfig: RealtimeTranscriptionProviderConfig,
+): FoundryRealtimeTranscriptionProviderConfig {
+  const providers = asObject(rawConfig.providers);
+  const raw =
+    asObject(providers?.[PROVIDER_ID]) ??
+    asObject(rawConfig[PROVIDER_ID]) ??
+    asObject(rawConfig.microsoftFoundry) ??
+    asObject(rawConfig);
+  const providerBaseUrl = trimToUndefined(raw?.baseUrl);
+  const endpoint = trimToUndefined(raw?.endpoint);
+  return {
+    apiKey:
+      trimToUndefined(raw?.apiKey) ??
+      trimToUndefined(asObject(raw?.headers)?.["api-key"]) ??
+      trimToUndefined(asObject(raw?.headers)?.Authorization)?.replace(/^Bearer\s+/i, ""),
+    baseUrl: providerBaseUrl,
+    endpoint,
+    deployment:
+      trimToUndefined(raw?.deployment) ??
+      trimToUndefined(raw?.model) ??
+      trimToUndefined(raw?.deploymentName),
+    model: trimToUndefined(raw?.transcriptionModel) ?? trimToUndefined(raw?.model),
+    apiVersion: trimToUndefined(raw?.apiVersion),
+    silenceDurationMs: asNumber(raw?.silenceDurationMs),
+    vadThreshold: asNumber(raw?.vadThreshold),
+  };
+}
+
+function resolveFoundryRealtimeBaseUrl(
+  config: FoundryRealtimeTranscriptionProviderConfig,
+): string | undefined {
+  if (config.endpoint) {
+    return normalizeFoundryEndpoint(config.endpoint);
+  }
+  if (!config.baseUrl) {
+    return undefined;
+  }
+  return normalizeFoundryEndpoint(config.baseUrl);
+}
+
+class FoundryRealtimeTranscriptionSession implements RealtimeTranscriptionSession {
+  private static readonly MAX_RECONNECT_ATTEMPTS = 5;
+  private static readonly RECONNECT_DELAY_MS = 1000;
+  private static readonly CONNECT_TIMEOUT_MS = 10_000;
+
+  private ws: WebSocket | null = null;
+  private connected = false;
+  private closed = false;
+  private reconnectAttempts = 0;
+  private pendingTranscript = "";
+
+  constructor(private readonly config: FoundryRealtimeTranscriptionSessionConfig) {}
+
+  async connect(): Promise<void> {
+    this.closed = false;
+    this.reconnectAttempts = 0;
+    await this.doConnect();
+  }
+
+  sendAudio(audio: Buffer): void {
+    if (this.ws?.readyState !== WebSocket.OPEN) {
+      return;
+    }
+    this.sendEvent({
+      type: "input_audio_buffer.append",
+      audio: audio.toString("base64"),
+    });
+  }
+
+  close(): void {
+    this.closed = true;
+    this.connected = false;
+    if (this.ws) {
+      this.ws.close(1000, "Transcription session closed");
+      this.ws = null;
+    }
+  }
+
+  isConnected(): boolean {
+    return this.connected;
+  }
+
+  private async doConnect(): Promise<void> {
+    await new Promise<void>((resolve, reject) => {
+      const wsUrl = this.buildWebSocketUrl();
+      this.ws = new WebSocket(wsUrl, {
+        headers: {
+          "api-key": this.config.apiKey,
+        },
+      });
+
+      const connectTimeout = setTimeout(() => {
+        reject(new Error("Microsoft Foundry realtime transcription connection timeout"));
+      }, FoundryRealtimeTranscriptionSession.CONNECT_TIMEOUT_MS);
+
+      this.ws.on("open", () => {
+        clearTimeout(connectTimeout);
+        this.connected = true;
+        this.reconnectAttempts = 0;
+        this.sendEvent({
+          type: "session.update",
+          session: {
+            input_audio_format: "pcm16",
+            input_audio_transcription: {
+              model: this.config.deployment,
+            },
+            turn_detection: {
+              type: "server_vad",
+              threshold: this.config.vadThreshold,
+              prefix_padding_ms: 300,
+              silence_duration_ms: this.config.silenceDurationMs,
+            },
+          },
+        });
+        resolve();
+      });
+
+      this.ws.on("message", (data: Buffer) => {
+        try {
+          this.handleEvent(JSON.parse(data.toString()) as RealtimeEvent);
+        } catch (error) {
+          this.config.onError?.(error instanceof Error ? error : new Error(String(error)));
+        }
+      });
+
+      this.ws.on("error", (error) => {
+        if (!this.connected) {
+          clearTimeout(connectTimeout);
+          reject(error);
+          return;
+        }
+        this.config.onError?.(error instanceof Error ? error : new Error(String(error)));
+      });
+
+      this.ws.on("close", () => {
+        this.connected = false;
+        if (this.closed) {
+          return;
+        }
+        void this.attemptReconnect();
+      });
+    });
+  }
+
+  private buildWebSocketUrl(): string {
+    const httpBaseUrl = this.config.baseUrl.replace(/\/+$/, "");
+    const wsBaseUrl = httpBaseUrl.replace(/^http:/i, "ws:").replace(/^https:/i, "wss:");
+    const url = new URL(`${wsBaseUrl}/openai/realtime`);
+    url.searchParams.set("api-version", this.config.apiVersion);
+    url.searchParams.set("deployment", this.config.deployment);
+    return url.toString();
+  }
+
+  private async attemptReconnect(): Promise<void> {
+    if (this.closed) {
+      return;
+    }
+    if (this.reconnectAttempts >= FoundryRealtimeTranscriptionSession.MAX_RECONNECT_ATTEMPTS) {
+      this.config.onError?.(
+        new Error("Microsoft Foundry realtime transcription reconnect limit reached"),
+      );
+      return;
+    }
+    this.reconnectAttempts += 1;
+    const delay =
+      FoundryRealtimeTranscriptionSession.RECONNECT_DELAY_MS * 2 ** (this.reconnectAttempts - 1);
+    await new Promise((resolve) => setTimeout(resolve, delay));
+    if (this.closed) {
+      return;
+    }
+    try {
+      await this.doConnect();
+    } catch (error) {
+      this.config.onError?.(error instanceof Error ? error : new Error(String(error)));
+      await this.attemptReconnect();
+    }
+  }
+
+  private handleEvent(event: RealtimeEvent): void {
+    switch (event.type) {
+      case "conversation.item.input_audio_transcription.delta":
+      case "conversation.item.audio_transcription.delta":
+        if (event.delta) {
+          this.pendingTranscript += event.delta;
+          this.config.onPartial?.(this.pendingTranscript);
+        }
+        return;
+
+      case "conversation.item.input_audio_transcription.completed":
+      case "conversation.item.audio_transcription.completed": {
+        const transcript = event.transcript ?? event.item?.transcript;
+        if (transcript) {
+          this.config.onTranscript?.(transcript);
+        }
+        this.pendingTranscript = "";
+        return;
+      }
+
+      case "input_audio_buffer.speech_started":
+        this.pendingTranscript = "";
+        this.config.onSpeechStart?.();
+        return;
+
+      case "error": {
+        const detail =
+          event.error && typeof event.error === "object" && "message" in event.error
+            ? String((event.error as { message?: unknown }).message ?? "Unknown error")
+            : event.error
+              ? String(event.error)
+              : "Unknown error";
+        this.config.onError?.(new Error(detail));
+        return;
+      }
+
+      default:
+        return;
+    }
+  }
+
+  private sendEvent(event: unknown): void {
+    if (this.ws?.readyState === WebSocket.OPEN) {
+      this.ws.send(JSON.stringify(event));
+    }
+  }
+}
+
+export function buildMicrosoftFoundryRealtimeTranscriptionProvider(): RealtimeTranscriptionProviderPlugin {
+  return {
+    id: PROVIDER_ID,
+    label: "Microsoft Foundry Realtime Transcription",
+    aliases: ["azure-foundry", "azure-openai-foundry"],
+    autoSelectOrder: 20,
+    resolveConfig: ({ rawConfig }) => extractFoundryProviderConfig(rawConfig),
+    isConfigured: ({ providerConfig }) => {
+      const config = extractFoundryProviderConfig(providerConfig);
+      return Boolean(config.apiKey && resolveFoundryRealtimeBaseUrl(config) && config.deployment);
+    },
+    createSession: (req) => {
+      const config = extractFoundryProviderConfig(req.providerConfig);
+      const baseUrl = resolveFoundryRealtimeBaseUrl(config);
+      if (!config.apiKey) {
+        throw new Error("Microsoft Foundry realtime transcription API key missing");
+      }
+      if (!baseUrl) {
+        throw new Error("Microsoft Foundry realtime transcription endpoint missing");
+      }
+      if (!config.deployment) {
+        throw new Error("Microsoft Foundry realtime transcription deployment missing");
+      }
+      return new FoundryRealtimeTranscriptionSession({
+        ...req,
+        apiKey: config.apiKey,
+        baseUrl,
+        deployment: config.deployment,
+        apiVersion: config.apiVersion ?? "2025-04-01-preview",
+        silenceDurationMs: config.silenceDurationMs ?? 800,
+        vadThreshold: config.vadThreshold ?? 0.5,
+      });
+    },
+  };
+}
--- a/extensions/openai/realtime-transcription-provider.ts
+++ b/extensions/openai/realtime-transcription-provider.ts
@@ -18,6 +18,7 @@ type OpenAIRealtimeTranscriptionProviderConfig = {
  model?: string;
  silenceDurationMs?: number;
  vadThreshold?: number;
+  inputAudioFormat?: string;
 };

 type OpenAIRealtimeTranscriptionSessionConfig = RealtimeTranscriptionSessionCreateRequest & {
@@ -25,6 +26,7 @@ type OpenAIRealtimeTranscriptionSessionConfig = RealtimeTranscriptionSessionCrea
  model: string;
  silenceDurationMs: number;
  vadThreshold: number;
+  inputAudioFormat: string;
 };

 type RealtimeEvent = {
@@ -51,6 +53,7 @@ function normalizeProviderConfig(
    model: trimToUndefined(raw?.model) ?? trimToUndefined(raw?.sttModel),
    silenceDurationMs: asFiniteNumber(raw?.silenceDurationMs),
    vadThreshold: asFiniteNumber(raw?.vadThreshold),
+    inputAudioFormat: trimToUndefined(raw?.inputAudioFormat),
  };
 }

@@ -116,7 +119,7 @@ class OpenAIRealtimeTranscriptionSession implements RealtimeTranscriptionSession
        this.sendEvent({
          type: "transcription_session.update",
          session: {
-            input_audio_format: "g711_ulaw",
+            input_audio_format: this.config.inputAudioFormat,
            input_audio_transcription: {
              model: this.config.model,
            },
@@ -241,6 +244,7 @@ export function buildOpenAIRealtimeTranscriptionProvider(): RealtimeTranscriptio
        model: config.model ?? "gpt-4o-transcribe",
        silenceDurationMs: config.silenceDurationMs ?? 800,
        vadThreshold: config.vadThreshold ?? 0.5,
+        inputAudioFormat: config.inputAudioFormat ?? "g711_ulaw",
      });
    },
  };
--- a/extensions/speech-core/runtime-api.ts
+++ b/extensions/speech-core/runtime-api.ts
@@ -9,6 +9,7 @@ export {
  isTtsProviderConfigured,
  listSpeechVoices,
  maybeApplyTtsToPayload,
+  resolveExplicitTtsOverrides,
  resolveTtsAutoMode,
  resolveTtsConfig,
  resolveTtsPrefsPath,
--- a/extensions/speech-core/src/tts.ts
+++ b/extensions/speech-core/src/tts.ts
@@ -23,7 +23,7 @@ import { resolveSendableOutboundReplyParts } from "openclaw/plugin-sdk/reply-pay
 import type { ReplyPayload } from "openclaw/plugin-sdk/reply-runtime";
 import { isVerbose, logVerbose } from "openclaw/plugin-sdk/runtime-env";
 import { resolvePreferredOpenClawTmpDir } from "openclaw/plugin-sdk/sandbox";
-import { CONFIG_DIR, resolveUserPath, stripMarkdown } from "openclaw/plugin-sdk/text-runtime";
+import { resolveConfigDir, resolveUserPath, stripMarkdown } from "openclaw/plugin-sdk/text-runtime";
 import {
  canonicalizeSpeechProviderId,
  getSpeechProvider,
@@ -35,6 +35,7 @@ import {
  summarizeText,
  type SpeechModelOverridePolicy,
  type SpeechProviderConfig,
+  type SpeechProviderOverrides,
  type SpeechVoiceOption,
  type TtsDirectiveOverrides,
  type TtsDirectiveParseResult,
@@ -167,7 +168,7 @@ function resolveTtsPrefsPathValue(prefsPath: string | undefined): string {
  if (envPath) {
    return resolveUserPath(envPath);
  }
-  return path.join(CONFIG_DIR, "settings", "tts.json");
+  return path.join(resolveConfigDir(process.env), "settings", "tts.json");
 }

 function resolveModelOverridePolicy(
@@ -494,6 +495,66 @@ export function setTtsProvider(prefsPath: string, provider: TtsProvider): void {
  });
 }

+export function resolveExplicitTtsOverrides(params: {
+  cfg: OpenClawConfig;
+  prefsPath?: string;
+  provider?: string;
+  modelId?: string;
+  voiceId?: string;
+}): TtsDirectiveOverrides {
+  const providerInput = params.provider?.trim();
+  const modelId = params.modelId?.trim();
+  const voiceId = params.voiceId?.trim();
+  const config = resolveTtsConfig(params.cfg);
+  const prefsPath = params.prefsPath ?? resolveTtsPrefsPath(config);
+  const selectedProvider =
+    canonicalizeSpeechProviderId(providerInput, params.cfg) ??
+    (modelId || voiceId ? getTtsProvider(config, prefsPath) : undefined);
+
+  if (providerInput && !selectedProvider) {
+    throw new Error(`Unknown TTS provider "${providerInput}".`);
+  }
+
+  if (!modelId && !voiceId) {
+    return selectedProvider ? { provider: selectedProvider } : {};
+  }
+
+  if (!selectedProvider) {
+    throw new Error("TTS model or voice overrides require a resolved provider.");
+  }
+
+  const provider = getSpeechProvider(selectedProvider, params.cfg);
+  if (!provider) {
+    throw new Error(`speech provider ${selectedProvider} is not registered`);
+  }
+  if (!provider.resolveTalkOverrides) {
+    throw new Error(
+      `TTS provider "${selectedProvider}" does not support model or voice overrides.`,
+    );
+  }
+
+  const providerOverrides = provider.resolveTalkOverrides({
+    talkProviderConfig: {},
+    params: {
+      ...(voiceId ? { voiceId } : {}),
+      ...(modelId ? { modelId } : {}),
+    },
+  });
+  if ((voiceId || modelId) && (!providerOverrides || Object.keys(providerOverrides).length === 0)) {
+    throw new Error(
+      `TTS provider "${selectedProvider}" ignored the requested model or voice overrides.`,
+    );
+  }
+
+  const overridesRecord = providerOverrides as SpeechProviderOverrides;
+  return {
+    provider: selectedProvider,
+    providerOverrides: {
+      [provider.id]: overridesRecord,
+    },
+  };
+}
+
 export function getTtsMaxLength(prefsPath: string): number {
  const prefs = readPrefs(prefsPath);
  return prefs.tts?.maxLength ?? DEFAULT_TTS_MAX_LENGTH;
--- a/src/agents/pi-embedded-runner/model.ts
+++ b/src/agents/pi-embedded-runner/model.ts
@@ -131,9 +131,6 @@ function normalizeResolvedModel(params: {
  const normalizedInputModel = {
    ...params.model,
    input: resolveProviderModelInput({
-      provider: params.provider,
-      modelId: params.model.id,
-      modelName: params.model.name,
      input: params.model.input,
    }),
  } as Model<Api>;
@@ -233,7 +230,6 @@ function findInlineModelMatch(params: {
 }

 export { buildModelAliasLines };
-export { buildInlineProviderModels };

 function resolveConfiguredProviderConfig(
  cfg: OpenClawConfig | undefined,
@@ -250,6 +246,17 @@ function resolveConfiguredProviderConfig(
  return findNormalizedProviderValue(configuredProviders, provider);
 }

+function resolveProviderModelInput(params: {
+  input?: unknown;
+  fallbackInput?: unknown;
+}): Array<"text" | "image"> {
+  const resolvedInput = Array.isArray(params.input) ? params.input : params.fallbackInput;
+  const normalizedInput = Array.isArray(resolvedInput)
+    ? resolvedInput.filter((item): item is "text" | "image" => item === "text" || item === "image")
+    : [];
+  return normalizedInput.length > 0 ? normalizedInput : ["text"];
+}
+
 function applyConfiguredProviderOverrides(params: {
  provider: string;
  discoveredModel: ProviderRuntimeModel;
@@ -290,9 +297,6 @@ function applyConfiguredProviderOverrides(params: {
    };
  }
  const normalizedInput = resolveProviderModelInput({
-    provider: params.provider,
-    modelId,
-    modelName: configuredModel?.name ?? discoveredModel.name,
    input: configuredModel?.input,
    fallbackInput: discoveredModel.input,
  });
@@ -337,6 +341,54 @@ function applyConfiguredProviderOverrides(params: {
  );
 }

+export function buildInlineProviderModels(
+  providers: Record<string, InlineProviderConfig>,
+): InlineModelEntry[] {
+  return Object.entries(providers).flatMap(([providerId, entry]) => {
+    const trimmed = providerId.trim();
+    if (!trimmed) {
+      return [];
+    }
+    const providerHeaders = sanitizeModelHeaders(entry?.headers, {
+      stripSecretRefMarkers: true,
+    });
+    const providerRequest = sanitizeConfiguredModelProviderRequest(entry?.request);
+    return (entry?.models ?? []).map((model) => {
+      const transport = resolveProviderTransport({
+        provider: trimmed,
+        api: model.api ?? entry?.api,
+        baseUrl: entry?.baseUrl,
+      });
+      const modelHeaders = sanitizeModelHeaders((model as InlineModelEntry).headers, {
+        stripSecretRefMarkers: true,
+      });
+      const requestConfig = resolveProviderRequestConfig({
+        provider: trimmed,
+        api: transport.api ?? model.api,
+        baseUrl: transport.baseUrl,
+        providerHeaders,
+        modelHeaders,
+        authHeader: entry?.authHeader,
+        request: providerRequest,
+        capability: "llm",
+        transport: "stream",
+      });
+      return attachModelProviderRequestTransport(
+        {
+          ...model,
+          input: resolveProviderModelInput({
+            input: model.input,
+          }),
+          provider: trimmed,
+          baseUrl: requestConfig.baseUrl ?? transport.baseUrl,
+          api: requestConfig.api ?? model.api,
+          headers: requestConfig.headers,
+        },
+        providerRequest,
+      );
+    });
+  });
+}
 function resolveExplicitModelWithRegistry(params: {
  provider: string;
  modelId: string;
@@ -505,9 +557,6 @@ function resolveConfiguredFallbackModel(params: {
        baseUrl: requestConfig.baseUrl,
        reasoning: configuredModel?.reasoning ?? false,
        input: resolveProviderModelInput({
-          provider,
-          modelId,
-          modelName: configuredModel?.name ?? modelId,
          input: configuredModel?.input,
        }),
        cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
--- a/src/agents/tool-display-config.ts
+++ b/src/agents/tool-display-config.ts
@@ -249,11 +249,6 @@ export const TOOL_DISPLAY_CONFIG: ToolDisplayConfig = {
        },
      },
    },
-    update_plan: {
-      emoji: "🗺️",
-      title: "Update Plan",
-      detailKeys: ["explanation", "plan.0.step"],
-    },
    gateway: {
      emoji: "🔌",
      title: "Gateway",
--- a/src/agents/tools/web-search.ts
+++ b/src/agents/tools/web-search.ts
@@ -4,6 +4,7 @@ import type { RuntimeWebSearchMetadata } from "../../secrets/runtime-web-tools.t
 import {
  resolveWebSearchDefinition,
  resolveWebSearchProviderId,
+  runWebSearch,
 } from "../../web-search/runtime.js";
 import type { AnyAgentTool } from "./common.js";
 import { jsonResult } from "./common.js";
@@ -16,16 +17,17 @@ export function createWebSearchTool(options?: {
 }): AnyAgentTool | null {
  const runtimeProviderId =
    options?.runtimeWebSearch?.selectedProvider ?? options?.runtimeWebSearch?.providerConfigured;
+  const preferRuntimeProviders =
+    Boolean(runtimeProviderId) &&
+    !resolveManifestContractOwnerPluginId({
+      contract: "webSearchProviders",
+      value: runtimeProviderId,
+      origin: "bundled",
+      config: options?.config,
+    });
  const resolved = resolveWebSearchDefinition({
    ...options,
-    preferRuntimeProviders:
-      Boolean(runtimeProviderId) &&
-      !resolveManifestContractOwnerPluginId({
-        contract: "webSearchProviders",
-        value: runtimeProviderId,
-        origin: "bundled",
-        config: options?.config,
-      }),
+    preferRuntimeProviders,
  });
  if (!resolved) {
    return null;
@@ -36,7 +38,19 @@ export function createWebSearchTool(options?: {
    name: "web_search",
    description: resolved.definition.description,
    parameters: resolved.definition.parameters,
-    execute: async (_toolCallId, args) => jsonResult(await resolved.definition.execute(args)),
+    execute: async (_toolCallId, args) => {
+      const result = await runWebSearch({
+        config: options?.config,
+        sandboxed: options?.sandboxed,
+        runtimeWebSearch: options?.runtimeWebSearch,
+        preferRuntimeProviders,
+        args,
+      });
+      return jsonResult({
+        ...result.result,
+        provider: result.provider,
+      });
+    },
  };
 }

--- a/src/cli/capability-cli.test.ts
+++ b/src/cli/capability-cli.test.ts
@@ -0,0 +1,703 @@
+import fs from "node:fs/promises";
+import os from "node:os";
+import path from "node:path";
+import { Command } from "commander";
+import { beforeEach, describe, expect, it, vi } from "vitest";
+import { runRegisteredCli } from "../test-utils/command-runner.js";
+import { registerCapabilityCli } from "./capability-cli.js";
+
+const mocks = vi.hoisted(() => ({
+  runtime: {
+    log: vi.fn(),
+    error: vi.fn(),
+    exit: vi.fn((code: number) => {
+      throw new Error(`exit ${code}`);
+    }),
+    writeJson: vi.fn(),
+    writeStdout: vi.fn(),
+  },
+  loadConfig: vi.fn(() => ({})),
+  loadAuthProfileStoreForRuntime: vi.fn(() => ({ profiles: {}, order: {} })),
+  listProfilesForProvider: vi.fn(() => []),
+  resolveMemorySearchConfig: vi.fn(() => null),
+  loadModelCatalog: vi.fn(async () => []),
+  agentCommand: vi.fn(async () => ({
+    payloads: [{ text: "local reply" }],
+    meta: { agentMeta: { provider: "openai", model: "gpt-5.4" } },
+  })),
+  callGateway: vi.fn(async ({ method }: { method: string }) => {
+    if (method === "tts.status") {
+      return { enabled: true, provider: "openai" };
+    }
+    if (method === "agent") {
+      return {
+        result: {
+          payloads: [{ text: "gateway reply" }],
+          meta: { agentMeta: { provider: "anthropic", model: "claude-sonnet-4-6" } },
+        },
+      };
+    }
+    return {};
+  }),
+  describeImageFile: vi.fn(async () => ({
+    text: "friendly lobster",
+    provider: "openai",
+    model: "gpt-4.1-mini",
+  })),
+  generateImage: vi.fn(),
+  transcribeAudioFile: vi.fn(async () => ({ text: "meeting notes" })),
+  textToSpeech: vi.fn(async () => ({
+    success: true,
+    audioPath: "/tmp/tts-source.mp3",
+    provider: "openai",
+    outputFormat: "mp3",
+    voiceCompatible: false,
+    attempts: [],
+  })),
+  setTtsProvider: vi.fn(),
+  resolveExplicitTtsOverrides: vi.fn(
+    ({
+      provider,
+      modelId,
+      voiceId,
+    }: {
+      provider?: string;
+      modelId?: string;
+      voiceId?: string;
+    }) => ({
+      ...(provider ? { provider } : {}),
+      ...(modelId || voiceId
+        ? {
+            providerOverrides: {
+              [provider ?? "openai"]: {
+                ...(modelId ? { modelId } : {}),
+                ...(voiceId ? { voiceId } : {}),
+              },
+            },
+          }
+        : {}),
+    }),
+  ),
+  createEmbeddingProvider: vi.fn(async () => ({
+    provider: {
+      id: "openai",
+      model: "text-embedding-3-small",
+      embedQuery: async () => [0.1, 0.2],
+      embedBatch: async (texts: string[]) => texts.map(() => [0.1, 0.2]),
+    },
+  })),
+  registerMemoryEmbeddingProvider: vi.fn(),
+  listMemoryEmbeddingProviders: vi.fn(() => [
+    { id: "openai", defaultModel: "text-embedding-3-small", transport: "remote" },
+  ]),
+  registerBuiltInMemoryEmbeddingProviders: vi.fn(),
+  isWebSearchProviderConfigured: vi.fn(() => false),
+  isWebFetchProviderConfigured: vi.fn(() => false),
+  modelsStatusCommand: vi.fn(
+    async (_opts: unknown, runtime: { log: (...args: unknown[]) => void }) => {
+      runtime.log(JSON.stringify({ ok: true, providers: [{ id: "openai" }] }));
+    },
+  ),
+}));
+
+vi.mock("../runtime.js", () => ({
+  defaultRuntime: mocks.runtime,
+  writeRuntimeJson: (runtime: { writeJson: (value: unknown) => void }, value: unknown) =>
+    runtime.writeJson(value),
+}));
+
+vi.mock("../config/config.js", () => ({
+  loadConfig: (...args: unknown[]) => mocks.loadConfig(...args),
+}));
+
+vi.mock("../agents/agent-command.js", () => ({
+  agentCommand: (...args: unknown[]) => mocks.agentCommand(...args),
+}));
+
+vi.mock("../agents/agent-scope.js", () => ({
+  resolveDefaultAgentId: () => "main",
+  resolveAgentDir: () => "/tmp/agent",
+}));
+
+vi.mock("../agents/model-catalog.js", () => ({
+  loadModelCatalog: (...args: unknown[]) => mocks.loadModelCatalog(...args),
+}));
+
+vi.mock("../agents/auth-profiles.js", () => ({
+  loadAuthProfileStoreForRuntime: (...args: unknown[]) =>
+    mocks.loadAuthProfileStoreForRuntime(...args),
+  listProfilesForProvider: (...args: unknown[]) => mocks.listProfilesForProvider(...args),
+}));
+
+vi.mock("../agents/memory-search.js", () => ({
+  resolveMemorySearchConfig: (...args: unknown[]) => mocks.resolveMemorySearchConfig(...args),
+}));
+
+vi.mock("../commands/models.js", () => ({
+  modelsAuthLoginCommand: vi.fn(),
+  modelsStatusCommand: (...args: unknown[]) => mocks.modelsStatusCommand(...args),
+}));
+
+vi.mock("../gateway/call.js", () => ({
+  callGateway: (...args: unknown[]) => mocks.callGateway(...args),
+  randomIdempotencyKey: () => "run-1",
+}));
+
+vi.mock("../gateway/connection-details.js", () => ({
+  buildGatewayConnectionDetailsWithResolvers: vi.fn(() => ({
+    url: "ws://127.0.0.1:18789",
+    urlSource: "local loopback",
+    message: "Gateway target: ws://127.0.0.1:18789",
+  })),
+}));
+
+vi.mock("../media-understanding/runtime.js", () => ({
+  describeImageFile: (...args: unknown[]) => mocks.describeImageFile(...args),
+  describeVideoFile: vi.fn(),
+  transcribeAudioFile: (...args: unknown[]) => mocks.transcribeAudioFile(...args),
+}));
+
+vi.mock("../../extensions/memory-core/src/memory/embeddings.js", () => ({
+  createEmbeddingProvider: (...args: unknown[]) => mocks.createEmbeddingProvider(...args),
+}));
+
+vi.mock("../plugins/memory-embedding-providers.js", () => ({
+  listMemoryEmbeddingProviders: (...args: unknown[]) => mocks.listMemoryEmbeddingProviders(...args),
+  registerMemoryEmbeddingProvider: (...args: unknown[]) =>
+    mocks.registerMemoryEmbeddingProvider(...args),
+}));
+
+vi.mock("../../extensions/memory-core/src/memory/provider-adapters.js", () => ({
+  registerBuiltInMemoryEmbeddingProviders: (...args: unknown[]) =>
+    mocks.registerBuiltInMemoryEmbeddingProviders(...args),
+}));
+
+vi.mock("../image-generation/runtime.js", () => ({
+  generateImage: (...args: unknown[]) => mocks.generateImage(...args),
+  listRuntimeImageGenerationProviders: vi.fn(() => []),
+}));
+
+vi.mock("../video-generation/runtime.js", () => ({
+  generateVideo: vi.fn(),
+  listRuntimeVideoGenerationProviders: vi.fn(() => []),
+}));
+
+vi.mock("../tts/tts.js", () => ({
+  getTtsProvider: vi.fn(() => "openai"),
+  listSpeechVoices: vi.fn(async () => []),
+  resolveTtsConfig: vi.fn(() => ({})),
+  resolveTtsPrefsPath: vi.fn(() => "/tmp/tts.json"),
+  setTtsEnabled: vi.fn(),
+  setTtsProvider: (...args: unknown[]) => mocks.setTtsProvider(...args),
+  resolveExplicitTtsOverrides: (...args: unknown[]) => mocks.resolveExplicitTtsOverrides(...args),
+  textToSpeech: (...args: unknown[]) => mocks.textToSpeech(...args),
+}));
+
+vi.mock("../tts/provider-registry.js", () => ({
+  canonicalizeSpeechProviderId: vi.fn((provider: string) => provider),
+  listSpeechProviders: vi.fn(() => []),
+}));
+
+vi.mock("../web-search/runtime.js", () => ({
+  listWebSearchProviders: vi.fn(() => []),
+  isWebSearchProviderConfigured: (...args: unknown[]) =>
+    mocks.isWebSearchProviderConfigured(...args),
+  runWebSearch: vi.fn(),
+}));
+
+vi.mock("../web-fetch/runtime.js", () => ({
+  listWebFetchProviders: vi.fn(() => []),
+  isWebFetchProviderConfigured: (...args: unknown[]) => mocks.isWebFetchProviderConfigured(...args),
+  resolveWebFetchDefinition: vi.fn(),
+}));
+
+describe("capability cli", () => {
+  beforeEach(() => {
+    mocks.runtime.log.mockClear();
+    mocks.runtime.error.mockClear();
+    mocks.runtime.writeJson.mockClear();
+    mocks.loadModelCatalog
+      .mockReset()
+      .mockResolvedValue([{ id: "gpt-5.4", provider: "openai", name: "GPT-5.4" }]);
+    mocks.loadAuthProfileStoreForRuntime.mockReset().mockReturnValue({ profiles: {}, order: {} });
+    mocks.listProfilesForProvider.mockReset().mockReturnValue([]);
+    mocks.resolveMemorySearchConfig.mockReset().mockReturnValue(null);
+    mocks.agentCommand.mockClear();
+    mocks.callGateway.mockClear().mockImplementation(async ({ method }: { method: string }) => {
+      if (method === "tts.status") {
+        return { enabled: true, provider: "openai" };
+      }
+      if (method === "agent") {
+        return {
+          result: {
+            payloads: [{ text: "gateway reply" }],
+            meta: { agentMeta: { provider: "anthropic", model: "claude-sonnet-4-6" } },
+          },
+        };
+      }
+      return {};
+    });
+    mocks.describeImageFile.mockClear();
+    mocks.generateImage.mockReset();
+    mocks.transcribeAudioFile.mockClear();
+    mocks.textToSpeech.mockClear();
+    mocks.setTtsProvider.mockClear();
+    mocks.resolveExplicitTtsOverrides.mockClear();
+    mocks.createEmbeddingProvider.mockClear();
+    mocks.registerMemoryEmbeddingProvider.mockClear();
+    mocks.registerBuiltInMemoryEmbeddingProviders.mockClear();
+    mocks.isWebSearchProviderConfigured.mockReset().mockReturnValue(false);
+    mocks.isWebFetchProviderConfigured.mockReset().mockReturnValue(false);
+    mocks.modelsStatusCommand.mockClear();
+    mocks.callGateway.mockImplementation(async ({ method }: { method: string }) => {
+      if (method === "tts.status") {
+        return { enabled: true, provider: "openai" };
+      }
+      if (method === "tts.convert") {
+        return {
+          audioPath: "/tmp/gateway-tts.mp3",
+          provider: "openai",
+          outputFormat: "mp3",
+          voiceCompatible: false,
+        };
+      }
+      if (method === "agent") {
+        return {
+          result: {
+            payloads: [{ text: "gateway reply" }],
+            meta: { agentMeta: { provider: "anthropic", model: "claude-sonnet-4-6" } },
+          },
+        };
+      }
+      return {};
+    });
+  });
+
+  it("lists canonical capabilities", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: ["capability", "list", "--json"],
+    });
+
+    const payload = mocks.runtime.writeJson.mock.calls[0]?.[0] as Array<{ id: string }>;
+    expect(payload.some((entry) => entry.id === "model.run")).toBe(true);
+    expect(payload.some((entry) => entry.id === "media.image.describe")).toBe(true);
+  });
+
+  it("defaults model run to local transport", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: ["capability", "model", "run", "--prompt", "hello", "--json"],
+    });
+
+    expect(mocks.agentCommand).toHaveBeenCalledTimes(1);
+    expect(mocks.callGateway).not.toHaveBeenCalled();
+    expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
+      expect.objectContaining({
+        capability: "model.run",
+        transport: "local",
+      }),
+    );
+  });
+
+  it("defaults tts status to gateway transport", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: ["capability", "media", "tts", "status", "--json"],
+    });
+
+    expect(mocks.callGateway).toHaveBeenCalledWith(
+      expect.objectContaining({ method: "tts.status" }),
+    );
+    expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
+      expect.objectContaining({ transport: "gateway" }),
+    );
+  });
+
+  it("routes image describe through media understanding, not generation", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: ["capability", "media", "image", "describe", "--file", "photo.jpg", "--json"],
+    });
+
+    expect(mocks.describeImageFile).toHaveBeenCalledWith(
+      expect.objectContaining({ filePath: expect.stringMatching(/photo\.jpg$/) }),
+    );
+    expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
+      expect.objectContaining({
+        capability: "media.image.describe",
+        outputs: [expect.objectContaining({ kind: "image.description" })],
+      }),
+    );
+  });
+
+  it("fails image describe when no description text is returned", async () => {
+    mocks.describeImageFile.mockResolvedValueOnce({
+      text: undefined,
+      provider: undefined,
+      model: undefined,
+    });
+
+    await expect(
+      runRegisteredCli({
+        register: registerCapabilityCli as (program: Command) => void,
+        argv: ["capability", "media", "image", "describe", "--file", "photo.jpg", "--json"],
+      }),
+    ).rejects.toThrow("exit 1");
+    expect(mocks.runtime.error).toHaveBeenCalledWith(
+      expect.stringMatching(/No description returned for image/),
+    );
+  });
+
+  it("rewrites mismatched explicit image output extensions to the detected file type", async () => {
+    const jpegBase64 =
+      "/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxAQEBUQEBAVFRUVFRUVFRUVFRUVFRUVFRUXFhUVFRUYHSggGBolHRUVITEhJSkrLi4uFx8zODMsNygtLisBCgoKDg0OGhAQGi0fHyUtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLf/AABEIAAEAAQMBIgACEQEDEQH/xAAXAAEBAQEAAAAAAAAAAAAAAAAAAQID/8QAFhEBAQEAAAAAAAAAAAAAAAAAAAER/9oADAMBAAIQAxAAAAH2AP/EABgQAQEAAwAAAAAAAAAAAAAAAAEAEQIS/9oACAEBAAEFAk1o7//EABYRAQEBAAAAAAAAAAAAAAAAAAABEf/aAAgBAwEBPwGn/8QAFhEBAQEAAAAAAAAAAAAAAAAAABEB/9oACAECAQE/AYf/xAAaEAACAgMAAAAAAAAAAAAAAAABEQAhMUFh/9oACAEBAAY/AjK9cY2f/8QAGhABAQACAwAAAAAAAAAAAAAAAAERITFBUf/aAAgBAQABPyGQk7W5jVYkA//Z";
+    mocks.generateImage.mockResolvedValue({
+      provider: "openai",
+      model: "gpt-image-1",
+      attempts: [],
+      images: [
+        {
+          buffer: Buffer.from(jpegBase64, "base64"),
+          mimeType: "image/png",
+          fileName: "provider-output.png",
+        },
+      ],
+    });
+
+    const tempOutput = path.join(os.tmpdir(), `openclaw-image-mismatch-${Date.now()}.png`);
+    await fs.rm(tempOutput, { force: true });
+    await fs.rm(tempOutput.replace(/\.png$/, ".jpg"), { force: true });
+
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: [
+        "capability",
+        "media",
+        "image",
+        "generate",
+        "--prompt",
+        "friendly lobster",
+        "--output",
+        tempOutput,
+        "--json",
+      ],
+    });
+
+    expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
+      expect.objectContaining({
+        outputs: [
+          expect.objectContaining({
+            path: tempOutput.replace(/\.png$/, ".jpg"),
+            mimeType: "image/jpeg",
+          }),
+        ],
+      }),
+    );
+  });
+
+  it("routes audio transcribe through transcription, not realtime", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: ["capability", "media", "audio", "transcribe", "--file", "memo.m4a", "--json"],
+    });
+
+    expect(mocks.transcribeAudioFile).toHaveBeenCalledWith(
+      expect.objectContaining({ filePath: expect.stringMatching(/memo\.m4a$/) }),
+    );
+    expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
+      expect.objectContaining({
+        capability: "media.audio.transcribe",
+        outputs: [expect.objectContaining({ kind: "audio.transcription" })],
+      }),
+    );
+  });
+
+  it("fails audio transcribe when no transcript text is returned", async () => {
+    mocks.transcribeAudioFile.mockResolvedValueOnce({ text: undefined });
+
+    await expect(
+      runRegisteredCli({
+        register: registerCapabilityCli as (program: Command) => void,
+        argv: ["capability", "media", "audio", "transcribe", "--file", "memo.m4a", "--json"],
+      }),
+    ).rejects.toThrow("exit 1");
+    expect(mocks.runtime.error).toHaveBeenCalledWith(
+      expect.stringMatching(/No transcript returned for audio/),
+    );
+  });
+
+  it("forwards transcription prompt and language hints", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: [
+        "capability",
+        "media",
+        "audio",
+        "transcribe",
+        "--file",
+        "memo.m4a",
+        "--language",
+        "en",
+        "--prompt",
+        "Focus on names",
+        "--json",
+      ],
+    });
+
+    expect(mocks.transcribeAudioFile).toHaveBeenCalledWith(
+      expect.objectContaining({
+        filePath: expect.stringMatching(/memo\.m4a$/),
+        language: "en",
+        prompt: "Focus on names",
+      }),
+    );
+  });
+
+  it("uses request-scoped TTS overrides without mutating prefs", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: [
+        "capability",
+        "media",
+        "tts",
+        "convert",
+        "--text",
+        "hello",
+        "--model",
+        "openai/gpt-4o-mini-tts",
+        "--voice",
+        "alloy",
+        "--json",
+      ],
+    });
+
+    expect(mocks.textToSpeech).toHaveBeenCalledWith(
+      expect.objectContaining({
+        overrides: expect.objectContaining({
+          provider: "openai",
+          providerOverrides: expect.objectContaining({
+            openai: expect.objectContaining({
+              modelId: "gpt-4o-mini-tts",
+              voiceId: "alloy",
+            }),
+          }),
+        }),
+      }),
+    );
+    expect(mocks.setTtsProvider).not.toHaveBeenCalled();
+  });
+
+  it("disables TTS fallback when explicit provider or voice/model selection is requested", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: [
+        "capability",
+        "media",
+        "tts",
+        "convert",
+        "--text",
+        "hello",
+        "--model",
+        "openai/gpt-4o-mini-tts",
+        "--voice",
+        "alloy",
+        "--json",
+      ],
+    });
+
+    expect(mocks.textToSpeech).toHaveBeenCalledWith(
+      expect.objectContaining({
+        disableFallback: true,
+      }),
+    );
+  });
+
+  it("does not infer and forward a local provider guess for gateway TTS overrides", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: [
+        "capability",
+        "media",
+        "tts",
+        "convert",
+        "--gateway",
+        "--text",
+        "hello",
+        "--voice",
+        "alloy",
+        "--json",
+      ],
+    });
+
+    expect(mocks.callGateway).toHaveBeenCalledWith(
+      expect.objectContaining({
+        method: "tts.convert",
+        params: expect.objectContaining({
+          provider: undefined,
+          voiceId: "alloy",
+        }),
+      }),
+    );
+  });
+
+  it("fails clearly when gateway TTS output is requested against a remote gateway", async () => {
+    const gatewayConnection = await import("../gateway/connection-details.js");
+    vi.mocked(gatewayConnection.buildGatewayConnectionDetailsWithResolvers).mockReturnValueOnce({
+      url: "wss://gateway.example.com",
+      urlSource: "config gateway.remote.url",
+      message: "Gateway target: wss://gateway.example.com",
+    });
+
+    await expect(
+      runRegisteredCli({
+        register: registerCapabilityCli as (program: Command) => void,
+        argv: [
+          "capability",
+          "media",
+          "tts",
+          "convert",
+          "--gateway",
+          "--text",
+          "hello",
+          "--output",
+          "hello.mp3",
+          "--json",
+        ],
+      }),
+    ).rejects.toThrow("exit 1");
+
+    expect(mocks.runtime.error).toHaveBeenCalledWith(
+      expect.stringContaining("--output is not supported for remote gateway TTS yet"),
+    );
+  });
+
+  it("uses only embedding providers for embedding creation", async () => {
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: ["capability", "embedding", "create", "--text", "hello", "--json"],
+    });
+
+    expect(mocks.createEmbeddingProvider).toHaveBeenCalledWith(
+      expect.objectContaining({
+        provider: "auto",
+        fallback: "none",
+      }),
+    );
+    expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
+      expect.objectContaining({
+        capability: "embedding.create",
+        provider: "openai",
+        model: "text-embedding-3-small",
+      }),
+    );
+  });
+
+  it("bootstraps built-in embedding providers when the registry is empty", async () => {
+    mocks.listMemoryEmbeddingProviders.mockReturnValueOnce([]);
+
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: ["capability", "embedding", "providers", "--json"],
+    });
+
+    expect(mocks.registerBuiltInMemoryEmbeddingProviders).toHaveBeenCalledWith(
+      expect.objectContaining({
+        registerMemoryEmbeddingProvider: expect.any(Function),
+      }),
+    );
+  });
+
+  it("surfaces available, configured, and selected for web providers", async () => {
+    mocks.loadConfig.mockReturnValue({
+      tools: {
+        web: {
+          search: { provider: "gemini" },
+          fetch: { provider: "firecrawl" },
+        },
+      },
+    });
+    const webSearchRuntime = await import("../web-search/runtime.js");
+    const webFetchRuntime = await import("../web-fetch/runtime.js");
+    vi.mocked(webSearchRuntime.listWebSearchProviders).mockReturnValue([
+      { id: "brave", envVars: ["BRAVE_API_KEY"] } as never,
+      { id: "gemini", envVars: ["GEMINI_API_KEY"] } as never,
+    ]);
+    vi.mocked(webFetchRuntime.listWebFetchProviders).mockReturnValue([
+      { id: "firecrawl", envVars: ["FIRECRAWL_API_KEY"] } as never,
+    ]);
+    mocks.isWebSearchProviderConfigured.mockReturnValueOnce(false).mockReturnValueOnce(true);
+    mocks.isWebFetchProviderConfigured.mockReturnValueOnce(true);
+
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: ["capability", "web", "providers", "--json"],
+    });
+
+    expect(mocks.runtime.writeJson).toHaveBeenCalledWith({
+      search: [
+        {
+          available: true,
+          configured: false,
+          selected: false,
+          id: "brave",
+          envVars: ["BRAVE_API_KEY"],
+        },
+        {
+          available: true,
+          configured: true,
+          selected: true,
+          id: "gemini",
+          envVars: ["GEMINI_API_KEY"],
+        },
+      ],
+      fetch: [
+        {
+          available: true,
+          configured: true,
+          selected: true,
+          id: "firecrawl",
+          envVars: ["FIRECRAWL_API_KEY"],
+        },
+      ],
+    });
+  });
+
+  it("surfaces selected and configured embedding provider state", async () => {
+    mocks.loadConfig.mockReturnValue({});
+    mocks.resolveMemorySearchConfig.mockReturnValue({
+      provider: "gemini",
+      model: "gemini-embedding-001",
+    });
+    mocks.listMemoryEmbeddingProviders.mockReturnValue([
+      { id: "openai", defaultModel: "text-embedding-3-small", transport: "remote" },
+      { id: "gemini", defaultModel: "gemini-embedding-001", transport: "remote" },
+    ]);
+
+    await runRegisteredCli({
+      register: registerCapabilityCli as (program: Command) => void,
+      argv: ["capability", "embedding", "providers", "--json"],
+    });
+
+    expect(mocks.runtime.writeJson).toHaveBeenCalledWith([
+      {
+        available: true,
+        configured: false,
+        selected: false,
+        id: "openai",
+        defaultModel: "text-embedding-3-small",
+        transport: "remote",
+        autoSelectPriority: undefined,
+      },
+      {
+        available: true,
+        configured: true,
+        selected: true,
+        id: "gemini",
+        defaultModel: "gemini-embedding-001",
+        transport: "remote",
+        autoSelectPriority: undefined,
+      },
+    ]);
+  });
+});
--- a/src/cli/capability-cli.ts
+++ b/src/cli/capability-cli.ts
--- a/src/cli/program/register.subclis.ts
+++ b/src/cli/program/register.subclis.ts
@@ -74,6 +74,15 @@ const entrySpecs: readonly CommandGroupDescriptorSpec<SubCliRegistrar>[] = [
      loadModule: () => import("../models-cli.js"),
      exportName: "registerModelsCli",
    },
+    {
+      name: "capability",
+      description: "Run provider-backed capability commands",
+      hasSubcommands: true,
+      register: async (program) => {
+        const mod = await import("../capability-cli.js");
+        mod.registerCapabilityCli(program);
+      },
+    },
    {
      commandNames: ["approvals"],
      loadModule: () => import("../exec-approvals-cli.js"),
--- a/src/cli/program/subcli-descriptors.ts
+++ b/src/cli/program/subcli-descriptors.ts
@@ -22,6 +22,11 @@ const subCliCommandCatalog = defineCommandDescriptorCatalog([
    description: "Discover, scan, and configure models",
    hasSubcommands: true,
  },
+  {
+    name: "capability",
+    description: "Run provider-backed capability commands",
+    hasSubcommands: true,
+  },
  {
    name: "approvals",
    description: "Manage exec approvals (gateway or node host)",
--- a/src/config/schema.help.ts
+++ b/src/config/schema.help.ts
@@ -75,6 +75,16 @@ export const FIELD_HELP: Record<string, string> = {
    "Control UI hosting settings including enablement, pathing, and browser-origin/auth hardening behavior. Keep UI exposure minimal and pair with strong auth controls before internet-facing deployments.",
  "gateway.controlUi.enabled":
    "Enables serving the gateway Control UI from the gateway HTTP process when true. Keep enabled for local administration, and disable when an external control surface replaces it.",
+  "gateway.controlUi.voice":
+    "Browser voice settings for the Control UI chat, including realtime transcription provider selection and optional assistant speech playback.",
+  "gateway.controlUi.voice.enabled":
+    "Enables realtime browser voice sessions for the Control UI chat when a transcription provider is configured.",
+  "gateway.controlUi.voice.transcriptionProvider":
+    "Registered realtime transcription provider id used for browser mic input. Keep this explicit so browser voice fails closed when no provider is configured.",
+  "gateway.controlUi.voice.providers":
+    "Provider-owned realtime transcription config keyed by provider id for browser voice sessions.",
+  "gateway.controlUi.voice.playbackEnabled":
+    "Enables browser speech-synthesis playback for finalized assistant replies during a voice session.",
  "gateway.auth":
    "Authentication policy for gateway HTTP/WebSocket access including mode, credentials, trusted-proxy behavior, and rate limiting. Keep auth enabled for every non-loopback deployment.",
  "gateway.auth.mode":
--- a/src/config/types.gateway.ts
+++ b/src/config/types.gateway.ts
@@ -100,6 +100,17 @@ export type GatewayControlUiConfig = {
  allowInsecureAuth?: boolean;
  /** DANGEROUS: Disable device identity checks for the Control UI (default: false). */
  dangerouslyDisableDeviceAuth?: boolean;
+  /** Realtime voice settings for the browser chat UI. */
+  voice?: {
+    /** Enable browser voice sessions for the Control UI chat. */
+    enabled?: boolean;
+    /** Registered realtime transcription provider id to use for browser voice. */
+    transcriptionProvider?: string;
+    /** Provider-owned realtime transcription config keyed by provider id. */
+    providers?: Record<string, Record<string, unknown>>;
+    /** Enable browser speech synthesis playback for assistant replies. */
+    playbackEnabled?: boolean;
+  };
 };

 export type GatewayAuthMode = "none" | "token" | "password" | "trusted-proxy";
--- a/src/config/zod-schema.ts
+++ b/src/config/zod-schema.ts
@@ -676,6 +676,15 @@ export const OpenClawSchema = z
            dangerouslyAllowHostHeaderOriginFallback: z.boolean().optional(),
            allowInsecureAuth: z.boolean().optional(),
            dangerouslyDisableDeviceAuth: z.boolean().optional(),
+            voice: z
+              .object({
+                enabled: z.boolean().optional(),
+                transcriptionProvider: z.string().min(1).optional(),
+                providers: z.record(z.string(), z.record(z.string(), z.unknown())).optional(),
+                playbackEnabled: z.boolean().optional(),
+              })
+              .strict()
+              .optional(),
          })
          .strict()
          .optional(),
--- a/src/gateway/chat-voice-sessions.ts
+++ b/src/gateway/chat-voice-sessions.ts
@@ -0,0 +1,106 @@
+import type { RealtimeTranscriptionSession } from "../realtime-transcription/provider-types.js";
+
+export type ChatVoiceEventPayload = {
+  sessionKey: string;
+  state:
+    | "ready"
+    | "speech_start"
+    | "partial_transcript"
+    | "final_transcript"
+    | "assistant_started"
+    | "assistant_completed"
+    | "playback_clear"
+    | "interrupted"
+    | "error"
+    | "closed";
+  transcript?: string;
+  runId?: string;
+  errorMessage?: string;
+  playbackEnabled?: boolean;
+};
+
+export type ChatVoiceSessionEntry = {
+  sessionKey: string;
+  connId: string;
+  providerId: string;
+  playbackEnabled: boolean;
+  sttSession: RealtimeTranscriptionSession;
+  transcriptPartial: string;
+  transcriptFinal: string;
+  activeRunId: string | null;
+};
+
+const sessionsByKey = new Map<string, ChatVoiceSessionEntry>();
+const sessionKeyByRunId = new Map<string, string>();
+
+export function getChatVoiceSession(sessionKey: string): ChatVoiceSessionEntry | undefined {
+  return sessionsByKey.get(sessionKey);
+}
+
+export function setChatVoiceSession(entry: ChatVoiceSessionEntry) {
+  const existing = sessionsByKey.get(entry.sessionKey);
+  if (existing && existing !== entry) {
+    try {
+      existing.sttSession.close();
+    } catch {
+      // ignore replacement cleanup errors
+    }
+    if (existing.activeRunId) {
+      sessionKeyByRunId.delete(existing.activeRunId);
+    }
+  }
+  sessionsByKey.set(entry.sessionKey, entry);
+}
+
+export function deleteChatVoiceSession(sessionKey: string): ChatVoiceSessionEntry | undefined {
+  const entry = sessionsByKey.get(sessionKey);
+  if (!entry) {
+    return undefined;
+  }
+  sessionsByKey.delete(sessionKey);
+  if (entry.activeRunId) {
+    sessionKeyByRunId.delete(entry.activeRunId);
+  }
+  return entry;
+}
+
+export function setChatVoiceRunId(sessionKey: string, runId: string | null) {
+  const entry = sessionsByKey.get(sessionKey);
+  if (!entry) {
+    return;
+  }
+  if (entry.activeRunId) {
+    sessionKeyByRunId.delete(entry.activeRunId);
+  }
+  entry.activeRunId = runId;
+  if (runId) {
+    sessionKeyByRunId.set(runId, sessionKey);
+  }
+}
+
+export function getChatVoiceSessionByRunId(runId: string): ChatVoiceSessionEntry | undefined {
+  const sessionKey = sessionKeyByRunId.get(runId);
+  return sessionKey ? sessionsByKey.get(sessionKey) : undefined;
+}
+
+export function closeChatVoiceSessionsForConn(
+  connId: string,
+  emit: (connId: string, payload: ChatVoiceEventPayload) => void,
+) {
+  for (const entry of sessionsByKey.values()) {
+    if (entry.connId !== connId) {
+      continue;
+    }
+    try {
+      entry.sttSession.close();
+    } catch {
+      // ignore cleanup errors on disconnect
+    }
+    deleteChatVoiceSession(entry.sessionKey);
+    emit(connId, {
+      sessionKey: entry.sessionKey,
+      state: "closed",
+      playbackEnabled: entry.playbackEnabled,
+    });
+  }
+}
--- a/src/gateway/method-scopes.test.ts
+++ b/src/gateway/method-scopes.test.ts
@@ -27,6 +27,7 @@ describe("method scope resolution", () => {
  it.each([
    ["sessions.resolve", ["operator.read"]],
    ["config.schema.lookup", ["operator.read"]],
+    ["chat.voice.start", ["operator.write"]],
    ["sessions.create", ["operator.write"]],
    ["sessions.send", ["operator.write"]],
    ["sessions.abort", ["operator.write"]],
@@ -85,6 +86,10 @@ describe("operator scope authorization", () => {
      allowed: false,
      missingScope: "operator.write",
    });
+    expect(authorizeOperatorScopesForMethod("chat.voice.start", ["operator.read"])).toEqual({
+      allowed: false,
+      missingScope: "operator.write",
+    });
  });

  it("requires pairing scope for node pairing approvals", () => {
--- a/src/gateway/method-scopes.ts
+++ b/src/gateway/method-scopes.ts
@@ -117,14 +117,23 @@ const METHOD_SCOPE_GROUPS: Record<OperatorScope, readonly string[]> = {
    "wake",
    "talk.mode",
    "talk.speak",
+    "chat.voice.start",
    "tts.enable",
    "tts.disable",
    "tts.convert",
    "tts.setProvider",
+    "realtimeTranscription.start",
+    "realtimeTranscription.pushAudio",
+    "realtimeTranscription.pull",
+    "realtimeTranscription.finish",
    "voicewake.set",
    "node.invoke",
    "chat.send",
    "chat.abort",
+    "chat.voice.audio",
+    "chat.voice.commit",
+    "chat.voice.interrupt",
+    "chat.voice.stop",
    "sessions.create",
    "sessions.send",
    "sessions.steer",
--- a/src/gateway/protocol/index.ts
+++ b/src/gateway/protocol/index.ts
@@ -63,6 +63,18 @@ import {
  ChatHistoryParamsSchema,
  type ChatInjectParams,
  ChatInjectParamsSchema,
+  type ChatVoiceAudioParams,
+  ChatVoiceAudioParamsSchema,
+  type ChatVoiceCommitParams,
+  ChatVoiceCommitParamsSchema,
+  type ChatVoiceEvent,
+  ChatVoiceEventSchema,
+  type ChatVoiceInterruptParams,
+  ChatVoiceInterruptParamsSchema,
+  type ChatVoiceStartParams,
+  ChatVoiceStartParamsSchema,
+  type ChatVoiceStopParams,
+  ChatVoiceStopParamsSchema,
  ChatSendParamsSchema,
  type ConfigApplyParams,
  ConfigApplyParamsSchema,
@@ -474,6 +486,21 @@ export const validateChatSendParams = ajv.compile(ChatSendParamsSchema);
 export const validateChatAbortParams = ajv.compile<ChatAbortParams>(ChatAbortParamsSchema);
 export const validateChatInjectParams = ajv.compile<ChatInjectParams>(ChatInjectParamsSchema);
 export const validateChatEvent = ajv.compile(ChatEventSchema);
+export const validateChatVoiceStartParams = ajv.compile<ChatVoiceStartParams>(
+  ChatVoiceStartParamsSchema,
+);
+export const validateChatVoiceAudioParams = ajv.compile<ChatVoiceAudioParams>(
+  ChatVoiceAudioParamsSchema,
+);
+export const validateChatVoiceCommitParams = ajv.compile<ChatVoiceCommitParams>(
+  ChatVoiceCommitParamsSchema,
+);
+export const validateChatVoiceInterruptParams = ajv.compile<ChatVoiceInterruptParams>(
+  ChatVoiceInterruptParamsSchema,
+);
+export const validateChatVoiceStopParams =
+  ajv.compile<ChatVoiceStopParams>(ChatVoiceStopParamsSchema);
+export const validateChatVoiceEvent = ajv.compile<ChatVoiceEvent>(ChatVoiceEventSchema);
 export const validateUpdateRunParams = ajv.compile<UpdateRunParams>(UpdateRunParamsSchema);
 export const validateWebLoginStartParams =
  ajv.compile<WebLoginStartParams>(WebLoginStartParamsSchema);
--- a/src/gateway/protocol/schema/logs-chat.ts
+++ b/src/gateway/protocol/schema/logs-chat.ts
@@ -68,6 +68,68 @@ export const ChatInjectParamsSchema = Type.Object(
  { additionalProperties: false },
 );

+export const ChatVoiceStartParamsSchema = Type.Object(
+  {
+    sessionKey: NonEmptyString,
+  },
+  { additionalProperties: false },
+);
+
+export const ChatVoiceAudioParamsSchema = Type.Object(
+  {
+    sessionKey: NonEmptyString,
+    audio: NonEmptyString,
+    format: Type.Optional(Type.String()),
+    sampleRate: Type.Optional(Type.Integer({ minimum: 1 })),
+  },
+  { additionalProperties: false },
+);
+
+export const ChatVoiceCommitParamsSchema = Type.Object(
+  {
+    sessionKey: NonEmptyString,
+    transcript: Type.Optional(Type.String()),
+  },
+  { additionalProperties: false },
+);
+
+export const ChatVoiceInterruptParamsSchema = Type.Object(
+  {
+    sessionKey: NonEmptyString,
+  },
+  { additionalProperties: false },
+);
+
+export const ChatVoiceStopParamsSchema = Type.Object(
+  {
+    sessionKey: NonEmptyString,
+  },
+  { additionalProperties: false },
+);
+
+export const ChatVoiceEventSchema = Type.Object(
+  {
+    sessionKey: NonEmptyString,
+    state: Type.Union([
+      Type.Literal("ready"),
+      Type.Literal("speech_start"),
+      Type.Literal("partial_transcript"),
+      Type.Literal("final_transcript"),
+      Type.Literal("assistant_started"),
+      Type.Literal("assistant_completed"),
+      Type.Literal("playback_clear"),
+      Type.Literal("interrupted"),
+      Type.Literal("error"),
+      Type.Literal("closed"),
+    ]),
+    transcript: Type.Optional(Type.String()),
+    runId: Type.Optional(Type.String()),
+    errorMessage: Type.Optional(Type.String()),
+    playbackEnabled: Type.Optional(Type.Boolean()),
+  },
+  { additionalProperties: false },
+);
+
 export const ChatEventSchema = Type.Object(
  {
    runId: NonEmptyString,
--- a/src/gateway/protocol/schema/protocol-schemas.ts
+++ b/src/gateway/protocol/schema/protocol-schemas.ts
@@ -118,6 +118,12 @@ import {
  ChatEventSchema,
  ChatHistoryParamsSchema,
  ChatInjectParamsSchema,
+  ChatVoiceAudioParamsSchema,
+  ChatVoiceCommitParamsSchema,
+  ChatVoiceEventSchema,
+  ChatVoiceInterruptParamsSchema,
+  ChatVoiceStartParamsSchema,
+  ChatVoiceStopParamsSchema,
  ChatSendParamsSchema,
  LogsTailParamsSchema,
  LogsTailResultSchema,
@@ -330,7 +336,13 @@ export const ProtocolSchemas = {
  ChatSendParams: ChatSendParamsSchema,
  ChatAbortParams: ChatAbortParamsSchema,
  ChatInjectParams: ChatInjectParamsSchema,
+  ChatVoiceStartParams: ChatVoiceStartParamsSchema,
+  ChatVoiceAudioParams: ChatVoiceAudioParamsSchema,
+  ChatVoiceCommitParams: ChatVoiceCommitParamsSchema,
+  ChatVoiceInterruptParams: ChatVoiceInterruptParamsSchema,
+  ChatVoiceStopParams: ChatVoiceStopParamsSchema,
  ChatEvent: ChatEventSchema,
+  ChatVoiceEvent: ChatVoiceEventSchema,
  UpdateRunParams: UpdateRunParamsSchema,
  TickEvent: TickEventSchema,
  ShutdownEvent: ShutdownEventSchema,
--- a/src/gateway/protocol/schema/types.ts
+++ b/src/gateway/protocol/schema/types.ts
@@ -144,6 +144,12 @@ export type DeviceTokenRevokeParams = SchemaType<"DeviceTokenRevokeParams">;
 export type ChatAbortParams = SchemaType<"ChatAbortParams">;
 export type ChatInjectParams = SchemaType<"ChatInjectParams">;
 export type ChatEvent = SchemaType<"ChatEvent">;
+export type ChatVoiceStartParams = SchemaType<"ChatVoiceStartParams">;
+export type ChatVoiceAudioParams = SchemaType<"ChatVoiceAudioParams">;
+export type ChatVoiceCommitParams = SchemaType<"ChatVoiceCommitParams">;
+export type ChatVoiceInterruptParams = SchemaType<"ChatVoiceInterruptParams">;
+export type ChatVoiceStopParams = SchemaType<"ChatVoiceStopParams">;
+export type ChatVoiceEvent = SchemaType<"ChatVoiceEvent">;
 export type UpdateRunParams = SchemaType<"UpdateRunParams">;
 export type TickEvent = SchemaType<"TickEvent">;
 export type ShutdownEvent = SchemaType<"ShutdownEvent">;
--- a/src/gateway/realtime-transcription-session-manager.test.ts
+++ b/src/gateway/realtime-transcription-session-manager.test.ts
@@ -0,0 +1,154 @@
+import { describe, expect, it, vi } from "vitest";
+import type { OpenClawConfig } from "../config/config.js";
+import type { RealtimeTranscriptionProviderPlugin } from "../plugins/types.js";
+import { RealtimeTranscriptionSessionManager } from "./realtime-transcription-session-manager.js";
+
+function createProvider(params?: {
+  id?: string;
+  configured?: boolean;
+  onCreate?: (callbacks: Record<string, unknown>) => void;
+}): RealtimeTranscriptionProviderPlugin {
+  return {
+    id: params?.id ?? "openai",
+    label: "Test",
+    autoSelectOrder: 1,
+    resolveConfig: ({ rawConfig }) => rawConfig,
+    isConfigured: () => params?.configured ?? true,
+    createSession: (req) => {
+      params?.onCreate?.(req as unknown as Record<string, unknown>);
+      return {
+        connect: async () => {},
+        sendAudio: vi.fn(),
+        close: vi.fn(),
+        isConnected: () => true,
+      };
+    },
+  };
+}
+
+describe("RealtimeTranscriptionSessionManager", () => {
+  it("starts a session, auto-selects the first configured provider, and queues events", async () => {
+    let callbacks: Record<string, unknown> | undefined;
+    const provider = createProvider({
+      onCreate: (req) => {
+        callbacks = req;
+      },
+    });
+    const manager = new RealtimeTranscriptionSessionManager({
+      loadConfig: () => ({}) as OpenClawConfig,
+      listProviders: () => [provider],
+      getProvider: () => provider,
+      now: () => 123,
+      createId: () => "session-1",
+    });
+
+    const started = await manager.startSession({
+      format: "s16le",
+      sampleRate: 16000,
+      channels: 1,
+    });
+    expect(started).toEqual({
+      sessionId: "session-1",
+      provider: "openai",
+      format: "s16le",
+      sampleRate: 16000,
+      channels: 1,
+    });
+
+    (callbacks?.onPartial as ((value: string) => void) | undefined)?.("hello");
+    (callbacks?.onTranscript as ((value: string) => void) | undefined)?.("hello world");
+
+    const pulled = manager.pullEvents({ sessionId: "session-1" });
+    expect(pulled.events).toEqual([
+      { type: "session.started", provider: "openai", transport: "gateway", timestamp: 123 },
+      { type: "partial", text: "hello", timestamp: 123 },
+      { type: "final", text: "hello world", timestamp: 123 },
+    ]);
+  });
+
+  it("rejects unsupported audio shapes", async () => {
+    const provider = createProvider();
+    const manager = new RealtimeTranscriptionSessionManager({
+      loadConfig: () => ({}) as OpenClawConfig,
+      listProviders: () => [provider],
+      getProvider: () => provider,
+      now: () => 123,
+      createId: () => "session-1",
+    });
+
+    await expect(
+      manager.startSession({
+        format: "s16le",
+        sampleRate: 16000,
+        channels: 2,
+      }),
+    ).rejects.toThrow(/mono audio/);
+  });
+
+  it("returns pending terminal events on finish and removes the session", async () => {
+    let callbacks: Record<string, unknown> | undefined;
+    const close = vi.fn();
+    const provider = createProvider({
+      onCreate: (req) => {
+        callbacks = req;
+      },
+    });
+    provider.createSession = (req) => {
+      callbacks = req as unknown as Record<string, unknown>;
+      return {
+        connect: async () => {},
+        sendAudio: vi.fn(),
+        close,
+        isConnected: () => false,
+      };
+    };
+    const manager = new RealtimeTranscriptionSessionManager({
+      loadConfig: () => ({}) as OpenClawConfig,
+      listProviders: () => [provider],
+      getProvider: () => provider,
+      now: () => 123,
+      createId: () => "session-1",
+    });
+
+    await manager.startSession({
+      format: "s16le",
+      sampleRate: 16000,
+      channels: 1,
+    });
+    (callbacks?.onPartial as ((value: string) => void) | undefined)?.("hello");
+
+    expect(manager.finishSession({ sessionId: "session-1" })).toEqual({
+      sessionId: "session-1",
+      provider: "openai",
+      closed: true,
+      events: [
+        { type: "session.started", provider: "openai", transport: "gateway", timestamp: 123 },
+        { type: "partial", text: "hello", timestamp: 123 },
+        { type: "session.ended", reason: "client_finish", timestamp: 123 },
+      ],
+    });
+    expect(close).toHaveBeenCalledTimes(1);
+    expect(() => manager.pullEvents({ sessionId: "session-1" })).toThrow(
+      /Unknown realtime transcription session/,
+    );
+  });
+
+  it("fails when no configured provider is available", async () => {
+    const provider = createProvider({ configured: false });
+    const manager = new RealtimeTranscriptionSessionManager({
+      loadConfig: () => ({}) as OpenClawConfig,
+      listProviders: () => [provider],
+      getProvider: () => provider,
+      now: () => 123,
+      createId: () => "session-1",
+    });
+
+    await expect(
+      manager.startSession({
+        format: "s16le",
+        sampleRate: 16000,
+        channels: 1,
+      }),
+    ).rejects.toThrow(/No configured realtime transcription provider/);
+  });
+});
--- a/src/gateway/realtime-transcription-session-manager.ts
+++ b/src/gateway/realtime-transcription-session-manager.ts
@@ -0,0 +1,297 @@
+import { randomUUID } from "node:crypto";
+import type { OpenClawConfig } from "../config/config.js";
+import { loadConfig } from "../config/config.js";
+import type { RealtimeTranscriptionProviderPlugin } from "../plugins/types.js";
+import {
+  getRealtimeTranscriptionProvider,
+  listRealtimeTranscriptionProviders,
+} from "../realtime-transcription/provider-registry.js";
+import type {
+  RealtimeTranscriptionProviderConfig,
+  RealtimeTranscriptionSession,
+} from "../realtime-transcription/provider-types.js";
+
+type AudioFormat = "s16le" | "pcm16" | "g711_ulaw";
+
+export type RealtimeTranscriptionSessionEvent =
+  | { type: "session.started"; provider: string; transport: "gateway"; timestamp: number }
+  | { type: "partial"; text: string; timestamp: number }
+  | { type: "final"; text: string; timestamp: number }
+  | { type: "warning"; message: string; timestamp: number }
+  | { type: "error"; message: string; timestamp: number }
+  | { type: "session.ended"; reason: string; timestamp: number };
+
+type ManagedSession = {
+  id: string;
+  provider: string;
+  format: AudioFormat;
+  sampleRate: number;
+  channels: number;
+  session: RealtimeTranscriptionSession;
+  events: RealtimeTranscriptionSessionEvent[];
+  closed: boolean;
+};
+
+type SessionStartParams = {
+  provider?: string;
+  providerConfig?: RealtimeTranscriptionProviderConfig;
+  format: AudioFormat;
+  sampleRate: number;
+  channels: number;
+};
+
+type ManagerDeps = {
+  loadConfig: () => OpenClawConfig;
+  listProviders: (cfg?: OpenClawConfig) => RealtimeTranscriptionProviderPlugin[];
+  getProvider: (
+    providerId: string | undefined,
+    cfg?: OpenClawConfig,
+  ) => RealtimeTranscriptionProviderPlugin | undefined;
+  now: () => number;
+  createId: () => string;
+};
+
+const defaultDeps: ManagerDeps = {
+  loadConfig,
+  listProviders: listRealtimeTranscriptionProviders,
+  getProvider: getRealtimeTranscriptionProvider,
+  now: () => Date.now(),
+  createId: () => randomUUID(),
+};
+
+function normalizeAudioFormat(raw: string | undefined): AudioFormat | null {
+  const value = raw?.trim().toLowerCase();
+  if (!value) {
+    return null;
+  }
+  if (value === "s16le" || value === "pcm16" || value === "g711_ulaw") {
+    return value;
+  }
+  return null;
+}
+
+function validateSessionShape(params: {
+  format: AudioFormat;
+  sampleRate: number;
+  channels: number;
+}) {
+  if (!Number.isFinite(params.sampleRate) || params.sampleRate <= 0) {
+    throw new Error("sampleRate must be a positive number.");
+  }
+  if (!Number.isFinite(params.channels) || params.channels <= 0) {
+    throw new Error("channels must be a positive number.");
+  }
+  if (params.channels !== 1) {
+    throw new Error("realtime transcription currently requires mono audio (channels=1).");
+  }
+  if (params.format === "g711_ulaw" && params.sampleRate !== 8000) {
+    throw new Error("g711_ulaw realtime transcription requires sampleRate=8000.");
+  }
+}
+
+function sortProviders(providers: RealtimeTranscriptionProviderPlugin[]) {
+  return [...providers].toSorted((left, right) => {
+    const leftOrder = left.autoSelectOrder ?? Number.MAX_SAFE_INTEGER;
+    const rightOrder = right.autoSelectOrder ?? Number.MAX_SAFE_INTEGER;
+    if (leftOrder !== rightOrder) {
+      return leftOrder - rightOrder;
+    }
+    return left.id.localeCompare(right.id);
+  });
+}
+
+function buildProviderConfig(params: {
+  provider: RealtimeTranscriptionProviderPlugin;
+  cfg: OpenClawConfig;
+  providerConfig?: RealtimeTranscriptionProviderConfig;
+  format: AudioFormat;
+}): RealtimeTranscriptionProviderConfig {
+  const rawConfig = {
+    ...params.providerConfig,
+    ...(params.format === "s16le" || params.format === "pcm16"
+      ? { inputAudioFormat: "pcm16" }
+      : params.format === "g711_ulaw"
+        ? { inputAudioFormat: "g711_ulaw" }
+        : {}),
+  };
+  return params.provider.resolveConfig?.({ cfg: params.cfg, rawConfig }) ?? rawConfig;
+}
+
+export class RealtimeTranscriptionSessionManager {
+  private readonly sessions = new Map<string, ManagedSession>();
+
+  constructor(private readonly deps: ManagerDeps = defaultDeps) {}
+
+  async startSession(params: SessionStartParams) {
+    validateSessionShape({
+      format: params.format,
+      sampleRate: params.sampleRate,
+      channels: params.channels,
+    });
+    const cfg = this.deps.loadConfig();
+    const provider = this.resolveProvider(params.provider, cfg, params);
+    const providerConfig = buildProviderConfig({
+      provider,
+      cfg,
+      providerConfig: params.providerConfig,
+      format: params.format,
+    });
+    const sessionId = this.deps.createId();
+    const events: RealtimeTranscriptionSessionEvent[] = [];
+    const queueEvent = (event: RealtimeTranscriptionSessionEvent) => {
+      events.push(event);
+    };
+    const session = provider.createSession({
+      providerConfig,
+      onPartial: (partial) => {
+        if (partial.trim()) {
+          queueEvent({ type: "partial", text: partial, timestamp: this.deps.now() });
+        }
+      },
+      onTranscript: (transcript) => {
+        if (transcript.trim()) {
+          queueEvent({ type: "final", text: transcript, timestamp: this.deps.now() });
+        }
+      },
+      onError: (error) => {
+        queueEvent({
+          type: "error",
+          message: error.message || String(error),
+          timestamp: this.deps.now(),
+        });
+      },
+    });
+    await session.connect();
+    queueEvent({
+      type: "session.started",
+      provider: provider.id,
+      transport: "gateway",
+      timestamp: this.deps.now(),
+    });
+    this.sessions.set(sessionId, {
+      id: sessionId,
+      provider: provider.id,
+      format: params.format,
+      sampleRate: params.sampleRate,
+      channels: params.channels,
+      session,
+      events,
+      closed: false,
+    });
+    return {
+      sessionId,
+      provider: provider.id,
+      format: params.format,
+      sampleRate: params.sampleRate,
+      channels: params.channels,
+    };
+  }
+
+  pushAudio(params: { sessionId: string; audio: Buffer }) {
+    const managed = this.getOpenSession(params.sessionId);
+    managed.session.sendAudio(params.audio);
+    return {
+      sessionId: managed.id,
+      acceptedBytes: params.audio.byteLength,
+      connected: managed.session.isConnected(),
+    };
+  }
+
+  pullEvents(params: { sessionId: string; limit?: number }) {
+    const managed = this.getSession(params.sessionId);
+    const requested = params.limit ?? (managed.events.length || 100);
+    const count = Math.max(1, Math.floor(requested));
+    const events = managed.events.splice(0, count);
+    return {
+      sessionId: managed.id,
+      provider: managed.provider,
+      connected: managed.session.isConnected(),
+      closed: managed.closed,
+      events,
+    };
+  }
+
+  finishSession(params: { sessionId: string; reason?: string }) {
+    const managed = this.getSession(params.sessionId);
+    if (!managed.closed) {
+      managed.closed = true;
+      managed.session.close();
+      managed.events.push({
+        type: "session.ended",
+        reason: params.reason?.trim() || "client_finish",
+        timestamp: this.deps.now(),
+      });
+    }
+    const events = managed.events.splice(0, managed.events.length);
+    this.sessions.delete(params.sessionId);
+    return {
+      sessionId: managed.id,
+      provider: managed.provider,
+      closed: true,
+      events,
+    };
+  }
+
+  private resolveProvider(
+    providerId: string | undefined,
+    cfg: OpenClawConfig,
+    params: SessionStartParams,
+  ): RealtimeTranscriptionProviderPlugin {
+    if (providerId?.trim()) {
+      const provider = this.deps.getProvider(providerId, cfg);
+      if (!provider) {
+        throw new Error(`Unknown realtime transcription provider: ${providerId}`);
+      }
+      const providerConfig = buildProviderConfig({
+        provider,
+        cfg,
+        providerConfig: params.providerConfig,
+        format: params.format,
+      });
+      if (!provider.isConfigured({ cfg, providerConfig })) {
+        throw new Error(`Realtime transcription provider "${provider.id}" is not configured.`);
+      }
+      return provider;
+    }
+
+    const provider = sortProviders(this.deps.listProviders(cfg)).find((candidate) => {
+      const providerConfig = buildProviderConfig({
+        provider: candidate,
+        cfg,
+        providerConfig: params.providerConfig,
+        format: params.format,
+      });
+      return candidate.isConfigured({ cfg, providerConfig });
+    });
+    if (!provider) {
+      throw new Error("No configured realtime transcription provider is available.");
+    }
+    return provider;
+  }
+
+  private getSession(sessionId: string): ManagedSession {
+    const managed = this.sessions.get(sessionId);
+    if (!managed) {
+      throw new Error(`Unknown realtime transcription session: ${sessionId}`);
+    }
+    return managed;
+  }
+
+  private getOpenSession(sessionId: string): ManagedSession {
+    const managed = this.getSession(sessionId);
+    if (managed.closed) {
+      throw new Error(`Realtime transcription session is already closed: ${sessionId}`);
+    }
+    return managed;
+  }
+}
+
+const sharedManager = new RealtimeTranscriptionSessionManager();
+
+export function getRealtimeTranscriptionSessionManager() {
+  return sharedManager;
+}
+
+export const __testing = {
+  normalizeAudioFormat,
+};
--- a/src/gateway/server-broadcast.ts
+++ b/src/gateway/server-broadcast.ts
@@ -21,6 +21,7 @@ const EVENT_SCOPE_GUARDS: Record<string, string[]> = {
  "sessions.changed": [READ_SCOPE],
  "session.message": [READ_SCOPE],
  "session.tool": [READ_SCOPE],
+  "chat.voice.event": [READ_SCOPE],
 };

 export type GatewayBroadcastStateVersion = {
--- a/src/gateway/server-chat.ts
+++ b/src/gateway/server-chat.ts
@@ -5,6 +5,7 @@ import { loadConfig } from "../config/config.js";
 import { type AgentEventPayload, getAgentRunContext } from "../infra/agent-events.js";
 import { resolveHeartbeatVisibility } from "../infra/heartbeat-visibility.js";
 import { stripInlineDirectiveTagsForDisplay } from "../utils/directive-tags.js";
+import { getChatVoiceSessionByRunId, setChatVoiceRunId } from "./chat-voice-sessions.js";
 import { loadGatewaySessionRow } from "./server-chat.load-gateway-session-row.runtime.js";
 import { persistGatewaySessionLifecycleEvent } from "./server-chat.persist-session-lifecycle.runtime.js";
 import { deriveGatewaySessionLifecycleSnapshot } from "./session-lifecycle-state.js";
@@ -948,6 +949,72 @@ export function createAgentEventHandler({
      }
      if (!isAborted && evt.stream === "assistant" && typeof evt.data?.text === "string") {
        emitChatDelta(sessionKey, clientRunId, evt.runId, evt.seq, evt.data.text, evt.data.delta);
+      } else if (!isAborted && (lifecyclePhase === "end" || lifecyclePhase === "error")) {
+        const evtStopReason =
+          typeof evt.data?.stopReason === "string" ? evt.data.stopReason : undefined;
+        if (chatLink) {
+          const finished = chatRunState.registry.shift(evt.runId);
+          if (!finished) {
+            clearAgentRunContext(evt.runId);
+            return;
+          }
+          emitChatFinal(
+            finished.sessionKey,
+            finished.clientRunId,
+            evt.runId,
+            evt.seq,
+            lifecyclePhase === "error" ? "error" : "done",
+            evt.data?.error,
+            evtStopReason,
+          );
+        } else {
+          emitChatFinal(
+            sessionKey,
+            eventRunId,
+            evt.runId,
+            evt.seq,
+            lifecyclePhase === "error" ? "error" : "done",
+            evt.data?.error,
+            evtStopReason,
+          );
+        }
+        const voiceSession = getChatVoiceSessionByRunId(clientRunId);
+        if (voiceSession) {
+          setChatVoiceRunId(voiceSession.sessionKey, null);
+          broadcastToConnIds(
+            "chat.voice.event",
+            {
+              sessionKey: voiceSession.sessionKey,
+              state: "assistant_completed",
+              runId: clientRunId,
+              playbackEnabled: voiceSession.playbackEnabled,
+            },
+            new Set([voiceSession.connId]),
+          );
+        }
+      } else if (isAborted && (lifecyclePhase === "end" || lifecyclePhase === "error")) {
+        chatRunState.abortedRuns.delete(clientRunId);
+        chatRunState.abortedRuns.delete(evt.runId);
+        chatRunState.buffers.delete(clientRunId);
+        chatRunState.deltaSentAt.delete(clientRunId);
+        if (chatLink) {
+          chatRunState.registry.remove(evt.runId, clientRunId, sessionKey);
+        }
+        const voiceSession = getChatVoiceSessionByRunId(clientRunId);
+        if (voiceSession) {
+          setChatVoiceRunId(voiceSession.sessionKey, null);
+          broadcastToConnIds(
+            "chat.voice.event",
+            {
+              sessionKey: voiceSession.sessionKey,
+              state: "interrupted",
+              runId: clientRunId,
+              playbackEnabled: voiceSession.playbackEnabled,
+            },
+            new Set([voiceSession.connId]),
+            { dropIfSlow: true },
+          );
+        }
      }
    }

--- a/src/gateway/server-methods-list.ts
+++ b/src/gateway/server-methods-list.ts
@@ -17,6 +17,10 @@ const BASE_METHODS = [
  "tts.disable",
  "tts.convert",
  "tts.setProvider",
+  "realtimeTranscription.start",
+  "realtimeTranscription.pushAudio",
+  "realtimeTranscription.pull",
+  "realtimeTranscription.finish",
  "config.get",
  "config.set",
  "config.apply",
@@ -118,6 +122,11 @@ const BASE_METHODS = [
  "chat.history",
  "chat.abort",
  "chat.send",
+  "chat.voice.start",
+  "chat.voice.audio",
+  "chat.voice.commit",
+  "chat.voice.interrupt",
+  "chat.voice.stop",
 ];

 export function listGatewayMethods(): string[] {
@@ -129,6 +138,7 @@ export const GATEWAY_EVENTS = [
  "connect.challenge",
  "agent",
  "chat",
+  "chat.voice.event",
  "session.message",
  "session.tool",
  "sessions.changed",
--- a/src/gateway/server-methods.ts
+++ b/src/gateway/server-methods.ts
@@ -20,6 +20,7 @@ import { modelsHandlers } from "./server-methods/models.js";
 import { nodePendingHandlers } from "./server-methods/nodes-pending.js";
 import { nodeHandlers } from "./server-methods/nodes.js";
 import { pushHandlers } from "./server-methods/push.js";
+import { realtimeTranscriptionHandlers } from "./server-methods/realtime-transcription.js";
 import { sendHandlers } from "./server-methods/send.js";
 import { sessionsHandlers } from "./server-methods/sessions.js";
 import { skillsHandlers } from "./server-methods/skills.js";
@@ -84,6 +85,7 @@ export const coreGatewayHandlers: GatewayRequestHandlers = {
  ...toolsCatalogHandlers,
  ...toolsEffectiveHandlers,
  ...ttsHandlers,
+  ...realtimeTranscriptionHandlers,
  ...skillsHandlers,
  ...sessionsHandlers,
  ...systemHandlers,
--- a/src/gateway/server-methods/chat.ts
+++ b/src/gateway/server-methods/chat.ts
@@ -1,3 +1,4 @@
+import { randomUUID } from "node:crypto";
 import fs from "node:fs";
 import path from "node:path";
 import { CURRENT_SESSION_VERSION, SessionManager } from "@mariozechner/pi-coding-agent";
@@ -19,6 +20,8 @@ import { jsonUtf8Bytes } from "../../infra/json-utf8-bytes.js";
 import type { PromptImageOrderEntry } from "../../media/prompt-image-order.js";
 import { type SavedMedia, saveMediaBuffer } from "../../media/store.js";
 import { createChannelReplyPipeline } from "../../plugin-sdk/channel-reply-pipeline.js";
+import { getRealtimeTranscriptionProvider } from "../../plugin-sdk/realtime-transcription.js";
+import type { RealtimeTranscriptionSession } from "../../realtime-transcription/provider-types.js";
 import { normalizeInputProvenance, type InputProvenance } from "../../sessions/input-provenance.js";
 import { resolveSendPolicy } from "../../sessions/send-policy.js";
 import { parseAgentSessionKey } from "../../sessions/session-key-utils.js";
@@ -48,6 +51,13 @@ import {
  parseMessageWithAttachments,
 } from "../chat-attachments.js";
 import { stripEnvelopeFromMessage, stripEnvelopeFromMessages } from "../chat-sanitize.js";
+import {
+  deleteChatVoiceSession,
+  getChatVoiceSession,
+  setChatVoiceRunId,
+  setChatVoiceSession,
+  type ChatVoiceEventPayload,
+} from "../chat-voice-sessions.js";
 import { augmentChatHistoryWithCliSessionImports } from "../cli-session-history.js";
 import { ADMIN_SCOPE } from "../method-scopes.js";
 import {
@@ -57,6 +67,11 @@ import {
  hasGatewayClientCap,
 } from "../protocol/client-info.js";
 import {
+  validateChatVoiceAudioParams,
+  validateChatVoiceCommitParams,
+  validateChatVoiceInterruptParams,
+  validateChatVoiceStartParams,
+  validateChatVoiceStopParams,
  ErrorCodes,
  errorShape,
  formatValidationErrors,
@@ -1011,6 +1026,88 @@ function normalizeOptionalText(value?: string | null): string | undefined {
  return trimmed || undefined;
 }

+function getActiveChatVoiceCallbackSession(params: {
+  sessionKey: string;
+  connId: string;
+  sttSession: RealtimeTranscriptionSession;
+}) {
+  const active = getChatVoiceSession(params.sessionKey);
+  if (!active || active.connId !== params.connId || active.sttSession !== params.sttSession) {
+    return undefined;
+  }
+  return active;
+}
+
+function isStrictBase64(value: string): boolean {
+  const normalized = value.replace(/\s+/g, "");
+  if (!normalized || normalized.length % 4 !== 0) {
+    return false;
+  }
+  if (!/^[A-Za-z0-9+/]+={0,2}$/.test(normalized)) {
+    return false;
+  }
+  const decoded = Buffer.from(normalized, "base64");
+  return decoded.length > 0 && decoded.toString("base64") === normalized;
+}
+
+function parseStrictBase64AudioBuffer(value: unknown): Buffer {
+  const audio = typeof value === "string" ? value.trim() : "";
+  if (!audio) {
+    throw new Error("audio is required.");
+  }
+  if (!isStrictBase64(audio)) {
+    throw new Error("audio must be base64 encoded.");
+  }
+  return Buffer.from(audio, "base64");
+}
+
+function resolveControlUiVoiceConfig(cfg: ReturnType<typeof loadSessionEntry>["cfg"]) {
+  return cfg.gateway?.controlUi?.voice;
+}
+
+function emitChatVoiceEvent(
+  context: GatewayRequestContext,
+  connId: string,
+  payload: ChatVoiceEventPayload,
+) {
+  context.broadcastToConnIds("chat.voice.event", payload, new Set([connId]));
+}
+
+async function closeChatVoiceSession(params: {
+  context: GatewayRequestContext;
+  sessionKey: string;
+  connId: string;
+  emitClosed?: boolean;
+  errorMessage?: string;
+}) {
+  const entry = deleteChatVoiceSession(params.sessionKey);
+  if (!entry) {
+    return;
+  }
+  try {
+    entry.sttSession.close();
+  } catch (err) {
+    params.context.logGateway.debug(
+      `chat.voice session close cleanup failed: ${formatForLog(err)}`,
+    );
+  }
+  if (params.errorMessage) {
+    emitChatVoiceEvent(params.context, params.connId, {
+      sessionKey: params.sessionKey,
+      state: "error",
+      errorMessage: params.errorMessage,
+      playbackEnabled: entry.playbackEnabled,
+    });
+  }
+  if (params.emitClosed !== false) {
+    emitChatVoiceEvent(params.context, params.connId, {
+      sessionKey: params.sessionKey,
+      state: "closed",
+      playbackEnabled: entry.playbackEnabled,
+    });
+  }
+}
+
 function normalizeExplicitChatSendOrigin(
  params: ChatSendExplicitOrigin,
 ): { ok: true; value?: ChatSendExplicitOrigin } | { ok: false; error: string } {
@@ -1954,6 +2051,425 @@ export const chatHandlers: GatewayRequestHandlers = {
      });
    }
  },
+  "chat.voice.start": async ({ params, respond, context, client }) => {
+    if (!validateChatVoiceStartParams(params)) {
+      respond(
+        false,
+        undefined,
+        errorShape(
+          ErrorCodes.INVALID_REQUEST,
+          `invalid chat.voice.start params: ${formatValidationErrors(validateChatVoiceStartParams.errors)}`,
+        ),
+      );
+      return;
+    }
+    const connId = normalizeOptionalText(client?.connId);
+    if (!connId) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
+      return;
+    }
+
+    const { sessionKey: rawSessionKey } = params as { sessionKey: string };
+    const { cfg, canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
+    const voiceConfig = resolveControlUiVoiceConfig(cfg);
+    if (voiceConfig?.enabled !== true) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "web voice is disabled"));
+      return;
+    }
+
+    const providerId = normalizeOptionalText(voiceConfig.transcriptionProvider);
+    if (!providerId) {
+      respond(
+        false,
+        undefined,
+        errorShape(ErrorCodes.INVALID_REQUEST, "voice transcription provider is not configured"),
+      );
+      return;
+    }
+
+    const provider = getRealtimeTranscriptionProvider(providerId, cfg);
+    if (!provider) {
+      respond(
+        false,
+        undefined,
+        errorShape(
+          ErrorCodes.INVALID_REQUEST,
+          `voice transcription provider not found: ${providerId}`,
+        ),
+      );
+      return;
+    }
+
+    const modelProviderConfig =
+      provider.id === "microsoft-foundry"
+        ? cfg.models?.providers?.["microsoft-foundry"]
+        : cfg.models?.providers?.[provider.id];
+    const providerConfig = {
+      providers: {
+        [provider.id]: {
+          ...modelProviderConfig,
+          ...voiceConfig.providers?.[provider.id],
+          inputAudioFormat: "pcm16",
+        },
+      },
+    };
+    if (!provider.isConfigured({ cfg, providerConfig })) {
+      respond(
+        false,
+        undefined,
+        errorShape(
+          ErrorCodes.INVALID_REQUEST,
+          `voice transcription provider is not configured: ${provider.id}`,
+        ),
+      );
+      return;
+    }
+
+    const existing = getChatVoiceSession(sessionKey);
+    if (existing?.connId === connId) {
+      await closeChatVoiceSession({
+        context,
+        sessionKey,
+        connId,
+        emitClosed: false,
+      });
+    }
+
+    const playbackEnabled = voiceConfig.playbackEnabled !== false;
+    try {
+      let sttSession: RealtimeTranscriptionSession;
+      sttSession = provider.createSession({
+        providerConfig,
+        onSpeechStart: () => {
+          const active = getActiveChatVoiceCallbackSession({ sessionKey, connId, sttSession });
+          if (!active) {
+            return;
+          }
+          active.transcriptPartial = "";
+          emitChatVoiceEvent(context, connId, {
+            sessionKey,
+            state: "speech_start",
+            playbackEnabled: active.playbackEnabled,
+          });
+        },
+        onPartial: (partial) => {
+          const active = getActiveChatVoiceCallbackSession({ sessionKey, connId, sttSession });
+          if (!active) {
+            return;
+          }
+          active.transcriptPartial = partial;
+          emitChatVoiceEvent(context, connId, {
+            sessionKey,
+            state: "partial_transcript",
+            transcript: partial,
+            playbackEnabled: active.playbackEnabled,
+          });
+        },
+        onTranscript: (transcript) => {
+          const active = getActiveChatVoiceCallbackSession({ sessionKey, connId, sttSession });
+          if (!active) {
+            return;
+          }
+          active.transcriptFinal = transcript;
+          active.transcriptPartial = "";
+          emitChatVoiceEvent(context, connId, {
+            sessionKey,
+            state: "final_transcript",
+            transcript,
+            playbackEnabled: active.playbackEnabled,
+          });
+        },
+        onError: (error) => {
+          const active = getActiveChatVoiceCallbackSession({ sessionKey, connId, sttSession });
+          if (!active) {
+            return;
+          }
+          void closeChatVoiceSession({
+            context,
+            sessionKey,
+            connId,
+            errorMessage: error.message || String(error),
+          });
+        },
+      });
+      await sttSession.connect();
+      setChatVoiceSession({
+        sessionKey,
+        connId,
+        providerId: provider.id,
+        playbackEnabled,
+        sttSession,
+        transcriptPartial: "",
+        transcriptFinal: "",
+        activeRunId: null,
+      });
+      respond(true, {
+        ok: true,
+        providerId: provider.id,
+        playbackEnabled,
+      });
+      emitChatVoiceEvent(context, connId, {
+        sessionKey,
+        state: "ready",
+        playbackEnabled,
+      });
+    } catch (err) {
+      respond(false, undefined, errorShape(ErrorCodes.UNAVAILABLE, String(err)));
+      context.logGateway.warn(`chat.voice.start failed: ${formatForLog(err)}`);
+    }
+  },
+  "chat.voice.audio": ({ params, respond, client }) => {
+    if (!validateChatVoiceAudioParams(params)) {
+      respond(
+        false,
+        undefined,
+        errorShape(
+          ErrorCodes.INVALID_REQUEST,
+          `invalid chat.voice.audio params: ${formatValidationErrors(validateChatVoiceAudioParams.errors)}`,
+        ),
+      );
+      return;
+    }
+    const connId = normalizeOptionalText(client?.connId);
+    if (!connId) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
+      return;
+    }
+    const {
+      sessionKey: rawSessionKey,
+      audio,
+      format,
+    } = params as {
+      sessionKey: string;
+      audio: string;
+      format?: string;
+    };
+    const { canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
+    const entry = getChatVoiceSession(sessionKey);
+    if (!entry || entry.connId !== connId) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice session not found"));
+      return;
+    }
+    if (format && format.toLowerCase() !== "pcm16") {
+      respond(
+        false,
+        undefined,
+        errorShape(ErrorCodes.INVALID_REQUEST, `unsupported voice audio format: ${format}`),
+      );
+      return;
+    }
+    let audioBuffer: Buffer;
+    try {
+      audioBuffer = parseStrictBase64AudioBuffer(audio);
+    } catch (err) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, String(err)));
+      return;
+    }
+    try {
+      entry.sttSession.sendAudio(audioBuffer);
+      respond(true, { ok: true });
+    } catch (err) {
+      respond(false, undefined, errorShape(ErrorCodes.UNAVAILABLE, String(err)));
+    }
+  },
+  "chat.voice.commit": async ({ params, req, respond, context, client }) => {
+    if (!validateChatVoiceCommitParams(params)) {
+      respond(
+        false,
+        undefined,
+        errorShape(
+          ErrorCodes.INVALID_REQUEST,
+          `invalid chat.voice.commit params: ${formatValidationErrors(validateChatVoiceCommitParams.errors)}`,
+        ),
+      );
+      return;
+    }
+    const connId = normalizeOptionalText(client?.connId);
+    if (!connId) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
+      return;
+    }
+    const { sessionKey: rawSessionKey, transcript: transcriptOverride } = params as {
+      sessionKey: string;
+      transcript?: string;
+    };
+    const { canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
+    const entry = getChatVoiceSession(sessionKey);
+    if (!entry || entry.connId !== connId) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice session not found"));
+      return;
+    }
+    if (entry.activeRunId) {
+      respond(true, { ok: false, status: "in_flight", runId: entry.activeRunId });
+      return;
+    }
+    const transcript = (
+      transcriptOverride ??
+      entry.transcriptFinal ??
+      entry.transcriptPartial
+    ).trim();
+    if (!transcript) {
+      respond(
+        false,
+        undefined,
+        errorShape(ErrorCodes.INVALID_REQUEST, "voice transcript is empty"),
+      );
+      return;
+    }
+
+    const runId = randomUUID();
+    const voiceSendResult = await new Promise<{
+      ok: boolean;
+      payload?: unknown;
+      error?: ReturnType<typeof errorShape>;
+    }>((resolve) => {
+      void chatHandlers["chat.send"]({
+        req,
+        params: {
+          sessionKey,
+          message: transcript,
+          deliver: false,
+          idempotencyKey: runId,
+        },
+        client,
+        isWebchatConnect: () => false,
+        context,
+        respond: (ok, payload, error) => resolve({ ok, payload, error }),
+      });
+    });
+    if (!voiceSendResult.ok) {
+      respond(false, voiceSendResult.payload, voiceSendResult.error);
+      return;
+    }
+    entry.transcriptFinal = "";
+    entry.transcriptPartial = "";
+    setChatVoiceRunId(sessionKey, runId);
+    emitChatVoiceEvent(context, connId, {
+      sessionKey,
+      state: "assistant_started",
+      runId,
+      playbackEnabled: entry.playbackEnabled,
+    });
+    respond(true, {
+      ok: true,
+      runId,
+      transcript,
+      playbackEnabled: entry.playbackEnabled,
+      result: voiceSendResult.payload,
+    });
+  },
+  "chat.voice.interrupt": ({ params, req, respond, context, client }) => {
+    if (!validateChatVoiceInterruptParams(params)) {
+      respond(
+        false,
+        undefined,
+        errorShape(
+          ErrorCodes.INVALID_REQUEST,
+          `invalid chat.voice.interrupt params: ${formatValidationErrors(validateChatVoiceInterruptParams.errors)}`,
+        ),
+      );
+      return;
+    }
+    const connId = normalizeOptionalText(client?.connId);
+    if (!connId) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
+      return;
+    }
+    const { sessionKey: rawSessionKey } = params as { sessionKey: string };
+    const { canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
+    const entry = getChatVoiceSession(sessionKey);
+    if (!entry || entry.connId !== connId) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice session not found"));
+      return;
+    }
+
+    emitChatVoiceEvent(context, connId, {
+      sessionKey,
+      state: "playback_clear",
+      playbackEnabled: entry.playbackEnabled,
+    });
+
+    const runId = entry.activeRunId;
+    if (!runId) {
+      emitChatVoiceEvent(context, connId, {
+        sessionKey,
+        state: "interrupted",
+        playbackEnabled: entry.playbackEnabled,
+      });
+      respond(true, { ok: true, aborted: false });
+      return;
+    }
+
+    void chatHandlers["chat.abort"]({
+      req,
+      params: {
+        sessionKey,
+        runId,
+      },
+      client,
+      isWebchatConnect: () => false,
+      context,
+      respond: () => undefined,
+    });
+    setChatVoiceRunId(sessionKey, null);
+    emitChatVoiceEvent(context, connId, {
+      sessionKey,
+      state: "interrupted",
+      runId,
+      playbackEnabled: entry.playbackEnabled,
+    });
+    respond(true, { ok: true, aborted: true, runId });
+  },
+  "chat.voice.stop": async ({ params, req, respond, context, client }) => {
+    if (!validateChatVoiceStopParams(params)) {
+      respond(
+        false,
+        undefined,
+        errorShape(
+          ErrorCodes.INVALID_REQUEST,
+          `invalid chat.voice.stop params: ${formatValidationErrors(validateChatVoiceStopParams.errors)}`,
+        ),
+      );
+      return;
+    }
+    const connId = normalizeOptionalText(client?.connId);
+    if (!connId) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
+      return;
+    }
+    const { sessionKey: rawSessionKey } = params as { sessionKey: string };
+    const { canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
+    const entry = getChatVoiceSession(sessionKey);
+    if (!entry || entry.connId !== connId) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice session not found"));
+      return;
+    }
+    emitChatVoiceEvent(context, connId, {
+      sessionKey,
+      state: "playback_clear",
+      playbackEnabled: entry.playbackEnabled,
+    });
+    if (entry.activeRunId) {
+      void chatHandlers["chat.abort"]({
+        req,
+        params: {
+          sessionKey,
+          runId: entry.activeRunId,
+        },
+        client,
+        isWebchatConnect: () => false,
+        context,
+        respond: () => undefined,
+      });
+    }
+    setChatVoiceRunId(sessionKey, null);
+    await closeChatVoiceSession({
+      context,
+      sessionKey,
+      connId,
+    });
+    respond(true, { ok: true });
+  },
  "chat.inject": async ({ params, respond, context }) => {
    if (!validateChatInjectParams(params)) {
      respond(
--- a/src/gateway/server-methods/chat.voice.test.ts
+++ b/src/gateway/server-methods/chat.voice.test.ts
@@ -0,0 +1,202 @@
+import { afterEach, describe, expect, it, vi } from "vitest";
+import {
+  deleteChatVoiceSession,
+  getChatVoiceSession,
+  setChatVoiceSession,
+} from "../chat-voice-sessions.js";
+import { ErrorCodes } from "../protocol/index.js";
+
+const mockState = vi.hoisted(() => ({
+  cfg: {
+    gateway: {
+      controlUi: {
+        voice: {
+          enabled: true,
+          transcriptionProvider: "mock-stt",
+          playbackEnabled: true,
+        },
+      },
+    },
+    models: {
+      providers: {
+        "mock-stt": {},
+      },
+    },
+  } as Record<string, unknown>,
+  provider: null as {
+    id: string;
+    isConfigured: ReturnType<typeof vi.fn>;
+    createSession: ReturnType<typeof vi.fn>;
+  } | null,
+}));
+
+vi.mock("../session-utils.js", async () => {
+  const original =
+    await vi.importActual<typeof import("../session-utils.js")>("../session-utils.js");
+  return {
+    ...original,
+    loadSessionEntry: (rawKey: string) => ({
+      cfg: mockState.cfg,
+      storePath: "/tmp/sessions.json",
+      entry: {
+        sessionId: "sess-voice-1",
+        sessionFile: "/tmp/sess-voice-1.jsonl",
+      },
+      canonicalKey: rawKey || "main",
+    }),
+  };
+});
+
+vi.mock("../../plugin-sdk/realtime-transcription.js", () => ({
+  getRealtimeTranscriptionProvider: vi.fn(() => mockState.provider),
+}));
+
+const { chatHandlers } = await import("./chat.js");
+
+function createContext() {
+  return {
+    broadcastToConnIds: vi.fn(),
+    logGateway: {
+      warn: vi.fn(),
+      debug: vi.fn(),
+    },
+  };
+}
+
+function createClient(connId = "conn-1") {
+  return { connId } as const;
+}
+
+afterEach(() => {
+  vi.restoreAllMocks();
+  deleteChatVoiceSession("main");
+  mockState.provider = null;
+});
+
+describe("chat voice handlers", () => {
+  it("ignores stale onError callbacks from replaced voice sessions", async () => {
+    const callbacks: Array<{
+      onError?: (error: Error) => void;
+    }> = [];
+    const sessions = [
+      {
+        connect: vi.fn(async () => undefined),
+        sendAudio: vi.fn(),
+        close: vi.fn(),
+        isConnected: vi.fn(() => true),
+      },
+      {
+        connect: vi.fn(async () => undefined),
+        sendAudio: vi.fn(),
+        close: vi.fn(),
+        isConnected: vi.fn(() => true),
+      },
+    ];
+    mockState.provider = {
+      id: "mock-stt",
+      isConfigured: vi.fn(() => true),
+      createSession: vi.fn((params) => {
+        callbacks.push(params);
+        return sessions[callbacks.length - 1];
+      }),
+    };
+    const context = createContext();
+    const respond = vi.fn();
+
+    await chatHandlers["chat.voice.start"]({
+      params: { sessionKey: "main" },
+      respond,
+      context: context as never,
+      client: createClient(),
+    } as never);
+    await chatHandlers["chat.voice.start"]({
+      params: { sessionKey: "main" },
+      respond,
+      context: context as never,
+      client: createClient(),
+    } as never);
+
+    expect(getChatVoiceSession("main")?.sttSession).toBe(sessions[1]);
+
+    callbacks[0].onError?.(new Error("late"));
+
+    expect(getChatVoiceSession("main")?.sttSession).toBe(sessions[1]);
+  });
+
+  it("rejects malformed base64 audio before forwarding to the session", async () => {
+    const sendAudio = vi.fn();
+    setChatVoiceSession({
+      sessionKey: "main",
+      connId: "conn-1",
+      providerId: "mock-stt",
+      playbackEnabled: true,
+      sttSession: {
+        connect: vi.fn(async () => undefined),
+        sendAudio,
+        close: vi.fn(),
+        isConnected: vi.fn(() => true),
+      },
+      transcriptPartial: "",
+      transcriptFinal: "",
+      activeRunId: null,
+    });
+    const respond = vi.fn();
+
+    await chatHandlers["chat.voice.audio"]({
+      params: { sessionKey: "main", audio: "not@base64", format: "pcm16" },
+      respond,
+      client: createClient(),
+    } as never);
+
+    expect(sendAudio).not.toHaveBeenCalled();
+    expect(respond).toHaveBeenCalledWith(
+      false,
+      undefined,
+      expect.objectContaining({
+        code: ErrorCodes.INVALID_REQUEST,
+        message: expect.stringContaining("base64"),
+      }),
+    );
+  });
+
+  it("preserves buffered transcript when commit send fails", async () => {
+    const sttSession = {
+      connect: vi.fn(async () => undefined),
+      sendAudio: vi.fn(),
+      close: vi.fn(),
+      isConnected: vi.fn(() => true),
+    };
+    setChatVoiceSession({
+      sessionKey: "main",
+      connId: "conn-1",
+      providerId: "mock-stt",
+      playbackEnabled: true,
+      sttSession,
+      transcriptPartial: "draft tail",
+      transcriptFinal: "hello from voice",
+      activeRunId: null,
+    });
+    vi.spyOn(chatHandlers, "chat.send").mockImplementation(async ({ respond }) => {
+      respond(false, undefined, { code: ErrorCodes.UNAVAILABLE, message: "send failed" } as never);
+    });
+    const respond = vi.fn();
+
+    await chatHandlers["chat.voice.commit"]({
+      params: { sessionKey: "main" },
+      req: {} as never,
+      respond,
+      context: createContext() as never,
+      client: createClient(),
+    } as never);
+
+    expect(getChatVoiceSession("main")).toMatchObject({
+      transcriptFinal: "hello from voice",
+      transcriptPartial: "draft tail",
+    });
+    expect(respond).toHaveBeenCalledWith(
+      false,
+      undefined,
+      expect.objectContaining({ code: ErrorCodes.UNAVAILABLE }),
+    );
+  });
+});
--- a/src/gateway/server-methods/realtime-transcription.test.ts
+++ b/src/gateway/server-methods/realtime-transcription.test.ts
@@ -0,0 +1,140 @@
+import { beforeEach, describe, expect, it, vi } from "vitest";
+
+const mocks = vi.hoisted(() => ({
+  manager: {
+    startSession: vi.fn(),
+    pushAudio: vi.fn(),
+    pullEvents: vi.fn(),
+    finishSession: vi.fn(),
+  },
+}));
+
+vi.mock("../realtime-transcription-session-manager.js", () => ({
+  getRealtimeTranscriptionSessionManager: () => mocks.manager,
+  __testing: {
+    normalizeAudioFormat: (value: string | undefined) =>
+      value === "s16le" || value === "pcm16" || value === "g711_ulaw" ? value : null,
+  },
+}));
+
+import { realtimeTranscriptionHandlers } from "./realtime-transcription.js";
+
+describe("realtimeTranscriptionHandlers", () => {
+  beforeEach(() => {
+    mocks.manager.startSession.mockReset();
+    mocks.manager.pushAudio.mockReset();
+    mocks.manager.pullEvents.mockReset();
+    mocks.manager.finishSession.mockReset();
+  });
+
+  it("starts a session with validated audio metadata", async () => {
+    mocks.manager.startSession.mockResolvedValue({ sessionId: "s1", provider: "openai" });
+    const respond = vi.fn();
+
+    await realtimeTranscriptionHandlers["realtimeTranscription.start"]({
+      req: { method: "realtimeTranscription.start", id: "1" } as never,
+      params: { format: "s16le", sampleRate: 16000, channels: 1 },
+      client: null,
+      isWebchatConnect: () => false,
+      respond,
+      context: {} as never,
+    });
+
+    expect(mocks.manager.startSession).toHaveBeenCalledWith({
+      provider: undefined,
+      providerConfig: undefined,
+      format: "s16le",
+      sampleRate: 16000,
+      channels: 1,
+    });
+    expect(respond).toHaveBeenCalledWith(true, { sessionId: "s1", provider: "openai" });
+  });
+
+  it("rejects invalid start formats", async () => {
+    const respond = vi.fn();
+
+    await realtimeTranscriptionHandlers["realtimeTranscription.start"]({
+      req: { method: "realtimeTranscription.start", id: "1" } as never,
+      params: { format: "wav", sampleRate: 16000, channels: 1 },
+      client: null,
+      isWebchatConnect: () => false,
+      respond,
+      context: {} as never,
+    });
+
+    expect(mocks.manager.startSession).not.toHaveBeenCalled();
+    expect(respond.mock.calls[0]?.[0]).toBe(false);
+  });
+
+  it("pushes audio chunks to an existing session", async () => {
+    mocks.manager.pushAudio.mockReturnValue({ sessionId: "s1", acceptedBytes: 4, connected: true });
+    const respond = vi.fn();
+
+    await realtimeTranscriptionHandlers["realtimeTranscription.pushAudio"]({
+      req: { method: "realtimeTranscription.pushAudio", id: "2" } as never,
+      params: { sessionId: "s1", audio: Buffer.from("test").toString("base64") },
+      client: null,
+      isWebchatConnect: () => false,
+      respond,
+      context: {} as never,
+    });
+
+    expect(mocks.manager.pushAudio).toHaveBeenCalledWith({
+      sessionId: "s1",
+      audio: expect.any(Buffer),
+    });
+    expect(respond).toHaveBeenCalledWith(
+      true,
+      expect.objectContaining({ sessionId: "s1", acceptedBytes: 4 }),
+    );
+  });
+
+  it("rejects malformed base64 audio payloads before forwarding to the manager", async () => {
+    const respond = vi.fn();
+
+    await realtimeTranscriptionHandlers["realtimeTranscription.pushAudio"]({
+      req: { method: "realtimeTranscription.pushAudio", id: "2b" } as never,
+      params: { sessionId: "s1", audio: "%%%not-base64%%%" },
+      client: null,
+      isWebchatConnect: () => false,
+      respond,
+      context: {} as never,
+    });
+
+    expect(mocks.manager.pushAudio).not.toHaveBeenCalled();
+    expect(respond.mock.calls[0]?.[0]).toBe(false);
+    expect(JSON.stringify(respond.mock.calls[0]?.[2] ?? {})).toContain("audio must be base64 encoded");
+  });
+
+  it("returns final events from finish and lets the manager clean up immediately", async () => {
+    mocks.manager.finishSession.mockReturnValue({
+      sessionId: "s1",
+      provider: "openai",
+      closed: true,
+      events: [{ type: "session.ended", reason: "client_finish", timestamp: 123 }],
+    });
+    const respond = vi.fn();
+
+    await realtimeTranscriptionHandlers["realtimeTranscription.finish"]({
+      req: { method: "realtimeTranscription.finish", id: "3" } as never,
+      params: { sessionId: "s1" },
+      client: null,
+      isWebchatConnect: () => false,
+      respond,
+      context: {} as never,
+    });
+
+    expect(mocks.manager.finishSession).toHaveBeenCalledWith({
+      sessionId: "s1",
+      reason: undefined,
+    });
+    expect(respond).toHaveBeenCalledWith(
+      true,
+      expect.objectContaining({
+        sessionId: "s1",
+        closed: true,
+        events: [{ type: "session.ended", reason: "client_finish", timestamp: 123 }],
+      }),
+    );
+  });
+});
--- a/src/gateway/server-methods/realtime-transcription.ts
+++ b/src/gateway/server-methods/realtime-transcription.ts
@@ -0,0 +1,118 @@
+import { ErrorCodes, errorShape } from "../protocol/index.js";
+import {
+  getRealtimeTranscriptionSessionManager,
+  __testing as managerTesting,
+} from "../realtime-transcription-session-manager.js";
+import { formatForLog } from "../ws-log.js";
+import type { GatewayRequestHandlers } from "./types.js";
+
+function parsePositiveNumber(value: unknown, name: string): number {
+  const number =
+    typeof value === "number"
+      ? value
+      : typeof value === "string" && value.trim()
+        ? Number(value)
+        : Number.NaN;
+  if (!Number.isFinite(number) || number <= 0) {
+    throw new Error(`${name} must be a positive number.`);
+  }
+  return number;
+}
+
+function parseSessionId(value: unknown): string {
+  const sessionId = typeof value === "string" ? value.trim() : "";
+  if (!sessionId) {
+    throw new Error("sessionId is required.");
+  }
+  return sessionId;
+}
+
+function parseAudioBuffer(value: unknown): Buffer {
+  const audio = typeof value === "string" ? value.trim() : "";
+  if (!audio) {
+    throw new Error("audio is required.");
+  }
+  if (!isStrictBase64(audio)) {
+    throw new Error("audio must be base64 encoded.");
+  }
+  return Buffer.from(audio, "base64");
+}
+
+function isStrictBase64(value: string): boolean {
+  const normalized = value.replace(/\s+/g, "");
+  if (!normalized || normalized.length % 4 !== 0) {
+    return false;
+  }
+  if (!/^[A-Za-z0-9+/]+={0,2}$/.test(normalized)) {
+    return false;
+  }
+  const decoded = Buffer.from(normalized, "base64");
+  return decoded.length > 0 && decoded.toString("base64") === normalized;
+}
+
+export const realtimeTranscriptionHandlers: GatewayRequestHandlers = {
+  "realtimeTranscription.start": async ({ params, respond }) => {
+    try {
+      const format = managerTesting.normalizeAudioFormat(
+        typeof params.format === "string" ? params.format : undefined,
+      );
+      if (!format) {
+        respond(
+          false,
+          undefined,
+          errorShape(
+            ErrorCodes.INVALID_REQUEST,
+            "format is required and must be one of: s16le, pcm16, g711_ulaw",
+          ),
+        );
+        return;
+      }
+      const result = await getRealtimeTranscriptionSessionManager().startSession({
+        provider: typeof params.provider === "string" ? params.provider.trim() : undefined,
+        providerConfig:
+          params.providerConfig && typeof params.providerConfig === "object"
+            ? (params.providerConfig as Record<string, unknown>)
+            : undefined,
+        format,
+        sampleRate: parsePositiveNumber(params.sampleRate, "sampleRate"),
+        channels: parsePositiveNumber(params.channels, "channels"),
+      });
+      respond(true, result);
+    } catch (err) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, formatForLog(err)));
+    }
+  },
+  "realtimeTranscription.pushAudio": async ({ params, respond }) => {
+    try {
+      const result = getRealtimeTranscriptionSessionManager().pushAudio({
+        sessionId: parseSessionId(params.sessionId),
+        audio: parseAudioBuffer(params.audio),
+      });
+      respond(true, result);
+    } catch (err) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, formatForLog(err)));
+    }
+  },
+  "realtimeTranscription.pull": async ({ params, respond }) => {
+    try {
+      const result = getRealtimeTranscriptionSessionManager().pullEvents({
+        sessionId: parseSessionId(params.sessionId),
+        limit: params.limit === undefined ? undefined : parsePositiveNumber(params.limit, "limit"),
+      });
+      respond(true, result);
+    } catch (err) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, formatForLog(err)));
+    }
+  },
+  "realtimeTranscription.finish": async ({ params, respond }) => {
+    try {
+      const result = getRealtimeTranscriptionSessionManager().finishSession({
+        sessionId: parseSessionId(params.sessionId),
+        reason: typeof params.reason === "string" ? params.reason : undefined,
+      });
+      respond(true, result);
+    } catch (err) {
+      respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, formatForLog(err)));
+    }
+  },
+};
--- a/src/gateway/server-methods/tts.ts
+++ b/src/gateway/server-methods/tts.ts
@@ -9,6 +9,7 @@ import {
  getTtsProvider,
  isTtsEnabled,
  isTtsProviderConfigured,
+  resolveExplicitTtsOverrides,
  resolveTtsAutoMode,
  resolveTtsConfig,
  resolveTtsPrefsPath,
@@ -89,7 +90,22 @@ export const ttsHandlers: GatewayRequestHandlers = {
    try {
      const cfg = loadConfig();
      const channel = typeof params.channel === "string" ? params.channel.trim() : undefined;
-      const result = await textToSpeech({ text, cfg, channel });
+      const providerRaw = typeof params.provider === "string" ? params.provider.trim() : undefined;
+      const modelId = typeof params.modelId === "string" ? params.modelId.trim() : undefined;
+      const voiceId = typeof params.voiceId === "string" ? params.voiceId.trim() : undefined;
+      const overrides = resolveExplicitTtsOverrides({
+        cfg,
+        provider: providerRaw,
+        modelId,
+        voiceId,
+      });
+      const result = await textToSpeech({
+        text,
+        cfg,
+        channel,
+        overrides,
+        disableFallback: Boolean(overrides.provider || modelId || voiceId),
+      });
      if (result.success && result.audioPath) {
        respond(true, {
          audioPath: result.audioPath,
--- a/src/gateway/server/ws-connection.ts
+++ b/src/gateway/server/ws-connection.ts
@@ -8,6 +8,7 @@ import { truncateUtf16Safe } from "../../utils.js";
 import { isWebchatClient } from "../../utils/message-channel.js";
 import type { AuthRateLimiter } from "../auth-rate-limit.js";
 import type { ResolvedGatewayAuth } from "../auth.js";
+import { closeChatVoiceSessionsForConn } from "../chat-voice-sessions.js";
 import { getPreauthHandshakeTimeoutMsFromEnv } from "../handshake-timeouts.js";
 import { isLoopbackAddress } from "../net.js";
 import type { GatewayRequestContext, GatewayRequestHandlers } from "../server-methods/types.js";
@@ -270,6 +271,9 @@ export function attachGatewayWsConnectionHandler(params: AttachGatewayWsConnecti
      }
      const context = buildRequestContext();
      context.unsubscribeAllSessionEvents(connId);
+      closeChatVoiceSessionsForConn(connId, (targetConnId, payload) => {
+        context.broadcastToConnIds("chat.voice.event", payload, new Set([targetConnId]));
+      });
      if (client?.connect?.role === "node") {
        const nodeId = context.nodeRegistry.unregister(connId);
        if (nodeId) {
--- a/src/media-understanding/runner.auto-audio.test.ts
+++ b/src/media-understanding/runner.auto-audio.test.ts
@@ -121,6 +121,43 @@ describe("runCapability auto audio entries", () => {
    expect(seenModel).toBe("whisper-1");
  });

+  it("lets per-request transcription hints override configured model-entry hints", async () => {
+    let seenLanguage: string | undefined;
+    let seenPrompt: string | undefined;
+    const result = await runAutoAudioCase({
+      transcribeAudio: async (req) => {
+        seenLanguage = req.language;
+        seenPrompt = req.prompt;
+        return { text: "ok", model: req.model ?? "unknown" };
+      },
+      cfgExtra: {
+        tools: {
+          media: {
+            audio: {
+              enabled: true,
+              prompt: "configured prompt",
+              language: "fr",
+              _requestPromptOverride: "Focus on names",
+              _requestLanguageOverride: "en",
+              models: [
+                {
+                  provider: "openai",
+                  model: "whisper-1",
+                  prompt: "entry prompt",
+                  language: "de",
+                },
+              ],
+            },
+          },
+        },
+      } as Partial<OpenClawConfig>,
+    });
+
+    expect(result.outputs[0]?.text).toBe("ok");
+    expect(seenLanguage).toBe("en");
+    expect(seenPrompt).toBe("Focus on names");
+  });
+
  it("uses mistral when only mistral key is configured", async () => {
    const isolatedAgentDir = await fs.mkdtemp(path.join(os.tmpdir(), "openclaw-audio-agent-"));
    let runResult: Awaited<ReturnType<typeof runCapability>> | undefined;
--- a/src/media-understanding/runner.cli-audio.test.ts
+++ b/src/media-understanding/runner.cli-audio.test.ts
@@ -0,0 +1,67 @@
+import { afterEach, beforeAll, beforeEach, describe, expect, it, vi } from "vitest";
+import type { OpenClawConfig } from "../config/config.js";
+import { withAudioFixture } from "./runner.test-utils.js";
+
+const runExecMock = vi.hoisted(() => vi.fn());
+
+vi.mock("../process/exec.js", () => ({
+  runExec: (...args: unknown[]) => runExecMock(...args),
+}));
+
+let runCliEntry: typeof import("./runner.entries.js").runCliEntry;
+
+describe("media-understanding CLI audio entry", () => {
+  beforeAll(async () => {
+    ({ runCliEntry } = await import("./runner.entries.js"));
+  });
+
+  beforeEach(() => {
+    runExecMock.mockReset().mockResolvedValue({ stdout: "cli transcript" });
+  });
+
+  afterEach(() => {
+    vi.clearAllMocks();
+  });
+
+  it("applies per-request prompt and language overrides to CLI transcription templating", async () => {
+    await withAudioFixture("openclaw-cli-audio", async ({ ctx, cache }) => {
+      await runCliEntry({
+        capability: "audio",
+        entry: {
+          type: "cli",
+          command: "mock-transcriber",
+          args: ["--prompt", "{{Prompt}}", "--language", "{{Language}}", "--file", "{{MediaPath}}"],
+          prompt: "entry prompt",
+          language: "de",
+        },
+        cfg: {
+          tools: {
+            media: {
+              audio: {
+                prompt: "configured prompt",
+                language: "fr",
+                _requestPromptOverride: "Focus on names",
+                _requestLanguageOverride: "en",
+              },
+            },
+          },
+        } as OpenClawConfig,
+        ctx,
+        attachmentIndex: 0,
+        cache,
+        config: {
+          prompt: "configured prompt",
+          language: "fr",
+          _requestPromptOverride: "Focus on names",
+          _requestLanguageOverride: "en",
+        } as never,
+      });
+    });
+
+    expect(runExecMock).toHaveBeenCalledWith(
+      "mock-transcriber",
+      expect.arrayContaining(["--prompt", "Focus on names", "--language", "en"]),
+      expect.any(Object),
+    );
+  });
+});
--- a/src/media-understanding/runner.entries.ts
+++ b/src/media-understanding/runner.entries.ts
@@ -372,6 +372,20 @@ function resolveEntryRunOptions(params: {
  return { maxBytes, maxChars, timeoutMs, prompt };
 }

+function resolveAudioRequestOverrides(config: MediaUnderstandingConfig | undefined): {
+  prompt?: string;
+  language?: string;
+} {
+  const overrides = (config ?? {}) as MediaUnderstandingConfig & {
+    _requestPromptOverride?: string;
+    _requestLanguageOverride?: string;
+  };
+  return {
+    prompt: overrides._requestPromptOverride,
+    language: overrides._requestLanguageOverride,
+  };
+}
+
 async function resolveProviderExecutionAuth(params: {
  providerId: string;
  cfg: OpenClawConfig;
@@ -530,6 +544,7 @@ export async function runProviderEntry(params: {
      throw new Error(`Audio transcription provider "${providerId}" not available.`);
    }
    const transcribeAudio = provider.transcribeAudio;
+    const requestOverrides = resolveAudioRequestOverrides(params.config);
    const media = await params.cache.getBuffer({
      attachmentIndex: params.attachmentIndex,
      maxBytes,
@@ -569,8 +584,12 @@ export async function runProviderEntry(params: {
          headers,
          request,
          model,
-          language: entry.language ?? params.config?.language ?? cfg.tools?.media?.audio?.language,
-          prompt,
+          language:
+            requestOverrides.language ??
+            entry.language ??
+            params.config?.language ??
+            cfg.tools?.media?.audio?.language,
+          prompt: requestOverrides.prompt ?? prompt,
          query: providerQuery,
          timeoutMs,
          fetchFn,
@@ -651,6 +670,7 @@ export async function runCliEntry(params: {
  if (!command) {
    throw new Error(`CLI entry missing command for ${capability}`);
  }
+  const requestOverrides = resolveAudioRequestOverrides(params.config);
  const { maxBytes, maxChars, timeoutMs, prompt } = resolveEntryRunOptions({
    capability,
    entry,
@@ -683,7 +703,8 @@ export async function runCliEntry(params: {
    MediaDir: path.dirname(mediaPath),
    OutputDir: outputDir,
    OutputBase: outputBase,
-    Prompt: prompt,
+    Prompt: requestOverrides.prompt ?? prompt,
+    ...(requestOverrides.language ? { Language: requestOverrides.language } : {}),
    MaxChars: maxChars,
  };
  const argv = [command, ...args].map((part, index) =>
--- a/src/media-understanding/runtime.ts
+++ b/src/media-understanding/runtime.ts
@@ -150,7 +150,28 @@ export async function transcribeAudioFile(params: {
  agentDir?: string;
  mime?: string;
  activeModel?: ActiveMediaModel;
+  language?: string;
+  prompt?: string;
 }): Promise<{ text: string | undefined }> {
-  const result = await runMediaUnderstandingFile({ ...params, capability: "audio" });
+  const cfg =
+    params.language || params.prompt
+      ? {
+          ...params.cfg,
+          tools: {
+            ...params.cfg.tools,
+            media: {
+              ...params.cfg.tools?.media,
+              audio: {
+                ...params.cfg.tools?.media?.audio,
+                ...(params.language ? { _requestLanguageOverride: params.language } : {}),
+                ...(params.prompt ? { _requestPromptOverride: params.prompt } : {}),
+                ...(params.language ? { language: params.language } : {}),
+                ...(params.prompt ? { prompt: params.prompt } : {}),
+              },
+            },
+          },
+        }
+      : params.cfg;
+  const result = await runMediaUnderstandingFile({ ...params, cfg, capability: "audio" });
  return { text: result.text };
 }
--- a/src/tts/status-config.test.ts
+++ b/src/tts/status-config.test.ts
@@ -1,6 +1,6 @@
 import fs from "node:fs";
 import path from "node:path";
-import { describe, expect, it } from "vitest";
+import { describe, expect, it, vi } from "vitest";
 import { withTempHome } from "../../test/helpers/temp-home.js";
 import type { OpenClawConfig } from "../config/config.js";
 import { resolveStatusTtsSnapshot } from "./status-config.js";
@@ -61,4 +61,44 @@ describe("resolveStatusTtsSnapshot", () => {
      });
    });
  });
+
+  it("derives the default prefs path from OPENCLAW_CONFIG_PATH when set", async () => {
+    await withTempHome(
+      async (home) => {
+        const stateDir = path.join(home, ".openclaw-dev");
+        const prefsPath = path.join(stateDir, "settings", "tts.json");
+        fs.mkdirSync(path.dirname(prefsPath), { recursive: true });
+        fs.writeFileSync(
+          prefsPath,
+          JSON.stringify({
+            tts: {
+              auto: "always",
+              provider: "openai",
+            },
+          }),
+        );
+
+        vi.stubEnv("OPENCLAW_CONFIG_PATH", path.join(stateDir, "openclaw.json"));
+        try {
+          expect(
+            resolveStatusTtsSnapshot({
+              cfg: {
+                messages: {
+                  tts: {},
+                },
+              } as OpenClawConfig,
+            }),
+          ).toEqual({
+            autoMode: "always",
+            provider: "openai",
+            maxLength: 1500,
+            summarize: true,
+          });
+        } finally {
+          vi.unstubAllEnvs();
+        }
+      },
+      { env: { OPENCLAW_STATE_DIR: undefined } },
+    );
+  });
 });
--- a/src/tts/status-config.ts
+++ b/src/tts/status-config.ts
@@ -2,7 +2,7 @@ import fs from "node:fs";
 import path from "node:path";
 import type { OpenClawConfig } from "../config/config.js";
 import type { TtsAutoMode, TtsConfig, TtsProvider } from "../config/types.tts.js";
-import { CONFIG_DIR, resolveUserPath } from "../utils.js";
+import { resolveConfigDir, resolveUserPath } from "../utils.js";
 import { normalizeTtsAutoMode } from "./tts-auto-mode.js";

 const DEFAULT_TTS_MAX_LENGTH = 1500;
@@ -47,7 +47,7 @@ function resolveTtsPrefsPathValue(prefsPath: string | undefined): string {
  if (envPath) {
    return resolveUserPath(envPath);
  }
-  return path.join(CONFIG_DIR, "settings", "tts.json");
+  return path.join(resolveConfigDir(process.env), "settings", "tts.json");
 }

 function readPrefs(prefsPath: string): TtsUserPrefs {
--- a/src/tts/tts.ts
+++ b/src/tts/tts.ts
@@ -10,6 +10,7 @@ export {
  isTtsProviderConfigured,
  listSpeechVoices,
  maybeApplyTtsToPayload,
+  resolveExplicitTtsOverrides,
  resolveTtsAutoMode,
  resolveTtsConfig,
  resolveTtsPrefsPath,
--- a/src/utils.test.ts
+++ b/src/utils.test.ts
@@ -50,6 +50,15 @@ describe("resolveConfigDir", () => {

    expect(resolveConfigDir(env)).toBe(path.resolve("/tmp/openclaw-home", "state"));
  });
+
+  it("falls back to the config file directory when only OPENCLAW_CONFIG_PATH is set", () => {
+    const env = {
+      HOME: "/tmp/openclaw-home",
+      OPENCLAW_CONFIG_PATH: "~/profiles/dev/openclaw.json",
+    } as NodeJS.ProcessEnv;
+
+    expect(resolveConfigDir(env)).toBe(path.resolve("/tmp/openclaw-home", "profiles", "dev"));
+  });
 });

 describe("resolveHomeDir", () => {
--- a/src/utils.ts
+++ b/src/utils.ts
@@ -141,6 +141,10 @@ export function resolveConfigDir(
  if (override) {
    return resolveUserPath(override, env, homedir);
  }
+  const configPath = env.OPENCLAW_CONFIG_PATH?.trim();
+  if (configPath) {
+    return path.dirname(resolveUserPath(configPath, env, homedir));
+  }
  const newDir = path.join(resolveRequiredHomeDir(env, homedir), ".openclaw");
  try {
    const hasNew = fs.existsSync(newDir);
--- a/src/web-fetch/runtime.ts
+++ b/src/web-fetch/runtime.ts
@@ -64,6 +64,16 @@ function hasEntryCredential(
  });
 }

+export function isWebFetchProviderConfigured(params: {
+  provider: Pick<
+    PluginWebFetchProviderEntry,
+    "envVars" | "getConfiguredCredentialValue" | "getCredentialValue" | "requiresCredential"
+  >;
+  config?: OpenClawConfig;
+}): boolean {
+  return hasEntryCredential(params.provider, params.config, resolveFetchConfig(params.config));
+}
+
 export function listWebFetchProviders(params?: {
  config?: OpenClawConfig;
 }): PluginWebFetchProviderEntry[] {
--- a/src/web-search/runtime.test.ts
+++ b/src/web-search/runtime.test.ts
@@ -289,4 +289,162 @@ describe("web search runtime", () => {
      result: { query: "runtime", provider: "beta", runtimeSelectedProvider: "beta" },
    });
  });
+
+  it("falls back to another provider when auto-selected search execution fails", async () => {
+    resolveRuntimeWebSearchProvidersMock.mockReturnValue([
+      createProvider({
+        pluginId: "google",
+        id: "google",
+        credentialPath: "tools.web.search.google.apiKey",
+        autoDetectOrder: 1,
+        getCredentialValue: () => "configured",
+        createTool: () => ({
+          description: "google",
+          parameters: {},
+          execute: async () => {
+            throw new Error("google aborted");
+          },
+        }),
+      }),
+      createProvider({
+        pluginId: "duckduckgo",
+        id: "duckduckgo",
+        credentialPath: "",
+        autoDetectOrder: 100,
+        requiresCredential: false,
+        createTool: () => ({
+          description: "duckduckgo",
+          parameters: {},
+          execute: async (args) => ({ ...args, provider: "duckduckgo" }),
+        }),
+      }),
+    ]);
+
+    await expect(
+      runWebSearch({
+        config: {},
+        args: { query: "fallback" },
+      }),
+    ).resolves.toEqual({
+      provider: "duckduckgo",
+      result: { query: "fallback", provider: "duckduckgo" },
+    });
+  });
+
+  it("does not prebuild fallback provider tools before attempting the selected provider", async () => {
+    resolveRuntimeWebSearchProvidersMock.mockReturnValue([
+      createProvider({
+        pluginId: "google",
+        id: "google",
+        credentialPath: "tools.web.search.google.apiKey",
+        autoDetectOrder: 1,
+        getCredentialValue: () => "configured",
+        createTool: () => ({
+          description: "google",
+          parameters: {},
+          execute: async (args) => ({ ...args, provider: "google" }),
+        }),
+      }),
+      createProvider({
+        pluginId: "broken-fallback",
+        id: "broken-fallback",
+        credentialPath: "",
+        autoDetectOrder: 100,
+        requiresCredential: false,
+        createTool: () => {
+          throw new Error("fallback createTool exploded");
+        },
+      }),
+    ]);
+
+    await expect(
+      runWebSearch({
+        config: {},
+        args: { query: "selected-first" },
+      }),
+    ).resolves.toEqual({
+      provider: "google",
+      result: { query: "selected-first", provider: "google" },
+    });
+  });
+
+  it("does not fall back when the provider came from explicit config selection", async () => {
+    resolveRuntimeWebSearchProvidersMock.mockReturnValue([
+      createProvider({
+        pluginId: "google",
+        id: "google",
+        credentialPath: "tools.web.search.google.apiKey",
+        autoDetectOrder: 1,
+        getCredentialValue: () => "configured",
+        createTool: () => ({
+          description: "google",
+          parameters: {},
+          execute: async () => {
+            throw new Error("google aborted");
+          },
+        }),
+      }),
+      createProvider({
+        pluginId: "duckduckgo",
+        id: "duckduckgo",
+        credentialPath: "",
+        autoDetectOrder: 100,
+        requiresCredential: false,
+        createTool: () => ({
+          description: "duckduckgo",
+          parameters: {},
+          execute: async (args) => ({ ...args, provider: "duckduckgo" }),
+        }),
+      }),
+    ]);
+
+    await expect(
+      runWebSearch({
+        config: {
+          tools: {
+            web: {
+              search: {
+                provider: "google",
+              },
+            },
+          },
+        },
+        args: { query: "configured" },
+      }),
+    ).rejects.toThrow("google aborted");
+  });
+
+  it("does not fall back when the caller explicitly selects a provider", async () => {
+    resolveRuntimeWebSearchProvidersMock.mockReturnValue([
+      createProvider({
+        pluginId: "google",
+        id: "google",
+        credentialPath: "tools.web.search.google.apiKey",
+        autoDetectOrder: 1,
+        getCredentialValue: () => "configured",
+        createTool: () => ({
+          description: "google",
+          parameters: {},
+          execute: async () => {
+            throw new Error("google aborted");
+          },
+        }),
+      }),
+      createProvider({
+        pluginId: "duckduckgo",
+        id: "duckduckgo",
+        credentialPath: "",
+        autoDetectOrder: 100,
+        requiresCredential: false,
+      }),
+    ]);
+
+    await expect(
+      runWebSearch({
+        config: {},
+        providerId: "google",
+        args: { query: "explicit" },
+      }),
+    ).rejects.toThrow("google aborted");
+  });
 });
--- a/src/web-search/runtime.ts
+++ b/src/web-search/runtime.ts
@@ -78,6 +78,21 @@ function hasEntryCredential(
  });
 }

+export function isWebSearchProviderConfigured(params: {
+  provider: Pick<
+    PluginWebSearchProviderEntry,
+    | "credentialPath"
+    | "id"
+    | "envVars"
+    | "getConfiguredCredentialValue"
+    | "getCredentialValue"
+    | "requiresCredential"
+  >;
+  config?: OpenClawConfig;
+}): boolean {
+  return hasEntryCredential(params.provider, params.config, resolveSearchConfig(params.config));
+}
+
 export function listWebSearchProviders(params?: {
  config?: OpenClawConfig;
 }): PluginWebSearchProviderEntry[] {
@@ -197,21 +212,117 @@ export function resolveWebSearchDefinition(
  });
 }

+function resolveWebSearchCandidates(
+  options?: ResolveWebSearchDefinitionParams,
+): PluginWebSearchProviderEntry[] {
+  const search = resolveSearchConfig(options?.config);
+  const runtimeWebSearch = options?.runtimeWebSearch ?? getActiveRuntimeWebToolsMetadata()?.search;
+  if (!resolveWebSearchEnabled({ search, sandboxed: options?.sandboxed })) {
+    return [];
+  }
+
+  const providers = sortWebSearchProvidersForAutoDetect(
+    options?.preferRuntimeProviders
+      ? resolveRuntimeWebSearchProviders({
+          config: options?.config,
+          bundledAllowlistCompat: true,
+        })
+      : resolvePluginWebSearchProviders({
+          config: options?.config,
+          bundledAllowlistCompat: true,
+          origin: "bundled",
+        }),
+  ).filter(Boolean);
+  if (providers.length === 0) {
+    return [];
+  }
+
+  const preferredIds = [
+    options?.providerId,
+    runtimeWebSearch?.selectedProvider,
+    runtimeWebSearch?.providerConfigured,
+    resolveWebSearchProviderId({ config: options?.config, search, providers }),
+  ].filter(
+    (value, index, array): value is string => Boolean(value) && array.indexOf(value) === index,
+  );
+
+  const orderedProviders = [
+    ...preferredIds
+      .map((id) => providers.find((entry) => entry.id === id))
+      .filter((entry): entry is PluginWebSearchProviderEntry => Boolean(entry)),
+    ...providers.filter((entry) => !preferredIds.includes(entry.id)),
+  ];
+  return orderedProviders;
+}
+
+function hasExplicitWebSearchSelection(params: {
+  search?: WebSearchConfig;
+  runtimeWebSearch?: RuntimeWebSearchMetadata;
+  providerId?: string;
+}): boolean {
+  if (params.providerId?.trim()) {
+    return true;
+  }
+  if (
+    params.search &&
+    "provider" in params.search &&
+    typeof params.search.provider === "string" &&
+    params.search.provider.trim()
+  ) {
+    return true;
+  }
+  return params.runtimeWebSearch?.providerSource === "configured";
+}
+
 export async function runWebSearch(
  params: RunWebSearchParams,
 ): Promise<{ provider: string; result: Record<string, unknown> }> {
-  const resolved = resolveWebSearchDefinition({ ...params, preferRuntimeProviders: true });
-  if (!resolved) {
+  const search = resolveSearchConfig(params.config);
+  const runtimeWebSearch = params.runtimeWebSearch ?? getActiveRuntimeWebToolsMetadata()?.search;
+  const candidates = resolveWebSearchCandidates({
+    ...params,
+    runtimeWebSearch,
+    preferRuntimeProviders: true,
+  });
+  if (candidates.length === 0) {
    throw new Error("web_search is disabled or no provider is available.");
  }
-  return {
-    provider: resolved.provider.id,
-    result: await resolved.definition.execute(params.args),
-  };
+  const allowFallback = !hasExplicitWebSearchSelection({
+    search,
+    runtimeWebSearch,
+    providerId: params.providerId,
+  });
+  let lastError: unknown;
+
+  for (const candidate of candidates) {
+    try {
+      const definition = candidate.createTool({
+        config: params.config,
+        searchConfig: search as Record<string, unknown> | undefined,
+        runtimeMetadata: runtimeWebSearch,
+      });
+      if (!definition) {
+        continue;
+      }
+      return {
+        provider: candidate.id,
+        result: await definition.execute(params.args),
+      };
+    } catch (error) {
+      lastError = error;
+      if (!allowFallback) {
+        throw error;
+      }
+    }
+  }
+
+  throw lastError instanceof Error ? lastError : new Error(String(lastError));
 }

 export const __testing = {
  resolveSearchConfig,
  resolveSearchProvider: resolveWebSearchProviderId,
  resolveWebSearchProviderId,
+  resolveWebSearchCandidates,
+  hasExplicitWebSearchSelection,
 };
--- a/ui/src/ui/app-render.ts
+++ b/ui/src/ui/app-render.ts
@@ -752,6 +752,13 @@ export function renderApp(state: AppViewState) {
              onSettingsChange: (next) => state.applySettings(next),
              onPasswordChange: (next) => (state.password = next),
              onSessionKeyChange: (next) => {
+                if (state.client && state.connected && state.chatVoiceActive) {
+                  void state.client
+                    .request("chat.voice.stop", { sessionKey: state.sessionKey })
+                    .catch(() => {
+                      // ignore best-effort voice cleanup errors during navigation
+                    });
+                }
                state.sessionKey = next;
                state.chatMessage = "";
                state.resetToolStream();
@@ -1532,12 +1539,24 @@ export function renderApp(state: AppViewState) {
          ? renderChat({
              sessionKey: state.sessionKey,
              onSessionKeyChange: (next) => {
+                if (state.client && state.connected && state.chatVoiceActive) {
+                  void state.client
+                    .request("chat.voice.stop", { sessionKey: state.sessionKey })
+                    .catch(() => {
+                      // ignore best-effort voice cleanup errors during navigation
+                    });
+                }
                state.sessionKey = next;
                state.chatMessage = "";
                state.chatAttachments = [];
                state.chatStream = null;
                state.chatStreamStartedAt = null;
                state.chatRunId = null;
+                state.chatVoiceActive = false;
+                state.chatVoiceState = "idle";
+                state.chatVoiceTranscript = "";
+                state.chatVoiceRunId = null;
+                state.chatVoiceError = null;
                state.chatQueue = [];
                state.resetToolStream();
                state.resetChatScroll();
@@ -1569,6 +1588,11 @@ export function renderApp(state: AppViewState) {
              canSend: state.connected,
              disabledReason: chatDisabledReason,
              error: state.lastError,
+              voiceActive: state.chatVoiceActive,
+              voiceState: state.chatVoiceState,
+              voiceTranscript: state.chatVoiceTranscript,
+              voiceError: state.chatVoiceError,
+              voicePlaybackEnabled: state.chatVoicePlaybackEnabled,
              sessions: state.sessionsResult,
              focusMode: chatFocus,
              onRefresh: () => {
@@ -1591,6 +1615,69 @@ export function renderApp(state: AppViewState) {
              attachments: state.chatAttachments,
              onAttachmentsChange: (next) => (state.chatAttachments = next),
              onSend: () => state.handleSendChat(),
+              onVoiceStart: async () => {
+                if (!state.client || !state.connected) {
+                  return false;
+                }
+                state.chatVoiceActive = false;
+                state.chatVoiceState = "connecting";
+                state.chatVoiceTranscript = "";
+                state.chatVoiceRunId = null;
+                state.chatVoiceError = null;
+                try {
+                  const res = (await state.client.request("chat.voice.start", {
+                    sessionKey: state.sessionKey,
+                  })) as { playbackEnabled?: boolean } | undefined;
+                  state.chatVoiceActive = true;
+                  state.chatVoiceState = "listening";
+                  state.chatVoicePlaybackEnabled = res?.playbackEnabled !== false;
+                  return true;
+                } catch (error) {
+                  state.chatVoiceActive = false;
+                  state.chatVoiceState = "error";
+                  state.chatVoiceError = String(error);
+                  return false;
+                }
+              },
+              onVoiceAudioChunk: async (chunkBase64) => {
+                if (!state.client || !state.connected || !state.chatVoiceActive) {
+                  return;
+                }
+                try {
+                  await state.client.request("chat.voice.audio", {
+                    sessionKey: state.sessionKey,
+                    audio: chunkBase64,
+                    format: "pcm16",
+                    sampleRate: 16000,
+                  });
+                } catch (error) {
+                  state.chatVoiceState = "error";
+                  state.chatVoiceError = String(error);
+                }
+              },
+              onVoiceStop: async () => {
+                state.chatVoiceActive = false;
+                state.chatVoiceRunId = null;
+                state.chatVoiceTranscript = "";
+                if (!state.client || !state.connected) {
+                  state.chatVoiceState = "idle";
+                  return;
+                }
+                try {
+                  await state.client.request("chat.voice.stop", { sessionKey: state.sessionKey });
+                } catch (error) {
+                  state.chatVoiceState = "error";
+                  state.chatVoiceError = String(error);
+                }
+              },
+              onVoiceInterrupt: async () => {
+                if (!state.client || !state.connected) {
+                  return;
+                }
+                await state.client.request("chat.voice.interrupt", {
+                  sessionKey: state.sessionKey,
+                });
+              },
              canAbort: Boolean(state.chatRunId),
              onAbort: () => void state.handleAbortChat(),
              onQueueRemove: (id) => state.removeQueuedMessage(id),
--- a/ui/src/ui/app-view-state.ts
+++ b/ui/src/ui/app-view-state.ts
@@ -72,6 +72,19 @@ export type AppViewState = {
  chatStream: string | null;
  chatStreamStartedAt: number | null;
  chatRunId: string | null;
+  chatVoiceActive: boolean;
+  chatVoiceState:
+    | "idle"
+    | "connecting"
+    | "listening"
+    | "processing"
+    | "speaking"
+    | "interrupted"
+    | "error";
+  chatVoiceTranscript: string;
+  chatVoiceRunId: string | null;
+  chatVoicePlaybackEnabled: boolean;
+  chatVoiceError: string | null;
  compactionStatus: CompactionStatus | null;
  fallbackStatus: FallbackStatus | null;
  chatAvatarUrl: string | null;
--- a/ui/src/ui/app.ts
+++ b/ui/src/ui/app.ts
@@ -165,6 +165,19 @@ export class OpenClawApp extends LitElement {
  @state() chatStream: string | null = null;
  @state() chatStreamStartedAt: number | null = null;
  @state() chatRunId: string | null = null;
+  @state() chatVoiceActive = false;
+  @state() chatVoiceState:
+    | "idle"
+    | "connecting"
+    | "listening"
+    | "processing"
+    | "speaking"
+    | "interrupted"
+    | "error" = "idle";
+  @state() chatVoiceTranscript = "";
+  @state() chatVoiceRunId: string | null = null;
+  @state() chatVoicePlaybackEnabled = true;
+  @state() chatVoiceError: string | null = null;
  @state() compactionStatus: CompactionStatus | null = null;
  @state() fallbackStatus: FallbackStatus | null = null;
  @state() chatAvatarUrl: string | null = null;
--- a/ui/src/ui/chat/speech.ts
+++ b/ui/src/ui/chat/speech.ts
@@ -125,6 +125,195 @@ export function isSttActive(): boolean {
  return activeRecognition !== null;
 }

+// ─── Realtime Voice Capture ───
+
+type RealtimeVoiceCallbacks = {
+  onChunk: (chunkBase64: string) => void;
+  onStart?: () => void;
+  onStop?: () => void;
+  onError?: (error: string) => void;
+};
+
+type RealtimeVoiceCapture = {
+  stop: () => void;
+};
+
+const REALTIME_VOICE_TARGET_SAMPLE_RATE = 16_000;
+const REALTIME_VOICE_CHUNK_MS = 250;
+
+let activeRealtimeVoiceCapture: RealtimeVoiceCapture | null = null;
+
+export function isRealtimeVoiceSupported(): boolean {
+  const hasGetUserMedia =
+    typeof navigator !== "undefined" && typeof navigator.mediaDevices?.getUserMedia === "function";
+  return (
+    typeof window !== "undefined" &&
+    Boolean(window.isSecureContext) &&
+    hasGetUserMedia &&
+    typeof AudioContext !== "undefined"
+  );
+}
+
+export async function startRealtimeVoiceCapture(
+  callbacks: RealtimeVoiceCallbacks,
+): Promise<boolean> {
+  if (!isRealtimeVoiceSupported()) {
+    callbacks.onError?.("Realtime voice requires a secure context with microphone access");
+    return false;
+  }
+
+  stopRealtimeVoiceCapture();
+
+  let stream: MediaStream;
+  try {
+    stream = await navigator.mediaDevices.getUserMedia({
+      audio: {
+        channelCount: 1,
+        echoCancellation: true,
+        noiseSuppression: true,
+        autoGainControl: true,
+      },
+    });
+  } catch (error) {
+    callbacks.onError?.(error instanceof Error ? error.message : String(error));
+    return false;
+  }
+
+  const audioContext = new AudioContext();
+  try {
+    if (audioContext.state !== "running") {
+      await audioContext.resume();
+    }
+  } catch (error) {
+    stream.getTracks().forEach((track) => track.stop());
+    callbacks.onError?.(
+      error instanceof Error ? error.message : "Failed to start realtime voice capture",
+    );
+    void audioContext.close();
+    return false;
+  }
+
+  const source = audioContext.createMediaStreamSource(stream);
+  const processor = audioContext.createScriptProcessor(4096, 1, 1);
+  const samplesPerChunk = Math.max(
+    1,
+    Math.round((REALTIME_VOICE_TARGET_SAMPLE_RATE * REALTIME_VOICE_CHUNK_MS) / 1000),
+  );
+  let pcmBuffer = new Int16Array(0);
+  let stopped = false;
+
+  const flushChunk = () => {
+    if (pcmBuffer.length < samplesPerChunk) {
+      return;
+    }
+    const chunk = pcmBuffer.slice(0, samplesPerChunk);
+    pcmBuffer = pcmBuffer.slice(samplesPerChunk);
+    callbacks.onChunk(encodePcm16Chunk(chunk));
+  };
+
+  processor.onaudioprocess = (event) => {
+    if (stopped) {
+      return;
+    }
+    const input = event.inputBuffer.getChannelData(0);
+    const downsampled = downsampleFloat32Buffer(
+      input,
+      audioContext.sampleRate,
+      REALTIME_VOICE_TARGET_SAMPLE_RATE,
+    );
+    if (downsampled.length === 0) {
+      return;
+    }
+    const next = new Int16Array(pcmBuffer.length + downsampled.length);
+    next.set(pcmBuffer, 0);
+    next.set(downsampled, pcmBuffer.length);
+    pcmBuffer = next;
+    flushChunk();
+  };
+
+  source.connect(processor);
+  processor.connect(audioContext.destination);
+
+  const stop = () => {
+    if (stopped) {
+      return;
+    }
+    stopped = true;
+    activeRealtimeVoiceCapture = null;
+    if (pcmBuffer.length > 0) {
+      callbacks.onChunk(encodePcm16Chunk(pcmBuffer));
+      pcmBuffer = new Int16Array(0);
+    }
+    processor.disconnect();
+    source.disconnect();
+    stream.getTracks().forEach((track) => track.stop());
+    void audioContext.close();
+    callbacks.onStop?.();
+  };
+
+  activeRealtimeVoiceCapture = { stop };
+  callbacks.onStart?.();
+  return true;
+}
+
+export function stopRealtimeVoiceCapture(): void {
+  activeRealtimeVoiceCapture?.stop();
+}
+
+function downsampleFloat32Buffer(
+  buffer: Float32Array,
+  inputSampleRate: number,
+  outputSampleRate: number,
+): Int16Array {
+  if (outputSampleRate >= inputSampleRate) {
+    return float32ToPcm16(buffer);
+  }
+  const ratio = inputSampleRate / outputSampleRate;
+  const outputLength = Math.max(1, Math.round(buffer.length / ratio));
+  const output = new Int16Array(outputLength);
+  let offsetBuffer = 0;
+  for (let i = 0; i < outputLength; i += 1) {
+    const nextOffsetBuffer = Math.min(buffer.length, Math.round((i + 1) * ratio));
+    let sum = 0;
+    let count = 0;
+    for (let j = offsetBuffer; j < nextOffsetBuffer; j += 1) {
+      sum += buffer[j];
+      count += 1;
+    }
+    const sample = count > 0 ? sum / count : 0;
+    output[i] = float32SampleToPcm16(sample);
+    offsetBuffer = nextOffsetBuffer;
+  }
+  return output;
+}
+
+function float32ToPcm16(buffer: Float32Array): Int16Array {
+  const output = new Int16Array(buffer.length);
+  for (let i = 0; i < buffer.length; i += 1) {
+    output[i] = float32SampleToPcm16(buffer[i]);
+  }
+  return output;
+}
+
+function float32SampleToPcm16(sample: number): number {
+  const clamped = Math.max(-1, Math.min(1, sample));
+  return clamped < 0 ? Math.round(clamped * 0x8000) : Math.round(clamped * 0x7fff);
+}
+
+function encodePcm16Chunk(chunk: Int16Array): string {
+  const bytes = new Uint8Array(chunk.length * 2);
+  for (let i = 0; i < chunk.length; i += 1) {
+    const value = chunk[i];
+    bytes[i * 2] = value & 0xff;
+    bytes[i * 2 + 1] = (value >> 8) & 0xff;
+  }
+  let binary = "";
+  for (const byte of bytes) {
+    binary += String.fromCharCode(byte);
+  }
+  return btoa(binary);
+}
+
 // ─── TTS (Text-to-Speech) ───

 export function isTtsSupported(): boolean {
--- a/ui/src/ui/views/chat.browser.test.ts
+++ b/ui/src/ui/views/chat.browser.test.ts
@@ -44,6 +44,11 @@ function createProps(overrides: Partial<ChatProps> = {}): ChatProps {
    canSend: true,
    disabledReason: null,
    error: null,
+    voiceActive: false,
+    voiceState: "idle",
+    voiceTranscript: "",
+    voiceError: null,
+    voicePlaybackEnabled: true,
    sessions: {
      ts: 0,
      path: "",
--- a/ui/src/ui/views/chat.test.ts
+++ b/ui/src/ui/views/chat.test.ts
@@ -220,6 +220,11 @@ function createProps(overrides: Partial<ChatProps> = {}): ChatProps {
    canSend: true,
    disabledReason: null,
    error: null,
+    voiceActive: false,
+    voiceState: "idle",
+    voiceTranscript: "",
+    voiceError: null,
+    voicePlaybackEnabled: true,
    sessions: createSessions(),
    focusMode: false,
    assistantName: "OpenClaw",
--- a/ui/src/ui/views/chat.ts
+++ b/ui/src/ui/views/chat.ts
@@ -29,7 +29,14 @@ import {
  type SlashCommandCategory,
  type SlashCommandDef,
 } from "../chat/slash-commands.ts";
-import { isSttSupported, startStt, stopStt } from "../chat/speech.ts";
+import {
+  isRealtimeVoiceSupported,
+  isSttSupported,
+  startRealtimeVoiceCapture,
+  startStt,
+  stopRealtimeVoiceCapture,
+  stopStt,
+} from "../chat/speech.ts";
 import { icons } from "../icons.ts";
 import { detectTextDirection } from "../text-direction.ts";
 import type { GatewaySessionRow, SessionsListResult } from "../types.ts";
@@ -62,6 +69,18 @@ export type ChatProps = {
  canSend: boolean;
  disabledReason: string | null;
  error: string | null;
+  voiceActive: boolean;
+  voiceState:
+    | "idle"
+    | "connecting"
+    | "listening"
+    | "processing"
+    | "speaking"
+    | "interrupted"
+    | "error";
+  voiceTranscript: string;
+  voiceError: string | null;
+  voicePlaybackEnabled: boolean;
  sessions: SessionsListResult | null;
  focusMode: boolean;
  sidebarOpen?: boolean;
@@ -80,6 +99,10 @@ export type ChatProps = {
  onDraftChange: (next: string) => void;
  onRequestUpdate?: () => void;
  onSend: () => void;
+  onVoiceStart?: () => Promise<boolean> | boolean;
+  onVoiceAudioChunk?: (chunkBase64: string) => Promise<void> | void;
+  onVoiceStop?: () => Promise<void> | void;
+  onVoiceInterrupt?: () => Promise<void> | void;
  onAbort?: () => void;
  onQueueRemove: (id: string) => void;
  onNewSession: () => void;
@@ -130,6 +153,7 @@ function getDeletedMessages(sessionKey: string): DeletedMessages {
 interface ChatEphemeralState {
  sttRecording: boolean;
  sttInterimText: string;
+  voiceRecording: boolean;
  slashMenuOpen: boolean;
  slashMenuItems: SlashCommandDef[];
  slashMenuIndex: number;
@@ -145,6 +169,7 @@ function createChatEphemeralState(): ChatEphemeralState {
  return {
    sttRecording: false,
    sttInterimText: "",
+    voiceRecording: false,
    slashMenuOpen: false,
    slashMenuItems: [],
    slashMenuIndex: 0,
@@ -167,6 +192,9 @@ export function resetChatViewState() {
  if (vs.sttRecording) {
    stopStt();
  }
+  if (vs.voiceRecording) {
+    stopRealtimeVoiceCapture();
+  }
  Object.assign(vs, createChatEphemeralState());
 }

@@ -254,6 +282,32 @@ function renderFallbackIndicator(status: FallbackIndicatorStatus | null | undefi
  `;
 }

+function renderVoiceStatus(props: ChatProps) {
+  if (!props.voiceActive && !props.voiceError) {
+    return nothing;
+  }
+  const label =
+    props.voiceState === "connecting"
+      ? "Connecting voice..."
+      : props.voiceState === "listening"
+        ? "Listening..."
+        : props.voiceState === "processing"
+          ? "Processing..."
+          : props.voiceState === "speaking"
+            ? "Speaking..."
+            : props.voiceState === "interrupted"
+              ? "Interrupted"
+              : props.voiceState === "error"
+                ? "Voice error"
+                : "Voice ready";
+  const detail = props.voiceError || props.voiceTranscript;
+  return html`
+    <div class="agent-chat__stt-interim">
+      <strong>${label}</strong>${detail ? html` ${detail}` : nothing}
+    </div>
+  `;
+}
+
 /**
 * Compact notice when context usage reaches 85%+.
 * Progressively shifts from amber (85%) to red (90%+).
@@ -913,6 +967,11 @@ export function renderChat(props: ChatProps) {
  const requestUpdate = props.onRequestUpdate ?? (() => {});
  const getDraft = props.getDraft ?? (() => props.draft);

+  if (!props.voiceActive && vs.voiceRecording) {
+    stopRealtimeVoiceCapture();
+    vs.voiceRecording = false;
+  }
+
  const splitRatio = props.splitRatio ?? 0.6;
  const sidebarOpen = Boolean(props.sidebarOpen && props.onCloseSidebar);

@@ -1262,6 +1321,7 @@ export function renderChat(props: ChatProps) {
        ${vs.sttRecording && vs.sttInterimText
          ? html`<div class="agent-chat__stt-interim">${vs.sttInterimText}</div>`
          : nothing}
+        ${renderVoiceStatus(props)}

        <textarea
          ${ref((el) => el && adjustTextareaHeight(el as HTMLTextAreaElement))}
@@ -1342,6 +1402,56 @@ export function renderChat(props: ChatProps) {
                  </button>
                `
              : nothing}
+            ${isRealtimeVoiceSupported() && props.onVoiceStart && props.onVoiceStop
+              ? html`
+                  <button
+                    class="agent-chat__input-btn ${props.voiceActive
+                      ? "agent-chat__input-btn--recording"
+                      : ""}"
+                    @click=${async () => {
+                      if (props.voiceActive) {
+                        stopRealtimeVoiceCapture();
+                        vs.voiceRecording = false;
+                        await props.onVoiceStop?.();
+                        requestUpdate();
+                        return;
+                      }
+                      const started = await props.onVoiceStart?.();
+                      if (!started) {
+                        requestUpdate();
+                        return;
+                      }
+                      const captureStarted = await startRealtimeVoiceCapture({
+                        onChunk: (chunkBase64) => {
+                          void props.onVoiceAudioChunk?.(chunkBase64);
+                        },
+                        onStart: () => {
+                          vs.voiceRecording = true;
+                          requestUpdate();
+                        },
+                        onStop: () => {
+                          vs.voiceRecording = false;
+                          requestUpdate();
+                        },
+                        onError: async () => {
+                          vs.voiceRecording = false;
+                          await props.onVoiceStop?.();
+                          requestUpdate();
+                        },
+                      });
+                      if (!captureStarted) {
+                        await props.onVoiceStop?.();
+                        requestUpdate();
+                      }
+                    }}
+                    title=${props.voiceActive ? "Stop live voice" : "Start live voice"}
+                    aria-label=${props.voiceActive ? "Stop live voice" : "Start live voice"}
+                    ?disabled=${!props.connected || props.voiceState === "connecting"}
+                  >
+                    ${props.voiceActive ? icons.volume2 : icons.radio}
+                  </button>
+                `
+              : nothing}
            ${tokens ? html`<span class="agent-chat__token-count">${tokens}</span>` : nothing}
          </div>