mirror of
https://github.com/openclaw/openclaw.git
synced 2026-06-06 14:01:24 +08:00
Compare commits
1 Commits
v2026.5.3
...
codex/capa
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
91625aa9f3 |
@@ -6,12 +6,14 @@ Docs: https://docs.openclaw.ai
|
||||
|
||||
### Changes
|
||||
|
||||
- CLI/capabilities: add a first-class `openclaw capability ...` hub for provider-backed inference workflows across model, media, web, and embedding tasks, with capability inspection, provider discovery, and consistent JSON output. Thanks @Takhoffman.
|
||||
- Providers/Anthropic: restore Claude CLI as the preferred local Anthropic path in onboarding, model-auth guidance, and doctor flows again, and keep the Docker Claude CLI live lane aligned with the restored guidance.
|
||||
- Plugins/webhooks: add a bundled webhook ingress plugin so external automation can create and drive bound TaskFlows through per-route shared-secret endpoints. (#61892) Thanks @mbelinky.
|
||||
- Tools/media: document per-provider music and video generation capabilities, and add shared live video-to-video sweep coverage for providers that support local reference clips.
|
||||
|
||||
### Fixes
|
||||
|
||||
- CLI/capabilities: keep provider-backed capability behavior aligned with actual runtime execution by fixing explicit TTS override handling, profile-aware gateway TTS prefs resolution, per-request transcription `prompt`/`language` overrides, image output MIME/extension mismatches, configured web-search fallback behavior, and agent-vs-CLI web-search execution drift.
|
||||
- Channels/secrets: keep bundled channel artifact and secret-contract loading stable under lazy loading so bundled channel secrets continue to appear in `openclaw secret`, status, and security-audit surfaces.
|
||||
- Providers/xAI: recognize `api.grok.x.ai` as an xAI-native endpoint again so native xAI web-search attribution keeps working on Grok-hosted base URLs. (#61377) Thanks @jjjojoj.
|
||||
- Providers/Anthropic/cache: preserve thinking blocks for Claude Opus 4.5+, Sonnet 4.5+, and newer Claude 4-family models so Anthropic prompt-cache prefixes keep matching after thinking turns. (#61793)
|
||||
|
||||
@@ -361,14 +361,6 @@
|
||||
}
|
||||
}
|
||||
},
|
||||
"update_plan": {
|
||||
"emoji": "🗺️",
|
||||
"title": "Update Plan",
|
||||
"detailKeys": [
|
||||
"explanation",
|
||||
"plan.0.step"
|
||||
]
|
||||
},
|
||||
"gateway": {
|
||||
"emoji": "🔌",
|
||||
"title": "Gateway",
|
||||
|
||||
116
docs/cli/capability.md
Normal file
116
docs/cli/capability.md
Normal file
@@ -0,0 +1,116 @@
|
||||
---
|
||||
summary: "Capability-first CLI for provider-backed model, media, web, and embedding workflows"
|
||||
read_when:
|
||||
- Adding or modifying `openclaw capability` commands
|
||||
- Designing stable headless capability automation
|
||||
title: "Capability CLI"
|
||||
---
|
||||
|
||||
# Capability CLI
|
||||
|
||||
`openclaw capability` is the canonical headless surface for provider-backed capabilities.
|
||||
|
||||
It intentionally exposes capability families, not raw gateway RPC names and not raw agent tool ids.
|
||||
|
||||
## Command tree
|
||||
|
||||
```text
|
||||
openclaw capability
|
||||
list
|
||||
inspect
|
||||
|
||||
model
|
||||
run
|
||||
list
|
||||
inspect
|
||||
providers
|
||||
auth login
|
||||
auth logout
|
||||
auth status
|
||||
|
||||
media
|
||||
image
|
||||
generate
|
||||
edit
|
||||
describe
|
||||
describe-many
|
||||
providers
|
||||
audio
|
||||
transcribe
|
||||
providers
|
||||
tts
|
||||
convert
|
||||
voices
|
||||
providers
|
||||
status
|
||||
enable
|
||||
disable
|
||||
set-provider
|
||||
video
|
||||
generate
|
||||
describe
|
||||
providers
|
||||
|
||||
web
|
||||
search
|
||||
fetch
|
||||
providers
|
||||
|
||||
memory
|
||||
embedding
|
||||
create
|
||||
providers
|
||||
```
|
||||
|
||||
## Transport
|
||||
|
||||
Supported transport flags:
|
||||
|
||||
- `--local`
|
||||
- `--gateway`
|
||||
|
||||
Default transport is implicit auto at the command-family level:
|
||||
|
||||
- Stateless execution commands default to local.
|
||||
- Gateway-managed state commands default to gateway.
|
||||
|
||||
Examples:
|
||||
|
||||
```bash
|
||||
openclaw capability model run --prompt "hello" --json
|
||||
openclaw capability media image generate --prompt "friendly lobster" --json
|
||||
openclaw capability media tts status --json
|
||||
openclaw capability embedding create --text "hello world" --json
|
||||
```
|
||||
|
||||
## JSON output
|
||||
|
||||
Capability commands normalize JSON output under a shared envelope:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"capability": "media.image.generate",
|
||||
"transport": "local",
|
||||
"provider": "openai",
|
||||
"model": "gpt-image-1",
|
||||
"attempts": [],
|
||||
"outputs": []
|
||||
}
|
||||
```
|
||||
|
||||
Top-level fields are stable:
|
||||
|
||||
- `ok`
|
||||
- `capability`
|
||||
- `transport`
|
||||
- `provider`
|
||||
- `model`
|
||||
- `attempts`
|
||||
- `outputs`
|
||||
- `error`
|
||||
|
||||
## Notes
|
||||
|
||||
- `model run` reuses the agent runtime so provider/model overrides behave like normal agent execution.
|
||||
- `media tts status` defaults to gateway because it reflects gateway-managed TTS state.
|
||||
@@ -35,6 +35,7 @@ This page describes the current CLI behavior. If commands change, update this do
|
||||
- [`logs`](/cli/logs)
|
||||
- [`system`](/cli/system)
|
||||
- [`models`](/cli/models)
|
||||
- [`capability`](/cli/capability)
|
||||
- [`memory`](/cli/memory)
|
||||
- [`directory`](/cli/directory)
|
||||
- [`nodes`](/cli/nodes)
|
||||
@@ -248,6 +249,16 @@ openclaw [--dev] [--profile <name>] <command>
|
||||
fallbacks list|add|remove|clear
|
||||
image-fallbacks list|add|remove|clear
|
||||
scan
|
||||
capability
|
||||
list
|
||||
inspect
|
||||
model run|list|inspect|providers|auth login|logout|status
|
||||
media image generate|edit|describe|describe-many|providers
|
||||
media audio transcribe|providers
|
||||
media tts convert|voices|providers|status|enable|disable|set-provider
|
||||
media video generate|describe|providers
|
||||
web search|fetch|providers
|
||||
embedding create|providers
|
||||
auth add|login|login-github-copilot|setup-token|paste-token
|
||||
auth order get|set|clear
|
||||
sandbox
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
import { definePluginEntry } from "openclaw/plugin-sdk/plugin-entry";
|
||||
import { buildMicrosoftFoundryProvider } from "./provider.js";
|
||||
import { buildMicrosoftFoundryRealtimeTranscriptionProvider } from "./realtime-transcription-provider.js";
|
||||
|
||||
export default definePluginEntry({
|
||||
id: "microsoft-foundry",
|
||||
@@ -7,5 +8,6 @@ export default definePluginEntry({
|
||||
description: "Microsoft Foundry provider with Entra ID and API key auth",
|
||||
register(api) {
|
||||
api.registerProvider(buildMicrosoftFoundryProvider());
|
||||
api.registerRealtimeTranscriptionProvider(buildMicrosoftFoundryRealtimeTranscriptionProvider());
|
||||
},
|
||||
});
|
||||
|
||||
@@ -0,0 +1,58 @@
|
||||
import { describe, expect, it } from "vitest";
|
||||
import { buildMicrosoftFoundryRealtimeTranscriptionProvider } from "./realtime-transcription-provider.js";
|
||||
|
||||
describe("buildMicrosoftFoundryRealtimeTranscriptionProvider", () => {
|
||||
it("normalizes foundry config from the voice provider block", () => {
|
||||
const provider = buildMicrosoftFoundryRealtimeTranscriptionProvider();
|
||||
const resolved = provider.resolveConfig?.({
|
||||
cfg: {} as never,
|
||||
rawConfig: {
|
||||
providers: {
|
||||
"microsoft-foundry": {
|
||||
apiKey: "azure-test-key",
|
||||
baseUrl: "https://example.services.ai.azure.com/openai/v1",
|
||||
deployment: "gpt-realtime",
|
||||
apiVersion: "2025-04-01-preview",
|
||||
},
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
expect(resolved).toEqual({
|
||||
apiKey: "azure-test-key",
|
||||
baseUrl: "https://example.services.ai.azure.com/openai/v1",
|
||||
deployment: "gpt-realtime",
|
||||
apiVersion: "2025-04-01-preview",
|
||||
});
|
||||
});
|
||||
|
||||
it("accepts model-provider style config with api-key headers", () => {
|
||||
const provider = buildMicrosoftFoundryRealtimeTranscriptionProvider();
|
||||
const resolved = provider.resolveConfig?.({
|
||||
cfg: {} as never,
|
||||
rawConfig: {
|
||||
providers: {
|
||||
"microsoft-foundry": {
|
||||
baseUrl: "https://example.services.ai.azure.com/openai/v1",
|
||||
headers: {
|
||||
"api-key": "azure-test-key",
|
||||
},
|
||||
model: "gpt-realtime",
|
||||
},
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
expect(resolved).toEqual({
|
||||
apiKey: "azure-test-key",
|
||||
baseUrl: "https://example.services.ai.azure.com/openai/v1",
|
||||
deployment: "gpt-realtime",
|
||||
model: "gpt-realtime",
|
||||
});
|
||||
});
|
||||
|
||||
it("registers foundry aliases for voice provider selection", () => {
|
||||
const provider = buildMicrosoftFoundryRealtimeTranscriptionProvider();
|
||||
expect(provider.aliases).toContain("azure-foundry");
|
||||
});
|
||||
});
|
||||
313
extensions/microsoft-foundry/realtime-transcription-provider.ts
Normal file
313
extensions/microsoft-foundry/realtime-transcription-provider.ts
Normal file
@@ -0,0 +1,313 @@
|
||||
import type {
|
||||
RealtimeTranscriptionProviderConfig,
|
||||
RealtimeTranscriptionProviderPlugin,
|
||||
RealtimeTranscriptionSession,
|
||||
RealtimeTranscriptionSessionCreateRequest,
|
||||
} from "openclaw/plugin-sdk/realtime-transcription";
|
||||
import WebSocket from "ws";
|
||||
import { normalizeFoundryEndpoint, PROVIDER_ID } from "./shared.js";
|
||||
|
||||
type FoundryRealtimeTranscriptionProviderConfig = {
|
||||
apiKey?: string;
|
||||
baseUrl?: string;
|
||||
endpoint?: string;
|
||||
deployment?: string;
|
||||
model?: string;
|
||||
apiVersion?: string;
|
||||
silenceDurationMs?: number;
|
||||
vadThreshold?: number;
|
||||
};
|
||||
|
||||
type FoundryRealtimeTranscriptionSessionConfig = RealtimeTranscriptionSessionCreateRequest & {
|
||||
apiKey: string;
|
||||
baseUrl: string;
|
||||
deployment: string;
|
||||
apiVersion: string;
|
||||
silenceDurationMs: number;
|
||||
vadThreshold: number;
|
||||
};
|
||||
|
||||
type RealtimeEvent = {
|
||||
type: string;
|
||||
delta?: string;
|
||||
transcript?: string;
|
||||
error?: unknown;
|
||||
item?: { transcript?: string } | null;
|
||||
};
|
||||
|
||||
function trimToUndefined(value: unknown): string | undefined {
|
||||
return typeof value === "string" && value.trim() ? value.trim() : undefined;
|
||||
}
|
||||
|
||||
function asNumber(value: unknown): number | undefined {
|
||||
return typeof value === "number" && Number.isFinite(value) ? value : undefined;
|
||||
}
|
||||
|
||||
function asObject(value: unknown): Record<string, unknown> | undefined {
|
||||
return typeof value === "object" && value !== null && !Array.isArray(value)
|
||||
? (value as Record<string, unknown>)
|
||||
: undefined;
|
||||
}
|
||||
|
||||
function extractFoundryProviderConfig(
|
||||
rawConfig: RealtimeTranscriptionProviderConfig,
|
||||
): FoundryRealtimeTranscriptionProviderConfig {
|
||||
const providers = asObject(rawConfig.providers);
|
||||
const raw =
|
||||
asObject(providers?.[PROVIDER_ID]) ??
|
||||
asObject(rawConfig[PROVIDER_ID]) ??
|
||||
asObject(rawConfig.microsoftFoundry) ??
|
||||
asObject(rawConfig);
|
||||
const providerBaseUrl = trimToUndefined(raw?.baseUrl);
|
||||
const endpoint = trimToUndefined(raw?.endpoint);
|
||||
return {
|
||||
apiKey:
|
||||
trimToUndefined(raw?.apiKey) ??
|
||||
trimToUndefined(asObject(raw?.headers)?.["api-key"]) ??
|
||||
trimToUndefined(asObject(raw?.headers)?.Authorization)?.replace(/^Bearer\s+/i, ""),
|
||||
baseUrl: providerBaseUrl,
|
||||
endpoint,
|
||||
deployment:
|
||||
trimToUndefined(raw?.deployment) ??
|
||||
trimToUndefined(raw?.model) ??
|
||||
trimToUndefined(raw?.deploymentName),
|
||||
model: trimToUndefined(raw?.transcriptionModel) ?? trimToUndefined(raw?.model),
|
||||
apiVersion: trimToUndefined(raw?.apiVersion),
|
||||
silenceDurationMs: asNumber(raw?.silenceDurationMs),
|
||||
vadThreshold: asNumber(raw?.vadThreshold),
|
||||
};
|
||||
}
|
||||
|
||||
function resolveFoundryRealtimeBaseUrl(
|
||||
config: FoundryRealtimeTranscriptionProviderConfig,
|
||||
): string | undefined {
|
||||
if (config.endpoint) {
|
||||
return normalizeFoundryEndpoint(config.endpoint);
|
||||
}
|
||||
if (!config.baseUrl) {
|
||||
return undefined;
|
||||
}
|
||||
return normalizeFoundryEndpoint(config.baseUrl);
|
||||
}
|
||||
|
||||
class FoundryRealtimeTranscriptionSession implements RealtimeTranscriptionSession {
|
||||
private static readonly MAX_RECONNECT_ATTEMPTS = 5;
|
||||
private static readonly RECONNECT_DELAY_MS = 1000;
|
||||
private static readonly CONNECT_TIMEOUT_MS = 10_000;
|
||||
|
||||
private ws: WebSocket | null = null;
|
||||
private connected = false;
|
||||
private closed = false;
|
||||
private reconnectAttempts = 0;
|
||||
private pendingTranscript = "";
|
||||
|
||||
constructor(private readonly config: FoundryRealtimeTranscriptionSessionConfig) {}
|
||||
|
||||
async connect(): Promise<void> {
|
||||
this.closed = false;
|
||||
this.reconnectAttempts = 0;
|
||||
await this.doConnect();
|
||||
}
|
||||
|
||||
sendAudio(audio: Buffer): void {
|
||||
if (this.ws?.readyState !== WebSocket.OPEN) {
|
||||
return;
|
||||
}
|
||||
this.sendEvent({
|
||||
type: "input_audio_buffer.append",
|
||||
audio: audio.toString("base64"),
|
||||
});
|
||||
}
|
||||
|
||||
close(): void {
|
||||
this.closed = true;
|
||||
this.connected = false;
|
||||
if (this.ws) {
|
||||
this.ws.close(1000, "Transcription session closed");
|
||||
this.ws = null;
|
||||
}
|
||||
}
|
||||
|
||||
isConnected(): boolean {
|
||||
return this.connected;
|
||||
}
|
||||
|
||||
private async doConnect(): Promise<void> {
|
||||
await new Promise<void>((resolve, reject) => {
|
||||
const wsUrl = this.buildWebSocketUrl();
|
||||
this.ws = new WebSocket(wsUrl, {
|
||||
headers: {
|
||||
"api-key": this.config.apiKey,
|
||||
},
|
||||
});
|
||||
|
||||
const connectTimeout = setTimeout(() => {
|
||||
reject(new Error("Microsoft Foundry realtime transcription connection timeout"));
|
||||
}, FoundryRealtimeTranscriptionSession.CONNECT_TIMEOUT_MS);
|
||||
|
||||
this.ws.on("open", () => {
|
||||
clearTimeout(connectTimeout);
|
||||
this.connected = true;
|
||||
this.reconnectAttempts = 0;
|
||||
this.sendEvent({
|
||||
type: "session.update",
|
||||
session: {
|
||||
input_audio_format: "pcm16",
|
||||
input_audio_transcription: {
|
||||
model: this.config.deployment,
|
||||
},
|
||||
turn_detection: {
|
||||
type: "server_vad",
|
||||
threshold: this.config.vadThreshold,
|
||||
prefix_padding_ms: 300,
|
||||
silence_duration_ms: this.config.silenceDurationMs,
|
||||
},
|
||||
},
|
||||
});
|
||||
resolve();
|
||||
});
|
||||
|
||||
this.ws.on("message", (data: Buffer) => {
|
||||
try {
|
||||
this.handleEvent(JSON.parse(data.toString()) as RealtimeEvent);
|
||||
} catch (error) {
|
||||
this.config.onError?.(error instanceof Error ? error : new Error(String(error)));
|
||||
}
|
||||
});
|
||||
|
||||
this.ws.on("error", (error) => {
|
||||
if (!this.connected) {
|
||||
clearTimeout(connectTimeout);
|
||||
reject(error);
|
||||
return;
|
||||
}
|
||||
this.config.onError?.(error instanceof Error ? error : new Error(String(error)));
|
||||
});
|
||||
|
||||
this.ws.on("close", () => {
|
||||
this.connected = false;
|
||||
if (this.closed) {
|
||||
return;
|
||||
}
|
||||
void this.attemptReconnect();
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
private buildWebSocketUrl(): string {
|
||||
const httpBaseUrl = this.config.baseUrl.replace(/\/+$/, "");
|
||||
const wsBaseUrl = httpBaseUrl.replace(/^http:/i, "ws:").replace(/^https:/i, "wss:");
|
||||
const url = new URL(`${wsBaseUrl}/openai/realtime`);
|
||||
url.searchParams.set("api-version", this.config.apiVersion);
|
||||
url.searchParams.set("deployment", this.config.deployment);
|
||||
return url.toString();
|
||||
}
|
||||
|
||||
private async attemptReconnect(): Promise<void> {
|
||||
if (this.closed) {
|
||||
return;
|
||||
}
|
||||
if (this.reconnectAttempts >= FoundryRealtimeTranscriptionSession.MAX_RECONNECT_ATTEMPTS) {
|
||||
this.config.onError?.(
|
||||
new Error("Microsoft Foundry realtime transcription reconnect limit reached"),
|
||||
);
|
||||
return;
|
||||
}
|
||||
this.reconnectAttempts += 1;
|
||||
const delay =
|
||||
FoundryRealtimeTranscriptionSession.RECONNECT_DELAY_MS * 2 ** (this.reconnectAttempts - 1);
|
||||
await new Promise((resolve) => setTimeout(resolve, delay));
|
||||
if (this.closed) {
|
||||
return;
|
||||
}
|
||||
try {
|
||||
await this.doConnect();
|
||||
} catch (error) {
|
||||
this.config.onError?.(error instanceof Error ? error : new Error(String(error)));
|
||||
await this.attemptReconnect();
|
||||
}
|
||||
}
|
||||
|
||||
private handleEvent(event: RealtimeEvent): void {
|
||||
switch (event.type) {
|
||||
case "conversation.item.input_audio_transcription.delta":
|
||||
case "conversation.item.audio_transcription.delta":
|
||||
if (event.delta) {
|
||||
this.pendingTranscript += event.delta;
|
||||
this.config.onPartial?.(this.pendingTranscript);
|
||||
}
|
||||
return;
|
||||
|
||||
case "conversation.item.input_audio_transcription.completed":
|
||||
case "conversation.item.audio_transcription.completed": {
|
||||
const transcript = event.transcript ?? event.item?.transcript;
|
||||
if (transcript) {
|
||||
this.config.onTranscript?.(transcript);
|
||||
}
|
||||
this.pendingTranscript = "";
|
||||
return;
|
||||
}
|
||||
|
||||
case "input_audio_buffer.speech_started":
|
||||
this.pendingTranscript = "";
|
||||
this.config.onSpeechStart?.();
|
||||
return;
|
||||
|
||||
case "error": {
|
||||
const detail =
|
||||
event.error && typeof event.error === "object" && "message" in event.error
|
||||
? String((event.error as { message?: unknown }).message ?? "Unknown error")
|
||||
: event.error
|
||||
? String(event.error)
|
||||
: "Unknown error";
|
||||
this.config.onError?.(new Error(detail));
|
||||
return;
|
||||
}
|
||||
|
||||
default:
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
private sendEvent(event: unknown): void {
|
||||
if (this.ws?.readyState === WebSocket.OPEN) {
|
||||
this.ws.send(JSON.stringify(event));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
export function buildMicrosoftFoundryRealtimeTranscriptionProvider(): RealtimeTranscriptionProviderPlugin {
|
||||
return {
|
||||
id: PROVIDER_ID,
|
||||
label: "Microsoft Foundry Realtime Transcription",
|
||||
aliases: ["azure-foundry", "azure-openai-foundry"],
|
||||
autoSelectOrder: 20,
|
||||
resolveConfig: ({ rawConfig }) => extractFoundryProviderConfig(rawConfig),
|
||||
isConfigured: ({ providerConfig }) => {
|
||||
const config = extractFoundryProviderConfig(providerConfig);
|
||||
return Boolean(config.apiKey && resolveFoundryRealtimeBaseUrl(config) && config.deployment);
|
||||
},
|
||||
createSession: (req) => {
|
||||
const config = extractFoundryProviderConfig(req.providerConfig);
|
||||
const baseUrl = resolveFoundryRealtimeBaseUrl(config);
|
||||
if (!config.apiKey) {
|
||||
throw new Error("Microsoft Foundry realtime transcription API key missing");
|
||||
}
|
||||
if (!baseUrl) {
|
||||
throw new Error("Microsoft Foundry realtime transcription endpoint missing");
|
||||
}
|
||||
if (!config.deployment) {
|
||||
throw new Error("Microsoft Foundry realtime transcription deployment missing");
|
||||
}
|
||||
return new FoundryRealtimeTranscriptionSession({
|
||||
...req,
|
||||
apiKey: config.apiKey,
|
||||
baseUrl,
|
||||
deployment: config.deployment,
|
||||
apiVersion: config.apiVersion ?? "2025-04-01-preview",
|
||||
silenceDurationMs: config.silenceDurationMs ?? 800,
|
||||
vadThreshold: config.vadThreshold ?? 0.5,
|
||||
});
|
||||
},
|
||||
};
|
||||
}
|
||||
@@ -18,6 +18,7 @@ type OpenAIRealtimeTranscriptionProviderConfig = {
|
||||
model?: string;
|
||||
silenceDurationMs?: number;
|
||||
vadThreshold?: number;
|
||||
inputAudioFormat?: string;
|
||||
};
|
||||
|
||||
type OpenAIRealtimeTranscriptionSessionConfig = RealtimeTranscriptionSessionCreateRequest & {
|
||||
@@ -25,6 +26,7 @@ type OpenAIRealtimeTranscriptionSessionConfig = RealtimeTranscriptionSessionCrea
|
||||
model: string;
|
||||
silenceDurationMs: number;
|
||||
vadThreshold: number;
|
||||
inputAudioFormat: string;
|
||||
};
|
||||
|
||||
type RealtimeEvent = {
|
||||
@@ -51,6 +53,7 @@ function normalizeProviderConfig(
|
||||
model: trimToUndefined(raw?.model) ?? trimToUndefined(raw?.sttModel),
|
||||
silenceDurationMs: asFiniteNumber(raw?.silenceDurationMs),
|
||||
vadThreshold: asFiniteNumber(raw?.vadThreshold),
|
||||
inputAudioFormat: trimToUndefined(raw?.inputAudioFormat),
|
||||
};
|
||||
}
|
||||
|
||||
@@ -116,7 +119,7 @@ class OpenAIRealtimeTranscriptionSession implements RealtimeTranscriptionSession
|
||||
this.sendEvent({
|
||||
type: "transcription_session.update",
|
||||
session: {
|
||||
input_audio_format: "g711_ulaw",
|
||||
input_audio_format: this.config.inputAudioFormat,
|
||||
input_audio_transcription: {
|
||||
model: this.config.model,
|
||||
},
|
||||
@@ -241,6 +244,7 @@ export function buildOpenAIRealtimeTranscriptionProvider(): RealtimeTranscriptio
|
||||
model: config.model ?? "gpt-4o-transcribe",
|
||||
silenceDurationMs: config.silenceDurationMs ?? 800,
|
||||
vadThreshold: config.vadThreshold ?? 0.5,
|
||||
inputAudioFormat: config.inputAudioFormat ?? "g711_ulaw",
|
||||
});
|
||||
},
|
||||
};
|
||||
|
||||
@@ -9,6 +9,7 @@ export {
|
||||
isTtsProviderConfigured,
|
||||
listSpeechVoices,
|
||||
maybeApplyTtsToPayload,
|
||||
resolveExplicitTtsOverrides,
|
||||
resolveTtsAutoMode,
|
||||
resolveTtsConfig,
|
||||
resolveTtsPrefsPath,
|
||||
|
||||
@@ -23,7 +23,7 @@ import { resolveSendableOutboundReplyParts } from "openclaw/plugin-sdk/reply-pay
|
||||
import type { ReplyPayload } from "openclaw/plugin-sdk/reply-runtime";
|
||||
import { isVerbose, logVerbose } from "openclaw/plugin-sdk/runtime-env";
|
||||
import { resolvePreferredOpenClawTmpDir } from "openclaw/plugin-sdk/sandbox";
|
||||
import { CONFIG_DIR, resolveUserPath, stripMarkdown } from "openclaw/plugin-sdk/text-runtime";
|
||||
import { resolveConfigDir, resolveUserPath, stripMarkdown } from "openclaw/plugin-sdk/text-runtime";
|
||||
import {
|
||||
canonicalizeSpeechProviderId,
|
||||
getSpeechProvider,
|
||||
@@ -35,6 +35,7 @@ import {
|
||||
summarizeText,
|
||||
type SpeechModelOverridePolicy,
|
||||
type SpeechProviderConfig,
|
||||
type SpeechProviderOverrides,
|
||||
type SpeechVoiceOption,
|
||||
type TtsDirectiveOverrides,
|
||||
type TtsDirectiveParseResult,
|
||||
@@ -167,7 +168,7 @@ function resolveTtsPrefsPathValue(prefsPath: string | undefined): string {
|
||||
if (envPath) {
|
||||
return resolveUserPath(envPath);
|
||||
}
|
||||
return path.join(CONFIG_DIR, "settings", "tts.json");
|
||||
return path.join(resolveConfigDir(process.env), "settings", "tts.json");
|
||||
}
|
||||
|
||||
function resolveModelOverridePolicy(
|
||||
@@ -494,6 +495,66 @@ export function setTtsProvider(prefsPath: string, provider: TtsProvider): void {
|
||||
});
|
||||
}
|
||||
|
||||
export function resolveExplicitTtsOverrides(params: {
|
||||
cfg: OpenClawConfig;
|
||||
prefsPath?: string;
|
||||
provider?: string;
|
||||
modelId?: string;
|
||||
voiceId?: string;
|
||||
}): TtsDirectiveOverrides {
|
||||
const providerInput = params.provider?.trim();
|
||||
const modelId = params.modelId?.trim();
|
||||
const voiceId = params.voiceId?.trim();
|
||||
const config = resolveTtsConfig(params.cfg);
|
||||
const prefsPath = params.prefsPath ?? resolveTtsPrefsPath(config);
|
||||
const selectedProvider =
|
||||
canonicalizeSpeechProviderId(providerInput, params.cfg) ??
|
||||
(modelId || voiceId ? getTtsProvider(config, prefsPath) : undefined);
|
||||
|
||||
if (providerInput && !selectedProvider) {
|
||||
throw new Error(`Unknown TTS provider "${providerInput}".`);
|
||||
}
|
||||
|
||||
if (!modelId && !voiceId) {
|
||||
return selectedProvider ? { provider: selectedProvider } : {};
|
||||
}
|
||||
|
||||
if (!selectedProvider) {
|
||||
throw new Error("TTS model or voice overrides require a resolved provider.");
|
||||
}
|
||||
|
||||
const provider = getSpeechProvider(selectedProvider, params.cfg);
|
||||
if (!provider) {
|
||||
throw new Error(`speech provider ${selectedProvider} is not registered`);
|
||||
}
|
||||
if (!provider.resolveTalkOverrides) {
|
||||
throw new Error(
|
||||
`TTS provider "${selectedProvider}" does not support model or voice overrides.`,
|
||||
);
|
||||
}
|
||||
|
||||
const providerOverrides = provider.resolveTalkOverrides({
|
||||
talkProviderConfig: {},
|
||||
params: {
|
||||
...(voiceId ? { voiceId } : {}),
|
||||
...(modelId ? { modelId } : {}),
|
||||
},
|
||||
});
|
||||
if ((voiceId || modelId) && (!providerOverrides || Object.keys(providerOverrides).length === 0)) {
|
||||
throw new Error(
|
||||
`TTS provider "${selectedProvider}" ignored the requested model or voice overrides.`,
|
||||
);
|
||||
}
|
||||
|
||||
const overridesRecord = providerOverrides as SpeechProviderOverrides;
|
||||
return {
|
||||
provider: selectedProvider,
|
||||
providerOverrides: {
|
||||
[provider.id]: overridesRecord,
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
export function getTtsMaxLength(prefsPath: string): number {
|
||||
const prefs = readPrefs(prefsPath);
|
||||
return prefs.tts?.maxLength ?? DEFAULT_TTS_MAX_LENGTH;
|
||||
|
||||
@@ -131,9 +131,6 @@ function normalizeResolvedModel(params: {
|
||||
const normalizedInputModel = {
|
||||
...params.model,
|
||||
input: resolveProviderModelInput({
|
||||
provider: params.provider,
|
||||
modelId: params.model.id,
|
||||
modelName: params.model.name,
|
||||
input: params.model.input,
|
||||
}),
|
||||
} as Model<Api>;
|
||||
@@ -233,7 +230,6 @@ function findInlineModelMatch(params: {
|
||||
}
|
||||
|
||||
export { buildModelAliasLines };
|
||||
export { buildInlineProviderModels };
|
||||
|
||||
function resolveConfiguredProviderConfig(
|
||||
cfg: OpenClawConfig | undefined,
|
||||
@@ -250,6 +246,17 @@ function resolveConfiguredProviderConfig(
|
||||
return findNormalizedProviderValue(configuredProviders, provider);
|
||||
}
|
||||
|
||||
function resolveProviderModelInput(params: {
|
||||
input?: unknown;
|
||||
fallbackInput?: unknown;
|
||||
}): Array<"text" | "image"> {
|
||||
const resolvedInput = Array.isArray(params.input) ? params.input : params.fallbackInput;
|
||||
const normalizedInput = Array.isArray(resolvedInput)
|
||||
? resolvedInput.filter((item): item is "text" | "image" => item === "text" || item === "image")
|
||||
: [];
|
||||
return normalizedInput.length > 0 ? normalizedInput : ["text"];
|
||||
}
|
||||
|
||||
function applyConfiguredProviderOverrides(params: {
|
||||
provider: string;
|
||||
discoveredModel: ProviderRuntimeModel;
|
||||
@@ -290,9 +297,6 @@ function applyConfiguredProviderOverrides(params: {
|
||||
};
|
||||
}
|
||||
const normalizedInput = resolveProviderModelInput({
|
||||
provider: params.provider,
|
||||
modelId,
|
||||
modelName: configuredModel?.name ?? discoveredModel.name,
|
||||
input: configuredModel?.input,
|
||||
fallbackInput: discoveredModel.input,
|
||||
});
|
||||
@@ -337,6 +341,54 @@ function applyConfiguredProviderOverrides(params: {
|
||||
);
|
||||
}
|
||||
|
||||
export function buildInlineProviderModels(
|
||||
providers: Record<string, InlineProviderConfig>,
|
||||
): InlineModelEntry[] {
|
||||
return Object.entries(providers).flatMap(([providerId, entry]) => {
|
||||
const trimmed = providerId.trim();
|
||||
if (!trimmed) {
|
||||
return [];
|
||||
}
|
||||
const providerHeaders = sanitizeModelHeaders(entry?.headers, {
|
||||
stripSecretRefMarkers: true,
|
||||
});
|
||||
const providerRequest = sanitizeConfiguredModelProviderRequest(entry?.request);
|
||||
return (entry?.models ?? []).map((model) => {
|
||||
const transport = resolveProviderTransport({
|
||||
provider: trimmed,
|
||||
api: model.api ?? entry?.api,
|
||||
baseUrl: entry?.baseUrl,
|
||||
});
|
||||
const modelHeaders = sanitizeModelHeaders((model as InlineModelEntry).headers, {
|
||||
stripSecretRefMarkers: true,
|
||||
});
|
||||
const requestConfig = resolveProviderRequestConfig({
|
||||
provider: trimmed,
|
||||
api: transport.api ?? model.api,
|
||||
baseUrl: transport.baseUrl,
|
||||
providerHeaders,
|
||||
modelHeaders,
|
||||
authHeader: entry?.authHeader,
|
||||
request: providerRequest,
|
||||
capability: "llm",
|
||||
transport: "stream",
|
||||
});
|
||||
return attachModelProviderRequestTransport(
|
||||
{
|
||||
...model,
|
||||
input: resolveProviderModelInput({
|
||||
input: model.input,
|
||||
}),
|
||||
provider: trimmed,
|
||||
baseUrl: requestConfig.baseUrl ?? transport.baseUrl,
|
||||
api: requestConfig.api ?? model.api,
|
||||
headers: requestConfig.headers,
|
||||
},
|
||||
providerRequest,
|
||||
);
|
||||
});
|
||||
});
|
||||
}
|
||||
function resolveExplicitModelWithRegistry(params: {
|
||||
provider: string;
|
||||
modelId: string;
|
||||
@@ -505,9 +557,6 @@ function resolveConfiguredFallbackModel(params: {
|
||||
baseUrl: requestConfig.baseUrl,
|
||||
reasoning: configuredModel?.reasoning ?? false,
|
||||
input: resolveProviderModelInput({
|
||||
provider,
|
||||
modelId,
|
||||
modelName: configuredModel?.name ?? modelId,
|
||||
input: configuredModel?.input,
|
||||
}),
|
||||
cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
|
||||
|
||||
@@ -249,11 +249,6 @@ export const TOOL_DISPLAY_CONFIG: ToolDisplayConfig = {
|
||||
},
|
||||
},
|
||||
},
|
||||
update_plan: {
|
||||
emoji: "🗺️",
|
||||
title: "Update Plan",
|
||||
detailKeys: ["explanation", "plan.0.step"],
|
||||
},
|
||||
gateway: {
|
||||
emoji: "🔌",
|
||||
title: "Gateway",
|
||||
|
||||
@@ -4,6 +4,7 @@ import type { RuntimeWebSearchMetadata } from "../../secrets/runtime-web-tools.t
|
||||
import {
|
||||
resolveWebSearchDefinition,
|
||||
resolveWebSearchProviderId,
|
||||
runWebSearch,
|
||||
} from "../../web-search/runtime.js";
|
||||
import type { AnyAgentTool } from "./common.js";
|
||||
import { jsonResult } from "./common.js";
|
||||
@@ -16,16 +17,17 @@ export function createWebSearchTool(options?: {
|
||||
}): AnyAgentTool | null {
|
||||
const runtimeProviderId =
|
||||
options?.runtimeWebSearch?.selectedProvider ?? options?.runtimeWebSearch?.providerConfigured;
|
||||
const preferRuntimeProviders =
|
||||
Boolean(runtimeProviderId) &&
|
||||
!resolveManifestContractOwnerPluginId({
|
||||
contract: "webSearchProviders",
|
||||
value: runtimeProviderId,
|
||||
origin: "bundled",
|
||||
config: options?.config,
|
||||
});
|
||||
const resolved = resolveWebSearchDefinition({
|
||||
...options,
|
||||
preferRuntimeProviders:
|
||||
Boolean(runtimeProviderId) &&
|
||||
!resolveManifestContractOwnerPluginId({
|
||||
contract: "webSearchProviders",
|
||||
value: runtimeProviderId,
|
||||
origin: "bundled",
|
||||
config: options?.config,
|
||||
}),
|
||||
preferRuntimeProviders,
|
||||
});
|
||||
if (!resolved) {
|
||||
return null;
|
||||
@@ -36,7 +38,19 @@ export function createWebSearchTool(options?: {
|
||||
name: "web_search",
|
||||
description: resolved.definition.description,
|
||||
parameters: resolved.definition.parameters,
|
||||
execute: async (_toolCallId, args) => jsonResult(await resolved.definition.execute(args)),
|
||||
execute: async (_toolCallId, args) => {
|
||||
const result = await runWebSearch({
|
||||
config: options?.config,
|
||||
sandboxed: options?.sandboxed,
|
||||
runtimeWebSearch: options?.runtimeWebSearch,
|
||||
preferRuntimeProviders,
|
||||
args,
|
||||
});
|
||||
return jsonResult({
|
||||
...result.result,
|
||||
provider: result.provider,
|
||||
});
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
|
||||
703
src/cli/capability-cli.test.ts
Normal file
703
src/cli/capability-cli.test.ts
Normal file
@@ -0,0 +1,703 @@
|
||||
import fs from "node:fs/promises";
|
||||
import os from "node:os";
|
||||
import path from "node:path";
|
||||
import { Command } from "commander";
|
||||
import { beforeEach, describe, expect, it, vi } from "vitest";
|
||||
import { runRegisteredCli } from "../test-utils/command-runner.js";
|
||||
import { registerCapabilityCli } from "./capability-cli.js";
|
||||
|
||||
const mocks = vi.hoisted(() => ({
|
||||
runtime: {
|
||||
log: vi.fn(),
|
||||
error: vi.fn(),
|
||||
exit: vi.fn((code: number) => {
|
||||
throw new Error(`exit ${code}`);
|
||||
}),
|
||||
writeJson: vi.fn(),
|
||||
writeStdout: vi.fn(),
|
||||
},
|
||||
loadConfig: vi.fn(() => ({})),
|
||||
loadAuthProfileStoreForRuntime: vi.fn(() => ({ profiles: {}, order: {} })),
|
||||
listProfilesForProvider: vi.fn(() => []),
|
||||
resolveMemorySearchConfig: vi.fn(() => null),
|
||||
loadModelCatalog: vi.fn(async () => []),
|
||||
agentCommand: vi.fn(async () => ({
|
||||
payloads: [{ text: "local reply" }],
|
||||
meta: { agentMeta: { provider: "openai", model: "gpt-5.4" } },
|
||||
})),
|
||||
callGateway: vi.fn(async ({ method }: { method: string }) => {
|
||||
if (method === "tts.status") {
|
||||
return { enabled: true, provider: "openai" };
|
||||
}
|
||||
if (method === "agent") {
|
||||
return {
|
||||
result: {
|
||||
payloads: [{ text: "gateway reply" }],
|
||||
meta: { agentMeta: { provider: "anthropic", model: "claude-sonnet-4-6" } },
|
||||
},
|
||||
};
|
||||
}
|
||||
return {};
|
||||
}),
|
||||
describeImageFile: vi.fn(async () => ({
|
||||
text: "friendly lobster",
|
||||
provider: "openai",
|
||||
model: "gpt-4.1-mini",
|
||||
})),
|
||||
generateImage: vi.fn(),
|
||||
transcribeAudioFile: vi.fn(async () => ({ text: "meeting notes" })),
|
||||
textToSpeech: vi.fn(async () => ({
|
||||
success: true,
|
||||
audioPath: "/tmp/tts-source.mp3",
|
||||
provider: "openai",
|
||||
outputFormat: "mp3",
|
||||
voiceCompatible: false,
|
||||
attempts: [],
|
||||
})),
|
||||
setTtsProvider: vi.fn(),
|
||||
resolveExplicitTtsOverrides: vi.fn(
|
||||
({
|
||||
provider,
|
||||
modelId,
|
||||
voiceId,
|
||||
}: {
|
||||
provider?: string;
|
||||
modelId?: string;
|
||||
voiceId?: string;
|
||||
}) => ({
|
||||
...(provider ? { provider } : {}),
|
||||
...(modelId || voiceId
|
||||
? {
|
||||
providerOverrides: {
|
||||
[provider ?? "openai"]: {
|
||||
...(modelId ? { modelId } : {}),
|
||||
...(voiceId ? { voiceId } : {}),
|
||||
},
|
||||
},
|
||||
}
|
||||
: {}),
|
||||
}),
|
||||
),
|
||||
createEmbeddingProvider: vi.fn(async () => ({
|
||||
provider: {
|
||||
id: "openai",
|
||||
model: "text-embedding-3-small",
|
||||
embedQuery: async () => [0.1, 0.2],
|
||||
embedBatch: async (texts: string[]) => texts.map(() => [0.1, 0.2]),
|
||||
},
|
||||
})),
|
||||
registerMemoryEmbeddingProvider: vi.fn(),
|
||||
listMemoryEmbeddingProviders: vi.fn(() => [
|
||||
{ id: "openai", defaultModel: "text-embedding-3-small", transport: "remote" },
|
||||
]),
|
||||
registerBuiltInMemoryEmbeddingProviders: vi.fn(),
|
||||
isWebSearchProviderConfigured: vi.fn(() => false),
|
||||
isWebFetchProviderConfigured: vi.fn(() => false),
|
||||
modelsStatusCommand: vi.fn(
|
||||
async (_opts: unknown, runtime: { log: (...args: unknown[]) => void }) => {
|
||||
runtime.log(JSON.stringify({ ok: true, providers: [{ id: "openai" }] }));
|
||||
},
|
||||
),
|
||||
}));
|
||||
|
||||
vi.mock("../runtime.js", () => ({
|
||||
defaultRuntime: mocks.runtime,
|
||||
writeRuntimeJson: (runtime: { writeJson: (value: unknown) => void }, value: unknown) =>
|
||||
runtime.writeJson(value),
|
||||
}));
|
||||
|
||||
vi.mock("../config/config.js", () => ({
|
||||
loadConfig: (...args: unknown[]) => mocks.loadConfig(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../agents/agent-command.js", () => ({
|
||||
agentCommand: (...args: unknown[]) => mocks.agentCommand(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../agents/agent-scope.js", () => ({
|
||||
resolveDefaultAgentId: () => "main",
|
||||
resolveAgentDir: () => "/tmp/agent",
|
||||
}));
|
||||
|
||||
vi.mock("../agents/model-catalog.js", () => ({
|
||||
loadModelCatalog: (...args: unknown[]) => mocks.loadModelCatalog(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../agents/auth-profiles.js", () => ({
|
||||
loadAuthProfileStoreForRuntime: (...args: unknown[]) =>
|
||||
mocks.loadAuthProfileStoreForRuntime(...args),
|
||||
listProfilesForProvider: (...args: unknown[]) => mocks.listProfilesForProvider(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../agents/memory-search.js", () => ({
|
||||
resolveMemorySearchConfig: (...args: unknown[]) => mocks.resolveMemorySearchConfig(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../commands/models.js", () => ({
|
||||
modelsAuthLoginCommand: vi.fn(),
|
||||
modelsStatusCommand: (...args: unknown[]) => mocks.modelsStatusCommand(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../gateway/call.js", () => ({
|
||||
callGateway: (...args: unknown[]) => mocks.callGateway(...args),
|
||||
randomIdempotencyKey: () => "run-1",
|
||||
}));
|
||||
|
||||
vi.mock("../gateway/connection-details.js", () => ({
|
||||
buildGatewayConnectionDetailsWithResolvers: vi.fn(() => ({
|
||||
url: "ws://127.0.0.1:18789",
|
||||
urlSource: "local loopback",
|
||||
message: "Gateway target: ws://127.0.0.1:18789",
|
||||
})),
|
||||
}));
|
||||
|
||||
vi.mock("../media-understanding/runtime.js", () => ({
|
||||
describeImageFile: (...args: unknown[]) => mocks.describeImageFile(...args),
|
||||
describeVideoFile: vi.fn(),
|
||||
transcribeAudioFile: (...args: unknown[]) => mocks.transcribeAudioFile(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../../extensions/memory-core/src/memory/embeddings.js", () => ({
|
||||
createEmbeddingProvider: (...args: unknown[]) => mocks.createEmbeddingProvider(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../plugins/memory-embedding-providers.js", () => ({
|
||||
listMemoryEmbeddingProviders: (...args: unknown[]) => mocks.listMemoryEmbeddingProviders(...args),
|
||||
registerMemoryEmbeddingProvider: (...args: unknown[]) =>
|
||||
mocks.registerMemoryEmbeddingProvider(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../../extensions/memory-core/src/memory/provider-adapters.js", () => ({
|
||||
registerBuiltInMemoryEmbeddingProviders: (...args: unknown[]) =>
|
||||
mocks.registerBuiltInMemoryEmbeddingProviders(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../image-generation/runtime.js", () => ({
|
||||
generateImage: (...args: unknown[]) => mocks.generateImage(...args),
|
||||
listRuntimeImageGenerationProviders: vi.fn(() => []),
|
||||
}));
|
||||
|
||||
vi.mock("../video-generation/runtime.js", () => ({
|
||||
generateVideo: vi.fn(),
|
||||
listRuntimeVideoGenerationProviders: vi.fn(() => []),
|
||||
}));
|
||||
|
||||
vi.mock("../tts/tts.js", () => ({
|
||||
getTtsProvider: vi.fn(() => "openai"),
|
||||
listSpeechVoices: vi.fn(async () => []),
|
||||
resolveTtsConfig: vi.fn(() => ({})),
|
||||
resolveTtsPrefsPath: vi.fn(() => "/tmp/tts.json"),
|
||||
setTtsEnabled: vi.fn(),
|
||||
setTtsProvider: (...args: unknown[]) => mocks.setTtsProvider(...args),
|
||||
resolveExplicitTtsOverrides: (...args: unknown[]) => mocks.resolveExplicitTtsOverrides(...args),
|
||||
textToSpeech: (...args: unknown[]) => mocks.textToSpeech(...args),
|
||||
}));
|
||||
|
||||
vi.mock("../tts/provider-registry.js", () => ({
|
||||
canonicalizeSpeechProviderId: vi.fn((provider: string) => provider),
|
||||
listSpeechProviders: vi.fn(() => []),
|
||||
}));
|
||||
|
||||
vi.mock("../web-search/runtime.js", () => ({
|
||||
listWebSearchProviders: vi.fn(() => []),
|
||||
isWebSearchProviderConfigured: (...args: unknown[]) =>
|
||||
mocks.isWebSearchProviderConfigured(...args),
|
||||
runWebSearch: vi.fn(),
|
||||
}));
|
||||
|
||||
vi.mock("../web-fetch/runtime.js", () => ({
|
||||
listWebFetchProviders: vi.fn(() => []),
|
||||
isWebFetchProviderConfigured: (...args: unknown[]) => mocks.isWebFetchProviderConfigured(...args),
|
||||
resolveWebFetchDefinition: vi.fn(),
|
||||
}));
|
||||
|
||||
describe("capability cli", () => {
|
||||
beforeEach(() => {
|
||||
mocks.runtime.log.mockClear();
|
||||
mocks.runtime.error.mockClear();
|
||||
mocks.runtime.writeJson.mockClear();
|
||||
mocks.loadModelCatalog
|
||||
.mockReset()
|
||||
.mockResolvedValue([{ id: "gpt-5.4", provider: "openai", name: "GPT-5.4" }]);
|
||||
mocks.loadAuthProfileStoreForRuntime.mockReset().mockReturnValue({ profiles: {}, order: {} });
|
||||
mocks.listProfilesForProvider.mockReset().mockReturnValue([]);
|
||||
mocks.resolveMemorySearchConfig.mockReset().mockReturnValue(null);
|
||||
mocks.agentCommand.mockClear();
|
||||
mocks.callGateway.mockClear().mockImplementation(async ({ method }: { method: string }) => {
|
||||
if (method === "tts.status") {
|
||||
return { enabled: true, provider: "openai" };
|
||||
}
|
||||
if (method === "agent") {
|
||||
return {
|
||||
result: {
|
||||
payloads: [{ text: "gateway reply" }],
|
||||
meta: { agentMeta: { provider: "anthropic", model: "claude-sonnet-4-6" } },
|
||||
},
|
||||
};
|
||||
}
|
||||
return {};
|
||||
});
|
||||
mocks.describeImageFile.mockClear();
|
||||
mocks.generateImage.mockReset();
|
||||
mocks.transcribeAudioFile.mockClear();
|
||||
mocks.textToSpeech.mockClear();
|
||||
mocks.setTtsProvider.mockClear();
|
||||
mocks.resolveExplicitTtsOverrides.mockClear();
|
||||
mocks.createEmbeddingProvider.mockClear();
|
||||
mocks.registerMemoryEmbeddingProvider.mockClear();
|
||||
mocks.registerBuiltInMemoryEmbeddingProviders.mockClear();
|
||||
mocks.isWebSearchProviderConfigured.mockReset().mockReturnValue(false);
|
||||
mocks.isWebFetchProviderConfigured.mockReset().mockReturnValue(false);
|
||||
mocks.modelsStatusCommand.mockClear();
|
||||
mocks.callGateway.mockImplementation(async ({ method }: { method: string }) => {
|
||||
if (method === "tts.status") {
|
||||
return { enabled: true, provider: "openai" };
|
||||
}
|
||||
if (method === "tts.convert") {
|
||||
return {
|
||||
audioPath: "/tmp/gateway-tts.mp3",
|
||||
provider: "openai",
|
||||
outputFormat: "mp3",
|
||||
voiceCompatible: false,
|
||||
};
|
||||
}
|
||||
if (method === "agent") {
|
||||
return {
|
||||
result: {
|
||||
payloads: [{ text: "gateway reply" }],
|
||||
meta: { agentMeta: { provider: "anthropic", model: "claude-sonnet-4-6" } },
|
||||
},
|
||||
};
|
||||
}
|
||||
return {};
|
||||
});
|
||||
});
|
||||
|
||||
it("lists canonical capabilities", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "list", "--json"],
|
||||
});
|
||||
|
||||
const payload = mocks.runtime.writeJson.mock.calls[0]?.[0] as Array<{ id: string }>;
|
||||
expect(payload.some((entry) => entry.id === "model.run")).toBe(true);
|
||||
expect(payload.some((entry) => entry.id === "media.image.describe")).toBe(true);
|
||||
});
|
||||
|
||||
it("defaults model run to local transport", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "model", "run", "--prompt", "hello", "--json"],
|
||||
});
|
||||
|
||||
expect(mocks.agentCommand).toHaveBeenCalledTimes(1);
|
||||
expect(mocks.callGateway).not.toHaveBeenCalled();
|
||||
expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
capability: "model.run",
|
||||
transport: "local",
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("defaults tts status to gateway transport", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "media", "tts", "status", "--json"],
|
||||
});
|
||||
|
||||
expect(mocks.callGateway).toHaveBeenCalledWith(
|
||||
expect.objectContaining({ method: "tts.status" }),
|
||||
);
|
||||
expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
|
||||
expect.objectContaining({ transport: "gateway" }),
|
||||
);
|
||||
});
|
||||
|
||||
it("routes image describe through media understanding, not generation", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "media", "image", "describe", "--file", "photo.jpg", "--json"],
|
||||
});
|
||||
|
||||
expect(mocks.describeImageFile).toHaveBeenCalledWith(
|
||||
expect.objectContaining({ filePath: expect.stringMatching(/photo\.jpg$/) }),
|
||||
);
|
||||
expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
capability: "media.image.describe",
|
||||
outputs: [expect.objectContaining({ kind: "image.description" })],
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("fails image describe when no description text is returned", async () => {
|
||||
mocks.describeImageFile.mockResolvedValueOnce({
|
||||
text: undefined,
|
||||
provider: undefined,
|
||||
model: undefined,
|
||||
});
|
||||
|
||||
await expect(
|
||||
runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "media", "image", "describe", "--file", "photo.jpg", "--json"],
|
||||
}),
|
||||
).rejects.toThrow("exit 1");
|
||||
expect(mocks.runtime.error).toHaveBeenCalledWith(
|
||||
expect.stringMatching(/No description returned for image/),
|
||||
);
|
||||
});
|
||||
|
||||
it("rewrites mismatched explicit image output extensions to the detected file type", async () => {
|
||||
const jpegBase64 =
|
||||
"/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxAQEBUQEBAVFRUVFRUVFRUVFRUVFRUVFRUXFhUVFRUYHSggGBolHRUVITEhJSkrLi4uFx8zODMsNygtLisBCgoKDg0OGhAQGi0fHyUtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLf/AABEIAAEAAQMBIgACEQEDEQH/xAAXAAEBAQEAAAAAAAAAAAAAAAAAAQID/8QAFhEBAQEAAAAAAAAAAAAAAAAAAAER/9oADAMBAAIQAxAAAAH2AP/EABgQAQEAAwAAAAAAAAAAAAAAAAEAEQIS/9oACAEBAAEFAk1o7//EABYRAQEBAAAAAAAAAAAAAAAAAAABEf/aAAgBAwEBPwGn/8QAFhEBAQEAAAAAAAAAAAAAAAAAABEB/9oACAECAQE/AYf/xAAaEAACAgMAAAAAAAAAAAAAAAABEQAhMUFh/9oACAEBAAY/AjK9cY2f/8QAGhABAQACAwAAAAAAAAAAAAAAAAERITFBUf/aAAgBAQABPyGQk7W5jVYkA//Z";
|
||||
mocks.generateImage.mockResolvedValue({
|
||||
provider: "openai",
|
||||
model: "gpt-image-1",
|
||||
attempts: [],
|
||||
images: [
|
||||
{
|
||||
buffer: Buffer.from(jpegBase64, "base64"),
|
||||
mimeType: "image/png",
|
||||
fileName: "provider-output.png",
|
||||
},
|
||||
],
|
||||
});
|
||||
|
||||
const tempOutput = path.join(os.tmpdir(), `openclaw-image-mismatch-${Date.now()}.png`);
|
||||
await fs.rm(tempOutput, { force: true });
|
||||
await fs.rm(tempOutput.replace(/\.png$/, ".jpg"), { force: true });
|
||||
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: [
|
||||
"capability",
|
||||
"media",
|
||||
"image",
|
||||
"generate",
|
||||
"--prompt",
|
||||
"friendly lobster",
|
||||
"--output",
|
||||
tempOutput,
|
||||
"--json",
|
||||
],
|
||||
});
|
||||
|
||||
expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
outputs: [
|
||||
expect.objectContaining({
|
||||
path: tempOutput.replace(/\.png$/, ".jpg"),
|
||||
mimeType: "image/jpeg",
|
||||
}),
|
||||
],
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("routes audio transcribe through transcription, not realtime", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "media", "audio", "transcribe", "--file", "memo.m4a", "--json"],
|
||||
});
|
||||
|
||||
expect(mocks.transcribeAudioFile).toHaveBeenCalledWith(
|
||||
expect.objectContaining({ filePath: expect.stringMatching(/memo\.m4a$/) }),
|
||||
);
|
||||
expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
capability: "media.audio.transcribe",
|
||||
outputs: [expect.objectContaining({ kind: "audio.transcription" })],
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("fails audio transcribe when no transcript text is returned", async () => {
|
||||
mocks.transcribeAudioFile.mockResolvedValueOnce({ text: undefined });
|
||||
|
||||
await expect(
|
||||
runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "media", "audio", "transcribe", "--file", "memo.m4a", "--json"],
|
||||
}),
|
||||
).rejects.toThrow("exit 1");
|
||||
expect(mocks.runtime.error).toHaveBeenCalledWith(
|
||||
expect.stringMatching(/No transcript returned for audio/),
|
||||
);
|
||||
});
|
||||
|
||||
it("forwards transcription prompt and language hints", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: [
|
||||
"capability",
|
||||
"media",
|
||||
"audio",
|
||||
"transcribe",
|
||||
"--file",
|
||||
"memo.m4a",
|
||||
"--language",
|
||||
"en",
|
||||
"--prompt",
|
||||
"Focus on names",
|
||||
"--json",
|
||||
],
|
||||
});
|
||||
|
||||
expect(mocks.transcribeAudioFile).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
filePath: expect.stringMatching(/memo\.m4a$/),
|
||||
language: "en",
|
||||
prompt: "Focus on names",
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("uses request-scoped TTS overrides without mutating prefs", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: [
|
||||
"capability",
|
||||
"media",
|
||||
"tts",
|
||||
"convert",
|
||||
"--text",
|
||||
"hello",
|
||||
"--model",
|
||||
"openai/gpt-4o-mini-tts",
|
||||
"--voice",
|
||||
"alloy",
|
||||
"--json",
|
||||
],
|
||||
});
|
||||
|
||||
expect(mocks.textToSpeech).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
overrides: expect.objectContaining({
|
||||
provider: "openai",
|
||||
providerOverrides: expect.objectContaining({
|
||||
openai: expect.objectContaining({
|
||||
modelId: "gpt-4o-mini-tts",
|
||||
voiceId: "alloy",
|
||||
}),
|
||||
}),
|
||||
}),
|
||||
}),
|
||||
);
|
||||
expect(mocks.setTtsProvider).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it("disables TTS fallback when explicit provider or voice/model selection is requested", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: [
|
||||
"capability",
|
||||
"media",
|
||||
"tts",
|
||||
"convert",
|
||||
"--text",
|
||||
"hello",
|
||||
"--model",
|
||||
"openai/gpt-4o-mini-tts",
|
||||
"--voice",
|
||||
"alloy",
|
||||
"--json",
|
||||
],
|
||||
});
|
||||
|
||||
expect(mocks.textToSpeech).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
disableFallback: true,
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("does not infer and forward a local provider guess for gateway TTS overrides", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: [
|
||||
"capability",
|
||||
"media",
|
||||
"tts",
|
||||
"convert",
|
||||
"--gateway",
|
||||
"--text",
|
||||
"hello",
|
||||
"--voice",
|
||||
"alloy",
|
||||
"--json",
|
||||
],
|
||||
});
|
||||
|
||||
expect(mocks.callGateway).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
method: "tts.convert",
|
||||
params: expect.objectContaining({
|
||||
provider: undefined,
|
||||
voiceId: "alloy",
|
||||
}),
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("fails clearly when gateway TTS output is requested against a remote gateway", async () => {
|
||||
const gatewayConnection = await import("../gateway/connection-details.js");
|
||||
vi.mocked(gatewayConnection.buildGatewayConnectionDetailsWithResolvers).mockReturnValueOnce({
|
||||
url: "wss://gateway.example.com",
|
||||
urlSource: "config gateway.remote.url",
|
||||
message: "Gateway target: wss://gateway.example.com",
|
||||
});
|
||||
|
||||
await expect(
|
||||
runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: [
|
||||
"capability",
|
||||
"media",
|
||||
"tts",
|
||||
"convert",
|
||||
"--gateway",
|
||||
"--text",
|
||||
"hello",
|
||||
"--output",
|
||||
"hello.mp3",
|
||||
"--json",
|
||||
],
|
||||
}),
|
||||
).rejects.toThrow("exit 1");
|
||||
|
||||
expect(mocks.runtime.error).toHaveBeenCalledWith(
|
||||
expect.stringContaining("--output is not supported for remote gateway TTS yet"),
|
||||
);
|
||||
});
|
||||
|
||||
it("uses only embedding providers for embedding creation", async () => {
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "embedding", "create", "--text", "hello", "--json"],
|
||||
});
|
||||
|
||||
expect(mocks.createEmbeddingProvider).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
provider: "auto",
|
||||
fallback: "none",
|
||||
}),
|
||||
);
|
||||
expect(mocks.runtime.writeJson).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
capability: "embedding.create",
|
||||
provider: "openai",
|
||||
model: "text-embedding-3-small",
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("bootstraps built-in embedding providers when the registry is empty", async () => {
|
||||
mocks.listMemoryEmbeddingProviders.mockReturnValueOnce([]);
|
||||
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "embedding", "providers", "--json"],
|
||||
});
|
||||
|
||||
expect(mocks.registerBuiltInMemoryEmbeddingProviders).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
registerMemoryEmbeddingProvider: expect.any(Function),
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("surfaces available, configured, and selected for web providers", async () => {
|
||||
mocks.loadConfig.mockReturnValue({
|
||||
tools: {
|
||||
web: {
|
||||
search: { provider: "gemini" },
|
||||
fetch: { provider: "firecrawl" },
|
||||
},
|
||||
},
|
||||
});
|
||||
const webSearchRuntime = await import("../web-search/runtime.js");
|
||||
const webFetchRuntime = await import("../web-fetch/runtime.js");
|
||||
vi.mocked(webSearchRuntime.listWebSearchProviders).mockReturnValue([
|
||||
{ id: "brave", envVars: ["BRAVE_API_KEY"] } as never,
|
||||
{ id: "gemini", envVars: ["GEMINI_API_KEY"] } as never,
|
||||
]);
|
||||
vi.mocked(webFetchRuntime.listWebFetchProviders).mockReturnValue([
|
||||
{ id: "firecrawl", envVars: ["FIRECRAWL_API_KEY"] } as never,
|
||||
]);
|
||||
mocks.isWebSearchProviderConfigured.mockReturnValueOnce(false).mockReturnValueOnce(true);
|
||||
mocks.isWebFetchProviderConfigured.mockReturnValueOnce(true);
|
||||
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "web", "providers", "--json"],
|
||||
});
|
||||
|
||||
expect(mocks.runtime.writeJson).toHaveBeenCalledWith({
|
||||
search: [
|
||||
{
|
||||
available: true,
|
||||
configured: false,
|
||||
selected: false,
|
||||
id: "brave",
|
||||
envVars: ["BRAVE_API_KEY"],
|
||||
},
|
||||
{
|
||||
available: true,
|
||||
configured: true,
|
||||
selected: true,
|
||||
id: "gemini",
|
||||
envVars: ["GEMINI_API_KEY"],
|
||||
},
|
||||
],
|
||||
fetch: [
|
||||
{
|
||||
available: true,
|
||||
configured: true,
|
||||
selected: true,
|
||||
id: "firecrawl",
|
||||
envVars: ["FIRECRAWL_API_KEY"],
|
||||
},
|
||||
],
|
||||
});
|
||||
});
|
||||
|
||||
it("surfaces selected and configured embedding provider state", async () => {
|
||||
mocks.loadConfig.mockReturnValue({});
|
||||
mocks.resolveMemorySearchConfig.mockReturnValue({
|
||||
provider: "gemini",
|
||||
model: "gemini-embedding-001",
|
||||
});
|
||||
mocks.listMemoryEmbeddingProviders.mockReturnValue([
|
||||
{ id: "openai", defaultModel: "text-embedding-3-small", transport: "remote" },
|
||||
{ id: "gemini", defaultModel: "gemini-embedding-001", transport: "remote" },
|
||||
]);
|
||||
|
||||
await runRegisteredCli({
|
||||
register: registerCapabilityCli as (program: Command) => void,
|
||||
argv: ["capability", "embedding", "providers", "--json"],
|
||||
});
|
||||
|
||||
expect(mocks.runtime.writeJson).toHaveBeenCalledWith([
|
||||
{
|
||||
available: true,
|
||||
configured: false,
|
||||
selected: false,
|
||||
id: "openai",
|
||||
defaultModel: "text-embedding-3-small",
|
||||
transport: "remote",
|
||||
autoSelectPriority: undefined,
|
||||
},
|
||||
{
|
||||
available: true,
|
||||
configured: true,
|
||||
selected: true,
|
||||
id: "gemini",
|
||||
defaultModel: "gemini-embedding-001",
|
||||
transport: "remote",
|
||||
autoSelectPriority: undefined,
|
||||
},
|
||||
]);
|
||||
});
|
||||
});
|
||||
1805
src/cli/capability-cli.ts
Normal file
1805
src/cli/capability-cli.ts
Normal file
File diff suppressed because it is too large
Load Diff
@@ -74,6 +74,15 @@ const entrySpecs: readonly CommandGroupDescriptorSpec<SubCliRegistrar>[] = [
|
||||
loadModule: () => import("../models-cli.js"),
|
||||
exportName: "registerModelsCli",
|
||||
},
|
||||
{
|
||||
name: "capability",
|
||||
description: "Run provider-backed capability commands",
|
||||
hasSubcommands: true,
|
||||
register: async (program) => {
|
||||
const mod = await import("../capability-cli.js");
|
||||
mod.registerCapabilityCli(program);
|
||||
},
|
||||
},
|
||||
{
|
||||
commandNames: ["approvals"],
|
||||
loadModule: () => import("../exec-approvals-cli.js"),
|
||||
|
||||
@@ -22,6 +22,11 @@ const subCliCommandCatalog = defineCommandDescriptorCatalog([
|
||||
description: "Discover, scan, and configure models",
|
||||
hasSubcommands: true,
|
||||
},
|
||||
{
|
||||
name: "capability",
|
||||
description: "Run provider-backed capability commands",
|
||||
hasSubcommands: true,
|
||||
},
|
||||
{
|
||||
name: "approvals",
|
||||
description: "Manage exec approvals (gateway or node host)",
|
||||
|
||||
@@ -75,6 +75,16 @@ export const FIELD_HELP: Record<string, string> = {
|
||||
"Control UI hosting settings including enablement, pathing, and browser-origin/auth hardening behavior. Keep UI exposure minimal and pair with strong auth controls before internet-facing deployments.",
|
||||
"gateway.controlUi.enabled":
|
||||
"Enables serving the gateway Control UI from the gateway HTTP process when true. Keep enabled for local administration, and disable when an external control surface replaces it.",
|
||||
"gateway.controlUi.voice":
|
||||
"Browser voice settings for the Control UI chat, including realtime transcription provider selection and optional assistant speech playback.",
|
||||
"gateway.controlUi.voice.enabled":
|
||||
"Enables realtime browser voice sessions for the Control UI chat when a transcription provider is configured.",
|
||||
"gateway.controlUi.voice.transcriptionProvider":
|
||||
"Registered realtime transcription provider id used for browser mic input. Keep this explicit so browser voice fails closed when no provider is configured.",
|
||||
"gateway.controlUi.voice.providers":
|
||||
"Provider-owned realtime transcription config keyed by provider id for browser voice sessions.",
|
||||
"gateway.controlUi.voice.playbackEnabled":
|
||||
"Enables browser speech-synthesis playback for finalized assistant replies during a voice session.",
|
||||
"gateway.auth":
|
||||
"Authentication policy for gateway HTTP/WebSocket access including mode, credentials, trusted-proxy behavior, and rate limiting. Keep auth enabled for every non-loopback deployment.",
|
||||
"gateway.auth.mode":
|
||||
|
||||
@@ -100,6 +100,17 @@ export type GatewayControlUiConfig = {
|
||||
allowInsecureAuth?: boolean;
|
||||
/** DANGEROUS: Disable device identity checks for the Control UI (default: false). */
|
||||
dangerouslyDisableDeviceAuth?: boolean;
|
||||
/** Realtime voice settings for the browser chat UI. */
|
||||
voice?: {
|
||||
/** Enable browser voice sessions for the Control UI chat. */
|
||||
enabled?: boolean;
|
||||
/** Registered realtime transcription provider id to use for browser voice. */
|
||||
transcriptionProvider?: string;
|
||||
/** Provider-owned realtime transcription config keyed by provider id. */
|
||||
providers?: Record<string, Record<string, unknown>>;
|
||||
/** Enable browser speech synthesis playback for assistant replies. */
|
||||
playbackEnabled?: boolean;
|
||||
};
|
||||
};
|
||||
|
||||
export type GatewayAuthMode = "none" | "token" | "password" | "trusted-proxy";
|
||||
|
||||
@@ -676,6 +676,15 @@ export const OpenClawSchema = z
|
||||
dangerouslyAllowHostHeaderOriginFallback: z.boolean().optional(),
|
||||
allowInsecureAuth: z.boolean().optional(),
|
||||
dangerouslyDisableDeviceAuth: z.boolean().optional(),
|
||||
voice: z
|
||||
.object({
|
||||
enabled: z.boolean().optional(),
|
||||
transcriptionProvider: z.string().min(1).optional(),
|
||||
providers: z.record(z.string(), z.record(z.string(), z.unknown())).optional(),
|
||||
playbackEnabled: z.boolean().optional(),
|
||||
})
|
||||
.strict()
|
||||
.optional(),
|
||||
})
|
||||
.strict()
|
||||
.optional(),
|
||||
|
||||
106
src/gateway/chat-voice-sessions.ts
Normal file
106
src/gateway/chat-voice-sessions.ts
Normal file
@@ -0,0 +1,106 @@
|
||||
import type { RealtimeTranscriptionSession } from "../realtime-transcription/provider-types.js";
|
||||
|
||||
export type ChatVoiceEventPayload = {
|
||||
sessionKey: string;
|
||||
state:
|
||||
| "ready"
|
||||
| "speech_start"
|
||||
| "partial_transcript"
|
||||
| "final_transcript"
|
||||
| "assistant_started"
|
||||
| "assistant_completed"
|
||||
| "playback_clear"
|
||||
| "interrupted"
|
||||
| "error"
|
||||
| "closed";
|
||||
transcript?: string;
|
||||
runId?: string;
|
||||
errorMessage?: string;
|
||||
playbackEnabled?: boolean;
|
||||
};
|
||||
|
||||
export type ChatVoiceSessionEntry = {
|
||||
sessionKey: string;
|
||||
connId: string;
|
||||
providerId: string;
|
||||
playbackEnabled: boolean;
|
||||
sttSession: RealtimeTranscriptionSession;
|
||||
transcriptPartial: string;
|
||||
transcriptFinal: string;
|
||||
activeRunId: string | null;
|
||||
};
|
||||
|
||||
const sessionsByKey = new Map<string, ChatVoiceSessionEntry>();
|
||||
const sessionKeyByRunId = new Map<string, string>();
|
||||
|
||||
export function getChatVoiceSession(sessionKey: string): ChatVoiceSessionEntry | undefined {
|
||||
return sessionsByKey.get(sessionKey);
|
||||
}
|
||||
|
||||
export function setChatVoiceSession(entry: ChatVoiceSessionEntry) {
|
||||
const existing = sessionsByKey.get(entry.sessionKey);
|
||||
if (existing && existing !== entry) {
|
||||
try {
|
||||
existing.sttSession.close();
|
||||
} catch {
|
||||
// ignore replacement cleanup errors
|
||||
}
|
||||
if (existing.activeRunId) {
|
||||
sessionKeyByRunId.delete(existing.activeRunId);
|
||||
}
|
||||
}
|
||||
sessionsByKey.set(entry.sessionKey, entry);
|
||||
}
|
||||
|
||||
export function deleteChatVoiceSession(sessionKey: string): ChatVoiceSessionEntry | undefined {
|
||||
const entry = sessionsByKey.get(sessionKey);
|
||||
if (!entry) {
|
||||
return undefined;
|
||||
}
|
||||
sessionsByKey.delete(sessionKey);
|
||||
if (entry.activeRunId) {
|
||||
sessionKeyByRunId.delete(entry.activeRunId);
|
||||
}
|
||||
return entry;
|
||||
}
|
||||
|
||||
export function setChatVoiceRunId(sessionKey: string, runId: string | null) {
|
||||
const entry = sessionsByKey.get(sessionKey);
|
||||
if (!entry) {
|
||||
return;
|
||||
}
|
||||
if (entry.activeRunId) {
|
||||
sessionKeyByRunId.delete(entry.activeRunId);
|
||||
}
|
||||
entry.activeRunId = runId;
|
||||
if (runId) {
|
||||
sessionKeyByRunId.set(runId, sessionKey);
|
||||
}
|
||||
}
|
||||
|
||||
export function getChatVoiceSessionByRunId(runId: string): ChatVoiceSessionEntry | undefined {
|
||||
const sessionKey = sessionKeyByRunId.get(runId);
|
||||
return sessionKey ? sessionsByKey.get(sessionKey) : undefined;
|
||||
}
|
||||
|
||||
export function closeChatVoiceSessionsForConn(
|
||||
connId: string,
|
||||
emit: (connId: string, payload: ChatVoiceEventPayload) => void,
|
||||
) {
|
||||
for (const entry of sessionsByKey.values()) {
|
||||
if (entry.connId !== connId) {
|
||||
continue;
|
||||
}
|
||||
try {
|
||||
entry.sttSession.close();
|
||||
} catch {
|
||||
// ignore cleanup errors on disconnect
|
||||
}
|
||||
deleteChatVoiceSession(entry.sessionKey);
|
||||
emit(connId, {
|
||||
sessionKey: entry.sessionKey,
|
||||
state: "closed",
|
||||
playbackEnabled: entry.playbackEnabled,
|
||||
});
|
||||
}
|
||||
}
|
||||
@@ -27,6 +27,7 @@ describe("method scope resolution", () => {
|
||||
it.each([
|
||||
["sessions.resolve", ["operator.read"]],
|
||||
["config.schema.lookup", ["operator.read"]],
|
||||
["chat.voice.start", ["operator.write"]],
|
||||
["sessions.create", ["operator.write"]],
|
||||
["sessions.send", ["operator.write"]],
|
||||
["sessions.abort", ["operator.write"]],
|
||||
@@ -85,6 +86,10 @@ describe("operator scope authorization", () => {
|
||||
allowed: false,
|
||||
missingScope: "operator.write",
|
||||
});
|
||||
expect(authorizeOperatorScopesForMethod("chat.voice.start", ["operator.read"])).toEqual({
|
||||
allowed: false,
|
||||
missingScope: "operator.write",
|
||||
});
|
||||
});
|
||||
|
||||
it("requires pairing scope for node pairing approvals", () => {
|
||||
|
||||
@@ -117,14 +117,23 @@ const METHOD_SCOPE_GROUPS: Record<OperatorScope, readonly string[]> = {
|
||||
"wake",
|
||||
"talk.mode",
|
||||
"talk.speak",
|
||||
"chat.voice.start",
|
||||
"tts.enable",
|
||||
"tts.disable",
|
||||
"tts.convert",
|
||||
"tts.setProvider",
|
||||
"realtimeTranscription.start",
|
||||
"realtimeTranscription.pushAudio",
|
||||
"realtimeTranscription.pull",
|
||||
"realtimeTranscription.finish",
|
||||
"voicewake.set",
|
||||
"node.invoke",
|
||||
"chat.send",
|
||||
"chat.abort",
|
||||
"chat.voice.audio",
|
||||
"chat.voice.commit",
|
||||
"chat.voice.interrupt",
|
||||
"chat.voice.stop",
|
||||
"sessions.create",
|
||||
"sessions.send",
|
||||
"sessions.steer",
|
||||
|
||||
@@ -63,6 +63,18 @@ import {
|
||||
ChatHistoryParamsSchema,
|
||||
type ChatInjectParams,
|
||||
ChatInjectParamsSchema,
|
||||
type ChatVoiceAudioParams,
|
||||
ChatVoiceAudioParamsSchema,
|
||||
type ChatVoiceCommitParams,
|
||||
ChatVoiceCommitParamsSchema,
|
||||
type ChatVoiceEvent,
|
||||
ChatVoiceEventSchema,
|
||||
type ChatVoiceInterruptParams,
|
||||
ChatVoiceInterruptParamsSchema,
|
||||
type ChatVoiceStartParams,
|
||||
ChatVoiceStartParamsSchema,
|
||||
type ChatVoiceStopParams,
|
||||
ChatVoiceStopParamsSchema,
|
||||
ChatSendParamsSchema,
|
||||
type ConfigApplyParams,
|
||||
ConfigApplyParamsSchema,
|
||||
@@ -474,6 +486,21 @@ export const validateChatSendParams = ajv.compile(ChatSendParamsSchema);
|
||||
export const validateChatAbortParams = ajv.compile<ChatAbortParams>(ChatAbortParamsSchema);
|
||||
export const validateChatInjectParams = ajv.compile<ChatInjectParams>(ChatInjectParamsSchema);
|
||||
export const validateChatEvent = ajv.compile(ChatEventSchema);
|
||||
export const validateChatVoiceStartParams = ajv.compile<ChatVoiceStartParams>(
|
||||
ChatVoiceStartParamsSchema,
|
||||
);
|
||||
export const validateChatVoiceAudioParams = ajv.compile<ChatVoiceAudioParams>(
|
||||
ChatVoiceAudioParamsSchema,
|
||||
);
|
||||
export const validateChatVoiceCommitParams = ajv.compile<ChatVoiceCommitParams>(
|
||||
ChatVoiceCommitParamsSchema,
|
||||
);
|
||||
export const validateChatVoiceInterruptParams = ajv.compile<ChatVoiceInterruptParams>(
|
||||
ChatVoiceInterruptParamsSchema,
|
||||
);
|
||||
export const validateChatVoiceStopParams =
|
||||
ajv.compile<ChatVoiceStopParams>(ChatVoiceStopParamsSchema);
|
||||
export const validateChatVoiceEvent = ajv.compile<ChatVoiceEvent>(ChatVoiceEventSchema);
|
||||
export const validateUpdateRunParams = ajv.compile<UpdateRunParams>(UpdateRunParamsSchema);
|
||||
export const validateWebLoginStartParams =
|
||||
ajv.compile<WebLoginStartParams>(WebLoginStartParamsSchema);
|
||||
|
||||
@@ -68,6 +68,68 @@ export const ChatInjectParamsSchema = Type.Object(
|
||||
{ additionalProperties: false },
|
||||
);
|
||||
|
||||
export const ChatVoiceStartParamsSchema = Type.Object(
|
||||
{
|
||||
sessionKey: NonEmptyString,
|
||||
},
|
||||
{ additionalProperties: false },
|
||||
);
|
||||
|
||||
export const ChatVoiceAudioParamsSchema = Type.Object(
|
||||
{
|
||||
sessionKey: NonEmptyString,
|
||||
audio: NonEmptyString,
|
||||
format: Type.Optional(Type.String()),
|
||||
sampleRate: Type.Optional(Type.Integer({ minimum: 1 })),
|
||||
},
|
||||
{ additionalProperties: false },
|
||||
);
|
||||
|
||||
export const ChatVoiceCommitParamsSchema = Type.Object(
|
||||
{
|
||||
sessionKey: NonEmptyString,
|
||||
transcript: Type.Optional(Type.String()),
|
||||
},
|
||||
{ additionalProperties: false },
|
||||
);
|
||||
|
||||
export const ChatVoiceInterruptParamsSchema = Type.Object(
|
||||
{
|
||||
sessionKey: NonEmptyString,
|
||||
},
|
||||
{ additionalProperties: false },
|
||||
);
|
||||
|
||||
export const ChatVoiceStopParamsSchema = Type.Object(
|
||||
{
|
||||
sessionKey: NonEmptyString,
|
||||
},
|
||||
{ additionalProperties: false },
|
||||
);
|
||||
|
||||
export const ChatVoiceEventSchema = Type.Object(
|
||||
{
|
||||
sessionKey: NonEmptyString,
|
||||
state: Type.Union([
|
||||
Type.Literal("ready"),
|
||||
Type.Literal("speech_start"),
|
||||
Type.Literal("partial_transcript"),
|
||||
Type.Literal("final_transcript"),
|
||||
Type.Literal("assistant_started"),
|
||||
Type.Literal("assistant_completed"),
|
||||
Type.Literal("playback_clear"),
|
||||
Type.Literal("interrupted"),
|
||||
Type.Literal("error"),
|
||||
Type.Literal("closed"),
|
||||
]),
|
||||
transcript: Type.Optional(Type.String()),
|
||||
runId: Type.Optional(Type.String()),
|
||||
errorMessage: Type.Optional(Type.String()),
|
||||
playbackEnabled: Type.Optional(Type.Boolean()),
|
||||
},
|
||||
{ additionalProperties: false },
|
||||
);
|
||||
|
||||
export const ChatEventSchema = Type.Object(
|
||||
{
|
||||
runId: NonEmptyString,
|
||||
|
||||
@@ -118,6 +118,12 @@ import {
|
||||
ChatEventSchema,
|
||||
ChatHistoryParamsSchema,
|
||||
ChatInjectParamsSchema,
|
||||
ChatVoiceAudioParamsSchema,
|
||||
ChatVoiceCommitParamsSchema,
|
||||
ChatVoiceEventSchema,
|
||||
ChatVoiceInterruptParamsSchema,
|
||||
ChatVoiceStartParamsSchema,
|
||||
ChatVoiceStopParamsSchema,
|
||||
ChatSendParamsSchema,
|
||||
LogsTailParamsSchema,
|
||||
LogsTailResultSchema,
|
||||
@@ -330,7 +336,13 @@ export const ProtocolSchemas = {
|
||||
ChatSendParams: ChatSendParamsSchema,
|
||||
ChatAbortParams: ChatAbortParamsSchema,
|
||||
ChatInjectParams: ChatInjectParamsSchema,
|
||||
ChatVoiceStartParams: ChatVoiceStartParamsSchema,
|
||||
ChatVoiceAudioParams: ChatVoiceAudioParamsSchema,
|
||||
ChatVoiceCommitParams: ChatVoiceCommitParamsSchema,
|
||||
ChatVoiceInterruptParams: ChatVoiceInterruptParamsSchema,
|
||||
ChatVoiceStopParams: ChatVoiceStopParamsSchema,
|
||||
ChatEvent: ChatEventSchema,
|
||||
ChatVoiceEvent: ChatVoiceEventSchema,
|
||||
UpdateRunParams: UpdateRunParamsSchema,
|
||||
TickEvent: TickEventSchema,
|
||||
ShutdownEvent: ShutdownEventSchema,
|
||||
|
||||
@@ -144,6 +144,12 @@ export type DeviceTokenRevokeParams = SchemaType<"DeviceTokenRevokeParams">;
|
||||
export type ChatAbortParams = SchemaType<"ChatAbortParams">;
|
||||
export type ChatInjectParams = SchemaType<"ChatInjectParams">;
|
||||
export type ChatEvent = SchemaType<"ChatEvent">;
|
||||
export type ChatVoiceStartParams = SchemaType<"ChatVoiceStartParams">;
|
||||
export type ChatVoiceAudioParams = SchemaType<"ChatVoiceAudioParams">;
|
||||
export type ChatVoiceCommitParams = SchemaType<"ChatVoiceCommitParams">;
|
||||
export type ChatVoiceInterruptParams = SchemaType<"ChatVoiceInterruptParams">;
|
||||
export type ChatVoiceStopParams = SchemaType<"ChatVoiceStopParams">;
|
||||
export type ChatVoiceEvent = SchemaType<"ChatVoiceEvent">;
|
||||
export type UpdateRunParams = SchemaType<"UpdateRunParams">;
|
||||
export type TickEvent = SchemaType<"TickEvent">;
|
||||
export type ShutdownEvent = SchemaType<"ShutdownEvent">;
|
||||
|
||||
154
src/gateway/realtime-transcription-session-manager.test.ts
Normal file
154
src/gateway/realtime-transcription-session-manager.test.ts
Normal file
@@ -0,0 +1,154 @@
|
||||
import { describe, expect, it, vi } from "vitest";
|
||||
import type { OpenClawConfig } from "../config/config.js";
|
||||
import type { RealtimeTranscriptionProviderPlugin } from "../plugins/types.js";
|
||||
import { RealtimeTranscriptionSessionManager } from "./realtime-transcription-session-manager.js";
|
||||
|
||||
function createProvider(params?: {
|
||||
id?: string;
|
||||
configured?: boolean;
|
||||
onCreate?: (callbacks: Record<string, unknown>) => void;
|
||||
}): RealtimeTranscriptionProviderPlugin {
|
||||
return {
|
||||
id: params?.id ?? "openai",
|
||||
label: "Test",
|
||||
autoSelectOrder: 1,
|
||||
resolveConfig: ({ rawConfig }) => rawConfig,
|
||||
isConfigured: () => params?.configured ?? true,
|
||||
createSession: (req) => {
|
||||
params?.onCreate?.(req as unknown as Record<string, unknown>);
|
||||
return {
|
||||
connect: async () => {},
|
||||
sendAudio: vi.fn(),
|
||||
close: vi.fn(),
|
||||
isConnected: () => true,
|
||||
};
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
describe("RealtimeTranscriptionSessionManager", () => {
|
||||
it("starts a session, auto-selects the first configured provider, and queues events", async () => {
|
||||
let callbacks: Record<string, unknown> | undefined;
|
||||
const provider = createProvider({
|
||||
onCreate: (req) => {
|
||||
callbacks = req;
|
||||
},
|
||||
});
|
||||
const manager = new RealtimeTranscriptionSessionManager({
|
||||
loadConfig: () => ({}) as OpenClawConfig,
|
||||
listProviders: () => [provider],
|
||||
getProvider: () => provider,
|
||||
now: () => 123,
|
||||
createId: () => "session-1",
|
||||
});
|
||||
|
||||
const started = await manager.startSession({
|
||||
format: "s16le",
|
||||
sampleRate: 16000,
|
||||
channels: 1,
|
||||
});
|
||||
expect(started).toEqual({
|
||||
sessionId: "session-1",
|
||||
provider: "openai",
|
||||
format: "s16le",
|
||||
sampleRate: 16000,
|
||||
channels: 1,
|
||||
});
|
||||
|
||||
(callbacks?.onPartial as ((value: string) => void) | undefined)?.("hello");
|
||||
(callbacks?.onTranscript as ((value: string) => void) | undefined)?.("hello world");
|
||||
|
||||
const pulled = manager.pullEvents({ sessionId: "session-1" });
|
||||
expect(pulled.events).toEqual([
|
||||
{ type: "session.started", provider: "openai", transport: "gateway", timestamp: 123 },
|
||||
{ type: "partial", text: "hello", timestamp: 123 },
|
||||
{ type: "final", text: "hello world", timestamp: 123 },
|
||||
]);
|
||||
});
|
||||
|
||||
it("rejects unsupported audio shapes", async () => {
|
||||
const provider = createProvider();
|
||||
const manager = new RealtimeTranscriptionSessionManager({
|
||||
loadConfig: () => ({}) as OpenClawConfig,
|
||||
listProviders: () => [provider],
|
||||
getProvider: () => provider,
|
||||
now: () => 123,
|
||||
createId: () => "session-1",
|
||||
});
|
||||
|
||||
await expect(
|
||||
manager.startSession({
|
||||
format: "s16le",
|
||||
sampleRate: 16000,
|
||||
channels: 2,
|
||||
}),
|
||||
).rejects.toThrow(/mono audio/);
|
||||
});
|
||||
|
||||
it("returns pending terminal events on finish and removes the session", async () => {
|
||||
let callbacks: Record<string, unknown> | undefined;
|
||||
const close = vi.fn();
|
||||
const provider = createProvider({
|
||||
onCreate: (req) => {
|
||||
callbacks = req;
|
||||
},
|
||||
});
|
||||
provider.createSession = (req) => {
|
||||
callbacks = req as unknown as Record<string, unknown>;
|
||||
return {
|
||||
connect: async () => {},
|
||||
sendAudio: vi.fn(),
|
||||
close,
|
||||
isConnected: () => false,
|
||||
};
|
||||
};
|
||||
const manager = new RealtimeTranscriptionSessionManager({
|
||||
loadConfig: () => ({}) as OpenClawConfig,
|
||||
listProviders: () => [provider],
|
||||
getProvider: () => provider,
|
||||
now: () => 123,
|
||||
createId: () => "session-1",
|
||||
});
|
||||
|
||||
await manager.startSession({
|
||||
format: "s16le",
|
||||
sampleRate: 16000,
|
||||
channels: 1,
|
||||
});
|
||||
(callbacks?.onPartial as ((value: string) => void) | undefined)?.("hello");
|
||||
|
||||
expect(manager.finishSession({ sessionId: "session-1" })).toEqual({
|
||||
sessionId: "session-1",
|
||||
provider: "openai",
|
||||
closed: true,
|
||||
events: [
|
||||
{ type: "session.started", provider: "openai", transport: "gateway", timestamp: 123 },
|
||||
{ type: "partial", text: "hello", timestamp: 123 },
|
||||
{ type: "session.ended", reason: "client_finish", timestamp: 123 },
|
||||
],
|
||||
});
|
||||
expect(close).toHaveBeenCalledTimes(1);
|
||||
expect(() => manager.pullEvents({ sessionId: "session-1" })).toThrow(
|
||||
/Unknown realtime transcription session/,
|
||||
);
|
||||
});
|
||||
|
||||
it("fails when no configured provider is available", async () => {
|
||||
const provider = createProvider({ configured: false });
|
||||
const manager = new RealtimeTranscriptionSessionManager({
|
||||
loadConfig: () => ({}) as OpenClawConfig,
|
||||
listProviders: () => [provider],
|
||||
getProvider: () => provider,
|
||||
now: () => 123,
|
||||
createId: () => "session-1",
|
||||
});
|
||||
|
||||
await expect(
|
||||
manager.startSession({
|
||||
format: "s16le",
|
||||
sampleRate: 16000,
|
||||
channels: 1,
|
||||
}),
|
||||
).rejects.toThrow(/No configured realtime transcription provider/);
|
||||
});
|
||||
});
|
||||
297
src/gateway/realtime-transcription-session-manager.ts
Normal file
297
src/gateway/realtime-transcription-session-manager.ts
Normal file
@@ -0,0 +1,297 @@
|
||||
import { randomUUID } from "node:crypto";
|
||||
import type { OpenClawConfig } from "../config/config.js";
|
||||
import { loadConfig } from "../config/config.js";
|
||||
import type { RealtimeTranscriptionProviderPlugin } from "../plugins/types.js";
|
||||
import {
|
||||
getRealtimeTranscriptionProvider,
|
||||
listRealtimeTranscriptionProviders,
|
||||
} from "../realtime-transcription/provider-registry.js";
|
||||
import type {
|
||||
RealtimeTranscriptionProviderConfig,
|
||||
RealtimeTranscriptionSession,
|
||||
} from "../realtime-transcription/provider-types.js";
|
||||
|
||||
type AudioFormat = "s16le" | "pcm16" | "g711_ulaw";
|
||||
|
||||
export type RealtimeTranscriptionSessionEvent =
|
||||
| { type: "session.started"; provider: string; transport: "gateway"; timestamp: number }
|
||||
| { type: "partial"; text: string; timestamp: number }
|
||||
| { type: "final"; text: string; timestamp: number }
|
||||
| { type: "warning"; message: string; timestamp: number }
|
||||
| { type: "error"; message: string; timestamp: number }
|
||||
| { type: "session.ended"; reason: string; timestamp: number };
|
||||
|
||||
type ManagedSession = {
|
||||
id: string;
|
||||
provider: string;
|
||||
format: AudioFormat;
|
||||
sampleRate: number;
|
||||
channels: number;
|
||||
session: RealtimeTranscriptionSession;
|
||||
events: RealtimeTranscriptionSessionEvent[];
|
||||
closed: boolean;
|
||||
};
|
||||
|
||||
type SessionStartParams = {
|
||||
provider?: string;
|
||||
providerConfig?: RealtimeTranscriptionProviderConfig;
|
||||
format: AudioFormat;
|
||||
sampleRate: number;
|
||||
channels: number;
|
||||
};
|
||||
|
||||
type ManagerDeps = {
|
||||
loadConfig: () => OpenClawConfig;
|
||||
listProviders: (cfg?: OpenClawConfig) => RealtimeTranscriptionProviderPlugin[];
|
||||
getProvider: (
|
||||
providerId: string | undefined,
|
||||
cfg?: OpenClawConfig,
|
||||
) => RealtimeTranscriptionProviderPlugin | undefined;
|
||||
now: () => number;
|
||||
createId: () => string;
|
||||
};
|
||||
|
||||
const defaultDeps: ManagerDeps = {
|
||||
loadConfig,
|
||||
listProviders: listRealtimeTranscriptionProviders,
|
||||
getProvider: getRealtimeTranscriptionProvider,
|
||||
now: () => Date.now(),
|
||||
createId: () => randomUUID(),
|
||||
};
|
||||
|
||||
function normalizeAudioFormat(raw: string | undefined): AudioFormat | null {
|
||||
const value = raw?.trim().toLowerCase();
|
||||
if (!value) {
|
||||
return null;
|
||||
}
|
||||
if (value === "s16le" || value === "pcm16" || value === "g711_ulaw") {
|
||||
return value;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
function validateSessionShape(params: {
|
||||
format: AudioFormat;
|
||||
sampleRate: number;
|
||||
channels: number;
|
||||
}) {
|
||||
if (!Number.isFinite(params.sampleRate) || params.sampleRate <= 0) {
|
||||
throw new Error("sampleRate must be a positive number.");
|
||||
}
|
||||
if (!Number.isFinite(params.channels) || params.channels <= 0) {
|
||||
throw new Error("channels must be a positive number.");
|
||||
}
|
||||
if (params.channels !== 1) {
|
||||
throw new Error("realtime transcription currently requires mono audio (channels=1).");
|
||||
}
|
||||
if (params.format === "g711_ulaw" && params.sampleRate !== 8000) {
|
||||
throw new Error("g711_ulaw realtime transcription requires sampleRate=8000.");
|
||||
}
|
||||
}
|
||||
|
||||
function sortProviders(providers: RealtimeTranscriptionProviderPlugin[]) {
|
||||
return [...providers].toSorted((left, right) => {
|
||||
const leftOrder = left.autoSelectOrder ?? Number.MAX_SAFE_INTEGER;
|
||||
const rightOrder = right.autoSelectOrder ?? Number.MAX_SAFE_INTEGER;
|
||||
if (leftOrder !== rightOrder) {
|
||||
return leftOrder - rightOrder;
|
||||
}
|
||||
return left.id.localeCompare(right.id);
|
||||
});
|
||||
}
|
||||
|
||||
function buildProviderConfig(params: {
|
||||
provider: RealtimeTranscriptionProviderPlugin;
|
||||
cfg: OpenClawConfig;
|
||||
providerConfig?: RealtimeTranscriptionProviderConfig;
|
||||
format: AudioFormat;
|
||||
}): RealtimeTranscriptionProviderConfig {
|
||||
const rawConfig = {
|
||||
...params.providerConfig,
|
||||
...(params.format === "s16le" || params.format === "pcm16"
|
||||
? { inputAudioFormat: "pcm16" }
|
||||
: params.format === "g711_ulaw"
|
||||
? { inputAudioFormat: "g711_ulaw" }
|
||||
: {}),
|
||||
};
|
||||
return params.provider.resolveConfig?.({ cfg: params.cfg, rawConfig }) ?? rawConfig;
|
||||
}
|
||||
|
||||
export class RealtimeTranscriptionSessionManager {
|
||||
private readonly sessions = new Map<string, ManagedSession>();
|
||||
|
||||
constructor(private readonly deps: ManagerDeps = defaultDeps) {}
|
||||
|
||||
async startSession(params: SessionStartParams) {
|
||||
validateSessionShape({
|
||||
format: params.format,
|
||||
sampleRate: params.sampleRate,
|
||||
channels: params.channels,
|
||||
});
|
||||
const cfg = this.deps.loadConfig();
|
||||
const provider = this.resolveProvider(params.provider, cfg, params);
|
||||
const providerConfig = buildProviderConfig({
|
||||
provider,
|
||||
cfg,
|
||||
providerConfig: params.providerConfig,
|
||||
format: params.format,
|
||||
});
|
||||
const sessionId = this.deps.createId();
|
||||
const events: RealtimeTranscriptionSessionEvent[] = [];
|
||||
const queueEvent = (event: RealtimeTranscriptionSessionEvent) => {
|
||||
events.push(event);
|
||||
};
|
||||
const session = provider.createSession({
|
||||
providerConfig,
|
||||
onPartial: (partial) => {
|
||||
if (partial.trim()) {
|
||||
queueEvent({ type: "partial", text: partial, timestamp: this.deps.now() });
|
||||
}
|
||||
},
|
||||
onTranscript: (transcript) => {
|
||||
if (transcript.trim()) {
|
||||
queueEvent({ type: "final", text: transcript, timestamp: this.deps.now() });
|
||||
}
|
||||
},
|
||||
onError: (error) => {
|
||||
queueEvent({
|
||||
type: "error",
|
||||
message: error.message || String(error),
|
||||
timestamp: this.deps.now(),
|
||||
});
|
||||
},
|
||||
});
|
||||
await session.connect();
|
||||
queueEvent({
|
||||
type: "session.started",
|
||||
provider: provider.id,
|
||||
transport: "gateway",
|
||||
timestamp: this.deps.now(),
|
||||
});
|
||||
this.sessions.set(sessionId, {
|
||||
id: sessionId,
|
||||
provider: provider.id,
|
||||
format: params.format,
|
||||
sampleRate: params.sampleRate,
|
||||
channels: params.channels,
|
||||
session,
|
||||
events,
|
||||
closed: false,
|
||||
});
|
||||
return {
|
||||
sessionId,
|
||||
provider: provider.id,
|
||||
format: params.format,
|
||||
sampleRate: params.sampleRate,
|
||||
channels: params.channels,
|
||||
};
|
||||
}
|
||||
|
||||
pushAudio(params: { sessionId: string; audio: Buffer }) {
|
||||
const managed = this.getOpenSession(params.sessionId);
|
||||
managed.session.sendAudio(params.audio);
|
||||
return {
|
||||
sessionId: managed.id,
|
||||
acceptedBytes: params.audio.byteLength,
|
||||
connected: managed.session.isConnected(),
|
||||
};
|
||||
}
|
||||
|
||||
pullEvents(params: { sessionId: string; limit?: number }) {
|
||||
const managed = this.getSession(params.sessionId);
|
||||
const requested = params.limit ?? (managed.events.length || 100);
|
||||
const count = Math.max(1, Math.floor(requested));
|
||||
const events = managed.events.splice(0, count);
|
||||
return {
|
||||
sessionId: managed.id,
|
||||
provider: managed.provider,
|
||||
connected: managed.session.isConnected(),
|
||||
closed: managed.closed,
|
||||
events,
|
||||
};
|
||||
}
|
||||
|
||||
finishSession(params: { sessionId: string; reason?: string }) {
|
||||
const managed = this.getSession(params.sessionId);
|
||||
if (!managed.closed) {
|
||||
managed.closed = true;
|
||||
managed.session.close();
|
||||
managed.events.push({
|
||||
type: "session.ended",
|
||||
reason: params.reason?.trim() || "client_finish",
|
||||
timestamp: this.deps.now(),
|
||||
});
|
||||
}
|
||||
const events = managed.events.splice(0, managed.events.length);
|
||||
this.sessions.delete(params.sessionId);
|
||||
return {
|
||||
sessionId: managed.id,
|
||||
provider: managed.provider,
|
||||
closed: true,
|
||||
events,
|
||||
};
|
||||
}
|
||||
|
||||
private resolveProvider(
|
||||
providerId: string | undefined,
|
||||
cfg: OpenClawConfig,
|
||||
params: SessionStartParams,
|
||||
): RealtimeTranscriptionProviderPlugin {
|
||||
if (providerId?.trim()) {
|
||||
const provider = this.deps.getProvider(providerId, cfg);
|
||||
if (!provider) {
|
||||
throw new Error(`Unknown realtime transcription provider: ${providerId}`);
|
||||
}
|
||||
const providerConfig = buildProviderConfig({
|
||||
provider,
|
||||
cfg,
|
||||
providerConfig: params.providerConfig,
|
||||
format: params.format,
|
||||
});
|
||||
if (!provider.isConfigured({ cfg, providerConfig })) {
|
||||
throw new Error(`Realtime transcription provider "${provider.id}" is not configured.`);
|
||||
}
|
||||
return provider;
|
||||
}
|
||||
|
||||
const provider = sortProviders(this.deps.listProviders(cfg)).find((candidate) => {
|
||||
const providerConfig = buildProviderConfig({
|
||||
provider: candidate,
|
||||
cfg,
|
||||
providerConfig: params.providerConfig,
|
||||
format: params.format,
|
||||
});
|
||||
return candidate.isConfigured({ cfg, providerConfig });
|
||||
});
|
||||
if (!provider) {
|
||||
throw new Error("No configured realtime transcription provider is available.");
|
||||
}
|
||||
return provider;
|
||||
}
|
||||
|
||||
private getSession(sessionId: string): ManagedSession {
|
||||
const managed = this.sessions.get(sessionId);
|
||||
if (!managed) {
|
||||
throw new Error(`Unknown realtime transcription session: ${sessionId}`);
|
||||
}
|
||||
return managed;
|
||||
}
|
||||
|
||||
private getOpenSession(sessionId: string): ManagedSession {
|
||||
const managed = this.getSession(sessionId);
|
||||
if (managed.closed) {
|
||||
throw new Error(`Realtime transcription session is already closed: ${sessionId}`);
|
||||
}
|
||||
return managed;
|
||||
}
|
||||
}
|
||||
|
||||
const sharedManager = new RealtimeTranscriptionSessionManager();
|
||||
|
||||
export function getRealtimeTranscriptionSessionManager() {
|
||||
return sharedManager;
|
||||
}
|
||||
|
||||
export const __testing = {
|
||||
normalizeAudioFormat,
|
||||
};
|
||||
@@ -21,6 +21,7 @@ const EVENT_SCOPE_GUARDS: Record<string, string[]> = {
|
||||
"sessions.changed": [READ_SCOPE],
|
||||
"session.message": [READ_SCOPE],
|
||||
"session.tool": [READ_SCOPE],
|
||||
"chat.voice.event": [READ_SCOPE],
|
||||
};
|
||||
|
||||
export type GatewayBroadcastStateVersion = {
|
||||
|
||||
@@ -5,6 +5,7 @@ import { loadConfig } from "../config/config.js";
|
||||
import { type AgentEventPayload, getAgentRunContext } from "../infra/agent-events.js";
|
||||
import { resolveHeartbeatVisibility } from "../infra/heartbeat-visibility.js";
|
||||
import { stripInlineDirectiveTagsForDisplay } from "../utils/directive-tags.js";
|
||||
import { getChatVoiceSessionByRunId, setChatVoiceRunId } from "./chat-voice-sessions.js";
|
||||
import { loadGatewaySessionRow } from "./server-chat.load-gateway-session-row.runtime.js";
|
||||
import { persistGatewaySessionLifecycleEvent } from "./server-chat.persist-session-lifecycle.runtime.js";
|
||||
import { deriveGatewaySessionLifecycleSnapshot } from "./session-lifecycle-state.js";
|
||||
@@ -948,6 +949,72 @@ export function createAgentEventHandler({
|
||||
}
|
||||
if (!isAborted && evt.stream === "assistant" && typeof evt.data?.text === "string") {
|
||||
emitChatDelta(sessionKey, clientRunId, evt.runId, evt.seq, evt.data.text, evt.data.delta);
|
||||
} else if (!isAborted && (lifecyclePhase === "end" || lifecyclePhase === "error")) {
|
||||
const evtStopReason =
|
||||
typeof evt.data?.stopReason === "string" ? evt.data.stopReason : undefined;
|
||||
if (chatLink) {
|
||||
const finished = chatRunState.registry.shift(evt.runId);
|
||||
if (!finished) {
|
||||
clearAgentRunContext(evt.runId);
|
||||
return;
|
||||
}
|
||||
emitChatFinal(
|
||||
finished.sessionKey,
|
||||
finished.clientRunId,
|
||||
evt.runId,
|
||||
evt.seq,
|
||||
lifecyclePhase === "error" ? "error" : "done",
|
||||
evt.data?.error,
|
||||
evtStopReason,
|
||||
);
|
||||
} else {
|
||||
emitChatFinal(
|
||||
sessionKey,
|
||||
eventRunId,
|
||||
evt.runId,
|
||||
evt.seq,
|
||||
lifecyclePhase === "error" ? "error" : "done",
|
||||
evt.data?.error,
|
||||
evtStopReason,
|
||||
);
|
||||
}
|
||||
const voiceSession = getChatVoiceSessionByRunId(clientRunId);
|
||||
if (voiceSession) {
|
||||
setChatVoiceRunId(voiceSession.sessionKey, null);
|
||||
broadcastToConnIds(
|
||||
"chat.voice.event",
|
||||
{
|
||||
sessionKey: voiceSession.sessionKey,
|
||||
state: "assistant_completed",
|
||||
runId: clientRunId,
|
||||
playbackEnabled: voiceSession.playbackEnabled,
|
||||
},
|
||||
new Set([voiceSession.connId]),
|
||||
);
|
||||
}
|
||||
} else if (isAborted && (lifecyclePhase === "end" || lifecyclePhase === "error")) {
|
||||
chatRunState.abortedRuns.delete(clientRunId);
|
||||
chatRunState.abortedRuns.delete(evt.runId);
|
||||
chatRunState.buffers.delete(clientRunId);
|
||||
chatRunState.deltaSentAt.delete(clientRunId);
|
||||
if (chatLink) {
|
||||
chatRunState.registry.remove(evt.runId, clientRunId, sessionKey);
|
||||
}
|
||||
const voiceSession = getChatVoiceSessionByRunId(clientRunId);
|
||||
if (voiceSession) {
|
||||
setChatVoiceRunId(voiceSession.sessionKey, null);
|
||||
broadcastToConnIds(
|
||||
"chat.voice.event",
|
||||
{
|
||||
sessionKey: voiceSession.sessionKey,
|
||||
state: "interrupted",
|
||||
runId: clientRunId,
|
||||
playbackEnabled: voiceSession.playbackEnabled,
|
||||
},
|
||||
new Set([voiceSession.connId]),
|
||||
{ dropIfSlow: true },
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -17,6 +17,10 @@ const BASE_METHODS = [
|
||||
"tts.disable",
|
||||
"tts.convert",
|
||||
"tts.setProvider",
|
||||
"realtimeTranscription.start",
|
||||
"realtimeTranscription.pushAudio",
|
||||
"realtimeTranscription.pull",
|
||||
"realtimeTranscription.finish",
|
||||
"config.get",
|
||||
"config.set",
|
||||
"config.apply",
|
||||
@@ -118,6 +122,11 @@ const BASE_METHODS = [
|
||||
"chat.history",
|
||||
"chat.abort",
|
||||
"chat.send",
|
||||
"chat.voice.start",
|
||||
"chat.voice.audio",
|
||||
"chat.voice.commit",
|
||||
"chat.voice.interrupt",
|
||||
"chat.voice.stop",
|
||||
];
|
||||
|
||||
export function listGatewayMethods(): string[] {
|
||||
@@ -129,6 +138,7 @@ export const GATEWAY_EVENTS = [
|
||||
"connect.challenge",
|
||||
"agent",
|
||||
"chat",
|
||||
"chat.voice.event",
|
||||
"session.message",
|
||||
"session.tool",
|
||||
"sessions.changed",
|
||||
|
||||
@@ -20,6 +20,7 @@ import { modelsHandlers } from "./server-methods/models.js";
|
||||
import { nodePendingHandlers } from "./server-methods/nodes-pending.js";
|
||||
import { nodeHandlers } from "./server-methods/nodes.js";
|
||||
import { pushHandlers } from "./server-methods/push.js";
|
||||
import { realtimeTranscriptionHandlers } from "./server-methods/realtime-transcription.js";
|
||||
import { sendHandlers } from "./server-methods/send.js";
|
||||
import { sessionsHandlers } from "./server-methods/sessions.js";
|
||||
import { skillsHandlers } from "./server-methods/skills.js";
|
||||
@@ -84,6 +85,7 @@ export const coreGatewayHandlers: GatewayRequestHandlers = {
|
||||
...toolsCatalogHandlers,
|
||||
...toolsEffectiveHandlers,
|
||||
...ttsHandlers,
|
||||
...realtimeTranscriptionHandlers,
|
||||
...skillsHandlers,
|
||||
...sessionsHandlers,
|
||||
...systemHandlers,
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import { randomUUID } from "node:crypto";
|
||||
import fs from "node:fs";
|
||||
import path from "node:path";
|
||||
import { CURRENT_SESSION_VERSION, SessionManager } from "@mariozechner/pi-coding-agent";
|
||||
@@ -19,6 +20,8 @@ import { jsonUtf8Bytes } from "../../infra/json-utf8-bytes.js";
|
||||
import type { PromptImageOrderEntry } from "../../media/prompt-image-order.js";
|
||||
import { type SavedMedia, saveMediaBuffer } from "../../media/store.js";
|
||||
import { createChannelReplyPipeline } from "../../plugin-sdk/channel-reply-pipeline.js";
|
||||
import { getRealtimeTranscriptionProvider } from "../../plugin-sdk/realtime-transcription.js";
|
||||
import type { RealtimeTranscriptionSession } from "../../realtime-transcription/provider-types.js";
|
||||
import { normalizeInputProvenance, type InputProvenance } from "../../sessions/input-provenance.js";
|
||||
import { resolveSendPolicy } from "../../sessions/send-policy.js";
|
||||
import { parseAgentSessionKey } from "../../sessions/session-key-utils.js";
|
||||
@@ -48,6 +51,13 @@ import {
|
||||
parseMessageWithAttachments,
|
||||
} from "../chat-attachments.js";
|
||||
import { stripEnvelopeFromMessage, stripEnvelopeFromMessages } from "../chat-sanitize.js";
|
||||
import {
|
||||
deleteChatVoiceSession,
|
||||
getChatVoiceSession,
|
||||
setChatVoiceRunId,
|
||||
setChatVoiceSession,
|
||||
type ChatVoiceEventPayload,
|
||||
} from "../chat-voice-sessions.js";
|
||||
import { augmentChatHistoryWithCliSessionImports } from "../cli-session-history.js";
|
||||
import { ADMIN_SCOPE } from "../method-scopes.js";
|
||||
import {
|
||||
@@ -57,6 +67,11 @@ import {
|
||||
hasGatewayClientCap,
|
||||
} from "../protocol/client-info.js";
|
||||
import {
|
||||
validateChatVoiceAudioParams,
|
||||
validateChatVoiceCommitParams,
|
||||
validateChatVoiceInterruptParams,
|
||||
validateChatVoiceStartParams,
|
||||
validateChatVoiceStopParams,
|
||||
ErrorCodes,
|
||||
errorShape,
|
||||
formatValidationErrors,
|
||||
@@ -1011,6 +1026,88 @@ function normalizeOptionalText(value?: string | null): string | undefined {
|
||||
return trimmed || undefined;
|
||||
}
|
||||
|
||||
function getActiveChatVoiceCallbackSession(params: {
|
||||
sessionKey: string;
|
||||
connId: string;
|
||||
sttSession: RealtimeTranscriptionSession;
|
||||
}) {
|
||||
const active = getChatVoiceSession(params.sessionKey);
|
||||
if (!active || active.connId !== params.connId || active.sttSession !== params.sttSession) {
|
||||
return undefined;
|
||||
}
|
||||
return active;
|
||||
}
|
||||
|
||||
function isStrictBase64(value: string): boolean {
|
||||
const normalized = value.replace(/\s+/g, "");
|
||||
if (!normalized || normalized.length % 4 !== 0) {
|
||||
return false;
|
||||
}
|
||||
if (!/^[A-Za-z0-9+/]+={0,2}$/.test(normalized)) {
|
||||
return false;
|
||||
}
|
||||
const decoded = Buffer.from(normalized, "base64");
|
||||
return decoded.length > 0 && decoded.toString("base64") === normalized;
|
||||
}
|
||||
|
||||
function parseStrictBase64AudioBuffer(value: unknown): Buffer {
|
||||
const audio = typeof value === "string" ? value.trim() : "";
|
||||
if (!audio) {
|
||||
throw new Error("audio is required.");
|
||||
}
|
||||
if (!isStrictBase64(audio)) {
|
||||
throw new Error("audio must be base64 encoded.");
|
||||
}
|
||||
return Buffer.from(audio, "base64");
|
||||
}
|
||||
|
||||
function resolveControlUiVoiceConfig(cfg: ReturnType<typeof loadSessionEntry>["cfg"]) {
|
||||
return cfg.gateway?.controlUi?.voice;
|
||||
}
|
||||
|
||||
function emitChatVoiceEvent(
|
||||
context: GatewayRequestContext,
|
||||
connId: string,
|
||||
payload: ChatVoiceEventPayload,
|
||||
) {
|
||||
context.broadcastToConnIds("chat.voice.event", payload, new Set([connId]));
|
||||
}
|
||||
|
||||
async function closeChatVoiceSession(params: {
|
||||
context: GatewayRequestContext;
|
||||
sessionKey: string;
|
||||
connId: string;
|
||||
emitClosed?: boolean;
|
||||
errorMessage?: string;
|
||||
}) {
|
||||
const entry = deleteChatVoiceSession(params.sessionKey);
|
||||
if (!entry) {
|
||||
return;
|
||||
}
|
||||
try {
|
||||
entry.sttSession.close();
|
||||
} catch (err) {
|
||||
params.context.logGateway.debug(
|
||||
`chat.voice session close cleanup failed: ${formatForLog(err)}`,
|
||||
);
|
||||
}
|
||||
if (params.errorMessage) {
|
||||
emitChatVoiceEvent(params.context, params.connId, {
|
||||
sessionKey: params.sessionKey,
|
||||
state: "error",
|
||||
errorMessage: params.errorMessage,
|
||||
playbackEnabled: entry.playbackEnabled,
|
||||
});
|
||||
}
|
||||
if (params.emitClosed !== false) {
|
||||
emitChatVoiceEvent(params.context, params.connId, {
|
||||
sessionKey: params.sessionKey,
|
||||
state: "closed",
|
||||
playbackEnabled: entry.playbackEnabled,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
function normalizeExplicitChatSendOrigin(
|
||||
params: ChatSendExplicitOrigin,
|
||||
): { ok: true; value?: ChatSendExplicitOrigin } | { ok: false; error: string } {
|
||||
@@ -1954,6 +2051,425 @@ export const chatHandlers: GatewayRequestHandlers = {
|
||||
});
|
||||
}
|
||||
},
|
||||
"chat.voice.start": async ({ params, respond, context, client }) => {
|
||||
if (!validateChatVoiceStartParams(params)) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(
|
||||
ErrorCodes.INVALID_REQUEST,
|
||||
`invalid chat.voice.start params: ${formatValidationErrors(validateChatVoiceStartParams.errors)}`,
|
||||
),
|
||||
);
|
||||
return;
|
||||
}
|
||||
const connId = normalizeOptionalText(client?.connId);
|
||||
if (!connId) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
|
||||
return;
|
||||
}
|
||||
|
||||
const { sessionKey: rawSessionKey } = params as { sessionKey: string };
|
||||
const { cfg, canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
|
||||
const voiceConfig = resolveControlUiVoiceConfig(cfg);
|
||||
if (voiceConfig?.enabled !== true) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "web voice is disabled"));
|
||||
return;
|
||||
}
|
||||
|
||||
const providerId = normalizeOptionalText(voiceConfig.transcriptionProvider);
|
||||
if (!providerId) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(ErrorCodes.INVALID_REQUEST, "voice transcription provider is not configured"),
|
||||
);
|
||||
return;
|
||||
}
|
||||
|
||||
const provider = getRealtimeTranscriptionProvider(providerId, cfg);
|
||||
if (!provider) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(
|
||||
ErrorCodes.INVALID_REQUEST,
|
||||
`voice transcription provider not found: ${providerId}`,
|
||||
),
|
||||
);
|
||||
return;
|
||||
}
|
||||
|
||||
const modelProviderConfig =
|
||||
provider.id === "microsoft-foundry"
|
||||
? cfg.models?.providers?.["microsoft-foundry"]
|
||||
: cfg.models?.providers?.[provider.id];
|
||||
const providerConfig = {
|
||||
providers: {
|
||||
[provider.id]: {
|
||||
...modelProviderConfig,
|
||||
...voiceConfig.providers?.[provider.id],
|
||||
inputAudioFormat: "pcm16",
|
||||
},
|
||||
},
|
||||
};
|
||||
if (!provider.isConfigured({ cfg, providerConfig })) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(
|
||||
ErrorCodes.INVALID_REQUEST,
|
||||
`voice transcription provider is not configured: ${provider.id}`,
|
||||
),
|
||||
);
|
||||
return;
|
||||
}
|
||||
|
||||
const existing = getChatVoiceSession(sessionKey);
|
||||
if (existing?.connId === connId) {
|
||||
await closeChatVoiceSession({
|
||||
context,
|
||||
sessionKey,
|
||||
connId,
|
||||
emitClosed: false,
|
||||
});
|
||||
}
|
||||
|
||||
const playbackEnabled = voiceConfig.playbackEnabled !== false;
|
||||
try {
|
||||
let sttSession: RealtimeTranscriptionSession;
|
||||
sttSession = provider.createSession({
|
||||
providerConfig,
|
||||
onSpeechStart: () => {
|
||||
const active = getActiveChatVoiceCallbackSession({ sessionKey, connId, sttSession });
|
||||
if (!active) {
|
||||
return;
|
||||
}
|
||||
active.transcriptPartial = "";
|
||||
emitChatVoiceEvent(context, connId, {
|
||||
sessionKey,
|
||||
state: "speech_start",
|
||||
playbackEnabled: active.playbackEnabled,
|
||||
});
|
||||
},
|
||||
onPartial: (partial) => {
|
||||
const active = getActiveChatVoiceCallbackSession({ sessionKey, connId, sttSession });
|
||||
if (!active) {
|
||||
return;
|
||||
}
|
||||
active.transcriptPartial = partial;
|
||||
emitChatVoiceEvent(context, connId, {
|
||||
sessionKey,
|
||||
state: "partial_transcript",
|
||||
transcript: partial,
|
||||
playbackEnabled: active.playbackEnabled,
|
||||
});
|
||||
},
|
||||
onTranscript: (transcript) => {
|
||||
const active = getActiveChatVoiceCallbackSession({ sessionKey, connId, sttSession });
|
||||
if (!active) {
|
||||
return;
|
||||
}
|
||||
active.transcriptFinal = transcript;
|
||||
active.transcriptPartial = "";
|
||||
emitChatVoiceEvent(context, connId, {
|
||||
sessionKey,
|
||||
state: "final_transcript",
|
||||
transcript,
|
||||
playbackEnabled: active.playbackEnabled,
|
||||
});
|
||||
},
|
||||
onError: (error) => {
|
||||
const active = getActiveChatVoiceCallbackSession({ sessionKey, connId, sttSession });
|
||||
if (!active) {
|
||||
return;
|
||||
}
|
||||
void closeChatVoiceSession({
|
||||
context,
|
||||
sessionKey,
|
||||
connId,
|
||||
errorMessage: error.message || String(error),
|
||||
});
|
||||
},
|
||||
});
|
||||
await sttSession.connect();
|
||||
setChatVoiceSession({
|
||||
sessionKey,
|
||||
connId,
|
||||
providerId: provider.id,
|
||||
playbackEnabled,
|
||||
sttSession,
|
||||
transcriptPartial: "",
|
||||
transcriptFinal: "",
|
||||
activeRunId: null,
|
||||
});
|
||||
respond(true, {
|
||||
ok: true,
|
||||
providerId: provider.id,
|
||||
playbackEnabled,
|
||||
});
|
||||
emitChatVoiceEvent(context, connId, {
|
||||
sessionKey,
|
||||
state: "ready",
|
||||
playbackEnabled,
|
||||
});
|
||||
} catch (err) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.UNAVAILABLE, String(err)));
|
||||
context.logGateway.warn(`chat.voice.start failed: ${formatForLog(err)}`);
|
||||
}
|
||||
},
|
||||
"chat.voice.audio": ({ params, respond, client }) => {
|
||||
if (!validateChatVoiceAudioParams(params)) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(
|
||||
ErrorCodes.INVALID_REQUEST,
|
||||
`invalid chat.voice.audio params: ${formatValidationErrors(validateChatVoiceAudioParams.errors)}`,
|
||||
),
|
||||
);
|
||||
return;
|
||||
}
|
||||
const connId = normalizeOptionalText(client?.connId);
|
||||
if (!connId) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
|
||||
return;
|
||||
}
|
||||
const {
|
||||
sessionKey: rawSessionKey,
|
||||
audio,
|
||||
format,
|
||||
} = params as {
|
||||
sessionKey: string;
|
||||
audio: string;
|
||||
format?: string;
|
||||
};
|
||||
const { canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
|
||||
const entry = getChatVoiceSession(sessionKey);
|
||||
if (!entry || entry.connId !== connId) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice session not found"));
|
||||
return;
|
||||
}
|
||||
if (format && format.toLowerCase() !== "pcm16") {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(ErrorCodes.INVALID_REQUEST, `unsupported voice audio format: ${format}`),
|
||||
);
|
||||
return;
|
||||
}
|
||||
let audioBuffer: Buffer;
|
||||
try {
|
||||
audioBuffer = parseStrictBase64AudioBuffer(audio);
|
||||
} catch (err) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, String(err)));
|
||||
return;
|
||||
}
|
||||
try {
|
||||
entry.sttSession.sendAudio(audioBuffer);
|
||||
respond(true, { ok: true });
|
||||
} catch (err) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.UNAVAILABLE, String(err)));
|
||||
}
|
||||
},
|
||||
"chat.voice.commit": async ({ params, req, respond, context, client }) => {
|
||||
if (!validateChatVoiceCommitParams(params)) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(
|
||||
ErrorCodes.INVALID_REQUEST,
|
||||
`invalid chat.voice.commit params: ${formatValidationErrors(validateChatVoiceCommitParams.errors)}`,
|
||||
),
|
||||
);
|
||||
return;
|
||||
}
|
||||
const connId = normalizeOptionalText(client?.connId);
|
||||
if (!connId) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
|
||||
return;
|
||||
}
|
||||
const { sessionKey: rawSessionKey, transcript: transcriptOverride } = params as {
|
||||
sessionKey: string;
|
||||
transcript?: string;
|
||||
};
|
||||
const { canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
|
||||
const entry = getChatVoiceSession(sessionKey);
|
||||
if (!entry || entry.connId !== connId) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice session not found"));
|
||||
return;
|
||||
}
|
||||
if (entry.activeRunId) {
|
||||
respond(true, { ok: false, status: "in_flight", runId: entry.activeRunId });
|
||||
return;
|
||||
}
|
||||
const transcript = (
|
||||
transcriptOverride ??
|
||||
entry.transcriptFinal ??
|
||||
entry.transcriptPartial
|
||||
).trim();
|
||||
if (!transcript) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(ErrorCodes.INVALID_REQUEST, "voice transcript is empty"),
|
||||
);
|
||||
return;
|
||||
}
|
||||
|
||||
const runId = randomUUID();
|
||||
const voiceSendResult = await new Promise<{
|
||||
ok: boolean;
|
||||
payload?: unknown;
|
||||
error?: ReturnType<typeof errorShape>;
|
||||
}>((resolve) => {
|
||||
void chatHandlers["chat.send"]({
|
||||
req,
|
||||
params: {
|
||||
sessionKey,
|
||||
message: transcript,
|
||||
deliver: false,
|
||||
idempotencyKey: runId,
|
||||
},
|
||||
client,
|
||||
isWebchatConnect: () => false,
|
||||
context,
|
||||
respond: (ok, payload, error) => resolve({ ok, payload, error }),
|
||||
});
|
||||
});
|
||||
if (!voiceSendResult.ok) {
|
||||
respond(false, voiceSendResult.payload, voiceSendResult.error);
|
||||
return;
|
||||
}
|
||||
entry.transcriptFinal = "";
|
||||
entry.transcriptPartial = "";
|
||||
setChatVoiceRunId(sessionKey, runId);
|
||||
emitChatVoiceEvent(context, connId, {
|
||||
sessionKey,
|
||||
state: "assistant_started",
|
||||
runId,
|
||||
playbackEnabled: entry.playbackEnabled,
|
||||
});
|
||||
respond(true, {
|
||||
ok: true,
|
||||
runId,
|
||||
transcript,
|
||||
playbackEnabled: entry.playbackEnabled,
|
||||
result: voiceSendResult.payload,
|
||||
});
|
||||
},
|
||||
"chat.voice.interrupt": ({ params, req, respond, context, client }) => {
|
||||
if (!validateChatVoiceInterruptParams(params)) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(
|
||||
ErrorCodes.INVALID_REQUEST,
|
||||
`invalid chat.voice.interrupt params: ${formatValidationErrors(validateChatVoiceInterruptParams.errors)}`,
|
||||
),
|
||||
);
|
||||
return;
|
||||
}
|
||||
const connId = normalizeOptionalText(client?.connId);
|
||||
if (!connId) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
|
||||
return;
|
||||
}
|
||||
const { sessionKey: rawSessionKey } = params as { sessionKey: string };
|
||||
const { canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
|
||||
const entry = getChatVoiceSession(sessionKey);
|
||||
if (!entry || entry.connId !== connId) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice session not found"));
|
||||
return;
|
||||
}
|
||||
|
||||
emitChatVoiceEvent(context, connId, {
|
||||
sessionKey,
|
||||
state: "playback_clear",
|
||||
playbackEnabled: entry.playbackEnabled,
|
||||
});
|
||||
|
||||
const runId = entry.activeRunId;
|
||||
if (!runId) {
|
||||
emitChatVoiceEvent(context, connId, {
|
||||
sessionKey,
|
||||
state: "interrupted",
|
||||
playbackEnabled: entry.playbackEnabled,
|
||||
});
|
||||
respond(true, { ok: true, aborted: false });
|
||||
return;
|
||||
}
|
||||
|
||||
void chatHandlers["chat.abort"]({
|
||||
req,
|
||||
params: {
|
||||
sessionKey,
|
||||
runId,
|
||||
},
|
||||
client,
|
||||
isWebchatConnect: () => false,
|
||||
context,
|
||||
respond: () => undefined,
|
||||
});
|
||||
setChatVoiceRunId(sessionKey, null);
|
||||
emitChatVoiceEvent(context, connId, {
|
||||
sessionKey,
|
||||
state: "interrupted",
|
||||
runId,
|
||||
playbackEnabled: entry.playbackEnabled,
|
||||
});
|
||||
respond(true, { ok: true, aborted: true, runId });
|
||||
},
|
||||
"chat.voice.stop": async ({ params, req, respond, context, client }) => {
|
||||
if (!validateChatVoiceStopParams(params)) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(
|
||||
ErrorCodes.INVALID_REQUEST,
|
||||
`invalid chat.voice.stop params: ${formatValidationErrors(validateChatVoiceStopParams.errors)}`,
|
||||
),
|
||||
);
|
||||
return;
|
||||
}
|
||||
const connId = normalizeOptionalText(client?.connId);
|
||||
if (!connId) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice requires connId"));
|
||||
return;
|
||||
}
|
||||
const { sessionKey: rawSessionKey } = params as { sessionKey: string };
|
||||
const { canonicalKey: sessionKey } = loadSessionEntry(rawSessionKey);
|
||||
const entry = getChatVoiceSession(sessionKey);
|
||||
if (!entry || entry.connId !== connId) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, "voice session not found"));
|
||||
return;
|
||||
}
|
||||
emitChatVoiceEvent(context, connId, {
|
||||
sessionKey,
|
||||
state: "playback_clear",
|
||||
playbackEnabled: entry.playbackEnabled,
|
||||
});
|
||||
if (entry.activeRunId) {
|
||||
void chatHandlers["chat.abort"]({
|
||||
req,
|
||||
params: {
|
||||
sessionKey,
|
||||
runId: entry.activeRunId,
|
||||
},
|
||||
client,
|
||||
isWebchatConnect: () => false,
|
||||
context,
|
||||
respond: () => undefined,
|
||||
});
|
||||
}
|
||||
setChatVoiceRunId(sessionKey, null);
|
||||
await closeChatVoiceSession({
|
||||
context,
|
||||
sessionKey,
|
||||
connId,
|
||||
});
|
||||
respond(true, { ok: true });
|
||||
},
|
||||
"chat.inject": async ({ params, respond, context }) => {
|
||||
if (!validateChatInjectParams(params)) {
|
||||
respond(
|
||||
|
||||
202
src/gateway/server-methods/chat.voice.test.ts
Normal file
202
src/gateway/server-methods/chat.voice.test.ts
Normal file
@@ -0,0 +1,202 @@
|
||||
import { afterEach, describe, expect, it, vi } from "vitest";
|
||||
import {
|
||||
deleteChatVoiceSession,
|
||||
getChatVoiceSession,
|
||||
setChatVoiceSession,
|
||||
} from "../chat-voice-sessions.js";
|
||||
import { ErrorCodes } from "../protocol/index.js";
|
||||
|
||||
const mockState = vi.hoisted(() => ({
|
||||
cfg: {
|
||||
gateway: {
|
||||
controlUi: {
|
||||
voice: {
|
||||
enabled: true,
|
||||
transcriptionProvider: "mock-stt",
|
||||
playbackEnabled: true,
|
||||
},
|
||||
},
|
||||
},
|
||||
models: {
|
||||
providers: {
|
||||
"mock-stt": {},
|
||||
},
|
||||
},
|
||||
} as Record<string, unknown>,
|
||||
provider: null as {
|
||||
id: string;
|
||||
isConfigured: ReturnType<typeof vi.fn>;
|
||||
createSession: ReturnType<typeof vi.fn>;
|
||||
} | null,
|
||||
}));
|
||||
|
||||
vi.mock("../session-utils.js", async () => {
|
||||
const original =
|
||||
await vi.importActual<typeof import("../session-utils.js")>("../session-utils.js");
|
||||
return {
|
||||
...original,
|
||||
loadSessionEntry: (rawKey: string) => ({
|
||||
cfg: mockState.cfg,
|
||||
storePath: "/tmp/sessions.json",
|
||||
entry: {
|
||||
sessionId: "sess-voice-1",
|
||||
sessionFile: "/tmp/sess-voice-1.jsonl",
|
||||
},
|
||||
canonicalKey: rawKey || "main",
|
||||
}),
|
||||
};
|
||||
});
|
||||
|
||||
vi.mock("../../plugin-sdk/realtime-transcription.js", () => ({
|
||||
getRealtimeTranscriptionProvider: vi.fn(() => mockState.provider),
|
||||
}));
|
||||
|
||||
const { chatHandlers } = await import("./chat.js");
|
||||
|
||||
function createContext() {
|
||||
return {
|
||||
broadcastToConnIds: vi.fn(),
|
||||
logGateway: {
|
||||
warn: vi.fn(),
|
||||
debug: vi.fn(),
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
function createClient(connId = "conn-1") {
|
||||
return { connId } as const;
|
||||
}
|
||||
|
||||
afterEach(() => {
|
||||
vi.restoreAllMocks();
|
||||
deleteChatVoiceSession("main");
|
||||
mockState.provider = null;
|
||||
});
|
||||
|
||||
describe("chat voice handlers", () => {
|
||||
it("ignores stale onError callbacks from replaced voice sessions", async () => {
|
||||
const callbacks: Array<{
|
||||
onError?: (error: Error) => void;
|
||||
}> = [];
|
||||
const sessions = [
|
||||
{
|
||||
connect: vi.fn(async () => undefined),
|
||||
sendAudio: vi.fn(),
|
||||
close: vi.fn(),
|
||||
isConnected: vi.fn(() => true),
|
||||
},
|
||||
{
|
||||
connect: vi.fn(async () => undefined),
|
||||
sendAudio: vi.fn(),
|
||||
close: vi.fn(),
|
||||
isConnected: vi.fn(() => true),
|
||||
},
|
||||
];
|
||||
mockState.provider = {
|
||||
id: "mock-stt",
|
||||
isConfigured: vi.fn(() => true),
|
||||
createSession: vi.fn((params) => {
|
||||
callbacks.push(params);
|
||||
return sessions[callbacks.length - 1];
|
||||
}),
|
||||
};
|
||||
const context = createContext();
|
||||
const respond = vi.fn();
|
||||
|
||||
await chatHandlers["chat.voice.start"]({
|
||||
params: { sessionKey: "main" },
|
||||
respond,
|
||||
context: context as never,
|
||||
client: createClient(),
|
||||
} as never);
|
||||
await chatHandlers["chat.voice.start"]({
|
||||
params: { sessionKey: "main" },
|
||||
respond,
|
||||
context: context as never,
|
||||
client: createClient(),
|
||||
} as never);
|
||||
|
||||
expect(getChatVoiceSession("main")?.sttSession).toBe(sessions[1]);
|
||||
|
||||
callbacks[0].onError?.(new Error("late"));
|
||||
|
||||
expect(getChatVoiceSession("main")?.sttSession).toBe(sessions[1]);
|
||||
});
|
||||
|
||||
it("rejects malformed base64 audio before forwarding to the session", async () => {
|
||||
const sendAudio = vi.fn();
|
||||
setChatVoiceSession({
|
||||
sessionKey: "main",
|
||||
connId: "conn-1",
|
||||
providerId: "mock-stt",
|
||||
playbackEnabled: true,
|
||||
sttSession: {
|
||||
connect: vi.fn(async () => undefined),
|
||||
sendAudio,
|
||||
close: vi.fn(),
|
||||
isConnected: vi.fn(() => true),
|
||||
},
|
||||
transcriptPartial: "",
|
||||
transcriptFinal: "",
|
||||
activeRunId: null,
|
||||
});
|
||||
const respond = vi.fn();
|
||||
|
||||
await chatHandlers["chat.voice.audio"]({
|
||||
params: { sessionKey: "main", audio: "not@base64", format: "pcm16" },
|
||||
respond,
|
||||
client: createClient(),
|
||||
} as never);
|
||||
|
||||
expect(sendAudio).not.toHaveBeenCalled();
|
||||
expect(respond).toHaveBeenCalledWith(
|
||||
false,
|
||||
undefined,
|
||||
expect.objectContaining({
|
||||
code: ErrorCodes.INVALID_REQUEST,
|
||||
message: expect.stringContaining("base64"),
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("preserves buffered transcript when commit send fails", async () => {
|
||||
const sttSession = {
|
||||
connect: vi.fn(async () => undefined),
|
||||
sendAudio: vi.fn(),
|
||||
close: vi.fn(),
|
||||
isConnected: vi.fn(() => true),
|
||||
};
|
||||
setChatVoiceSession({
|
||||
sessionKey: "main",
|
||||
connId: "conn-1",
|
||||
providerId: "mock-stt",
|
||||
playbackEnabled: true,
|
||||
sttSession,
|
||||
transcriptPartial: "draft tail",
|
||||
transcriptFinal: "hello from voice",
|
||||
activeRunId: null,
|
||||
});
|
||||
vi.spyOn(chatHandlers, "chat.send").mockImplementation(async ({ respond }) => {
|
||||
respond(false, undefined, { code: ErrorCodes.UNAVAILABLE, message: "send failed" } as never);
|
||||
});
|
||||
const respond = vi.fn();
|
||||
|
||||
await chatHandlers["chat.voice.commit"]({
|
||||
params: { sessionKey: "main" },
|
||||
req: {} as never,
|
||||
respond,
|
||||
context: createContext() as never,
|
||||
client: createClient(),
|
||||
} as never);
|
||||
|
||||
expect(getChatVoiceSession("main")).toMatchObject({
|
||||
transcriptFinal: "hello from voice",
|
||||
transcriptPartial: "draft tail",
|
||||
});
|
||||
expect(respond).toHaveBeenCalledWith(
|
||||
false,
|
||||
undefined,
|
||||
expect.objectContaining({ code: ErrorCodes.UNAVAILABLE }),
|
||||
);
|
||||
});
|
||||
});
|
||||
140
src/gateway/server-methods/realtime-transcription.test.ts
Normal file
140
src/gateway/server-methods/realtime-transcription.test.ts
Normal file
@@ -0,0 +1,140 @@
|
||||
import { beforeEach, describe, expect, it, vi } from "vitest";
|
||||
|
||||
const mocks = vi.hoisted(() => ({
|
||||
manager: {
|
||||
startSession: vi.fn(),
|
||||
pushAudio: vi.fn(),
|
||||
pullEvents: vi.fn(),
|
||||
finishSession: vi.fn(),
|
||||
},
|
||||
}));
|
||||
|
||||
vi.mock("../realtime-transcription-session-manager.js", () => ({
|
||||
getRealtimeTranscriptionSessionManager: () => mocks.manager,
|
||||
__testing: {
|
||||
normalizeAudioFormat: (value: string | undefined) =>
|
||||
value === "s16le" || value === "pcm16" || value === "g711_ulaw" ? value : null,
|
||||
},
|
||||
}));
|
||||
|
||||
import { realtimeTranscriptionHandlers } from "./realtime-transcription.js";
|
||||
|
||||
describe("realtimeTranscriptionHandlers", () => {
|
||||
beforeEach(() => {
|
||||
mocks.manager.startSession.mockReset();
|
||||
mocks.manager.pushAudio.mockReset();
|
||||
mocks.manager.pullEvents.mockReset();
|
||||
mocks.manager.finishSession.mockReset();
|
||||
});
|
||||
|
||||
it("starts a session with validated audio metadata", async () => {
|
||||
mocks.manager.startSession.mockResolvedValue({ sessionId: "s1", provider: "openai" });
|
||||
const respond = vi.fn();
|
||||
|
||||
await realtimeTranscriptionHandlers["realtimeTranscription.start"]({
|
||||
req: { method: "realtimeTranscription.start", id: "1" } as never,
|
||||
params: { format: "s16le", sampleRate: 16000, channels: 1 },
|
||||
client: null,
|
||||
isWebchatConnect: () => false,
|
||||
respond,
|
||||
context: {} as never,
|
||||
});
|
||||
|
||||
expect(mocks.manager.startSession).toHaveBeenCalledWith({
|
||||
provider: undefined,
|
||||
providerConfig: undefined,
|
||||
format: "s16le",
|
||||
sampleRate: 16000,
|
||||
channels: 1,
|
||||
});
|
||||
expect(respond).toHaveBeenCalledWith(true, { sessionId: "s1", provider: "openai" });
|
||||
});
|
||||
|
||||
it("rejects invalid start formats", async () => {
|
||||
const respond = vi.fn();
|
||||
|
||||
await realtimeTranscriptionHandlers["realtimeTranscription.start"]({
|
||||
req: { method: "realtimeTranscription.start", id: "1" } as never,
|
||||
params: { format: "wav", sampleRate: 16000, channels: 1 },
|
||||
client: null,
|
||||
isWebchatConnect: () => false,
|
||||
respond,
|
||||
context: {} as never,
|
||||
});
|
||||
|
||||
expect(mocks.manager.startSession).not.toHaveBeenCalled();
|
||||
expect(respond.mock.calls[0]?.[0]).toBe(false);
|
||||
});
|
||||
|
||||
it("pushes audio chunks to an existing session", async () => {
|
||||
mocks.manager.pushAudio.mockReturnValue({ sessionId: "s1", acceptedBytes: 4, connected: true });
|
||||
const respond = vi.fn();
|
||||
|
||||
await realtimeTranscriptionHandlers["realtimeTranscription.pushAudio"]({
|
||||
req: { method: "realtimeTranscription.pushAudio", id: "2" } as never,
|
||||
params: { sessionId: "s1", audio: Buffer.from("test").toString("base64") },
|
||||
client: null,
|
||||
isWebchatConnect: () => false,
|
||||
respond,
|
||||
context: {} as never,
|
||||
});
|
||||
|
||||
expect(mocks.manager.pushAudio).toHaveBeenCalledWith({
|
||||
sessionId: "s1",
|
||||
audio: expect.any(Buffer),
|
||||
});
|
||||
expect(respond).toHaveBeenCalledWith(
|
||||
true,
|
||||
expect.objectContaining({ sessionId: "s1", acceptedBytes: 4 }),
|
||||
);
|
||||
});
|
||||
|
||||
it("rejects malformed base64 audio payloads before forwarding to the manager", async () => {
|
||||
const respond = vi.fn();
|
||||
|
||||
await realtimeTranscriptionHandlers["realtimeTranscription.pushAudio"]({
|
||||
req: { method: "realtimeTranscription.pushAudio", id: "2b" } as never,
|
||||
params: { sessionId: "s1", audio: "%%%not-base64%%%" },
|
||||
client: null,
|
||||
isWebchatConnect: () => false,
|
||||
respond,
|
||||
context: {} as never,
|
||||
});
|
||||
|
||||
expect(mocks.manager.pushAudio).not.toHaveBeenCalled();
|
||||
expect(respond.mock.calls[0]?.[0]).toBe(false);
|
||||
expect(JSON.stringify(respond.mock.calls[0]?.[2] ?? {})).toContain("audio must be base64 encoded");
|
||||
});
|
||||
|
||||
it("returns final events from finish and lets the manager clean up immediately", async () => {
|
||||
mocks.manager.finishSession.mockReturnValue({
|
||||
sessionId: "s1",
|
||||
provider: "openai",
|
||||
closed: true,
|
||||
events: [{ type: "session.ended", reason: "client_finish", timestamp: 123 }],
|
||||
});
|
||||
const respond = vi.fn();
|
||||
|
||||
await realtimeTranscriptionHandlers["realtimeTranscription.finish"]({
|
||||
req: { method: "realtimeTranscription.finish", id: "3" } as never,
|
||||
params: { sessionId: "s1" },
|
||||
client: null,
|
||||
isWebchatConnect: () => false,
|
||||
respond,
|
||||
context: {} as never,
|
||||
});
|
||||
|
||||
expect(mocks.manager.finishSession).toHaveBeenCalledWith({
|
||||
sessionId: "s1",
|
||||
reason: undefined,
|
||||
});
|
||||
expect(respond).toHaveBeenCalledWith(
|
||||
true,
|
||||
expect.objectContaining({
|
||||
sessionId: "s1",
|
||||
closed: true,
|
||||
events: [{ type: "session.ended", reason: "client_finish", timestamp: 123 }],
|
||||
}),
|
||||
);
|
||||
});
|
||||
});
|
||||
118
src/gateway/server-methods/realtime-transcription.ts
Normal file
118
src/gateway/server-methods/realtime-transcription.ts
Normal file
@@ -0,0 +1,118 @@
|
||||
import { ErrorCodes, errorShape } from "../protocol/index.js";
|
||||
import {
|
||||
getRealtimeTranscriptionSessionManager,
|
||||
__testing as managerTesting,
|
||||
} from "../realtime-transcription-session-manager.js";
|
||||
import { formatForLog } from "../ws-log.js";
|
||||
import type { GatewayRequestHandlers } from "./types.js";
|
||||
|
||||
function parsePositiveNumber(value: unknown, name: string): number {
|
||||
const number =
|
||||
typeof value === "number"
|
||||
? value
|
||||
: typeof value === "string" && value.trim()
|
||||
? Number(value)
|
||||
: Number.NaN;
|
||||
if (!Number.isFinite(number) || number <= 0) {
|
||||
throw new Error(`${name} must be a positive number.`);
|
||||
}
|
||||
return number;
|
||||
}
|
||||
|
||||
function parseSessionId(value: unknown): string {
|
||||
const sessionId = typeof value === "string" ? value.trim() : "";
|
||||
if (!sessionId) {
|
||||
throw new Error("sessionId is required.");
|
||||
}
|
||||
return sessionId;
|
||||
}
|
||||
|
||||
function parseAudioBuffer(value: unknown): Buffer {
|
||||
const audio = typeof value === "string" ? value.trim() : "";
|
||||
if (!audio) {
|
||||
throw new Error("audio is required.");
|
||||
}
|
||||
if (!isStrictBase64(audio)) {
|
||||
throw new Error("audio must be base64 encoded.");
|
||||
}
|
||||
return Buffer.from(audio, "base64");
|
||||
}
|
||||
|
||||
function isStrictBase64(value: string): boolean {
|
||||
const normalized = value.replace(/\s+/g, "");
|
||||
if (!normalized || normalized.length % 4 !== 0) {
|
||||
return false;
|
||||
}
|
||||
if (!/^[A-Za-z0-9+/]+={0,2}$/.test(normalized)) {
|
||||
return false;
|
||||
}
|
||||
const decoded = Buffer.from(normalized, "base64");
|
||||
return decoded.length > 0 && decoded.toString("base64") === normalized;
|
||||
}
|
||||
|
||||
export const realtimeTranscriptionHandlers: GatewayRequestHandlers = {
|
||||
"realtimeTranscription.start": async ({ params, respond }) => {
|
||||
try {
|
||||
const format = managerTesting.normalizeAudioFormat(
|
||||
typeof params.format === "string" ? params.format : undefined,
|
||||
);
|
||||
if (!format) {
|
||||
respond(
|
||||
false,
|
||||
undefined,
|
||||
errorShape(
|
||||
ErrorCodes.INVALID_REQUEST,
|
||||
"format is required and must be one of: s16le, pcm16, g711_ulaw",
|
||||
),
|
||||
);
|
||||
return;
|
||||
}
|
||||
const result = await getRealtimeTranscriptionSessionManager().startSession({
|
||||
provider: typeof params.provider === "string" ? params.provider.trim() : undefined,
|
||||
providerConfig:
|
||||
params.providerConfig && typeof params.providerConfig === "object"
|
||||
? (params.providerConfig as Record<string, unknown>)
|
||||
: undefined,
|
||||
format,
|
||||
sampleRate: parsePositiveNumber(params.sampleRate, "sampleRate"),
|
||||
channels: parsePositiveNumber(params.channels, "channels"),
|
||||
});
|
||||
respond(true, result);
|
||||
} catch (err) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, formatForLog(err)));
|
||||
}
|
||||
},
|
||||
"realtimeTranscription.pushAudio": async ({ params, respond }) => {
|
||||
try {
|
||||
const result = getRealtimeTranscriptionSessionManager().pushAudio({
|
||||
sessionId: parseSessionId(params.sessionId),
|
||||
audio: parseAudioBuffer(params.audio),
|
||||
});
|
||||
respond(true, result);
|
||||
} catch (err) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, formatForLog(err)));
|
||||
}
|
||||
},
|
||||
"realtimeTranscription.pull": async ({ params, respond }) => {
|
||||
try {
|
||||
const result = getRealtimeTranscriptionSessionManager().pullEvents({
|
||||
sessionId: parseSessionId(params.sessionId),
|
||||
limit: params.limit === undefined ? undefined : parsePositiveNumber(params.limit, "limit"),
|
||||
});
|
||||
respond(true, result);
|
||||
} catch (err) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, formatForLog(err)));
|
||||
}
|
||||
},
|
||||
"realtimeTranscription.finish": async ({ params, respond }) => {
|
||||
try {
|
||||
const result = getRealtimeTranscriptionSessionManager().finishSession({
|
||||
sessionId: parseSessionId(params.sessionId),
|
||||
reason: typeof params.reason === "string" ? params.reason : undefined,
|
||||
});
|
||||
respond(true, result);
|
||||
} catch (err) {
|
||||
respond(false, undefined, errorShape(ErrorCodes.INVALID_REQUEST, formatForLog(err)));
|
||||
}
|
||||
},
|
||||
};
|
||||
@@ -9,6 +9,7 @@ import {
|
||||
getTtsProvider,
|
||||
isTtsEnabled,
|
||||
isTtsProviderConfigured,
|
||||
resolveExplicitTtsOverrides,
|
||||
resolveTtsAutoMode,
|
||||
resolveTtsConfig,
|
||||
resolveTtsPrefsPath,
|
||||
@@ -89,7 +90,22 @@ export const ttsHandlers: GatewayRequestHandlers = {
|
||||
try {
|
||||
const cfg = loadConfig();
|
||||
const channel = typeof params.channel === "string" ? params.channel.trim() : undefined;
|
||||
const result = await textToSpeech({ text, cfg, channel });
|
||||
const providerRaw = typeof params.provider === "string" ? params.provider.trim() : undefined;
|
||||
const modelId = typeof params.modelId === "string" ? params.modelId.trim() : undefined;
|
||||
const voiceId = typeof params.voiceId === "string" ? params.voiceId.trim() : undefined;
|
||||
const overrides = resolveExplicitTtsOverrides({
|
||||
cfg,
|
||||
provider: providerRaw,
|
||||
modelId,
|
||||
voiceId,
|
||||
});
|
||||
const result = await textToSpeech({
|
||||
text,
|
||||
cfg,
|
||||
channel,
|
||||
overrides,
|
||||
disableFallback: Boolean(overrides.provider || modelId || voiceId),
|
||||
});
|
||||
if (result.success && result.audioPath) {
|
||||
respond(true, {
|
||||
audioPath: result.audioPath,
|
||||
|
||||
@@ -8,6 +8,7 @@ import { truncateUtf16Safe } from "../../utils.js";
|
||||
import { isWebchatClient } from "../../utils/message-channel.js";
|
||||
import type { AuthRateLimiter } from "../auth-rate-limit.js";
|
||||
import type { ResolvedGatewayAuth } from "../auth.js";
|
||||
import { closeChatVoiceSessionsForConn } from "../chat-voice-sessions.js";
|
||||
import { getPreauthHandshakeTimeoutMsFromEnv } from "../handshake-timeouts.js";
|
||||
import { isLoopbackAddress } from "../net.js";
|
||||
import type { GatewayRequestContext, GatewayRequestHandlers } from "../server-methods/types.js";
|
||||
@@ -270,6 +271,9 @@ export function attachGatewayWsConnectionHandler(params: AttachGatewayWsConnecti
|
||||
}
|
||||
const context = buildRequestContext();
|
||||
context.unsubscribeAllSessionEvents(connId);
|
||||
closeChatVoiceSessionsForConn(connId, (targetConnId, payload) => {
|
||||
context.broadcastToConnIds("chat.voice.event", payload, new Set([targetConnId]));
|
||||
});
|
||||
if (client?.connect?.role === "node") {
|
||||
const nodeId = context.nodeRegistry.unregister(connId);
|
||||
if (nodeId) {
|
||||
|
||||
@@ -121,6 +121,43 @@ describe("runCapability auto audio entries", () => {
|
||||
expect(seenModel).toBe("whisper-1");
|
||||
});
|
||||
|
||||
it("lets per-request transcription hints override configured model-entry hints", async () => {
|
||||
let seenLanguage: string | undefined;
|
||||
let seenPrompt: string | undefined;
|
||||
const result = await runAutoAudioCase({
|
||||
transcribeAudio: async (req) => {
|
||||
seenLanguage = req.language;
|
||||
seenPrompt = req.prompt;
|
||||
return { text: "ok", model: req.model ?? "unknown" };
|
||||
},
|
||||
cfgExtra: {
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
enabled: true,
|
||||
prompt: "configured prompt",
|
||||
language: "fr",
|
||||
_requestPromptOverride: "Focus on names",
|
||||
_requestLanguageOverride: "en",
|
||||
models: [
|
||||
{
|
||||
provider: "openai",
|
||||
model: "whisper-1",
|
||||
prompt: "entry prompt",
|
||||
language: "de",
|
||||
},
|
||||
],
|
||||
},
|
||||
},
|
||||
},
|
||||
} as Partial<OpenClawConfig>,
|
||||
});
|
||||
|
||||
expect(result.outputs[0]?.text).toBe("ok");
|
||||
expect(seenLanguage).toBe("en");
|
||||
expect(seenPrompt).toBe("Focus on names");
|
||||
});
|
||||
|
||||
it("uses mistral when only mistral key is configured", async () => {
|
||||
const isolatedAgentDir = await fs.mkdtemp(path.join(os.tmpdir(), "openclaw-audio-agent-"));
|
||||
let runResult: Awaited<ReturnType<typeof runCapability>> | undefined;
|
||||
|
||||
67
src/media-understanding/runner.cli-audio.test.ts
Normal file
67
src/media-understanding/runner.cli-audio.test.ts
Normal file
@@ -0,0 +1,67 @@
|
||||
import { afterEach, beforeAll, beforeEach, describe, expect, it, vi } from "vitest";
|
||||
import type { OpenClawConfig } from "../config/config.js";
|
||||
import { withAudioFixture } from "./runner.test-utils.js";
|
||||
|
||||
const runExecMock = vi.hoisted(() => vi.fn());
|
||||
|
||||
vi.mock("../process/exec.js", () => ({
|
||||
runExec: (...args: unknown[]) => runExecMock(...args),
|
||||
}));
|
||||
|
||||
let runCliEntry: typeof import("./runner.entries.js").runCliEntry;
|
||||
|
||||
describe("media-understanding CLI audio entry", () => {
|
||||
beforeAll(async () => {
|
||||
({ runCliEntry } = await import("./runner.entries.js"));
|
||||
});
|
||||
|
||||
beforeEach(() => {
|
||||
runExecMock.mockReset().mockResolvedValue({ stdout: "cli transcript" });
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
vi.clearAllMocks();
|
||||
});
|
||||
|
||||
it("applies per-request prompt and language overrides to CLI transcription templating", async () => {
|
||||
await withAudioFixture("openclaw-cli-audio", async ({ ctx, cache }) => {
|
||||
await runCliEntry({
|
||||
capability: "audio",
|
||||
entry: {
|
||||
type: "cli",
|
||||
command: "mock-transcriber",
|
||||
args: ["--prompt", "{{Prompt}}", "--language", "{{Language}}", "--file", "{{MediaPath}}"],
|
||||
prompt: "entry prompt",
|
||||
language: "de",
|
||||
},
|
||||
cfg: {
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
prompt: "configured prompt",
|
||||
language: "fr",
|
||||
_requestPromptOverride: "Focus on names",
|
||||
_requestLanguageOverride: "en",
|
||||
},
|
||||
},
|
||||
},
|
||||
} as OpenClawConfig,
|
||||
ctx,
|
||||
attachmentIndex: 0,
|
||||
cache,
|
||||
config: {
|
||||
prompt: "configured prompt",
|
||||
language: "fr",
|
||||
_requestPromptOverride: "Focus on names",
|
||||
_requestLanguageOverride: "en",
|
||||
} as never,
|
||||
});
|
||||
});
|
||||
|
||||
expect(runExecMock).toHaveBeenCalledWith(
|
||||
"mock-transcriber",
|
||||
expect.arrayContaining(["--prompt", "Focus on names", "--language", "en"]),
|
||||
expect.any(Object),
|
||||
);
|
||||
});
|
||||
});
|
||||
@@ -372,6 +372,20 @@ function resolveEntryRunOptions(params: {
|
||||
return { maxBytes, maxChars, timeoutMs, prompt };
|
||||
}
|
||||
|
||||
function resolveAudioRequestOverrides(config: MediaUnderstandingConfig | undefined): {
|
||||
prompt?: string;
|
||||
language?: string;
|
||||
} {
|
||||
const overrides = (config ?? {}) as MediaUnderstandingConfig & {
|
||||
_requestPromptOverride?: string;
|
||||
_requestLanguageOverride?: string;
|
||||
};
|
||||
return {
|
||||
prompt: overrides._requestPromptOverride,
|
||||
language: overrides._requestLanguageOverride,
|
||||
};
|
||||
}
|
||||
|
||||
async function resolveProviderExecutionAuth(params: {
|
||||
providerId: string;
|
||||
cfg: OpenClawConfig;
|
||||
@@ -530,6 +544,7 @@ export async function runProviderEntry(params: {
|
||||
throw new Error(`Audio transcription provider "${providerId}" not available.`);
|
||||
}
|
||||
const transcribeAudio = provider.transcribeAudio;
|
||||
const requestOverrides = resolveAudioRequestOverrides(params.config);
|
||||
const media = await params.cache.getBuffer({
|
||||
attachmentIndex: params.attachmentIndex,
|
||||
maxBytes,
|
||||
@@ -569,8 +584,12 @@ export async function runProviderEntry(params: {
|
||||
headers,
|
||||
request,
|
||||
model,
|
||||
language: entry.language ?? params.config?.language ?? cfg.tools?.media?.audio?.language,
|
||||
prompt,
|
||||
language:
|
||||
requestOverrides.language ??
|
||||
entry.language ??
|
||||
params.config?.language ??
|
||||
cfg.tools?.media?.audio?.language,
|
||||
prompt: requestOverrides.prompt ?? prompt,
|
||||
query: providerQuery,
|
||||
timeoutMs,
|
||||
fetchFn,
|
||||
@@ -651,6 +670,7 @@ export async function runCliEntry(params: {
|
||||
if (!command) {
|
||||
throw new Error(`CLI entry missing command for ${capability}`);
|
||||
}
|
||||
const requestOverrides = resolveAudioRequestOverrides(params.config);
|
||||
const { maxBytes, maxChars, timeoutMs, prompt } = resolveEntryRunOptions({
|
||||
capability,
|
||||
entry,
|
||||
@@ -683,7 +703,8 @@ export async function runCliEntry(params: {
|
||||
MediaDir: path.dirname(mediaPath),
|
||||
OutputDir: outputDir,
|
||||
OutputBase: outputBase,
|
||||
Prompt: prompt,
|
||||
Prompt: requestOverrides.prompt ?? prompt,
|
||||
...(requestOverrides.language ? { Language: requestOverrides.language } : {}),
|
||||
MaxChars: maxChars,
|
||||
};
|
||||
const argv = [command, ...args].map((part, index) =>
|
||||
|
||||
@@ -150,7 +150,28 @@ export async function transcribeAudioFile(params: {
|
||||
agentDir?: string;
|
||||
mime?: string;
|
||||
activeModel?: ActiveMediaModel;
|
||||
language?: string;
|
||||
prompt?: string;
|
||||
}): Promise<{ text: string | undefined }> {
|
||||
const result = await runMediaUnderstandingFile({ ...params, capability: "audio" });
|
||||
const cfg =
|
||||
params.language || params.prompt
|
||||
? {
|
||||
...params.cfg,
|
||||
tools: {
|
||||
...params.cfg.tools,
|
||||
media: {
|
||||
...params.cfg.tools?.media,
|
||||
audio: {
|
||||
...params.cfg.tools?.media?.audio,
|
||||
...(params.language ? { _requestLanguageOverride: params.language } : {}),
|
||||
...(params.prompt ? { _requestPromptOverride: params.prompt } : {}),
|
||||
...(params.language ? { language: params.language } : {}),
|
||||
...(params.prompt ? { prompt: params.prompt } : {}),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
: params.cfg;
|
||||
const result = await runMediaUnderstandingFile({ ...params, cfg, capability: "audio" });
|
||||
return { text: result.text };
|
||||
}
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
import fs from "node:fs";
|
||||
import path from "node:path";
|
||||
import { describe, expect, it } from "vitest";
|
||||
import { describe, expect, it, vi } from "vitest";
|
||||
import { withTempHome } from "../../test/helpers/temp-home.js";
|
||||
import type { OpenClawConfig } from "../config/config.js";
|
||||
import { resolveStatusTtsSnapshot } from "./status-config.js";
|
||||
@@ -61,4 +61,44 @@ describe("resolveStatusTtsSnapshot", () => {
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
it("derives the default prefs path from OPENCLAW_CONFIG_PATH when set", async () => {
|
||||
await withTempHome(
|
||||
async (home) => {
|
||||
const stateDir = path.join(home, ".openclaw-dev");
|
||||
const prefsPath = path.join(stateDir, "settings", "tts.json");
|
||||
fs.mkdirSync(path.dirname(prefsPath), { recursive: true });
|
||||
fs.writeFileSync(
|
||||
prefsPath,
|
||||
JSON.stringify({
|
||||
tts: {
|
||||
auto: "always",
|
||||
provider: "openai",
|
||||
},
|
||||
}),
|
||||
);
|
||||
|
||||
vi.stubEnv("OPENCLAW_CONFIG_PATH", path.join(stateDir, "openclaw.json"));
|
||||
try {
|
||||
expect(
|
||||
resolveStatusTtsSnapshot({
|
||||
cfg: {
|
||||
messages: {
|
||||
tts: {},
|
||||
},
|
||||
} as OpenClawConfig,
|
||||
}),
|
||||
).toEqual({
|
||||
autoMode: "always",
|
||||
provider: "openai",
|
||||
maxLength: 1500,
|
||||
summarize: true,
|
||||
});
|
||||
} finally {
|
||||
vi.unstubAllEnvs();
|
||||
}
|
||||
},
|
||||
{ env: { OPENCLAW_STATE_DIR: undefined } },
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
@@ -2,7 +2,7 @@ import fs from "node:fs";
|
||||
import path from "node:path";
|
||||
import type { OpenClawConfig } from "../config/config.js";
|
||||
import type { TtsAutoMode, TtsConfig, TtsProvider } from "../config/types.tts.js";
|
||||
import { CONFIG_DIR, resolveUserPath } from "../utils.js";
|
||||
import { resolveConfigDir, resolveUserPath } from "../utils.js";
|
||||
import { normalizeTtsAutoMode } from "./tts-auto-mode.js";
|
||||
|
||||
const DEFAULT_TTS_MAX_LENGTH = 1500;
|
||||
@@ -47,7 +47,7 @@ function resolveTtsPrefsPathValue(prefsPath: string | undefined): string {
|
||||
if (envPath) {
|
||||
return resolveUserPath(envPath);
|
||||
}
|
||||
return path.join(CONFIG_DIR, "settings", "tts.json");
|
||||
return path.join(resolveConfigDir(process.env), "settings", "tts.json");
|
||||
}
|
||||
|
||||
function readPrefs(prefsPath: string): TtsUserPrefs {
|
||||
|
||||
@@ -10,6 +10,7 @@ export {
|
||||
isTtsProviderConfigured,
|
||||
listSpeechVoices,
|
||||
maybeApplyTtsToPayload,
|
||||
resolveExplicitTtsOverrides,
|
||||
resolveTtsAutoMode,
|
||||
resolveTtsConfig,
|
||||
resolveTtsPrefsPath,
|
||||
|
||||
@@ -50,6 +50,15 @@ describe("resolveConfigDir", () => {
|
||||
|
||||
expect(resolveConfigDir(env)).toBe(path.resolve("/tmp/openclaw-home", "state"));
|
||||
});
|
||||
|
||||
it("falls back to the config file directory when only OPENCLAW_CONFIG_PATH is set", () => {
|
||||
const env = {
|
||||
HOME: "/tmp/openclaw-home",
|
||||
OPENCLAW_CONFIG_PATH: "~/profiles/dev/openclaw.json",
|
||||
} as NodeJS.ProcessEnv;
|
||||
|
||||
expect(resolveConfigDir(env)).toBe(path.resolve("/tmp/openclaw-home", "profiles", "dev"));
|
||||
});
|
||||
});
|
||||
|
||||
describe("resolveHomeDir", () => {
|
||||
|
||||
@@ -141,6 +141,10 @@ export function resolveConfigDir(
|
||||
if (override) {
|
||||
return resolveUserPath(override, env, homedir);
|
||||
}
|
||||
const configPath = env.OPENCLAW_CONFIG_PATH?.trim();
|
||||
if (configPath) {
|
||||
return path.dirname(resolveUserPath(configPath, env, homedir));
|
||||
}
|
||||
const newDir = path.join(resolveRequiredHomeDir(env, homedir), ".openclaw");
|
||||
try {
|
||||
const hasNew = fs.existsSync(newDir);
|
||||
|
||||
@@ -64,6 +64,16 @@ function hasEntryCredential(
|
||||
});
|
||||
}
|
||||
|
||||
export function isWebFetchProviderConfigured(params: {
|
||||
provider: Pick<
|
||||
PluginWebFetchProviderEntry,
|
||||
"envVars" | "getConfiguredCredentialValue" | "getCredentialValue" | "requiresCredential"
|
||||
>;
|
||||
config?: OpenClawConfig;
|
||||
}): boolean {
|
||||
return hasEntryCredential(params.provider, params.config, resolveFetchConfig(params.config));
|
||||
}
|
||||
|
||||
export function listWebFetchProviders(params?: {
|
||||
config?: OpenClawConfig;
|
||||
}): PluginWebFetchProviderEntry[] {
|
||||
|
||||
@@ -289,4 +289,162 @@ describe("web search runtime", () => {
|
||||
result: { query: "runtime", provider: "beta", runtimeSelectedProvider: "beta" },
|
||||
});
|
||||
});
|
||||
|
||||
it("falls back to another provider when auto-selected search execution fails", async () => {
|
||||
resolveRuntimeWebSearchProvidersMock.mockReturnValue([
|
||||
createProvider({
|
||||
pluginId: "google",
|
||||
id: "google",
|
||||
credentialPath: "tools.web.search.google.apiKey",
|
||||
autoDetectOrder: 1,
|
||||
getCredentialValue: () => "configured",
|
||||
createTool: () => ({
|
||||
description: "google",
|
||||
parameters: {},
|
||||
execute: async () => {
|
||||
throw new Error("google aborted");
|
||||
},
|
||||
}),
|
||||
}),
|
||||
createProvider({
|
||||
pluginId: "duckduckgo",
|
||||
id: "duckduckgo",
|
||||
credentialPath: "",
|
||||
autoDetectOrder: 100,
|
||||
requiresCredential: false,
|
||||
createTool: () => ({
|
||||
description: "duckduckgo",
|
||||
parameters: {},
|
||||
execute: async (args) => ({ ...args, provider: "duckduckgo" }),
|
||||
}),
|
||||
}),
|
||||
]);
|
||||
|
||||
await expect(
|
||||
runWebSearch({
|
||||
config: {},
|
||||
args: { query: "fallback" },
|
||||
}),
|
||||
).resolves.toEqual({
|
||||
provider: "duckduckgo",
|
||||
result: { query: "fallback", provider: "duckduckgo" },
|
||||
});
|
||||
});
|
||||
|
||||
it("does not prebuild fallback provider tools before attempting the selected provider", async () => {
|
||||
resolveRuntimeWebSearchProvidersMock.mockReturnValue([
|
||||
createProvider({
|
||||
pluginId: "google",
|
||||
id: "google",
|
||||
credentialPath: "tools.web.search.google.apiKey",
|
||||
autoDetectOrder: 1,
|
||||
getCredentialValue: () => "configured",
|
||||
createTool: () => ({
|
||||
description: "google",
|
||||
parameters: {},
|
||||
execute: async (args) => ({ ...args, provider: "google" }),
|
||||
}),
|
||||
}),
|
||||
createProvider({
|
||||
pluginId: "broken-fallback",
|
||||
id: "broken-fallback",
|
||||
credentialPath: "",
|
||||
autoDetectOrder: 100,
|
||||
requiresCredential: false,
|
||||
createTool: () => {
|
||||
throw new Error("fallback createTool exploded");
|
||||
},
|
||||
}),
|
||||
]);
|
||||
|
||||
await expect(
|
||||
runWebSearch({
|
||||
config: {},
|
||||
args: { query: "selected-first" },
|
||||
}),
|
||||
).resolves.toEqual({
|
||||
provider: "google",
|
||||
result: { query: "selected-first", provider: "google" },
|
||||
});
|
||||
});
|
||||
|
||||
it("does not fall back when the provider came from explicit config selection", async () => {
|
||||
resolveRuntimeWebSearchProvidersMock.mockReturnValue([
|
||||
createProvider({
|
||||
pluginId: "google",
|
||||
id: "google",
|
||||
credentialPath: "tools.web.search.google.apiKey",
|
||||
autoDetectOrder: 1,
|
||||
getCredentialValue: () => "configured",
|
||||
createTool: () => ({
|
||||
description: "google",
|
||||
parameters: {},
|
||||
execute: async () => {
|
||||
throw new Error("google aborted");
|
||||
},
|
||||
}),
|
||||
}),
|
||||
createProvider({
|
||||
pluginId: "duckduckgo",
|
||||
id: "duckduckgo",
|
||||
credentialPath: "",
|
||||
autoDetectOrder: 100,
|
||||
requiresCredential: false,
|
||||
createTool: () => ({
|
||||
description: "duckduckgo",
|
||||
parameters: {},
|
||||
execute: async (args) => ({ ...args, provider: "duckduckgo" }),
|
||||
}),
|
||||
}),
|
||||
]);
|
||||
|
||||
await expect(
|
||||
runWebSearch({
|
||||
config: {
|
||||
tools: {
|
||||
web: {
|
||||
search: {
|
||||
provider: "google",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
args: { query: "configured" },
|
||||
}),
|
||||
).rejects.toThrow("google aborted");
|
||||
});
|
||||
|
||||
it("does not fall back when the caller explicitly selects a provider", async () => {
|
||||
resolveRuntimeWebSearchProvidersMock.mockReturnValue([
|
||||
createProvider({
|
||||
pluginId: "google",
|
||||
id: "google",
|
||||
credentialPath: "tools.web.search.google.apiKey",
|
||||
autoDetectOrder: 1,
|
||||
getCredentialValue: () => "configured",
|
||||
createTool: () => ({
|
||||
description: "google",
|
||||
parameters: {},
|
||||
execute: async () => {
|
||||
throw new Error("google aborted");
|
||||
},
|
||||
}),
|
||||
}),
|
||||
createProvider({
|
||||
pluginId: "duckduckgo",
|
||||
id: "duckduckgo",
|
||||
credentialPath: "",
|
||||
autoDetectOrder: 100,
|
||||
requiresCredential: false,
|
||||
}),
|
||||
]);
|
||||
|
||||
await expect(
|
||||
runWebSearch({
|
||||
config: {},
|
||||
providerId: "google",
|
||||
args: { query: "explicit" },
|
||||
}),
|
||||
).rejects.toThrow("google aborted");
|
||||
});
|
||||
});
|
||||
|
||||
@@ -78,6 +78,21 @@ function hasEntryCredential(
|
||||
});
|
||||
}
|
||||
|
||||
export function isWebSearchProviderConfigured(params: {
|
||||
provider: Pick<
|
||||
PluginWebSearchProviderEntry,
|
||||
| "credentialPath"
|
||||
| "id"
|
||||
| "envVars"
|
||||
| "getConfiguredCredentialValue"
|
||||
| "getCredentialValue"
|
||||
| "requiresCredential"
|
||||
>;
|
||||
config?: OpenClawConfig;
|
||||
}): boolean {
|
||||
return hasEntryCredential(params.provider, params.config, resolveSearchConfig(params.config));
|
||||
}
|
||||
|
||||
export function listWebSearchProviders(params?: {
|
||||
config?: OpenClawConfig;
|
||||
}): PluginWebSearchProviderEntry[] {
|
||||
@@ -197,21 +212,117 @@ export function resolveWebSearchDefinition(
|
||||
});
|
||||
}
|
||||
|
||||
function resolveWebSearchCandidates(
|
||||
options?: ResolveWebSearchDefinitionParams,
|
||||
): PluginWebSearchProviderEntry[] {
|
||||
const search = resolveSearchConfig(options?.config);
|
||||
const runtimeWebSearch = options?.runtimeWebSearch ?? getActiveRuntimeWebToolsMetadata()?.search;
|
||||
if (!resolveWebSearchEnabled({ search, sandboxed: options?.sandboxed })) {
|
||||
return [];
|
||||
}
|
||||
|
||||
const providers = sortWebSearchProvidersForAutoDetect(
|
||||
options?.preferRuntimeProviders
|
||||
? resolveRuntimeWebSearchProviders({
|
||||
config: options?.config,
|
||||
bundledAllowlistCompat: true,
|
||||
})
|
||||
: resolvePluginWebSearchProviders({
|
||||
config: options?.config,
|
||||
bundledAllowlistCompat: true,
|
||||
origin: "bundled",
|
||||
}),
|
||||
).filter(Boolean);
|
||||
if (providers.length === 0) {
|
||||
return [];
|
||||
}
|
||||
|
||||
const preferredIds = [
|
||||
options?.providerId,
|
||||
runtimeWebSearch?.selectedProvider,
|
||||
runtimeWebSearch?.providerConfigured,
|
||||
resolveWebSearchProviderId({ config: options?.config, search, providers }),
|
||||
].filter(
|
||||
(value, index, array): value is string => Boolean(value) && array.indexOf(value) === index,
|
||||
);
|
||||
|
||||
const orderedProviders = [
|
||||
...preferredIds
|
||||
.map((id) => providers.find((entry) => entry.id === id))
|
||||
.filter((entry): entry is PluginWebSearchProviderEntry => Boolean(entry)),
|
||||
...providers.filter((entry) => !preferredIds.includes(entry.id)),
|
||||
];
|
||||
return orderedProviders;
|
||||
}
|
||||
|
||||
function hasExplicitWebSearchSelection(params: {
|
||||
search?: WebSearchConfig;
|
||||
runtimeWebSearch?: RuntimeWebSearchMetadata;
|
||||
providerId?: string;
|
||||
}): boolean {
|
||||
if (params.providerId?.trim()) {
|
||||
return true;
|
||||
}
|
||||
if (
|
||||
params.search &&
|
||||
"provider" in params.search &&
|
||||
typeof params.search.provider === "string" &&
|
||||
params.search.provider.trim()
|
||||
) {
|
||||
return true;
|
||||
}
|
||||
return params.runtimeWebSearch?.providerSource === "configured";
|
||||
}
|
||||
|
||||
export async function runWebSearch(
|
||||
params: RunWebSearchParams,
|
||||
): Promise<{ provider: string; result: Record<string, unknown> }> {
|
||||
const resolved = resolveWebSearchDefinition({ ...params, preferRuntimeProviders: true });
|
||||
if (!resolved) {
|
||||
const search = resolveSearchConfig(params.config);
|
||||
const runtimeWebSearch = params.runtimeWebSearch ?? getActiveRuntimeWebToolsMetadata()?.search;
|
||||
const candidates = resolveWebSearchCandidates({
|
||||
...params,
|
||||
runtimeWebSearch,
|
||||
preferRuntimeProviders: true,
|
||||
});
|
||||
if (candidates.length === 0) {
|
||||
throw new Error("web_search is disabled or no provider is available.");
|
||||
}
|
||||
return {
|
||||
provider: resolved.provider.id,
|
||||
result: await resolved.definition.execute(params.args),
|
||||
};
|
||||
const allowFallback = !hasExplicitWebSearchSelection({
|
||||
search,
|
||||
runtimeWebSearch,
|
||||
providerId: params.providerId,
|
||||
});
|
||||
let lastError: unknown;
|
||||
|
||||
for (const candidate of candidates) {
|
||||
try {
|
||||
const definition = candidate.createTool({
|
||||
config: params.config,
|
||||
searchConfig: search as Record<string, unknown> | undefined,
|
||||
runtimeMetadata: runtimeWebSearch,
|
||||
});
|
||||
if (!definition) {
|
||||
continue;
|
||||
}
|
||||
return {
|
||||
provider: candidate.id,
|
||||
result: await definition.execute(params.args),
|
||||
};
|
||||
} catch (error) {
|
||||
lastError = error;
|
||||
if (!allowFallback) {
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
throw lastError instanceof Error ? lastError : new Error(String(lastError));
|
||||
}
|
||||
|
||||
export const __testing = {
|
||||
resolveSearchConfig,
|
||||
resolveSearchProvider: resolveWebSearchProviderId,
|
||||
resolveWebSearchProviderId,
|
||||
resolveWebSearchCandidates,
|
||||
hasExplicitWebSearchSelection,
|
||||
};
|
||||
|
||||
@@ -752,6 +752,13 @@ export function renderApp(state: AppViewState) {
|
||||
onSettingsChange: (next) => state.applySettings(next),
|
||||
onPasswordChange: (next) => (state.password = next),
|
||||
onSessionKeyChange: (next) => {
|
||||
if (state.client && state.connected && state.chatVoiceActive) {
|
||||
void state.client
|
||||
.request("chat.voice.stop", { sessionKey: state.sessionKey })
|
||||
.catch(() => {
|
||||
// ignore best-effort voice cleanup errors during navigation
|
||||
});
|
||||
}
|
||||
state.sessionKey = next;
|
||||
state.chatMessage = "";
|
||||
state.resetToolStream();
|
||||
@@ -1532,12 +1539,24 @@ export function renderApp(state: AppViewState) {
|
||||
? renderChat({
|
||||
sessionKey: state.sessionKey,
|
||||
onSessionKeyChange: (next) => {
|
||||
if (state.client && state.connected && state.chatVoiceActive) {
|
||||
void state.client
|
||||
.request("chat.voice.stop", { sessionKey: state.sessionKey })
|
||||
.catch(() => {
|
||||
// ignore best-effort voice cleanup errors during navigation
|
||||
});
|
||||
}
|
||||
state.sessionKey = next;
|
||||
state.chatMessage = "";
|
||||
state.chatAttachments = [];
|
||||
state.chatStream = null;
|
||||
state.chatStreamStartedAt = null;
|
||||
state.chatRunId = null;
|
||||
state.chatVoiceActive = false;
|
||||
state.chatVoiceState = "idle";
|
||||
state.chatVoiceTranscript = "";
|
||||
state.chatVoiceRunId = null;
|
||||
state.chatVoiceError = null;
|
||||
state.chatQueue = [];
|
||||
state.resetToolStream();
|
||||
state.resetChatScroll();
|
||||
@@ -1569,6 +1588,11 @@ export function renderApp(state: AppViewState) {
|
||||
canSend: state.connected,
|
||||
disabledReason: chatDisabledReason,
|
||||
error: state.lastError,
|
||||
voiceActive: state.chatVoiceActive,
|
||||
voiceState: state.chatVoiceState,
|
||||
voiceTranscript: state.chatVoiceTranscript,
|
||||
voiceError: state.chatVoiceError,
|
||||
voicePlaybackEnabled: state.chatVoicePlaybackEnabled,
|
||||
sessions: state.sessionsResult,
|
||||
focusMode: chatFocus,
|
||||
onRefresh: () => {
|
||||
@@ -1591,6 +1615,69 @@ export function renderApp(state: AppViewState) {
|
||||
attachments: state.chatAttachments,
|
||||
onAttachmentsChange: (next) => (state.chatAttachments = next),
|
||||
onSend: () => state.handleSendChat(),
|
||||
onVoiceStart: async () => {
|
||||
if (!state.client || !state.connected) {
|
||||
return false;
|
||||
}
|
||||
state.chatVoiceActive = false;
|
||||
state.chatVoiceState = "connecting";
|
||||
state.chatVoiceTranscript = "";
|
||||
state.chatVoiceRunId = null;
|
||||
state.chatVoiceError = null;
|
||||
try {
|
||||
const res = (await state.client.request("chat.voice.start", {
|
||||
sessionKey: state.sessionKey,
|
||||
})) as { playbackEnabled?: boolean } | undefined;
|
||||
state.chatVoiceActive = true;
|
||||
state.chatVoiceState = "listening";
|
||||
state.chatVoicePlaybackEnabled = res?.playbackEnabled !== false;
|
||||
return true;
|
||||
} catch (error) {
|
||||
state.chatVoiceActive = false;
|
||||
state.chatVoiceState = "error";
|
||||
state.chatVoiceError = String(error);
|
||||
return false;
|
||||
}
|
||||
},
|
||||
onVoiceAudioChunk: async (chunkBase64) => {
|
||||
if (!state.client || !state.connected || !state.chatVoiceActive) {
|
||||
return;
|
||||
}
|
||||
try {
|
||||
await state.client.request("chat.voice.audio", {
|
||||
sessionKey: state.sessionKey,
|
||||
audio: chunkBase64,
|
||||
format: "pcm16",
|
||||
sampleRate: 16000,
|
||||
});
|
||||
} catch (error) {
|
||||
state.chatVoiceState = "error";
|
||||
state.chatVoiceError = String(error);
|
||||
}
|
||||
},
|
||||
onVoiceStop: async () => {
|
||||
state.chatVoiceActive = false;
|
||||
state.chatVoiceRunId = null;
|
||||
state.chatVoiceTranscript = "";
|
||||
if (!state.client || !state.connected) {
|
||||
state.chatVoiceState = "idle";
|
||||
return;
|
||||
}
|
||||
try {
|
||||
await state.client.request("chat.voice.stop", { sessionKey: state.sessionKey });
|
||||
} catch (error) {
|
||||
state.chatVoiceState = "error";
|
||||
state.chatVoiceError = String(error);
|
||||
}
|
||||
},
|
||||
onVoiceInterrupt: async () => {
|
||||
if (!state.client || !state.connected) {
|
||||
return;
|
||||
}
|
||||
await state.client.request("chat.voice.interrupt", {
|
||||
sessionKey: state.sessionKey,
|
||||
});
|
||||
},
|
||||
canAbort: Boolean(state.chatRunId),
|
||||
onAbort: () => void state.handleAbortChat(),
|
||||
onQueueRemove: (id) => state.removeQueuedMessage(id),
|
||||
|
||||
@@ -72,6 +72,19 @@ export type AppViewState = {
|
||||
chatStream: string | null;
|
||||
chatStreamStartedAt: number | null;
|
||||
chatRunId: string | null;
|
||||
chatVoiceActive: boolean;
|
||||
chatVoiceState:
|
||||
| "idle"
|
||||
| "connecting"
|
||||
| "listening"
|
||||
| "processing"
|
||||
| "speaking"
|
||||
| "interrupted"
|
||||
| "error";
|
||||
chatVoiceTranscript: string;
|
||||
chatVoiceRunId: string | null;
|
||||
chatVoicePlaybackEnabled: boolean;
|
||||
chatVoiceError: string | null;
|
||||
compactionStatus: CompactionStatus | null;
|
||||
fallbackStatus: FallbackStatus | null;
|
||||
chatAvatarUrl: string | null;
|
||||
|
||||
@@ -165,6 +165,19 @@ export class OpenClawApp extends LitElement {
|
||||
@state() chatStream: string | null = null;
|
||||
@state() chatStreamStartedAt: number | null = null;
|
||||
@state() chatRunId: string | null = null;
|
||||
@state() chatVoiceActive = false;
|
||||
@state() chatVoiceState:
|
||||
| "idle"
|
||||
| "connecting"
|
||||
| "listening"
|
||||
| "processing"
|
||||
| "speaking"
|
||||
| "interrupted"
|
||||
| "error" = "idle";
|
||||
@state() chatVoiceTranscript = "";
|
||||
@state() chatVoiceRunId: string | null = null;
|
||||
@state() chatVoicePlaybackEnabled = true;
|
||||
@state() chatVoiceError: string | null = null;
|
||||
@state() compactionStatus: CompactionStatus | null = null;
|
||||
@state() fallbackStatus: FallbackStatus | null = null;
|
||||
@state() chatAvatarUrl: string | null = null;
|
||||
|
||||
@@ -125,6 +125,195 @@ export function isSttActive(): boolean {
|
||||
return activeRecognition !== null;
|
||||
}
|
||||
|
||||
// ─── Realtime Voice Capture ───
|
||||
|
||||
type RealtimeVoiceCallbacks = {
|
||||
onChunk: (chunkBase64: string) => void;
|
||||
onStart?: () => void;
|
||||
onStop?: () => void;
|
||||
onError?: (error: string) => void;
|
||||
};
|
||||
|
||||
type RealtimeVoiceCapture = {
|
||||
stop: () => void;
|
||||
};
|
||||
|
||||
const REALTIME_VOICE_TARGET_SAMPLE_RATE = 16_000;
|
||||
const REALTIME_VOICE_CHUNK_MS = 250;
|
||||
|
||||
let activeRealtimeVoiceCapture: RealtimeVoiceCapture | null = null;
|
||||
|
||||
export function isRealtimeVoiceSupported(): boolean {
|
||||
const hasGetUserMedia =
|
||||
typeof navigator !== "undefined" && typeof navigator.mediaDevices?.getUserMedia === "function";
|
||||
return (
|
||||
typeof window !== "undefined" &&
|
||||
Boolean(window.isSecureContext) &&
|
||||
hasGetUserMedia &&
|
||||
typeof AudioContext !== "undefined"
|
||||
);
|
||||
}
|
||||
|
||||
export async function startRealtimeVoiceCapture(
|
||||
callbacks: RealtimeVoiceCallbacks,
|
||||
): Promise<boolean> {
|
||||
if (!isRealtimeVoiceSupported()) {
|
||||
callbacks.onError?.("Realtime voice requires a secure context with microphone access");
|
||||
return false;
|
||||
}
|
||||
|
||||
stopRealtimeVoiceCapture();
|
||||
|
||||
let stream: MediaStream;
|
||||
try {
|
||||
stream = await navigator.mediaDevices.getUserMedia({
|
||||
audio: {
|
||||
channelCount: 1,
|
||||
echoCancellation: true,
|
||||
noiseSuppression: true,
|
||||
autoGainControl: true,
|
||||
},
|
||||
});
|
||||
} catch (error) {
|
||||
callbacks.onError?.(error instanceof Error ? error.message : String(error));
|
||||
return false;
|
||||
}
|
||||
|
||||
const audioContext = new AudioContext();
|
||||
try {
|
||||
if (audioContext.state !== "running") {
|
||||
await audioContext.resume();
|
||||
}
|
||||
} catch (error) {
|
||||
stream.getTracks().forEach((track) => track.stop());
|
||||
callbacks.onError?.(
|
||||
error instanceof Error ? error.message : "Failed to start realtime voice capture",
|
||||
);
|
||||
void audioContext.close();
|
||||
return false;
|
||||
}
|
||||
|
||||
const source = audioContext.createMediaStreamSource(stream);
|
||||
const processor = audioContext.createScriptProcessor(4096, 1, 1);
|
||||
const samplesPerChunk = Math.max(
|
||||
1,
|
||||
Math.round((REALTIME_VOICE_TARGET_SAMPLE_RATE * REALTIME_VOICE_CHUNK_MS) / 1000),
|
||||
);
|
||||
let pcmBuffer = new Int16Array(0);
|
||||
let stopped = false;
|
||||
|
||||
const flushChunk = () => {
|
||||
if (pcmBuffer.length < samplesPerChunk) {
|
||||
return;
|
||||
}
|
||||
const chunk = pcmBuffer.slice(0, samplesPerChunk);
|
||||
pcmBuffer = pcmBuffer.slice(samplesPerChunk);
|
||||
callbacks.onChunk(encodePcm16Chunk(chunk));
|
||||
};
|
||||
|
||||
processor.onaudioprocess = (event) => {
|
||||
if (stopped) {
|
||||
return;
|
||||
}
|
||||
const input = event.inputBuffer.getChannelData(0);
|
||||
const downsampled = downsampleFloat32Buffer(
|
||||
input,
|
||||
audioContext.sampleRate,
|
||||
REALTIME_VOICE_TARGET_SAMPLE_RATE,
|
||||
);
|
||||
if (downsampled.length === 0) {
|
||||
return;
|
||||
}
|
||||
const next = new Int16Array(pcmBuffer.length + downsampled.length);
|
||||
next.set(pcmBuffer, 0);
|
||||
next.set(downsampled, pcmBuffer.length);
|
||||
pcmBuffer = next;
|
||||
flushChunk();
|
||||
};
|
||||
|
||||
source.connect(processor);
|
||||
processor.connect(audioContext.destination);
|
||||
|
||||
const stop = () => {
|
||||
if (stopped) {
|
||||
return;
|
||||
}
|
||||
stopped = true;
|
||||
activeRealtimeVoiceCapture = null;
|
||||
if (pcmBuffer.length > 0) {
|
||||
callbacks.onChunk(encodePcm16Chunk(pcmBuffer));
|
||||
pcmBuffer = new Int16Array(0);
|
||||
}
|
||||
processor.disconnect();
|
||||
source.disconnect();
|
||||
stream.getTracks().forEach((track) => track.stop());
|
||||
void audioContext.close();
|
||||
callbacks.onStop?.();
|
||||
};
|
||||
|
||||
activeRealtimeVoiceCapture = { stop };
|
||||
callbacks.onStart?.();
|
||||
return true;
|
||||
}
|
||||
|
||||
export function stopRealtimeVoiceCapture(): void {
|
||||
activeRealtimeVoiceCapture?.stop();
|
||||
}
|
||||
|
||||
function downsampleFloat32Buffer(
|
||||
buffer: Float32Array,
|
||||
inputSampleRate: number,
|
||||
outputSampleRate: number,
|
||||
): Int16Array {
|
||||
if (outputSampleRate >= inputSampleRate) {
|
||||
return float32ToPcm16(buffer);
|
||||
}
|
||||
const ratio = inputSampleRate / outputSampleRate;
|
||||
const outputLength = Math.max(1, Math.round(buffer.length / ratio));
|
||||
const output = new Int16Array(outputLength);
|
||||
let offsetBuffer = 0;
|
||||
for (let i = 0; i < outputLength; i += 1) {
|
||||
const nextOffsetBuffer = Math.min(buffer.length, Math.round((i + 1) * ratio));
|
||||
let sum = 0;
|
||||
let count = 0;
|
||||
for (let j = offsetBuffer; j < nextOffsetBuffer; j += 1) {
|
||||
sum += buffer[j];
|
||||
count += 1;
|
||||
}
|
||||
const sample = count > 0 ? sum / count : 0;
|
||||
output[i] = float32SampleToPcm16(sample);
|
||||
offsetBuffer = nextOffsetBuffer;
|
||||
}
|
||||
return output;
|
||||
}
|
||||
|
||||
function float32ToPcm16(buffer: Float32Array): Int16Array {
|
||||
const output = new Int16Array(buffer.length);
|
||||
for (let i = 0; i < buffer.length; i += 1) {
|
||||
output[i] = float32SampleToPcm16(buffer[i]);
|
||||
}
|
||||
return output;
|
||||
}
|
||||
|
||||
function float32SampleToPcm16(sample: number): number {
|
||||
const clamped = Math.max(-1, Math.min(1, sample));
|
||||
return clamped < 0 ? Math.round(clamped * 0x8000) : Math.round(clamped * 0x7fff);
|
||||
}
|
||||
|
||||
function encodePcm16Chunk(chunk: Int16Array): string {
|
||||
const bytes = new Uint8Array(chunk.length * 2);
|
||||
for (let i = 0; i < chunk.length; i += 1) {
|
||||
const value = chunk[i];
|
||||
bytes[i * 2] = value & 0xff;
|
||||
bytes[i * 2 + 1] = (value >> 8) & 0xff;
|
||||
}
|
||||
let binary = "";
|
||||
for (const byte of bytes) {
|
||||
binary += String.fromCharCode(byte);
|
||||
}
|
||||
return btoa(binary);
|
||||
}
|
||||
|
||||
// ─── TTS (Text-to-Speech) ───
|
||||
|
||||
export function isTtsSupported(): boolean {
|
||||
|
||||
@@ -44,6 +44,11 @@ function createProps(overrides: Partial<ChatProps> = {}): ChatProps {
|
||||
canSend: true,
|
||||
disabledReason: null,
|
||||
error: null,
|
||||
voiceActive: false,
|
||||
voiceState: "idle",
|
||||
voiceTranscript: "",
|
||||
voiceError: null,
|
||||
voicePlaybackEnabled: true,
|
||||
sessions: {
|
||||
ts: 0,
|
||||
path: "",
|
||||
|
||||
@@ -220,6 +220,11 @@ function createProps(overrides: Partial<ChatProps> = {}): ChatProps {
|
||||
canSend: true,
|
||||
disabledReason: null,
|
||||
error: null,
|
||||
voiceActive: false,
|
||||
voiceState: "idle",
|
||||
voiceTranscript: "",
|
||||
voiceError: null,
|
||||
voicePlaybackEnabled: true,
|
||||
sessions: createSessions(),
|
||||
focusMode: false,
|
||||
assistantName: "OpenClaw",
|
||||
|
||||
@@ -29,7 +29,14 @@ import {
|
||||
type SlashCommandCategory,
|
||||
type SlashCommandDef,
|
||||
} from "../chat/slash-commands.ts";
|
||||
import { isSttSupported, startStt, stopStt } from "../chat/speech.ts";
|
||||
import {
|
||||
isRealtimeVoiceSupported,
|
||||
isSttSupported,
|
||||
startRealtimeVoiceCapture,
|
||||
startStt,
|
||||
stopRealtimeVoiceCapture,
|
||||
stopStt,
|
||||
} from "../chat/speech.ts";
|
||||
import { icons } from "../icons.ts";
|
||||
import { detectTextDirection } from "../text-direction.ts";
|
||||
import type { GatewaySessionRow, SessionsListResult } from "../types.ts";
|
||||
@@ -62,6 +69,18 @@ export type ChatProps = {
|
||||
canSend: boolean;
|
||||
disabledReason: string | null;
|
||||
error: string | null;
|
||||
voiceActive: boolean;
|
||||
voiceState:
|
||||
| "idle"
|
||||
| "connecting"
|
||||
| "listening"
|
||||
| "processing"
|
||||
| "speaking"
|
||||
| "interrupted"
|
||||
| "error";
|
||||
voiceTranscript: string;
|
||||
voiceError: string | null;
|
||||
voicePlaybackEnabled: boolean;
|
||||
sessions: SessionsListResult | null;
|
||||
focusMode: boolean;
|
||||
sidebarOpen?: boolean;
|
||||
@@ -80,6 +99,10 @@ export type ChatProps = {
|
||||
onDraftChange: (next: string) => void;
|
||||
onRequestUpdate?: () => void;
|
||||
onSend: () => void;
|
||||
onVoiceStart?: () => Promise<boolean> | boolean;
|
||||
onVoiceAudioChunk?: (chunkBase64: string) => Promise<void> | void;
|
||||
onVoiceStop?: () => Promise<void> | void;
|
||||
onVoiceInterrupt?: () => Promise<void> | void;
|
||||
onAbort?: () => void;
|
||||
onQueueRemove: (id: string) => void;
|
||||
onNewSession: () => void;
|
||||
@@ -130,6 +153,7 @@ function getDeletedMessages(sessionKey: string): DeletedMessages {
|
||||
interface ChatEphemeralState {
|
||||
sttRecording: boolean;
|
||||
sttInterimText: string;
|
||||
voiceRecording: boolean;
|
||||
slashMenuOpen: boolean;
|
||||
slashMenuItems: SlashCommandDef[];
|
||||
slashMenuIndex: number;
|
||||
@@ -145,6 +169,7 @@ function createChatEphemeralState(): ChatEphemeralState {
|
||||
return {
|
||||
sttRecording: false,
|
||||
sttInterimText: "",
|
||||
voiceRecording: false,
|
||||
slashMenuOpen: false,
|
||||
slashMenuItems: [],
|
||||
slashMenuIndex: 0,
|
||||
@@ -167,6 +192,9 @@ export function resetChatViewState() {
|
||||
if (vs.sttRecording) {
|
||||
stopStt();
|
||||
}
|
||||
if (vs.voiceRecording) {
|
||||
stopRealtimeVoiceCapture();
|
||||
}
|
||||
Object.assign(vs, createChatEphemeralState());
|
||||
}
|
||||
|
||||
@@ -254,6 +282,32 @@ function renderFallbackIndicator(status: FallbackIndicatorStatus | null | undefi
|
||||
`;
|
||||
}
|
||||
|
||||
function renderVoiceStatus(props: ChatProps) {
|
||||
if (!props.voiceActive && !props.voiceError) {
|
||||
return nothing;
|
||||
}
|
||||
const label =
|
||||
props.voiceState === "connecting"
|
||||
? "Connecting voice..."
|
||||
: props.voiceState === "listening"
|
||||
? "Listening..."
|
||||
: props.voiceState === "processing"
|
||||
? "Processing..."
|
||||
: props.voiceState === "speaking"
|
||||
? "Speaking..."
|
||||
: props.voiceState === "interrupted"
|
||||
? "Interrupted"
|
||||
: props.voiceState === "error"
|
||||
? "Voice error"
|
||||
: "Voice ready";
|
||||
const detail = props.voiceError || props.voiceTranscript;
|
||||
return html`
|
||||
<div class="agent-chat__stt-interim">
|
||||
<strong>${label}</strong>${detail ? html` ${detail}` : nothing}
|
||||
</div>
|
||||
`;
|
||||
}
|
||||
|
||||
/**
|
||||
* Compact notice when context usage reaches 85%+.
|
||||
* Progressively shifts from amber (85%) to red (90%+).
|
||||
@@ -913,6 +967,11 @@ export function renderChat(props: ChatProps) {
|
||||
const requestUpdate = props.onRequestUpdate ?? (() => {});
|
||||
const getDraft = props.getDraft ?? (() => props.draft);
|
||||
|
||||
if (!props.voiceActive && vs.voiceRecording) {
|
||||
stopRealtimeVoiceCapture();
|
||||
vs.voiceRecording = false;
|
||||
}
|
||||
|
||||
const splitRatio = props.splitRatio ?? 0.6;
|
||||
const sidebarOpen = Boolean(props.sidebarOpen && props.onCloseSidebar);
|
||||
|
||||
@@ -1262,6 +1321,7 @@ export function renderChat(props: ChatProps) {
|
||||
${vs.sttRecording && vs.sttInterimText
|
||||
? html`<div class="agent-chat__stt-interim">${vs.sttInterimText}</div>`
|
||||
: nothing}
|
||||
${renderVoiceStatus(props)}
|
||||
|
||||
<textarea
|
||||
${ref((el) => el && adjustTextareaHeight(el as HTMLTextAreaElement))}
|
||||
@@ -1342,6 +1402,56 @@ export function renderChat(props: ChatProps) {
|
||||
</button>
|
||||
`
|
||||
: nothing}
|
||||
${isRealtimeVoiceSupported() && props.onVoiceStart && props.onVoiceStop
|
||||
? html`
|
||||
<button
|
||||
class="agent-chat__input-btn ${props.voiceActive
|
||||
? "agent-chat__input-btn--recording"
|
||||
: ""}"
|
||||
@click=${async () => {
|
||||
if (props.voiceActive) {
|
||||
stopRealtimeVoiceCapture();
|
||||
vs.voiceRecording = false;
|
||||
await props.onVoiceStop?.();
|
||||
requestUpdate();
|
||||
return;
|
||||
}
|
||||
const started = await props.onVoiceStart?.();
|
||||
if (!started) {
|
||||
requestUpdate();
|
||||
return;
|
||||
}
|
||||
const captureStarted = await startRealtimeVoiceCapture({
|
||||
onChunk: (chunkBase64) => {
|
||||
void props.onVoiceAudioChunk?.(chunkBase64);
|
||||
},
|
||||
onStart: () => {
|
||||
vs.voiceRecording = true;
|
||||
requestUpdate();
|
||||
},
|
||||
onStop: () => {
|
||||
vs.voiceRecording = false;
|
||||
requestUpdate();
|
||||
},
|
||||
onError: async () => {
|
||||
vs.voiceRecording = false;
|
||||
await props.onVoiceStop?.();
|
||||
requestUpdate();
|
||||
},
|
||||
});
|
||||
if (!captureStarted) {
|
||||
await props.onVoiceStop?.();
|
||||
requestUpdate();
|
||||
}
|
||||
}}
|
||||
title=${props.voiceActive ? "Stop live voice" : "Start live voice"}
|
||||
aria-label=${props.voiceActive ? "Stop live voice" : "Start live voice"}
|
||||
?disabled=${!props.connected || props.voiceState === "connecting"}
|
||||
>
|
||||
${props.voiceActive ? icons.volume2 : icons.radio}
|
||||
</button>
|
||||
`
|
||||
: nothing}
|
||||
${tokens ? html`<span class="agent-chat__token-count">${tokens}</span>` : nothing}
|
||||
</div>
|
||||
|
||||
|
||||
Reference in New Issue
Block a user