Compare commits

..

5 Commits

Author SHA1 Message Date
Peter Steinberger
262d976c71 fix(agents): retry failover on network connection errors (#31697) (thanks @haotian2546) 2026-03-03 02:36:55 +00:00
KingMo
cce6d602e1 style: fix code formatting for test file 2026-03-03 02:36:30 +00:00
KingMo
cb8d36c6cd fix(agents): recognize connection errors as retryable timeout failures
When a model endpoint becomes unreachable (e.g., local proxy down,
relay server offline), the failover system fails to switch to the
next candidate model. Errors like "Connection error." are not
classified as retryable, causing the session to hang on a broken
endpoint instead of falling back to healthy alternatives.

Connection/network errors are not recognized by the current failover
classifier:
- Text patterns like "Connection error.", "fetch failed", "network error"
- Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text)

While `failover-error.ts` handles these as error codes (err.code),
it misses them when they appear as plain text in error messages.

Extend timeout error patterns to include connection/network failures:

**In `errors.ts` (ERROR_PATTERNS.timeout):**
- Text: "connection error", "network error", "fetch failed", etc.
- Regex: /\beconn(?:refused|reset|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i

**In `failover-error.ts` (TIMEOUT_HINT_RE):**
- Same patterns for non-assistant error paths

Added test cases covering:
- "Connection error."
- "fetch failed"
- "network error: ECONNREFUSED"
- "ENOTFOUND" / "EAI_AGAIN" in message text

- **Compatibility:** High - only expands retryable error detection
- **Behavior:** Connection failures now trigger automatic fallback
- **Risk:** Low - changes are additive and well-tested
2026-03-03 02:36:30 +00:00
riftzen-bit
5e1a2ea019 chore: remove unreachable "LINUX" from resolvePlatform return type
Address review feedback: since resolvePlatform() no longer returns
"LINUX", remove it from the union type to prevent future confusion.
2026-03-03 02:36:01 +00:00
riftzen-bit
008e4804a6 fix(gemini-cli-auth): use PLATFORM_UNSPECIFIED for Linux in loadCodeAssist
Google's loadCodeAssist API rejects "LINUX" as an invalid Platform enum
value, causing OAuth setup to fail with 400 Bad Request on Linux systems.

The pi-ai runtime already uses "PLATFORM_UNSPECIFIED" for this field.
This aligns the extension's discoverProject() with that approach by
returning "PLATFORM_UNSPECIFIED" for Linux (and other non-Windows/macOS
platforms) instead of "LINUX".

Also fixes the original resolvePlatform() which incorrectly fell through
to "MACOS" as default instead of explicitly checking for "darwin".
2026-03-03 02:36:01 +00:00
5 changed files with 34 additions and 2 deletions

View File

@@ -49,7 +49,6 @@ Docs: https://docs.openclaw.ai
- Feishu/default account resolution: always honor explicit `channels.feishu.defaultAccount` during outbound account selection (including top-level-credential setups where the preferred id is not present in `accounts`), instead of silently falling back to another account id. (#32253) Thanks @bmendonca3.
- Gemini schema sanitization: coerce malformed JSON Schema `properties` values (`null`, arrays, primitives) to `{}` before provider validation, preventing downstream strict-validator crashes on invalid plugin/tool schemas. (#32332) Thanks @webdevtodayjason.
- Models/openai-completions developer-role compatibility: force `supportsDeveloperRole=false` for non-native endpoints, treat unparseable `baseUrl` values as non-native, and add regression coverage for empty/malformed baseUrl plus explicit-true override behavior. (#29479) thanks @akramcodez.
- Gemini CLI OAuth/Linux compatibility: send `PLATFORM_UNSPECIFIED` (instead of invalid `LINUX`) in `loadCodeAssist` metadata so Linux OAuth flows no longer fail with provider 400 enum errors. (#32382) Thanks @riftzen-bit.
- OpenAI/Responses WebSocket tool-call id hygiene: normalize blank/whitespace streamed tool-call ids before persistence, and block empty `function_call_output.call_id` payloads in the WS conversion path to avoid OpenAI 400 errors (`Invalid 'input[n].call_id': empty string`), with regression coverage for both inbound stream normalization and outbound payload guards.
- Gateway/Control UI basePath webhook passthrough: let non-read methods under configured `controlUiBasePath` fall through to plugin routes (instead of returning Control UI 405), restoring webhook handlers behind basePath mounts. (#32311) Thanks @ademczuk.
- CLI/Config validation and routing hardening: dedupe `openclaw config validate` failures to a single authoritative report, expose allowed-values metadata/hints across core Zod and plugin AJV validation (including `--json` fields), sanitize terminal-rendered validation text, and make command-path parsing root-option-aware across preaction/route/lazy registration (including routed `config get/unset` with split root options). Thanks @gumadeiras.
@@ -123,6 +122,7 @@ Docs: https://docs.openclaw.ai
- Discord/audio preflight mentions: detect audio attachments via Discord `content_type` and gate preflight transcription on typed text (not media placeholders), so guild voice-note mentions are transcribed and matched correctly. (#32136) Thanks @jnMetaCode.
- Memory/LanceDB embeddings: forward configured `embedding.dimensions` into OpenAI embeddings requests so vector size and API output dimensions stay aligned when dimensions are explicitly configured. (#32036) Thanks @scotthuang.
- Failover/error classification: treat HTTP `529` (provider overloaded, common with Anthropic-compatible APIs) as `rate_limit` so model failover can engage instead of misclassifying the error path. (#31854) Thanks @bugkill3r.
- Failover/network resilience: classify connection and DNS failures (`fetch failed`, `ECONN*`, `ENOTFOUND`, `EAI_AGAIN`) as retryable timeout errors so provider fallback can advance instead of stalling on transient network outages. (#31697) Thanks @haotian2546.
- Plugin command/runtime hardening: validate and normalize plugin command name/description at registration boundaries, and guard Telegram native menu normalization paths so malformed plugin command specs cannot crash startup (`trim` on undefined). (#31997) Fixes #31944. Thanks @liuxiaopai-ai.
- Plugins/hardlink install compatibility: allow bundled plugin manifests and entry files to load when installed via hardlink-based package managers (`pnpm`, `bun`) while keeping hardlink rejection enabled for non-bundled plugin sources. (#32119) Fixes #28175, #28404, #29455. Thanks @markfietje.
- Web UI/config form: support SecretInput string-or-secret-ref unions in map `additionalProperties`, so provider API key fields stay editable instead of being marked unsupported. (#31866) Thanks @ningding97.

View File

@@ -48,6 +48,22 @@ describe("failover-error", () => {
expect(resolveFailoverReasonFromError({ message: "reason: error" })).toBe("timeout");
});
it("infers timeout from connection/network error messages", () => {
expect(resolveFailoverReasonFromError({ message: "Connection error." })).toBe("timeout");
expect(resolveFailoverReasonFromError({ message: "fetch failed" })).toBe("timeout");
expect(resolveFailoverReasonFromError({ message: "Network error: ECONNREFUSED" })).toBe(
"timeout",
);
expect(
resolveFailoverReasonFromError({
message: "dial tcp: lookup api.example.com: no such host (ENOTFOUND)",
}),
).toBe("timeout");
expect(resolveFailoverReasonFromError({ message: "temporary dns failure EAI_AGAIN" })).toBe(
"timeout",
);
});
it("treats AbortError reason=abort as timeout", () => {
const err = Object.assign(new Error("aborted"), {
name: "AbortError",

View File

@@ -6,7 +6,7 @@ import {
} from "./pi-embedded-helpers.js";
const TIMEOUT_HINT_RE =
/timeout|timed out|deadline exceeded|context deadline exceeded|stop reason:\s*(?:abort|error)|reason:\s*(?:abort|error)|unhandled stop reason:\s*(?:abort|error)/i;
/timeout|timed out|deadline exceeded|context deadline exceeded|connection error|network error|network request failed|fetch failed|socket hang up|econnrefused|econnreset|econnaborted|enotfound|eai_again|stop reason:\s*(?:abort|error)|reason:\s*(?:abort|error)|unhandled stop reason:\s*(?:abort|error)/i;
const ABORT_TIMEOUT_RE = /request was aborted|request aborted/i;
export class FailoverError extends Error {

View File

@@ -415,6 +415,7 @@ describe("isFailoverErrorMessage", () => {
"429 rate limit exceeded",
"Your credit balance is too low",
"request timed out",
"Connection error.",
"invalid request format",
];
for (const sample of samples) {
@@ -494,6 +495,13 @@ describe("classifyFailoverReason", () => {
expect(classifyFailoverReason("credit balance too low")).toBe("billing");
expect(classifyFailoverReason("deadline exceeded")).toBe("timeout");
expect(classifyFailoverReason("request ended without sending any chunks")).toBe("timeout");
expect(classifyFailoverReason("Connection error.")).toBe("timeout");
expect(classifyFailoverReason("fetch failed")).toBe("timeout");
expect(classifyFailoverReason("network error: ECONNREFUSED")).toBe("timeout");
expect(
classifyFailoverReason("dial tcp: lookup api.example.com: no such host (ENOTFOUND)"),
).toBe("timeout");
expect(classifyFailoverReason("temporary dns failure EAI_AGAIN")).toBe("timeout");
expect(
classifyFailoverReason(
"521 <!DOCTYPE html><html><head><title>Web server is down</title></head><body>Cloudflare</body></html>",

View File

@@ -640,6 +640,14 @@ const ERROR_PATTERNS = {
"timed out",
"deadline exceeded",
"context deadline exceeded",
"connection error",
"network error",
"network request failed",
"fetch failed",
"socket hang up",
/\beconn(?:refused|reset|aborted)\b/i,
/\benotfound\b/i,
/\beai_again\b/i,
/without sending (?:any )?chunks?/i,
/\bstop reason:\s*(?:abort|error)\b/i,
/\breason:\s*(?:abort|error)\b/i,