OpenClaw Issue Digest: Session Lifecycle, Provider Failovers, and Infrastructure Stability

Open Issues

Recent activity in the OpenClaw repository reveals several critical regressions and architectural gaps, primarily centered around session management, model provider interoperability, and platform-specific stability.

Session Lifecycle and Memory Management

One of the most severe reported issues involves the session.maintenance logic. When mode is set to "enforce", the system can evict pending subagent sessions before their results are announced or frozen. This leads to a failure where the parent receives a "completed successfully" event but with no output and zero tokens, despite the child having produced a valid response. The proposed fix involves implementing lifecycle-aware eviction that protects active or pending-delivery sessions.

Additionally, a regression in subagent completion handling (#81490) causes the gateway to spawn a fresh run on the parent's route instead of resuming a yielded session. This effectively orphans the paused run and overwrites the session-store pointer, silently breaking automated multi-step workflows.

Provider and Model Interoperability

Significant issues have emerged regarding "reasoning" blocks during cross-provider failovers. Specifically, when failing over from Gemini to OpenAI reasoning models, the system fails to propagate the required reasoning item, resulting in 400 errors. Similarly, MiMo models using the anthropic-messages API are falling back immediately because the gateway fails to preserve reasoning_content during replay, which MiMo requires for subsequent turns.

Infrastructure and Platform Stability

Stability issues are prominent on ARM64 edge devices (Raspberry Pi 5) and Windows. On ARM64, users report CLI commands timing out or being SIGKILLed due to exec overhead, and cron jobs failing without retry logic. On Windows, a critical runtime degradation has been observed where outbound HTTP fetches (including Telegram polling and model pricing) experience massive stalls (up to 60s) and timeouts, which does not occur in standalone Node processes.

Other notable infrastructure concerns include:

MCP Resource Exhaustion: Misconfigured MCP servers can trigger "retry storms," spawning hundreds of child processes and exhausting VM memory (#68527).
Sandbox Zombies: Sandboxed sessions are accumulating zombie processes under PID 1 that are not being reaped, risking pids.max exhaustion (#68691).
Memory Leaks: Reports indicate the gateway loads all session files into memory via readFileSync, causing RSS to grow linearly with session history (#69451).

Key Themes

1. The "Silent Failure" Pattern

Across multiple issues, a recurring theme is the lack of observability when critical paths fail. Whether it is the silent dropping of Slack replies in group chats (#77320), the absence of log lines for successful Telegram media sends (#68770), or the silent failure of the openclaw-weixin plugin to load in the gateway (#81448), the system often fails without emitting a warning or error log, making diagnosis nearly impossible for operators.

2. Context and Token Accounting Inaccuracies

Token counting remains a volatile area. MiniMax models are experiencing premature compaction at ~20% context usage because prompt tokens are being double-counted (input + cacheRead), triggering the compaction safeguard far too early (#68470). Similarly, there are reports of cacheWrite telemetry always remaining at zero despite active cacheRead activity (#81014).

3. UX and Feedback Gaps

Several feature requests highlight a need for better user-facing feedback. This includes adding ack reactions and typing indicators for slash commands like /new (#69585) and implementing a notification sound for agent turn completion (#69186) to assist users keeping the UI in the background.

Action Required

High Severity / Blockers

#81492 (Session Eviction): Immediate attention is needed to prevent the loss of subagent results during session maintenance. This is a data-loss bug for active runs.
#81490 (Subagent Resume): This regression breaks the core orchestrator pattern of sessions_yield and must be fixed to restore automated workflow reliability.
#68703 (Discord Auth Bypass): A high-severity security vulnerability where guild-admin actions (e.g., deleting channels) bypass requester authorization checks, allowing any guild member to trigger privileged mutations.
#68587 (MCP Protocol Violation): The MCP server is sending tools/list as a notification instead of a request/response, breaking compatibility with standard MCP clients like Hermes Agent.

Blocked or High-Priority Fixes

#73323 (Windows Runtime Degradation): This chronic network/timer degradation on Windows 11 makes the gateway effectively unusable for Telegram and RPC calls.
#68527 (MCP Circuit Breaker): Implementation of exponential backoff and circuit breakers for MCP starts is required to prevent VM-level crashes.
#80452 (Gemini $\rightarrow$ OpenAI Failover): Fix the propagation of reasoning items to ensure the failover safety net actually works.