OpenClaw Issue Digest: Session Isolation, Provider Stability, and Runtime Regressions
Open Issues
The recent activity window for the OpenClaw repository reveals a significant number of high-severity issues centered around session isolation, provider-specific delivery failures, and regressions in the gateway's update and recovery mechanisms.
Session Isolation and Stability
Critical failures in session isolation have been reported, most notably in #84903, where a single stalled agent session can block the entire Gateway event loop, leading to 100% CPU utilization and silent message drops for all other active sessions. This represents a fundamental failure in session isolation. Similarly, #85250 highlights a bug where sessions_yield leaves parent sessions unwakeable by subagent completion events, forcing users to send manual messages to "piggyback" the result.
On the memory front, a P0 security vulnerability (#85240) was identified where the relevant-memories recall mechanism lacks sender_id isolation, potentially leaking private memories from one user into another user's conversation context in multi-user deployments.
Provider and Channel Regressions
Several channel-specific issues have emerged:
- Telegram: Reports indicate that messages are sometimes silently dropped due to update offset race conditions (#44930) and that forum topic replies can "jump" to the General topic despite topic-qualified sessions (#81874).
- Feishu: A critical routing bug (#45158) causes all messages to be routed to a single agent regardless of the configured bindings, leading to session pollution and privacy leaks. Additionally, the Feishu webhook mode currently ignores configured
webhookPathand accepts signed requests on arbitrary paths (#54841). - Discord: Internal tool-call traces (e.g.,
NO_REPLY,commentary) are intermittently leaking into user channels (#44905).
Runtime and Provider Stability
Stability issues are prevalent in the Codex and Anthropic runtimes. The Codex app-server is experiencing silent truncation of long replies at ~1100 characters (#84516) and startup failures on Windows due to fragile command override handling (#84365).
For Anthropic providers, a regression in group chat context injection (#83419) creates consecutive user-role messages, which violates Anthropic's API requirements and triggers 500 errors via OpenRouter, causing silent fallbacks to Gemini models.
Key Themes
1. The "Stall and Block" Pattern
There is a recurring theme of internal bottlenecks causing system-wide failures. Whether it is the event loop blockage (#84903), the UV_THREADPOOL_SIZE limitation causing simultaneous API timeouts (#43374), or the Codex terminal-idle watchdog causing misleading timeouts (#85242), the system is struggling with concurrency and resource isolation.
2. State Persistence and Recovery
Issues with how state is persisted and recovered are frequent. This includes the "last-write-wins" race condition in exec-approvals.json (#44749), the loss of session history due to aggressive daily-reset archiving (#45003), and the lauchd-managed gateway failing to restart after an update due to inherited XPC_SERVICE_NAME environment variables (#85224).
3. Schema-Driven Model Misbehavior
Several issues stem from the tool schema being too permissive, leading models (particularly GPT-5.x) to auto-populate optional fields that then trigger strict runtime guards. This is evident in the message.send action where poll fields or Discord modal skeletons cause valid messages to be rejected (#43015, #42820).
Action Required
High-Severity / Blocked Issues
- #85240 (P0 Security): Immediate implementation of
sender_idfiltering in the memory recall layer to prevent cross-user data leakage. - #84903 (P1): Urgent need for per-session timeout budgets and better async isolation to prevent a single stalled session from crashing the Gateway event loop.
- #84886 (Beta Blocker): Implementation of a durable message dispatch idempotency ledger for Telegram to prevent duplicate agent turns during recovery.
- #85228 (P1): Optimization of the xAI OAuth auth stage to reuse cached tokens, reducing per-turn latency from ~13s to near-instant.
Critical Contributor Attention
- #44749: Fix the read-modify-write race in
addAllowlistEntryusing a mutex or re-read-before-write pattern. - #83419: Merge metadata and actual user messages into a single user role block for Anthropic API compatibility.
- #85246: Resolve the handoff deadlock in the UI Update flow for npm global + launchd installations on macOS.