OpenClaw Issue Digest: Addressing Session Stability and Codex Harness Latency
Open Issues
Recent activity in the OpenClaw repository reveals a cluster of high-severity issues primarily affecting session stability, the Codex harness runtime, and security-critical credential handling.
Session Stability and Data Integrity
One of the most critical regressions is the EmbeddedAttemptSessionTakeoverError (#84059), which has rendered the system non-functional for some users. This error stems from an overly sensitive session file fingerprint mechanism in pi-agent-core@0.75.1 that triggers a takeover error on nanosecond-precision mtime changes, even those caused by internal writes. This is further exacerbated by cron announce deliveries (#84583), where background job completions modify session files while a user is actively chatting, leading to immediate turn failure.
Additionally, a race condition in session-write-lock.ts (#57019) allows an async release to delete newly-acquired locks, potentially leading to session transcript corruption. This is compounded by reports of tasks/runs.sqlite corruption (#71689), where malformed database images prevent the restoration of the durable task registry during gateway startup.
Codex Harness and Runtime Performance
Performance bottlenecks are emerging within the Codex app-server. Users report significant "hidden" latency between attempt-dispatch and session.started (#84640), suggesting that the thread lifecycle (binding reads, compatibility checks, and RPC requests) is not sufficiently instrumented.
Stability issues also persist in the Codex bundled harness, specifically regarding isolated cron jobs that deterministically time out during setup (#84567). Furthermore, memory growth is being driven by unreaped chrome-devtools-mcp sidecars (#84413), which accumulate under the gateway cgroup and eventually require a full restart to clear.
Security and Auth Regressions
A significant security regression has been identified in openclaw models status --probe (#84632), which rewrites models.json with resolved plaintext API keys for non-CORE custom providers. This bypasses the SecretRef system and exposes sensitive credentials in plaintext on disk.
Other security concerns include the exec tool returning raw stdout/stderr without secret redaction (#71211), and a scope deadlock in the CLI (#74484) where a paired CLI with only operator.read scope cannot approve or reject repair requests because those actions require operator.pairing scope.
Key Themes
1. The "Fragile Session" Pattern
There is a recurring theme of session-state fragility. Whether it is the EmbeddedAttemptSessionTakeoverError or the session write-lock race, the system is struggling to manage concurrent access to session files. The transition to more aggressive fingerprinting for security/takeover detection has inadvertently introduced instability in standard operational flows.
2. Observability Gaps in the Codex Pipeline
While the core embedded runner is traced, the Codex app-server's internal lifecycle remains a "black box." The gap between dispatch and session start is a primary source of perceived latency, and the lack of lifecycle logging for MCP sidecars makes memory leaks difficult to diagnose without manual ps audits.
3. Regression in Provider-Specific Logic
Several issues highlight regressions in how specific providers are handled:
- Kimi/Moonshot: Reasoning content is lost after LCM compaction, leading to 400 errors (#71491).
- DeepSeek: High thinking levels occasionally produce empty visible content, which is then dispatched as blank messages to channels (#84591).
- GitHub Copilot: GPT models fail in isolated/cron sessions due to missing API provider registration for secondary LLM calls (#84614).
Action Required
High Severity / Blocked
- #84059 & #84583 (Session Takeover): Immediate attention is needed to relax the
mtimeNsprecision or exclude internal writes from the fingerprint check to restore basic functionality for Feishu and Telegram users. - #84632 (Plaintext API Keys): This is a critical security leak. The regenerator must be patched to persist non-secret markers instead of resolved values for all providers.
- #84604 (Auth Migration): The 4.x $\rightarrow$ 5.x migration path for
claude-cliis broken, leaving harnesses unregistered and causing crash-loops for upgraded users.
Contributor Attention Needed
- #84413 (MCP Sidecar Leak): Implementation of a shared pool or a bounded reaper for
chrome-devtools-mcpprocesses to prevent cgroup memory exhaustion. - #84640 (Codex Latency): Addition of a thread lifecycle stage summary to
OPENCLAW_LOG_LEVEL=traceto localize the 6s+ gap beforesession.started. - #71211 (Exec Redaction): Implementation of a secret-redaction pass for
exectool output to prevent internal credential exposure.