The Danger of Invalid Surrogate Pairs: A Tale of Silent Sync Failures
If you build software for long enough, you will eventually encounter a bug that feels less like a coding error and more like a glitch in the matrix. For one engineering team, this manifested as a silent, intermittent failure where a collaborative editor would simply stop saving content. There were no console errors, no stack traces, and no obvious network failures—just data vanishing into the void.
The culprit turned out to be one of the most deceptive aspects of modern computing: the way JavaScript handles Unicode strings. Specifically, the team ran into the nightmare of invalid surrogate pairs.
The Bug: Two Emoji Enter, None Leave
The issue surfaced during the migration of a legacy editor to a collaborative experience using TipTap (a ProseMirror wrapper) and Yjs for CRDT-based real-time syncing. For most users, the system worked perfectly. However, a small subset of users reported that their edits occasionally stopped syncing. The editor remained responsive locally, but upon refreshing the page, all changes since the failure point were gone.
The breakthrough came when a product manager noticed a pattern: the bug triggered when inserting specific characters between other multi-byte characters. In this case, using the 🟢 (green circle) and 🔴 (red circle) emojis.
When a user performed a "splice" operation—inserting or deleting a character at a specific byte offset—the underlying CRDT library occasionally split a surrogate pair down the middle. This created a string containing an orphaned surrogate, which then crashed the synchronization process.
Understanding the Unicode Hierarchy
To understand why a simple emoji can crash a sync engine, we have to distinguish between three concepts that are often conflated as "a character":
1. Code Units
JavaScript stores strings internally as UTF-16. A code unit is a raw 16-bit value. When you call .length or .slice() in JavaScript, you are operating on code units, not necessarily human-perceived characters.
2. Code Points
A code point is the actual atomic unit defined by the Unicode standard. While many characters fit into a single 16-bit code unit, others (like most emojis) are too large. These are stored as a surrogate pair: one high surrogate and one low surrogate.
3. Grapheme Clusters
A grapheme cluster is what a human perceives as a single character. Some emojis are actually combinations of multiple code points. For example, the female astronaut (👩🚀) consists of a woman emoji, a zero-width joiner, and a rocket emoji.
| Character | Code Units | Code Points | Graphemes |
|---|---|---|---|
| A | 1 | 1 | 1 |
| 🤠 | 2 | 1 | 1 |
| 👩🚀 | 5 | 3 | 1 |
| 👨👨👧👧 | 11 | 7 | 1 |
How .slice() Breaks the System
Because JavaScript's .slice() operates on code units, it is blindly indifferent to whether it is cutting through a surrogate pair. If you slice a cowboy emoji (🤠) in the middle, you get two fragments:
"🤠".slice(0, 1); // → '\uD83E' (lone high surrogate)
"🤠".slice(1, 2); // → '\uDD20' (lone low surrogate)
These fragments are invalid Unicode. While they might render as a replacement character () in a browser, they cause critical failures when passed to certain APIs. In this specific case, the orphaned surrogate was passed to encodeURIComponent, which threw a URIError: URI malformed. Because this error was uncaught, the Yjs sync loop died silently, leaving the user in a state of local-only editing.
Mitigation and Resolution
Fixing this required a multi-tiered approach, moving from "nuclear" hacks to architectural changes:
The Immediate Hacks
Since the bug existed in an upstream dependency (lib0), the team initially implemented two safeguards:
- Offline Support: By enabling offline mode, the team ensured that the CRDT would continue to update locally. Once the page was reloaded or the error cleared, the system could attempt to merge the state again.
- Global Error Catching: A global
window.addEventListener("error", ...)was added to specifically catchURIError: URI malformed. When detected, the app would trigger a modal asking the user to reload, preventing silent data loss.
The Permanent Fixes
Eventually, the root cause was addressed through two primary methods:
- Upstream Patching: The
lib0library was updated to detect orphaned surrogates during slicing and replace them with the Unicode replacement character (U+FFFD), preventing theURIErrorfrom being thrown. - Atomic Node Types: In ProseMirror/TipTap, the team defined emojis as their own "atomic node type." This ensured the editor treated each emoji as an indivisible unit, making it impossible for a cursor or a splice operation to land inside a surrogate pair.
Lessons for Modern Development
This bug is a reminder that assuming str[0] or .slice(0, 1) returns "the first character" is a dangerous assumption in any application that supports non-ASCII input. This is common in tools that generate user initials (e.g., firstName[0] + lastName[0]), which can crash or render garbage if a user's name begins with an emoji.
The Modern Solution: Intl.Segmenter
For developers performing string manipulation in JavaScript today, the correct approach is to use the Intl.Segmenter API, which is aware of grapheme clusters:
const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
const segments = [...seg.segment("👩🚀A👍")].map((s) => s.segment);
// Result: ['👩🚀', 'A', '👍']
By splitting by graphemes rather than code units, you eliminate the possibility of orphaned surrogates and ensure that your application treats text the way humans do.
Community Perspectives
Technical discussions around this issue highlight that this is a systemic problem across many languages. Some developers noted that even languages like Dart have encountered similar "cut in half" emoji issues. Others suggested that property-based testing is one of the most effective ways to uncover these edge cases by feeding functions a wide array of Unicode characters.
Some argue that the complexity of UTF-16 was an "unforced error" in the history of computing, while others point out that the concept of multiple scalars contributing to a single logical unit (graphemes) was always inevitable, regardless of the encoding used.