← Back to Blogs
HN Story

From 3 GB to 10 MB: Optimizing Prefix Search with Finite State Transducers

May 11, 2026

From 3 GB to 10 MB: Optimizing Prefix Search with Finite State Transducers

When building a search-as-you-type dictionary, the standard approach is often a trie. Tries are excellent for prefix search and autocomplete, providing efficient lookups by sharing common prefixes. However, when dealing with highly agglutinative languages—where words are formed by stringing together multiple morphemes—the scale of possible word forms can cause traditional data structures to collapse under their own weight.

This is the challenge faced by the developer of Taskusanakirja (tsk), a Finnish-English dictionary. In Finnish, a single base word can have over a hundred possible endings due to consonant gradation and vowel harmony. For a beginner, the ability to "cleave" a complex word into its root and suffixes is essential. Implementing this required indexing tens of millions of inflections, a task that quickly outgrew the memory limits of a standard trie and eventually led to a cumbersome 3 GB SQLite database.

The Failure of the Trie

A trie (prefix tree) optimizes for the start of a word. If you have the words "kadun" and "kaduille," they share the first three nodes (k-a-d). However, every distinct suffix path is stored independently.

In a language like Finnish, thousands of different words might all end with the same complex inflectional pattern (e.g., -ssa-mme-kin, meaning "in our [X], as well"). In a trie, this suffix is repeated for every single word that uses it. When the dataset grows from 400,000 items to 40-60 million, the memory footprint becomes unsustainable for consumer hardware, particularly for users on older laptops.

The "Right Bad Solution": SQLite

Faced with a scaling wall, the initial fallback was to move the inflections into a separate SQLite database using the Full Text Search (FTS) extension. This was a pragmatic choice: it worked, it was fast, and it was understandable.

However, it introduced a significant distribution hurdle: a 3 GB download. While SQLite is a powerful tool, it is a general-purpose database. Using a B-tree-based system for a static dataset of prefix searches is often overkill and inefficient in terms of space.

Enter the Finite State Transducer (FST)

To solve the memory problem, the project transitioned to Rust and implemented a Finite State Transducer (FST) using the fst crate. An FST is essentially a minimal acyclic deterministic finite-state automaton.

The critical difference between a trie and an FST is suffix sharing. While a trie only merges prefixes, an FST merges any two subtrees that are structurally identical. If 100,000 words share the same dozen inflectional patterns, the FST stores those patterns only once.

Results of the Transition

By moving from a general-purpose database to a specialized, static data structure, the results were dramatic:

  • Storage Reduction: The 3 GB SQLite database was replaced by a 10 MB FST binary.
  • Memory Efficiency: A 300x reduction in space.
  • Performance: The data load is static at runtime, bypassing the primary weakness of FSTs (the cost of construction) and maintaining the speed of prefix searches.

Engineering Takeaways

1. The Value of the "Naive" Implementation

One of the most important lessons from this transition is the validity of starting with a "stupid" solution. By implementing the SQLite version first, the developer had a working baseline. This allowed them to verify the correctness of the FST implementation by ensuring it produced the same results as the naive version.

Start with the obvious, stupid solution that definitely works. Then do the optimized version, while making sure it matches the naive implementation.

2. Solving Problems Twice

There is often a guilt associated with reinventing the wheel. However, in complex software engineering, solving a problem twice—first for functionality, then for optimization—is often the only way to reach the true frontier of a problem's requirements. The first pass establishes the "what," and the second pass optimizes the "how."

3. Tool Selection

This case highlights the intersection where Rust becomes an ideal choice: when a project needs to be fast, portable, and requires precise control over memory ergonomics. By leveraging Rust's ecosystem and specialized crates like fst, it was possible to turn a 3 GB download into a 20 MB "batteries-included" binary.

References

HN Stories