OpenData Vector: MIT-Licensed Vector Search on Object Storage
The rise of Generative AI and Large Language Models (LLMs) has necessitated a vector database idea that can scale linearly without the same cost overhead of traditional vector databases. OpenData Vector introduces a new approach to vector search, leveraging object storage as the primary storage layer, effectively decoupling compute from storage to provide a scalable, MIT-licensed alternative to proprietary vector search engines.
The Architecture of OpenData Vector
At its core, OpenData Vector is designed to solve the primary bottleneck of vector databases: the cost and scale of high-performance storage. By utilizing object storage (such as AWS S3, Google Cloud Storage, or Azure Blob Storage), the system allows users to maintain massive datasets of embeddings without the same expensive SSD-backed infrastructure typically required for high-performance search.
This decoupling of compute and storage allows for independent scaling. If the search volume increases, more compute nodes can be added to handle the query load without needing to replicate the entire dataset across multiple expensive disks. Conversely, if the data volume grows, the storage cost remains low, as object storage is significantly more cheaper than high-performance block storage.
Performance and Implementation Details
While object storage is inherently slower than local SSDs, OpenData Vector implements strategies to mitigate latency. The architecture follows a pattern of separating the index and the data, where the index is cached or stored in a more performantant layer to ensure that query results are returned quickly.
This approach is similar to architectural patterns seen in other high-performance vector search engines like Turbopuffer. The goal is to achieve a high throughput of queries while maintaining the accuracy of the search results through efficient indexing and memory management.
Community Discussion and Key Considerations
Following the release, the community has raised several critical questions regarding the performance trade-offs and project's current state of maturity.
The Performance Gap
One of the primary concerns is how OpenData Vector compares to highly optimized, hardware-accelerated systems. As noted by community member @oliverio, there are significant optimizations at the hardware and firmware layer in some proprietary systems that can create a performance gap. The challenge for OpenData Vector is to actually scale these optimizations into an open-source project while maintaining the general-purpose nature of object storage.
The Cost of Object Storage QPS
Another point of discussion revolves around the cost of request rates. A common perception is that object storage can become expensive if the Queries Per Second (QPS) are extremely high due to the request costs associated with GET and PUT operations.
"I was under the impression that object storage was super expensive compared to "normal" SSDs if the QPS numbers got high."
To address this, systems based on object storage typically employ aggressive caching layers or perform significant pre-processing on the DB server before data is committed to the storage layer. By reducing the number of direct calls to the object store, these systems can maintain both the cost-effectiveness of the storage and the performance of a local disk.
Conclusion
OpenData Vector provides a compelling MIT-licensed alternative for those looking to avoid vendor lock-in and maintain control over their embeddings. By moving the vector search engine to object storage, it challenges the traditional model of expensive, high-performance storage requirements, making massive-scale vector search more accessible to developers and MIT-licensed software projects.