Rocky: A Rust SQL Engine for Data Warehouse Control Planes
Data teams frequently grapple with the complexities of managing data pipelines, from tracking dependencies and ensuring data quality to enforcing governance and understanding costs. While modern data warehouses provide robust storage and compute capabilities, the orchestration and meta-management layer often remains fragmented. Rocky emerges as a Rust-based control plane designed to address these challenges by taking ownership of the Directed Acyclic Graph (DAG) that underpins data warehouse pipelines.
Rocky aims to complement existing data warehouse infrastructure like Databricks or Snowflake, rather than replacing them. Its core value proposition lies in providing a unified system for dependencies, compile-time types, drift detection, incremental logic, cost attribution, lineage, and governance—aspects often difficult to manage cohesively within current stacks that do not inherently "own the DAG." The project recently reached a significant milestone with the end-to-end implementation of its governance waveplan, including column classification, per-environment masking, an 8-field audit trail, compliance rollups, role-graph reconciliation, and retention policies, marking its readiness for broader adoption.
Core Innovations and Features
Rocky introduces several compelling features that streamline data pipeline management and enhance data trust.
Git-Grade Workflow with Branches and Replay
One of Rocky's standout features is its ability to bring Git-grade workflows to data warehouses. The rocky branch create stg command allows users to create logical copies of a pipeline's tables (currently via schema prefixes, with native Delta SHALLOW CLONE and Snowflake zero-copy planned). This enables isolated development and testing. Furthermore, rocky replay <run_id> can reconstruct the exact SQL executed against specific inputs, providing invaluable capabilities for debugging, auditing, and understanding historical pipeline states.
Compile-time Column-Level Lineage
Unlike many data lineage tools that perform post-hoc analysis by parsing logs, Rocky's compiler provides column-level lineage from the outset. This means the type checker actively traces columns through complex SQL constructs like joins, CTEs, and window functions. This compile-time understanding significantly changes the workflow, making refactors and the implementation of masking policies far less daunting. The VS Code extension further surfaces this lineage inline via the Language Server Protocol (LSP).
As one commenter noted, this approach is a significant improvement:
The compile-time lineage part is the most interesting bit to me. A lot of “data lineage” tools feel like archaeology after the fact: parse logs, reconstruct what probably happened, then hope it matches reality. Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary.
This compile-time lineage also opens the door for advanced review processes, such as a