DuckDB and the next frontier of OLAP databases

matrixpartners_gray

How DuckDB is ushering in the next frontier of embedded OLAP databases.

Data warehouses are now a default piece of the modern data stack. Teams of all sizes use a data warehouse to power analytics use cases. Snowflake's rapid rise has been the poster child of this trend.

However, the current warehouse paradigm is designed for a client-server use case and ignores a growing segment of users. The next frontier of data warehouses will close this gap by focusing on in-process analytics.

The next frontier: analytical queries x embedded deployments

An emerging category of data warehouses sits at the intersection of analytical queries and embedded deployments. To illustrate why this is so compelling, it's helpful to categorize databases along two axes:

1. What types of workloads they are optimized for: analytical versus transactional

Analytical workloads — also called Online analytical processing (OLAP) — are complex queries on historical data. For example, you may want to analyze user signups broken down by demographics such as age and location. On the other hand, transactional workloads — also referred to as Online transactional processing — are optimized for quick real-time reads and writes. This is what your typical CRUD (create, retrieve, update, delete) app runs on.

2. How they are deployed: stand-alone versus embedded

Stand-alone databases are typically deployed in a client-server paradigm. The database sits on a centralized server and is queried by a client application. Embedded databases run within the host process of whatever application is accessing the database.


Putting these two axes together reveals the gap. Innovation in OLAP databases has focused on stand-alone OLAP databases such as Snowflake, ClickHouse, and Redshift. As a result, embedded analytics use cases have been underserved. DuckDB is changing this.

DuckDB: Coupling data with compute

DuckDB's core innovation is that it is an in-process OLAP database. This means it is tightly integrated with your application and runs on the same host compute workload. By design, using DuckDB is as easy as installing a package and importing a module with no external dependencies.

The elegance of DuckDB’s approach can't be understated. It contrasts with that of data warehouses like Snowflake, which decouples compute from storage in a client-server model.

Merging compute and storage unlocks a handful of use cases I am excited about:

  • Better local interactive analytics: Today, most data scientists still do a significant amount of analysis locally in R or Python Notebooks. However, these notebooks lack traditional database management systems and are cumbersome to integrate with the client-server model of warehouses like Snowflake. As a workaround, data scientists are left with using alternative data management operators such as dplyr and Pandas. With DuckDB, data scientists get the power of a Snowflake-like analytical database on their local machine with a simple import statement.
  • Serverless, single-node Snowflake: The benefits of DuckDB are not limited to local interactive analytics. Because DuckDB is designed to run as a single node, it is relatively easier to implement and manage and can be used as a lightweight enterprise data warehouse. For startups that don't have enough data to warrant a Snowflake instance, DuckDB is both a cost-effective and more manageable alternative. I expect to see several startups spring up to offer serverless warehouse offerings powered by DuckDB.
  • IoT use cases: Having an in-process analytics store also means that analysis can be done on a host device versus a centralized server. This is particularly useful for IoT devices that run compute on the edge. Without an on-device analytics store, edge nodes would typically forward data to a central location for analysis. However, this approach runs into latency and bandwidth constraints. Performing analysis directly inside an edge node removes the data transfer step and alleviates these concerns.
  • ML runtimes: One of the challenges in setting up a machine learning pipeline is pulling data from your warehouse into your project. Since DuckDB can run in-process in Python, a machine learning pipeline can run on the same application host process. This removes the need to replicate data for a given project.

The idea that an in-process OLAP database needs to exist seems obvious in hindsight. However, several technical innovations and demand drivers converged to drive DuckDB's growing popularity. First, data science and analytics is now a core business process, not just a research function. Additionally, DuckDB's implementation draws upon decades of research on topics such as execution engines, concurrency control, and hardware resilience.

Though it is still early, the DuckDB ecosystem is showing promise, with several companies like Hex, Mode, and Voltron Data supporting the project. I would love to hear from you if you are building in the DuckDB ecosystem.