Extract data from Databases into DuckLake

Introduction

In the evolving landscape of data engineering, DuckLake is emerging as a powerful solution for building data lakes with ACID transactions, versioning, and a flexible catalog backend. It combines the speed and efficiency of DuckDB with the scalability of cloud storage, making it an attractive choice for modern data platforms.

A common requirement is to populate a data lake by extracting data from various transactional or analytical databases. This is where Sling comes in, offering a simple and powerful command-line interface (CLI) to move data between different sources and destinations.

In this article, we’ll walk through how to use Sling to extract data from a PostgreSQL database and load it into DuckLake. The same principles can be applied to other databases that Sling supports, such as MySQL, SQL Server, Oracle, and more.

What is DuckLake?

DuckLake is a data lake format that brings the power of DuckDB to a data lake architecture. It provides a transactional layer over your data files (like Parquet) stored in object storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage, or local files). It uses a catalog database (like DuckDB, SQLite, PostgreSQL, or MySQL) to manage metadata, schemas, and versions.

This setup allows you to query your data lake using standard SQL with the performance benefits of DuckDB, while ensuring data consistency and reliability.

Setting up the Environment

Before we begin, make sure you have Sling CLI installed.

Configuring the Connections

We need to configure two connections in Sling: one for our source database (PostgreSQL) and one for our target (DuckLake).

1. Source Database: PostgreSQL

Let’s set up a connection to a PostgreSQL database. You can do this by setting an environment variable or by using the sling conns command. See here for more details.

export POSTGRES_CONN="postgres://user:pass@host:5432/dbname"

2. Target: DuckLake

For DuckLake, we need to specify the catalog type, the connection string for the catalog, and the path where the data will be stored. For this example, we’ll use a local DuckDB file as our catalog and a local directory for our data. See here for more details.

Here’s how to set up the DuckLake connection using sling conns set:

sling conns set MY_DUCKLAKE type=ducklake \
  catalog_type=duckdb \
  catalog_conn_string=my_catalog.db \
  data_path=./ducklake_data

This command creates a DuckLake connection named MY_DUCKLAKE that uses a local DuckDB file my_catalog.db for the catalog and stores data in the ./ducklake_data directory.

You can verify that your connections are set up correctly by running:

sling conns list

Extracting Data from PostgreSQL to DuckLake

With our connections configured, extracting data is straightforward. We can use a simple sling run command.

Let’s say we want to extract the customers table from the public schema in our PostgreSQL database and load it into a table named customers in the main schema of our DuckLake.

sling run --src-conn POSTGRES_CONN --src-stream public.customers \
          --tgt-conn MY_DUCKLAKE --tgt-object main.customers

Alternatively, you can use a YAML configuration file for more control:

# extract.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE

defaults:
  object: main.{stream_table}

streams:
  # load all tables
  public.*:
  
  finance.customers:
    object: main.finance_customers

Then run:

sling run -r extract.yaml

That’s it! Sling will handle:

Reading the data from all the tables in the public schema, and the finance.customers table.
Creating the main.finance_customers table in DuckLake if it doesn’t exist, as well as all respective tables from the source public schema.
Writing the data into Parquet files in the ducklake_data directory.
Updating the DuckLake catalog (my_catalog.db) with the new table information.

Incremental Loads

One of the powerful features of Sling is its ability to handle incremental loads easily. This is crucial for keeping your data lake up-to-date without having to re-extract all the data every time.

To perform an incremental load, you need a key column in your source table that indicates the order of records, such as a timestamp or an auto-incrementing ID. Let’s assume our customers table has a updated_at column.

We can use the replication mode in Sling to manage the state of our incremental loads automatically. Here’s how you would structure the command:

# replication.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE

defaults:
  mode: incremental
  primary_key: [customer_id]
  update_key: updated_at

streams:
  public.customers:
    object: main.customers

You can then run this replication with:

sling run -r replication.yaml

The incremental mode will merge new or updated records to the target table and ensure there are no duplicates based on the primary_key. Sling will automatically track the last updated_at value it processed and only fetch newer records on subsequent runs.

Conclusion

DuckLake offers a compelling solution for building modern, transactional data lakes, and Sling makes it incredibly simple to populate it from any database. With just a few commands, you can perform full extracts or set up robust incremental pipelines to keep your DuckLake synchronized with your source systems.

To learn more about what you can do with Sling, check out the official documentation. Happy slinging!