Introduction
In the evolving landscape of data engineering, DuckLake is emerging as a powerful solution for building data lakes with ACID transactions, versioning, and a flexible catalog backend. It combines the speed and efficiency of DuckDB with the scalability of cloud storage, making it an attractive choice for modern data platforms.
A common requirement is to populate a data lake by extracting data from various transactional or analytical databases. This is where Sling comes in, offering a simple and powerful command-line interface (CLI) to move data between different sources and destinations.
In this article, we’ll walk through how to use Sling to extract data from a PostgreSQL database and load it into DuckLake. The same principles can be applied to other databases that Sling supports, such as MySQL, SQL Server, Oracle, and more.
What is DuckLake?
DuckLake is a data lake format that brings the power of DuckDB to a data lake architecture. It provides a transactional layer over your data files (like Parquet) stored in object storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage, or local files). It uses a catalog database (like DuckDB, SQLite, PostgreSQL, or MySQL) to manage metadata, schemas, and versions.
This setup allows you to query your data lake using standard SQL with the performance benefits of DuckDB, while ensuring data consistency and reliability.
Setting up the Environment
Before we begin, make sure you have Sling CLI installed.
Configuring the Connections
We need to configure two connections in Sling: one for our source database (PostgreSQL) and one for our target (DuckLake).
1. Source Database: PostgreSQL
Let’s set up a connection to a PostgreSQL database. You can do this by setting an environment variable or by using the sling conns
command. See here for more details.
export POSTGRES_CONN="postgres://user:pass@host:5432/dbname"
2. Target: DuckLake
For DuckLake, we need to specify the catalog type, the connection string for the catalog, and the path where the data will be stored. For this example, we’ll use a local DuckDB file as our catalog and a local directory for our data. See here for more details.
Here’s how to set up the DuckLake connection using sling conns set
:
sling conns set MY_DUCKLAKE type=ducklake \
catalog_type=duckdb \
catalog_conn_string=my_catalog.db \
data_path=./ducklake_data
This command creates a DuckLake connection named MY_DUCKLAKE
that uses a local DuckDB file my_catalog.db
for the catalog and stores data in the ./ducklake_data
directory.
You can verify that your connections are set up correctly by running:
sling conns list
Extracting Data from PostgreSQL to DuckLake
With our connections configured, extracting data is straightforward. We can use a simple sling run
command.
Let’s say we want to extract the customers
table from the public
schema in our PostgreSQL database and load it into a table named customers
in the main
schema of our DuckLake.
sling run --src-conn POSTGRES_CONN --src-stream public.customers \
--tgt-conn MY_DUCKLAKE --tgt-object main.customers
Alternatively, you can use a YAML configuration file for more control:
# extract.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE
defaults:
object: main.{stream_table}
streams:
# load all tables
public.*:
finance.customers:
object: main.finance_customers
Then run:
sling run -r extract.yaml
That’s it! Sling will handle:
- Reading the data from all the tables in the
public
schema, and thefinance.customers
table. - Creating the
main.finance_customers
table in DuckLake if it doesn’t exist, as well as all respective tables from the sourcepublic
schema. - Writing the data into Parquet files in the
ducklake_data
directory. - Updating the DuckLake catalog (
my_catalog.db
) with the new table information.
Incremental Loads
One of the powerful features of Sling is its ability to handle incremental loads easily. This is crucial for keeping your data lake up-to-date without having to re-extract all the data every time.
To perform an incremental load, you need a key column in your source table that indicates the order of records, such as a timestamp or an auto-incrementing ID. Let’s assume our customers
table has a updated_at
column.
We can use the replication
mode in Sling to manage the state of our incremental loads automatically. Here’s how you would structure the command:
# replication.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE
defaults:
mode: incremental
primary_key: [customer_id]
update_key: updated_at
streams:
public.customers:
object: main.customers
You can then run this replication with:
sling run -r replication.yaml
The incremental
mode will merge new or updated records to the target table and ensure there are no duplicates based on the primary_key
. Sling will automatically track the last updated_at
value it processed and only fetch newer records on subsequent runs.
Conclusion
DuckLake offers a compelling solution for building modern, transactional data lakes, and Sling makes it incredibly simple to populate it from any database. With just a few commands, you can perform full extracts or set up robust incremental pipelines to keep your DuckLake synchronized with your source systems.
To learn more about what you can do with Sling, check out the official documentation. Happy slinging!