Introduction
In the evolving landscape of data engineering, DuckLake is emerging as a powerful solution for building data lakes with ACID transactions, versioning, and a flexible catalog backend. It combines the speed and efficiency of DuckDB with the scalability of cloud storage, making it an attractive choice for modern data platforms.
A common requirement is to populate a data lake by extracting data from various transactional or analytical databases. This is where Sling comes in, offering a simple and powerful command-line interface (CLI) to move data between different sources and destinations.
In this article, we’ll walk through how to use Sling to extract data from a PostgreSQL database and load it into DuckLake. The same principles can be applied to other databases that Sling supports, such as MySQL, SQL Server, Oracle, and more.
What is DuckLake?
DuckLake is a data lake format that brings the power of DuckDB to a data lake architecture. It provides a transactional layer over your data files (like Parquet) stored in object storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage, or local files). It uses a catalog database (like DuckDB, SQLite, PostgreSQL, or MySQL) to manage metadata, schemas, and versions.
This setup allows you to query your data lake using standard SQL with the performance benefits of DuckDB, while ensuring data consistency and reliability.
Setting up the Environment
Before we begin, make sure you have Sling CLI installed.
Configuring the Connections
We need to configure two connections in Sling: one for our source database (PostgreSQL) and one for our target (DuckLake).
1. Source Database: PostgreSQL
Let’s set up a connection to a PostgreSQL database. You can do this by setting an environment variable or by using the sling conns command. See here for more details.
export POSTGRES_CONN="postgres://user:pass@host:5432/dbname"
2. Target: DuckLake
For DuckLake, we need to specify the catalog type, the connection string for the catalog, and the path where the data will be stored. For this example, we’ll use a local DuckDB file as our catalog and a local directory for our data. See here for more details.
Here’s how to set up the DuckLake connection using sling conns set:
sling conns set MY_DUCKLAKE type=ducklake \
catalog_type=duckdb \
catalog_conn_string=my_catalog.db \
data_path=./ducklake_data
This command creates a DuckLake connection named MY_DUCKLAKE that uses a local DuckDB file my_catalog.db for the catalog and stores data in the ./ducklake_data directory.
You can verify that your connections are set up correctly by running:
sling conns list
Extracting Data from PostgreSQL to DuckLake
With our connections configured, extracting data is straightforward. We can use a simple sling run command.
Let’s say we want to extract the customers table from the public schema in our PostgreSQL database and load it into a table named customers in the main schema of our DuckLake.
sling run --src-conn POSTGRES_CONN --src-stream public.customers \
--tgt-conn MY_DUCKLAKE --tgt-object main.customers
Alternatively, you can use a YAML configuration file for more control:
# extract.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE
defaults:
object: main.{stream_table}
streams:
# load all tables
public.*:
finance.customers:
object: main.finance_customers
Then run:
sling run -r extract.yaml
That’s it! Sling will handle:
- Reading the data from all the tables in the
publicschema, and thefinance.customerstable. - Creating the
main.finance_customerstable in DuckLake if it doesn’t exist, as well as all respective tables from the sourcepublicschema. - Writing the data into Parquet files in the
ducklake_datadirectory. - Updating the DuckLake catalog (
my_catalog.db) with the new table information.
Incremental Loads
One of the powerful features of Sling is its ability to handle incremental loads easily. This is crucial for keeping your data lake up-to-date without having to re-extract all the data every time.
To perform an incremental load, you need a key column in your source table that indicates the order of records, such as a timestamp or an auto-incrementing ID. Let’s assume our customers table has a updated_at column.
We can use the replication mode in Sling to manage the state of our incremental loads automatically. Here’s how you would structure the command:
# replication.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE
defaults:
mode: incremental
primary_key: [customer_id]
update_key: updated_at
streams:
public.customers:
object: main.customers
You can then run this replication with:
sling run -r replication.yaml
The incremental mode will merge new or updated records to the target table and ensure there are no duplicates based on the primary_key. Sling will automatically track the last updated_at value it processed and only fetch newer records on subsequent runs.
Frequently asked questions
What is DuckLake? DuckLake is a transactional data lake format where a lightweight catalog database (DuckDB, Postgres, or SQLite) manages table metadata and versions, while the actual data lives as Parquet files on S3, GCS, Azure Blob, or local disk. You get ACID semantics and time-travel queries without a JVM, a cluster, or a separate metadata service.
Which databases can I use as sources for DuckLake with Sling? Any database Sling supports: PostgreSQL, MySQL, SQL Server, Oracle, Snowflake, BigQuery, Redshift, SQLite, DuckDB, ClickHouse, and more. The DuckLake target connection is separate from the source — point Sling at whatever database holds your production data and it writes Parquet into DuckLake.
Does Sling support incremental loads into DuckLake? Yes. Set mode: incremental with an update_key (any monotonically increasing column — a timestamp or an ID works). Sling tracks the high-water mark automatically: each subsequent run fetches only rows newer than the last run, merges them on the primary_key, and writes the delta. No manual state tracking needed.
What catalog backends does DuckLake support? Local DuckDB is the simplest option and works for development and single-machine pipelines. For shared or production setups, Postgres and MySQL catalogs let multiple readers and writers coordinate safely.
How is DuckLake different from Apache Iceberg or Delta Lake? Iceberg and Delta Lake both use file-based metadata (manifests, log files) stored alongside data in object storage. DuckLake uses a real relational database for metadata, which gives it stronger transactional guarantees and simpler tooling at the cost of requiring a catalog DB. If you are already running DuckDB or Postgres and want a lakehouse without a Spark cluster, DuckLake is the lighter path.
Conclusion
DuckLake is a practical way to build a transactional data lake without standing up a cluster. Sling handles the data movement side — full loads, incremental merges, and multi-table YAML configs — leaving DuckLake to do what it does well: versioning and SQL queries over Parquet.
Check the Sling documentation for the full list of source connectors and DuckLake connection options.


