Load PostgreSQL, MySQL, and SQL Server data into DuckLake with Sling

Slinger avatar
Slinger
Cover for Load PostgreSQL, MySQL, and SQL Server data into DuckLake with Sling

Introduction

In the evolving landscape of data engineering, DuckLake is emerging as a powerful solution for building data lakes with ACID transactions, versioning, and a flexible catalog backend. It combines the speed and efficiency of DuckDB with the scalability of cloud storage, making it an attractive choice for modern data platforms.

A common requirement is to populate a data lake by extracting data from various transactional or analytical databases. This is where Sling comes in, offering a simple and powerful command-line interface (CLI) to move data between different sources and destinations.

In this article, we’ll walk through how to use Sling to extract data from a PostgreSQL database and load it into DuckLake. The same principles can be applied to other databases that Sling supports, such as MySQL, SQL Server, Oracle, and more.

What is DuckLake?

DuckLake is a data lake format that brings the power of DuckDB to a data lake architecture. It provides a transactional layer over your data files (like Parquet) stored in object storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage, or local files). It uses a catalog database (like DuckDB, SQLite, PostgreSQL, or MySQL) to manage metadata, schemas, and versions.

This setup allows you to query your data lake using standard SQL with the performance benefits of DuckDB, while ensuring data consistency and reliability.

Setting up the Environment

Before we begin, make sure you have Sling CLI installed.

Configuring the Connections

We need to configure two connections in Sling: one for our source database (PostgreSQL) and one for our target (DuckLake).

1. Source Database: PostgreSQL

Let’s set up a connection to a PostgreSQL database. You can do this by setting an environment variable or by using the sling conns command. See here for more details.

export POSTGRES_CONN="postgres://user:pass@host:5432/dbname"

2. Target: DuckLake

For DuckLake, we need to specify the catalog type, the connection string for the catalog, and the path where the data will be stored. For this example, we’ll use a local DuckDB file as our catalog and a local directory for our data. See here for more details.

Here’s how to set up the DuckLake connection using sling conns set:

sling conns set MY_DUCKLAKE type=ducklake \
  catalog_type=duckdb \
  catalog_conn_string=my_catalog.db \
  data_path=./ducklake_data

This command creates a DuckLake connection named MY_DUCKLAKE that uses a local DuckDB file my_catalog.db for the catalog and stores data in the ./ducklake_data directory.

You can verify that your connections are set up correctly by running:

sling conns list

Extracting Data from PostgreSQL to DuckLake

With our connections configured, extracting data is straightforward. We can use a simple sling run command.

Let’s say we want to extract the customers table from the public schema in our PostgreSQL database and load it into a table named customers in the main schema of our DuckLake.

sling run --src-conn POSTGRES_CONN --src-stream public.customers \
          --tgt-conn MY_DUCKLAKE --tgt-object main.customers

Alternatively, you can use a YAML configuration file for more control:

# extract.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE

defaults:
  object: main.{stream_table}

streams:
  # load all tables
  public.*:
  
  finance.customers:
    object: main.finance_customers

Then run:

sling run -r extract.yaml

That’s it! Sling will handle:

  • Reading the data from all the tables in the public schema, and the finance.customers table.
  • Creating the main.finance_customers table in DuckLake if it doesn’t exist, as well as all respective tables from the source public schema.
  • Writing the data into Parquet files in the ducklake_data directory.
  • Updating the DuckLake catalog (my_catalog.db) with the new table information.

Incremental Loads

One of the powerful features of Sling is its ability to handle incremental loads easily. This is crucial for keeping your data lake up-to-date without having to re-extract all the data every time.

To perform an incremental load, you need a key column in your source table that indicates the order of records, such as a timestamp or an auto-incrementing ID. Let’s assume our customers table has a updated_at column.

We can use the replication mode in Sling to manage the state of our incremental loads automatically. Here’s how you would structure the command:

# replication.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE

defaults:
  mode: incremental
  primary_key: [customer_id]
  update_key: updated_at

streams:
  public.customers:
    object: main.customers

You can then run this replication with:

sling run -r replication.yaml

The incremental mode will merge new or updated records to the target table and ensure there are no duplicates based on the primary_key. Sling will automatically track the last updated_at value it processed and only fetch newer records on subsequent runs.

Frequently asked questions

What is DuckLake? DuckLake is a transactional data lake format where a lightweight catalog database (DuckDB, Postgres, or SQLite) manages table metadata and versions, while the actual data lives as Parquet files on S3, GCS, Azure Blob, or local disk. You get ACID semantics and time-travel queries without a JVM, a cluster, or a separate metadata service.

Which databases can I use as sources for DuckLake with Sling? Any database Sling supports: PostgreSQL, MySQL, SQL Server, Oracle, Snowflake, BigQuery, Redshift, SQLite, DuckDB, ClickHouse, and more. The DuckLake target connection is separate from the source — point Sling at whatever database holds your production data and it writes Parquet into DuckLake.

Does Sling support incremental loads into DuckLake? Yes. Set mode: incremental with an update_key (any monotonically increasing column — a timestamp or an ID works). Sling tracks the high-water mark automatically: each subsequent run fetches only rows newer than the last run, merges them on the primary_key, and writes the delta. No manual state tracking needed.

What catalog backends does DuckLake support? Local DuckDB is the simplest option and works for development and single-machine pipelines. For shared or production setups, Postgres and MySQL catalogs let multiple readers and writers coordinate safely.

How is DuckLake different from Apache Iceberg or Delta Lake? Iceberg and Delta Lake both use file-based metadata (manifests, log files) stored alongside data in object storage. DuckLake uses a real relational database for metadata, which gives it stronger transactional guarantees and simpler tooling at the cost of requiring a catalog DB. If you are already running DuckDB or Postgres and want a lakehouse without a Spark cluster, DuckLake is the lighter path.

Conclusion

DuckLake is a practical way to build a transactional data lake without standing up a cluster. Sling handles the data movement side — full loads, incremental merges, and multi-table YAML configs — leaving DuckLake to do what it does well: versioning and SQL queries over Parquet.

Check the Sling documentation for the full list of source connectors and DuckLake connection options.