PostgreSQL to DuckLake: Load Postgres Data with Sling

Slinger avatar
Slinger
Cover for PostgreSQL to DuckLake: Load Postgres Data with Sling

Introduction

DuckLake gives you a transactional data lake — ACID semantics, time-travel queries, Parquet-on-S3 — without a JVM or a Spark cluster. The one thing it does not handle is getting your production data into it. That is where Sling comes in.

Sling is a single Go binary that connects to PostgreSQL (and 40+ other sources) and writes directly into DuckLake. One command to do a full load. One YAML file to run nightly incremental syncs. No additional infrastructure.

Prerequisites

  • Sling CLI installed: curl -fsSL https://slingdata.io/install.sh | bash
  • A running PostgreSQL database
  • A DuckLake catalog database (local DuckDB file works; Postgres and SQLite also supported)
  • Object storage for data files (local directory, S3, GCS, or Azure Blob)

Configure the connections

Sling reads connections from environment variables or from ~/.sling/env.yaml. For a quick start, environment variables are simplest.

PostgreSQL source:

export POSTGRES_CONN="postgres://user:pass@localhost:5432/mydb"

Or as a named connection with more options:

sling conns set MY_POSTGRES type=postgres host=localhost port=5432 \
  user=myuser password=mypass database=mydb

DuckLake target:

sling conns set MY_DUCKLAKE type=ducklake \
  catalog_type=duckdb \
  catalog_conn_string=./catalog.db \
  data_path=./ducklake_data

For an S3-backed DuckLake replace data_path with your bucket path:

sling conns set MY_DUCKLAKE type=ducklake \
  catalog_type=postgres \
  catalog_conn_string="postgres://user:pass@catalog-host:5432/ducklake_meta" \
  data_path=s3://my-bucket/ducklake/

Verify both connections:

sling conns list
sling conns test MY_POSTGRES
sling conns test MY_DUCKLAKE

Full load — one table

Extract the public.orders table from Postgres and load it into DuckLake:

sling run \
  --src-conn MY_POSTGRES \
  --src-stream public.orders \
  --tgt-conn MY_DUCKLAKE \
  --tgt-object main.orders

Sling infers the schema, creates the DuckLake table, writes the Parquet files, and updates the catalog. The whole operation is a single process — no staging bucket, no intermediate files to clean up.

Incremental loads — keeping DuckLake in sync

For ongoing pipelines, incremental mode fetches only rows that changed since the last run. You need a monotonically increasing column — a timestamp (updated_at, created_at) or an auto-increment ID.

Create a replication file:

# postgres_to_ducklake.yaml
source: MY_POSTGRES
target: MY_DUCKLAKE

defaults:
  mode: incremental
  primary_key: [id]
  update_key: updated_at
  object: main.{stream_table}

streams:
  public.customers:
  public.orders:
  public.line_items:
    update_key: created_at   # override for tables without updated_at

Run it:

sling run -r postgres_to_ducklake.yaml

Sling persists the high-water mark between runs automatically. Run the same command again and it will fetch only rows newer than the previous batch, then merge them on primary_key. No cron wrapper needed — just schedule sling run -r postgres_to_ducklake.yaml with whatever scheduler you already use (cron, Airflow, Dagster, systemd).

Replicate an entire schema

To mirror every table in the public schema:

source: MY_POSTGRES
target: MY_DUCKLAKE

defaults:
  mode: full-refresh
  object: main.{stream_table}

streams:
  public.*:

This is useful for bootstrapping a DuckLake from an existing Postgres database. Switch mode to incremental for subsequent runs once the initial load is complete.

Querying the result in DuckDB

Once data is in DuckLake, open a DuckDB session to query it:

INSTALL ducklake;
LOAD ducklake;
ATTACH 'ducklake:./catalog.db' AS lake (DATA_PATH './ducklake_data/');

-- Time-travel: compare current vs two snapshots ago
SELECT * FROM lake.main.orders AT (VERSION => 1);
SELECT * FROM lake.main.orders;  -- current

Sling handles the write side. DuckDB handles the query side. Neither needs to know about the other.

Why DuckLake + Sling

Both tools share the same design philosophy: do one thing well, require no cluster, and stay out of your way.

DuckLake’s catalog is a relational database you already know how to operate. Its data format is Parquet you already know how to store. Sling’s configuration is a YAML file you can read in ten seconds. The two tools compose cleanly because neither imposes a runtime dependency on the other.

For teams who want a lakehouse without adopting a new infrastructure stack, this combination gets you there with two binaries and a YAML file.

Next steps