Export PostgreSQL to Cloudflare R2 as Parquet with Sling

Slinger avatar
Slinger
Cover for Export PostgreSQL to Cloudflare R2 as Parquet with Sling

Introduction

Cloudflare R2 is S3-compatible object storage with no egress fees. The no-egress part is what gets it onto data-platform shortlists. If your warehouse, your notebooks, and a couple of downstream readers all need to pull the same Parquet files, R2 doesn’t meter the bytes on the way out. PostgreSQL is where most application data lives; Parquet on R2 is a sensible place to land the analytics copy.

Sling moves the data. It speaks the S3 API, so R2 is just an S3 connection with a different endpoint. Source is Postgres, target is R2, format is Parquet — and that’s most of the configuration.

This guide replicates a Postgres schema into R2 as Parquet using Sling. The CLI output, file sizes, and row counts below are from an actual run against a Postgres source and a real R2 bucket. Both full-refresh and incremental flows are covered.

Installing Sling

Sling is a single binary. Pick whichever install method fits your environment:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

Confirm the install:

sling --version

Installation notes for every platform are in the Sling CLI Getting Started Guide.

Configuring the Postgres Source

Sling reads connection details from ~/.sling/env.yaml, environment variables, or sling conns set. For Postgres you need host, port, database, user, and password.

A read-only user is enough for replication:

CREATE USER sling WITH PASSWORD '<password>';
GRANT CONNECT ON DATABASE mydb TO sling;
GRANT USAGE ON SCHEMA public TO sling;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO sling;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO sling;

Using sling conns set:

sling conns set POSTGRES type=postgres host=host.ip user=sling \
  database=mydb password=mypass port=5432

Or in ~/.sling/env.yaml:

connections:
  POSTGRES:
    type: postgres
    host: host.ip
    user: sling
    password: mypass
    port: 5432
    database: mydb

If your Postgres requires SSL, append sslmode: require in the YAML form. Test it:

sling conns test POSTGRES

The Postgres connection docs cover SSL, IAM auth, and the rest of the options.

Configuring the R2 Target

R2 buckets are S3-compatible, so from Sling’s point of view you create an S3-typed connection and point it at the R2 endpoint. You’ll need three values from the Cloudflare dashboard: an R2 access key id, the matching secret, and the S3 API endpoint for your account (it looks like <accountid>.r2.cloudflarestorage.com).

Using sling conns set:

sling conns set R2 type=s3 \
  bucket=my-bucket \
  endpoint=<accountid>.r2.cloudflarestorage.com \
  access_key_id=<r2_access_key_id> \
  secret_access_key=<r2_secret_access_key>

Or in ~/.sling/env.yaml:

connections:
  R2:
    type: s3
    bucket: my-bucket
    endpoint: <accountid>.r2.cloudflarestorage.com
    access_key_id: <r2_access_key_id>
    secret_access_key: <r2_secret_access_key>

Two things to know about R2 specifically:

  • The endpoint is per-account, not per-bucket. Buckets are addressed by name underneath that endpoint.
  • R2 doesn’t use AWS regions. Don’t set region: on the connection; Sling uses auto by default and that’s correct.

Test it:

sling conns test R2

The S3 connection docs cover R2 (and any other S3-compatible store) under the same connection type.

A Full-Refresh Replication

For this run the Postgres source has three tables in a demo_postgres_r2 schema:

  • customers — 5,000 rows
  • orders — 30,000 rows, with an updated_at timestamp
  • events — 60,000 rows, with an occurred_at timestamp

The replication file lives next to wherever you run Sling from:

# replication.yaml
source: POSTGRES
target: R2

defaults:
  mode: full-refresh
  object: 's3://my-bucket/demo_postgres_r2/{stream_table}/'
  target_options:
    format: parquet
    compression: snappy
    file_max_rows: 100000

streams:
  demo_postgres_r2.customers:
  demo_postgres_r2.orders:
    primary_key: [order_id]
    update_key: updated_at
  demo_postgres_r2.events:
    primary_key: [event_id]
    update_key: occurred_at

A few things worth pointing out:

  • s3://my-bucket/... is the URL scheme even for R2. Sling routes the request to the R2 endpoint on the connection; the path style is just how Sling addresses the bucket.
  • {stream_table} is a runtime variable. Sling substitutes the source table name into the target object, so you don’t repeat yourself per stream.
  • file_max_rows: 100000 caps each Parquet file at 100k rows. With these table sizes everything fits in one file per stream, but the cap is what keeps a 50-million-row table from landing as a single 4 GB blob.
  • primary_key and update_key are set even though the mode is full-refresh. The next section flips to incremental without touching those declarations; only the mode changes.

Run it:

sling run -r replication.yaml

Real output, trimmed for readability:

INF Sling Replication [3 streams] | POSTGRES -> R2

INF [1 / 3] running stream demo_postgres_r2.customers
INF reading from source database
INF writing to target file system (s3)
INF wrote 5000 rows [3,397 r/s] to s3://my-bucket/demo_postgres_r2/customers/
INF execution succeeded

INF [2 / 3] running stream demo_postgres_r2.orders
INF wrote 30000 rows [19,831 r/s] to s3://my-bucket/demo_postgres_r2/orders/
INF execution succeeded

INF [3 / 3] running stream demo_postgres_r2.events
INF wrote 60000 rows [38,155 r/s] to s3://my-bucket/demo_postgres_r2/events/
INF execution succeeded

INF Sling Replication Completed in 6s | POSTGRES -> R2 | 3 Successes | 0 Failures

95,000 rows across three tables, six seconds, snappy-compressed Parquet on R2.

Verification

A recursive listing of the bucket prefix shows what landed:

+---+---------------------------------------------+------+---------+
| # | NAME                                        | TYPE | SIZE    |
+---+---------------------------------------------+------+---------+
| 1 | demo_postgres_r2/customers/data_001.parquet | file | 138 KiB |
| 2 | demo_postgres_r2/events/data_001.parquet    | file | 700 KiB |
| 3 | demo_postgres_r2/orders/data_001.parquet    | file | 796 KiB |
+---+---------------------------------------------+------+---------+

You can run that listing with Sling itself:

sling conns discover R2 --pattern 'demo_postgres_r2/**' --recursive

To verify the contents round-trip, read one of the Parquet files back and dump it to a local CSV:

sling run \
  --src-conn R2 \
  --src-stream "s3://my-bucket/demo_postgres_r2/customers/" \
  --tgt-object "file:///tmp/customers-readback.csv"

The first few lines:

customer_id,name,email,country,lifetime_spend,created_at
1,Customer 1,[email protected],UK,338.260000,2024-01-01 01:00:00 -03
2,Customer 2,[email protected],DE,319.570000,2024-01-01 02:00:00 -03
3,Customer 3,[email protected],FR,600.760000,2024-01-01 03:00:00 -03
4,Customer 4,[email protected],JP,2174.770000,2024-01-01 04:00:00 -03
5,Customer 5,[email protected],BR,2080.100000,2024-01-01 05:00:00 -03

Numeric, varchar, and timestamp columns survive the trip cleanly. The nullable country column (every seventh source row is null) lands as a real null in the Parquet schema, not as the literal string "NULL".

Switching to Incremental

Re-uploading 60 million event rows every night is wasteful. Sling’s incremental mode reads only the rows newer than the last checkpoint.

For database targets Sling keeps the checkpoint in a state table on the target itself. File targets can’t do that, so Sling needs an external state store. You point it at one with the SLING_STATE environment variable. The simplest form is a local directory:

export SLING_STATE="LOCAL//var/lib/sling/state"

For production runs, point SLING_STATE at a database connection (POSTGRES//sling_state) so the checkpoint survives across machines. Same shape, different prefix.

Incremental output to files also requires partitioned object paths so Sling can update a window without rewriting the whole dataset. The {part_year_month} template tells Sling to partition by month based on the update key:

# replication-incremental.yaml
source: POSTGRES
target: R2

defaults:
  mode: incremental
  target_options:
    format: parquet
    compression: snappy

streams:
  demo_postgres_r2.orders:
    object: 's3://my-bucket/demo_postgres_r2/{stream_table}/{part_year_month}/'
    primary_key: [order_id]
    update_key: updated_at

  demo_postgres_r2.events:
    object: 's3://my-bucket/demo_postgres_r2/{stream_table}/{part_year_month}/'
    primary_key: [event_id]
    update_key: occurred_at

The first run of this config establishes the checkpoint and writes the full dataset partitioned by month:

sling run -r replication-incremental.yaml
INF [1 / 2] running stream demo_postgres_r2.orders
INF wrote 30000 rows [14,357 r/s] to s3://my-bucket/demo_postgres_r2/orders/
INF execution succeeded

INF [2 / 2] running stream demo_postgres_r2.events
INF wrote 60000 rows [31,855 r/s] to s3://my-bucket/demo_postgres_r2/events/
INF execution succeeded

INF Sling Replication Completed in 5s | POSTGRES -> R2 | 2 Successes | 0 Failures

The bucket layout now uses Hive-style partitions named after the update key:

demo_postgres_r2/orders/updated_at_year_month=2025-01/data_0.parquet
demo_postgres_r2/orders/updated_at_year_month=2025-02/data_0.parquet
demo_postgres_r2/events/occurred_at_year_month=2025-01/data_0.parquet

Insert 1,000 new orders and 2,500 new events on the source (a stand-in for a day of fresh data), then run the same replication again:

sling run -r replication-incremental.yaml
INF [1 / 2] running stream demo_postgres_r2.orders
INF wrote 1241 rows [828 r/s] to s3://my-bucket/demo_postgres_r2/orders/
INF execution succeeded

INF [2 / 2] running stream demo_postgres_r2.events
INF wrote 62500 rows [35,883 r/s] to s3://my-bucket/demo_postgres_r2/events/
INF execution succeeded

Two things to notice:

  • Sling reads the checkpoint from SLING_STATE and pulls only rows whose update_key is newer.
  • The “wrote” counts include any rows in partitions Sling re-wrote, not just the new arrivals. File-incremental mode rewrites the partitions touched by the delta window, so when 2,500 new events land in the current month, Sling writes the full current-month partition (and any earlier partition that overlaps the rewrite window). Older partitions are untouched. The listing confirms it:
demo_postgres_r2/events/occurred_at_year_month=2026-05/data_0.parquet  (36 KiB, 2500 rows)
demo_postgres_r2/orders/updated_at_year_month=2026-05/data_0.parquet   (28 KiB, 1000 rows)

The newest partitions hold exactly the new rows. Earlier partitions stay where they were.

Common Tweaks

A few options worth reaching for once the basics are in place:

  • Compression. Snappy is the default and a fine baseline. For cold storage, set compression: zstd or compression: gzip on target_options. Smaller files, slightly slower writes.
  • File sizing. file_max_rows (rows per file) and file_max_bytes (bytes per file) both work on Parquet. Use whichever maps better to your downstream reader’s chunking. For Snowflake-on-R2 or Athena, ~128–256 MB per file is the usual sweet spot.
  • Partitioning by date. Use {part_year}/{part_month}/{part_day}/ in the object path to lay out files in date-partitioned directories your downstream engine already understands.
  • Filter at the source. Use a sql: block per stream to project columns or filter rows before they leave Postgres. Cheaper than dragging unused columns to R2, and it keeps Parquet row groups tight.
  • Bucket policy. R2 buckets default to private. If you’re publishing the Parquet files for outside consumers, hook a public bucket or a custom domain in front of the bucket; Sling doesn’t need anything different on its side.

Where to Go Next

The same pattern works for any of Sling’s 30+ database sources into R2: MySQL, SQL Server, Snowflake, BigQuery, and the rest. Swap the source connection and leave the target alone.

For different file shapes on the same target, see Postgres to S3 as JSON and Postgres to S3 as CSV; same flow, different format: value. For the AWS-S3 equivalent of this article, see Postgres to S3 as Parquet. For team workflows with scheduling and alerting on top of the same CLI, look at the Sling Platform.

Questions go to Discord or GitHub Issues.