Replicate MySQL to ClickHouse with Sling

Slinger avatar
Slinger
Cover for Replicate MySQL to ClickHouse with Sling

Introduction

ClickHouse is a columnar OLAP database. It runs aggregate queries across billions of rows in seconds. MySQL is what most apps run on for transactional reads and writes. Different jobs, different storage shapes, which is why people end up running them side by side: MySQL for the app, ClickHouse for analytics on top of the app’s data.

The piece in the middle, the bit that copies tables from MySQL into ClickHouse and keeps them current, is what Sling does.

This guide replicates a MySQL schema into ClickHouse with Sling, in both full-refresh and incremental modes. The CLI output, row counts, and timings below all come from an actual run against a Docker MySQL on the source side and a self-hosted ClickHouse 25.4 on the target side. The same configuration works against ClickHouse Cloud; only the connection URL changes.

Installing Sling

Sling is a single binary. Pick whichever install method fits your environment:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

Confirm the install:

sling --version

Installation notes for every platform are in the Sling CLI Getting Started Guide.

Configuring the MySQL Source

Sling reads connection details from ~/.sling/env.yaml, environment variables, or sling conns set. For MySQL you need host, port, database, user, and password.

A read-only Sling user is the right shape for replication:

CREATE USER 'sling'@'%' IDENTIFIED BY '<password>';
GRANT SELECT ON <source_schema>.* TO 'sling'@'%';

Using sling conns set:

sling conns set MYSQL_SOURCE type=mysql host=host.ip user=sling \
  database=mydb password=mypass port=3306

Or in ~/.sling/env.yaml:

connections:
  MYSQL_SOURCE:
    type: mysql
    host: host.ip
    user: sling
    password: mypass
    port: 3306
    database: mydb

If your MySQL requires TLS, append ?tls=skip-verify to the URL form, or set tls: skip-verify in the YAML. Test it:

sling conns test MYSQL_SOURCE

The MySQL connection docs cover SSL, IAM auth, and the rest of the options.

Configuring the ClickHouse Target

ClickHouse speaks two protocols: native (port 9000) and HTTP (port 8123 / 8443 with TLS). Sling supports both. For self-hosted clusters the native protocol is usually the fastest path; for ClickHouse Cloud, the HTTPS endpoint is the supported one.

Self-hosted, native protocol:

sling conns set CLICKHOUSE type=clickhouse host=host.ip user=default \
  password=mypass port=9000 database=default

ClickHouse Cloud over HTTPS:

sling conns set CLICKHOUSE \
  url="https://default:[email protected]:8443/default"

Or in ~/.sling/env.yaml:

connections:
  CLICKHOUSE:
    type: clickhouse
    host: host.ip
    user: default
    password: mypass
    port: 9000
    database: default

Test it:

sling conns test CLICKHOUSE

The ClickHouse connection docs list every option, including the HTTP URL form and the export_stream_format setting for tuning the staging file format.

A Full-Refresh Replication

For this run the MySQL source has three tables in a demo_mysql_clickhouse database:

  • customers — 5,000 rows
  • orders — 30,000 rows, with an updated_at timestamp
  • events — 60,000 rows, with an occurred_at timestamp

The replication file lives next to wherever you run Sling from:

# replication.yaml
source: MYSQL_SOURCE
target: CLICKHOUSE

defaults:
  mode: full-refresh
  object: demo_mysql_clickhouse.{stream_table}

streams:
  demo_mysql_clickhouse.customers:
    primary_key: [customer_id]

  demo_mysql_clickhouse.orders:
    primary_key: [order_id]
    update_key: updated_at

  demo_mysql_clickhouse.events:
    primary_key: [event_id]
    update_key: occurred_at

A few things worth pointing out:

  • object: demo_mysql_clickhouse.{stream_table} is a runtime variable. Sling substitutes the source table name into the target object, so you don’t repeat yourself per stream.
  • primary_key and update_key are set even though the mode is full-refresh. The next section flips to incremental without touching those declarations; only the mode changes.
  • The target database (demo_mysql_clickhouse on ClickHouse) gets created automatically by Sling on the first run. No manual CREATE DATABASE needed on the target side.

Run it:

sling run -r replication.yaml

Real output, trimmed for readability:

INF Sling Replication [3 streams] | MYSQL_SOURCE -> CLICKHOUSE

INF [1 / 3] running stream demo_mysql_clickhouse.customers
INF reading from source database
INF writing to target database [mode: full-refresh]
INF created table `demo_mysql_clickhouse`.`customers`
INF inserted 5000 rows into `demo_mysql_clickhouse`.`customers` in 0 secs [8,853 r/s] [396 kB]
INF execution succeeded

INF [2 / 3] running stream demo_mysql_clickhouse.orders
INF created table `demo_mysql_clickhouse`.`orders`
INF inserted 30000 rows into `demo_mysql_clickhouse`.`orders` in 1 secs [29,381 r/s] [2.8 MB]
INF execution succeeded

INF [3 / 3] running stream demo_mysql_clickhouse.events
INF created table `demo_mysql_clickhouse`.`events`
INF inserted 60000 rows into `demo_mysql_clickhouse`.`events` in 0 secs [81,559 r/s] [3.2 MB]
INF execution succeeded

INF Sling Replication Completed in 4s | MYSQL_SOURCE -> CLICKHOUSE | 3 Successes | 0 Failures

95,000 rows across three tables, end to end, in 4 seconds. The _tmp tables that show up in the full log are Sling’s staging step before it swaps the data into the final target. They get cleaned up automatically.

When Sling creates the table, it asks for MergeTree with the primary key columns as the sorting key. That’s a fine baseline for analytical queries. The “Common Tweaks” section below covers how to override it when you need partitioning, replication, or a different engine.

Verification

A count() from ClickHouse right after the run:

SELECT 'customers' AS t, count() AS c FROM demo_mysql_clickhouse.customers
UNION ALL SELECT 'orders',    count()   FROM demo_mysql_clickhouse.orders
UNION ALL SELECT 'events',    count()   FROM demo_mysql_clickhouse.events;
+-----------+-------+
| T         | C     |
+-----------+-------+
| customers |  5000 |
| orders    | 30000 |
| events    | 60000 |
+-----------+-------+

A small sample to confirm the data made the trip with types intact:

SELECT event_id, customer_id, event_type, region, occurred_at
FROM demo_mysql_clickhouse.events
ORDER BY event_id LIMIT 5;
+----------+-------------+------------+--------+-------------------------------+
| EVENT_ID | CUSTOMER_ID | EVENT_TYPE | REGION | OCCURRED_AT                   |
+----------+-------------+------------+--------+-------------------------------+
|        1 |           2 | signup     | us-2   | 2025-01-01 00:00:01 +0000 UTC |
|        2 |           3 | purchase   | us-3   | 2025-01-01 00:00:02 +0000 UTC |
|        3 |           4 | logout     | us-4   | 2025-01-01 00:00:03 +0000 UTC |
|        4 |           5 | page_view  | us-1   | 2025-01-01 00:00:04 +0000 UTC |
|        5 |           6 | click      | us-2   | 2025-01-01 00:00:05 +0000 UTC |
+----------+-------------+------------+--------+-------------------------------+

Numeric, varchar, and timestamp columns round-tripped cleanly. The nullable region column (every seventh row in the source is null) lands as ClickHouse Nullable(String) and preserves nulls as nulls, not as the literal string "NULL".

Switching to Incremental

Full-refreshing a 60,000-row event table every day is fine. Full-refreshing a 600-million-row event table every day is not. Sling’s incremental mode reads only the rows newer than the highest update_key already in the target.

Drop customers from the streams (it changes slowly enough to keep on full-refresh in a separate run, or rebuild weekly) and switch the mode:

# replication-incremental.yaml
source: MYSQL_SOURCE
target: CLICKHOUSE

defaults:
  mode: incremental
  object: demo_mysql_clickhouse.{stream_table}

streams:
  demo_mysql_clickhouse.orders:
    primary_key: [order_id]
    update_key: updated_at

  demo_mysql_clickhouse.events:
    primary_key: [event_id]
    update_key: occurred_at

Insert 1,000 new orders and 2,500 new events on the source (a stand-in for a day of fresh data), then run again:

sling run -r replication-incremental.yaml
INF Sling Replication [2 streams] | MYSQL_SOURCE -> CLICKHOUSE

INF [1 / 2] running stream demo_mysql_clickhouse.orders
INF getting checkpoint value (updated_at)
INF writing to target database [mode: incremental]
INF inserted 1000 rows into `demo_mysql_clickhouse`.`orders` in 0 secs [1,926 r/s] [93 kB]
INF execution succeeded

INF [2 / 2] running stream demo_mysql_clickhouse.events
INF getting checkpoint value (occurred_at)
INF writing to target database [mode: incremental]
INF inserted 2500 rows into `demo_mysql_clickhouse`.`events` in 0 secs [4,040 r/s] [134 kB]
INF execution succeeded

INF Sling Replication Completed in 2s | MYSQL_SOURCE -> CLICKHOUSE | 2 Successes | 0 Failures

The getting checkpoint value line is where Sling looks at the target, finds the largest updated_at already present, and uses that as the lower bound on the source query. Only the new rows come across:

SELECT 'orders' AS t, count() c FROM demo_mysql_clickhouse.orders
UNION ALL SELECT 'events', count()   FROM demo_mysql_clickhouse.events;
+--------+-------+
| T      | C     |
+--------+-------+
| orders | 31000 |
| events | 62500 |
+--------+-------+

Orders went from 30,000 to 31,000. Events went from 60,000 to 62,500. Matches what was inserted on the source.

ClickHouse’s MergeTree family is append-friendly. In incremental mode Sling inserts the new rows directly into the main table without rewriting partitions. If you also need updates (a row’s updated_at changes and you want the existing target row replaced rather than duplicated), keep mode: incremental and make sure primary_key is set. Sling will use a ReplacingMergeTree-style upsert path against that key. The replication modes docs cover the trade-offs.

Common Tweaks

A few options worth reaching for once the basics are in place:

  • Pick a table engine. ClickHouse’s default MergeTree is a fine baseline, but for high-write or replicated clusters you’ll want ReplicatedMergeTree, partitioning by month, and a TTL. Set target_options.table_ddl per stream with the full CREATE TABLE you want; Sling will use it instead of generating its own. Example: engine = MergeTree() ORDER BY (customer_id, occurred_at) PARTITION BY toYYYYMM(occurred_at).
  • Add new columns automatically. When the source schema changes, set target_options: { add_new_columns: true } so Sling alters the ClickHouse table on the next run. Without it, new source columns get dropped at the boundary.
  • Tune the staging format. Sling stages data as a file before bulk-loading into ClickHouse. The default is CSVWithNames, which is robust but verbose. For wide rows or large text values, set export_stream_format: Parquet on the ClickHouse connection. Usually faster and more compact on the wire.
  • Filter at the source. Use a custom sql: block in a stream to project columns or filter rows before they leave MySQL. Cheaper than dragging unused columns to ClickHouse, and it keeps row payloads small for the network hop.

Where to Go Next

The same replication pattern works for any of Sling’s 30+ database sources into ClickHouse: PostgreSQL, SQL Server, Snowflake, BigQuery, and the rest. Swap the source connection and leave the target alone. For the equivalent flow from a Postgres source, see PostgreSQL to ClickHouse.

If your downstream is more cloud-warehouse than columnar engine, MySQL to MotherDuck covers the same setup with DuckDB-on-the-cloud as the target. For team workflows with scheduling and alerting on top of the same CLI, look at the Sling Platform.

Questions go to Discord or GitHub Issues.