Loading Data into SQL Server with Sling and ADBC

Slinger avatar
Slinger
Cover for Loading Data into SQL Server with Sling and ADBC

Introduction

Loading large tables into SQL Server is one of the places ETL tools quietly fall over. The TDS protocol works, but row-by-row inserts are slow at scale, and bcp requires you to leave Python and write to disk first. ADBC — Arrow Database Connectivity — was designed to fix that: a single columnar protocol with native bulk-load support across drivers.

Sling supports SQL Server through both the standard TDS path and the ADBC path. This article walks through the ADBC path end to end with a working lab:

  • A SQL Server 2022 container.
  • A CLI pipeline that loads a CSV into SQL Server via ADBC.
  • A Python script that generates 100k rows as multiple parquet files, pushes them into a SeaweedFS S3 bucket, lets Sling replicate the whole prefix into SQL Server via ADBC, and reads the result back through stream_arrow() for native Arrow types.

All output below is from an actual run. The full source lives in the Sling CLI repository at tests/pipelines/adbc/sqlserver. The connector docs are at docs.slingdata.io/connections/database-connections/sqlserver#adbc.

Why ADBC for SQL Server

Three concrete reasons to use the ADBC path over the default TDS driver:

  • Bulk load is built in. The Microsoft ADBC driver pushes rows as Arrow columnar batches. No bcp shell-out, no temp files.
  • Type fidelity from Arrow sources. When the source is already Arrow (DuckDB, Parquet, Polars), ADBC carries types through without a round-trip to SQL strings.
  • One driver, one configuration. The same dbc install <name> mechanism installs drivers for SQL Server, Postgres, MySQL, DuckDB, Snowflake, and Trino. Sling auto-detects the installed drivers; you flip a single flag.

The trade-off: the ADBC driver manager is an external dependency you have to install once. After that, it’s transparent.

Setting Up the Lab

Sling needs four things on the host: the binary itself, the dbc driver manager, a SQL Server instance, and (for the Python part) an S3 endpoint.

Install Sling

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Python (for the Python example below)
pip install 'sling[arrow]'

The [arrow] extra is needed for stream_arrow() in the Python example.

Install the dbc driver manager

The dbc CLI ships ADBC drivers. Install it once, then ask it for the SQL Server driver:

curl -LsSf https://dbc.columnar.tech/install.sh | sh
dbc install mssql

This drops libadbc_driver_mssql.* somewhere Sling can find it (~/.dbc/drivers/). Sling auto-resolves the path at connection time.

Start SQL Server and SeaweedFS

For the lab, one docker-compose.yaml brings up both: SQL Server 2022 on port 51444, and SeaweedFS exposing an S3-compatible API on port 18333.

# docker-compose.yaml
services:
  mssql:
    image: mcr.microsoft.com/mssql/server:2022-latest
    container_name: sling-adbc-mssql
    ports:
      - "51444:1433"
    environment:
      ACCEPT_EULA: "Y"
      MSSQL_SA_PASSWORD: "AdbcPipeline123!"
      MSSQL_PID: "Developer"
    healthcheck:
      test: ["CMD-SHELL", "/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P 'AdbcPipeline123!' -C -Q 'select 1'"]
      interval: 5s
      retries: 30

  seaweedfs:
    image: chrislusf/seaweedfs:latest
    container_name: sling-adbc-seaweedfs
    ports:
      - "18333:8333"
      - "19333:9333"
    command: server -s3 -dir=/data
    volumes:
      - seaweedfs_data:/data

volumes:
  seaweedfs_data:
docker compose up -d
docker inspect sling-adbc-mssql --format='{{.State.Health.Status}}'
# wait until → healthy

SeaweedFS’s server -s3 mode runs anonymous-write by default. That’s fine for a local lab. For production, configure credentials with weed shell and the s3.configure command.

Configuring the Sling Connections

Sling reads connection details from ~/.sling/env.yaml, environment variables, or sling conns set. We need two: one for SQL Server (with use_adbc: true) and one for SeaweedFS (a standard S3 connection pointed at the local endpoint).

sling conns set MSSQL_ADBC type=sqlserver \
  host=localhost port=51444 user=sa password='AdbcPipeline123!' \
  database=master encrypt=disable use_adbc=true

sling conns set SEAWEED_S3 type=s3 endpoint=http://localhost:18333 \
  bucket=pipeline access_key_id=any secret_access_key=any

Or in ~/.sling/env.yaml:

connections:
  MSSQL_ADBC:
    type: sqlserver
    host: localhost
    port: 51444
    user: sa
    password: 'AdbcPipeline123!'
    database: master
    encrypt: disable
    use_adbc: true

  SEAWEED_S3:
    type: s3
    endpoint: http://localhost:18333
    bucket: pipeline
    access_key_id: any
    secret_access_key: any

Test them:

sling conns test MSSQL_ADBC  # → INF success!
sling conns test SEAWEED_S3     # → INF success!

Behind the scenes, the use_adbc: true flag on a regular type: sqlserver connection routes both reads and writes through the ADBC driver. Flip it off and the same connection falls back to the standard TDS path; everything else in your replication or pipeline stays the same.

CLI Pipeline: CSV → ADBC

The CLI side of the lab stays simple: write a CSV, load it into dbo.adbc_pipeline via ADBC, read it back, verify the rows. The whole thing runs as a Sling pipeline so the setup, the load, and the assertions live in one file:

steps:
  - type: write
    to: file:///tmp/adbc_sqlserver_pipeline.csv
    content: |
      id,name,score
      1,alice,98.5
      2,bob,82.0
      3,charlie,75.25
      4,dana,91.75
      5,eve,88.0

  - type: query
    connection: MSSQL_ADBC
    query: |
      if object_id('dbo.adbc_pipeline', 'U') is not null drop table dbo.adbc_pipeline

  - replication:
      source: LOCAL
      target: MSSQL_ADBC
      defaults:
        mode: full-refresh
      streams:
        file:///tmp/adbc_sqlserver_pipeline.csv:
          object: dbo.adbc_pipeline

  - type: query
    connection: MSSQL_ADBC
    query: select id, name, score from dbo.adbc_pipeline order by id
    into: rows

  - type: check
    check: length(store.rows) == 5
    failure_message: "Expected 5 rows, got {length(store.rows)}"

Two things worth noting:

  • The replication: block embedded inside the pipeline is a regular Sling replication. The same YAML body runs unchanged outside the pipeline.
  • Sling’s hooks (query, check, log, write) are first-class steps in pipelines, so the lab’s setup, assertions, and cleanup all stay in one file.

Run it:

sling run -p tests/pipelines/adbc/sqlserver/p.41.adbc_sqlserver.yaml

Real output (trimmed):

INF Sling CLI | https://slingdata.io
INF Wrote /tmp/adbc_sqlserver_pipeline.csv
INF Sling Replication | LOCAL -> MSSQL_ADBC | file:///tmp/adbc_sqlserver_pipeline.csv
INF connecting to target database (sqlserver)
INF reading from source file system (file)
INF writing to target database [mode: full-refresh]
INF created table "dbo"."adbc_pipeline"
INF inserted 5 rows into "dbo"."adbc_pipeline" in 0 secs [33 r/s]
INF execution succeeded
INF ADBC write completed
INF Loaded rows: [{"id":1,"name":"alice","score":"98.500000"},{"id":2,"name":"bob","score":"82.000000"},...]
INF ADBC SQL Server pipeline test PASSED

A single replication, an ADBC read, three check assertions, all under a second. The Python script below shows the same flow at scale — 100k rows via S3 — with native Arrow types on the read-back.

Python: 100k Rows via S3 and stream_arrow()

The Python side does the more interesting thing — the shape you’d actually see in production. The script generates 100,000 rows in DuckDB, writes them as four parquet files into a SeaweedFS S3 bucket via PyArrow, and lets Sling replicate the full S3 prefix into SQL Server via ADBC. Then it reads back through stream_arrow() so the rows arrive as native Arrow RecordBatch objects instead of stringified CSV-shaped dicts.

Save it as run_pipeline.py — the inline script header tells uv which dependencies to fetch:

# /// script
# requires-python = ">=3.10"
# dependencies = ["sling[arrow]", "duckdb", "pyarrow", "boto3"]
# ///
import io
import boto3
import duckdb
import pyarrow as pa
from pyarrow import parquet as pq

from sling import Mode, Replication, ReplicationStream, Sling
from sling.hooks import HookQuery

MSSQL_CONN = "MSSQL_ADBC"
S3_CONN = "SEAWEED_S3"
S3_BUCKET = "pipeline"
S3_PREFIX = "adbc-py/"

ROWS_PER_FILE = 25_000
NUM_FILES = 4  # 100k rows total

# 1. Generate 100k rows in DuckDB, split into 4 parquet files via pyarrow,
#    push each to SeaweedFS S3 under the same prefix.
s3 = boto3.client(
    "s3",
    endpoint_url="http://localhost:18333",
    aws_access_key_id="any", aws_secret_access_key="any",
    region_name="us-east-1",
)
try: s3.create_bucket(Bucket=S3_BUCKET)
except Exception: pass

# Clear any prior run.
for obj in s3.list_objects_v2(Bucket=S3_BUCKET, Prefix=S3_PREFIX).get("Contents", []):
    s3.delete_object(Bucket=S3_BUCKET, Key=obj["Key"])

con = duckdb.connect()
for n in range(NUM_FILES):
    start_id = n * ROWS_PER_FILE + 1
    end_id = start_id + ROWS_PER_FILE
    arrow_tbl: pa.Table = con.execute(f"""
        select i as id,
               'user_' || i::varchar as name,
               round(i * 1.5, 2) as score,
               'cat_' || ((i % 10)::varchar) as category
        from range({start_id}, {end_id}) t(i)
    """).arrow()

    buf = io.BytesIO()
    pq.write_table(arrow_tbl, buf, compression="snappy")
    buf.seek(0)
    s3.upload_fileobj(buf, S3_BUCKET, f"{S3_PREFIX}part-{n+1:04d}.parquet")

# 2. Replicate the S3 prefix → SQL Server via ADBC.
Replication(
    source=S3_CONN,
    target=MSSQL_CONN,
    streams={
        f"{S3_BUCKET}/{S3_PREFIX}": ReplicationStream(
            mode=Mode.FULL_REFRESH,
            object="dbo.adbc_pipeline_py",
        ),
    },
    hooks={
        "start": [HookQuery(
            connection=MSSQL_CONN,
            query="if object_id('dbo.adbc_pipeline_py','U') is not null drop table dbo.adbc_pipeline_py",
        )],
    },
).run()

# 3. Read back through ADBC with stream_arrow().
stream = Sling(
    src_conn=MSSQL_CONN,
    src_stream="select top 5 id, name, score, category from dbo.adbc_pipeline_py order by id",
).stream_arrow()

batches = list(stream)
print(f"got {len(batches)} arrow batch(es), schema: {batches[0].schema}")
for row in pa.Table.from_batches(batches).to_pylist():
    print(f"  {row}")

# 4. Verify the count.
cnt_batches = list(Sling(
    src_conn=MSSQL_CONN,
    src_stream="select count(*) as cnt from dbo.adbc_pipeline_py",
).stream_arrow())
total = pa.Table.from_batches(cnt_batches).to_pylist()[0]["cnt"]
print(f"total rows: {total:,}")
assert total == NUM_FILES * ROWS_PER_FILE

Run it without setting up a virtualenv:

uv run run_pipeline.py

Actual output:

generating 100,000 rows in pyarrow...
  wrote s3://pipeline/adbc-py/part-0001.parquet (25,000 rows)
  wrote s3://pipeline/adbc-py/part-0002.parquet (25,000 rows)
  wrote s3://pipeline/adbc-py/part-0003.parquet (25,000 rows)
  wrote s3://pipeline/adbc-py/part-0004.parquet (25,000 rows)

replicating s3://pipeline/adbc-py/ -> MSSQL_ADBC via ADBC (full-refresh)...
INF connecting to source file system (s3)
INF connecting to target database (sqlserver)
INF reading from source file system (s3)
INF writing to target database [mode: full-refresh]
INF created table "dbo"."adbc_pipeline_py"
INF inserted 100,000 rows into "dbo"."adbc_pipeline_py" in 3 secs [33,333 r/s]

reading back via ADBC stream_arrow() (first 5 rows by id)...
  got 1 arrow batch(es), 5 rows
  schema: id: int64, name: string, score: double, category: string
  {'id': 1, 'name': 'user_1', 'score': 1.5, 'category': 'cat_1'}
  {'id': 2, 'name': 'user_2', 'score': 3.0, 'category': 'cat_2'}
  {'id': 3, 'name': 'user_3', 'score': 4.5, 'category': 'cat_3'}
  {'id': 4, 'name': 'user_4', 'score': 6.0, 'category': 'cat_4'}
  {'id': 5, 'name': 'user_5', 'score': 7.5, 'category': 'cat_5'}

total rows: 100,000

ADBC SQL Server Python pipeline PASSED

A few things worth calling out:

  • One Sling replication, N parquet files. The stream key pipeline/adbc-py/ resolves as a prefix on the S3 connection; Sling picks up every parquet file underneath. Add more files and the next run just loads them.
  • stream_arrow() returns Arrow RecordBatches. id is an int64, score is a double — not a stringified version. That’s the difference that makes downstream Polars / DuckDB / pandas conversions free copies. The plain stream() method returns a Python-dict iterator with string-shaped values; use it when you want to iterate row-by-row without an Arrow dependency.
  • DuckDB → PyArrow is a zero-copy hop. con.execute(sql).arrow() returns the result as a pa.Table directly. From there, pq.write_table() produces real columnar parquet, not pandas-shaped parquet — which matters when the receiving end is also Arrow-native.

When to Use ADBC vs. the Default Path

Sling’s default SQL Server path is the go-mssqldb TDS driver, which is fine for most workloads — connect, run queries, stream results. ADBC is the right call in three situations:

  • Bulk loads from Arrow-native sources. Loading from DuckDB, Parquet, or a Polars DataFrame? ADBC keeps the data in Arrow format end to end and bulk-inserts using SQL Server’s columnar bulk path. The go-mssqldb route would re-serialize through SQL strings.
  • Big writes you’d otherwise reach for bcp for. ADBC’s bulk insert is faster than row-by-row TDS and doesn’t need an external bcp invocation, which means it stays inside one process and one transaction boundary.
  • Read-heavy analytics. Sling’s ADBC reads return Arrow record batches; downstream Polars / DuckDB / pandas conversions are free copies.

The default path is still the right answer for short interactive queries, smaller writes, anywhere you want to avoid the dbc driver install, or anywhere you need features the ADBC driver doesn’t expose yet (Kerberos, certain Azure auth modes).

Wrapping Up

Two flags carry the difference between Sling-with-TDS and Sling-with-ADBC: dbc install mssql and use_adbc: true. Same connection, same replications, same hooks — only the protocol underneath changes. From there, the same YAML pipeline drops into Python through the sling library, and Sling.stream_arrow() gives you native Arrow RecordBatches when the next hop is also columnar.

The full lab — Docker compose file (SQL Server + SeaweedFS), CLI pipeline, Python script — is in the sling-cli repo. Clone it, docker compose up, run the pipeline, then run the Python script. Both should finish in under ten seconds against the local containers.

For Sling’s full SQL Server connector docs (TDS and ADBC together), see docs.slingdata.io/connections/database-connections/sqlserver. Sling Platform handles the same workloads with a UI, scheduling, and observability on top — see slingdata.io/platform. For local CLI work past the free tier, slingdata.io/cli-pro covers higher concurrency and longer-running pipelines.