Loading Local Parquet Files into Snowflake with Sling

Slinger avatar
Slinger
Cover for Loading Local Parquet Files into Snowflake with Sling

Introduction

Moving data from local Parquet files to Snowflake traditionally involves multiple steps, tools, and considerable setup time. Whether you’re dealing with analytics data, user events, or business metrics, the process can be complex and error-prone. Enter Sling: a modern data movement tool that simplifies this entire process into a streamlined operation.

Sling provides a powerful yet simple approach to data integration, offering a command-line interface that makes data transfer between local storage and Snowflake effortless. This guide will walk you through the process of setting up and using Sling to export your Parquet files to Snowflake, demonstrating how it eliminates common pain points and accelerates your data pipeline development.

Traditional Data Pipeline Challenges

Setting up a data pipeline to move Parquet files to Snowflake traditionally involves several complex steps and considerations:

  1. Infrastructure Setup

    • Setting up staging areas for data transfer
    • Configuring network access and security policies
    • Managing compute resources for data processing
  2. Development Overhead

    • Writing custom scripts for file handling
    • Implementing error handling and retry logic
    • Managing data type conversions and schema changes
  3. Operational Complexity

    • Monitoring data transfer processes
    • Handling failed transfers and data validation
    • Managing credentials and access controls
  4. Cost Considerations

    • Storage costs for intermediate staging
    • Compute costs for data processing
    • Maintenance and operational overhead

These challenges often lead to lengthy development cycles and increased operational costs. Sling addresses these pain points by providing a unified solution that handles all these aspects efficiently.

Getting Started with Sling

The first step in simplifying your Parquet to Snowflake data pipeline is installing Sling. The installation process is straightforward and supports multiple platforms and package managers.

Installation

Choose the installation method that best suits your environment:

# Install using Homebrew (macOS)
brew install slingdata-io/sling/sling

# Install using curl (Linux)
curl -LO 'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz' \
  && tar xf sling_linux_amd64.tar.gz \
  && rm -f sling_linux_amd64.tar.gz \
  && chmod +x sling

# Install using Scoop (Windows)
scoop bucket add sling https://github.com/slingdata-io/scoop-sling.git
scoop install sling

# Install using Python pip
pip install sling

After installation, verify that Sling is properly installed by checking its version:

# Check Sling version
sling --version

For more detailed installation instructions and options, visit the Sling CLI Installation Guide.

Setting Up Connections

Sling needs to know how to connect to your local storage and Snowflake instance. There are several ways to configure these connections:

Using Environment Variables

The simplest way to set up connections is through environment variables:

# Set up Snowflake connection
export SNOWFLAKE_CONN='snowflake://user:pass@account/database/schema?warehouse=compute_wh'

Using the CLI

Alternatively, you can use the sling conns set command to configure your connections:

# Set up Snowflake connection using URL
sling conns set snowflake_conn 'snowflake://user:pass@account/database/schema?warehouse=compute_wh'

# Or set up using individual parameters
sling conns set snowflake_conn \
  type=snowflake \
  account=myaccount \
  user=myuser \
  password=mypassword \
  warehouse=compute_wh \
  database=mydb \
  schema=myschema

Using YAML Configuration

For a more permanent solution, you can create an env.yaml file in your Sling environment directory:

connections:
  snowflake_conn:
    type: snowflake
    account: myaccount
    user: myuser
    password: mypassword
    warehouse: compute_wh
    database: mydb
    schema: myschema

For more details about connection configuration, refer to the Environment Configuration Guide.

Basic Data Sync Operations

Once you have Sling installed and your connections configured, you can start syncing data from your Parquet files to Snowflake. Let’s explore different approaches to achieve this.

Using CLI Commands

The simplest way to sync data is using the CLI. Here’s a basic example:

# Sync a single Parquet file to Snowflake
sling run \
  --src-file "data/users.parquet" \
  --tgt-conn snowflake_conn \
  --tgt-object "raw_data.users"

# Sync multiple Parquet files with pattern matching
sling run \
  --src-file "data/events/*.parquet" \
  --tgt-conn snowflake_conn \
  --tgt-object "raw_data.events"

# Sync with additional options
sling run \
  --src-file "data/events/*.parquet" \
  --tgt-conn snowflake_conn \
  --tgt-object "raw_data.events" \
  --mode full-refresh \
  --source-options '{"empty_as_null": true}' \
  --target-options '{"column_casing": "snake", "add_new_columns": true}'

For a complete list of available CLI options, visit the CLI Documentation.

Basic Replication Configuration

For more maintainable and repeatable syncs, you can use a YAML configuration file. Here’s a basic example:

source: local
target: snowflake_conn

defaults:
  target_options:
    column_casing: snake
    add_new_columns: true

streams:
  data/users.parquet:
    object: raw_data.users
    mode: full-refresh
    source_options:
      empty_as_null: true

  data/events/*.parquet:
    object: raw_data.events
    mode: full-refresh
    source_options:
      empty_as_null: true

Save this configuration as local_to_snowflake.yaml and run it:

# Run the replication configuration
sling run local_to_snowflake.yaml

This configuration provides several advantages:

  • Version control for your data pipeline
  • Reusable configurations
  • Easier maintenance and updates
  • Better documentation of your data flow

For more details about replication configurations, check out the Replication Documentation.

Advanced Replication Configuration

For more complex data pipelines, Sling offers advanced configuration options that provide greater control over your data movement and transformation processes.

Advanced YAML Configuration Example

Here’s a comprehensive example that showcases various advanced features:

source: local
target: snowflake_conn

defaults:
  mode: full-refresh
  source_options:
    empty_as_null: true
    datetime_format: "YYYY-MM-DD HH:mm:ss"
  target_options:
    column_casing: snake
    add_new_columns: true
    table_keys:
      primary_key: ["id"]
      unique_key: ["email"]

streams:
  # User data with transformations
  data/users/*.parquet:
    object: raw_data.users
    columns:
      id: number
      email: string
      created_at: timestamp
      status: string

  # Event data with runtime variables
  data/events/${stream_date}/*.parquet:
    object: raw_data.events_{stream_date}
    source_options:
      empty_as_null: true
    target_options:
      post_sql: |
        grant select on table {object_name} to role analyst;

env:
  stream_date: "2024-01-01"  # or use env vars

Let’s break down the key components of this configuration:

Default Settings

The defaults section specifies settings that apply to all streams unless overridden:

  • Common source and target options
  • Default replication mode
  • Table key configurations

Stream-Specific Configurations

Each stream can have its own specific settings:

  • Column definitions and data types
  • Custom SQL for table creation
  • Data transformations
  • Pre and post-sync SQL operations

Runtime Variables

The configuration uses runtime variables (like ${stream_date}) that can be:

  • Defined in the env section
  • Passed via command line
  • Set through environment variables

For more information about runtime variables, visit the Runtime Variables Documentation.

Complex Example with Multiple Data Types

Here’s another example handling different types of Parquet files:

source: local
target: snowflake_conn

streams:
  # Customer analytics data
  analytics/customers/*.parquet:
    object: analytics.customer_data
    mode: incremental
    primary_key: ["customer_id"]
    update_key: "updated_at"
    transforms:
      email: lower
    source_options:
      empty_as_null: true
    target_options:
      add_new_columns: true
      column_casing: snake
      table_keys:
        unique: ["email"]

  # Product usage metrics
  metrics/usage/{table}/*.parquet:
    object: metrics.{stream_table}
    mode: full-refresh
    source_options:
      datetime_format: "YYYY-MM-DD"
    target_options:
      batch_limit: 50000

env:
  table: "daily_usage"

This configuration demonstrates:

  • Incremental loading with update tracking
  • Dynamic table naming with runtime variables
  • Batch size and file size limits
  • Different replication modes per stream

For more examples and detailed documentation, visit:

Getting Started Guide

Here’s a step-by-step guide to get you started with Sling:

  1. Installation and Setup

    • Install Sling using your preferred method
    • Configure your Snowflake connection
    • Verify connectivity using sling conns test
  2. Basic Data Sync

    • Start with a simple CLI command to sync a single file
    • Monitor the sync process
    • Verify data in Snowflake
  3. Configuration Development

    • Create a basic YAML configuration
    • Test with a small dataset
    • Gradually add more advanced features

Additional Resources

To learn more about Sling and its capabilities, check out these resources:

Sling simplifies the process of moving data from Parquet files to Snowflake, providing powerful CLI tools to get the job done efficiently. Whether you’re handling simple data transfers or complex transformation pipelines, Sling offers the flexibility and features you need.