Export and Load JSON Data from SFTP to Snowflake using Sling

Slinger avatar
Slinger
Cover for Export and Load JSON Data from SFTP to Snowflake using Sling

Introduction

In today’s data-driven world, organizations frequently need to move data between different systems efficiently and reliably. A common scenario is transferring JSON data from SFTP servers to Snowflake data warehouses. While this seems straightforward, it often involves complex ETL processes, custom scripts, and ongoing maintenance.

Enter Sling, an open-source data integration tool designed to simplify these data movement challenges. Sling provides a streamlined approach to data replication, making it easy to transfer data between various sources and destinations, including SFTP servers and Snowflake.

In this comprehensive guide, we’ll walk through how to use Sling to efficiently move JSON data from an SFTP server to Snowflake. You’ll learn about:

  • Setting up Sling in your environment
  • Configuring SFTP and Snowflake connections
  • Creating and running data replication tasks
  • Using both the CLI and Platform interfaces
  • Best practices for optimal performance

Whether you’re a data engineer, developer, or analyst, this guide will help you implement a robust data pipeline using Sling’s powerful features.

Understanding the Data Pipeline Challenge

Moving data from SFTP to Snowflake traditionally involves several complex steps and considerations:

  1. Authentication and Access: Managing SFTP credentials, SSH keys, and Snowflake authentication
  2. Data Parsing: Handling JSON files with varying structures and nested data
  3. Type Mapping: Converting JSON data types to appropriate Snowflake column types
  4. Error Handling: Dealing with malformed JSON, network issues, and failed transfers
  5. Scalability: Processing large files and multiple directories efficiently
  6. Monitoring: Tracking successful transfers and troubleshooting failures

Many organizations resort to writing custom scripts or using expensive ETL tools to handle these challenges. This approach often leads to:

  • High development and maintenance costs
  • Complex error handling and retry logic
  • Limited scalability and performance
  • Difficulty in monitoring and troubleshooting
  • Technical debt from custom solutions

Sling addresses these challenges by providing:

  • Built-in support for SFTP and Snowflake connections
  • Automatic JSON parsing and type inference
  • Efficient bulk loading capabilities
  • Robust error handling and retry mechanisms
  • Comprehensive monitoring and logging
  • Simple configuration through YAML or CLI flags

Let’s explore how to implement this solution using Sling.

Getting Started with Sling

Installing Sling is straightforward and supports multiple platforms. Choose the installation method that best suits your environment:

# Install using Homebrew (macOS)
brew install slingdata-io/sling/sling

# Install using curl (Linux)
curl -LO 'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz' \
  && tar xf sling_linux_amd64.tar.gz \
  && rm -f sling_linux_amd64.tar.gz \
  && chmod +x sling

# Install using Scoop (Windows)
scoop bucket add sling https://github.com/slingdata-io/scoop-sling.git
scoop install sling

# Install using Python pip
pip install sling

After installation, verify that Sling is properly installed by checking its version:

# Check Sling version
sling --version

Setting Up Connections

SFTP Connection Setup

Sling provides multiple ways to configure your SFTP connection. Here are the available options:

Using sling conns set Command

# Set up SFTP connection with password authentication
sling conns set MY_SFTP type=sftp host=sftp.example.com user=myuser password=mypassword port=22

# Set up SFTP connection with SSH key authentication
sling conns set MY_SFTP type=sftp host=sftp.example.com user=myuser private_key=/path/to/private_key port=22

Using Environment Variables

# Using JSON format
export MY_SFTP='{
  "type": "sftp",
  "host": "sftp.example.com",
  "user": "myuser",
  "password": "mypassword",
  "port": 22
}'

Using YAML Configuration

Create or edit ~/.sling/env.yaml:

connections:
  MY_SFTP:
    type: sftp
    host: sftp.example.com
    user: myuser
    port: 22
    password: mypassword  # Or use private_key
    private_key: /path/to/private_key  # Optional, alternative to password

Snowflake Connection Setup

Similarly, you can configure your Snowflake connection using multiple methods:

Using sling conns set Command

# Set up Snowflake connection
sling conns set MY_SNOWFLAKE type=snowflake account=myaccount user=myuser password=mypassword database=mydatabase warehouse=mywarehouse role=myrole

Using Environment Variables

# Using connection URL
export MY_SNOWFLAKE='snowflake://myuser:mypassword@myaccount/mydatabase?warehouse=mywarehouse&role=myrole'

Using YAML Configuration

Add to your ~/.sling/env.yaml:

connections:
  MY_SNOWFLAKE:
    type: snowflake
    account: myaccount
    user: myuser
    password: mypassword
    database: mydatabase
    warehouse: mywarehouse
    role: myrole
    schema: myschema  # Optional

Data Synchronization Methods

Sling offers two primary methods for configuring and running data synchronization: CLI flags and replication YAML files. Let’s explore both approaches.

Using CLI Flags

For quick, one-off transfers or simple configurations, you can use CLI flags:

Basic Example

# Basic JSON file transfer
sling run \
  --src-conn MY_SFTP \
  --src-stream '/path/to/data.json' \
  --src-options '{ "format": "json", "flatten": true }' \
  --tgt-conn MY_SNOWFLAKE \
  --tgt-object 'mydatabase.myschema.mytable' \
  --mode full-refresh

Advanced Example

# Advanced JSON transfer with options, create a new table for each file
sling run \
  --src-conn MY_SFTP \
  --src-stream '/path/to/data/*.json' \
  --src-options '{
    "format": "json",
    "flatten": true,
    "jmespath": "records[*]",
    "empty_as_null": true
  }' \
  --transforms '[remove_accents, trim]' \
  --tgt-conn MY_SNOWFLAKE \
  --tgt-object 'mydatabase.myschema.{stream_file_name}' \
  --tgt-options '{
    "column_casing": "snake",
    "table_keys": {"primary": ["id"]},
    "add_new_columns": true
  }'

Using Replication YAML

For more complex scenarios or when managing multiple streams, use a replication YAML file:

Basic Multi-stream Example

source: MY_SFTP
target: MY_SNOWFLAKE

defaults:
  mode: full-refresh
  source_options:
    format: json
    flatten: true
    empty_as_null: true

streams:
  '/data/customers/*.json':
    object: mydatabase.myschema.customers
    transforms: [remove_accents, trim]
    
  '/data/orders/*.json':
    object: mydatabase.myschema.orders
    transforms: [remove_accents, trim]

Complex Multi-stream Example

source: MY_SFTP
target: MY_SNOWFLAKE

env:
  SLING_SAMPLE_SIZE: 2000
  SLING_STREAM_URL_COLUMN: true

defaults:
  mode: incremental
  source_options:
    format: json
    flatten: true
    empty_as_null: true
    jmespath: "records[*]"
  target_options:
    column_casing: snake
    add_new_columns: true

streams:
  '/data/customers/${stream_date}/*.json':
    object: mydatabase.myschema.customers
    transforms:
      - remove_accents
      - trim
    primary_key: [customer_id]
    update_key: last_modified
    source_options:
      datetime_format: "YYYY-MM-DD HH:mm:ss"
    target_options:
      table_keys:
        primary: [customer_id]
        unique: [email]
  
  '/data/orders/${stream_date}/*.json':
    object: mydatabase.myschema.orders
    transforms:
      - remove_accents
      - trim
      email: [lowercase, trim]
      amount: to_number
    primary_key: [order_id]
    update_key: order_date
    source_options:
      datetime_format: "YYYY-MM-DD"
    target_options:
      table_keys:
        primary: [order_id]

To run a replication YAML configuration:

# Run the replication
sling run -r path/to/replication.yaml

The Sling Platform

While the CLI is powerful for local development and automation, Sling also offers a comprehensive platform for managing data operations at scale. The platform provides a user-friendly interface for:

  • Managing connections
  • Creating and editing replications
  • Monitoring executions
  • Scheduling jobs
  • Team collaboration

Sling Platform Connections

The Connections page allows you to manage your data sources and destinations securely.

Sling Platform Editor

The Editor provides a visual interface for creating and modifying replication configurations.

Sling Platform Execution

The Execution view shows detailed information about your data transfers, including progress, logs, and statistics.

Platform Components

The Sling Platform consists of several key components:

  1. Control Plane

    • Manages authentication and authorization
    • Handles job scheduling and orchestration
    • Provides the web interface
  2. Agents

    • Execute data operations
    • Run in your infrastructure
    • Maintain secure access to your data sources
  3. Editor

    • Visual replication configuration
    • Syntax highlighting and validation
    • Template management
  4. Monitoring

    • Real-time execution tracking
    • Error reporting and alerting
    • Performance metrics

To get started with the platform:

  1. Sign up at https://app.slingdata.io/signup
  2. Create your first project
  3. Install and configure a Sling agent
  4. Set up your connections
  5. Create and run your first replication

Best Practices and Tips

To ensure optimal performance and reliability when transferring JSON data from SFTP to Snowflake:

  1. JSON Handling

    • Use flatten: true for nested JSON structures
    • Specify jmespath expressions for complex JSON parsing
    • Set empty_as_null: true to handle empty strings consistently
  2. Performance Optimization

    • Use wildcards for processing multiple files
    • Implement incremental loading for large datasets
    • Configure appropriate batch sizes
  3. Error Handling

    • Set up proper logging and monitoring
    • Use primary keys to prevent duplicates
    • Implement retry mechanisms for network issues
  4. Security

    • Use SSH keys instead of passwords for SFTP
    • Rotate credentials regularly
    • Follow the principle of least privilege
  5. Maintenance

    • Regular testing of connections
    • Monitoring of job execution times
    • Cleanup of processed files

For more detailed information and examples, visit:

Conclusion

Sling provides a powerful and flexible solution for transferring JSON data from SFTP to Snowflake. Its combination of simple configuration, robust features, and comprehensive platform makes it an excellent choice for organizations of all sizes.

Key benefits include:

  • Easy setup and configuration
  • Support for complex JSON structures
  • Robust error handling
  • Scalable performance
  • Comprehensive monitoring
  • Both CLI and Platform interfaces

To take your data integration to the next level:

  1. Install Sling and set up your connections
  2. Start with simple transfers using CLI flags
  3. Progress to replication YAML for complex scenarios
  4. Consider the Platform for enterprise needs
  5. Join the Sling community for support and updates

For more examples and detailed documentation, visit https://docs.slingdata.io/.