Export from BigQuery to PostgreSQL with Sling

Slinger avatar
Slinger
Cover for Export from BigQuery to PostgreSQL with Sling

Introduction

Moving data between Google BigQuery and PostgreSQL traditionally involves complex ETL processes, custom scripts, and significant engineering effort. Organizations often face challenges such as:

  • Setting up and maintaining data extraction processes from BigQuery
  • Managing authentication and permissions across platforms
  • Handling schema compatibility and data type conversions
  • Implementing efficient data loading into PostgreSQL
  • Monitoring and maintaining the data pipeline
  • Dealing with incremental updates and schema changes

According to industry research, setting up a traditional data pipeline between BigQuery and PostgreSQL can take weeks or even months, requiring specialized knowledge of both platforms and custom code development. Common approaches include:

  1. Writing custom Python scripts using libraries like pandas and sqlalchemy
  2. Using ETL tools that require extensive configuration and maintenance
  3. Implementing Apache Airflow DAGs with custom operators
  4. Developing and maintaining complex data transformation logic

These approaches often lead to:

  • Increased development and maintenance costs
  • Complex error handling and retry mechanisms
  • Difficulty in handling schema changes
  • Performance bottlenecks
  • Limited monitoring and observability

Sling simplifies this entire process by providing a streamlined, configuration-based approach that eliminates the need for custom code and complex infrastructure setup. With Sling, you can:

  • Configure connections with simple environment variables or CLI commands
  • Automatically handle schema mapping and data type conversions
  • Optimize performance with built-in batch processing and parallel execution
  • Monitor and manage replications through both CLI and web interface
  • Implement incremental updates with minimal configuration

In this guide, we’ll walk through the process of setting up a BigQuery to PostgreSQL replication using Sling, demonstrating how to overcome common challenges and implement an efficient data pipeline in minutes rather than days or weeks.

Installation

Getting started with Sling is straightforward. You can install it using various package managers depending on your operating system:

# Install using Homebrew (macOS)
brew install slingdata-io/sling/sling

# Install using curl (Linux)
curl -LO 'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz' \
  && tar xf sling_linux_amd64.tar.gz \
  && rm -f sling_linux_amd64.tar.gz \
  && chmod +x sling

# Install using Scoop (Windows)
scoop bucket add sling https://github.com/slingdata-io/scoop-sling.git
scoop install sling

# Install using Python pip
pip install sling

After installation, verify that Sling is properly installed by running:

# Check sling version
sling --version

For more detailed installation instructions and options, visit the installation guide.

Setting Up Connections

Before we can start replicating data, we need to configure our source (BigQuery) and target (PostgreSQL) connections. Sling provides multiple ways to manage connections, including environment variables, the sling conns command, and a YAML configuration file.

BigQuery Connection Setup

For BigQuery, you’ll need:

  • Google Cloud project ID
  • Service account credentials with appropriate permissions
  • Dataset information
  • Google Cloud Storage bucket (for data transfer)

You can set up the BigQuery connection in several ways:

  1. Using the sling conns set Command
# Set up BigQuery connection using CLI
sling conns set bigquery_source type=bigquery \
  project=<project> \
  dataset=<dataset> \
  gc_bucket=<gc_bucket> \
  key_file=/path/to/service.account.json \
  location=<location>
  1. Using Environment Variables
# Set up using service account JSON content
export GC_KEY_BODY='{"type": "service_account", ...}'
export BIGQUERY_SOURCE='{type: bigquery, project: <project>, dataset: <dataset>, gc_bucket: <gc_bucket>}'
  1. Using Sling Environment YAML

Create or edit ~/.sling/env.yaml:

connections:
  bigquery_source:
    type: bigquery
    project: your-project
    dataset: your_dataset
    gc_bucket: your-bucket
    key_file: /path/to/service.account.json
    location: US  # optional

PostgreSQL Connection Setup

For PostgreSQL, you’ll need:

  • Host address
  • Port number (default: 5432)
  • Database name
  • Username and password
  • Schema (optional)
  • SSL mode (if required)

Here’s how to set up the PostgreSQL connection:

  1. Using the sling conns set Command
# Set up PostgreSQL connection using CLI
sling conns set postgres_target type=postgres \
  host=<host> \
  user=<user> \
  database=<database> \
  password=<password> \
  port=<port> \
  schema=<schema>

# Or use connection URL format
sling conns set postgres_target url="postgresql://user:password@host:5432/database?sslmode=require"
  1. Using Environment Variables
# Set up using connection URL format
export POSTGRES_TARGET='postgresql://user:password@host:5432/database?sslmode=require'
  1. Using Sling Environment YAML

Add to your ~/.sling/env.yaml:

connections:
  postgres_target:
    type: postgres
    host: your-host
    user: your-username
    password: your-password
    database: your-database
    port: 5432
    schema: public
    sslmode: require  # optional

Testing Connections

After setting up your connections, it’s important to verify they work correctly:

# Test BigQuery connection
sling conns test bigquery_source

# Test PostgreSQL connection
sling conns test postgres_target

# List available tables in BigQuery
sling conns discover bigquery_source

You can also manage your connections through the Sling Platform’s web interface:

Sling Connections Management

For more details about connection configuration, visit the environment documentation.

Data Replication Methods

Sling provides multiple ways to replicate data from BigQuery to PostgreSQL. Let’s explore both CLI-based and YAML-based approaches, starting from simple configurations to more advanced use cases.

Using CLI Flags

The quickest way to start a replication is using CLI flags. Here are two examples:

Basic CLI Example

This example shows how to replicate a single table with default settings:

# Replicate a single table from BigQuery to PostgreSQL
sling run \
  --src-conn bigquery_source \
  --src-stream "analytics.daily_sales" \
  --tgt-conn postgres_target \
  --tgt-object "analytics.daily_sales" \
  --tgt-options '{ "column_casing": "snake" }'

Advanced CLI Example

This example demonstrates more advanced options including column selection and incremental updates:

# Replicate with advanced options
sling run \
  --src-conn bigquery_source \
  --src-stream "analytics.customer_orders" \
  --select "order_id, customer_id, order_date, total_amount" \
  --tgt-conn postgres_target \
  --tgt-object "analytics.customer_orders" \
  --mode incremental \
  --primary-key order_id \
  --update-key order_date \
  --tgt-options '{ "column_casing": "snake", "add_new_columns": true, "table_keys": { "unique": ["order_id"] } }'

For more CLI flag options, visit the CLI flags documentation.

Using YAML Configuration

For more complex replication scenarios, YAML configuration files provide better maintainability and reusability. Let’s look at two examples:

Basic YAML Example

Create a file named bigquery_to_postgres.yaml:

# Define source and target connections
source: bigquery_source
target: postgres_target

# Default settings for all streams
defaults:
  mode: full-refresh
  target_options:
    column_casing: snake
    add_new_columns: true

# Define streams to replicate
streams:
  analytics.daily_sales:
    object: analytics.daily_sales
    primary_key: [date, product_id]

  analytics.customer_orders:
    object: analytics.{stream_table}
    primary_key: order_id

Run the replication:

# Run the replication using YAML config
sling run -r bigquery_to_postgres.yaml

Advanced YAML Example

Here’s a more complex example that demonstrates various features including runtime variables, custom SQL, and multiple streams:

source: bigquery_source
target: postgres_target

env:
  DATE: ${DATE}  # from env var

defaults:
  mode: incremental
  target_options:
    column_casing: snake
    add_new_columns: true

streams:
  # Stream with custom SQL and runtime variables
  analytics.orders_{DATE}:
    object: analytics.orders
    sql: |
      SELECT *
      FROM analytics.orders
      WHERE DATE(created_at) = '{DATE}'
    primary_key: order_id
    update_key: updated_at

  # Stream with column selection and transforms
  analytics.customers:
    object: analytics.customers
    select:
      - customer_id
      - first_name
      - last_name
      - email
      - -internal_notes  # exclude this column
    transforms:
      email: [lower, trim]
    primary_key: customer_id
    target_options:
      table_keys:
        primary: [customer_id]
        unique: [email]

  # Stream with wildcard pattern
  analytics.events_*:
    object: analytics.events
    mode: full-refresh
    primary_key: event_id

Run the replication with runtime variables:

# Run the replication with a specific date
export DATE=2024-02-10
sling run -r bigquery_to_postgres.yaml

For more details about replication configuration, visit:

Sling Platform UI

While the CLI provides powerful functionality for local development and automation, the Sling Platform offers a comprehensive web interface for managing and monitoring your data replications at scale.

Platform Overview

The Sling Platform provides:

  • Visual interface for creating and managing data workflows
  • Team collaboration features
  • Monitoring and alerting
  • Centralized connection management
  • Job scheduling and orchestration
  • Agent-based architecture for secure execution

Managing Connections

The platform provides an intuitive interface for managing your connections:

Sling Connections Management

You can:

  • Create and edit connections with a visual form
  • Test connections directly from the UI
  • View and manage connection permissions
  • Share connections with team members

Visual Replication Editor

The platform includes a powerful visual editor for creating and managing replications:

Sling Editor

Features include:

  • Visual stream configuration
  • Syntax highlighting for SQL and YAML
  • Real-time validation
  • Version control integration

Monitoring and Execution

Track your replication jobs with detailed execution information:

Sling Execution

The platform provides:

  • Real-time execution monitoring
  • Detailed logs and error messages
  • Performance metrics and statistics
  • Historical execution data

For more information about the Sling Platform, visit the platform documentation.

Getting Started

Now that we’ve covered the various aspects of using Sling for BigQuery to PostgreSQL data replication, here’s a quick guide to get you started:

  1. Install Sling

    • Choose the appropriate installation method for your system
    • Verify the installation with sling --version
  2. Set Up Connections

    • Configure BigQuery source connection
    • Configure PostgreSQL target connection
    • Test both connections using sling conns test
  3. Start Simple

    • Begin with a basic CLI command to replicate a single table
    • Monitor the replication process
    • Verify the data in PostgreSQL
  4. Scale Up

    • Create a YAML configuration file for multiple streams
    • Add incremental updates and transformations
    • Implement runtime variables for flexibility
  5. Consider Platform Features

    • Sign up for the Sling Platform for advanced features
    • Set up team access and permissions
    • Configure monitoring and alerts

Best Practices

  1. Connection Management

    • Use environment variables or YAML files for connection configuration
    • Keep credentials secure and never commit them to version control
    • Use separate connections for development and production
  2. Replication Configuration

    • Start with simple configurations and gradually add complexity
    • Use YAML files for better maintainability
    • Document your configurations with clear descriptions
  3. Performance Optimization

    • Use appropriate batch sizes for your data volume
    • Implement incremental updates when possible
    • Monitor and adjust configurations based on performance metrics
  4. Monitoring and Maintenance

    • Regularly check replication logs
    • Set up alerts for failed replications
    • Keep Sling updated to the latest version

Next Steps

To learn more about Sling’s capabilities, explore these resources:

For additional examples and community support: