Export Data from Insider One to Databricks

Prev Next

This guide explains how to export data from Insider One to Databricks using cloud object storage as the intermediary data layer.

You can currently use these options:

Both methods leverage Insider One’s native export capabilities and Databricks’ ability to ingest data directly from cloud storage buckets.

How does the integration work?

The integration between Insider One and Databricks follows this high-level workflow:

  1. Insider One exports event or user data to a customer-owned cloud storage bucket (Amazon S3 or Google Cloud Storage).

  2. Databricks reads data from that bucket using customer-configured ingestion pipelines.

  3. Data is processed, transformed, or queried in Databricks according to customer needs.

Prerequisites

Before setting up the export to Databricks, make sure you meet the following requirements across Insider One, your cloud provider, and Databricks.

  • Insider One

    • Access to an Insider One account with permissions to configure data exports

    • Ability to configure Amazon S3 or Google Cloud Storage as an export destination

  • Amazon Web Services (AWS) (if using Amazon S3)

    • An AWS account

    • Permissions to create and manage S3 buckets

    • IAM permissions allowing Insider One to write data to the selected bucket

  • Google Cloud Platform (GCP) (if using Google Cloud Storage)

    • A GCP project

    • Permissions to create and manage Google Cloud Storage buckets

    • Service account and IAM permissions allowing Insider One to write data to the selected bucket

  • Databricks

    • Access to a Databricks workspace

    • Permissions to configure data ingestion from your chosen cloud storage (S3 or GCS)

After meeting the prerequisites, select the cloud storage provider that fits your infrastructure and follow the corresponding setup steps below.

Option 1: Export via Amazon S3

Follow the steps below to export data from Insider One to Amazon S3 and configure Databricks to ingest the exported files.

Step 1: Set Up Amazon S3 Export in Insider One

Insider One supports exporting data directly to Amazon S3.

Using the Insider One dashboard, you can:

  • Select or create an Amazon S3 bucket for export

  • Configure IAM credentials and bucket policies that allow Insider One to write data

  • Choose export frequency and data types (for example, events or user attributes)

  • Validate the integration and monitor exported files in S3

Refer to Export Data to Amazon S3 for detailed information.

Step 2: Ingest Data from Amazon S3 into Databricks

Once data is available in Amazon S3, it can be consumed by Databricks using ingestion pipelines. Databricks supports multiple approaches for ingesting data from Amazon S3, including:

  • Continuous ingestion for near real-time or incremental processing
    (for example, streaming-based ingestion using Auto Loader)

  • Batch ingestion for scheduled, ad-hoc, or backfill workloads
    (for example, using COPY INTO)

The choice of ingestion method depends on:

  • Desired data freshness

  • Pipeline architecture

  • Operational preferences of your team

Insider One does not require or enforce a specific Databricks ingestion method.

Refer to these Databricks documentation for implementation details:

Option 2: Export via Google Cloud Storage (GCS)

Follow the steps below to export data from Insider One to Google Cloud Storage and configure Databricks to ingest the exported files.

Step 1: Set Up Google Cloud Storage Export in Insider One

Insider One also supports exporting data directly to Google Cloud Storage.

Using the Insider One dashboard, you can:

  • Select or create a Google Cloud Storage bucket for export

  • Configure a GCP service account, key, and bucket IAM permissions that allow Insider One to write data

  • Choose export frequency and data types (for example, events or user attributes)

  • Validate the integration and monitor exported files in GCS

Refer to Export Data to Google Cloud Storage for further details.

Step 2: Ingest Data from Google Cloud Storage into Databricks

Once data is available in Google Cloud Storage, it can be consumed by Databricks using ingestion pipelines. Databricks supports multiple approaches for ingesting data from Google Cloud Storage, including:

  • Continuous ingestion for near real-time or incremental processing
    (for example, streaming-based ingestion using Auto Loader)

  • Batch ingestion for scheduled, ad-hoc, or backfill workloads
    (for example, using COPY INTO)

The choice of ingestion method depends on:

  • Desired data freshness

  • Pipeline architecture

  • Operational preferences of your team

Insider One does not require or enforce a specific Databricks ingestion method.

Refer to these Databricks documentation for implementation details:

Automation and Monitoring

To operate a reliable pipeline:

  • Configure Insider One export schedules to match your desired data refresh cadence

  • Use Databricks monitoring and alerting tools to track ingestion health and troubleshoot issues