Export Data from Insider One to Databricks

This guide explains how to export data from Insider One to Databricks using cloud object storage as the intermediary data layer.

You can currently use these options:

Both methods leverage Insider One’s native export capabilities and Databricks’ ability to ingest data directly from cloud storage buckets.

How does the integration work?

The integration between Insider One and Databricks follows this high-level workflow:

Insider One exports event or user data to a customer-owned cloud storage bucket (Amazon S3 or Google Cloud Storage).
Databricks reads data from that bucket using customer-configured ingestion pipelines.
Data is processed, transformed, or queried in Databricks according to customer needs.

Prerequisites

Before setting up the export to Databricks, make sure you meet the following requirements across Insider One, your cloud provider, and Databricks.

Insider One
- Access to an Insider One account with permissions to configure data exports
- Ability to configure Amazon S3 or Google Cloud Storage as an export destination
Amazon Web Services (AWS) (if using Amazon S3)
- An AWS account
- Permissions to create and manage S3 buckets
- IAM permissions allowing Insider One to write data to the selected bucket
Google Cloud Platform (GCP) (if using Google Cloud Storage)
- A GCP project
- Permissions to create and manage Google Cloud Storage buckets
- Service account and IAM permissions allowing Insider One to write data to the selected bucket
Databricks
- Access to a Databricks workspace
- Permissions to configure data ingestion from your chosen cloud storage (S3 or GCS)

After meeting the prerequisites, select the cloud storage provider that fits your infrastructure and follow the corresponding setup steps below.

Option 1: Export via Amazon S3

Follow the steps below to export data from Insider One to Amazon S3 and configure Databricks to ingest the exported files.

Step 1: Set Up Amazon S3 Export in Insider One

Insider One supports exporting data directly to Amazon S3.

Using the Insider One dashboard, you can:

Select or create an Amazon S3 bucket for export
Configure IAM credentials and bucket policies that allow Insider One to write data
Choose export frequency and data types (for example, events or user attributes)
Validate the integration and monitor exported files in S3

Refer to Export Data to Amazon S3 for detailed information.

Step 2: Ingest Data from Amazon S3 into Databricks

Once data is available in Amazon S3, it can be consumed by Databricks using ingestion pipelines. Databricks supports multiple approaches for ingesting data from Amazon S3, including:

Continuous ingestion for near real-time or incremental processing
(for example, streaming-based ingestion using Auto Loader)
Batch ingestion for scheduled, ad-hoc, or backfill workloads
(for example, using COPY INTO)

The choice of ingestion method depends on:

Desired data freshness
Pipeline architecture
Operational preferences of your team

Insider One does not require or enforce a specific Databricks ingestion method.

Refer to these Databricks documentation for implementation details:
Databricks Auto Loader
Databricks SQL: COPY INTO

Option 2: Export via Google Cloud Storage (GCS)

Follow the steps below to export data from Insider One to Google Cloud Storage and configure Databricks to ingest the exported files.

Step 1: Set Up Google Cloud Storage Export in Insider One

Insider One also supports exporting data directly to Google Cloud Storage.

Using the Insider One dashboard, you can:

Select or create a Google Cloud Storage bucket for export
Configure a GCP service account, key, and bucket IAM permissions that allow Insider One to write data
Choose export frequency and data types (for example, events or user attributes)
Validate the integration and monitor exported files in GCS

Refer to Export Data to Google Cloud Storage for further details.

Step 2: Ingest Data from Google Cloud Storage into Databricks

Once data is available in Google Cloud Storage, it can be consumed by Databricks using ingestion pipelines. Databricks supports multiple approaches for ingesting data from Google Cloud Storage, including:

Continuous ingestion for near real-time or incremental processing
(for example, streaming-based ingestion using Auto Loader)
Batch ingestion for scheduled, ad-hoc, or backfill workloads
(for example, using COPY INTO)

The choice of ingestion method depends on:

Desired data freshness
Pipeline architecture
Operational preferences of your team

Insider One does not require or enforce a specific Databricks ingestion method.

Refer to these Databricks documentation for implementation details:
Databricks Auto Loader
Databricks SQL: COPY INTO

Automation and Monitoring

To operate a reliable pipeline:

Configure Insider One export schedules to match your desired data refresh cadence
Use Databricks monitoring and alerting tools to track ingestion health and troubleshoot issues