CUSTOMER DATA INFRASTRUCTURE

Databricks and Customer.io, connected properly

Customer.io's identify and track calls look simple. But Databricks has Delta Lake tables with evolving schemas, Spark ML upgrade likelihood scores with StructType outputs, and Unity Catalog permission boundaries at every integration point. Meiro Pipes resolves the identity gap, adapts to Delta Lake schema evolution in the transform layer, and keeps ML-enriched profiles flowing to Customer.io, without a custom pipeline that breaks every time a data scientist updates a model.

Talk to a Consultant

Free trial · No credit card · Live in minutes

Customer.io is simple to start. Connecting it to Databricks is not.

Identity is the first structural problem. Customer.io identifies users by a customer id you define, with email as optional. Databricks stores records keyed on internal IDs, Salesforce contact IDs, or other upstream-assigned identifiers. When these don't map to Customer.io's customer id, identify calls create duplicates or miss the intended user, anonymous-to-identified lifecycle merges fail at whichever stage the identifier breaks.

The identify versus track classification is the second problem. Persistent attributes (plan tier, feature flags) belong in identify calls; behavioral events (feature activations, milestones) belong in track calls. Getting this wrong affects segmentation, triggers, and billing. Databricks tables arrive without that label, and Delta Lake schema evolution can shift column types between runs, causing silent failures when Customer.io receives an unexpected property type. B2B teams add another layer: Customer.io Objects require a separate API endpoint, a different schema, and manual object-to-person relationship maintenance.

Customer.io's warehouse export targets Redshift and BigQuery natively, not Databricks. Getting engagement data into Databricks requires S3 exports or a third-party connector. The reverse direction requires direct API integration. Neither is configuration; both are infrastructure work.

Five ways the Databricks → Customer.io pipeline breaks

Delta Lake schema evolution

Problem

Data scientists update model schemas between notebook runs, new columns, renamed fields, changed types. Delta Lake handles it. The downstream Customer.io sync doesn't. Changed upgrade score columns mean wrong identify calls or silent failures.

Meiro solves it

Pipes is schema-aware at the transform layer. When Delta Lake schemas evolve, you update the transform function, not the pipeline infrastructure. Version-controlled transforms mean schema changes are deliberate and auditable, not silent breaking changes.

Identify vs. track classification

Problem

Customer.io uses identify for persistent attributes and track for behavioral events. Getting this wrong affects segmentation, triggers, and pricing. Databricks data doesn't arrive pre-classified, the identify/track split is a modeling decision that has to be made explicitly.

Meiro solves it

Pipes lets you model your Databricks data before it reaches Customer.io. Decide what becomes a persistent attribute versus a behavioral event at the infrastructure layer, visible, version-controlled, and changeable without touching Customer.io.

Spark ML type mapping

Problem

Spark ML upgrade likelihood models produce DoubleType scores, StructType prediction metadata, and ArrayType feature vectors. Customer.io's API requires flat attribute objects and property dictionaries. Converting Spark ML output types requires explicit transformation logic outside the notebook.

Meiro solves it

Pipes transform functions handle Spark type conversion in the JavaScript sandbox. DoubleType scores become float attributes. StructType metadata gets traversed and mapped to Customer.io traits. ArrayType feature vectors get summarized or selectively extracted. The transform layer bridges the type gap.

Identity mismatch

Problem

Databricks model training uses internal customer_id or numeric user IDs. Customer.io expects a customer id and optionally email. When these diverge, identify calls create duplicate profiles or miss the right user. Anonymous-to-known merges fail silently.

Meiro solves it

Pipes resolves identity across email, user_id, anonymous ID, Stripe customer ID, CRM contact ID using deterministic matching. One unified Customer.io profile, regardless of which identifier Databricks model training used.

ML scores stuck in Delta tables

Problem

Your data science team builds upgrade likelihood models in Databricks. Outputs land in Delta tables. Getting those scores into Customer.io to trigger upgrade campaigns requires a pipeline that doesn't exist out of the box, and breaks when the model output schema changes.

Meiro solves it

Pipes connects directly to the Delta table where model outputs land. Upgrade likelihood scores become Customer.io identify attributes. Users who cross the upgrade threshold receive a track event that triggers the upgrade campaign. When the model schema evolves, you update the transform, not the pipeline.

One pipeline. Identity-resolved. Schema-aware.

Collect from Customer.io

Customer.io engagement data (email opens, clicks, conversions, campaign events) flows into Pipes via webhook or export. Events land without replacing your existing Customer.io setup.

Load & Model in Databricks

Events land in Databricks Delta tables automatically. Pipes connects via Unity Catalog: browse schemas, map columns, join with Spark ML model outputs or feature store tables. Databricks stays your source of truth for ML-enriched user intelligence.

Resolve Identity

Pipes stitches profiles across Customer.io customer ids, email addresses, Databricks customer_ids, and model training identifiers. Deterministic matching with configurable limits. Full lifecycle coverage from anonymous to paid.

Activate Back to Customer.io

Enriched profiles push back to Customer.io via correctly structured identify calls and track events. Spark ML type conversions handled in the transform layer. Delta schema evolution absorbed at the transform layer. Scheduled or real time.

Use case: Upgrade campaign triggered by Spark ML upgrade likelihood scores from Databricks

Your data science team builds an upgrade likelihood model using Spark ML in Databricks. The model scores SaaS users on their probability of converting from free to paid, producing a Delta table with customer_id, upgrade_likelihood_score (DoubleType), account_tier, and a StructType feature_summary. Users who score above 0.65 should receive a targeted upgrade campaign in Customer.io.

The problem: the Delta table schema changed last week, the data science team added a confidence_interval field and renamed upgrade_score to upgrade_likelihood_score. Customer.io identifies users by customer id, not the internal customer_id the model uses. The StructType feature_summary needs to be unpacked before it can become a Customer.io attribute.

Without Meiro: You'd write a Databricks job that queries the Delta table using Spark SQL (::DOUBLE casts and DATEADD(DAY, -1, CURRENT_DATE()) for change detection), resolves Customer.io customer id from internal customer_id, converts StructType fields manually, classifies high-scoring users as identify calls (persistent attribute update) versus track calls (milestone event), and pushes via the Customer.io API. Every model schema change requires a pipeline rewrite.

With Meiro Pipes: The Delta table is connected via Unity Catalog. A Spark SQL query with DATEADD(DAY, -1, CURRENT_DATE()) fetches recent model outputs. The Pipes transform handles StructType traversal and type coercion in the JavaScript sandbox, the renamed field gets mapped to the correct Customer.io attribute without a pipeline rewrite. Pipes resolves internal customer_id to Customer.io customer id using the identity graph. Upgrade likelihood scores push as identify attributes. Users above the 0.65 threshold receive a track event that fires the upgrade campaign flow in Customer.io.

From Spark ML model output to triggered Customer.io upgrade campaign: hours.

Pipes speaks Customer.io's schema so your Databricks doesn't have to

Your Databricks Delta table

SELECT
  user_id,
  email,
  upgrade_likelihood_score::DOUBLE AS upgrade_score,
  account_tier,
  last_active_date
FROM catalog.ml_outputs.upgrade_scores
WHERE updated_at > DATEADD(DAY, -1, CURRENT_DATE())

Pipes transform

// Pipes send function (Event Destination)
async function send(payload, headers) {
  return payload.events.map(row => ({
    type: 'identify',
    userId: row.user_id,
    traits: {
      email: row.email,
      churn_risk_score: row.churn_risk_score,
      account_tier: row.account_tier,
      last_active_date: row.last_active_date
    }
  }));
}

What Customer.io receives

{
  "type": "identify",
  "userId": "usr_8472",
  "traits": {
    "email": "[email protected]",
    "churn_risk_score": 0.82,
    "account_tier": "enterprise",
    "last_active_date": "2026-03-15"
  }
}

No custom API client code. Spark ML type conversion handled in the transform layer, not in Databricks notebooks. When the Delta table schema evolves, you update the transform function, not the pipeline infrastructure.

Standard stack vs. Meiro Pipes

The standard stack

Custom Databricks job, query Delta tables with Spark SQL, resolve customer id, batch identify/track calls
Manual Spark ML type conversion: DoubleType, StructType, ArrayType to flat JSON
Breaks silently when data scientists add columns or rename fields (Delta schema evolution)
No identity resolution, silent failures when `customer_id` and Customer.io id diverge
Manual classify-as-attribute-vs-event logic with no visibility or version control
Unity Catalog permission provisioning required for every new connector or job
Lifecycle identity gaps, anonymous → trial → paid transitions break silently

Meiro Pipes

Native connectors for Customer.io and Databricks via Unity Catalog
Schema-aware transforms that adapt to Delta Lake schema evolution
Spark ML type mapping (DoubleType, StructType, ArrayType) in the transform sandbox
Deterministic identity matching across customer id, email, `user_id`, anonymous ID
Model-layer control over identify vs. track classification
Bidirectional: Customer.io engagement events land in Databricks Delta tables automatically
Single service principal, one permission boundary, not many

A reverse ETL tool syncs rows. It doesn't handle Delta Lake schema evolution gracefully, convert Spark ML output types, or resolve lifecycle identity. Meiro Pipes does all of that.

One platform. Two problems solved.

For the Lifecycle Marketer

You want to trigger Customer.io upgrade campaigns, churn prevention flows, and retention sequences based on ML scores your data science team produces in Databricks, signals that exist today but never make it to Customer.io.

Describe the trigger you need; Piper builds it
Spark ML scores and Delta table attributes appear in Customer.io without engineering tickets
Upgrade likelihood, churn risk, feature adoption, all available for Customer.io segmentation
Build lifecycle sequences on complete ML-enriched customer context
Upgrade campaign triggers automatically when model scores cross the threshold

For the Data Engineer

You're tired of maintaining the Databricks → Customer.io pipeline. The customer id resolution. The Spark ML type conversion code. The sync job that breaks silently every time a data scientist updates the model output schema.

Connect Databricks and Customer.io once, Pipes handles schema translation
Transform functions adapt to Delta schema evolution without pipeline rewrites
Spark ML type mapping in the JavaScript sandbox, not in Databricks notebooks
Identity resolution across customer id, email, `user_id`, anonymous ID
CI/CD-native config management via mpcli

Under the hood

Customer.io Event Destination

Native connector. Sends identify calls (user attributes) and track calls (behavioral events) to Customer.io in the correct API format. Handles timestamp formatting, property serialization, and B2B Object API calls with relationship mapping.

Databricks Connector

Direct connection via Unity Catalog. Supports Spark SQL syntax including ::DOUBLE casts, DATEADD(DAY, -1, CURRENT_DATE()), and Delta table references. Browse catalogs, schemas, and tables. Model warehouse data as identify attributes, track events, or B2B Object records.

Identity Resolution

Deterministic stitching across Customer.io customer id, email, user_id, anonymous ID, Stripe ID, and CRM IDs. Full lifecycle coverage from anonymous visitor through paid customer. Configurable merge limits to prevent false merges.

Transform Sandbox

Sandboxed JavaScript functions for schema translation. Handle Spark ML type conversions, DoubleType, StructType, ArrayType, to Customer.io-compatible flat JSON. Classify data as identify or track calls. Adapts to Delta Lake schema evolution without pipeline rewrites. 47 allowlisted packages available.

Reverse ETL / Profile Sync (Meiro Engage)

Scheduled or real-time Live Profile Sync. Delta table watermark-based change detection. Push ML-enriched profiles and events to Customer.io via identify and track calls. Full delivery history and retry logic.

B2B Object Sync

Model Databricks company and account records as Customer.io Objects. Pipes handles the Object API endpoint, schema differences, and person-to-object relationship maintenance, so B2B teams can sync account context alongside person records from Delta tables.

Why connecting Databricks and Customer.io requires more than a connector

Delta Lake schema evolution is the first structural problem. Data science teams iterate on models between deployments. Delta Lake handles schema changes automatically. Downstream sync pipelines don't. A renamed field or a new confidence interval column silently breaks the Customer.io identify call that was working last week. A durable integration needs to be schema-aware at the transform layer.

The identify versus track decision is the second structural problem. Persistent user attributes, upgrade likelihood score, account tier, feature adoption flags, belong in identify calls. Behavioral occurrences, milestone completions, API calls, feature activations, belong in track calls. Getting this classification wrong affects segmentation, trigger logic, and billing. Databricks data arrives as rows in Delta tables. The identify/track classification is a modeling decision that has to be made explicitly and maintained when the underlying data model changes.

Spark ML type mapping adds a third layer. Databricks MLflow and Spark ML model outputs carry Spark-native types, DoubleType scores, StructType prediction metadata, ArrayType feature vectors, that Customer.io's API cannot consume directly. Converting these types requires explicit transformation logic that lives outside the Databricks notebook.

Identity reconciliation is the fourth gap. Databricks stores customer records using whatever identifier the model training pipeline used. Customer.io identifies users by a customer id you define, with email as an optional secondary identifier. When these don't reconcile, identify calls create duplicate profiles or miss the intended user.

Stop debugging the pipeline. Start activating the data.

Connect Databricks and Customer.io through Meiro Pipes. Identity-resolved, schema-aware, bidirectional. Start free.

Talk to a Consultant