Meiro Pipes Integration

Connect Amplitude and Databricks. Model outputs that actually reach the right user.

Amplitude captures product events. Databricks runs the models — churn scores, LTV predictions, PQL scoring. Pipes resolves identity and makes sure the output reaches the right Amplitude user, not a partial match.

Talk to a Consultant

Free trial · No credit card · Live in minutes

Two teams. Same broken pipe.

You're in Amplitude. You can see who's active, who's dropping off, which features drive conversion. What you can't see is whether the churning users are the same ones your sales team has open deals with, or whether the inactive accounts are enterprise customers with high expansion potential.

Your data team has that context in Databricks — CRM, billing, product usage, all joined on Delta Lake, probably with a churn or PQL model on top. Getting it into Amplitude as user properties would let you build cohorts that matter. But when the sync runs, warehouse records are keyed differently from how Amplitude tracked those users before they authenticated. Properties arrive. Some land on the wrong user. Others don't land at all.

You're running ML pipelines in Databricks on Delta Lake. Feature engineering on Amplitude events, churn scoring, PQL models — the output tables need to reach Amplitude as user properties.

Two things break this reliably. First, Delta Lake schema evolution: if a column type changes between pipeline runs — a FLOAT score becomes DOUBLE, a nullable field becomes non-nullable — Amplitude's schema validation silently drops the property. The sync reports success. The property stopped updating three weeks ago. Second, identity: Databricks records are keyed on whatever your upstream systems use. Amplitude tracks users by device_id through anonymous sessions, then merges to user_id on login — inside Amplitude, invisible to Databricks. When you map warehouse records to Amplitude users by user_id, you miss every record carrying email or account_id instead. The cohort is incomplete and the logs don't tell you by how much.

The Real Problem

Why connecting Amplitude and Databricks requires more than a connector

Amplitude doesn't have a native Databricks connector — events reach Databricks via S3 export, adding latency and a schema translation step before data is usable for modeling. On the return leg, getting Databricks ML outputs back into Amplitude requires a reverse ETL connector that reads from Delta Lake and calls Amplitude's Identify API.

Delta Lake schema evolution is a specific failure mode. Delta Lake allows column types to change between pipeline writes — a feature for iterative ML workflows. But Amplitude's Data Management layer enforces property types at ingestion: a type that changes between syncs fails silently. You won't see it in the reverse ETL delivery log. You'll see it when your product manager notices that churn_risk_score stopped updating and traces it back to a model iteration three weeks ago.

Identity is the deeper problem. Amplitude's internal identity graph merges anonymous device sessions into authenticated user records. Databricks has no visibility into that graph. ML models built in Databricks produce output records keyed on whatever identifier the training data carried — often email from CRM, account_id from the product database, or customer_id from billing. A reverse ETL connector maps one identifier to one Amplitude user. Multi-device users, users who converted from anonymous sessions, and users whose Databricks identifier doesn't match their Amplitude user_id all receive partial or incorrect enrichment. The model is correct. The activation isn't.

Pipes resolves identity across device_id, user_id, email, account_id, and any identifier your Databricks records carry — before data moves. Type changes from Delta Lake schema evolution are surfaced as visible transform-layer errors before reaching Amplitude's ingestion API, not discovered weeks later via a missing property.

One platform. Collect, resolve, model, activate.

Collect

Pipes connects to Amplitude via its export API and warehouse connector. Events are ingested on a scheduled or near-real-time basis — no replacement of your existing Amplitude SDK or tracking plan required.

Load & Model

Events land in your Databricks warehouse automatically. Pipes connects directly — browse tables, map columns, model data. Your warehouse stays your source of truth.

Resolve Identity

Pipes stitches user profiles across Amplitude events and Databricks records using deterministic matching on email, user_id, device_id, or any identifier you define. Configurable merge limits prevent false matches on shared devices. No probabilistic guesswork.

Activate

Enriched profiles and segments flow back into Amplitude via scheduled or real-time sync. Your growth team gets warehouse-enriched cohorts directly in the tool they already use — no reverse ETL vendor required.

Use case: Product-qualified lead scores from Databricks to Amplitude

Your data team builds a product-qualified lead (PQL) model in Databricks. It combines Amplitude behavioral signals — feature depth, API call volume, team size in-product — with Salesforce data: deal stage, account tier, last sales contact. The model writes a pql_score and recommended_action per account to a Delta Lake table.

You want sales and growth teams to filter Amplitude cohorts by pql_score — "high product engagement, not yet in active sales cycle" — without CSVs or Databricks access.

Without Pipes: you write a reverse ETL job that reads the Delta Lake output and calls Amplitude's Identify API. The model output is keyed on account_id. Amplitude users are keyed on user_id. Individual users within an account have different device_id histories from before they authenticated. The mapping breaks for users who joined the product before the account existed in Salesforce. A model iteration changes pql_score from FLOAT to DOUBLE — Amplitude drops the property silently. By the time sales runs their outreach sequence, the cohort is stale and missing a third of the accounts it should contain.

With Pipes: the Delta Lake output table is a Databricks source. Pipes resolves account_id to individual Amplitude user_ids via the identity graph, handling multi-user accounts and anonymous-to-authenticated transitions. The FLOAT→DOUBLE type change is caught in the transform layer and surfaces as a fixable error before the API call. The pql_score reaches the right Amplitude users. Cohorts built on it reflect the actual model output.

The pain is real

Extracting full value usually requires a dedicated analyst or someone with strong technical skills to manage schemas, plan taxonomies, and validate events.

— Amplitude user review, G2

A fragile pipeline for your customer behavioral tool will often lead to missing and inaccurate data and require a full-time team dedicated to maintaining it.

— Data engineering community, 2024

Under the hood

Amplitude Connector

Connects to Amplitude via its export API and warehouse connector. Ingests events on a scheduled or near-real-time basis. Supports event filtering and transformation via Pipes sandbox functions. No replacement of your existing Amplitude SDK.

Databricks Connector

Direct Databricks connection via personal access token and SQL warehouse endpoint. Browse Unity Catalog schemas, Delta Lake tables, and column definitions. Map identifier columns to Meiro identity types. Handles Delta Lake schema evolution: column type changes between pipeline runs are surfaced as transform-layer errors before reaching Amplitude's ingestion API rather than failing silently after.

Identity Resolution

Deterministic stitching across identifier types: email, user_id, device_id, cookie. Configurable merge limits (maxIdentifiers) and priority hierarchy prevent false merges. No probabilistic matching.

Reverse ETL / Profile Sync

Scheduled exports or real-time Live Profile Sync. Push enriched profiles and audience segments back to Amplitude or any downstream destination via custom send functions.

Transform Layer

Sandboxed JavaScript functions for event transformation, filtering, and enrichment. Run inline — no external orchestrator needed.

Self-Hosted Option

Deploy on your own infrastructure for full data sovereignty. Or use Meiro Cloud. Your data never leaves your perimeter unless you want it to.

Live in minutes, not months

Connect Amplitude

Add Amplitude as a Source via its export API or warehouse connector. Events start landing in your pipeline.

Connect Databricks

Add your Databricks credentials. Browse tables, map identifiers, start modeling.

Resolve & Activate

Pipes stitches identity across both systems. Push enriched profiles back to Amplitude or anywhere in your stack.

Stop shipping model outputs to the wrong Amplitude users.

Connect Amplitude and Databricks through Pipes. Resolve identity across device, user, and account. Start free.

Talk to a Consultant