HubSpot and Databricks Integration: Transform and Sync Data Easily

Written by Ralph Vugts | Jul 8, 2024 11:03:57 PM

Databricks is a fantastic data lake/warehouse option for storing and centralizing your business data. It's extremely easy to feed data into Databricks as they have a wide range of existing connectors and notebooks to help you manage large datasets.

Once you have your data in Databricks, you can easily process and transform the data to your requirements.

What we love about Databricks

Ease of use:
Databricks has made it really easy to load your data into the platform. There are many prebuilt connectors to help ingest data from all the major players. This means you can get up and running quickly, often with just a few clicks and some minimal config.

For anything that doesn't exist, creating your own pipelines to ingest data is quite straightforward as well using Python/SQL notebooks.

Documentation:
There is great documentation and examples on the Databricks side which means you can get up to speed quickly and onboard new staff easily.

Transforming data:
Quite often, we need to reformat/restructure data before we can sync it over to HubSpot. Databricks makes transforming data a breeze with real time and batch transformations.

Hosting options:
"Databricks workspaces can be hosted on Amazon AWS, Microsoft Azure, and Google Cloud Platform. You can use Databricks on any of these hosting platforms to access data wherever you keep it, regardless of cloud."

This can really help with compliance and IT security requirements.

Sets businesses up for the future!
Most businesses have complex tech stacks with data spread out all over the place, but once you consolidate all your data into one place, life becomes a lot easier.

Need to integrate with another platform in the future? Easy, you already have access to your data!

A platform you're using goes away suddenly? Again, no worries, you already have a copy of your data.

How we integrate HubSpot with Databricks

We've helped clients stand up Databricks from scratch. The first step is determining what data sources they want to load into Databricks and what they are planning to do with the data.

Generally we are more focussed on the HubSpot side, but clients will often use Databricks for many other purposes.

Our high level HubSpot and Databricks approach:

Determine data sources that need to be in Databricks
Design the HubSpot structure by defining objects & properties to help drive business automation and reporting
Data mapping from Databricks to the new HubSpot structure
Data transformation: We want to do as much of the heavy lifting on the Databricks side to get the tables in an ideal state before we reverse ETL to HubSpot
Decide on the reverse ETL tool that will be used

If you want to get data from HubSpot into Databricks, there are 2 options:

Via HubSpot API's or webhooks
Snowflake Data Sharing: If you have a HubSpot Ops Hub Enterprise license you will get access to a Snowflake Data Share. There is a Databricks and Snowflake connector to then ingest this.

HubSpot friendly Reverse ETL tools

Once we have all the data in Databricks, we need to get it back out and into HubSpot.

Luckily, there are many reverse ETL tools out there already that can handle this.

The main downside of these 3rd party tools is they generally have limited functionality (ie. can only sync certain objects) and can have slow processing (often do not use batch calls, etc.).

Costs can also add up. Generally, they either charge by volume or by object that is sync'd over. Over the long run, these costs can really start to add up.

For simple syncs, we recommend Census. It also allows you to create 2 object syncs to HubSpot for free.

Complex Databricks Syncs

For anything more complex, we recommend getting us to script the Reverse ETL process for you.

We have developed a custom processing engine which is optimized to the HubSpot batch API calls.

The main benefits to this approach:

We have zero platform limitations. Need some kind of crazy processing logic to dedupe or associate records together? No problem, this can be coded in.
Ownership: We can host this in your environment and also hand it off to your tech team if they would like to own/manage it after we complete the setup.
Speed: We can optimize the sync to your HubSpot portal. For example, if you have a higher API limit, we can optimize toward this. Quite often, this then means we can sync millions of records in a few hours.
Error logging and alerts: We can integrate the sync with your error logging platform of choice. We use Sentry.io and can monitor and address issues for you as they arise.
Costs can also work out cheaper in the long run but there is a higher upfront setup cost. But, with this approach, ongoing costs for processing the data can be significantly lower.
We have developed an easy to use last mile data transformation interface. What this means is marketing teams can easily map new columns from Databricks tables with a few clicks.
Selective re-sync options: So you've added a new column and need that new data point in HubSpot quickly? No problem, map it and select the time period you want to sync back to.

Easily trigger bulk resync of smaller or the entire data set

Map and transform new properties

Last mile transformations:
Easily convert values during the sync

Our processing engine can also be used with other data warehouse or data source platforms like Snowflake, MySQL, RDS, Google Datastore, and more.

View full post