Simplify Data Ingestion from HubSpot to BigQuery with DLT Hub and Dagster

I want to dive into an exciting project I recently completed using DLT Hub, a Python library designed to simplify data ingestion and replication. In this project, I built a straightforward data ingestion pipeline that transfers data from HubSpot to BigQuery, leveraging the power of both DLT Hub and Dagster. This project turned out to be easier than expected, and I’m eager to share the details with you.

Tackling the Challenges of Data Ingestion

For those of us in the data engineering realm, we know that replicating and ingesting data into a data warehouse is far from a trivial task. It’s a challenge, especially when you’re trying to set realistic expectations with stakeholders about the time and effort required to integrate new data sources.

Here’s why it can be complicated:

  • API Limitations: Many SaaS (Software as a Service) platforms have underdeveloped or inconsistent APIs, making it difficult to query data effectively for analytical purposes.
  • Diverse Data Sources: At Linden Digital Marketing, we deal with over 30 different data sources. Each comes with its own schema, querying logic, authentication requirements, and unique quirks.
  • Change Data Capture (CDC): Ideally, we’d love to have a transaction log capturing every operation from all systems, but this isn’t always possible. Often, we have to implement logic downstream in our data warehouse to handle changes, which adds another layer of complexity.

These are just some of the hurdles that make data engineering both challenging and rewarding. But this is where tools like DLT Hub can make a significant difference.

Why DLT Hub?

DLT Hub simplifies the initial stages of data ingestion or replication, making it an excellent choice for proof-of-concept projects. It’s especially handy when you need to get data from sources like HubSpot into a data warehouse such as BigQuery. Here’s what makes DLT Hub appealing:

  • Ease of Integration: DLT Hub supports straightforward integration with various data sources and platforms.
  • Flexibility: It allows for custom queries and adjustments, making it adaptable to different project requirements.
  • Simplified Configuration: Setting up DLT Hub with Dagster and managing the integration is surprisingly simple, as I’ll show you in the next section.

Project Walkthrough: HubSpot to BigQuery

Setting Up the Project

In VS Code, I organized the project with a clear folder structure:

  • Dagster Project: Contains all Dagster-related assets and resources.
  • DLT Sources: Houses the configurations and setup files for DLT Hub’s data sources.

After setting up the folder structure, integrating DLT Hub with Dagster was quite straightforward. Here’s a step-by-step breakdown:

  1. Integrating DLT Hub:
    • After running a few command-line scripts to initialize DLT Hub, the tool generated configuration files (templ files) and helper scripts in my project directory.
    • These files manage the connection and data extraction logic for the HubSpot API.
  2. Configuring Dagster Assets:
    • I defined the assets within the Dagster project to handle data ingestion from HubSpot. This included specifying how data is pulled, processed, and loaded into BigQuery.
    • Adjusting these configurations for custom queries was easy, thanks to DLT Hub’s flexible support for different query logic.
  3. Running the Pipeline:
    • With the setup complete, I executed the pipeline in Dagster. It efficiently ingested all the HubSpot data into our BigQuery database, making the entire process smooth and manageable.

Considering Your Role as a Data Team

While tools like DLT Hub and SaaS solutions such as Fivetran can streamline data ingestion, it’s crucial to weigh their costs against your project’s requirements and budget.

Some factors to consider:

  • Cost Efficiency: For smaller or less critical data sources, using a SaaS tool can be cost-effective. However, for high-volume or mission-critical data, custom pipelines might be more economical.
  • Long-Term Value: Building custom pipelines in Python, as I did with this project, can often save costs and offer greater control over your data workflows. It also allows your team to focus more on adding value through data analysis, dashboard creation, machine learning, and process automation.

Wrapping Up

This project was a fantastic opportunity to explore the capabilities of DLT Hub and see how it simplifies data ingestion tasks. Integrating HubSpot data into BigQuery using DLT Hub and Dagster turned out to be a straightforward process, and it’s a solution I’ll likely incorporate into future projects due to its ease of use and flexibility.

If you’re working on similar projects or interested in enhancing your data ingestion processes, I highly recommend checking out DLT Hub. For more details on my setup and to stay connected, feel free to follow me on Twitter or LinkedIn.

Stay Connected

If you found this walkthrough helpful and want more insights into data engineering projects, be sure to like and subscribe to my channel. Your feedback is invaluable, so let me know if you have any questions or suggestions for future topics!

Useful Links:

Leave a comment