Python Data Orchestration Project: Leveraging Dagster for Spotify Ads Integration

This pipeline integrates seamlessly into our data warehouse at Linden Digital Marketing, and I used Dagster, an amazing Python-based data orchestration tool, to get it all done. In this post, I’ll walk you through what data orchestration is, how I utilized Dagster, and the specifics of the project, including how I managed data ingestion, integration with BigQuery, and fitting DBT into the mix.

Understanding Data Orchestration

Data orchestration is a term that might sound a bit complex, but it’s essentially the backbone of any robust data pipeline. Imagine having multiple data sources and systems – your task is to tie them all together in a way that makes the data flow smoothly and efficiently from one stage to the next.

In my project, data orchestration involved several steps:

  1. Ingestion: Pulling data from the Spotify Ads API.
  2. Transformation: Cleaning and formatting the data to match our reporting and business requirements.
  3. Consumption: Making the data available for analysis and reporting, whether that’s through regular reports, business intelligence tools, or even machine learning models.

The orchestration process not only schedules and coordinates these steps but also provides visibility and metadata for the entire data flow, ensuring everything works as expected.

Why Dagster?

When I started exploring tools for this project, I quickly became a fan of Dagster. Here’s why:

  • All-in-One Python: Being a Python enthusiast, I appreciated how Dagster integrates seamlessly with Python scripts and libraries.
  • Structured Approach: As a self-taught data engineer, I found Dagster’s emphasis on project structure and dependency management incredibly valuable. It helped me grow my skills in version control and file management.
  • Modern and Intuitive: The tool’s modern interface and asset-centric view allowed me to visualize my entire data pipeline clearly. I could see how different assets (data sets, DBT models, etc.) connected and when they were last updated.

My Project Overview

The goal of my project was to create a data pipeline that integrates Spotify Ads data into our data warehouse. We needed to collect daily performance metrics for various ads and structure this data in a way that’s ready for business analysis.

Here’s a quick breakdown of what I worked with:

  • Data Source: Spotify Ads API.
  • Schema: I set up a star schema with dimensions like accounts, ads, campaigns, and ad sets. The fact table tracked daily ad statistics.
  • Update Frequency: To keep the data current and accurate, especially since ad metrics can change over time, the pipeline needed to refresh and rebuild the last 90 days of data daily.

Building the Pipeline with Dagster

Project Structure

I organized the project into several components, which made it easier to manage and maintain:

  • Configuration Files: These manage all the environment variables and settings needed for the pipeline.
  • DBT and Dagster Directories: Separate directories for DBT models and Dagster assets, ensuring clear segregation of concerns.
  • Virtual Environment: This keeps all the Python packages consistent and isolated, which is crucial for avoiding conflicts and ensuring reproducibility.

Coding the Pipeline

  1. Ingesting Spotify Data:
    • I used Python scripts to connect to the Spotify Ads API and extract data in JSON format.
    • The data was then loaded into Pandas DataFrames for initial processing.
    • Finally, these DataFrames were loaded into Google BigQuery, our chosen data warehouse.
  2. Defining Dagster Assets and Jobs:
    • Dagster allows you to define assets for each part of the data pipeline. For my project, this included assets for raw data ingestion and DBT-transformed data.
    • I set up a CLI resource to manage DBT commands, ensuring smooth building and testing of data models.
    • Helper functions were crucial for managing date calculations and handling data frame operations efficiently.
  3. Scheduling and Running the Pipeline:
    • I created jobs to refresh dimension tables and update the fact table incrementally. This setup ensures that the pipeline captures all necessary updates daily.
    • A simple cron job schedules the pipeline to run every day at midnight, keeping our data up-to-date.

Integrating DBT for Data Transformation

For data transformations, I integrated DBT (Data Build Tool) into the project. DBT models, defined using SQL, allowed me to:

  • Map Data Types and Resolve Errors: Ensuring the data was clean and consistent.
  • Perform Schema Testing: Implementing simple tests, like checking for unique IDs, to maintain data integrity.

Combining DBT with Dagster created a powerful framework for managing the entire data pipeline, from initial extraction to final transformation and loading.

Wrapping Up

This project was an exciting journey into the world of data orchestration with Dagster. It showcased how flexible and powerful Dagster can be, especially when paired with DBT for managing complex data transformations. The end result is a robust and scalable pipeline that efficiently integrates Spotify Ads data into our data warehouse, providing invaluable insights for our business.

If you’re interested in exploring the project further, I’ve made the entire repository available. You can find the link in the video description. Don’t hesitate to reach out if you have any questions or thoughts – I’m always happy to connect on Twitter or LinkedIn.

Stay Connected

If you enjoyed this walkthrough and want to see more content focused on data engineering rather than just analysis, please let me know! And don’t forget to subscribe to my channel for more updates.

Useful Links:

Leave a comment