The Bezos API mandate

In the Jeff Bezos mandate referenced in the iconic Steve Yagge platform rant, he talks about how Amazon implemented the service interface platform that grew into AWS. Using REST APIs for applications has a lot of benefits such as flexibility, scalability, security, and interoperability. Analytics use cases have different needs than usual CRUD operations that most APIs are built for.
Managing change data capture is where REST APIs become an issue for data ingestion.
If you’ve gone through the process of building a data pipeline from a REST API, you will run into the following questions during requirements gathering:
- How do I pull updates from a specific point in time?
- How do you manage deletes? What are the API’s limits (hard/soft)?
- Does the in app reporting tool fulfill our reporting requirements?
- Does this thing report capturing slowly changing dimensions?
The answer is you must push this logic downstream to the Data Warehouse, which is good for data engineers but this problem feels like it should’ve been solved long ago. My first job out of college just had a read replica of the production database and honestly things have never been that good.
Making everything a HTTP service is great when you got a world class engineering organization like Amazon with high standards but after 15 years of every SAAS idea under the sun getting funded us data folks are left with half baked APIs that keep us busy but toiling. So this ecosystem has developed to get the data that is allegedly yours where it can actually be leveraged. Marketing tech is especially poor in this regard.
It’s America though, you can pay someone to do the thing. Fivetran is a popular such tool for ingestion, it’s super easy to setup and manage and all it costs is your entire data budget and kills data team ROI before you even think about adding value. Its best used tactically on miscellaneous sources that aren’t worth the developer time to build a pipeline and are pulling relatively low volumes of so called “monthly active rows”.
Personally, I felt like a sucker when I got my first Fivetran bill. Which set me off on a quest for an alternative. There is open source tools such as Meltano, Airbyte, and dltHub which are essentially replicating basic ingestion/replication scripts. Open Source tools are great but you are making the tradeoff that managing and understanding the abstractions in place is worth the cost savings from Fivetran.
One of these days Software services will just have an option to sync their transaction log to the warehouse or we can go back to the good ol days of FTP csv drops.
LinkedIn Ads API
A product of a truly deranged mind. This thing reeks of over engineering and committee decision making. Why is there a new version every month? I really enjoyed seeing that the docs, the postman doc, and what the endpoints actually need were 3 completely different things. My first time encountering a Finder method, because I never had any issues with a Get request.
Setting up the app and authenticating can be a little bit confusing but you just gotta do it. Going through these ingestion projects across different systems made me more confident as a data engineer as it forces you to formalize your workflow, read the documentation, and understand how the apps are structured. Pretty soon you start to notice patterns.
I like to start these projects in a collab notebook, I find it makes the feedback loop tight and makes it easy to build out the logical blocks in a data pipeline.
LinkedIn API
LinkedIn Developer: https://developer.linkedin.com/
App Quick Start: https://learn.microsoft.com/en-us/linkedin/marketing/quick-start?view=li-lms-2024-02
OAuth Token generator: https://www.linkedin.com/developers/tools/oauth/token-generator
LinkedIn Developer Postman Collection: https://www.postman.com/linkedin-developer-apis
Code Example
This is part of an existing Dagster project but the code can be adjusted to fit your stack.

The basic flow that I have followed before for marketing based APIs is to fully refresh the dimensions (Accounts, Campaign groups, Campaigns, Creative) and then pull in the fact table data (creative performance by day). For the initial population I pull the full account history broken out into 30 day increments. In order to make the incremental refresh simple, the asset is just the full pull of the dimension objects and the last 30 days of creative history to capture and write over any adjustments that LinkedIn ads made on their end.
I’ll be making adjustments to this code and enhancing it with tests and dbt models in the future so make sure you follow me on LinkedIn and Twitter.
