Deploying a Complex Data Pipeline on Azure: Dagster vs. Prefect
AzureData PipelineDagsterPrefect
Recently, I had to deploy a pretty complex data pipeline originally built as a Jupyter notebook (with some backing Python modules). This pipeline had tons of input parameters, processed a substantial dataset (around 600k to 1M rows), required significant RAM due to intensive data processing, and was only used occasionally—roughly 10-50 times per month. Running the entire thing took about 20 minutes, mostly thanks to Excel being painfully slow (seriously, Excel files are the sloths of the data world).
The goal was clear: get this pipeline into a production-ready setup on Azure, complete with observability, retries for failures, file-based triggers, scalability, low costs, and minimal changes to the existing codebase. My first thought was just deploying the notebook directly onto Azure Databricks—but I quickly dropped that idea. Managing notebooks in production is like herding cats—no real version control, poor observability, no retries, no triggers, just messy chaos.
Exploring Azure-native options turned out to be disappointing; they were either crazy expensive (looking at you, Microsoft Fabric) or required extensive rewriting of the code for Spark clusters. Running plain Python scripts on Azure Synapse Analytics is technically doable, but it needs at least three nodes in each spark cluster, making it three times as expensive. No thanks.
After some digging, two promising contenders (both open-source) emerged:
- Dagster (on Azure Container Instances)
- Prefect (also on Azure Container Instances)
Here’s what I discovered after testing both:
Dagster: Elegant, but Costly
Dagster has a fantastic conceptual model based on “data assets,” which made the pipelines very intuitive. Observability was decent, though displaying tabular data results clearly during runs wasn’t straightforward. However, Dagster uses a pull-based model, meaning a worker instance needs to run continuously—burning money even when idle. On top of that, it would require significant refactoring of my existing pipeline into Dagster’s more declarative style. Dagster’s cloud offering was also restricted to US locations only, which was a no-go due to data privacy constraints.
Don’t get me wrong—Dagster seems awesome, and I hope to use it someday—but it wasn’t right for this specific project.
Prefect: Practical and Cost-effective
Prefect took a more straightforward, traditional task-and-flow approach. Although not as elegant as Dagster’s concept, it did the job very effectively. Observability was actually great, supporting artifacts like tables, markdown, and images directly in the UI. Prefect also handled retries and event triggers exceptionally well. Its UI, admittedly, felt slightly less polished compared to Dagster’s.
The big win with Prefect was its push-based worker model: spin up an Azure Container Instance (ACI), run the task, then shut it down immediately afterward—no idle costs! Prefect still required a continuously running server, though. The Prefect Cloud free tier didn’t allow for custom compute, and the next tier cost about $100/month, leaving a small continuously-running ACI instance as my only practical choice.
Why Prefect Came Out on Top
In the end, Prefect was the clear winner for my project. It balanced functionality, observability, and cost-effectiveness better—even though I still admire Dagster’s elegant approach.
Hopefully, my journey helps someone else navigate the surprisingly murky waters of Azure data pipeline solutions. Honestly, the Azure-native offerings disappointed me—they’re pricey, inflexible, and oddly unsuited for typical Python data pipelines. In the same project, I’m already using Azure Synapse Pipelines for a simple file-to-database job it’s actually really bad. Who at Microsoft decided that dynamic input schema wasn’t important?
It genuinely surprises me that Microsoft hasn’t provided a straightforward, serverless solution for Python-based data pipelines. Until then, Prefect has my vote.