In modern data engineering and MLOps, workflow management platforms are becoming increasingly important for orchestrating distributed data pipelines. Workflow orchestration tools help process and send data between systems and tasks, which is still a pretty tricky problem.
Over time, workflow orchestration has grown to encompass the increasing complexity of the workflows and pipelines. Teams often begin by managing and processing tasks manually, including data cleaning, training, results tracking, deployment, etc. As tasks and workflows become more complex, manual orchestration becomes increasingly time-consuming.
Enter, workflow management and orchestration platforms.
Table of Contents
What Are Workflow Management and Orchestration Tools?
As data pipelines and their various tasks grow in complexity, creating automated workflows that handle tasks and their dependencies eventually becomes necessary. Tasks and dependencies form networks that can be modelled as a directed acrylic graph (DAG). These graphs model tasks and their dependencies, displaying the relationship between variables (often called nodes).
Workflow orchestration tools enable data engineers to define pipelines as DAGs, including dependencies, then enabling them to execute tasks in order.
Additionally, workflow orchestration tools create progress reports and notifications that enable team members to monitor what’s going on. Workflow orchestration tools connect to a wide range of data sources, e.g. APIs, databases, data warehouses, etc. Some key uses include:
- Monitoring data flow between APIs, warehouses, etc.
- Managing pipelines that change at relatively slow, even intervals
- Extracting batch data from multiple sources
- ML model training
- DevOps tasks, like submitting Spark jobs
The end goal is to create a dependable, repeatable, centralised workflow for orchestrating data pipelines and MLOps-related tasks.
This is a relatively new category of tools, but there are already quite a few options, including:
- Apache Airflow: Originally developed by Airbnb, Airflow was donated to the Apache Software Foundation project in early 2019. Airflow is written in Python and is probably the go-to workflow orchestration tool with its easy-to-use UI.
- Luigi: Luigi is a Python package for building data orchestration and workflows. It’s simpler for Python users than Airflow overall.
- Dagster: Dagster is more similar to Prefect than Airflow, working via graphs of metadata-rich, functions called ops, connected by gradually typed dependencies.
- Prefect: Prefect has become a key competitor to Airflow, but provides a cloud offering with hybrid architecture.
- KubeFlow: For Kubernetes users that want to define tasks with Python.
- MLFlow: Orchestration specifically for ML projects.
Here, we’ll be comparing Airflow and Prefect.
An Apache project, Airflow has become the go-to workflow orchestration tool that is well-suited to medium to large-scale businesses and projects.
Written in Python, Airflow is popular amongst developers and is designed to be distributed, flexible and scalable while handling complex business logic. Airflow is used by at least 10,000 large organisations, including Disney and Zoom. In addition, Airflow connects to cloud services like AWS and is backed by a huge community.
Issues with Airflow
Airflow was designed by a huge enterprise – Airbnb – and is therefore angled more towards large and enterprise deployments. Airflow was the only real option available for orchestration at scale, but fitting very complex projects into Airflow can be tricky, particularly in the case of ML projects.
Prefect is fresh and modern, but is also an open-source project. However, there is a paid cloud version too, which is one of its major differentials from Airflow.
That means you can execute workflows on any server and monitor them from Prefect’s cloud portal. It’s much simpler than Airflow, but that’s a benefit to most. Overall, Prefect is slicker Airflow as the code is similar to writing Python functions while being wrapped in a with the statement.
1: Ease of Setup
Both Airflow and Prefect can be set up using pip, docker or other containerisation options. However, Prefect is very well organised and is probably more extensible out-of-the-box. To run Airflow, you’ll need a scheduler and webserver, but AWS and GCP both provide managed services for the platform.
Prefect and Airflow work through cloud services, but Prefect contains a paid cloud version to simplify monitoring pipelines through an intuitive real-time UI. Access is simple but secure, with the user requiring an account and API key.
2: Ease of Use
Airflow utilises DAGs, as described above. These are pretty intuitive and describe workflows visually, but DAGs also have their own DAG operators, which are tricky to learn at first despite a straightforward syntax structure.
Prefect code is similar to writing Python functions, and it’s unnecessary to refactor the code when creating new workflows, which is a big bonus vs Airflow. Prefect also has excellent code modularisation features, which are great for test cycles.
Prefect is pretty adamant that their tool solves many of the issues users report with Airflow, including:
- DAGs with the same start time
- DAGs run off-schedule or with no schedule at all
- DAGs that rely on the exchange of data
- DAGs with complex branching logic
- DAGs with lots of fast tasks
- Dynamic DAGs
- Parametrized DAGs
In Airflow, many enterprises resort to writing custom DSL or building proprietary plugins to support their internal needs, whereas Prefect supports these functions more-or-less ‘out of the box’. After all, Prefect was made to beat Airflow, and the devs knew and understood why Airflow might be limited in some use cases.
Of course, whether or not these limitations apply to your specific project is a different matter.
Airflow’s UI is part of the webserver and contains plenty of intuitive features once users have mastered DAGs. Tasks, schedules and runs are all clearly displayed. Multiple views, such as calendar and graph views, make navigating workflows easy once everything is set up.
Prefect’s UI is a very configurable, easy-to-manage dashboard that enables you to manage workflows centrally whilst allowing you to check the health of your data pipelines.
4: AI and ML
Airflow is better supported by the ML community due to its integrations. Airflow can accomplish many tasks, such as training models at specific intervals, retraining the models, batch processing, data scraping, portfolio tracking, etc.
Since Airflow is more popular right now, it’s usually possible to find guidance on even the most complex MLOps.
Prefect is not as equipped for ML right now, but this is changing (and possibly already has by the time you’re reading the blog). It’s not so much that Prefect cannot cater to advanced ML pipelines, but that it makes more assumptions about what you will use the tool for. However, Prefect is built with the modern data stack in mind, thanks to its cleaner Python API.
Airflow is the more extensible tool right now and has a broader following and community base. However, Prefect provides support through a Slack channel that is regularly monitored. In the end, both tools are backed by both professional developers, students and enthusiasts from multiple backgrounds in data engineering and ML.
Summary: Airflow vs Prefect
Prefect is understandably bullish when comparing Airflow with Prefect. Prefect’s UI is excellent, supporting tons of real-time features and visualisations. In addition, the API is cleaner and easier to get to grips with without sacrificing flexibility.
But what Prefect lacks, for now, is the massive community backing of Airflow, and this probably won’t change for a few years. Airflow makes a lot of the technical details of its workflows and pipelines available to users, however, which is excellent for technical users.
Suppose you’ve not used either tool before. In that case, the marginal preference is probably Prefect, purely because it’s new, it’s there, it’s actively supported, and it does offer some nifty features for modern MLOps.
Overall, though, you can’t easily split these two workflow orchestration tools without putting preference heavily into the frame.
What is workflow orchestration in data?
Workflow orchestration in data engineering and ML is the organisation of data pipelines and their sources. Workflow orchestration tools enable users to set up workflows between data sources and pipelines, running tasks with dependencies on specific schedules.
What is Airflow?
Airflow is an Apache project for data workflow orchestration. It allows users to organise their pipelines and schedule tasks between data sources. It’s used in both machine learning and data engineering for larger businesses and projects with lots of automated dependencies.
What is Prefect
Prefect is an open-source competitor to Airflow that adds a paid cloud gateway. It’s newer than Airflow and utilises a straightforward and clean Python API.