Airflow – Sub-DAGs

Introduction Workflows can quickly become quite complicated which makes them difficult to understand. To reduce complexity tasks can be grouped together into smaller DAGs and then called. This makes DAGs easier to manage and maintain. Creating a sub dag consists of Creating the actual Sub DAG and testing it. Create a DAG which calls the Sub-DAG created in Step1. For those of you wondering how to call a sub-dag. It is easy using the SubDagOperator. In the next example, we would re-use one of our earlier examples as a Sub-DAG and call it from another DAG. Create a Sub-DAG Let’s first see the code of the Sub-DAG. You will observe that it is very similar to a DAG from the previous entries except that it is called inside a function. See Below Before moving forward a few important points Airflow Sub DAG is in a separate file in the same … Read more

Airflow Branch joins

Introduction In many use cases, there is a requirement of having different branches(see blog entry) in a workflow. However, these branches can also join together and execute a common task. Something similar to the pic below There is no special operator which is used but only the way we assign & set upstream and downstream relationships between the tasks. An example of this is shown below. Branch Join DAG The hello_joins example extends a DAG from the previous blog entry. An additional task is added to the dag which has been quite imaginatively named join_task. Understand DAG Let’s look at the changes introduced in the DAG. Step 5 – A new task called join_task was added. Observe the TriggerRule which has been added. It is set to ONE_SUCCESS which means that if any one of the preceding tasks has been successful join_task should be executed. Step 6 – Adds the … Read more

Airflow – Variables

In the last entry on airflow(which was many moons ago!) we had created a simple DAG and executed it. It was simple to get us started. However, it did not allow us any flexibility. We could not change the behaviour of the DAG. In the real world scenarios, we may need to change the behaviour of our workflow based on certain parameters. This is accomplished by Airflow Variables Airflow Variables are simple key-value pairs which are stored in the database which holds the airflow metadata. These variables can be created & managed via the airflow UI or airflow CLI. Airflow WebUI -> Admin -> Variables Some of the features of Airflow variables are below Can be defined as a simple key-value pair One variable can hold a list of key-value pairs as well! Stored in airflow database which holds the metadata Can be used in the Airflow DAG code as … Read more

Airflow – Branching

Introduction Branching is a useful concept when creating workflows. Simply speaking it is a way to implement if-then-else logic in airflow. For example, there may be a requirement to execute a certain task(s) only when a particular condition is met. This is achieved via airflow branching. If the condition is true, certain task(s) are executed and if the condition is false, different task(s) are executed. Branching is achieved by implementing an Airflow operator called the BranchPythonOperator. To keep it simple – it is essentially, an API which implements a task. This task then calls a simple method written in python – whose only job is to implement an if-then-else logic and return to airflow the name of the next task to execute. Branching Let’s take an example. We want to check if the value of a variable is greater than 10. If the value is greater than or equal 15 … Read more

Apache Airflow – First DAG

Now that we have a working Airflow it is time to look at DAGs in detail. In the previous post, we saw how to execute DAGs from the UI. In this post, we will talk more about DAGs.DAGs are the core concept of airflow. Simple DAG Simply speaking a DAG is a python script. Here is the code of a hello world DAG. It executes a command to print “helllooooo world”. It may not do much but it provides a lot of information about how to write an airflow DAG. Understanding an Airflow DAG Remember this code is stored in the $DAGS_FOLDER. Please refer to the previous blog which has the details on the location. An important thing to note and I quote from the airflow website One thing to wrap your head around (it may not be very intuitive for everyone at first) is that this Airflow Python script … Read more

Apache Airflow – Getting Started

I recently finished a project where Apache Airflow(just airflow for short) was being touted as the next generating Workflow Mangement System and the whole place was just going gaga over. Well, that got me thinking how I could get to understand and learn it. Hence the blog post. Here are some things you may want to know before getting your hands dirty into Apache Airflow What is Airflow?The definition of Apache Airflow goes like this Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. So Airflow executes a series of unrelated tasks which when executed together accomplish a business outcome. For those folks who are working on the likes of Informatica – airflow is similar to Workflow Designer or those working in Oracle Data … Read more