Apache Airflow – Getting Started

I recently finished a project where Apache Airflow(just airflow for short) was being touted as the next generating Workflow Mangement System and the whole place was just going gaga over. Well, that got me thinking how I could get to understand and learn it. Hence the blog post.

Here are some things you may want to know before getting your hands dirty into Apache Airflow

What is Airflow?
The definition of Apache Airflow goes like this

Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies.

So Airflow executes a series of unrelated tasks which when executed together accomplish a business outcome. For those folks who are working on the likes of Informatica – airflow is similar to Workflow Designer or those working in Oracle Data Integrator(ODI) it would be ODI packages. This should put things in a bit of context if coming from a propriety ETL software background.

More product information is available on this – http://airflow.apache.org

This post covers the following items
– Airflow Components
– Installation of Airflow
– Configuration of Airflow
– Initialise Airflow
– Startup Airflow
– Execution of DAGs

Airflow Components

Before you jump into the depths of airflow it is a good thing to familiarise with the various components

-DAGs
-Operators
-Tasks
-Workflows
-Hooks
-Pools
-Connections
-Queues
-Variables
-Branching
-Sub DAGs
-Trigger Rules
-Scheduler
-Worker
-XComs

A very nice explanation is available on the airflow website – https://airflow.apache.org/concepts.html
– https://airflow.apache.org/scheduler.html

Installation of Airflow

Pre-Reqs

Airflow can be installed using pip – it is a recommended tool for installing python packages. But before you go ahead try to install airflow – check the following pre-reqs

python2.7 (Yes, I am a bit old fashioned) pip, gcc, gcc-c++, python-devel

You can copy and paste the following

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
sudo python get-pip.py
sudo yum install gcc
sudo yum install gcc-c++
sudo yum install python-devel

These commands have been executed on Amazon Linux EC2 container. Once the above libraries are installed move to the next step.

Installation

Airflow is made of core airflow package written is python & python framework. In addition to that it also comes with additional sub-packages which can be added as and when required. More information is available on this page
https://airflow.apache.org/installation.html

There are packages available for various things – slack, s3, Postgres, MySQL – it is an increasing list of ever-increasing contributions to the airflow project.

sudo pip install "apache-airflow[s3, crypto, celery, slack, jdbc]"
sudo pip install apache-airflow[postgres]

Make sure that the crypto is installed along with the initial installation else you would face startup issues but they can be resolved. We will look at that when doing the airflow configuration.

Configuration

All the metadata for airflow is stored in a relational database. By default, it comes pre-packaged with SQLite database. But that is only good for executing single tasks. It is not suitable for initiating DAGs.

Step-1 – Environment Variables

So before we go and jump into airflow configuration let’s setup two environment variables.

export AIRFLOW_HOME=~/airflow
export DAGS_FOLDER=$AIRFLOW_HOME/dags
Step-2 – Airflow Directory

Create the directories

I have created these directories in my home directory

~/airflow - This is airflow home.
~/airflow/dags - This is where the DAGs are stored.

Step-3 – Configure airflow.cfg

airflow.cfg is stored in $AIRFLOW_HOME/airflow.cfg

Airflow is highly customizable but to get started you need to just configure the following parameters

- Set the database connection
- Set the executor
  • Database connection

Airflow uses sqlalchemy for connecting to databases.  The name of the variable used to configure the connection is sql_alchemy_conn

Here is an example below

sql_alchemy_conn = postgresql+psycopg2://username:password@dbname
  • Executor

By default, SequentialExecutor is configured in airflow. However, with sequential executor we can only execute tasks sequentially and is unsuited for any practical purposes. For more complex  purposes, we need to use LocalExecutor, DaskExecutor or CeleryExecutor

executor = LocalExecutor
Step-4 Initialise Airflow

Before you start configuring airflow you need to initialise it.

airflow initdb

Once the initialisation is complete the airflow directory should look something like this

Startup Airflow

Airflow is now initialised with the metadata tables created. Now we can start the airflow components from the CLI. We need to get two components up to get the airflow in a usable state.

  • Web Server UI – to view & execute dags
  • Airflow Scheduler – service to execute requests

For both these to work together start two different ssh sessions and fire the following commands one in each. The order of firing does not matter.

 airflow scheduler
airflow webserver -p 8080

We have now a working Apache Airflow…….Tadaaaaaa!

Execution of DAGs

Once you have this site and the scheduler running you can execute DAGs. The easiest way is to fire from the UI. In case you don’t have access to UI you can always use the Airflow CLI which is quite extensive. A couple of commands are already shown above. For more of these commands, you can refer to https://airflow.apache.org/cli.html

You can trigger a dag manually from the UI by clicking on the first icon in the links column. Alternatively, from the CLI you can fire the following command

airflow trigger_dag hello_world2

Until next time!

Leave a Reply

Your email address will not be published. Required fields are marked *