I recently finished a project where Apache Airflow(just airflow for short) was being touted as the next generating Workflow Mangement System and the whole place was just going gaga over. Well, that got me thinking how I could get to understand and learn it. Hence the blog post.
Here are some things you may want to know before getting your hands dirty into Apache Airflow
What is Airflow?
The definition of Apache Airflow goes like this
Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
So Airflow executes a series of unrelated tasks which when executed together accomplish a business outcome. For those folks who are working on the likes of Informatica – airflow is similar to Workflow Designer or those working in Oracle Data Integrator(ODI) it would be ODI packages. This should put things in a bit of context if coming from a propriety ETL software background.
More product information is available on this – http://airflow.apache.org
This post covers the following items
– Airflow Components
– Installation of Airflow
– Configuration of Airflow
– Initialise Airflow
– Startup Airflow
– Execution of DAGs
Before you jump into the depths of airflow it is a good thing to familiarise with the various components
A very nice explanation is available on the airflow website – https://airflow.apache.org/concepts.html
Installation of Airflow
Airflow can be installed using pip – it is a recommended tool for installing python packages. But before you go ahead try to install airflow – check the following pre-reqs
python2.7 (Yes, I am a bit old fashioned) pip, gcc, gcc-c++, python-devel
You can copy and paste the following
sudo python get-pip.py
sudo yum install gcc
sudo yum install gcc-c++
sudo yum install python-devel
These commands have been executed on Amazon Linux EC2 container. Once the above libraries are installed move to the next step.
Airflow is made of core airflow package written is python & python framework. In addition to that it also comes with additional sub-packages which can be added as and when required. More information is available on this page
There are packages available for various things – slack, s3, Postgres, MySQL – it is an increasing list of ever-increasing contributions to the airflow project.
sudo pip install apache-airflow[postgres]
Make sure that the crypto is installed along with the initial installation else you would face startup issues but they can be resolved. We will look at that when doing the airflow configuration.
All the metadata for airflow is stored in a relational database. By default, it comes pre-packaged with SQLite database. But that is only good for executing single tasks. It is not suitable for initiating DAGs.
Step-1 – Environment Variables
So before we go and jump into airflow configuration let’s setup two environment variables.
Step-2 – Airflow Directory
Create the directories
I have created these directories in my home directory
~/airflow - This is airflow home. ~/airflow/dags - This is where the DAGs are stored.
Step-3 – Configure airflow.cfg
airflow.cfg is stored in $AIRFLOW_HOME/airflow.cfg
Airflow is highly customizable but to get started you need to just configure the following parameters
- Set the database connection - Set the executor
- Database connection
Airflow uses sqlalchemy for connecting to databases. The name of the variable used to configure the connection is sql_alchemy_conn
Here is an example below
sql_alchemy_conn = postgresql+psycopg2://username:password@dbname
By default, SequentialExecutor is configured in airflow. However, with sequential executor we can only execute tasks sequentially and is unsuited for any practical purposes. For more complex purposes, we need to use LocalExecutor, DaskExecutor or CeleryExecutor
executor = LocalExecutor
Step-4 Initialise Airflow
Before you start configuring airflow you need to initialise it.
Once the initialisation is complete the airflow directory should look something like this
Airflow is now initialised with the metadata tables created. Now we can start the airflow components from the CLI. We need to get two components up to get the airflow in a usable state.
- Web Server UI – to view & execute dags
- Airflow Scheduler – service to execute requests
For both these to work together start two different ssh sessions and fire the following commands one in each. The order of firing does not matter.
airflow webserver -p 8080
We have now a working Apache Airflow…….Tadaaaaaa!
Execution of DAGs
Once you have this site and the scheduler running you can execute DAGs. The easiest way is to fire from the UI. In case you don’t have access to UI you can always use the Airflow CLI which is quite extensive. A couple of commands are already shown above. For more of these commands, you can refer to https://airflow.apache.org/cli.html
You can trigger a dag manually from the UI by clicking on the first icon in the links column. Alternatively, from the CLI you can fire the following command
Until next time!