Airflow – External Task Sensor

Airflow External Task Sensor deserves a separate blog entry. It is a really powerful feature in airflow and can help you sort out dependencies for many use-cases – a must-have tool. This blog entry introduces the external task sensors and how they can be quickly implemented in your ecosystem. Before you dive into this post, if this is the first time you are reading about sensors I would recommend you read the following entry Airflow – Sensors Why use External Task Sensor Here is my thought as to why an external task sensor is very useful. Most traditional scheduling is time-based. This means that the dependencies between jobs are base on an assumption that the first job will definitely finish before the next job starts. But what happens if the first job fails or is processing more data than usual and may be delayed? Well, we have what is called … Read more

Airflow – Sensors

Airflow sensors are like operators but perform a special task in an airflow DAG. They check for a particular condition at regular intervals and when it is met they pass to control downstream tasks in a DAG. There are various types of sensors and in this mini blog series, we intend to explore. Before you begin to read further. In case, you are beginning to learn airflow – Do have a look at these blog posts Getting Started with Airflow First DAG Airflow Connections Introduction The fastest way to learn how to use an airflow sensor is to look at an example. In this blog post, we will be looking at an example using S3KeySensor for reading a file as soon as they arrive in S3. There are other sensors that are available as well. Some of them are S3 Key Sensor SQL Sesnsor HTTP Sensor HDFS Sensor Hive Sensor … Read more

Airflow – Dynamic DAGs

Introduction Over the last couple of months got lots of queries around dynamic dags. So here is a post on dynamic dags in an airflow. Dynamic tasks is probably one of the best features of airflow. It allows you to launch airflow tasks dynamically inside an airflow DAG. A simple use case can be if you want to launch a shell script with different parameters in a list all at the same time. To make things more fun is that the list size changes all the time. This is where airflow really shines. Let’s see how If you are new to airflow – before you jump into reading this post read these posts for a getting some background Airflow getting started How to write a DAG Airflow Variables Airflow DAG – Dynamic Tasks – Example-1 Creating dynamic tasks in a DAG is pretty simple. For the purpose of this section, … Read more

Airflow & SLA Management

Introduction Came across this interesting feature of managing SLA’s natively in airflow. Failures can happen not just by an actual failure of a task/pipeline but may a slow running task/pipeline. A slow running task/pipeline may cause downstream tasks or DAGs which depend upon it to fail. This thought got me searching and I decided to write a post about Airflow and SLAs Management. What I found was a simple solution which can help manage failures and delays. Before you jump into understanding how to manage SLAs in airflow make sure you are familiar with how to create airflow DAGs . This post is divided into the following parts Time duration is defined on a task. Airflow will monitor the performance of task/DAG when SLAs are enabled. When SLAs are breached airflow can trigger the following An email notification An action via a callback function call Note 1: SLAs checks occur … Read more

Airflow remote logging using AWS S3

airflow s3 image

Airflow logs are stored in the filesystem by default in $AIRFLOW_HOME/dags directory, this is also called remote logging. Airflow logs can also be easily configured to be stored on AWS S3 as well. This blog entry describes the steps required to configure Airflow to store its logs on an S3 bucket. The blog entry is divided into the following sections Introduction Create S3 Connection Configure airflow.cfg Demo Introduction For this blog entry, we are running airflow on an ubuntu server which has access to AWS S3 buckets via AWS-CLI. Note: If you are using an EC2 instance please makes sure that your instance has read-write access to S3 buckets configured. Create S3 Connection To enable remote logging in airflow, we need to make use of an airflow plugin which can be installed as a part of airflow pip install command. Please refer to this blog entry for more details. Goto … Read more

Airflow RBAC – Role-Based Access Control

Airflow version 1.10.0 onward introduced Role-Based Access Control(RBAC) as part of their security landscape. RBAC is the quickest way to get around and secure airflow. It comes with pre-built roles which makes it easy to implement. Not stopping there you could add your own roles as well. In addition, to securing various features of airflow web UI, RBAC can be used to secure access to DAGs as well. Though this feature is only available after 1.10.2. And as with any software, a few bugs popped. But this was fixed in the 1.10.7. There is an entry for a bug AIRFLOW-2694. In this blog entry, we will touch upon DAG level access as well. The blog entry is divided into few parts Enable RBAC Create users using standard roles Roles & permissions Secure DAGs with RBAC Needless to say, I am assuming you have a working airflow. If not, please head … Read more

Airflow – XCOM

Introduction Airflow XCom is used for inter-task communications. Sounds a bit complex but it is really very simple. Its implementation inside airflow is very simple and it can be used in a very easy way and needless to say it has numerous use cases. Inter-task communication is achieved by passing key-value pairs between tasks. Tasks can run on any airflow worker and need not run on the same worker. To pass information a task pushes a key-value pair. The key-value pair is then pulled by another task and utilized. This blog entry requires some knowledge of airflow. If you are just starting out. I would suggest you first get familiar with airflow. You can try this link Pushing values to XCOM Before we dive headlong into XCOM, let’s see where to find them. XCOM can be accessed via the Airflow Web UI->Admin->XComs Let’s look at a DAG below. Its sole … Read more

Airflow – Scale-out with Redis and Celery

Introduction This post uses Redis and celery to scale-out airflow. Redis is a simple caching server and scales out quite well. It can be made resilient by deploying it as a cluster. In my previous post, the airflow scale-out was done using celery with rabbitmq as the message broker. On the whole, I found the idea of maintaining a rabbitmq a bit fiddly unless you happen to be an expert in rabbitmq. Redis seems to be a better solution when compared to rabbitmq. On the whole, it is a lot easier to deploy and maintain when compared with the various steps taken to deploy a RabbitMQ broker. In a nutshell, I like it more than RabbitMQ! To create an infrastructure like this we need to do the following steps Install & Configure Redis server on a separate host – 1 server Install & Configure Airflow with Redis and Celery Executor … Read more

Airflow – Scale out with RabbitMQ and Celery

Introduction Airflow Queues and workers are required if there is a need to make the airflow infrastructure more flexible and resilient. This blog entry describes how to install/setup/configure additional components so that we can use airflow in a more flexible and resilient manner. Till now we have used a local executor to execute all our jobs. Which works fine if we have a small number of jobs and they are not running when another job is running. However, in the real world, this is always the case. Additionally, we also need to take care of the possibilities of an airflow local executor becoming unavailable. Airflow queues are like any other queues and use a messaging system – like RabbitMQ, ActiveMQ. Airflow scheduler sends tasks as messages to the queues and hence acts as a publisher. Airflow workers are configured to listen for events(i.e tasks) coming on particular queues and execute … Read more

Airflow – Connections

Introduction We understand that airflow is a workflow/job orchestration engine and can execute various tasks by connecting to our environments. There are various ways to connect to an environment. One needs is connection details about that environment to connect to. For example Postgres DB – Hostname, Port, Schema SSH – Hostname which allows SSH connections. The list may extend to AWS, Google Cloud or Azure as well. But all of them need to some sort of connection information. Airflow is no different and needs connection information. One only needs to log in and goto Admin->Connections to see the exhaustive list of connections which are possible. Airflow allows for various connections and some of the documentation can be found on this link. You can see the various types of connections which can be made by airflow. This makes connecting to different types of technologies so easy by airflow. Keep in mind … Read more