Airflow – Scale out with RabbitMQ and Celery

Introduction Airflow Queues and workers are required if there is a need to make the airflow infrastructure more flexible and resilient. This blog entry describes how to install/setup/configure additional components so that we can use airflow in a more flexible and resilient manner. Till now we have used a local executor to execute all our jobs. Which works fine if we have a small number of jobs and they are not running when another job is running. However, in the real world, this is always the case. Additionally, we also need to take care of the possibilities of an airflow local executor becoming unavailable. Airflow queues are like any other queues and use a messaging system – like RabbitMQ, ActiveMQ. Airflow scheduler sends tasks as messages to the queues and hence acts as a publisher. Airflow workers are configured to listen for events(i.e tasks) coming on particular queues and execute … Read more

Airflow – Connections

Introduction We understand that airflow is a workflow/job orchestration engine and can execute various tasks by connecting to our environments. There are various ways to connect to an environment. One needs is connection details about that environment to connect to. For example Postgres DB – Hostname, Port, Schema SSH – Hostname which allows SSH connections. The list may extend to AWS, Google Cloud or Azure as well. But all of them need to some sort of connection information. Airflow is no different and needs connection information. One only needs to log in and goto Admin->Connections to see the exhaustive list of connections which are possible. Airflow allows for various connections and some of the documentation can be found on this link. You can see the various types of connections which can be made by airflow. This makes connecting to different types of technologies so easy by airflow. Keep in mind … Read more

Airflow – Sub-DAGs

Introduction Workflows can quickly become quite complicated which makes them difficult to understand. To reduce complexity tasks can be grouped together into smaller DAGs and then called. This makes DAGs easier to manage and maintain. Creating a sub dag consists of Creating the actual Sub DAG and testing it. Create a DAG which calls the Sub-DAG created in Step1. For those of you wondering how to call a sub-dag. It is easy using the SubDagOperator. In the next example, we would re-use one of our earlier examples as a Sub-DAG and call it from another DAG. Create a Sub-DAG Let’s first see the code of the Sub-DAG. You will observe that it is very similar to a DAG from the previous entries except that it is called inside a function. See Below Before moving forward a few important points Airflow Sub DAG is in a separate file in the same … Read more

Airflow Branch joins

Introduction In many use cases, there is a requirement of having different branches(see blog entry) in a workflow. However, these branches can also join together and execute a common task. Something similar to the pic below There is no special operator which is used but only the way we assign & set upstream and downstream relationships between the tasks. An example of this is shown below. Branch Join DAG The hello_joins example extends a DAG from the previous blog entry. An additional task is added to the dag which has been quite imaginatively named join_task. Understand DAG Let’s look at the changes introduced in the DAG. Step 5 – A new task called join_task was added. Observe the TriggerRule which has been added. It is set to ONE_SUCCESS which means that if any one of the preceding tasks has been successful join_task should be executed. Step 6 – Adds the … Read more

Airflow – Variables

In the last entry on airflow(which was many moons ago!) we had created a simple DAG and executed it. It was simple to get us started. However, it did not allow us any flexibility. We could not change the behaviour of the DAG. In the real world scenarios, we may need to change the behaviour of our workflow based on certain parameters. This is accomplished by Airflow Variables Airflow Variables are simple key-value pairs which are stored in the database which holds the airflow metadata. These variables can be created & managed via the airflow UI or airflow CLI. Airflow WebUI -> Admin -> Variables Some of the features of Airflow variables are below Can be defined as a simple key-value pair One variable can hold a list of key-value pairs as well! Stored in airflow database which holds the metadata Can be used in the Airflow DAG code as … Read more

Airflow – Branching

Introduction Branching is a useful concept when creating workflows. Simply speaking it is a way to implement if-then-else logic in airflow. For example, there may be a requirement to execute a certain task(s) only when a particular condition is met. This is achieved via airflow branching. If the condition is true, certain task(s) are executed and if the condition is false, different task(s) are executed. Branching is achieved by implementing an Airflow operator called the BranchPythonOperator. To keep it simple – it is essentially, an API which implements a task. This task then calls a simple method written in python – whose only job is to implement an if-then-else logic and return to airflow the name of the next task to execute. Branching Let’s take an example. We want to check if the value of a variable is greater than 10. If the value is greater than or equal 15 … Read more

Spark DataSets Performance Tuning – Resource Planning

Introduction In this post we will focus on datasets and see how they can be tuned and also see if they are more efficient than working on RDDs even though datasets are converted to RDDs for processing. More on that later. To set a stage for this post we will use the same resource plans as were used in the Spark RDDs Performance Tuning – Resource Planning and would also answer the same queries, using the same data and using a cluster with exactly the same components & resources. That way we will have a nice comparative as well. If you haven’t read it – then you should :). So let’s dive in and run our tests and see how everything stacks up. Step – 1 Modify spark-defaults.conf Make the following changes to the spark-defaults.conf. In Amazon EMR it is found on /etc/spark/conf/ directory Step – 2 Cluster Configuration For … Read more

Spark RDDs Performance Tuning – Partitioning

Now that we have gone thru resource planning, shuffling & caching to tune our spark application we have still not looked at couple of areas which can give us some additional performance gains and make our application a bit more efficient. One of the areas we can look at tuning is how Spark partitions its data into partitions and processes them. If we have too small partitions we would have a scenario were Spark will take more time to start than it will take time to process data and of course lead to additional shuffling. So the granular approach may not be the answer. In the same way, it may also be the case where having big partitions may lead to inefficient use of cluster resources. The answer lies somewhere in between these two extremes. If you have run the code in any of the previous posts you will have … Read more

Spark RDDs Performance Tuning – Shuffling & Caching

Let’s look at the next step in our performance tuning journey. This one involves changing which APIs are called and how data is stored in the distributed memory. Both involve code changes – not too many but they do help. Both these topics have been covered to an extent in some of my previous blog posts. For a quick recap – refer the links below Resource Planning Spark Caching Spark Shuffling We will use the same dataset and the same transformations (as in the previous post) but with minor changes which are highlighted and see how we can reduce shuffling and bring down the runtimes further. Reducing Shuffling There are two simple ways of reducing shuffling. Reduce the dataset on which the shuffle occurs. Change the API to use a more efficient API. Reduce dataset size When doing data analytics it is usually observed that not all the attributes which … Read more

Spark RDDs Performance Tuning – Resource Planning

Introduction Tuning spark jobs can dramatically increase the performance and help squeeze more from the resources at hand. This post is a high-level overview of how to tune spark jobs and talks about various tools which are at your disposal. Performance tuning may be black magic to some but to most engineers, it would be How many resources are provided How best are the provided resources used How you write your code Before you can tune a Spark job it is important to identify where are the potential performance bottlenecks. Resources – Memory, CPU Cores and Executors Partitioning & Parallelism Long-running straggling tasks Caching To help with the above areas Spark provides(and has access to) some tools which can be used for tuning Resource Manager UI(For example – YARN) Spark Web UI & History Server Tungsten & Catalyst Optimizers Explain Plan For the purpose of this post, a pre-cleaned dataset … Read more