Apache Airflow – First DAG

Now that we have a working Airflow it is time to look at DAGs in detail. In the previous post, we saw how to execute DAGs from the UI. In this post, we will talk more about DAGs.

DAGs are the core concept of airflow. But how are they created??? Here is the code of a hello world DAG I created.

# Filename: hello_world2.py
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta

default_args = {
  'owner': 'airflow',
  'depends_on_past': False,
  'start_date': datetime(2018, 5, 30),
  'email': ['airflow@example.com'],
  'email_on_failure': False,
  'email_on_retry': False,
  'retries': 1,
  'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}

dag = DAG('hello_world2', schedule_interval='0 0 * * *' ,
  default_args=default_args)
  create_command = 'echo   HELLOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO '
t1 = BashOperator(
  task_id='print_date',
  bash_command='date',
  dag=dag
)

t2 = BashOperator(
  task_id= 'myBashTest',
  bash_command=create_command,
  dag=dag
)
t2.set_upstream(t1)

Remeber this code is stored in the $DAGS_FOLDER. Please refer to the previous blog which has the details on the location. An important thing to note and I quote from the airflow website

One thing to wrap your head around (it may not be very intuitive for everyone at first) is that this Airflow Python script is really just a configuration file specifying the DAG’s structure as code.

So let’s go thru the code and try and understand it.

  1. Line 1-2 – The first two lines are importing various airflow components we would be working on DAG, Bash Operator
  2. Line 3 – import data related functions.
  1. Line 6 – default_args – Default Arguments is a dictionary of arguments which you want to pass to the operators.
  1. Line 21 – The next dag variable defines an airflow DAG object. Some of the parameters we are passing are  
    • dag id – hello_world2
    • schedule interval(think cron) – 0 0 * * *
    • default arguments
  2. Line 22 – Linux command you want to fire.
  1. Line 23 thru 32 – Once the dag is defined we then go on to create the various operators which make up the DAG. t1, t2 are operators 🙄 . An instance of an operator is also called a task. So when the dag is getting executed it instantiates operators and they are called tasks.
  1. Line 33 – Finally, we define the dependencies between the various operators. Once that is done the dag configuration is ready to be tested and placed inside $DAGS_FOLDER.
  2. If your scheduler is still running airflow should pick up the python file stored in $DAGS_FOLDER and show up in the list of DAGs.
  3. If you click on the hello_world2 and then click on “Graph View” you should see the following picture below

Now that you have a basic understanding of DAGs let’s look at some of the operators in the next post….. Until next time….

Leave a Reply

Your email address will not be published. Required fields are marked *