Airflow tasks do not run at specified time as scheduled

We observed a problem where dags did not run at the specified time at all but consistently started at a random time. Let’s dig into it.

Expected Behavior:

We have a job chain of three dags which are scheduled for daily execution and should run in succession. For that we defined the start_date at 22.11.2021 11:30 (12:30 / 13:30 repectively).

The start_date configured for the dags

Observation:

Airflow dags are running almost at the same time, although they are clearly scheduled 1 hour apart from each other. Moreover they do not start nearly at the same time as indicated by start_date even if you consider different time zones.

Those dags with the prefix 2000, 3000 and 4000 were deployed and should run 1 hour apart from each other.

Background information:

To solve this kind of mystery you need to know 2 important things about the airflow scheduler, especially about the start_date parameter.

  • if you set a start_date in the past with a specific time interval, Airflow will try to fill up and execute all missing runs immediately after you activate the dag
  • “Note that Airflow simply looks at the latest execution_date and adds the schedule_interval to determine the next execution_date” – taken from the official docs. If there is no last execution_date (because it is the first run) it will run at the specified start_date instead
  • optional background information: start date just defines the starting time interval, on when the dag should be executed

Now let’s put the pieces together:

  • we deployed the dags, and set the start_date at a point in time in the past
  • we activated all the above mentioned dags apparently at around 12:02
  • Airflow realizes, that it has been missing some runs and executes the dag at 12:02
  • now Airflow calculates the next execution_date which would be 12:02 on the next day – for all the dags! (disregarding the start time of 11.00 /12:00 / 13:00)

So there you have it. We actually figured it out just minutes before stumbling upon the blog post Apache Airflow Gotchas which mentions the execution_date calculation and implicitely hints to the understanding of our problem at hand

Learnings:

  • a very obvious best practice is, to create a job chain, which controls and triggers the dags to ensure their dependency
  • when deploying a new dag, the start_date should be set to a future point in time to ensure, that execution of the first dag takes place at the specified time

Story time:

Part of why we were so confused was, that we had two deployments with that same issue. One of which would start all dags at around 12:00 pm and the other one at around 03:00 am (!). Those hours were just too wild and arbitrary to have any pattern. After actually understanding what is going on, we were seriously confused on who or how a deployment at 3 am took place. Turns out it was I who deployed it on a Wednesday night to our dev system …

If you found this post to be helpful or encountered a similiar problem, feel free to leave a comment behind 🙂

Leave a Comment

Your email address will not be published.

hungsblog | Nguyen Hung Manh | Dresden