We observed a problem where dags did not run at the specified time at all but consistently started at a random time. Let’s dig into it.
We have a job chain of three dags which are scheduled for daily execution and should run in succession. For that we defined the start_date at 22.11.2021 11:30 (12:30 / 13:30 repectively).
Airflow dags are running almost at the same time, although they are clearly scheduled 1 hour apart from each other. Moreover they do not start nearly at the same time as indicated by start_date even if you consider different time zones.
To solve this kind of mystery you need to know 2 important things about the airflow scheduler, especially about the start_date parameter.
- if you set a start_date in the past with a specific time interval, Airflow will try to fill up and execute all missing runs immediately after you activate the dag
- “Note that Airflow simply looks at the latest
execution_dateand adds the
schedule_intervalto determine the next
execution_date” – taken from the official docs. If there is no last execution_date (because it is the first run) it will run at the specified start_date instead
- optional background information: start date just defines the starting time interval, on when the dag should be executed
Now let’s put the pieces together:
- we deployed the dags, and set the start_date at a point in time in the past
- we activated all the above mentioned dags apparently at around 12:02
- Airflow realizes, that it has been missing some runs and executes the dag at 12:02
- now Airflow calculates the next execution_date which would be 12:02 on the next day – for all the dags! (disregarding the start time of 11.00 /12:00 / 13:00)
So there you have it. We actually figured it out just minutes before stumbling upon the blog post Apache Airflow Gotchas which mentions the execution_date calculation and implicitely hints to the understanding of our problem at hand
- a very obvious best practice is, to create a job chain, which controls and triggers the dags to ensure their dependency
- when deploying a new dag, the start_date should be set to a future point in time to ensure, that execution of the first dag takes place at the specified time
Part of why we were so confused was, that we had two deployments with that same issue. One of which would start all dags at around 12:00 pm and the other one at around 03:00 am (!). Those hours were just too wild and arbitrary to have any pattern. After actually understanding what is going on, we were seriously confused on who or how a deployment at 3 am took place. Turns out it was I who deployed it on a Wednesday night to our dev system …
If you found this post to be helpful or encountered a similiar problem, feel free to leave a comment behind 🙂