Hello Airflow users,

We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using standard 
installation, with the celery executor, one master node running the webserver, 
scheduler, and flower and 4 worker nodes. We are using hosted mysql8, redis and 
python 3.6.10.
We have around 2300 dags. With version 1.10.10 the scheduler was able to 
process all 2300 dags, although not efficiently, but it was working. With 
version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300 dags 
and that’s when the scheduler started returning the below error:

2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last sent a 
heartbeat 50.59 seconds ago! Restarting it

2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs of 
all processes in the group: [9876]

2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group 9876

2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876, 
status='terminated', exitcode=0, started='01:43:47') (9876) terminated with 
exit code 0

2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid: 9988

2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC')



We started a second scheduler on one of the worker nodes thinking it will help 
with the load, but that did not make a difference, both schedulers returned the 
same error message as above.



After more than 1 hour of the schedulers start time, there was sporadic 
processing of some dags, but the rest of time, nothing but 
DagFileProcessorManager error messages.


I came across a post this post 
https://github.com/apache/airflow/discussions/19270 that suggested increasing 
the value of scheduler_health_check_threshold, which I changed to 120, but it 
did not solve the problem.

Any suggestions to how to fix this issue, or possibly downgrade to a different 
version?





Thanks,

-mo








Reply via email to