Hello Airflow users, We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using standard installation, with the celery executor, one master node running the webserver, scheduler, and flower and 4 worker nodes. We are using hosted mysql8, redis and python 3.6.10. We have around 2300 dags. With version 1.10.10 the scheduler was able to process all 2300 dags, although not efficiently, but it was working. With version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300 dags and that’s when the scheduler started returning the below error:
2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last sent a heartbeat 50.59 seconds ago! Restarting it 2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs of all processes in the group: [9876] 2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group 9876 2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876, status='terminated', exitcode=0, started='01:43:47') (9876) terminated with exit code 0 2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid: 9988 2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC') We started a second scheduler on one of the worker nodes thinking it will help with the load, but that did not make a difference, both schedulers returned the same error message as above. After more than 1 hour of the schedulers start time, there was sporadic processing of some dags, but the rest of time, nothing but DagFileProcessorManager error messages. I came across a post this post https://github.com/apache/airflow/discussions/19270 that suggested increasing the value of scheduler_health_check_threshold, which I changed to 120, but it did not solve the problem. Any suggestions to how to fix this issue, or possibly downgrade to a different version? Thanks, -mo