Follow the best practices: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
It looks like you have DAGs that do "a lot" in the top-level code and it takes awfully a lot of time to parse them. J. On Wed, Apr 6, 2022 at 4:54 AM HANNAOUI, MOHAMAD <mh7...@att.com> wrote: > Hello Airflow users, > > > > We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using > standard installation, with the celery executor, one master node running > the webserver, scheduler, and flower and 4 worker nodes. We are using > hosted mysql8, redis and python 3.6.10. > > We have around 2300 dags. With version 1.10.10 the scheduler was able to > process all 2300 dags, although not efficiently, but it was working. With > version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300 > dags and that’s when the scheduler started returning the below error: > > 2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last > sent a heartbeat 50.59 seconds ago! Restarting it > > 2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs > of all processes in the group: [9876] > > 2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group > 9876 > > 2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876, > status='terminated', exitcode=0, started='01:43:47') (9876) terminated with > exit code 0 > > 2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid: > 9988 > > 2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC') > > > > We started a second scheduler on one of the worker nodes thinking it will > help with the load, but that did not make a difference, both schedulers > returned the same error message as above. > > > > After more than 1 hour of the schedulers start time, there was sporadic > processing of some dags, but the rest of time, nothing but > DagFileProcessorManager error messages. > > > > I came across a post this post > https://github.com/apache/airflow/discussions/19270 that suggested > increasing the value of scheduler_health_check_threshold, which I changed > to 120, but it did not solve the problem. > > > > Any suggestions to how to fix this issue, or possibly downgrade to a > different version? > > > > > > Thanks, > > -mo > > > > > > > > > > >