OK, thanks for your response, will investigate the dags against the best 
practice documentation.

By the way, the same dags used to load fine with 1.10.10.

In the meantime, we have increased the values of dag_file_processor_timeout and 
scheduler_health_check_threshold to 180 and that seem to get the scheduler 
going, but still have not been able to get rid of all the 
DagFileProcessorManager errors.

I am not sure if this is expected, but the only way we can get the scheduler to 
recognize the dags is by incrementally adding between 50-100 dags at a time. As 
the number of dags reached 1900, we had to reduce the number of dags that can 
be incrementally added at one time to between 5 and 10, otherwise newly added 
dags were not being picked up by the scheduler.

The “airflow dags list” command works fine, it lists all the dags.

Any other suggestions beside ensuring the dags follow the best practice 
documentation?

Thanks,
-mo


From: Jarek Potiuk <ja...@potiuk.com>
Reply-To: "users@airflow.apache.org" <users@airflow.apache.org>
Date: Wednesday, April 6, 2022 at 8:30 AM
To: "users@airflow.apache.org" <users@airflow.apache.org>
Subject: Re: Airflow 2.2.5 - scheduler error

Follow the best practices: 
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code<https://urldefense.com/v3/__https:/airflow.apache.org/docs/apache-airflow/stable/best-practices.html*top-level-python-code__;Iw!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcWqsSjHw$>

It looks like you have DAGs that do "a lot" in the top-level code and it takes 
awfully a lot of time to parse them.

J.


On Wed, Apr 6, 2022 at 4:54 AM HANNAOUI, MOHAMAD 
<mh7...@att.com<mailto:mh7...@att.com>> wrote:
Hello Airflow users,

We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using standard 
installation, with the celery executor, one master node running the webserver, 
scheduler, and flower and 4 worker nodes. We are using hosted mysql8, redis and 
python 3.6.10.
We have around 2300 dags. With version 1.10.10 the scheduler was able to 
process all 2300 dags, although not efficiently, but it was working. With 
version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300 dags 
and that’s when the scheduler started returning the below error:

2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last sent a 
heartbeat 50.59 seconds ago! Restarting it

2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs of 
all processes in the group: [9876]

2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group 9876

2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876, 
status='terminated', exitcode=0, started='01:43:47') (9876) terminated with 
exit code 0

2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid: 9988

2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC')



We started a second scheduler on one of the worker nodes thinking it will help 
with the load, but that did not make a difference, both schedulers returned the 
same error message as above.



After more than 1 hour of the schedulers start time, there was sporadic 
processing of some dags, but the rest of time, nothing but 
DagFileProcessorManager error messages.


I came across a post this post 
https://github.com/apache/airflow/discussions/19270<https://urldefense.com/v3/__https:/github.com/apache/airflow/discussions/19270__;!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcBbqjRjA$>
 that suggested increasing the value of scheduler_health_check_threshold, which 
I changed to 120, but it did not solve the problem.

Any suggestions to how to fix this issue, or possibly downgrade to a different 
version?





Thanks,

-mo








Reply via email to