I think simply parsing your DAGs takes too long a time and you should simply follow the best practices. I am guessing here of course - I do not know what you do in your DAG top-level code. But fixing badly written DAGs is known to fix similar problems for others.
Airflow 1.10 had performance problems which made scheduling much slower in general - that's why it "could" look better but the problem was that simply scheduling was much slower and the problem did not surface (but for example you had huge gaps between tasks). Now - the scheduler in Airflow 2 is lightning fast. So as usual, if one part of the system is faster, bottlenecks in the other (this time likely caused by the DAG code of yours) surface. This is pretty normal and expected when you improve performance. The good thing is that after you fix it, you will get much smaller (unnoticeable) scheduling delays and your DAGs will be much snappier. Another measure you can take is add more schedulers. Airflow 2 allows running multiple schedulers to account for the case when there is a lot of parsing to do. Also In the upcoming Airflow 2.3 you will also be able to run multiple DAG processors - they might be decoupled from scheduler. But I'd look at your DAG practices. They are the fastest way to get more out of the Airflow you have now (without paying extra for additional computing) J. On Thu, Apr 7, 2022 at 9:10 AM HANNAOUI, MOHAMAD <mh7...@att.com> wrote: > OK, thanks for your response, will investigate the dags against the best > practice documentation. > > > > By the way, the same dags used to load fine with 1.10.10. > > > > In the meantime, we have increased the values of dag_file_processor_timeout > and scheduler_health_check_threshold to 180 and that seem to get the > scheduler going, but still have not been able to get rid of all the > DagFileProcessorManager > errors. > > > > I am not sure if this is expected, but the only way we can get the > scheduler to recognize the dags is by incrementally adding between 50-100 > dags at a time. As the number of dags reached 1900, we had to reduce the > number of dags that can be incrementally added at one time to between 5 and > 10, otherwise newly added dags were not being picked up by the scheduler. > > > > The “airflow dags list” command works fine, it lists all the dags. > > > > Any other suggestions beside ensuring the dags follow the best practice > documentation? > > > > Thanks, > > -mo > > > > > > *From: *Jarek Potiuk <ja...@potiuk.com> > *Reply-To: *"users@airflow.apache.org" <users@airflow.apache.org> > *Date: *Wednesday, April 6, 2022 at 8:30 AM > *To: *"users@airflow.apache.org" <users@airflow.apache.org> > *Subject: *Re: Airflow 2.2.5 - scheduler error > > > > Follow the best practices: > https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code > <https://urldefense.com/v3/__https:/airflow.apache.org/docs/apache-airflow/stable/best-practices.html*top-level-python-code__;Iw!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcWqsSjHw$> > > > > > It looks like you have DAGs that do "a lot" in the top-level code and it > takes awfully a lot of time to parse them. > > > > J. > > > > > > On Wed, Apr 6, 2022 at 4:54 AM HANNAOUI, MOHAMAD <mh7...@att.com> wrote: > > Hello Airflow users, > > > > We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using > standard installation, with the celery executor, one master node running > the webserver, scheduler, and flower and 4 worker nodes. We are using > hosted mysql8, redis and python 3.6.10. > > We have around 2300 dags. With version 1.10.10 the scheduler was able to > process all 2300 dags, although not efficiently, but it was working. With > version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300 > dags and that’s when the scheduler started returning the below error: > > 2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last > sent a heartbeat 50.59 seconds ago! Restarting it > > 2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs > of all processes in the group: [9876] > > 2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group > 9876 > > 2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876, > status='terminated', exitcode=0, started='01:43:47') (9876) terminated with > exit code 0 > > 2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid: > 9988 > > 2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC') > > > > We started a second scheduler on one of the worker nodes thinking it will > help with the load, but that did not make a difference, both schedulers > returned the same error message as above. > > > > After more than 1 hour of the schedulers start time, there was sporadic > processing of some dags, but the rest of time, nothing but > DagFileProcessorManager error messages. > > > > I came across a post this post > https://github.com/apache/airflow/discussions/19270 > <https://urldefense.com/v3/__https:/github.com/apache/airflow/discussions/19270__;!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcBbqjRjA$> > that suggested increasing the value of scheduler_health_check_threshold, > which I changed to 120, but it did not solve the problem. > > > > Any suggestions to how to fix this issue, or possibly downgrade to a > different version? > > > > > > Thanks, > > -mo > > > > > > > > > > > >