The source of the issue turned out to be network latency which has been fixed. It was tolerable with the pre-2.x version of the scheduler, but not with version 2.x.
The dags did not require much optimization and whatever changes were made did not improve the performance. Fixing the network latency is what made the difference. The timeout errors have all disappeared and the scheduler is running smoothly now. Thanks for your help. -mo From: MOHAMAD HANNAOUI <mh7...@att.com> Date: Thursday, April 7, 2022 at 1:00 PM To: "users@airflow.apache.org" <users@airflow.apache.org> Subject: Re: Airflow 2.2.5 - scheduler error We made some changes today based on best practice and there was a slight improvement in performance to the load time, whether that will solve the problem, we won’t know until we do further tests and apply the changes to all the dags. As far as system and database load, both are very low. We already have 2 schedulers running and can increase it to 5. We will start more schedulers and see if that helps. Our dags are simple, one file per dag, and each dag is between 5 and 7 tasks only. I agree on the performance of the scheduler in previous versions, especially 1.10.10. It was very frustrating. I have encountered many occurrences where the difference in time between two tasks in the same dag is up to 10 hours. I was very pleased to see the fast-follow feature in action, so thanks for the work that was done on improving the scheduler. I’ll report back once the dag best practice improvements are done, in the meantime, let me know if you can think of other things we can try. Thanks, -mo From: Ash Berlin-Taylor <a...@apache.org> Reply-To: "users@airflow.apache.org" <users@airflow.apache.org> Date: Thursday, April 7, 2022 at 5:57 AM To: "users@airflow.apache.org" <users@airflow.apache.org> Subject: Re: Airflow 2.2.5 - scheduler error Hmmm something else is going on here. DAG files taking to long to parse should result in the individual dag file parsing process timing out and being terminated, but ideally nothing should ever cause the DagFileProcessorManager process to not heartbeat. Do you perhaps have one file with "a lot" of dags in it? Is your database under heavy load? -ash On Thu, Apr 7 2022 at 11:36:24 +0200, Jarek Potiuk <ja...@potiuk.com> wrote: I think simply parsing your DAGs takes too long a time and you should simply follow the best practices. I am guessing here of course - I do not know what you do in your DAG top-level code. But fixing badly written DAGs is known to fix similar problems for others. Airflow 1.10 had performance problems which made scheduling much slower in general - that's why it "could" look better but the problem was that simply scheduling was much slower and the problem did not surface (but for example you had huge gaps between tasks). Now - the scheduler in Airflow 2 is lightning fast. So as usual, if one part of the system is faster, bottlenecks in the other (this time likely caused by the DAG code of yours) surface. This is pretty normal and expected when you improve performance. The good thing is that after you fix it, you will get much smaller (unnoticeable) scheduling delays and your DAGs will be much snappier. Another measure you can take is add more schedulers. Airflow 2 allows running multiple schedulers to account for the case when there is a lot of parsing to do. Also In the upcoming Airflow 2.3 you will also be able to run multiple DAG processors - they might be decoupled from scheduler. But I'd look at your DAG practices. They are the fastest way to get more out of the Airflow you have now (without paying extra for additional computing) J. On Thu, Apr 7, 2022 at 9:10 AM HANNAOUI, MOHAMAD <mh7...@att.com<mailto:mh7...@att.com>> wrote: OK, thanks for your response, will investigate the dags against the best practice documentation. By the way, the same dags used to load fine with 1.10.10. In the meantime, we have increased the values of dag_file_processor_timeout and scheduler_health_check_threshold to 180 and that seem to get the scheduler going, but still have not been able to get rid of all the DagFileProcessorManager errors. I am not sure if this is expected, but the only way we can get the scheduler to recognize the dags is by incrementally adding between 50-100 dags at a time. As the number of dags reached 1900, we had to reduce the number of dags that can be incrementally added at one time to between 5 and 10, otherwise newly added dags were not being picked up by the scheduler. The “airflow dags list” command works fine, it lists all the dags. Any other suggestions beside ensuring the dags follow the best practice documentation? Thanks, -mo From: Jarek Potiuk <ja...@potiuk.com<mailto:ja...@potiuk.com>> Reply-To: "users@airflow.apache.org<mailto:users@airflow.apache.org>" <users@airflow.apache.org<mailto:users@airflow.apache.org>> Date: Wednesday, April 6, 2022 at 8:30 AM To: "users@airflow.apache.org<mailto:users@airflow.apache.org>" <users@airflow.apache.org<mailto:users@airflow.apache.org>> Subject: Re: Airflow 2.2.5 - scheduler error Follow the best practices: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code<https://urldefense.com/v3/__https:/airflow.apache.org/docs/apache-airflow/stable/best-practices.html*top-level-python-code__;Iw!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcWqsSjHw$> It looks like you have DAGs that do "a lot" in the top-level code and it takes awfully a lot of time to parse them. J. On Wed, Apr 6, 2022 at 4:54 AM HANNAOUI, MOHAMAD <mh7...@att.com<mailto:mh7...@att.com>> wrote: Hello Airflow users, We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using standard installation, with the celery executor, one master node running the webserver, scheduler, and flower and 4 worker nodes. We are using hosted mysql8, redis and python 3.6.10. We have around 2300 dags. With version 1.10.10 the scheduler was able to process all 2300 dags, although not efficiently, but it was working. With version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300 dags and that’s when the scheduler started returning the below error: 2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last sent a heartbeat 50.59 seconds ago! Restarting it 2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs of all processes in the group: [9876] 2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group 9876 2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876, status='terminated', exitcode=0, started='01:43:47') (9876) terminated with exit code 0 2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid: 9988 2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC') We started a second scheduler on one of the worker nodes thinking it will help with the load, but that did not make a difference, both schedulers returned the same error message as above. After more than 1 hour of the schedulers start time, there was sporadic processing of some dags, but the rest of time, nothing but DagFileProcessorManager error messages. I came across a post this post https://github.com/apache/airflow/discussions/19270<https://urldefense.com/v3/__https:/github.com/apache/airflow/discussions/19270__;!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcBbqjRjA$> that suggested increasing the value of scheduler_health_check_threshold, which I changed to 120, but it did not solve the problem. Any suggestions to how to fix this issue, or possibly downgrade to a different version? Thanks, -mo