Hmmm something else is going on here.

DAG files taking to long to parse should result in the individual dag file parsing process timing out and being terminated, but ideally nothing should ever cause the DagFileProcessorManager process to not heartbeat.

Do you perhaps have one file with "a lot" of dags in it? Is your database under heavy load?

-ash

On Thu, Apr 7 2022 at 11:36:24 +0200, Jarek Potiuk <ja...@potiuk.com> wrote:
I think simply parsing your DAGs takes too long a time and you should simply follow the best practices. I am guessing here of course - I do not know what you do in your DAG top-level code. But fixing badly written DAGs is known to fix similar problems for others.

Airflow 1.10 had performance problems which made scheduling much slower in general - that's why it "could" look better but the problem was that simply scheduling was much slower and the problem did not surface (but for example you had huge gaps between tasks). Now - the scheduler in Airflow 2 is lightning fast. So as usual, if one part of the system is faster, bottlenecks in the other (this time likely caused by the DAG code of yours) surface. This is pretty normal and expected when you improve performance.

The good thing is that after you fix it, you will get much smaller (unnoticeable) scheduling delays and your DAGs will be much snappier.

Another measure you can take is add more schedulers. Airflow 2 allows running multiple schedulers to account for the case when there is a lot of parsing to do. Also In the upcoming Airflow 2.3 you will also be able to run multiple DAG processors - they might be decoupled from scheduler.

But I'd look at your DAG practices. They are the fastest way to get more out of the Airflow you have now (without paying extra for additional computing)

J.

On Thu, Apr 7, 2022 at 9:10 AM HANNAOUI, MOHAMAD <mh7...@att.com <mailto:mh7...@att.com>> wrote:
OK, thanks for your response, will investigate the dags against the best practice documentation.____

__ __

By the way, the same dags used to load fine with 1.10.10.____

__ __

In the meantime, we have increased the values of dag_file_processor_timeout and scheduler_health_check_threshold to 180 and that seem to get the scheduler going,but still have not been able to get rid of all the DagFileProcessorManager errors. ____

__ __

I am not sure if this is expected, but the only way we can get the scheduler to recognize the dags is by incrementally adding between 50-100 dags at a time. As the number of dags reached 1900, we had to reduce the number of dags that can be incrementally added at one time to between 5 and 10, otherwise newly added dags were not being picked up by the scheduler. ____

__ __

The “airflow dags list” command works fine, it lists all the dags. ____

__ __

Any other suggestions beside ensuring the dags follow the best practice documentation?____

__ __

Thanks,____

-mo____

__ __

__ __

*From:*Jarek Potiuk <ja...@potiuk.com <mailto:ja...@potiuk.com>>
*Reply-To:*"users@airflow.apache.org <mailto:users@airflow.apache.org>" <users@airflow.apache.org <mailto:users@airflow.apache.org>>
*Date:*Wednesday, April 6, 2022 at 8:30 AM
*To:*"users@airflow.apache.org <mailto:users@airflow.apache.org>" <users@airflow.apache.org <mailto:users@airflow.apache.org>>
*Subject:*Re: Airflow 2.2.5 - scheduler error____

__ __

Follow the best practices: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code <https://urldefense.com/v3/__https:/airflow.apache.org/docs/apache-airflow/stable/best-practices.html*top-level-python-code__;Iw!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcWqsSjHw$> ____

__ __

It looks like you have DAGs that do "a lot" in the top-level code and it takes awfully a lot of time to parse them.____

__ __

J.____

__ __

__ __

On Wed, Apr 6, 2022 at 4:54 AM HANNAOUI, MOHAMAD <mh7...@att.com <mailto:mh7...@att.com>> wrote:____

Hello Airflow users,____

 ____

We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using standard installation, with the celery executor, one master node running the webserver, scheduler, and flower and 4 worker nodes. We are using hosted mysql8, redis and python 3.6.10.____

We have around 2300 dags. With version 1.10.10 the scheduler was able to process all 2300 dags, although not efficiently, but it was working. With version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300 dags and that’s when the scheduler started returning the below error:____

2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last sent a heartbeat 50.59 seconds ago! Restarting it____

2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs of all processes in the group: [9876]____

2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group 9876____

2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876, status='terminated', exitcode=0, started='01:43:47') (9876) terminated with exit code 0____

2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid: 9988____

2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC')____

 ____

We started a second scheduler on one of the worker nodes thinking it will help with the load, but that did not make a difference, both schedulers returned the same error message as above.____

 ____

After more than 1 hour of the schedulers start time, there was sporadic processing of some dags, but the rest of time, nothing but DagFileProcessorManager error messages.____

 ____

I came across a post this post https://github.com/apache/airflow/discussions/19270 <https://urldefense.com/v3/__https:/github.com/apache/airflow/discussions/19270__;!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcBbqjRjA$> that suggested increasing the value of scheduler_health_check_threshold, which I changed to 120, but it did not solve the problem.____

 ____

Any suggestions to how to fix this issue, or possibly downgrade to a different version?____

 ____

 ____

Thanks,____

-mo____

 ____

 ____

 ____

 ____

 ____


Reply via email to