The source of the issue turned out to be network latency which has been fixed. 
It was tolerable with the pre-2.x version of the scheduler, but not with 
version 2.x.

The dags did not require much optimization and whatever changes were made did 
not improve the performance. Fixing the network latency is what made the 
difference.

The timeout errors have all disappeared and the scheduler is running smoothly 
now.

Thanks for your help.

-mo


From: MOHAMAD HANNAOUI <mh7...@att.com>
Date: Thursday, April 7, 2022 at 1:00 PM
To: "users@airflow.apache.org" <users@airflow.apache.org>
Subject: Re: Airflow 2.2.5 - scheduler error

We made some changes today based on best practice and there was a slight 
improvement in performance to the load time, whether that will solve the 
problem, we won’t know until we do further tests and apply the changes to all 
the dags. As far as system and database load, both are very low. We already 
have 2 schedulers running and can increase it to 5. We will start more 
schedulers and see if that helps.

Our dags are simple, one file per dag, and each dag is between 5 and 7 tasks 
only.

I agree on the performance of the scheduler in previous versions, especially 
1.10.10. It was very frustrating. I have encountered many occurrences where the 
difference in time between two tasks in the same dag is up to 10 hours. I was 
very pleased to see the fast-follow feature in action, so thanks for the work 
that was done on improving the scheduler.

I’ll report back once the dag best practice improvements are done, in the 
meantime, let me know if you can think of other things we can try.

Thanks,
-mo


From: Ash Berlin-Taylor <a...@apache.org>
Reply-To: "users@airflow.apache.org" <users@airflow.apache.org>
Date: Thursday, April 7, 2022 at 5:57 AM
To: "users@airflow.apache.org" <users@airflow.apache.org>
Subject: Re: Airflow 2.2.5 - scheduler error

Hmmm something else is going on here.

DAG files taking to long to parse should result in the individual dag file 
parsing process timing out and being terminated, but ideally nothing should 
ever cause the DagFileProcessorManager process to not heartbeat.

Do you perhaps have one file with "a lot" of dags in it? Is your database under 
heavy load?

-ash

On Thu, Apr 7 2022 at 11:36:24 +0200, Jarek Potiuk <ja...@potiuk.com> wrote:
I think simply parsing your DAGs takes too long a time and you should simply 
follow the best practices. I am guessing here of course - I do not know what 
you do in your DAG top-level code.
But fixing badly written DAGs is known to fix similar problems for others.

Airflow 1.10 had performance problems which made scheduling much slower in 
general - that's why it "could" look better but the problem was that simply 
scheduling was much slower and the problem did not surface (but for example you 
had huge gaps between tasks).
Now - the scheduler in Airflow 2 is lightning fast. So as usual, if one part of 
the system is faster, bottlenecks in the other (this time likely caused by the 
DAG code of yours) surface. This is pretty normal and expected when you improve 
performance.

The good thing is that after you fix it, you will get much smaller 
(unnoticeable) scheduling delays and your DAGs will be much snappier.

Another measure you can take is add more schedulers. Airflow 2 allows running 
multiple schedulers to account for the case when there is a lot of parsing to 
do.  Also In the upcoming Airflow 2.3 you will also be able to run multiple DAG 
processors - they might be decoupled from scheduler.

But I'd look at your DAG practices. They are the fastest way to get more out of 
the Airflow you have now (without paying extra for additional computing)

J.

On Thu, Apr 7, 2022 at 9:10 AM HANNAOUI, MOHAMAD 
<mh7...@att.com<mailto:mh7...@att.com>> wrote:
OK, thanks for your response, will investigate the dags against the best 
practice documentation.

By the way, the same dags used to load fine with 1.10.10.

In the meantime, we have increased the values of dag_file_processor_timeout and 
scheduler_health_check_threshold to 180 and that seem to get the scheduler 
going, but still have not been able to get rid of all the 
DagFileProcessorManager errors.

I am not sure if this is expected, but the only way we can get the scheduler to 
recognize the dags is by incrementally adding between 50-100 dags at a time. As 
the number of dags reached 1900, we had to reduce the number of dags that can 
be incrementally added at one time to between 5 and 10, otherwise newly added 
dags were not being picked up by the scheduler.

The “airflow dags list” command works fine, it lists all the dags.

Any other suggestions beside ensuring the dags follow the best practice 
documentation?

Thanks,
-mo


From: Jarek Potiuk <ja...@potiuk.com<mailto:ja...@potiuk.com>>
Reply-To: "users@airflow.apache.org<mailto:users@airflow.apache.org>" 
<users@airflow.apache.org<mailto:users@airflow.apache.org>>
Date: Wednesday, April 6, 2022 at 8:30 AM
To: "users@airflow.apache.org<mailto:users@airflow.apache.org>" 
<users@airflow.apache.org<mailto:users@airflow.apache.org>>
Subject: Re: Airflow 2.2.5 - scheduler error

Follow the best practices: 
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code<https://urldefense.com/v3/__https:/airflow.apache.org/docs/apache-airflow/stable/best-practices.html*top-level-python-code__;Iw!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcWqsSjHw$>

It looks like you have DAGs that do "a lot" in the top-level code and it takes 
awfully a lot of time to parse them.

J.


On Wed, Apr 6, 2022 at 4:54 AM HANNAOUI, MOHAMAD 
<mh7...@att.com<mailto:mh7...@att.com>> wrote:
Hello Airflow users,

We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using standard 
installation, with the celery executor, one master node running the webserver, 
scheduler, and flower and 4 worker nodes. We are using hosted mysql8, redis and 
python 3.6.10.
We have around 2300 dags. With version 1.10.10 the scheduler was able to 
process all 2300 dags, although not efficiently, but it was working. With 
version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300 dags 
and that’s when the scheduler started returning the below error:

2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last sent a 
heartbeat 50.59 seconds ago! Restarting it

2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs of 
all processes in the group: [9876]

2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group 9876

2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876, 
status='terminated', exitcode=0, started='01:43:47') (9876) terminated with 
exit code 0

2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid: 9988

2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC')



We started a second scheduler on one of the worker nodes thinking it will help 
with the load, but that did not make a difference, both schedulers returned the 
same error message as above.



After more than 1 hour of the schedulers start time, there was sporadic 
processing of some dags, but the rest of time, nothing but 
DagFileProcessorManager error messages.


I came across a post this post 
https://github.com/apache/airflow/discussions/19270<https://urldefense.com/v3/__https:/github.com/apache/airflow/discussions/19270__;!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcBbqjRjA$>
 that suggested increasing the value of scheduler_health_check_threshold, which 
I changed to 120, but it did not solve the problem.

Any suggestions to how to fix this issue, or possibly downgrade to a different 
version?





Thanks,

-mo








Reply via email to