Hmmm something else is going on here.
DAG files taking to long to parse should result in the individual dag
file parsing process timing out and being terminated, but ideally
nothing should ever cause the DagFileProcessorManager process to not
heartbeat.
Do you perhaps have one file with "a lot" of dags in it? Is your
database under heavy load?
-ash
On Thu, Apr 7 2022 at 11:36:24 +0200, Jarek Potiuk <ja...@potiuk.com>
wrote:
I think simply parsing your DAGs takes too long a time and you should
simply follow the best practices. I am guessing here of course - I do
not know what you do in your DAG top-level code.
But fixing badly written DAGs is known to fix similar problems for
others.
Airflow 1.10 had performance problems which made scheduling much
slower in general - that's why it "could" look better but the problem
was that simply scheduling was much slower and the problem did not
surface (but for example you had huge gaps between tasks).
Now - the scheduler in Airflow 2 is lightning fast. So as usual, if
one part of the system is faster, bottlenecks in the other (this time
likely caused by the DAG code of yours) surface. This is pretty
normal and expected when you improve performance.
The good thing is that after you fix it, you will get much smaller
(unnoticeable) scheduling delays and your DAGs will be much snappier.
Another measure you can take is add more schedulers. Airflow 2 allows
running multiple schedulers to account for the case when there is a
lot of parsing to do. Also In the upcoming Airflow 2.3 you will also
be able to run multiple DAG processors - they might be decoupled from
scheduler.
But I'd look at your DAG practices. They are the fastest way to get
more out of the Airflow you have now (without paying extra for
additional computing)
J.
On Thu, Apr 7, 2022 at 9:10 AM HANNAOUI, MOHAMAD <mh7...@att.com
<mailto:mh7...@att.com>> wrote:
OK, thanks for your response, will investigate the dags against the
best practice documentation.____
__ __
By the way, the same dags used to load fine with 1.10.10.____
__ __
In the meantime, we have increased the values of
dag_file_processor_timeout and scheduler_health_check_threshold to
180 and that seem to get the scheduler going,but still have not been
able to get rid of all the DagFileProcessorManager errors. ____
__ __
I am not sure if this is expected, but the only way we can get the
scheduler to recognize the dags is by incrementally adding between
50-100 dags at a time. As the number of dags reached 1900, we had to
reduce the number of dags that can be incrementally added at one
time to between 5 and 10, otherwise newly added dags were not being
picked up by the scheduler. ____
__ __
The “airflow dags list” command works fine, it lists all the
dags. ____
__ __
Any other suggestions beside ensuring the dags follow the best
practice documentation?____
__ __
Thanks,____
-mo____
__ __
__ __
*From:*Jarek Potiuk <ja...@potiuk.com <mailto:ja...@potiuk.com>>
*Reply-To:*"users@airflow.apache.org
<mailto:users@airflow.apache.org>" <users@airflow.apache.org
<mailto:users@airflow.apache.org>>
*Date:*Wednesday, April 6, 2022 at 8:30 AM
*To:*"users@airflow.apache.org <mailto:users@airflow.apache.org>"
<users@airflow.apache.org <mailto:users@airflow.apache.org>>
*Subject:*Re: Airflow 2.2.5 - scheduler error____
__ __
Follow the best practices:
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
<https://urldefense.com/v3/__https:/airflow.apache.org/docs/apache-airflow/stable/best-practices.html*top-level-python-code__;Iw!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcWqsSjHw$>
____
__ __
It looks like you have DAGs that do "a lot" in the top-level code
and it takes awfully a lot of time to parse them.____
__ __
J.____
__ __
__ __
On Wed, Apr 6, 2022 at 4:54 AM HANNAOUI, MOHAMAD <mh7...@att.com
<mailto:mh7...@att.com>> wrote:____
Hello Airflow users,____
____
We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are
using standard installation, with the celery executor, one master
node running the webserver, scheduler, and flower and 4 worker
nodes. We are using hosted mysql8, redis and python 3.6.10.____
We have around 2300 dags. With version 1.10.10 the scheduler was
able to process all 2300 dags, although not efficiently, but it was
working. With version 2.2.5, the scheduler worked fine with 519
dags, we then added ~300 dags and that’s when the scheduler
started returning the below error:____
2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876)
last sent a heartbeat 50.59 seconds ago! Restarting it____
2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group
9876. PIDs of all processes in the group: [9876]____
2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM
to group 9876____
2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876,
status='terminated', exitcode=0, started='01:43:47') (9876)
terminated with exit code 0____
2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager
with pid: 9988____
2022-04-06 01:44:39,344 INFO - Configured default timezone
Timezone('UTC')____
____
We started a second scheduler on one of the worker nodes thinking
it will help with the load, but that did not make a difference,
both schedulers returned the same error message as above.____
____
After more than 1 hour of the schedulers start time, there was
sporadic processing of some dags, but the rest of time, nothing but
DagFileProcessorManager error messages.____
____
I came across a post this post
https://github.com/apache/airflow/discussions/19270
<https://urldefense.com/v3/__https:/github.com/apache/airflow/discussions/19270__;!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcBbqjRjA$>
that suggested increasing the value of
scheduler_health_check_threshold, which I changed to 120, but it
did not solve the problem.____
____
Any suggestions to how to fix this issue, or possibly downgrade to
a different version?____
____
____
Thanks,____
-mo____
____
____
____
____
____