Re: Airflow 2.2.5 - scheduler error

Ash Berlin-Taylor Thu, 07 Apr 2022 05:56:36 -0700

Hmmm something else is going on here.

DAG files taking to long to parse should result in the individual dagfile parsing process timing out and being terminated, but ideallynothing should ever cause the DagFileProcessorManager process to notheartbeat.

Do you perhaps have one file with "a lot" of dags in it? Is yourdatabase under heavy load?


-ash

On Thu, Apr 7 2022 at 11:36:24 +0200, Jarek Potiuk <ja...@potiuk.com>wrote:

I think simply parsing your DAGs takes too long a time and you shouldsimply follow the best practices. I am guessing here of course - I donot know what you do in your DAG top-level code.But fixing badly written DAGs is known to fix similar problems forothers.
Airflow 1.10 had performance problems which made scheduling muchslower in general - that's why it "could" look better but the problemwas that simply scheduling was much slower and the problem did notsurface (but for example you had huge gaps between tasks).Now - the scheduler in Airflow 2 is lightning fast. So as usual, ifone part of the system is faster, bottlenecks in the other (this timelikely caused by the DAG code of yours) surface. This is prettynormal and expected when you improve performance.
The good thing is that after you fix it, you will get much smaller(unnoticeable) scheduling delays and your DAGs will be much snappier.
Another measure you can take is add more schedulers. Airflow 2 allowsrunning multiple schedulers to account for the case when there is alot of parsing to do. Also In the upcoming Airflow 2.3 you will alsobe able to run multiple DAG processors - they might be decoupled fromscheduler.
But I'd look at your DAG practices. They are the fastest way to getmore out of the Airflow you have now (without paying extra foradditional computing)
J.
On Thu, Apr 7, 2022 at 9:10 AM HANNAOUI, MOHAMAD <mh7...@att.com<mailto:mh7...@att.com>> wrote:
OK, thanks for your response, will investigate the dags against thebest practice documentation.____
__ __

By the way, the same dags used to load fine with 1.10.10.____

__ __
In the meantime, we have increased the values ofdag_file_processor_timeout and scheduler_health_check_threshold to180 and that seem to get the scheduler going,but still have not beenable to get rid of all the DagFileProcessorManager errors. ____
__ __
I am not sure if this is expected, but the only way we can get thescheduler to recognize the dags is by incrementally adding between50-100 dags at a time. As the number of dags reached 1900, we had toreduce the number of dags that can be incrementally added at onetime to between 5 and 10, otherwise newly added dags were not beingpicked up by the scheduler. ____
__ __
The “airflow dags list” command works fine, it lists all thedags. ____
__ __
Any other suggestions beside ensuring the dags follow the bestpractice documentation?____
__ __

Thanks,____

-mo____

__ __

__ __

*From:*Jarek Potiuk <ja...@potiuk.com <mailto:ja...@potiuk.com>>
*Reply-To:*"users@airflow.apache.org<mailto:users@airflow.apache.org>" <users@airflow.apache.org<mailto:users@airflow.apache.org>>
*Date:*Wednesday, April 6, 2022 at 8:30 AM
*To:*"users@airflow.apache.org <mailto:users@airflow.apache.org>"<users@airflow.apache.org <mailto:users@airflow.apache.org>>
*Subject:*Re: Airflow 2.2.5 - scheduler error____

__ __
Follow the best practices:https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code<https://urldefense.com/v3/__https:/airflow.apache.org/docs/apache-airflow/stable/best-practices.html*top-level-python-code__;Iw!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcWqsSjHw$>____
__ __
It looks like you have DAGs that do "a lot" in the top-level codeand it takes awfully a lot of time to parse them.____
__ __

J.____

__ __

__ __
On Wed, Apr 6, 2022 at 4:54 AM HANNAOUI, MOHAMAD <mh7...@att.com<mailto:mh7...@att.com>> wrote:____
Hello Airflow users,____

 ____
We just upgraded from airflow 1.10.10 to airflow 2.2.5. We areusing standard installation, with the celery executor, one masternode running the webserver, scheduler, and flower and 4 workernodes. We are using hosted mysql8, redis and python 3.6.10.____
We have around 2300 dags. With version 1.10.10 the scheduler wasable to process all 2300 dags, although not efficiently, but it wasworking. With version 2.2.5, the scheduler worked fine with 519dags, we then added ~300 dags and that’s when the schedulerstarted returning the below error:____
2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876)last sent a heartbeat 50.59 seconds ago! Restarting it____
2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group9876. PIDs of all processes in the group: [9876]____
2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERMto group 9876____
2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876,status='terminated', exitcode=0, started='01:43:47') (9876)terminated with exit code 0____
2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManagerwith pid: 9988____
2022-04-06 01:44:39,344 INFO - Configured default timezoneTimezone('UTC')____
 ____
We started a second scheduler on one of the worker nodes thinkingit will help with the load, but that did not make a difference,both schedulers returned the same error message as above.____
 ____
After more than 1 hour of the schedulers start time, there wassporadic processing of some dags, but the rest of time, nothing butDagFileProcessorManager error messages.____
 ____
I came across a post this posthttps://github.com/apache/airflow/discussions/19270<https://urldefense.com/v3/__https:/github.com/apache/airflow/discussions/19270__;!!BhdT!j_XzatfUPvcBMSiuJcOxBzO2OJOG1LEVCFKd0waMaV686ikKGIFbHspmL_In9MYmTqFcBbqjRjA$>that suggested increasing the value ofscheduler_health_check_threshold, which I changed to 120, but itdid not solve the problem.____
 ____
Any suggestions to how to fix this issue, or possibly downgrade toa different version?____
 ____

 ____

Thanks,____

-mo____

 ____

 ____

 ____

 ____

 ____

Re: Airflow 2.2.5 - scheduler error

Reply via email to