Follow the best practices:
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code


It looks like you have DAGs that do "a lot" in the top-level code and it
takes awfully a lot of time to parse them.

J.


On Wed, Apr 6, 2022 at 4:54 AM HANNAOUI, MOHAMAD <mh7...@att.com> wrote:

> Hello Airflow users,
>
>
>
> We just upgraded from airflow 1.10.10 to airflow 2.2.5. We are using
> standard installation, with the celery executor, one master node running
> the webserver, scheduler, and flower and 4 worker nodes. We are using
> hosted mysql8, redis and python 3.6.10.
>
> We have around 2300 dags. With version 1.10.10 the scheduler was able to
> process all 2300 dags, although not efficiently, but it was working. With
> version 2.2.5, the scheduler worked fine with 519 dags, we then added ~300
> dags and that’s when the scheduler started returning the below error:
>
> 2022-04-06 01:44:39,039 ERROR - DagFileProcessorManager (PID=9876) last
> sent a heartbeat 50.59 seconds ago! Restarting it
>
> 2022-04-06 01:44:39,067 INFO - Sending Signals.SIGTERM to group 9876. PIDs
> of all processes in the group: [9876]
>
> 2022-04-06 01:44:39,067 INFO - Sending the signal Signals.SIGTERM to group
> 9876
>
> 2022-04-06 01:44:39,320 INFO - Process psutil.Process(pid=9876,
> status='terminated', exitcode=0, started='01:43:47') (9876) terminated with
> exit code 0
>
> 2022-04-06 01:44:39,327 INFO - Launched DagFileProcessorManager with pid:
> 9988
>
> 2022-04-06 01:44:39,344 INFO - Configured default timezone Timezone('UTC')
>
>
>
> We started a second scheduler on one of the worker nodes thinking it will
> help with the load, but that did not make a difference, both schedulers
> returned the same error message as above.
>
>
>
> After more than 1 hour of the schedulers start time, there was sporadic
> processing of some dags, but the rest of time, nothing but
> DagFileProcessorManager error messages.
>
>
>
> I came across a post this post
> https://github.com/apache/airflow/discussions/19270 that suggested
> increasing the value of scheduler_health_check_threshold, which I changed
> to 120, but it did not solve the problem.
>
>
>
> Any suggestions to how to fix this issue, or possibly downgrade to a
> different version?
>
>
>
>
>
> Thanks,
>
> -mo
>
>
>
>
>
>
>
>
>
>
>

Reply via email to