If you are interested in my journey to come to that PR I've just published a blog post about my journey to get there - https://www.astronomer.io/blog/profiling-the-airflow-scheduler/ <https://www.astronomer.io/blog/profiling-the-airflow-scheduler/> -- Improving Airflow’s scheduler is one of our top priorities at Astronomer and I think this should help anyone with short-running tasks.
-ash > On 5 Dec 2019, at 21:33, Aaron Grubb <[email protected]> wrote: > > That’s great! Thanks for your reply! > > From: Kamil Breguła <[email protected]> > Sent: Thursday, December 5, 2019 4:19 PM > To: [email protected] > Subject: Re: Celery Task Startup Overhead > > Hello, > > This is caused by strict process isolation. Each task is started in a new > process, where the Python interpreter is loaded completely anew. > This change can help solve some of your problems. > https://github.com/apache/airflow/pull/6627 > <https://github.com/apache/airflow/pull/6627> > > Best regards, > Kamil > > On Thu, Dec 5, 2019 at 9:41 PM Aaron Grubb <[email protected] > <mailto:[email protected]>> wrote: > Hi everyone, > > I’ve been testing celery workers with both prefork and eventlet pools and I'm > noticing massive startup overhead for simple BashOperators. For example, 20x > instances of: > > BashOperator( > task_id='test0', > bash_command="echo 'test'", > dag=dag) > > executed concurrently spikes my worker machine to from ~150mb to ~3gb > (eventlet) or ~3.5gb (prefork) memory and takes ~50 seconds. Is this an > expected artifact of the 20x python executions or is there some way to reduce > this? > > Thanks, > Aaron
