This looks to me like it's the gunicorn master process timing out workers that are unresponsive. Most likely due to blocking while handling a request for too long.
When we ran into this problem, we used the default sync worker_type. We solved it by switching to one of the other concurrency models(tornado + asyncio), which allowed the worker to stay responsive even though request handling took longer than `--timeout` seconds. In other words, we switched to asyncronous workers in order to allow the worker to yield and respond to heartbeats from the master, making sure they don't get killed. The switch was pretty non-trivial due to several factors, the biggest one being that concurrency in Python is pretty hard to reason about. Asyncio eased our pain. Note: Raising the timeout is a great way to mask the actual problem, even though it may very well work as a quickfix. I guess this chould be discussed further on a dev-list / ticket. I'm new to Airflow development, but have looked at this particular problem before and I might be able to help out. Maybe someone can help out here? concurrent Æ regards, With best Den tors 16 jan. 2020 02:41Reed Villanueva <[email protected]> skrev: > Having problem where unable to turn on a DAG in the airflow webserver UI. > > One thing to note is that the DAG in question originally was causing > timeout errors when trying to run so I have edited the airflow.cfg file to > have line... > > dagbag_import_timeout = 300 > > Now after making this change, running... > > airflow list_dags > > can see the dag gets built successfully. > > Then going to webserver, refresh dag in UI, switch the DAG status to "On", > click on DAG to attempt to see the graph view. > > Either get message about PID timeout or webserver page shows some browser > error like "page sent no data" and after reloading, I see that the DAG has > been switched off (in either case, no indication of problem in the > airflow-webserver.log). > > More debugging info if it helps: > > [root@airflowetl airflow]# ps -aux | grep webserver > airflow 16740 0.8 0.2 782620 134804 ? S 15:17 0:06 [ready] > gunicorn: worker [airflow-webserver] > airflow 29758 2.3 0.2 756164 108644 ? S 15:26 0:03 [ready] > gunicorn: worker [airflow-webserver] > airflow 33820 14.8 0.1 724788 78036 ? S 15:29 0:01 gunicorn: > worker [airflow-webserver] > airflow 33854 26.7 0.1 724784 78032 ? S 15:29 0:01 gunicorn: > worker [airflow-webserver] > airflow 33855 26.5 0.1 724816 78064 ? S 15:29 0:01 gunicorn: > worker [airflow-webserver] > root 34072 0.0 0.0 112712 968 pts/0 S+ 15:29 0:00 grep > --color=auto webserver > airflow 91174 1.6 0.1 735708 82468 ? S 14:14 1:14 > /usr/bin/python3 /home/airflow/.local/bin/airflow webserver -D > airflow 91211 0.0 0.1 355040 53472 ? S 14:14 0:01 gunicorn: > master [airflow-webserver] > > Anyone with more airflow experience have any ideas why this could be > happening and how to fix? (Maybe some airflow.cfg timeout config that I > should extend)? > > This electronic message is intended only for the named > recipient, and may contain information that is confidential or > privileged. If you are not the intended recipient, you are > hereby notified that any disclosure, copying, distribution or > use of the contents of this message is strictly prohibited. If > you have received this message in error or are not the named > recipient, please notify us immediately by contacting the > sender at the electronic mail address noted above, and delete > and destroy all copies of this message. Thank you. >
