yuqian90 edited a comment on issue #15938:
URL: https://github.com/apache/airflow/issues/15938#issuecomment-845994258


   > Just how slow does it have to be to happen?
   > We can probably guard this by closing of the current pid when we register 
them, and checking that the signal is received by the same pid
   
   Hi, @ashb it's not clear to me how slow it must be exactly for this to 
happen. It looks like as long as some child processes are a fraction of a 
second slower than the others, they easily get into a deadlock when a SIGTERM 
is received. So even a transient slowness of a beefy machine can cause this to 
happen. 
   
   Here's what I tried so far. Only the last method seems to fix the issue 
completely (i.e. we have to stop using `multiprocessing.Pool`):
   - Tried to reset the signal handler to `signal.SIG_DFL` in 
`register_signals` if the current process is a child process. This doesn't help 
because the child process inherits the parent's signal handler when it's 
forked. Still hangs occasionally.
   - Tried to make `_exit_gracefully` a no-op if the current process is a child 
process. This isn't sufficient. Still hangs occasionally.
   - Tried to change multiprocessing to use "spawn" instead of "fork" like some 
people suggested [on the 
internet](https://pythonspeed.com/articles/python-multiprocessing/), it greatly 
reduced the chance of this issue happening. However, after running the 
reproducing example about 8000 times, it still happened. So it doesn't fix the 
issue completely.
   - **Replace `multiprocessing.Pool` with 
`concurrent.futures.process.ProcessPoolExecutor`. Once this is done, the 
reproducing example no longer hangs even after running it tens of thousands 
times.**. So I put up PR #15989 which fixes the issue using this method. 
   
   From experience, `multiprocessing.Pool` is notorious for causing mysterious 
hangs like these. Using `ProcessPoolExecutor` does not cause the same problems. 
It has similar interface and uses similar underlying libraries. I don't 
understand exactly why it fixes the issue, but in practice it always seems to 
help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to