Got that. Thanks for explaining your POV. What I was trying to say is that
without being able to reproduce, I can only guess. And since the scheduler
code in regards of process handling is quite streamlined, there are not a
lot of places where a zombie can be created.
So, let's guess...
Are zombie processes "still running and churning data" or just leftovers ?
Usually there are two kinds of zombie processes: completely detached
processes running and consuming resources (what theoretically the scheduler
should avoid at all costs) and simple leftovers on the "virtual" process
table that are there because nobody (the parent process, usually) reclaimed
either their output or their status code. The second is something to avoid,
of course, but shouldn't hurt the functionality of the app.
from the pstree you posted, it seems that (shorting code here and there)
p = multiprocessing.Process(executor, ....)
p.start()
try:
#task runs
p.join(run_timeout)
except:
#this should be raised only when a general error on the task happened,
so it's a STOPPED one
p.terminate()
p.join()
else:
#this is the codepath your task takes, since its the one landing
TIMEOUT tasks
if p.is_alive():
# this is ultimately the call that SHOULD kill the process you
later find as a zombie
p.terminate()
....
this "terminate" is labelled as using SIGKILL on the process. Eventual
processes child of that one (i.e. a subprocess call inside the task itself)
are not guaranteed to be terminated, but then they'll show as orphaned,
while your pstree reports python processes still "attached" to the
scheduler worker process.
>From where I stand, if the result is the task being labelled as TIMEOUT
(with the corresponding "task timeout" debug line), it can only be
originated there.
Maybe there's a culprit there.... can you add a p.join() after that
p.terminate(), and maybe a few debug lines ?
i.e.
... else:
if p.is_alive():
logger.debug(' MD: terminating')
p.terminate()
logger.debug(' MD: terminated')
logger.debug(' MD: joining')
p.join()
logger.debug(' MD: joined')
logger.debug(' task timeout')
try:
# we try to get a traceback here
tr = queue.get(timeout=2)
...
--
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
---
You received this message because you are subscribed to the Google Groups
"web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.