Hi Everyone,

This is somewhat of a re-post of an old issue (
https://groups.google.com/forum/#!topic/slurm-devel/59xPbuhb_78).

It caught my attention recently so I re-investigated. The reason we
experience the problem is a curious interaction between older versions of
the hydra MPI launcher and SSSD. That's not really a SLURM problem, though.
What I think is a SLURM issue is the way srun reacts.

The reproducer is this, which is a simple way of reproducing the bizarre
interaction of hydra and sssd:

dd if=/dev/zero bs=1k | srun sleep 5

It produces this error most of the time (still happens on a build of master
from today):

srun: debug:  IO error on node 0
srun: error: step_launch_notify_io_failure: aborting, io error with
slurmstepd on node 0

What I think I narrowed it down to (I was actually wrong about my initial
assessment in my earlier post) is that there's somewhat of a race condition
between eio_handle_mainloop and eio_signal_shutdown. If eio_handle_mainloop
exists before eio_signal_shutdown is called the socket is never shutdown()
meaning there's data sent to the socket by srun that hasn't been dequeued
by stepd and fed to sleep because, well, sleep doesn't read anything from
stdin. When stepd exits, the next read() attempted by srun will return with
ECONNRESET as a result of the unread, pending data in stepd before it quit
and the kernel closed the socket.

I've come up with a seemingly functional fix that ensures the eio objects
get set to shutdown and a subsequent shutdown() called should
eio_handle_mainloop exit prior to eio_signal_shutdown being called. The
proposed fix is here:

https://github.com/aaronknister/slurm/commits/connection_reset_fix

I've attempted to run the regression test suite but haven't had much luck.
My SLURM VM is being a little squirrelly. I also plan to write a regression
test for this problem but I haven't made it that far yet.

-Aaron

Reply via email to