Hi Everyone, This is somewhat of a re-post of an old issue ( https://groups.google.com/forum/#!topic/slurm-devel/59xPbuhb_78).
It caught my attention recently so I re-investigated. The reason we experience the problem is a curious interaction between older versions of the hydra MPI launcher and SSSD. That's not really a SLURM problem, though. What I think is a SLURM issue is the way srun reacts. The reproducer is this, which is a simple way of reproducing the bizarre interaction of hydra and sssd: dd if=/dev/zero bs=1k | srun sleep 5 It produces this error most of the time (still happens on a build of master from today): srun: debug: IO error on node 0 srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 0 What I think I narrowed it down to (I was actually wrong about my initial assessment in my earlier post) is that there's somewhat of a race condition between eio_handle_mainloop and eio_signal_shutdown. If eio_handle_mainloop exists before eio_signal_shutdown is called the socket is never shutdown() meaning there's data sent to the socket by srun that hasn't been dequeued by stepd and fed to sleep because, well, sleep doesn't read anything from stdin. When stepd exits, the next read() attempted by srun will return with ECONNRESET as a result of the unread, pending data in stepd before it quit and the kernel closed the socket. I've come up with a seemingly functional fix that ensures the eio objects get set to shutdown and a subsequent shutdown() called should eio_handle_mainloop exit prior to eio_signal_shutdown being called. The proposed fix is here: https://github.com/aaronknister/slurm/commits/connection_reset_fix I've attempted to run the regression test suite but haven't had much luck. My SLURM VM is being a little squirrelly. I also plan to write a regression test for this problem but I haven't made it that far yet. -Aaron
