On Mon, May 2, 2016 at 10:36 AM, Paul Edmon <[email protected]> wrote:
> > I'm seeing quite a few of these errors: > > May 2 11:33:29 holy-slurm01 slurmctld[47253]: error: slurm_receive_msg: > Zero Bytes were transmitted or received > May 2 11:33:29 holy-slurm01 slurmctld[47253]: error: slurm_receive_msg: > Zero Bytes were transmitted or received > > I know that this can be caused by a node or client that is in a bad state, > but I can't figure out how to trace it back to which one. Does anyone have > any tricks for tracing this sort of error back? I turned on the Protocol > Debug Flag but none of the additional debug statements lead to the culprit. > It's funny you should mention this! I ran into the same problem a week or two back. I ended up having to use strace and backtrack where the connection came in from. This irritated me immensely, so I built a patch to address this very problem! It adds a new format operator for error(...) and friends that expect a file descriptor which it calls getpeername on (sorry for the hand-wave-y description). I'd love to post this or send it to somebody that could look it over. devs: what should I do here?
