On Mon, May 2, 2016 at 10:36 AM, Paul Edmon <[email protected]> wrote:

>
> I'm seeing quite a few of these errors:
>
> May  2 11:33:29 holy-slurm01 slurmctld[47253]: error: slurm_receive_msg:
> Zero Bytes were transmitted or received
> May  2 11:33:29 holy-slurm01 slurmctld[47253]: error: slurm_receive_msg:
> Zero Bytes were transmitted or received
>
> I know that this can be caused by a node or client that is in a bad state,
> but I can't figure out how to trace it back to which one. Does anyone have
> any tricks for tracing this sort of error back?  I turned on the Protocol
> Debug Flag but none of the additional debug statements lead to the culprit.
>

It's funny you should mention this!
I ran into the same problem a week or two back. I ended up having to use
strace and backtrack where the connection came in from. This irritated me
immensely, so I built a patch to address this very problem! It adds a new
format operator for error(...) and friends that expect a file descriptor
which it calls getpeername on (sorry for the hand-wave-y description).  I'd
love to post this or send it to somebody that could look it over.

devs: what should I do here?

Reply via email to