Hi Paddy,

On Sun, Aug 17, 2014 at 01:26:12PM -0700, Gerben Roest wrote:


I run a slurmctld and slurmdbd on a Scientific Linux (SL) 5 server and
have three SL6 nodes, all running Slurm 14.03.6, with one node behind
another slurmctld on another cluster. The whole slurm setup seems to run
fine with tests, even submitting from one cluster to the other.
The slurmctld daemon on the machine where slurmdbd is also running, shows

error: slurm_receive_msg: Zero Bytes were transmitted or received

For me, that's usually a version mis-match somewhere. One of the daemons is a
version behind and so there's a protocol mis-match when trying to communicate.
I'd double-check that all versions are the same (and have been restarted since
any upgrades) first.

I have checked the versions of the main slurmctld and the slurmd's on the nodes, and the slurmctld on the other cluster and slurmd's on that nodes, and all use 14.03.6. I didn't upgrade, started straight from 14.03.6. The only thing might be that the main master runs 14.03.6 compiled for SL5 with "-O0" and the others run it from another dir (NFS) compiled from the same source but without "-O0" and "make installed" to that other dir, created for SL6 machines (because of GLIBC deps). But I guess you should be able to run slurm on different builds provided it is the same version? It all seems to work but I only get the strange logs. If you would need more verbose information, please let me know what I have to check.

Gerben

Reply via email to