Hello,
I am trying to run slurm without errors for the first time.  When I run the
controller, I get no errors:

slurmctld: slurmctld version 14.11.7 started on cluster cluster

slurmctld: layouts: no layout to initialize

slurmctld: layouts: loading entities/relations information

slurmctld: Recovered state of 1 nodes

slurmctld: Recovered information about 0 jobs

slurmctld: Recovered state of 0 reservations

slurmctld: read_slurm_conf: backup_controller not specified.

slurmctld: Running as primary controller


However, when I run the slurmd on my compute node, I get these errors on
the compute node which persist as long as I don't kill the daemon:

slurmd: Node configuration differs from hardware

   Procs=1:8(hw) Sockets=1:2(hw)

   CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)

slurmd: topology NONE plugin loaded

slurmd: task NONE plugin loaded

slurmd: Null authentication plugin loaded

slurmd: OpenSSL cryptographic signature plugin loaded

slurmd: Warning: Core limit is only 0 KB

slurmd: slurmd version 2.3.2 started

slurmd: switch NONE plugin loaded

slurmd: slurmd started on Mon 15 Jun 2015 16:17:37 -0400

slurmd: Procs=1 Sockets=1 Cores=1 Threads=1 Memory=40231 TmpDisk=162114
Uptime=5411

*slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received*

*slurmd: error: Unable to register: Zero Bytes were transmitted or received*

*slurmd: error: Invalid Protocol Version 7168 from uid=1000 at
"CONTROLLER'S IP ADDRESS"*

*slurmd: error: slurm_receive_msg_and_forward: Protocol version has
changed, re-link your code*

*slurmd: error: service_connection: slurm_receive_msg: Protocol version has
changed, re-link your code*

*slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received*

*slurmd: error: Unable to register: Zero Bytes were transmitted or received*

*slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received*


At the same time the controller starts giving me similar errors:

*slurmctld: error: slurm_receive_msg: Invalid Protocol Version 5888 from
uid=0 at "COMPUTE NODE'S IP"*

*slurmctld: error: slurm_receive_msg: Incompatible versions of client and
server code*

*slurmctld: error: slurm_receive_msg: Incompatible versions of client and
server code*

*slurmctld: error: Invalid Protocol Version 5888 from uid=0 at "COMPUTE
NODE'S IP"*

*slurmctld: error: slurm_receive_msgs: Incompatible versions of client and
server code*

*slurmctld: error: slurm_receive_msg: Invalid Protocol Version 5888 from
uid=0 at "COMPUTE NODE'S IP"*

*slurmctld: error: slurm_receive_msg: Incompatible versions of client and
server code*

*slurmctld: error: slurm_receive_msg: Incompatible versions of client and
server code*


Clearly, something is wrong with how the controller & compute node are
passing message to each other.  I checked the archive for the Invalid
Protocol Version, but I'm not sure how relevant those queries were.  I just
installed the most recent release, and this is my first time using SLURM.
Any advice is appreciated!  Also I've attached my current configuration
file.  To stop SLURM from complaining, I put it in /usr/local/etc on the
controller machine and /etc/slurm-llnl.


Thanks,

Adam Cooper

Brown University Computer Engineering '16



/

Attachment: slurm.conf
Description: Binary data

Reply via email to