Hello, I am trying to run slurm without errors for the first time. When I run the controller, I get no errors:
slurmctld: slurmctld version 14.11.7 started on cluster cluster slurmctld: layouts: no layout to initialize slurmctld: layouts: loading entities/relations information slurmctld: Recovered state of 1 nodes slurmctld: Recovered information about 0 jobs slurmctld: Recovered state of 0 reservations slurmctld: read_slurm_conf: backup_controller not specified. slurmctld: Running as primary controller However, when I run the slurmd on my compute node, I get these errors on the compute node which persist as long as I don't kill the daemon: slurmd: Node configuration differs from hardware Procs=1:8(hw) Sockets=1:2(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw) slurmd: topology NONE plugin loaded slurmd: task NONE plugin loaded slurmd: Null authentication plugin loaded slurmd: OpenSSL cryptographic signature plugin loaded slurmd: Warning: Core limit is only 0 KB slurmd: slurmd version 2.3.2 started slurmd: switch NONE plugin loaded slurmd: slurmd started on Mon 15 Jun 2015 16:17:37 -0400 slurmd: Procs=1 Sockets=1 Cores=1 Threads=1 Memory=40231 TmpDisk=162114 Uptime=5411 *slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received* *slurmd: error: Unable to register: Zero Bytes were transmitted or received* *slurmd: error: Invalid Protocol Version 7168 from uid=1000 at "CONTROLLER'S IP ADDRESS"* *slurmd: error: slurm_receive_msg_and_forward: Protocol version has changed, re-link your code* *slurmd: error: service_connection: slurm_receive_msg: Protocol version has changed, re-link your code* *slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received* *slurmd: error: Unable to register: Zero Bytes were transmitted or received* *slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received* At the same time the controller starts giving me similar errors: *slurmctld: error: slurm_receive_msg: Invalid Protocol Version 5888 from uid=0 at "COMPUTE NODE'S IP"* *slurmctld: error: slurm_receive_msg: Incompatible versions of client and server code* *slurmctld: error: slurm_receive_msg: Incompatible versions of client and server code* *slurmctld: error: Invalid Protocol Version 5888 from uid=0 at "COMPUTE NODE'S IP"* *slurmctld: error: slurm_receive_msgs: Incompatible versions of client and server code* *slurmctld: error: slurm_receive_msg: Invalid Protocol Version 5888 from uid=0 at "COMPUTE NODE'S IP"* *slurmctld: error: slurm_receive_msg: Incompatible versions of client and server code* *slurmctld: error: slurm_receive_msg: Incompatible versions of client and server code* Clearly, something is wrong with how the controller & compute node are passing message to each other. I checked the archive for the Invalid Protocol Version, but I'm not sure how relevant those queries were. I just installed the most recent release, and this is my first time using SLURM. Any advice is appreciated! Also I've attached my current configuration file. To stop SLURM from complaining, I put it in /usr/local/etc on the controller machine and /etc/slurm-llnl. Thanks, Adam Cooper Brown University Computer Engineering '16 /
slurm.conf
Description: Binary data
