It appears your slurmctld is 14.11.7 and your slurmd is 2.3.2. I've seen the "Zero Bytes were transmitted or received" before when there was a major version mismatch between slurmctld and slurmd.
- Trey ============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: [email protected] Jabber: [email protected] On Mon, Jun 15, 2015 at 3:36 PM, Cooper, Adam <[email protected]> wrote: > Hello, > I am trying to run slurm without errors for the first time. When I run > the controller, I get no errors: > > slurmctld: slurmctld version 14.11.7 started on cluster cluster > > slurmctld: layouts: no layout to initialize > > slurmctld: layouts: loading entities/relations information > > slurmctld: Recovered state of 1 nodes > > slurmctld: Recovered information about 0 jobs > > slurmctld: Recovered state of 0 reservations > > slurmctld: read_slurm_conf: backup_controller not specified. > > slurmctld: Running as primary controller > > > However, when I run the slurmd on my compute node, I get these errors on > the compute node which persist as long as I don't kill the daemon: > > slurmd: Node configuration differs from hardware > > Procs=1:8(hw) Sockets=1:2(hw) > > CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw) > > slurmd: topology NONE plugin loaded > > slurmd: task NONE plugin loaded > > slurmd: Null authentication plugin loaded > > slurmd: OpenSSL cryptographic signature plugin loaded > > slurmd: Warning: Core limit is only 0 KB > > slurmd: slurmd version 2.3.2 started > > slurmd: switch NONE plugin loaded > > slurmd: slurmd started on Mon 15 Jun 2015 16:17:37 -0400 > > slurmd: Procs=1 Sockets=1 Cores=1 Threads=1 Memory=40231 TmpDisk=162114 > Uptime=5411 > > *slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received* > > *slurmd: error: Unable to register: Zero Bytes were transmitted or > received* > > *slurmd: error: Invalid Protocol Version 7168 from uid=1000 at > "CONTROLLER'S IP ADDRESS"* > > *slurmd: error: slurm_receive_msg_and_forward: Protocol version has > changed, re-link your code* > > *slurmd: error: service_connection: slurm_receive_msg: Protocol version > has changed, re-link your code* > > *slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received* > > *slurmd: error: Unable to register: Zero Bytes were transmitted or > received* > > *slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received* > > > At the same time the controller starts giving me similar errors: > > *slurmctld: error: slurm_receive_msg: Invalid Protocol Version 5888 from > uid=0 at "COMPUTE NODE'S IP"* > > *slurmctld: error: slurm_receive_msg: Incompatible versions of client and > server code* > > *slurmctld: error: slurm_receive_msg: Incompatible versions of client and > server code* > > *slurmctld: error: Invalid Protocol Version 5888 from uid=0 at "COMPUTE > NODE'S IP"* > > *slurmctld: error: slurm_receive_msgs: Incompatible versions of client and > server code* > > *slurmctld: error: slurm_receive_msg: Invalid Protocol Version 5888 from > uid=0 at "COMPUTE NODE'S IP"* > > *slurmctld: error: slurm_receive_msg: Incompatible versions of client and > server code* > > *slurmctld: error: slurm_receive_msg: Incompatible versions of client and > server code* > > > Clearly, something is wrong with how the controller & compute node are > passing message to each other. I checked the archive for the Invalid > Protocol Version, but I'm not sure how relevant those queries were. I just > installed the most recent release, and this is my first time using SLURM. > Any advice is appreciated! Also I've attached my current configuration > file. To stop SLURM from complaining, I put it in /usr/local/etc on the > controller machine and /etc/slurm-llnl. > > > Thanks, > > Adam Cooper > > Brown University Computer Engineering '16 > > > > / >
