It appears your slurmctld is 14.11.7 and your slurmd is 2.3.2.  I've seen
the "Zero Bytes were transmitted or received" before when there was a major
version mismatch between slurmctld and slurmd.

- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected]
Jabber: [email protected]

On Mon, Jun 15, 2015 at 3:36 PM, Cooper, Adam <[email protected]> wrote:

>  Hello,
> I am trying to run slurm without errors for the first time.  When I run
> the controller, I get no errors:
>
> slurmctld: slurmctld version 14.11.7 started on cluster cluster
>
> slurmctld: layouts: no layout to initialize
>
> slurmctld: layouts: loading entities/relations information
>
> slurmctld: Recovered state of 1 nodes
>
> slurmctld: Recovered information about 0 jobs
>
> slurmctld: Recovered state of 0 reservations
>
> slurmctld: read_slurm_conf: backup_controller not specified.
>
> slurmctld: Running as primary controller
>
>
> However, when I run the slurmd on my compute node, I get these errors on
> the compute node which persist as long as I don't kill the daemon:
>
> slurmd: Node configuration differs from hardware
>
>    Procs=1:8(hw) Sockets=1:2(hw)
>
>    CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)
>
> slurmd: topology NONE plugin loaded
>
> slurmd: task NONE plugin loaded
>
> slurmd: Null authentication plugin loaded
>
> slurmd: OpenSSL cryptographic signature plugin loaded
>
> slurmd: Warning: Core limit is only 0 KB
>
> slurmd: slurmd version 2.3.2 started
>
> slurmd: switch NONE plugin loaded
>
> slurmd: slurmd started on Mon 15 Jun 2015 16:17:37 -0400
>
> slurmd: Procs=1 Sockets=1 Cores=1 Threads=1 Memory=40231 TmpDisk=162114
> Uptime=5411
>
> *slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received*
>
> *slurmd: error: Unable to register: Zero Bytes were transmitted or
> received*
>
> *slurmd: error: Invalid Protocol Version 7168 from uid=1000 at
> "CONTROLLER'S IP ADDRESS"*
>
> *slurmd: error: slurm_receive_msg_and_forward: Protocol version has
> changed, re-link your code*
>
> *slurmd: error: service_connection: slurm_receive_msg: Protocol version
> has changed, re-link your code*
>
> *slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received*
>
> *slurmd: error: Unable to register: Zero Bytes were transmitted or
> received*
>
> *slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received*
>
>
> At the same time the controller starts giving me similar errors:
>
> *slurmctld: error: slurm_receive_msg: Invalid Protocol Version 5888 from
> uid=0 at "COMPUTE NODE'S IP"*
>
> *slurmctld: error: slurm_receive_msg: Incompatible versions of client and
> server code*
>
> *slurmctld: error: slurm_receive_msg: Incompatible versions of client and
> server code*
>
> *slurmctld: error: Invalid Protocol Version 5888 from uid=0 at "COMPUTE
> NODE'S IP"*
>
> *slurmctld: error: slurm_receive_msgs: Incompatible versions of client and
> server code*
>
> *slurmctld: error: slurm_receive_msg: Invalid Protocol Version 5888 from
> uid=0 at "COMPUTE NODE'S IP"*
>
> *slurmctld: error: slurm_receive_msg: Incompatible versions of client and
> server code*
>
> *slurmctld: error: slurm_receive_msg: Incompatible versions of client and
> server code*
>
>
> Clearly, something is wrong with how the controller & compute node are
> passing message to each other.  I checked the archive for the Invalid
> Protocol Version, but I'm not sure how relevant those queries were.  I just
> installed the most recent release, and this is my first time using SLURM.
> Any advice is appreciated!  Also I've attached my current configuration
> file.  To stop SLURM from complaining, I put it in /usr/local/etc on the
> controller machine and /etc/slurm-llnl.
>
>
> Thanks,
>
> Adam Cooper
>
> Brown University Computer Engineering '16
>
>
>
> /
>

Reply via email to