RPC 4017 is RESPONSE_JOB_ALLOCATION_INFO_LITE (see src/common/slurm_protocol_defs.h) and that only contains a job id. Nothing in the message contents have changed. Most plugins are loaded on demand rather than all being loaded when a program (e.g.. srun) starts. My best guess is that the srun command has some version 2.3 plugins loaded and some version 2.4 plugins were loaded after the upgrade resulting in an inconsistent set of software.
You definitely don't want to keep using a version 2.3 srun with version 2.4 daemons. The other commands (sinfo, sbatch, squeue, etc.) should all work with new daemons though. Quoting [email protected]: > Danny, > We are having some trouble with the transition from v2.3.5 to v2.4.1. I > tried to keep the test and logs as simple as possible. I have a single > node and start job and have a job queued awaiting resources. When I > terminate v2.3.5 and start v2.4.1 the job terminates correctly, but the > queued job does not start with the following error coming to the console. > The logs are attached as well. > Thanks for any help, > Nancy > > [sulu] (slurm) slurm>srun: error: Invalid Protocol Version 6144 from > uid=200 at 141.112.17.124:39306 > srun: error: slurm_receive_msg: Protocol version has changed, re-link your > code > srun: error: _accept_msg_connection[sulu.gpv.az05.bull.com]: Protocol > version has changed, re-link your code > srun: error: Malformed RPC of type 4017 received > srun: error: slurm_receive_msg: Header lengths are longer than data > received > srun: error: Invalid Protocol Version 6144 from uid=200 at > 141.112.17.124:53548 > srun: error: slurm_receive_msg: Protocol version has changed, re-link your > code > srun: error: slurm_receive_msg[141.112.17.124]: Protocol version has > changed, re-link your code > srun: error: Unable to allocate resources: Header lengths are longer than > data received > > >
