Did the srun start on v2.3, but not get a resource allocation, then continue execution on v2.4? In that case, it could has a combination of plugins, some from v2.3 and others from v2.4, which would probably not work. That is what I am thinking happened.
Quoting [email protected]: > Moe, > Thank you for your reply, but I am not sure I understand what you saying. > I have the same slurm.conf file for both releases. The srun that is > queued, is started with the 2.3 release and I expected it to be started > even when I upgrade to V2.4.1 once resources are available. Maybe this is > not how is works... > Nancy > > > > From: Moe Jette <[email protected]> > To: slurm-dev <[email protected]>, [email protected], > Date: 07/04/2012 09:12 AM > Subject: Re: [slurm-dev] Re: Problems upgrading to 2.4.0 > > > > RPC 4017 is RESPONSE_JOB_ALLOCATION_INFO_LITE (see > src/common/slurm_protocol_defs.h) and that only contains a job id. > Nothing in the message contents have changed. Most plugins are loaded > on demand rather than all being loaded when a program (e.g.. srun) > starts. My best guess is that the srun command has some version 2.3 > plugins loaded and some version 2.4 plugins were loaded after the > upgrade resulting in an inconsistent set of software. > > You definitely don't want to keep using a version 2.3 srun with > version 2.4 daemons. The other commands (sinfo, sbatch, squeue, etc.) > should all work with new daemons though. > > Quoting [email protected]: > >> Danny, >> We are having some trouble with the transition from v2.3.5 to v2.4.1. I >> tried to keep the test and logs as simple as possible. I have a single >> node and start job and have a job queued awaiting resources. When I >> terminate v2.3.5 and start v2.4.1 the job terminates correctly, but the >> queued job does not start with the following error coming to the > console. >> The logs are attached as well. >> Thanks for any help, >> Nancy >> >> [sulu] (slurm) slurm>srun: error: Invalid Protocol Version 6144 from >> uid=200 at 141.112.17.124:39306 >> srun: error: slurm_receive_msg: Protocol version has changed, re-link > your >> code >> srun: error: _accept_msg_connection[sulu.gpv.az05.bull.com]: Protocol >> version has changed, re-link your code >> srun: error: Malformed RPC of type 4017 received >> srun: error: slurm_receive_msg: Header lengths are longer than data >> received >> srun: error: Invalid Protocol Version 6144 from uid=200 at >> 141.112.17.124:53548 >> srun: error: slurm_receive_msg: Protocol version has changed, re-link > your >> code >> srun: error: slurm_receive_msg[141.112.17.124]: Protocol version has >> changed, re-link your code >> srun: error: Unable to allocate resources: Header lengths are longer > than >> data received >> >> >> > > > > > >
