Nancy could you test this scenario going from 2.2 to 2.3? I wouldn't expect a pending srun to still work. I would expect sbatch and perhaps salloc to work though. salloc is questionable though. This is only for pending jobs though. There are only a few RPCs that are supported going between versions and I believe the scenario of a pending srun doesn't work.
[email protected] wrote: Moe, I'm sorry I guess I am still a bit confused. I was assuming that I was having a similar problem as everyone else on this issue, but just having different symptoms. I guess to clarify first. Is it supported that jobs can be queued waiting resources in a 2.3.x release and then upgrade to a 2.4.1 release and expect these jobs to run once resources are available? If this is true, are you saying that we can not have any plugins configured if we want to do this? I am surprised that the previous reports of problems moving from previous releases to 2.4.1 do not have any plugins configured. Nancy From: Moe Jette <[email protected]> To: "slurm-dev" <[email protected]>, Date: 07/04/2012 10:48 AM Subject: [slurm-dev] Re: Problems upgrading to 2.4.0 _____________________________________________ Did the srun start on v2.3, but not get a resource allocation, then continue execution on v2.4? In that case, it could has a combination of plugins, some from v2.3 and others from v2.4, which would probably not work. That is what I am thinking happened. Quoting [email protected]: > Moe, > Thank you for your reply, but I am not sure I understand what you saying. > I have the same slurm.conf file for both releases. The srun that is > queued, is started with the 2.3 release and I expected it to be started > even when I upgrade to V2.4.1 once resources are available. Maybe this is > not how is works... > Nancy > > > > From: Moe Jette <[email protected]> > To: slurm-dev <[email protected]>, [email protected], > Date: 07/04/2012 09:12 AM > Subject: Re: [slurm-dev] Re: Problems upgrading to 2.4.0 > > > > RPC 4017 is RESPONSE_JOB_ALLOCATION_INFO_LITE (see > src/common/slurm_protocol_defs.h) and that only contains a job id. > Nothing in the message contents have changed. Most plugins are loaded > on demand rather than all being loaded when a program (e.g.. srun) > starts. My best guess is that the srun command has some version 2.3 > plugins loaded and some version 2.4 plugins were loaded after the > upgrade resulting in an inconsistent set of software. > > You definitely don't want to keep using a version 2.3 srun with > version 2.4 daemons. The other commands (sinfo, sbatch, squeue, etc.) > should all work with new daemons though. > > Quoting [email protected]: > >> Danny, >> We are having some trouble with the transition from v2.3.5 to v2.4.1. I >> tried to keep the test and logs as simple as possible. I have a single >> node and start job and have a job queued awaiting resources. When I >> terminate v2.3.5 and start v2.4.1 the job terminates correctly, but the >> queued job does not start with the following error coming to the > console. >> The logs are attached as well. >> Thanks for any help, >> Nancy >> >> [sulu] (slurm) slurm>srun: error: Invalid Protocol Version 6144 from >> uid=200 at 141.112.17.124:39306 >> srun: error: slurm_receive_msg: Protocol version has changed, re-link > your >> code >> srun: error: _accept_msg_connection[sulu.gpv.az05.bull.com]: Protocol >> version has changed, re-link your code >> srun: error: Malformed RPC of type 4017 received >> srun: error: slurm_receive_msg: Header lengths are longer than data >> received >> srun: error: Invalid Protocol Version 6144 from uid=200 at >> 141.112.17.124:53548 >> srun: error: slurm_receive_msg: Protocol version has changed, re-link > your >> code >> srun: error: slurm_receive_msg[141.112.17.124]: Protocol version has >> changed, re-link your code >> srun: error: Unable to allocate resources: Header lengths are longer > than >> data received >> >> >> > > > > > >
