Hi,

As far as I understand, the incompatibility is now with srun jobs. My first
post was about batch jobs (sorry I didn't specify that). I suppose that the
protocol to talk to a "waiting" srun has changed and that is what makes
your jobs fail, but I'm just guessing.

regards,
Carles Fenoy

On Thu, Jul 5, 2012 at 1:49 AM, <[email protected]> wrote:

>  Moe,
> I'm sorry I guess I am still a bit confused.  I was assuming that I was
> having a similar problem as everyone else on this issue, but just having
> different symptoms.  I guess to clarify first.  Is it supported that jobs
> can be queued waiting resources in a 2.3.x release and then upgrade to a
> 2.4.1 release and expect these jobs to run once resources are available?
>  If this is true, are you saying that we can not have any plugins
> configured if we want to do this?  I am surprised that the previous reports
> of problems moving from previous releases to 2.4.1 do not have any plugins
> configured.
> Nancy
>
>
>
> From:        Moe Jette <[email protected]>
> To:        "slurm-dev" <[email protected]>,
> Date:        07/04/2012 10:48 AM
> Subject:        [slurm-dev] Re: Problems upgrading to 2.4.0
> ------------------------------
>
>
>
>
> Did the srun start on v2.3, but not get a resource allocation, then
> continue execution on v2.4? In that case, it could has a combination
> of plugins, some from v2.3 and others from v2.4, which would probably
> not work. That is what I am thinking happened.
>
> Quoting [email protected]:
>
> > Moe,
> > Thank you for your reply, but I am not sure I understand what you saying.
> > I have the same slurm.conf file for both releases.  The srun that is
> > queued, is started with the 2.3 release and I expected it to be started
> > even when I upgrade to V2.4.1 once resources are available.  Maybe this
> is
> > not how is works...
> > Nancy
> >
> >
> >
> > From:   Moe Jette <[email protected]>
> > To:     slurm-dev <[email protected]>, [email protected],
> > Date:   07/04/2012 09:12 AM
> > Subject:        Re: [slurm-dev] Re: Problems upgrading to 2.4.0
> >
> >
> >
> > RPC 4017 is RESPONSE_JOB_ALLOCATION_INFO_LITE (see
> > src/common/slurm_protocol_defs.h) and that only contains a job id.
> > Nothing in the message contents have changed. Most plugins are loaded
> > on demand rather than all being loaded when a program (e.g.. srun)
> > starts. My best guess is that the srun command has some version 2.3
> > plugins loaded and some version 2.4 plugins were loaded after the
> > upgrade resulting in an inconsistent set of software.
> >
> > You definitely don't want to keep using a version 2.3 srun with
> > version 2.4 daemons. The other commands (sinfo, sbatch, squeue, etc.)
> > should all work with new daemons though.
> >
> > Quoting [email protected]:
> >
> >> Danny,
> >> We are having some trouble with the transition from v2.3.5 to v2.4.1.  I
> >> tried to keep the test and logs as simple as possible.  I have a single
> >> node and start job and have a job queued awaiting resources.  When I
> >> terminate v2.3.5 and start v2.4.1 the job terminates correctly, but the
> >> queued job does not start with the following error coming to the
> > console.
> >> The logs are attached as well.
> >> Thanks for any help,
> >> Nancy
> >>
> >>  [sulu] (slurm) slurm>srun: error: Invalid Protocol Version 6144 from
> >> uid=200 at 141.112.17.124:39306
> >> srun: error: slurm_receive_msg: Protocol version has changed, re-link
> > your
> >> code
> >> srun: error: _accept_msg_connection[sulu.gpv.az05.bull.com]: Protocol
> >> version has changed, re-link your code
> >> srun: error: Malformed RPC of type 4017 received
> >> srun: error: slurm_receive_msg: Header lengths are longer than data
> >> received
> >> srun: error: Invalid Protocol Version 6144 from uid=200 at
> >> 141.112.17.124:53548
> >> srun: error: slurm_receive_msg: Protocol version has changed, re-link
> > your
> >> code
> >> srun: error: slurm_receive_msg[141.112.17.124]: Protocol version has
> >> changed, re-link your code
> >> srun: error: Unable to allocate resources: Header lengths are longer
> > than
> >> data received
> >>
> >>
> >>
> >
> >
> >
> >
> >
> >
>
>
>
>


-- 
--
Carles Fenoy

Reply via email to