Hello,

We currently use rather old versions of SLURM at NSC. Some 2.4, some 2.3
and we even have one cluster still on 2.2. I am currently preparing to
upgrade at least some systems to 2.6.

While testing 2.4 to 2.6 upgrades in our lab I have learned that there
have been rather big changes in how much slurm commands and daemons of
different versions can communicate with each other.

In the past slurmdbd could communicate with older clusters, and
slurmctld/slurmd could communicate just enough with old stepds so that
running jobs could finish. But most things did not work and returned
SLURM_PROTOCOL_VERSION_ERROR.

If you upgraded your slurmctld you had SlurmdTimeout seconds to upgrade
all your slurmds or they would time out. I would typically increase
SlurmdTimeout before an upgrade. Another approach was to stop all
slurmctlds and slurmds and not start any of them until after all
packages had been upgraded.

Now all this is changed. Slurmctld can communicate with older
slurmds. Many commands (squeue/sinfo) also seems to work with a newer
slurmctld. There was some discussion here:
https://groups.google.com/d/msg/slurm-devel/R0ptc32Pre8/f9DiR7H-q0gJ

Unfortunately I have found some things that does not work very well. One
job killing bug, and some usability issues.


Lack of information
-------------------

    [paran@krut ~]$ scontrol version
    slurm 2.6.2
    [root@krut ~]# sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    kryo*        up 7-00:00:00      5   idle n[1-5]
    [root@krut ~]# 

Everything looks fine here. However only 4 out of the 5 compute nodes
actually run slurm 2.6.2. n1 is still running 2.5.7, but slurm does not
give any informaton about this.

During an upgrade this is likely not a very big issue as then you
probably know what you are doing. But what if this happened at some
other time? For example if a node is accidentally rebooted with an old
OS image with an old slurm version. Having versions mixed like this can
lead to all kinds of breakage, and I think Slurm should make some noise
about this situation.

Maybe nodes with an old version could get set as drained? Slurmctld
can't start jobs on old nodes anyway. Or maybe it is necessary to add a
new node state to handle this?

Today I think the only way to find out that something is wrong is to
look in slurmctld.log:

    [2013-10-24T17:20:59.912] debug:  validate_node_specs: node n1 registered 
with 0 jobs
    [2013-10-24T17:20:59.912] debug:  validate_node_specs: node n1 registered 
with 0 jobs
    [2013-10-24T17:21:00.215] agent/is_node_resp: node:n1 rpc:1001 : Protocol 
version has changed, re-link your code
    [2013-10-24T17:21:00.215] agent/is_node_resp: node:n1 rpc:1001 : Protocol 
version has changed, re-link your code


Jobs fail
---------

Slurm will attempt to schedule jobs on nodes running old versions. This
makes jobs fail and get stuck in state completing.

Same example setup as before. n1 is running 2.5.7 and everything else
run 2.6.2.

Empty queue and idle nodes:
    [paran@krut ~]$ squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
    [paran@krut ~]$ sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    kryo*        up 7-00:00:00      5   idle n[1-5]

Submit a job to n1:
    [paran@krut ~]$ sbatch -n1 -wn1 -t 10:00 sleep-job 10m
    Submitted batch job 293

Job fails, and gets stuck in state completing:
    [paran@krut ~]$ squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
                   293      kryo sleep-jo    paran CG       0:00      1 n1
    [paran@krut ~]$ sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    kryo*        up 7-00:00:00      1   comp n1
    kryo*        up 7-00:00:00      4   idle n[2-5]

The only way I have found to get rid of the completing job is to
manually set n1 to state down:
    [root@krut ~]# scontrol update nodename=n1 state=down
    [root@krut ~]# squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)

slurmctld.log:
    [2013-10-24T17:27:10.293] Weighted Age priority is 0.000000 * 1000 = 0.00
    [2013-10-24T17:27:10.293] Weighted Fairshare priority is 0.000000 * 1000000 
= 0.00
    [2013-10-24T17:27:10.293] Weighted JobSize priority is 0.000000 * 0 = 0.00
    [2013-10-24T17:27:10.293] Weighted Partition priority is 0.000000 * 0 = 0.00
    [2013-10-24T17:27:10.293] Weighted QOS priority is 0.000000 * 1000000000 = 
0.00
    [2013-10-24T17:27:10.293] Job 293 priority: 0.00 + 0.00 + 0.00 + 0.00 + 
0.00 - 0 = 1.00
    [2013-10-24T17:27:10.293] _slurm_rpc_submit_batch_job JobId=293 usec=769
    [2013-10-24T17:27:10.294] debug:  sched: Running job scheduler
    [2013-10-24T17:27:10.294] sched: Allocate JobId=293 NodeList=n1 #CPUs=1
    [2013-10-24T17:27:10.311] Killing non-startable batch job 293: Protocol 
version has changed, re-link your code
    [2013-10-24T17:27:10.323] completing job 293
    [2013-10-24T17:27:10.324] priority_p_job_end: called for job 293
    [2013-10-24T17:27:10.324] job 293 ran for 0 seconds on 1 cpus
    [2013-10-24T17:27:10.324] QOS normal has grp_used_cpu_run_secs of 600, will 
subtract 600
    [2013-10-24T17:27:10.324] assoc 19 (user='paran' acct='nsc') has 
grp_used_cpu_run_secs of 600, will subtract 600
    [2013-10-24T17:27:10.324] adding 0.000000 new usage to assoc 19 
(user='paran' acct='nsc') raw usage is now 42074.092832.  Group wall added 
0.000000 making it 6522.774147. GrpCPURunMins is 0
    [2013-10-24T17:27:10.324] assoc 3 (user='(null)' acct='nsc') has 
grp_used_cpu_run_secs of 600, will subtract 600
    [2013-10-24T17:27:10.324] adding 0.000000 new usage to assoc 3 
(user='(null)' acct='nsc') raw usage is now 63024.155596.  Group wall added 
0.000000 making it 8698.365278. GrpCPURunMins is 0
    [2013-10-24T17:27:10.324] assoc 1 (user='(null)' acct='root') has 
grp_used_cpu_run_secs of 600, will subtract 600
    [2013-10-24T17:27:10.324] adding 0.000000 new usage to assoc 1 
(user='(null)' acct='root') raw usage is now 63030.154468.  Group wall added 
0.000000 making it 8704.364151. GrpCPURunMins is 0
    [2013-10-24T17:27:10.324] sched: job_complete for JobId=293 successful, 
exit code=256
    [2013-10-24T17:27:10.335] agent/is_node_resp: node:n1 rpc:6011 : Protocol 
version has changed, re-link your code

slurmd.log:
    [2013-10-24T17:27:10+02:00] error: Invalid Protocol Version 6656 from 
uid=400 at 10.32.254.1:47887
    [2013-10-24T17:27:10+02:00] error: slurm_receive_msg_and_forward: Protocol 
version has changed, re-link your code
    [2013-10-24T17:27:10+02:00] error: service_connection: slurm_receive_msg: 
Protocol version has changed, re-link your code
    [2013-10-24T17:27:10+02:00] error: Invalid Protocol Version 6656 from 
uid=400 at 10.32.254.1:47888
    [2013-10-24T17:27:10+02:00] error: slurm_receive_msg_and_forward: Protocol 
version has changed, re-link your code
    [2013-10-24T17:27:10+02:00] error: service_connection: slurm_receive_msg: 
Protocol version has changed, re-link your code


Bad error messages
------------------

Running for example sinfo 2.5.7 works with slurmctld 2.6.2:
    [paran@n1 ~]$ sinfo
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    kryo*        up 7-00:00:00      5   idle n[1-5]

But trying to to submit job using sbatch/srun/salloc 2.5.7 to slurmctld
running 2.6.2 gives unexpected error messages:
    [paran@n1 ~]$ sbatch -n1 -t5 sleep-job
    sbatch: error: Batch job submission failed: Invalid accounting frequency 
requested

slurmctld.log:
    [2013-10-24T17:47:52.889] error: Invalid accounting frequency (65534 > 30)
    [2013-10-24T17:47:52.889] _slurm_rpc_submit_batch_job: Invalid accounting 
frequency requested

I had expected SLURM_PROTOCOL_VERSION_ERROR on both client and the
slurmctld log instead.


Kind regards,
Pär Lindfors, NSC

Reply via email to