Hello, We currently use rather old versions of SLURM at NSC. Some 2.4, some 2.3 and we even have one cluster still on 2.2. I am currently preparing to upgrade at least some systems to 2.6.
While testing 2.4 to 2.6 upgrades in our lab I have learned that there have been rather big changes in how much slurm commands and daemons of different versions can communicate with each other. In the past slurmdbd could communicate with older clusters, and slurmctld/slurmd could communicate just enough with old stepds so that running jobs could finish. But most things did not work and returned SLURM_PROTOCOL_VERSION_ERROR. If you upgraded your slurmctld you had SlurmdTimeout seconds to upgrade all your slurmds or they would time out. I would typically increase SlurmdTimeout before an upgrade. Another approach was to stop all slurmctlds and slurmds and not start any of them until after all packages had been upgraded. Now all this is changed. Slurmctld can communicate with older slurmds. Many commands (squeue/sinfo) also seems to work with a newer slurmctld. There was some discussion here: https://groups.google.com/d/msg/slurm-devel/R0ptc32Pre8/f9DiR7H-q0gJ Unfortunately I have found some things that does not work very well. One job killing bug, and some usability issues. Lack of information ------------------- [paran@krut ~]$ scontrol version slurm 2.6.2 [root@krut ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST kryo* up 7-00:00:00 5 idle n[1-5] [root@krut ~]# Everything looks fine here. However only 4 out of the 5 compute nodes actually run slurm 2.6.2. n1 is still running 2.5.7, but slurm does not give any informaton about this. During an upgrade this is likely not a very big issue as then you probably know what you are doing. But what if this happened at some other time? For example if a node is accidentally rebooted with an old OS image with an old slurm version. Having versions mixed like this can lead to all kinds of breakage, and I think Slurm should make some noise about this situation. Maybe nodes with an old version could get set as drained? Slurmctld can't start jobs on old nodes anyway. Or maybe it is necessary to add a new node state to handle this? Today I think the only way to find out that something is wrong is to look in slurmctld.log: [2013-10-24T17:20:59.912] debug: validate_node_specs: node n1 registered with 0 jobs [2013-10-24T17:20:59.912] debug: validate_node_specs: node n1 registered with 0 jobs [2013-10-24T17:21:00.215] agent/is_node_resp: node:n1 rpc:1001 : Protocol version has changed, re-link your code [2013-10-24T17:21:00.215] agent/is_node_resp: node:n1 rpc:1001 : Protocol version has changed, re-link your code Jobs fail --------- Slurm will attempt to schedule jobs on nodes running old versions. This makes jobs fail and get stuck in state completing. Same example setup as before. n1 is running 2.5.7 and everything else run 2.6.2. Empty queue and idle nodes: [paran@krut ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [paran@krut ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST kryo* up 7-00:00:00 5 idle n[1-5] Submit a job to n1: [paran@krut ~]$ sbatch -n1 -wn1 -t 10:00 sleep-job 10m Submitted batch job 293 Job fails, and gets stuck in state completing: [paran@krut ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 293 kryo sleep-jo paran CG 0:00 1 n1 [paran@krut ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST kryo* up 7-00:00:00 1 comp n1 kryo* up 7-00:00:00 4 idle n[2-5] The only way I have found to get rid of the completing job is to manually set n1 to state down: [root@krut ~]# scontrol update nodename=n1 state=down [root@krut ~]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) slurmctld.log: [2013-10-24T17:27:10.293] Weighted Age priority is 0.000000 * 1000 = 0.00 [2013-10-24T17:27:10.293] Weighted Fairshare priority is 0.000000 * 1000000 = 0.00 [2013-10-24T17:27:10.293] Weighted JobSize priority is 0.000000 * 0 = 0.00 [2013-10-24T17:27:10.293] Weighted Partition priority is 0.000000 * 0 = 0.00 [2013-10-24T17:27:10.293] Weighted QOS priority is 0.000000 * 1000000000 = 0.00 [2013-10-24T17:27:10.293] Job 293 priority: 0.00 + 0.00 + 0.00 + 0.00 + 0.00 - 0 = 1.00 [2013-10-24T17:27:10.293] _slurm_rpc_submit_batch_job JobId=293 usec=769 [2013-10-24T17:27:10.294] debug: sched: Running job scheduler [2013-10-24T17:27:10.294] sched: Allocate JobId=293 NodeList=n1 #CPUs=1 [2013-10-24T17:27:10.311] Killing non-startable batch job 293: Protocol version has changed, re-link your code [2013-10-24T17:27:10.323] completing job 293 [2013-10-24T17:27:10.324] priority_p_job_end: called for job 293 [2013-10-24T17:27:10.324] job 293 ran for 0 seconds on 1 cpus [2013-10-24T17:27:10.324] QOS normal has grp_used_cpu_run_secs of 600, will subtract 600 [2013-10-24T17:27:10.324] assoc 19 (user='paran' acct='nsc') has grp_used_cpu_run_secs of 600, will subtract 600 [2013-10-24T17:27:10.324] adding 0.000000 new usage to assoc 19 (user='paran' acct='nsc') raw usage is now 42074.092832. Group wall added 0.000000 making it 6522.774147. GrpCPURunMins is 0 [2013-10-24T17:27:10.324] assoc 3 (user='(null)' acct='nsc') has grp_used_cpu_run_secs of 600, will subtract 600 [2013-10-24T17:27:10.324] adding 0.000000 new usage to assoc 3 (user='(null)' acct='nsc') raw usage is now 63024.155596. Group wall added 0.000000 making it 8698.365278. GrpCPURunMins is 0 [2013-10-24T17:27:10.324] assoc 1 (user='(null)' acct='root') has grp_used_cpu_run_secs of 600, will subtract 600 [2013-10-24T17:27:10.324] adding 0.000000 new usage to assoc 1 (user='(null)' acct='root') raw usage is now 63030.154468. Group wall added 0.000000 making it 8704.364151. GrpCPURunMins is 0 [2013-10-24T17:27:10.324] sched: job_complete for JobId=293 successful, exit code=256 [2013-10-24T17:27:10.335] agent/is_node_resp: node:n1 rpc:6011 : Protocol version has changed, re-link your code slurmd.log: [2013-10-24T17:27:10+02:00] error: Invalid Protocol Version 6656 from uid=400 at 10.32.254.1:47887 [2013-10-24T17:27:10+02:00] error: slurm_receive_msg_and_forward: Protocol version has changed, re-link your code [2013-10-24T17:27:10+02:00] error: service_connection: slurm_receive_msg: Protocol version has changed, re-link your code [2013-10-24T17:27:10+02:00] error: Invalid Protocol Version 6656 from uid=400 at 10.32.254.1:47888 [2013-10-24T17:27:10+02:00] error: slurm_receive_msg_and_forward: Protocol version has changed, re-link your code [2013-10-24T17:27:10+02:00] error: service_connection: slurm_receive_msg: Protocol version has changed, re-link your code Bad error messages ------------------ Running for example sinfo 2.5.7 works with slurmctld 2.6.2: [paran@n1 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST kryo* up 7-00:00:00 5 idle n[1-5] But trying to to submit job using sbatch/srun/salloc 2.5.7 to slurmctld running 2.6.2 gives unexpected error messages: [paran@n1 ~]$ sbatch -n1 -t5 sleep-job sbatch: error: Batch job submission failed: Invalid accounting frequency requested slurmctld.log: [2013-10-24T17:47:52.889] error: Invalid accounting frequency (65534 > 30) [2013-10-24T17:47:52.889] _slurm_rpc_submit_batch_job: Invalid accounting frequency requested I had expected SLURM_PROTOCOL_VERSION_ERROR on both client and the slurmctld log instead. Kind regards, Pär Lindfors, NSC
