[slurm-dev] Re: Change in meaning of --nodelist
Loris, Taking Slurm v2.3 as a data point, the sbatch behavior seems consistent with what you report for v16.05. I deliberately attempted to request fewer nodes than I specified with the --nodelist option (attempting the potential behavior you describe below) and sbatch complained: $ sbatch -N2 --wrap hostname --nodelist=cab[2-5] -p pdebug sbatch: error: Batch job submission failed: Node count specification invalid I had to specify a node count that matched the nodelist-specified node count or omit the –N spec for it to succeed: $ sbatch --wrap hostname --nodelist=cab[2-5] -p pdebug Submitted batch job 2857543 $ sbatch -N4 --wrap hostname --nodelist=cab[10-13] -p pdebug Submitted batch job 2857564 $ sacct -X -j 2857543,2857564 -o jobid,nnodes,nodelist JobID NNodesNodeList --- 2857543 4cab[2-5] 2857564 4 cab[10-13] Don On 7/26/17, 1:59 AM, "Loris Bennett"wrote: Hi, With Version 16.05.10-2 --nodelist seems to be a list of nodes which are *all* assigned to the job. In previous versions I seem to remember it being complementary to --exclude i.e. a list of *potential* nodes for the job. Is this correct and, if so, are there any plans to reintroduce the old functionality, say, with a option '--include'? Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
[slurm-dev] updating ExcNodeList attribute of running job
Dear all, is there a way for running sbatch job to modify it's own ExcNodeList before it is requeued? (slurm 17.02.4) What I am trying to do is the following: A batch script is submitted and running, now, under certain conditions it is supposed to requeue itself to a _different_ node than is has been running on before the requeue. (only a single node required) Up to now, I learned that there is the command 'scontrol update ExcNodeList=', but this seems to work only for pending jobs? Thanks. Jan
[slurm-dev] Re: Slurm with High Availabilty/Automatic failover
Hello, Am 25.07.2017 um 16:19 schrieb J. Smith: > Does anyone has any suggestions in setting up high availability and > automatic failover between two servers that run a Controller daemon, > Database daemon and Mysql Database (i.e replication vs galera cluster)? > > Any input would be appreciated. we use ganeti instances for most services. In our case KVM (configurable on a per cluster basis) + DRBD (instance storage) On Debian they are rock solid. While HA is experimentally possible, the default is intentionally going without automatic fail-over: http://docs.ganeti.org/ganeti/2.15/html/design-linuxha.html#risks From my point of view a failing Slurm controller is such a rare event, that I prefer having a look first and only then be able to do a manually triggered fast fail-over. On the other hand the (unwritten) expected SLA for most services here is 90% per week & month, 95% year -- sure, relaxed; not knowing your needs, that might just be a HPC-kindergarden from your perspective. Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html ☎ +49 3641 9 44323 smime.p7s Description: S/MIME Cryptographic Signature