[slurm-dev] Re: Change in meaning of --nodelist

2017-07-26 Thread Lipari, Don
Loris,

Taking Slurm v2.3 as a data point, the sbatch behavior seems consistent with 
what you report for v16.05.

I deliberately attempted to request fewer nodes than I specified with the 
--nodelist option (attempting the potential behavior you describe below) and 
sbatch complained:

$ sbatch -N2 --wrap hostname --nodelist=cab[2-5] -p pdebug
sbatch: error: Batch job submission failed: Node count specification invalid

I had to specify a node count that matched the nodelist-specified node count or 
omit the –N spec for it to succeed:

$ sbatch --wrap hostname --nodelist=cab[2-5] -p pdebug
Submitted batch job 2857543

$ sbatch -N4 --wrap hostname --nodelist=cab[10-13] -p pdebug
Submitted batch job 2857564

$ sacct -X -j 2857543,2857564 -o jobid,nnodes,nodelist
   JobID   NNodesNodeList 
  --- 
2857543 4cab[2-5] 
2857564 4  cab[10-13] 

Don

On 7/26/17, 1:59 AM, "Loris Bennett"  wrote:


Hi,

With Version 16.05.10-2

  --nodelist

seems to be a list of nodes which are *all* assigned to the job.  In
previous versions I seem to remember it being complementary to

  --exclude

i.e. a list of *potential* nodes for the job.

Is this correct and, if so, are there any plans to reintroduce the old
functionality, say, with a option '--include'?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de





[slurm-dev] updating ExcNodeList attribute of running job

2017-07-26 Thread Jan Schulze

Dear all,
is there a way for running sbatch job to modify it's own ExcNodeList 
before it is requeued? (slurm 17.02.4)


What I am trying to do is the following: A batch script is submitted and 
running, now, under certain conditions it is supposed to requeue itself 
to a _different_ node than is has been running on before the requeue. 
(only a single node required)


Up to now, I learned that there is the command 'scontrol update  
ExcNodeList=', but this seems to work only for pending jobs?


Thanks.

Jan


[slurm-dev] Re: Slurm with High Availabilty/Automatic failover

2017-07-26 Thread Benjamin Redling
Hello,

Am 25.07.2017 um 16:19 schrieb J. Smith:
> Does anyone has any suggestions in setting up high availability and
> automatic failover between two servers that run a Controller daemon,
> Database daemon and Mysql Database (i.e replication vs galera cluster)?
> 
> Any input would be appreciated.

we use ganeti instances for most services. In our case KVM (configurable
on a per cluster basis) + DRBD (instance storage)
On Debian they are rock solid.
While HA is experimentally possible, the default is intentionally going
without automatic fail-over:
http://docs.ganeti.org/ganeti/2.15/html/design-linuxha.html#risks

From my point of view a failing Slurm controller is such a rare event,
that I prefer having a look first and only then be able to do a manually
triggered fast fail-over.
  On the other hand the (unwritten) expected SLA for most services here
is 90% per week & month, 95% year
-- sure, relaxed; not knowing your needs, that might just be a
HPC-kindergarden from your perspective.


Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



smime.p7s
Description: S/MIME Cryptographic Signature