Hi,
I am trying to set upa small cluster for a university research groupI am a 
member of. From googling around Slurm seems like good option, but I have no 
experience with it.
In my environment memory (RAM) is the most valuable resource.And users run 
interactive jobs(in R) which may take days and find it hard to estimate top 
memory usagein advance. These jobs also do actual computation in only a small 
fraction oftime(i.e. they keep large ~50GB data structures in memory while the 
user plays with the data and interprets the results).
Therefore I am striving for setting up anenvironment where the most offensive 
jobs (or job groups grouped by users) can be preempted before the node becomes 
unresponsive due to the out-of-memoryproblem. Of course the usual cluster 
tasks: queueing batch jobs, allocating cores for multi-threaded jobs and 
postponing jobs for which memory requirement is known and exceedes what is 
available is also required.

My question is whether Slurm is the right choice(and if not then whichsoftware 
is?)

>From what I learnt from the documentation Slurm can preempt jobs by killing 
>them or suspending, but it is not clear to me on what condition other than 
>priority in gang scheduling, i.e. wheter it is monitoring the actual memory 
>usage and can trigger some actions based on that. Moreover I don't understand 
>how other system processes are accounted for in slurm (I use 
>jobacct_gather/linux). In particular some nodes need to run mySQL servers, 
>which are not expected to randomly change memory usage, but small fluctuations 
>are possible. Should I be cheating about much memory a node uses in the config 
>(RealMemory in node configuration)? Orshould I instead run mySQL inside of 
>Slurminstead of system service?

>From the documentation these options seem most reasonable tome:
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PreemptType=preempt/qos        # ???
PreemptMode=CANCEL
and I need to be running slurmDBD, not use gang scheduling and set Shared=NO. 
Is that right?

BTW. http://slurm.schedmd.com/cons_res.html discusses --job-mem option for srun 
which does not exist.

Thanks,
Piotr

Reply via email to