Hi, I am trying to set upa small cluster for a university research groupI am a member of. From googling around Slurm seems like good option, but I have no experience with it. In my environment memory (RAM) is the most valuable resource.And users run interactive jobs(in R) which may take days and find it hard to estimate top memory usagein advance. These jobs also do actual computation in only a small fraction oftime(i.e. they keep large ~50GB data structures in memory while the user plays with the data and interprets the results). Therefore I am striving for setting up anenvironment where the most offensive jobs (or job groups grouped by users) can be preempted before the node becomes unresponsive due to the out-of-memoryproblem. Of course the usual cluster tasks: queueing batch jobs, allocating cores for multi-threaded jobs and postponing jobs for which memory requirement is known and exceedes what is available is also required.
My question is whether Slurm is the right choice(and if not then whichsoftware is?) >From what I learnt from the documentation Slurm can preempt jobs by killing >them or suspending, but it is not clear to me on what condition other than >priority in gang scheduling, i.e. wheter it is monitoring the actual memory >usage and can trigger some actions based on that. Moreover I don't understand >how other system processes are accounted for in slurm (I use >jobacct_gather/linux). In particular some nodes need to run mySQL servers, >which are not expected to randomly change memory usage, but small fluctuations >are possible. Should I be cheating about much memory a node uses in the config >(RealMemory in node configuration)? Orshould I instead run mySQL inside of >Slurminstead of system service? >From the documentation these options seem most reasonable tome: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory PreemptType=preempt/qos # ??? PreemptMode=CANCEL and I need to be running slurmDBD, not use gang scheduling and set Shared=NO. Is that right? BTW. http://slurm.schedmd.com/cons_res.html discusses --job-mem option for srun which does not exist. Thanks, Piotr
