While this isn't a SLURM issue, it's something we all face. Due to my system being primarily students, it's something I face a lot.
I second the use of ulimits, although this can kill off long running file transfers. What you can do to help out users is set a low soft limit and a somewhat larger hard limit. Encourage users that want to do a file transfer to increase their limit (they wont be able to go over the hard limit). A method that I am testing to employ is having each login node as a KVM virtual machine, and then limiting the amount of CPU that the virtual machine can use. Each login-VM will be identical minus the MAC and the IP address, then using IP tables on the VM-host to push the connections out to the VM that responds first. The idea is that a loaded down VM would have a delay in responding and provide a user with a login node that doesn't have any users on it. I'm sure someone has already blazed this trail before, but this is how I am going about it. -- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Thu, 2017-02-09 at 07:32 -0800, Ryan Cox wrote: > John, > > We use /etc/security/limits.conf to set cputime limits on processes: > * hard cpu 60 > root hard cpu unlimited > > It works pretty well but long running file transfers can get > killed. We > have a script that looks for whitelisted programs to remove the > limit > from on a periodic basis. We haven't experienced problems with this > approach in users (that anyone has reported to us, at > least). Threaded > programs get killed more quickly than multi-process programs since > the > limit is per process. > > Additionally, we use cgroups for limits in a similar way to Sean but > with an older approach than pam_cgroup. We also use the cpu cgroup > rather than cpuset because it doesn't limit them to particular CPUs > and > doesn't limit them when no one else is running (it's shares- > based). We > also have an OOM notifier daemon that writes to a user's tty so they > know if they ran out of memory. "Killed" isn't usually a helpful > error > message that they understand. > > We have this in a github repo: https://github.com/BYUHPC/uft. > Directories that may be useful include cputime_controls, > oom_notifierd, > loginlimits (lets users see their cgroup limits with some > explanations). > > Ryan > > On 02/09/2017 07:18 AM, Sean McGrath wrote: > > Hi, > > > > We use cgroups to limit usage to 3 cores and 4G of memory on the > > head nodes. I > > didn't do it but will copy and paste in our documentation below. > > > > Those limits, 3 cores are 4G are global to all non root users I > > think as they > > apply to a group. We obviously don't do this on the nodes. > > > > We also monitor system utilisation with nagios and will intervene > > if needed. > > Before we had cgroups in place I very occasionally had to do a > > pkill -u baduser > > and lock them out temporarily until the situation was explained to > > them. > > > > Any questions please let me know. > > > > Sean > > > > > > > > ===== How to configure Cgroups locally on a system ===== > > > > This is a step-to-step guide to configure Cgroups locally on a > > system. > > > > ==== 1. Install the libraries to control Cgroups and to enforce it > > via PAM ==== > > > > <code bash>$ yum install libcgroup libcgroup-pam</code> > > > > ==== 2. Load the Cgroups module on PAM ==== > > > > <code bash> > > $ echo session required pam_cgroup.so>>/etc/pam.d/login > > $ echo session required pam_cgroup.so>>/etc/pam.d/password- > > auth-ac > > $ echo session required pam_cgroup.so>>/etc/pam.d/system- > > auth-ac > > </code> > > > > ==== 3. Set the Cgroup limits and associate them to a user group > > ==== > > > > add to /etc/cgconfig.conf: > > <code bash> > > # cpuset.mems may be different in different architectures, e.g. in > > Parsons there > > # is only "0". > > group users { > > memory { > > memory.limit_in_bytes="4G"; > > memory.memsw.limit_in_bytes="6G"; > > } > > cpuset { > > cpuset.mems="0-1"; > > cpuset.cpus="0-2"; > > } > > } > > </code> > > > > Note that the ''memory.memsw.limit_in_bytes'' limit is > > //inclusive// of the > > ''memory.limit_in_bytes'' limit. So in the above example, the limit > > is 4GB of > > RAM following by a further 2 GB of swap. See: > > > > [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_ > > Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory- > > use_case.html#proc-cpu_and_mem > > ]] > > > > Set no limit for root and set limits for every other individual > > user: > > > > <code bash> > > $ echo "root * /">>/etc/cgrules.conf > > $ echo "* cpuset,memory users">>/etc/cgrules.conf > > </code> > > > > Note also that the ''users'' cgroup defined above is inclusive of > > **all** users > > (the * wildcard). So it is not a 4GB RAM limit for one user, it is > > a 4GB RAM > > limit in total for every non-root user. > > > > ==== 4. Start the daemon that manages Cgroups configuration and set > > it to start > > on boot ==== > > > > <code bash> > > $ /etc/init.d/cgconfig start > > $ chkconfig cgconfig on > > </code> > > > > > > > > > > > > On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote: > > > > > Does anyone have a good suggestion for this problem? > > > > > > On a cluster I am implementing I noticed a user is running a code > > > on 16 cores, on one of the login nodes, outside the batch system. > > > What are the accepted techniques to combat this? Other than > > > applying a LART, if you all know what this means. > > > > > > On one system I set up a year or so ago I was asked to implement > > > a shell timeout, so if the user was idle for 30 minutes they > > > would be logged out. > > > This actually is quite easy to set up as I recall. > > > I guess in this case as the user is connected to a running > > > process then they are not 'idle'. > > > > > > > > > Any views or opinions presented in this email are solely those of > > > the author and do not necessarily represent those of the company. > > > Employees of XMA Ltd are expressly required not to make > > > defamatory statements and not to infringe or authorise any > > > infringement of copyright or any other legal right by email > > > communications. Any such communication is contrary to company > > > policy and outside the scope of the employment of the individual > > > concerned. The company will not accept any liability in respect > > > of such communication, and the employee responsible will be > > > personally liable for any damages or other liability arising. XMA > > > Limited is registered in England and Wales (registered no. > > > 2051703). Registered Office: Wilford Industrial Estate, > > > Ruddington Lane, Wilford, Nottingham, NG11 7EP > >
