[slurm-dev] Re: Stopping compute usage on login nodes

Nicholas McCollum Thu, 09 Feb 2017 10:13:38 -0800

While this isn't a SLURM issue, it's something we all face.  Due to my
system being primarily students, it's something I face a lot.


I second the use of ulimits, although this can kill off long running
file transfers.  What you can do to help out users is set a low soft
limit and a somewhat larger hard limit.  Encourage users that want to
do a file transfer to increase their limit (they wont be able to go
over the hard limit).  

A method that I am testing to employ is having each login node as a KVM
virtual machine, and then limiting the amount of CPU that the virtual
machine can use.  Each login-VM will be identical minus the MAC and the
IP address, then using IP tables on the VM-host to push the connections
out to the VM that responds first.  The idea is that a loaded down VM
would have a delay in responding and provide a user with a login node
that doesn't have any users on it.

I'm sure someone has already blazed this trail before, but this is how
I am going about it.


-- 
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

On Thu, 2017-02-09 at 07:32 -0800, Ryan Cox wrote:
> John,
> 
> We use /etc/security/limits.conf to set cputime limits on processes:
> * hard cpu 60
> root hard cpu unlimited
> 
> It works pretty well but long running file transfers can get
> killed.  We 
> have a script that looks for whitelisted programs to remove the
> limit 
> from on a periodic basis.  We haven't experienced problems with this 
> approach in users (that anyone has reported to us, at
> least).  Threaded 
> programs get killed more quickly than multi-process programs since
> the 
> limit is per process.
> 
> Additionally, we use cgroups for limits in a similar way to Sean but 
> with an older approach than pam_cgroup.  We also use the cpu cgroup 
> rather than cpuset because it doesn't limit them to particular CPUs
> and 
> doesn't limit them when no one else is running (it's shares-
> based).  We 
> also have an OOM notifier daemon that writes to a user's tty so they 
> know if they ran out of memory.  "Killed" isn't usually a helpful
> error 
> message that they understand.
> 
> We have this in a github repo: https://github.com/BYUHPC/uft. 
> Directories that may be useful include cputime_controls,
> oom_notifierd, 
> loginlimits (lets users see their cgroup limits with some
> explanations).
> 
> Ryan
> 
> On 02/09/2017 07:18 AM, Sean McGrath wrote:
> > Hi,
> > 
> > We use cgroups to limit usage to 3 cores and 4G of memory on the
> > head nodes. I
> > didn't do it but will copy and paste in our documentation below.
> > 
> > Those limits, 3 cores are 4G are global to all non root users I
> > think as they
> > apply to a group. We obviously don't do this on the nodes.
> > 
> > We also monitor system utilisation with nagios and will intervene
> > if needed.
> > Before we had cgroups in place I very occasionally had to do a
> > pkill -u baduser
> > and lock them out temporarily until the situation was explained to
> > them.
> > 
> > Any questions please let me know.
> > 
> > Sean
> > 
> > 
> > 
> > ===== How to configure Cgroups locally on a system =====
> > 
> > This is a step-to-step guide to configure Cgroups locally on a
> > system.
> > 
> > ==== 1. Install the libraries to control Cgroups and to enforce it
> > via PAM ====
> > 
> > <code bash>$ yum install libcgroup libcgroup-pam</code>
> > 
> > ==== 2. Load the Cgroups module on PAM ====
> > 
> > <code bash>
> > $ echo session    required    pam_cgroup.so>>/etc/pam.d/login
> > $ echo session    required    pam_cgroup.so>>/etc/pam.d/password-
> > auth-ac
> > $ echo session    required    pam_cgroup.so>>/etc/pam.d/system-
> > auth-ac
> > </code>
> > 
> > ==== 3. Set the Cgroup limits and associate them to a user group
> > ====
> > 
> > add to /etc/cgconfig.conf:
> > <code bash>
> > # cpuset.mems may be different in different architectures, e.g. in
> > Parsons there
> > # is only "0".
> > group users {
> >    memory {
> >      memory.limit_in_bytes="4G";
> >      memory.memsw.limit_in_bytes="6G";
> >    }
> >    cpuset {
> >      cpuset.mems="0-1";
> >      cpuset.cpus="0-2";
> >    }
> > }
> > </code>
> > 
> > Note that the ''memory.memsw.limit_in_bytes'' limit is
> > //inclusive// of the
> > ''memory.limit_in_bytes'' limit. So in the above example, the limit
> > is 4GB of
> > RAM following by a further 2 GB of swap. See:
> > 
> > [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_
> > Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-
> > use_case.html#proc-cpu_and_mem
> > ]]
> > 
> > Set no limit for root and set limits for every other individual
> > user:
> > 
> > <code bash>
> > $ echo "root    *      /">>/etc/cgrules.conf
> > $ echo "*   cpuset,memory    users">>/etc/cgrules.conf
> > </code>
> > 
> > Note also that the ''users'' cgroup defined above is inclusive of
> > **all** users
> > (the * wildcard). So it is not a 4GB RAM limit for one user, it is
> > a 4GB RAM
> > limit in total for every non-root user.
> > 
> > ==== 4. Start the daemon that manages Cgroups configuration and set
> > it to start
> > on boot ====
> > 
> > <code bash>
> > $ /etc/init.d/cgconfig start
> > $ chkconfig cgconfig on
> > </code>
> > 
> > 
> > 
> > 
> > 
> > On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote:
> > 
> > > Does anyone have a good suggestion for this problem?
> > > 
> > > On a cluster I am implementing I noticed a user is running a code
> > > on 16 cores, on one of the login nodes, outside the batch system.
> > > What are the accepted techniques to combat this? Other than
> > > applying a LART, if you all know what this means.
> > > 
> > > On one system I set up a year or so ago I was asked to implement
> > > a shell timeout, so if the user was idle for 30 minutes they
> > > would be logged out.
> > > This actually is quite easy to set up as I recall.
> > > I guess in this case as the user is connected to a running
> > > process then they are not 'idle'.
> > > 
> > > 
> > > Any views or opinions presented in this email are solely those of
> > > the author and do not necessarily represent those of the company.
> > > Employees of XMA Ltd are expressly required not to make
> > > defamatory statements and not to infringe or authorise any
> > > infringement of copyright or any other legal right by email
> > > communications. Any such communication is contrary to company
> > > policy and outside the scope of the employment of the individual
> > > concerned. The company will not accept any liability in respect
> > > of such communication, and the employee responsible will be
> > > personally liable for any damages or other liability arising. XMA
> > > Limited is registered in England and Wales (registered no.
> > > 2051703). Registered Office: Wilford Industrial Estate,
> > > Ruddington Lane, Wilford, Nottingham, NG11 7EP
> 
>

[slurm-dev] Re: Stopping compute usage on login nodes

Reply via email to