If you're interested in the programmatic method I mentioned to increase limits for file transfers, https://github.com/BYUHPC/uft/tree/master/cputime_controls might be worth looking at. It works well for us, though a user will occasionally start using a new file transfer program that you might want to centrally install and whitelist.
We used to use LVS for load balancing and it worked pretty well. We finally scrapped it in favor of DNS round robin since it gets expensive to have a load balancer that's capable of moving that much bandwidth. We have a script that can drop some of the login nodes from the DNS round robin based on CPU and memory usage (with sanity checks to not drop all of them at the same time, of course :) ). There may be a better way of doing this but it has worked so far.
Ryan On 02/09/2017 11:15 AM, Nicholas McCollum wrote:
While this isn't a SLURM issue, it's something we all face. Due to my system being primarily students, it's something I face a lot. I second the use of ulimits, although this can kill off long running file transfers. What you can do to help out users is set a low soft limit and a somewhat larger hard limit. Encourage users that want to do a file transfer to increase their limit (they wont be able to go over the hard limit). A method that I am testing to employ is having each login node as a KVM virtual machine, and then limiting the amount of CPU that the virtual machine can use. Each login-VM will be identical minus the MAC and the IP address, then using IP tables on the VM-host to push the connections out to the VM that responds first. The idea is that a loaded down VM would have a delay in responding and provide a user with a login node that doesn't have any users on it. I'm sure someone has already blazed this trail before, but this is how I am going about it.
-- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
