Are you using pam_limits.so in any of your /etc/pam.d/ configuration files?
That would be enforcing /etc/security/limits.conf for all users which are
usually unlimited for root. Root’s almost always allowed to do stuff bad enough
to crash the machine or run it out of resources. If the /etc/pam.d/sshd file
has pam_limits.so in it, that’s probably where the unlimited setting for root
is coming from.
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu | Phone: (512) 232-7069
Office: ROC 1.435 | Fax: (512) 475-9445
On 4/15/18, 1:26 PM, "slurm-users on behalf of Mahmood Naderan"
<slurm-users-boun...@lists.schedmd.com on behalf of mahmood...@gmail.com> wrote:
I actually have disabled the swap partition (!) since the system goes
really bad and based on my experience I have to enter the room and
reset the affected machine (!). Otherwise I have to wait for long
times to see it get back to normal.
When I ssh to the node with root user, the ulimit -a says unlimited
virtual memory. So, it seems that the root have unlimited value while
users have limited value.
On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
> Hi Mahmood,
> It seems your compute node is configured with this limit:
> virtual memory (kbytes, -v) 72089600
> So when the batch job tries to set a higher limit (ulimit -v 82089600)
> permitted by the system (72089600), this must surely get rejected, as you
> have discovered!
> You may want to reconfigure your compute nodes' limits, for example by
> setting the virtual memory limit to "unlimited" in your configuration. If
> the nodes has a very small RAM memory + swap space size, you might
> Out Of Memory errors...