Can anyone shed some light on where the _virtual_ memory limit comes from?  
We're getting jobs killed with the message
slurmstepd: error: Step 3664.0 exceeded virtual memory limit (79348101120 > 
72638634393), being killed
Is this a limit that's dictated by cgroup.conf or by some srun option (like 
--mem-per-cpu?  And where could this number come from on a machine that has 64 
GB nodes, DefMemPerCPU for the partition is 64 GB / 32 (threads), and 
cgroup.conf has AllowedSwapSpace=75.  

And a couple of related questions:
1. If I define DefMemPerCPU in the partition line, and the job doesn't request 
anything else, what memory measure should expect this to be the limit on? RSS?

2. In general, what's the right way to disable swapping by default, but allow 
individual jobs to request to be allowed to swap?

                                                                        thanks,
                                                                        Noam

Reply via email to