I have a user that is running a problem which uses 512 GB of memory. She request this from SLURM on a node which has this much. However her code dies:
slurmd[holy2b09101]: error: Job 6497 exceeded 268435456 KB memory limit, being killed slurmd[holy2b09101]: error: Exceeded job memory limit slurmd[holy2b09101]: error: *** JOB 6497 CANCELLED AT 2013-05-23T00:53:31 *** This is half of the 512 GB which was requested. Is there something I am missing? The nodes in question have: NodeName=DEFAULT CPUs=64 RealMemory=529247 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN These are AMD Abu Dhabi processors with 8 GB per core, so 512 GB total. She is requesting 8 GB per cpu and is asking for 64 cores. Thoughts? -Paul Edmon-
