[slurm-users] Jobs killed by OOM-killer only on certain nodes.

Prentice Bisbal Thu, 02 Jul 2020 11:00:00 -0700

I maintain a very heterogeneous cluster (different processors, differentamounts of RAM, etc.) I have a user reporting the following problem.

He's running the same job multiple times with different inputparameters. The jobs run fine unless they land on specific nodes. He'sspecifying --mem=2G in his sbatch files. On the nodes where the jobsfail, I see that the OOM killer is invoked, so I asked him to specifymore RAM, so he did. He set --mem=4G, and still the jobs fail on these 2nodes. However, they run just fine on other nodes with --mem=2G.

When I look at the slurm log file on the nodes, I see something likethis for a failing job (in this case, --mem=4G was set)

[2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job 801777ran for 0 seconds[2020-07-01T16:19:06.479] [801777.extern] task/cgroup:/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MBmemsw.limit=unlimited[2020-07-01T16:19:06.483] [801777.extern] task/cgroup:/slurm/uid_40324/job_801777/step_extern: alloc=4096MB mem.limit=4096MBmemsw.limit=unlimited

[2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324

[2020-07-01T16:19:06.621] [801777.batch] task/cgroup:/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MBmemsw.limit=unlimited[2020-07-01T16:19:06.623] [801777.batch] task/cgroup:/slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MBmemsw.limit=unlimited[2020-07-01T16:19:19.385] [801777.batch] sendingREQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0

[2020-07-01T16:19:19.389] [801777.batch] done with job

[2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-killevent count: 1

[2020-07-01T16:19:19.508] [801777.extern] done with job

Any ideas why the jobs are failing on just these two nodes, while theyrun just fine on many other nodes?

For now, the user is excluding these two nodes using the -x option tosbatch, but I'd really like to understand what's going on here.


--

Prentice

[slurm-users] Jobs killed by OOM-killer only on certain nodes.

Reply via email to