Do you by any chance see
Out of memory: Kill proces ...
on the nodes?
Andrej
On 11/08/2012 11:23 PM, Bronevetsky, Greg wrote:
I’m using SLURM to schedule my tasks inside an existing slurm
allocation (this is on LLNL OCF clusters). This technique was working
fine for months but recently (probably the last OS update, although
I’m not sure) I’ve been running into problems where my tasks are being
killed off prematurely and my SLURM reports the nodes within my
allocation as dead. This problem was first observed when I was using
slurm2.3.3 and upgrading to 2.4.4 hasn’t helped. Below I’ve included
my log of events. It shows the stdout of slurmd -D –vvvvvvvvvvv and
every 30 seconds I print the following:
<<<----------------- hostname -------------------<<<
Output of my SLURM’s sinfo
grep MemFree /proc/meminfo
ps –ef |grep $USER
>>>----------------- Time ------------------->>>
As you can observe that things run mostly smoothly for the first 180
seconds. The only oddity is that task 2 is aborted; I don’t know why.
After this point slurmd gets a REQUEST_TERMINATE_JOB message for each
of the remaining 15 tasks, kills them off and the next call to sinfo
shows that the node has transitioned from alloc to down status. I’ve
confirmed that my tasks get a TERM signal before they complete.
Does anybody know what might be going wrong or can suggest what I can
do to debug it? Thanks!
Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
[email protected] <mailto:[email protected]>
http://greg.bronevetsky.com
--
_____________________________________________________________
prof. dr. Andrej Filipcic, E-mail: [email protected]
Department of Experimental High Energy Physics - F9
Jozef Stefan Institute, Jamova 39, P.o.Box 3000
SI-1001 Ljubljana, Slovenia
Tel.: +386-1-477-3674 Fax: +386-1-477-3166
-------------------------------------------------------------