[slurm-dev] Re: slurmd aborting jobs before their completion

Andrej Filipcic Thu, 08 Nov 2012 22:58:07 -0800


Do you by any chance see
Out of memory: Kill proces ...
on the nodes?


Andrej

On 11/08/2012 11:23 PM, Bronevetsky, Greg wrote:

I’m using SLURM to schedule my tasks inside an existing slurmallocation (this is on LLNL OCF clusters). This technique was workingfine for months but recently (probably the last OS update, althoughI’m not sure) I’ve been running into problems where my tasks are beingkilled off prematurely and my SLURM reports the nodes within myallocation as dead. This problem was first observed when I was usingslurm2.3.3 and upgrading to 2.4.4 hasn’t helped. Below I’ve includedmy log of events. It shows the stdout of slurmd -D –vvvvvvvvvvv andevery 30 seconds I print the following:
<<<----------------- hostname -------------------<<<

Output of my SLURM’s sinfo

grep MemFree /proc/meminfo

ps –ef |grep $USER

>>>----------------- Time ------------------->>>
As you can observe that things run mostly smoothly for the first 180seconds. The only oddity is that task 2 is aborted; I don’t know why.After this point slurmd gets a REQUEST_TERMINATE_JOB message for eachof the remaining 15 tasks, kills them off and the next call to sinfoshows that the node has transitioned from alloc to down status. I’veconfirmed that my tasks get a TERM signal before they complete.
Does anybody know what might be going wrong or can suggest what I cando to debug it? Thanks!
Greg Bronevetsky

Lawrence Livermore National Lab

(925) 424-5756

[email protected] <mailto:[email protected]>

http://greg.bronevetsky.com


--
_____________________________________________________________
   prof. dr. Andrej Filipcic,   E-mail: [email protected]
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------

[slurm-dev] Re: slurmd aborting jobs before their completion

Reply via email to