I don’t believe that this is happening but to eliminate this possibility where 
might such information get logged? I’m tracking the stdout and stderr of the 
job submitted through the main scheduler (it starts my slurmd daemons) as well 
as the stdout and stderr of the individual jobs scheduled by the SLURM.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
[email protected]<mailto:[email protected]>
http://greg.bronevetsky.com

From: Andrej Filipcic [mailto:[email protected]]
Sent: Thursday, November 08, 2012 11:00 PM
To: slurm-dev
Subject: [slurm-dev] Re: slurmd aborting jobs before their completion


Do you by any chance see
Out of memory: Kill proces ...
on the nodes?

Andrej

On 11/08/2012 11:23 PM, Bronevetsky, Greg wrote:
I’m using SLURM to schedule my tasks inside an existing slurm allocation (this 
is on LLNL OCF clusters). This technique was working fine for months but 
recently (probably the last OS update, although I’m not sure) I’ve been running 
into problems where my tasks are being killed off prematurely and my SLURM 
reports the nodes within my allocation as dead. This problem was first observed 
when I was using slurm2.3.3 and upgrading to 2.4.4 hasn’t helped. Below I’ve 
included my log of events. It shows the stdout of slurmd -D –vvvvvvvvvvv and 
every 30 seconds I print the following:
<<<----------------- hostname -------------------<<<
Output of my SLURM’s sinfo
grep MemFree /proc/meminfo
ps –ef |grep $USER
>>>----------------- Time ------------------->>>

As you can observe that things run mostly smoothly for the first 180 seconds. 
The only oddity is that task 2 is aborted; I don’t know why. After this point 
slurmd gets a REQUEST_TERMINATE_JOB message for each of the remaining 15 tasks, 
kills them off and the next call to sinfo shows that the node has transitioned 
from alloc to down status. I’ve confirmed that my tasks get a TERM signal 
before they complete.

Does anybody know what might be going wrong or can suggest what I can do to 
debug it? Thanks!

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
[email protected]<mailto:[email protected]>
http://greg.bronevetsky.com





--

_____________________________________________________________

   prof. dr. Andrej Filipcic,   E-mail: 
[email protected]<mailto:[email protected]>

   Department of Experimental High Energy Physics - F9

   Jozef Stefan Institute, Jamova 39, P.o.Box 3000

   SI-1001 Ljubljana, Slovenia

   Tel.: +386-1-477-3674    Fax: +386-1-477-3166

-------------------------------------------------------------

Reply via email to