On 11/09/2012 06:11 PM, Bronevetsky, Greg wrote:
I don’t believe that this is happening but to eliminate this
possibility where might such information get logged? I’m tracking the
stdout and stderr of the job submitted through the main scheduler (it
starts my slurmd daemons) as well as the stdout and stderr of the
individual jobs scheduled by the SLURM.
Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
[email protected] <mailto:[email protected]>
http://greg.bronevetsky.com
Those entries are usually in /var/log/messages..
Barbara
*From:*Andrej Filipcic [mailto:[email protected]]
*Sent:* Thursday, November 08, 2012 11:00 PM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: slurmd aborting jobs before their completion
Do you by any chance see
Out of memory: Kill proces ...
on the nodes?
Andrej
On 11/08/2012 11:23 PM, Bronevetsky, Greg wrote:
I’m using SLURM to schedule my tasks inside an existing slurm
allocation (this is on LLNL OCF clusters). This technique was
working fine for months but recently (probably the last OS update,
although I’m not sure) I’ve been running into problems where my
tasks are being killed off prematurely and my SLURM reports the
nodes within my allocation as dead. This problem was first
observed when I was using slurm2.3.3 and upgrading to 2.4.4 hasn’t
helped. Below I’ve included my log of events. It shows the stdout
of slurmd -D –vvvvvvvvvvv and every 30 seconds I print the following:
<<<----------------- hostname -------------------<<<
Output of my SLURM’s sinfo
grep MemFree /proc/meminfo
ps –ef |grep $USER
>>>----------------- Time ------------------->>>
As you can observe that things run mostly smoothly for the first
180 seconds. The only oddity is that task 2 is aborted; I don’t
know why. After this point slurmd gets a REQUEST_TERMINATE_JOB
message for each of the remaining 15 tasks, kills them off and the
next call to sinfo shows that the node has transitioned from alloc
to down status. I’ve confirmed that my tasks get a TERM signal
before they complete.
Does anybody know what might be going wrong or can suggest what I
can do to debug it? Thanks!
Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
[email protected] <mailto:[email protected]>
http://greg.bronevetsky.com
--
_____________________________________________________________
prof. dr. Andrej Filipcic, E-mail:[email protected]
<mailto:[email protected]>
Department of Experimental High Energy Physics - F9
Jozef Stefan Institute, Jamova 39, P.o.Box 3000
SI-1001 Ljubljana, Slovenia
Tel.: +386-1-477-3674 Fax: +386-1-477-3166
-------------------------------------------------------------