On 11/09/2012 06:11 PM, Bronevetsky, Greg wrote:

I don’t believe that this is happening but to eliminate this possibility where might such information get logged? I’m tracking the stdout and stderr of the job submitted through the main scheduler (it starts my slurmd daemons) as well as the stdout and stderr of the individual jobs scheduled by the SLURM.

Greg Bronevetsky

Lawrence Livermore National Lab

(925) 424-5756

[email protected] <mailto:[email protected]>

http://greg.bronevetsky.com

Those entries are usually in /var/log/messages..

Barbara

*From:*Andrej Filipcic [mailto:[email protected]]
*Sent:* Thursday, November 08, 2012 11:00 PM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: slurmd aborting jobs before their completion


Do you by any chance see
Out of memory: Kill proces ...
on the nodes?

Andrej

On 11/08/2012 11:23 PM, Bronevetsky, Greg wrote:

    I’m using SLURM to schedule my tasks inside an existing slurm
    allocation (this is on LLNL OCF clusters). This technique was
    working fine for months but recently (probably the last OS update,
    although I’m not sure) I’ve been running into problems where my
    tasks are being killed off prematurely and my SLURM reports the
    nodes within my allocation as dead. This problem was first
    observed when I was using slurm2.3.3 and upgrading to 2.4.4 hasn’t
    helped. Below I’ve included my log of events. It shows the stdout
    of slurmd -D –vvvvvvvvvvv and every 30 seconds I print the following:

    <<<----------------- hostname -------------------<<<

    Output of my SLURM’s sinfo

    grep MemFree /proc/meminfo

    ps –ef |grep $USER

    >>>----------------- Time ------------------->>>

    As you can observe that things run mostly smoothly for the first
    180 seconds. The only oddity is that task 2 is aborted; I don’t
    know why. After this point slurmd gets a REQUEST_TERMINATE_JOB
    message for each of the remaining 15 tasks, kills them off and the
    next call to sinfo shows that the node has transitioned from alloc
    to down status. I’ve confirmed that my tasks get a TERM signal
    before they complete.

    Does anybody know what might be going wrong or can suggest what I
    can do to debug it? Thanks!

    Greg Bronevetsky

    Lawrence Livermore National Lab

    (925) 424-5756

    [email protected] <mailto:[email protected]>

    http://greg.bronevetsky.com



--
_____________________________________________________________
    prof. dr. Andrej Filipcic,   E-mail:[email protected]  
<mailto:[email protected]>
    Department of Experimental High Energy Physics - F9
    Jozef Stefan Institute, Jamova 39, P.o.Box 3000
    SI-1001 Ljubljana, Slovenia
    Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------


Reply via email to