I don’t believe that this is happening but to eliminate this possibility where might such information get logged? I’m tracking the stdout and stderr of the job submitted through the main scheduler (it starts my slurmd daemons) as well as the stdout and stderr of the individual jobs scheduled by the SLURM.
Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 [email protected]<mailto:[email protected]> http://greg.bronevetsky.com From: Andrej Filipcic [mailto:[email protected]] Sent: Thursday, November 08, 2012 11:00 PM To: slurm-dev Subject: [slurm-dev] Re: slurmd aborting jobs before their completion Do you by any chance see Out of memory: Kill proces ... on the nodes? Andrej On 11/08/2012 11:23 PM, Bronevetsky, Greg wrote: I’m using SLURM to schedule my tasks inside an existing slurm allocation (this is on LLNL OCF clusters). This technique was working fine for months but recently (probably the last OS update, although I’m not sure) I’ve been running into problems where my tasks are being killed off prematurely and my SLURM reports the nodes within my allocation as dead. This problem was first observed when I was using slurm2.3.3 and upgrading to 2.4.4 hasn’t helped. Below I’ve included my log of events. It shows the stdout of slurmd -D –vvvvvvvvvvv and every 30 seconds I print the following: <<<----------------- hostname -------------------<<< Output of my SLURM’s sinfo grep MemFree /proc/meminfo ps –ef |grep $USER >>>----------------- Time ------------------->>> As you can observe that things run mostly smoothly for the first 180 seconds. The only oddity is that task 2 is aborted; I don’t know why. After this point slurmd gets a REQUEST_TERMINATE_JOB message for each of the remaining 15 tasks, kills them off and the next call to sinfo shows that the node has transitioned from alloc to down status. I’ve confirmed that my tasks get a TERM signal before they complete. Does anybody know what might be going wrong or can suggest what I can do to debug it? Thanks! Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 [email protected]<mailto:[email protected]> http://greg.bronevetsky.com -- _____________________________________________________________ prof. dr. Andrej Filipcic, E-mail: [email protected]<mailto:[email protected]> Department of Experimental High Energy Physics - F9 Jozef Stefan Institute, Jamova 39, P.o.Box 3000 SI-1001 Ljubljana, Slovenia Tel.: +386-1-477-3674 Fax: +386-1-477-3166 -------------------------------------------------------------
