Re: [gridengine users] (Seemingly) Random failures of OpenMPI jobs

Skylar Thompson Tue, 07 Jan 2014 15:14:25 -0800

Quick question - are you limiting memory usage for the job (i.e. h_vmem)?

On Tue, Jan 07, 2014 at 02:57:00PM -0800, Joshua Baker-LePain wrote:
> We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on a
> cluster with ~650 nodes.  Spool directories are local to the nodes.
> Our jobs are primarily serial, but with some parallel usage.  One
> user has been having issues with random tasks of parallel array jobs
> failing, and I'm having trouble tracking it down.
> 
> The application is compiled and running against almost the stock
> verison of OpenMPI.  That is, the stock openmpi in C6 is 1.5.4.  I
> used the same SRPM but upgraded to 1.5.5 to get a patch which fixes
> issues in our environment (which includes multiple queues).  Note
> that the failing jobs are submitted to a single queue.  Openmpi is
> compiled --with-sge, and the PE looks like this:
> 
> $ qconf -sp ompi
> pe_name            ompi
> slots              5000
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> 
> The user typically submits an array of 41 tasks, each requesting 50
> slots. Some tasks run, but some fail with this message in qstat
> (which then puts the queue on that host in QERROR):
> 
> error reason   10:          01/07/2014 14:34:17 [11511:22892]: unable to find 
> job file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851
> 
> Similarly, in the qmaster messages file:
> 
> 01/07/2014 14:35:56|worker|sortinghat|W|job 8568851.10 failed on host iq237 
> general before job because: 01/07/2014 14:34:17 [11511:22892]: unable to find 
> job file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851"
> 01/07/2014 14:35:56|worker|sortinghat|W|rescheduling job 8568851.10
> 01/07/2014 14:35:56|worker|sortinghat|E|queue lab.q marked QERROR as result 
> of job 8568851's failure at host iq237
> 
> Often times when I go look at a node that gave that error message,
> I'll find other tasks from the same array job running there.  But I
> can't confirm that it *always* happens.  The rescheduled task
> generally ends up running anyway.
> 
> Any ideas as to how to track this down?  I'm a bit stumped...
> 
> Thanks.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


-- 
-- Skylar Thompson ([email protected])
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] (Seemingly) Random failures of OpenMPI jobs

Reply via email to