[gridengine users] (Seemingly) Random failures of OpenMPI jobs

Joshua Baker-LePain Tue, 07 Jan 2014 15:00:09 -0800

We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on acluster with ~650 nodes. Spool directories are local to the nodes. Ourjobs are primarily serial, but with some parallel usage. One user hasbeen having issues with random tasks of parallel array jobs failing, andI'm having trouble tracking it down.

The application is compiled and running against almost the stock verisonof OpenMPI. That is, the stock openmpi in C6 is 1.5.4. I used the sameSRPM but upgraded to 1.5.5 to get a patch which fixes issues in ourenvironment (which includes multiple queues). Note that the failing jobsare submitted to a single queue. Openmpi is compiled --with-sge, and thePE looks like this:


$ qconf -sp ompi
pe_name            ompi
slots              5000
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

The user typically submits an array of 41 tasks, each requesting 50 slots.Some tasks run, but some fail with this message in qstat (which then putsthe queue on that host in QERROR):


error reason   10:          01/07/2014 14:34:17 [11511:22892]: unable to find job 
file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851

Similarly, in the qmaster messages file:

01/07/2014 14:35:56|worker|sortinghat|W|job 8568851.10 failed on host iq237 general 
before job because: 01/07/2014 14:34:17 [11511:22892]: unable to find job file 
"/var/spool/sge/qb3cell/iq237/job_scripts/8568851"
01/07/2014 14:35:56|worker|sortinghat|W|rescheduling job 8568851.10
01/07/2014 14:35:56|worker|sortinghat|E|queue lab.q marked QERROR as result of 
job 8568851's failure at host iq237

Often times when I go look at a node that gave that error message, I'llfind other tasks from the same array job running there. But I can'tconfirm that it *always* happens. The rescheduled task generally ends uprunning anyway.


Any ideas as to how to track this down?  I'm a bit stumped...

Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] (Seemingly) Random failures of OpenMPI jobs

Reply via email to