We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on a cluster with ~650 nodes. Spool directories are local to the nodes. Our jobs are primarily serial, but with some parallel usage. One user has been having issues with random tasks of parallel array jobs failing, and I'm having trouble tracking it down.

The application is compiled and running against almost the stock verison of OpenMPI. That is, the stock openmpi in C6 is 1.5.4. I used the same SRPM but upgraded to 1.5.5 to get a patch which fixes issues in our environment (which includes multiple queues). Note that the failing jobs are submitted to a single queue. Openmpi is compiled --with-sge, and the PE looks like this:

$ qconf -sp ompi
pe_name            ompi
slots              5000
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

The user typically submits an array of 41 tasks, each requesting 50 slots. Some tasks run, but some fail with this message in qstat (which then puts the queue on that host in QERROR):

error reason   10:          01/07/2014 14:34:17 [11511:22892]: unable to find job 
file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851

Similarly, in the qmaster messages file:

01/07/2014 14:35:56|worker|sortinghat|W|job 8568851.10 failed on host iq237 general 
before job because: 01/07/2014 14:34:17 [11511:22892]: unable to find job file 
"/var/spool/sge/qb3cell/iq237/job_scripts/8568851"
01/07/2014 14:35:56|worker|sortinghat|W|rescheduling job 8568851.10
01/07/2014 14:35:56|worker|sortinghat|E|queue lab.q marked QERROR as result of 
job 8568851's failure at host iq237

Often times when I go look at a node that gave that error message, I'll find other tasks from the same array job running there. But I can't confirm that it *always* happens. The rescheduled task generally ends up running anyway.

Any ideas as to how to track this down?  I'm a bit stumped...

Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to