We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on a
cluster with ~650 nodes. Spool directories are local to the nodes. Our
jobs are primarily serial, but with some parallel usage. One user has
been having issues with random tasks of parallel array jobs failing, and
I'm having trouble tracking it down.
The application is compiled and running against almost the stock verison
of OpenMPI. That is, the stock openmpi in C6 is 1.5.4. I used the same
SRPM but upgraded to 1.5.5 to get a patch which fixes issues in our
environment (which includes multiple queues). Note that the failing jobs
are submitted to a single queue. Openmpi is compiled --with-sge, and the
PE looks like this:
$ qconf -sp ompi
pe_name ompi
slots 5000
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
The user typically submits an array of 41 tasks, each requesting 50 slots.
Some tasks run, but some fail with this message in qstat (which then puts
the queue on that host in QERROR):
error reason 10: 01/07/2014 14:34:17 [11511:22892]: unable to find job
file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851
Similarly, in the qmaster messages file:
01/07/2014 14:35:56|worker|sortinghat|W|job 8568851.10 failed on host iq237 general
before job because: 01/07/2014 14:34:17 [11511:22892]: unable to find job file
"/var/spool/sge/qb3cell/iq237/job_scripts/8568851"
01/07/2014 14:35:56|worker|sortinghat|W|rescheduling job 8568851.10
01/07/2014 14:35:56|worker|sortinghat|E|queue lab.q marked QERROR as result of
job 8568851's failure at host iq237
Often times when I go look at a node that gave that error message, I'll
find other tasks from the same array job running there. But I can't
confirm that it *always* happens. The rescheduled task generally ends up
running anyway.
Any ideas as to how to track this down? I'm a bit stumped...
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users