Am 07.01.2014 um 23:57 schrieb Joshua Baker-LePain:

> We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on a cluster 
> with ~650 nodes.  Spool directories are local to the nodes.  Our jobs are 
> primarily serial, but with some parallel usage.  One user has been having 
> issues with random tasks of parallel array jobs failing, and I'm having 
> trouble tracking it down.
> 
> The application is compiled and running against almost the stock verison of 
> OpenMPI.  That is, the stock openmpi in C6 is 1.5.4.  I used the same SRPM 
> but upgraded to 1.5.5 to get a patch which fixes issues in our environment 
> (which includes multiple queues).

I never noticed it before: as the job script has only the name of the job_id 
without the task attached (contrary to the directory in "active_jobs") it could 
be a race conditon, that one of the tasks just finished and the file was 
removed before the next one started successfully.

Can you make a test: use a fixed allocation_rule to get always complete nodes. 
The requested number of slots must be a multiple of it then though.

-- Reuti

NB: What about using a plain 1.6.5 and compile from the original source? I 
never feel safe with the uneven feature release of Open MPI.


>  Note that the failing jobs are submitted to a single queue.  Openmpi is 
> compiled --with-sge, and the PE looks like this:
> 
> $ qconf -sp ompi
> pe_name            ompi
> slots              5000
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> 
> The user typically submits an array of 41 tasks, each requesting 50 slots. 
> Some tasks run, but some fail with this message in qstat (which then puts the 
> queue on that host in QERROR):
> 
> error reason   10:          01/07/2014 14:34:17 [11511:22892]: unable to find 
> job file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851
> 
> Similarly, in the qmaster messages file:
> 
> 01/07/2014 14:35:56|worker|sortinghat|W|job 8568851.10 failed on host iq237 
> general before job because: 01/07/2014 14:34:17 [11511:22892]: unable to find 
> job file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851"
> 01/07/2014 14:35:56|worker|sortinghat|W|rescheduling job 8568851.10
> 01/07/2014 14:35:56|worker|sortinghat|E|queue lab.q marked QERROR as result 
> of job 8568851's failure at host iq237
> 
> Often times when I go look at a node that gave that error message, I'll find 
> other tasks from the same array job running there.  But I can't confirm that 
> it *always* happens.  The rescheduled task generally ends up running anyway.
> 
> Any ideas as to how to track this down?  I'm a bit stumped...
> 
> Thanks.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to