Am 07.01.2014 um 23:57 schrieb Joshua Baker-LePain: > We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on a cluster > with ~650 nodes. Spool directories are local to the nodes. Our jobs are > primarily serial, but with some parallel usage. One user has been having > issues with random tasks of parallel array jobs failing, and I'm having > trouble tracking it down. > > The application is compiled and running against almost the stock verison of > OpenMPI. That is, the stock openmpi in C6 is 1.5.4. I used the same SRPM > but upgraded to 1.5.5 to get a patch which fixes issues in our environment > (which includes multiple queues).
I never noticed it before: as the job script has only the name of the job_id without the task attached (contrary to the directory in "active_jobs") it could be a race conditon, that one of the tasks just finished and the file was removed before the next one started successfully. Can you make a test: use a fixed allocation_rule to get always complete nodes. The requested number of slots must be a multiple of it then though. -- Reuti NB: What about using a plain 1.6.5 and compile from the original source? I never feel safe with the uneven feature release of Open MPI. > Note that the failing jobs are submitted to a single queue. Openmpi is > compiled --with-sge, and the PE looks like this: > > $ qconf -sp ompi > pe_name ompi > slots 5000 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > The user typically submits an array of 41 tasks, each requesting 50 slots. > Some tasks run, but some fail with this message in qstat (which then puts the > queue on that host in QERROR): > > error reason 10: 01/07/2014 14:34:17 [11511:22892]: unable to find > job file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851 > > Similarly, in the qmaster messages file: > > 01/07/2014 14:35:56|worker|sortinghat|W|job 8568851.10 failed on host iq237 > general before job because: 01/07/2014 14:34:17 [11511:22892]: unable to find > job file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851" > 01/07/2014 14:35:56|worker|sortinghat|W|rescheduling job 8568851.10 > 01/07/2014 14:35:56|worker|sortinghat|E|queue lab.q marked QERROR as result > of job 8568851's failure at host iq237 > > Often times when I go look at a node that gave that error message, I'll find > other tasks from the same array job running there. But I can't confirm that > it *always* happens. The rescheduled task generally ends up running anyway. > > Any ideas as to how to track this down? I'm a bit stumped... > > Thanks. > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
