Quick question - are you limiting memory usage for the job (i.e. h_vmem)? On Tue, Jan 07, 2014 at 02:57:00PM -0800, Joshua Baker-LePain wrote: > We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on a > cluster with ~650 nodes. Spool directories are local to the nodes. > Our jobs are primarily serial, but with some parallel usage. One > user has been having issues with random tasks of parallel array jobs > failing, and I'm having trouble tracking it down. > > The application is compiled and running against almost the stock > verison of OpenMPI. That is, the stock openmpi in C6 is 1.5.4. I > used the same SRPM but upgraded to 1.5.5 to get a patch which fixes > issues in our environment (which includes multiple queues). Note > that the failing jobs are submitted to a single queue. Openmpi is > compiled --with-sge, and the PE looks like this: > > $ qconf -sp ompi > pe_name ompi > slots 5000 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > The user typically submits an array of 41 tasks, each requesting 50 > slots. Some tasks run, but some fail with this message in qstat > (which then puts the queue on that host in QERROR): > > error reason 10: 01/07/2014 14:34:17 [11511:22892]: unable to find > job file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851 > > Similarly, in the qmaster messages file: > > 01/07/2014 14:35:56|worker|sortinghat|W|job 8568851.10 failed on host iq237 > general before job because: 01/07/2014 14:34:17 [11511:22892]: unable to find > job file "/var/spool/sge/qb3cell/iq237/job_scripts/8568851" > 01/07/2014 14:35:56|worker|sortinghat|W|rescheduling job 8568851.10 > 01/07/2014 14:35:56|worker|sortinghat|E|queue lab.q marked QERROR as result > of job 8568851's failure at host iq237 > > Often times when I go look at a node that gave that error message, > I'll find other tasks from the same array job running there. But I > can't confirm that it *always* happens. The rescheduled task > generally ends up running anyway. > > Any ideas as to how to track this down? I'm a bit stumped... > > Thanks. > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users
-- -- Skylar Thompson ([email protected]) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
