On Wed, 4 Apr 2012 at 12:12am, Reuti wrote
Sometimes a job will report that error and seem to still run, and
other times it won't report the error but will fail.
The error from the job is different from a timeout - what in detail?
These jobs are submitted with "-sync y". For jobs that fail, qsub
reports "Unable to run job $JOBID". The SGE error logs of those jobs
usually (but not always) contain commlib errors, but they always
contain the following Open MPI errors:
[opt53:20930] [[6569,0],114] routed:binomial: Connection to lifeline
[[6569,0],0] lost
Are you running your jobs across more than one queue? There was an issue
recently when the hostfile contains more than one queue per machine on
the Open MPI mailing list with a similar output IIRC.
Heh. That was me, and I'm running version 1.5.5 of Open MPI, which
includes the fix for the multiple queue issue. And this issue is
completely separate from that one anyway -- that issue casued the MPI
spawned processes to segfault, which isn't happening here.
So we have two issues: for SGE it's between a slave and the master
machines. But for your job it's between the slaves - right?
Yes. We have the SGE commlib errors, and the Open MPI "routed:binomial"
errors. I'm mainly focusing on the SGE problem right now, as I think
(hope) that fixing that will also fix the MPI issue.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users