Am 04.04.2012 um 00:19 schrieb Joshua Baker-LePain: > On Wed, 4 Apr 2012 at 12:12am, Reuti wrote > >>>>> Sometimes a job will report that error and seem to still run, and other >>>>> times it won't report the error but will fail. >>>> >>>> The error from the job is different from a timeout - what in detail? >>> >>> These jobs are submitted with "-sync y". For jobs that fail, qsub reports >>> "Unable to run job $JOBID". The SGE error logs of those jobs usually (but >>> not always) contain commlib errors, but they always contain the following >>> Open MPI errors: >>> >>> [opt53:20930] [[6569,0],114] routed:binomial: Connection to lifeline >>> [[6569,0],0] lost >> >> Are you running your jobs across more than one queue? There was an issue >> recently when the hostfile contains more than one queue per machine on the >> Open MPI mailing list with a similar output IIRC. > > Heh. That was me, and I'm running version 1.5.5 of Open MPI, which includes > the fix for the multiple queue issue. And this issue is completely separate > from that one anyway -- that issue casued the MPI spawned processes to > segfault, which isn't happening here.
Not for my tests regarding this issue. The jobs ran, but used only a part of the granted slots were used; and at the end I got this message "Connection to lifeline...". >> So we have two issues: for SGE it's between a slave and the master machines. >> But for your job it's between the slaves - right? > > Yes. We have the SGE commlib errors, and the Open MPI "routed:binomial" > errors. I'm mainly focusing on the SGE problem right now, as I think (hope) > that fixing that will also fix the MPI issue. Does it also happen with an mpihello job? -- Reuti > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
