Re: [gridengine users] Parallel jobs failure after OS upgrade

Joshua Baker-LePain Tue, 03 Apr 2012 16:11:09 -0700

On Wed, 4 Apr 2012 at 12:30am, Reuti wrote

Am 04.04.2012 um 00:19 schrieb Joshua Baker-LePain:
On Wed, 4 Apr 2012 at 12:12am, Reuti wrote
Are you running your jobs across more than one queue? There was anissue recently when the hostfile contains more than one queue permachine on the Open MPI mailing list with a similar output IIRC.
Heh. That was me, and I'm running version 1.5.5 of Open MPI, whichincludes the fix for the multiple queue issue. And this issue iscompletely separate from that one anyway -- that issue casued the MPIspawned processes to segfault, which isn't happening here.
Not for my tests regarding this issue. The jobs ran, but used only apart of the granted slots were used; and at the end I got this message"Connection to lifeline...".
So we have two issues: for SGE it's between a slave and the mastermachines. But for your job it's between the slaves - right?
Yes. We have the SGE commlib errors, and the Open MPI"routed:binomial" errors. I'm mainly focusing on the SGE problem rightnow, as I think (hope) that fixing that will also fix the MPI issue.
Does it also happen with an mpihello job?

Actually, yes. I see commlib errors in jobs which successfully complete,and in those I do *not* see "Connection to lifeline" errors. Those lattererrors pop up when a hung job hits h_rt and gets killed by SGE. So Ithink those are more a symptom than a cause.

So the main questions remain a) why am I seeing these commlib errors andb) why do some jobs run anyway while others fail? I'm assuming that thelatter is due to SGE retrying the qrsh call a limited number of times.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Parallel jobs failure after OS upgrade

Reply via email to