On Wed, 4 Apr 2012 at 12:30am, Reuti wrote
Am 04.04.2012 um 00:19 schrieb Joshua Baker-LePain:
On Wed, 4 Apr 2012 at 12:12am, Reuti wrote
Are you running your jobs across more than one queue? There was an
issue recently when the hostfile contains more than one queue per
machine on the Open MPI mailing list with a similar output IIRC.
Heh. That was me, and I'm running version 1.5.5 of Open MPI, which
includes the fix for the multiple queue issue. And this issue is
completely separate from that one anyway -- that issue casued the MPI
spawned processes to segfault, which isn't happening here.
Not for my tests regarding this issue. The jobs ran, but used only a
part of the granted slots were used; and at the end I got this message
"Connection to lifeline...".
So we have two issues: for SGE it's between a slave and the master
machines. But for your job it's between the slaves - right?
Yes. We have the SGE commlib errors, and the Open MPI
"routed:binomial" errors. I'm mainly focusing on the SGE problem right
now, as I think (hope) that fixing that will also fix the MPI issue.
Does it also happen with an mpihello job?
Actually, yes. I see commlib errors in jobs which successfully complete,
and in those I do *not* see "Connection to lifeline" errors. Those latter
errors pop up when a hung job hits h_rt and gets killed by SGE. So I
think those are more a symptom than a cause.
So the main questions remain a) why am I seeing these commlib errors and
b) why do some jobs run anyway while others fail? I'm assuming that the
latter is due to SGE retrying the qrsh call a limited number of times.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users