On Wed, 4 Apr 2012 at 12:30am, Reuti wrote

Am 04.04.2012 um 00:19 schrieb Joshua Baker-LePain:

On Wed, 4 Apr 2012 at 12:12am, Reuti wrote

Are you running your jobs across more than one queue? There was an issue recently when the hostfile contains more than one queue per machine on the Open MPI mailing list with a similar output IIRC.

Heh. That was me, and I'm running version 1.5.5 of Open MPI, which includes the fix for the multiple queue issue. And this issue is completely separate from that one anyway -- that issue casued the MPI spawned processes to segfault, which isn't happening here.

Not for my tests regarding this issue. The jobs ran, but used only a part of the granted slots were used; and at the end I got this message "Connection to lifeline...".


So we have two issues: for SGE it's between a slave and the master machines. But for your job it's between the slaves - right?

Yes. We have the SGE commlib errors, and the Open MPI "routed:binomial" errors. I'm mainly focusing on the SGE problem right now, as I think (hope) that fixing that will also fix the MPI issue.

Does it also happen with an mpihello job?

Actually, yes. I see commlib errors in jobs which successfully complete, and in those I do *not* see "Connection to lifeline" errors. Those latter errors pop up when a hung job hits h_rt and gets killed by SGE. So I think those are more a symptom than a cause.

So the main questions remain a) why am I seeing these commlib errors and b) why do some jobs run anyway while others fail? I'm assuming that the latter is due to SGE retrying the qrsh call a limited number of times.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to