On Wed, 4 Apr 2012 at 12:12am, Reuti wrote

Sometimes a job will report that error and seem to still run, and other times it won't report the error but will fail.

The error from the job is different from a timeout - what in detail?

These jobs are submitted with "-sync y". For jobs that fail, qsub reports "Unable to run job $JOBID". The SGE error logs of those jobs usually (but not always) contain commlib errors, but they always contain the following Open MPI errors:

[opt53:20930] [[6569,0],114] routed:binomial: Connection to lifeline 
[[6569,0],0] lost

Are you running your jobs across more than one queue? There was an issue recently when the hostfile contains more than one queue per machine on the Open MPI mailing list with a similar output IIRC.

Heh. That was me, and I'm running version 1.5.5 of Open MPI, which includes the fix for the multiple queue issue. And this issue is completely separate from that one anyway -- that issue casued the MPI spawned processes to segfault, which isn't happening here.

So we have two issues: for SGE it's between a slave and the master machines. But for your job it's between the slaves - right?

Yes. We have the SGE commlib errors, and the Open MPI "routed:binomial" errors. I'm mainly focusing on the SGE problem right now, as I think (hope) that fixing that will also fix the MPI issue.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to