Am 04.04.2012 um 00:19 schrieb Joshua Baker-LePain:

> On Wed, 4 Apr 2012 at 12:12am, Reuti wrote
> 
>>>>> Sometimes a job will report that error and seem to still run, and other 
>>>>> times it won't report the error but will fail.
>>>> 
>>>> The error from the job is different from a timeout - what in detail?
>>> 
>>> These jobs are submitted with "-sync y".  For jobs that fail, qsub reports 
>>> "Unable to run job $JOBID".  The SGE error logs of those jobs usually (but 
>>> not always) contain commlib errors, but they always contain the following 
>>> Open MPI errors:
>>> 
>>> [opt53:20930] [[6569,0],114] routed:binomial: Connection to lifeline 
>>> [[6569,0],0] lost
>> 
>> Are you running your jobs across more than one queue? There was an issue 
>> recently when the hostfile contains more than one queue per machine on the 
>> Open MPI mailing list with a similar output IIRC.
> 
> Heh.  That was me, and I'm running version 1.5.5 of Open MPI, which includes 
> the fix for the multiple queue issue.  And this issue is completely separate 
> from that one anyway -- that issue casued the MPI spawned processes to 
> segfault, which isn't happening here.

Not for my tests regarding this issue. The jobs ran, but used only a part of 
the granted slots were used; and at the end I got this message "Connection to 
lifeline...".


>> So we have two issues: for SGE it's between a slave and the master machines. 
>> But for your job it's between the slaves - right?
> 
> Yes.  We have the SGE commlib errors, and the Open MPI "routed:binomial" 
> errors.  I'm mainly focusing on the SGE problem right now, as I think (hope) 
> that fixing that will also fix the MPI issue.

Does it also happen with an mpihello job?

-- Reuti

> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to