On Dec 16, 2010, at 3:29 AM, Gilbert Grosdidier wrote:
>> Does this problem *always* happen, or does it only happen once in a great
>> while?
>>
> gg= No, this problem happens rather often, almost every other time.
> Seems to happen more often as the number of cores increases.
Well that's a
Bonjour Jeff,
Le 16/12/2010 01:40, Jeff Squyres a écrit :
On Dec 15, 2010, at 3:24 PM, Ralph Castain wrote:
I am not using the TCP BTL, only OPENIB one. Does this change the number of
sockets in use per node, please ?
I believe the openib btl opens sockets for connection purposes, so the
On Dec 15, 2010, at 3:24 PM, Ralph Castain wrote:
>> I am not using the TCP BTL, only OPENIB one. Does this change the number of
>> sockets in use per node, please ?
>
> I believe the openib btl opens sockets for connection purposes, so the count
> is likely the same. An IB person can confirm
On Dec 15, 2010, at 1:11 PM, Gilbert Grosdidier wrote:
> Ralph,
>
> I am not using the TCP BTL, only OPENIB one. Does this change the number of
> sockets in use per node, please ?
I believe the openib btl opens sockets for connection purposes, so the count is
likely the same. An IB person
Ralph,
I am not using the TCP BTL, only OPENIB one. Does this change the
number of sockets in use per node, please ?
But I suspect the ORTE daemons are communicating only through TCP
anyway, right ?
Also, is there anybody in the OpenMPI team using an SGI Altix cluster
with a high number
On Dec 15, 2010, at 12:30 PM, Gilbert Grosdidier wrote:
> Bonsoir Ralph,
>
> Le 15/12/2010 18:45, Ralph Castain a écrit :
>> It looks like all the messages are flowing within a single job (all three
>> processes mentioned in the error have the same identifier). Only possibility
>> I can think
Bonsoir Ralph,
Le 15/12/2010 18:45, Ralph Castain a écrit :
It looks like all the messages are flowing within a single job (all
three processes mentioned in the error have the same identifier). Only
possibility I can think of is that somehow you are reusing ports - is
it possible your system
Bonjour Ralph,
Thanks for taking time to help me.
Le 15 déc. 10 à 16:27, Ralph Castain a écrit :
It would appear that there is something trying to talk to a socket
opened by one of your daemons. At a guess, I would bet the problem
is that a prior job left a daemon alive that is talking on
It would appear that there is something trying to talk to a socket opened by
one of your daemons. At a guess, I would bet the problem is that a prior job
left a daemon alive that is talking on the same socket.
Are you by chance using static ports for the job? Did you run another job just
Bonjour,
Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got
this error message, right at startup :
mca_oob_tcp_peer_recv_connect_ack: received unexpected process
identifier [[13816,0],209]
and the whole job is going to spin for an undefined period, without
10 matches
Mail list logo