Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-16 Thread Jeff Squyres
On Dec 16, 2010, at 3:29 AM, Gilbert Grosdidier wrote: >> Does this problem *always* happen, or does it only happen once in a great >> while? >> > gg= No, this problem happens rather often, almost every other time. > Seems to happen more often as the number of cores increases. Well that's a

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-16 Thread Gilbert Grosdidier
Bonjour Jeff, Le 16/12/2010 01:40, Jeff Squyres a écrit : On Dec 15, 2010, at 3:24 PM, Ralph Castain wrote: I am not using the TCP BTL, only OPENIB one. Does this change the number of sockets in use per node, please ? I believe the openib btl opens sockets for connection purposes, so the

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Jeff Squyres
On Dec 15, 2010, at 3:24 PM, Ralph Castain wrote: >> I am not using the TCP BTL, only OPENIB one. Does this change the number of >> sockets in use per node, please ? > > I believe the openib btl opens sockets for connection purposes, so the count > is likely the same. An IB person can confirm

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Ralph Castain
On Dec 15, 2010, at 1:11 PM, Gilbert Grosdidier wrote: > Ralph, > > I am not using the TCP BTL, only OPENIB one. Does this change the number of > sockets in use per node, please ? I believe the openib btl opens sockets for connection purposes, so the count is likely the same. An IB person

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Gilbert Grosdidier
Ralph, I am not using the TCP BTL, only OPENIB one. Does this change the number of sockets in use per node, please ? But I suspect the ORTE daemons are communicating only through TCP anyway, right ? Also, is there anybody in the OpenMPI team using an SGI Altix cluster with a high number

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Ralph Castain
On Dec 15, 2010, at 12:30 PM, Gilbert Grosdidier wrote: > Bonsoir Ralph, > > Le 15/12/2010 18:45, Ralph Castain a écrit : >> It looks like all the messages are flowing within a single job (all three >> processes mentioned in the error have the same identifier). Only possibility >> I can think

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Gilbert Grosdidier
Bonsoir Ralph, Le 15/12/2010 18:45, Ralph Castain a écrit : It looks like all the messages are flowing within a single job (all three processes mentioned in the error have the same identifier). Only possibility I can think of is that somehow you are reusing ports - is it possible your system

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Gilbert Grosdidier
Bonjour Ralph, Thanks for taking time to help me. Le 15 déc. 10 à 16:27, Ralph Castain a écrit : It would appear that there is something trying to talk to a socket opened by one of your daemons. At a guess, I would bet the problem is that a prior job left a daemon alive that is talking on

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Ralph Castain
It would appear that there is something trying to talk to a socket opened by one of your daemons. At a guess, I would bet the problem is that a prior job left a daemon alive that is talking on the same socket. Are you by chance using static ports for the job? Did you run another job just

[OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Gilbert Grosdidier
Bonjour, Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got this error message, right at startup : mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[13816,0],209] and the whole job is going to spin for an undefined period, without