Re: [OMPI devel] Change in btl/tcp

2008-04-18 Thread Tim Prins
To echo what Josh said, there are no special compile flags being used. 
If you send me a patch with debug output, I'd be happy to run it for you.


Both odin and sif are fairly normal linux based clusters, with ethernet 
and openib IP networks. The ethernet network has both ipv4 & ipv6, and 
the openib network runs ipv4.


Tim

Adrian Knoth wrote:

On Fri, Apr 18, 2008 at 01:00:40PM -0400, Josh Hursey wrote:

The trick is to force Open MPI to use only tcp,self and nothing else.  
Did you try adding this (-mca btl tcp,self) to the runtime parameter  
set?


Sure. Even with 64 processes, I cannot trigger this behaviour. Neither
on Linux nor Solaris.

Any special compile flags?

I guess a little bit more debug output could probably reveal the
culprit.






Re: [OMPI devel] Change in btl/tcp

2008-04-18 Thread Josh Hursey
I'm seeing this problem as well even running just 4 processes on a  
single node (though not as frequently as with higher process counts).  
The trick is to force Open MPI to use only tcp,self and nothing else.  
Did you try adding this (-mca btl tcp,self) to the runtime parameter  
set?


-- Josh

On Apr 18, 2008, at 12:56 PM, Adrian Knoth wrote:


On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote:


Hi Adrian,


Hi!


After this change, I am getting a lot of errors of the form:
[sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by
peer (104)

See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615


That's weird. I've tried hello_c.c on about ten machines with  
different

network configurations, none of them showed any problems at all.

Do you have a very special setup? And if need be, would it be possible
to debug on your machine?



From all MTT sites, this error only occurs on Odin and Sif. What's so

special with these clusters?

I have found this especially easy to reproduce if I run 16  
processes all

with just the tcp and self btls on the same machine, running the
'hello_c' program in the examples directory.


Unfortunately, I can't reproduce it that way. If this is related to  
the
change, then it would mean that mca_btl_tcp_proc_accept() returns  
false,

either after the large loop or in mca_btl_tcp_endpoint_accept().

Do you have the cycles to add some BTL_VERBOSE-lines to see where  
things

go wrong? Or even to step through with the debugger?

If you want me to do it, I would provide you with my ssh key?


Cheerio


--
mail: a...@thur.de  http://adi.thur.de  PGP/GPG: key via keyserver

Das Sterben wird nur halb so schlimm, rauchst du KIM.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Change in btl/tcp

2008-04-18 Thread Adrian Knoth
On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote:

> Hi Adrian,

Hi!

> After this change, I am getting a lot of errors of the form:
> [sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by
> peer (104)
> 
> See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615

That's weird. I've tried hello_c.c on about ten machines with different
network configurations, none of them showed any problems at all.

Do you have a very special setup? And if need be, would it be possible
to debug on your machine?


>From all MTT sites, this error only occurs on Odin and Sif. What's so
special with these clusters?

> I have found this especially easy to reproduce if I run 16 processes all 
> with just the tcp and self btls on the same machine, running the 
> 'hello_c' program in the examples directory.

Unfortunately, I can't reproduce it that way. If this is related to the
change, then it would mean that mca_btl_tcp_proc_accept() returns false,
either after the large loop or in mca_btl_tcp_endpoint_accept().

Do you have the cycles to add some BTL_VERBOSE-lines to see where things
go wrong? Or even to step through with the debugger?

If you want me to do it, I would provide you with my ssh key?


Cheerio


-- 
mail: a...@thur.de  http://adi.thur.de  PGP/GPG: key via keyserver

Das Sterben wird nur halb so schlimm, rauchst du KIM.


Re: [OMPI devel] Change in btl/tcp

2008-04-18 Thread Tim Prins

Hi Adrian,

After this change, I am getting a lot of errors of the form:
[sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by

peer (104)

See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615

I have found this especially easy to reproduce if I run 16 processes all 
with just the tcp and self btls on the same machine, running the 
'hello_c' program in the examples directory.


Tim


Adrian Knoth wrote:

Hi!

As of r18169, I've changed the acceptance rules for incoming BTL-TCP
connections.

The old code would have denied a connection in case of non-matching
addresses (comparison between source address and expected source
address).

Unfortunately, you cannot always say which source address an incoming
packet will have (it's the sender's kernel who decides), so rejecting a
connection due to "wrong" source address caused a complete hang.

I had several cases, mostly multi-cluster setups, where this has happend
all the time. (typical scenario: you're expecting the headnode's
internal address, but since you're talking to another cluster,
the kernel uses the headnode's external address)

Though I've tested it as much as possible, I don't know if it breaks
your setup, especially the multi-rail stuff. George?


Cheerio