Re: [OMPI devel] Change in btl/tcp
To echo what Josh said, there are no special compile flags being used. If you send me a patch with debug output, I'd be happy to run it for you. Both odin and sif are fairly normal linux based clusters, with ethernet and openib IP networks. The ethernet network has both ipv4 & ipv6, and the openib network runs ipv4. Tim Adrian Knoth wrote: On Fri, Apr 18, 2008 at 01:00:40PM -0400, Josh Hursey wrote: The trick is to force Open MPI to use only tcp,self and nothing else. Did you try adding this (-mca btl tcp,self) to the runtime parameter set? Sure. Even with 64 processes, I cannot trigger this behaviour. Neither on Linux nor Solaris. Any special compile flags? I guess a little bit more debug output could probably reveal the culprit.
Re: [OMPI devel] Change in btl/tcp
I'm seeing this problem as well even running just 4 processes on a single node (though not as frequently as with higher process counts). The trick is to force Open MPI to use only tcp,self and nothing else. Did you try adding this (-mca btl tcp,self) to the runtime parameter set? -- Josh On Apr 18, 2008, at 12:56 PM, Adrian Knoth wrote: On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote: Hi Adrian, Hi! After this change, I am getting a lot of errors of the form: [sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615 That's weird. I've tried hello_c.c on about ten machines with different network configurations, none of them showed any problems at all. Do you have a very special setup? And if need be, would it be possible to debug on your machine? From all MTT sites, this error only occurs on Odin and Sif. What's so special with these clusters? I have found this especially easy to reproduce if I run 16 processes all with just the tcp and self btls on the same machine, running the 'hello_c' program in the examples directory. Unfortunately, I can't reproduce it that way. If this is related to the change, then it would mean that mca_btl_tcp_proc_accept() returns false, either after the large loop or in mca_btl_tcp_endpoint_accept(). Do you have the cycles to add some BTL_VERBOSE-lines to see where things go wrong? Or even to step through with the debugger? If you want me to do it, I would provide you with my ssh key? Cheerio -- mail: a...@thur.de http://adi.thur.de PGP/GPG: key via keyserver Das Sterben wird nur halb so schlimm, rauchst du KIM. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Change in btl/tcp
On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote: > Hi Adrian, Hi! > After this change, I am getting a lot of errors of the form: > [sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by > peer (104) > > See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615 That's weird. I've tried hello_c.c on about ten machines with different network configurations, none of them showed any problems at all. Do you have a very special setup? And if need be, would it be possible to debug on your machine? >From all MTT sites, this error only occurs on Odin and Sif. What's so special with these clusters? > I have found this especially easy to reproduce if I run 16 processes all > with just the tcp and self btls on the same machine, running the > 'hello_c' program in the examples directory. Unfortunately, I can't reproduce it that way. If this is related to the change, then it would mean that mca_btl_tcp_proc_accept() returns false, either after the large loop or in mca_btl_tcp_endpoint_accept(). Do you have the cycles to add some BTL_VERBOSE-lines to see where things go wrong? Or even to step through with the debugger? If you want me to do it, I would provide you with my ssh key? Cheerio -- mail: a...@thur.de http://adi.thur.de PGP/GPG: key via keyserver Das Sterben wird nur halb so schlimm, rauchst du KIM.
Re: [OMPI devel] Change in btl/tcp
Hi Adrian, After this change, I am getting a lot of errors of the form: [sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615 I have found this especially easy to reproduce if I run 16 processes all with just the tcp and self btls on the same machine, running the 'hello_c' program in the examples directory. Tim Adrian Knoth wrote: Hi! As of r18169, I've changed the acceptance rules for incoming BTL-TCP connections. The old code would have denied a connection in case of non-matching addresses (comparison between source address and expected source address). Unfortunately, you cannot always say which source address an incoming packet will have (it's the sender's kernel who decides), so rejecting a connection due to "wrong" source address caused a complete hang. I had several cases, mostly multi-cluster setups, where this has happend all the time. (typical scenario: you're expecting the headnode's internal address, but since you're talking to another cluster, the kernel uses the headnode's external address) Though I've tested it as much as possible, I don't know if it breaks your setup, especially the multi-rail stuff. George? Cheerio