I thought about it again: There's probably no call to dat_ep_query() *because* it returns wrong port numbers and the port numbers saved by the uDAPL BTL code itself are used.
I'll leave the debugging to those who know the code ... ;-) Boris Andrew Friedley wrote: > OK, strange but good. Yeah I wouldn't be surprised if something has > been changed, though I wouldn't know what, and I don't have time right > now to go digging :( Maybe Don Kerr knows something? > > Andrew > > > Boris Bierbaum wrote: >> I've run the whole IMB Benchmark Suite on 2, 3, and 4 nodes with 2 >> processes per node and --mca btl udapl,self. I didn't encouter any problems. >> >> The comment above line 197 says that dat_ep_query() returns wrong port >> numbers (which it does indeed), but I can't find any call to >> dat_ep_query() in the uDAPL BTL code. Maybe the comment is out of date? >> >> Boris >> >> >> Andrew Friedley wrote: >>> You say that fixes the problem, does it work even when running more than >>> one MPI process per node? (that is the case the hack fixes) Simply >>> doing an mpirun with a -np paremeter higher than the number of nodes you >>> have set up should trigger this case, and making sure to use '-mca btl >>> udapl,self' (ie not SM or anything else). >>> >>> Andrew >>> >>> Boris Bierbaum wrote: >>>> It has been explained in a different thread on [ofa-general] that the >>>> problem lies in a combination of the OpenIB-cma provider not setting the >>>> local and remote port numbers on endpoints correctly and Open MPI >>>> stepping over the IA to save the port number to circumvent this problem, >>>> thereby confusing the provider. >>>> >>>> I commented out line 197 in ompi/mca/btl/udapl/btl_udapl.c (Open MPI >>>> 1.2.1 release) and this fixes the problem. As the problem in the >>>> provider is currently being fixed, the whole saving of the port number >>>> in the uDAPL BTL code will be unnecessary in the future. >>>> >>>> Steve Wise wrote: >>>>>>> Can the UDAPL OFED wizards shed any light on the error messages that >>>>>>> are listed below? In particular, these seem to be worrysome: >>>>>>> >>>>>>>> setup_listener Permission denied >>>>>>> setup_listener Address already in use >>>>>> These failures are from rdma_cm_bind indicating the port is already >>>>>> bound to this IA address. How are you creating the service point? >>>>>> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you >>>>>> will see some failures until it gets to a free port. That is normal. >>>>>> Just make sure your create call returns DAT_SUCCESS. >>>>>> >>>>> Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down >>>>> and let the rdma-cma pick an available port number? >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> gene...@lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- | _ RWTH | Boris Bierbaum |_|_`_ | Lehrstuhl fuer Betriebssysteme | |_) _ | RWTH Aachen D-52056 Aachen |_)(_` | Tel: +49-241-80-27805 ._) | Fax: +49-241-80-22339