Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-07-27 Thread Brock Palen
Sorry to bring this back up. We recently had an outage updated the firmware on our GD4700 and installed a new mellonox provided OFED stack and the problem has returned. Specifically I am able to produce the problem with IMB 4 12 core nodes when it tries to go to 16 cores. I have verified that

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-24 Thread Dave Love
Brock Palen writes: > Well I have a new wrench into this situation. > We have a power failure at our datacenter took down our entire system > nodes,switch,sm. > Now I am unable to produce the error with oob default ibflags etc. As far as I know, we could still reproduce it.

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-18 Thread Brock Palen
Well I have a new wrench into this situation. We have a power failure at our datacenter took down our entire system nodes,switch,sm. Now I am unable to produce the error with oob default ibflags etc. Does this shed any light on the issue? It also makes it hard to now debug the issue without

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-17 Thread Brock Palen
Sorry typo 314 not 313, Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On May 17, 2011, at 2:02 PM, Brock Palen wrote: > Thanks, I though of looking at ompi_info after I sent that note sigh. > > SEND_INPLACE appears to help performance of

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-17 Thread Brock Palen
Thanks, I though of looking at ompi_info after I sent that note sigh. SEND_INPLACE appears to help performance of larger messages in my synthetic benchmarks over regular SEND. Also it appears that SEND_INPLACE still allows our code to run. We working on getting devs access to our system and

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread George Bosilca
Here is the output of the "ompi_info --param btl openib": MCA btl: parameter "btl_openib_flags" (current value: <306>, data source: default value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4,

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Brock Palen
On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote: > Hi, > > Just out of curiosity - what happens when you add the following MCA option to > your openib runs? > > -mca btl_openib_flags 305 You Sir found the magic combination. I verified this lets IMB and CRASH progress pass their

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Samuel K. Gutierrez
Hi, Just out of curiosity - what happens when you add the following MCA option to your openib runs? -mca btl_openib_flags 305 Thanks, Samuel Gutierrez Los Alamos National Laboratory On May 13, 2011, at 2:38 PM, Brock Palen wrote: > On May 13, 2011, at 4:09 PM, Dave Love wrote: > >> Jeff

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-13 Thread Brock Palen
On May 13, 2011, at 4:09 PM, Dave Love wrote: > Jeff Squyres writes: > >> On May 11, 2011, at 3:21 PM, Dave Love wrote: >> >>> We can reproduce it with IMB. We could provide access, but we'd have to >>> negotiate with the owners of the relevant nodes to give you

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-13 Thread Dave Love
Jeff Squyres writes: > On May 11, 2011, at 3:21 PM, Dave Love wrote: > >> We can reproduce it with IMB. We could provide access, but we'd have to >> negotiate with the owners of the relevant nodes to give you interactive >> access to them. Maybe Brock's would be more

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Brock Palen
I am pretty sure MTL's and BTL's are very different, but just as a note, This users code (Crash) hangs at MPI_Allreduce() in Openib But runs on: tcp psm (an mtl, different hardware) Putting it out there if it does have any bearing. Otherwise ignore. Brock Palen www.umich.edu/~brockp Center

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Brock Palen
On May 12, 2011, at 10:13 AM, Jeff Squyres wrote: > On May 11, 2011, at 3:21 PM, Dave Love wrote: > >> We can reproduce it with IMB. We could provide access, but we'd have to >> negotiate with the owners of the relevant nodes to give you interactive >> access to them. Maybe Brock's would be

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Jeff Squyres
On May 11, 2011, at 3:21 PM, Dave Love wrote: > We can reproduce it with IMB. We could provide access, but we'd have to > negotiate with the owners of the relevant nodes to give you interactive > access to them. Maybe Brock's would be more accessible? (If you > contact me, I may not be able to

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Ralph Castain
On May 11, 2011, at 4:27 PM, Dave Love wrote: > Ralph Castain writes: > >> I'll go back to my earlier comments. Users always claim that their >> code doesn't have the sync issue, but it has proved to help more often >> than not, and costs nothing to try, > > Could you

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Dave Love
Ralph Castain writes: > I'll go back to my earlier comments. Users always claim that their > code doesn't have the sync issue, but it has proved to help more often > than not, and costs nothing to try, Could you point to that post, or tell us what to try excatly, given

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Ralph Castain
Sent from my iPad On May 11, 2011, at 2:05 PM, Brock Palen wrote: > On May 9, 2011, at 9:31 AM, Jeff Squyres wrote: > >> On May 3, 2011, at 6:42 AM, Dave Love wrote: >> We managed to have another user hit the bug that causes collectives (this time MPI_Bcast() )

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Brock Palen
On May 9, 2011, at 9:31 AM, Jeff Squyres wrote: > On May 3, 2011, at 6:42 AM, Dave Love wrote: > >>> We managed to have another user hit the bug that causes collectives (this >>> time MPI_Bcast() ) to hang on IB that was fixed by setting: >>> >>> btl_openib_cpc_include rdmacm >> >> Could

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-09 Thread Jeff Squyres
On May 3, 2011, at 6:42 AM, Dave Love wrote: >> We managed to have another user hit the bug that causes collectives (this >> time MPI_Bcast() ) to hang on IB that was fixed by setting: >> >> btl_openib_cpc_include rdmacm > > Could someone explain this? We also have problems with collective

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-09 Thread Jeff Squyres
Sorry for the delay on this -- it looks like the problem is caused by messages like this (from your first message): [nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port RDMA CM requires IP addresses (i.e., IPoIB) to be enabled on every port/LID where you want to use

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-05 Thread Brock Palen
Yeah we have ran into more issues, with rdmacm not being avialable on all of our hosts. So it would be nice to know what we can do to test that a host would support rdmacm, Example: -- No OpenFabrics connection schemes

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-03 Thread Dave Love
Brock Palen writes: > We managed to have another user hit the bug that causes collectives (this > time MPI_Bcast() ) to hang on IB that was fixed by setting: > > btl_openib_cpc_include rdmacm Could someone explain this? We also have problems with collective hangs with

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-28 Thread Brock Palen
Attached is the output of running with verbose 100, mpirun --mca btl_openib_cpc_include rdmacm --mca btl_base_verbose 100 NPmpi [nyx0665.engin.umich.edu:06399] mca: base: components_open: Looking for btl components [nyx0666.engin.umich.edu:07210] mca: base: components_open: Looking for btl

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-28 Thread Jeff Squyres
On Apr 27, 2011, at 10:02 AM, Brock Palen wrote: > Argh, our messed up environment with three generations on infiniband bit us, > Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR > ib on some of our hosts. Note that jobs will never run across our old DDR ib > and

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-27 Thread Brock Palen
Argh, our messed up environment with three generations on infiniband bit us, Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR ib on some of our hosts. Note that jobs will never run across our old DDR ib and our new QDR stuff where rdmacm does work. I am doing some

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-22 Thread Brock Palen
On Apr 21, 2011, at 6:49 PM, Ralph Castain wrote: > > On Apr 21, 2011, at 4:41 PM, Brock Palen wrote: > >> Given that part of our cluster is TCP only, openib wouldn't even startup on >> those hosts > > That is correct - it would have no impact on those hosts > >> and this would be ignored on

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-21 Thread Ralph Castain
On Apr 21, 2011, at 4:41 PM, Brock Palen wrote: > Given that part of our cluster is TCP only, openib wouldn't even startup on > those hosts That is correct - it would have no impact on those hosts > and this would be ignored on hosts with IB adaptors? Ummm...not sure I understand this one.

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-21 Thread Brock Palen
Given that part of our cluster is TCP only, openib wouldn't even startup on those hosts and this would be ignored on hosts with IB adaptors? Just checking thanks! Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On Apr 21, 2011, at 6:21 PM,

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-21 Thread Jeff Squyres
Over IB, I'm not sure there is much of a drawback. It might be slightly slower to establish QP's, but I don't think that matters much. Over iWARP, rdmacm can cause connection storms as you scale to thousands of MPI processes. On Apr 20, 2011, at 5:03 PM, Brock Palen wrote: > We managed to