Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-07-27 Thread Brock Palen
Sorry to bring this back up.
We recently had an outage updated the firmware on our GD4700 and installed a 
new mellonox provided OFED stack and the problem has returned.
Specifically I am able to produce the problem with IMB 4 12 core nodes when it 
tries to go to 16 cores.  I have verified that enabling different openib_flags 
of 313 fix the issue abit lower bw for some message sizes. 

Has there been any progress on this issue?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On May 18, 2011, at 10:25 AM, Brock Palen wrote:

> Well I have a new wrench into this situation.
> We have a power failure at our datacenter took down our entire system 
> nodes,switch,sm.  
> Now I am unable to produce the error with oob default ibflags etc.
> 
> Does this shed any light on the issue?  It also makes it hard to now debug 
> the issue without being able to reproduce it.
> 
> Any thoughts?  Am I overlooking something? 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On May 17, 2011, at 2:18 PM, Brock Palen wrote:
> 
>> Sorry typo 314 not 313, 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On May 17, 2011, at 2:02 PM, Brock Palen wrote:
>> 
>>> Thanks, I though of looking at ompi_info after I sent that note sigh.
>>> 
>>> SEND_INPLACE appears to help performance of larger messages in my synthetic 
>>> benchmarks over regular SEND.  Also it appears that SEND_INPLACE still 
>>> allows our code to run.
>>> 
>>> We working on getting devs access to our system and code. 
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> Center for Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On May 16, 2011, at 11:49 AM, George Bosilca wrote:
>>> 
 Here is the output of the "ompi_info --param btl openib":
 
  MCA btl: parameter "btl_openib_flags" (current value: <306>, 
 data
   source: default value)
   BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
   SEND_INPLACE=8, RDMA_MATCHED=64, 
 HETEROGENEOUS_RDMA=256; flags
   only used by the "dr" PML (ignored by others): 
 ACK=16,
   CHECKSUM=32, RDMA_COMPLETION=128; flags only used by 
 the "bfo"
   PML (ignored by others): FAILOVER_SUPPORT=512)
 
 So the 305 flags means: HETEROGENEOUS_RDMA | CHECKSUM | ACK | SEND. Most 
 of these flags are totally useless in the current version of Open MPI (DR 
 is not supported), so the only value that really matter is SEND | 
 HETEROGENEOUS_RDMA.
 
 If you want to enable the send protocol try first with SEND | SEND_INPLACE 
 (9), if not downgrade to SEND (1)
 
 george.
 
 On May 16, 2011, at 11:33 , Samuel K. Gutierrez wrote:
 
> 
> On May 16, 2011, at 8:53 AM, Brock Palen wrote:
> 
>> 
>> 
>> 
>> On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
>> 
>>> Hi,
>>> 
>>> Just out of curiosity - what happens when you add the following MCA 
>>> option to your openib runs?
>>> 
>>> -mca btl_openib_flags 305
>> 
>> You Sir found the magic combination.
> 
> :-)  - cool.
> 
> Developers - does this smell like a registered memory availability hang?
> 
>> I verified this lets IMB and CRASH progress pass their lockup points,
>> I will have a user test this, 
> 
> Please let us know what you find.
> 
>> Is this an ok option to put in our environment?  What does 305 mean?
> 
> There may be a performance hit associated with this configuration, but if 
> it lets your users run, then I don't see a problem with adding it to your 
> environment.
> 
> If I'm reading things correctly, 305 turns off RDMA PUT/GET and turns on 
> SEND.
> 
> OpenFabrics gurus - please correct me if I'm wrong :-).
> 
> Samuel Gutierrez
> Los Alamos National Laboratory
> 
> 
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>>> 
>>> Thanks,
>>> 
>>> Samuel Gutierrez
>>> Los Alamos National Laboratory
>>> 
>>> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
>>> 
 On May 13, 2011, at 4:09 PM, Dave Love wrote:
 
> Jeff Squyres  writes:
> 
>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>> 
>>> We can reproduce it with IMB.  We could provide access, but we'd 
>>> have to
>>> negotiate with the owners of the relevant nodes to give you 
>>> interactive
>>> access to them.  Maybe Brock's would be more accessible?  (If you

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-24 Thread Dave Love
Brock Palen  writes:

> Well I have a new wrench into this situation.
> We have a power failure at our datacenter took down our entire system 
> nodes,switch,sm.  
> Now I am unable to produce the error with oob default ibflags etc.

As far as I know, we could still reproduce it.  Mail me if you need an
alternative, but we may have trouble getting access to the relevant
nodes.

-- 
Excuse the typping -- I have a broken wrist


Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-18 Thread Brock Palen
Well I have a new wrench into this situation.
We have a power failure at our datacenter took down our entire system 
nodes,switch,sm.  
Now I am unable to produce the error with oob default ibflags etc.

Does this shed any light on the issue?  It also makes it hard to now debug the 
issue without being able to reproduce it.

Any thoughts?  Am I overlooking something? 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On May 17, 2011, at 2:18 PM, Brock Palen wrote:

> Sorry typo 314 not 313, 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On May 17, 2011, at 2:02 PM, Brock Palen wrote:
> 
>> Thanks, I though of looking at ompi_info after I sent that note sigh.
>> 
>> SEND_INPLACE appears to help performance of larger messages in my synthetic 
>> benchmarks over regular SEND.  Also it appears that SEND_INPLACE still 
>> allows our code to run.
>> 
>> We working on getting devs access to our system and code. 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On May 16, 2011, at 11:49 AM, George Bosilca wrote:
>> 
>>> Here is the output of the "ompi_info --param btl openib":
>>> 
>>>   MCA btl: parameter "btl_openib_flags" (current value: <306>, 
>>> data
>>>source: default value)
>>>BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
>>>SEND_INPLACE=8, RDMA_MATCHED=64, 
>>> HETEROGENEOUS_RDMA=256; flags
>>>only used by the "dr" PML (ignored by others): 
>>> ACK=16,
>>>CHECKSUM=32, RDMA_COMPLETION=128; flags only used by 
>>> the "bfo"
>>>PML (ignored by others): FAILOVER_SUPPORT=512)
>>> 
>>> So the 305 flags means: HETEROGENEOUS_RDMA | CHECKSUM | ACK | SEND. Most of 
>>> these flags are totally useless in the current version of Open MPI (DR is 
>>> not supported), so the only value that really matter is SEND | 
>>> HETEROGENEOUS_RDMA.
>>> 
>>> If you want to enable the send protocol try first with SEND | SEND_INPLACE 
>>> (9), if not downgrade to SEND (1)
>>> 
>>> george.
>>> 
>>> On May 16, 2011, at 11:33 , Samuel K. Gutierrez wrote:
>>> 
 
 On May 16, 2011, at 8:53 AM, Brock Palen wrote:
 
> 
> 
> 
> On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
> 
>> Hi,
>> 
>> Just out of curiosity - what happens when you add the following MCA 
>> option to your openib runs?
>> 
>> -mca btl_openib_flags 305
> 
> You Sir found the magic combination.
 
 :-)  - cool.
 
 Developers - does this smell like a registered memory availability hang?
 
> I verified this lets IMB and CRASH progress pass their lockup points,
> I will have a user test this, 
 
 Please let us know what you find.
 
> Is this an ok option to put in our environment?  What does 305 mean?
 
 There may be a performance hit associated with this configuration, but if 
 it lets your users run, then I don't see a problem with adding it to your 
 environment.
 
 If I'm reading things correctly, 305 turns off RDMA PUT/GET and turns on 
 SEND.
 
 OpenFabrics gurus - please correct me if I'm wrong :-).
 
 Samuel Gutierrez
 Los Alamos National Laboratory
 
 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
>> 
>> Thanks,
>> 
>> Samuel Gutierrez
>> Los Alamos National Laboratory
>> 
>> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
>> 
>>> On May 13, 2011, at 4:09 PM, Dave Love wrote:
>>> 
 Jeff Squyres  writes:
 
> On May 11, 2011, at 3:21 PM, Dave Love wrote:
> 
>> We can reproduce it with IMB.  We could provide access, but we'd 
>> have to
>> negotiate with the owners of the relevant nodes to give you 
>> interactive
>> access to them.  Maybe Brock's would be more accessible?  (If you
>> contact me, I may not be able to respond for a few days.)
> 
> Brock has replied off-list that he, too, is able to reliably 
> reproduce the issue with IMB, and is working to get access for us.  
> Many thanks for your offer; let's see where Brock's access takes us.
 
 Good.  Let me know if we could be useful
 
>>> -- we have not closed this issue,
>> 
>> Which issue?   I couldn't find a relevant-looking one.
> 
> https://svn.open-mpi.org/trac/ompi/ticket/2714
 
 Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 
 on
 connectx with more than one collective I can't recall.
>>> 

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-17 Thread Brock Palen
Sorry typo 314 not 313, 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On May 17, 2011, at 2:02 PM, Brock Palen wrote:

> Thanks, I though of looking at ompi_info after I sent that note sigh.
> 
> SEND_INPLACE appears to help performance of larger messages in my synthetic 
> benchmarks over regular SEND.  Also it appears that SEND_INPLACE still allows 
> our code to run.
> 
> We working on getting devs access to our system and code. 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On May 16, 2011, at 11:49 AM, George Bosilca wrote:
> 
>> Here is the output of the "ompi_info --param btl openib":
>> 
>>MCA btl: parameter "btl_openib_flags" (current value: <306>, 
>> data
>> source: default value)
>> BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
>> SEND_INPLACE=8, RDMA_MATCHED=64, 
>> HETEROGENEOUS_RDMA=256; flags
>> only used by the "dr" PML (ignored by others): 
>> ACK=16,
>> CHECKSUM=32, RDMA_COMPLETION=128; flags only used by 
>> the "bfo"
>> PML (ignored by others): FAILOVER_SUPPORT=512)
>> 
>> So the 305 flags means: HETEROGENEOUS_RDMA | CHECKSUM | ACK | SEND. Most of 
>> these flags are totally useless in the current version of Open MPI (DR is 
>> not supported), so the only value that really matter is SEND | 
>> HETEROGENEOUS_RDMA.
>> 
>> If you want to enable the send protocol try first with SEND | SEND_INPLACE 
>> (9), if not downgrade to SEND (1)
>> 
>> george.
>> 
>> On May 16, 2011, at 11:33 , Samuel K. Gutierrez wrote:
>> 
>>> 
>>> On May 16, 2011, at 8:53 AM, Brock Palen wrote:
>>> 
 
 
 
 On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
 
> Hi,
> 
> Just out of curiosity - what happens when you add the following MCA 
> option to your openib runs?
> 
> -mca btl_openib_flags 305
 
 You Sir found the magic combination.
>>> 
>>> :-)  - cool.
>>> 
>>> Developers - does this smell like a registered memory availability hang?
>>> 
 I verified this lets IMB and CRASH progress pass their lockup points,
 I will have a user test this, 
>>> 
>>> Please let us know what you find.
>>> 
 Is this an ok option to put in our environment?  What does 305 mean?
>>> 
>>> There may be a performance hit associated with this configuration, but if 
>>> it lets your users run, then I don't see a problem with adding it to your 
>>> environment.
>>> 
>>> If I'm reading things correctly, 305 turns off RDMA PUT/GET and turns on 
>>> SEND.
>>> 
>>> OpenFabrics gurus - please correct me if I'm wrong :-).
>>> 
>>> Samuel Gutierrez
>>> Los Alamos National Laboratory
>>> 
>>> 
 
 
 Brock Palen
 www.umich.edu/~brockp
 Center for Advanced Computing
 bro...@umich.edu
 (734)936-1985
 
> 
> Thanks,
> 
> Samuel Gutierrez
> Los Alamos National Laboratory
> 
> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
> 
>> On May 13, 2011, at 4:09 PM, Dave Love wrote:
>> 
>>> Jeff Squyres  writes:
>>> 
 On May 11, 2011, at 3:21 PM, Dave Love wrote:
 
> We can reproduce it with IMB.  We could provide access, but we'd have 
> to
> negotiate with the owners of the relevant nodes to give you 
> interactive
> access to them.  Maybe Brock's would be more accessible?  (If you
> contact me, I may not be able to respond for a few days.)
 
 Brock has replied off-list that he, too, is able to reliably reproduce 
 the issue with IMB, and is working to get access for us.  Many thanks 
 for your offer; let's see where Brock's access takes us.
>>> 
>>> Good.  Let me know if we could be useful
>>> 
>> -- we have not closed this issue,
> 
> Which issue?   I couldn't find a relevant-looking one.
 
 https://svn.open-mpi.org/trac/ompi/ticket/2714
>>> 
>>> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
>>> connectx with more than one collective I can't recall.
>> 
>> Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  
>> well that doesn't help here, both my production code (crash) and IMB 
>> still hang.
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>>> 
>>> -- 
>>> Excuse the typping -- I have a broken wrist
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>> 
>> 
>> ___
>> 

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-17 Thread Brock Palen
Thanks, I though of looking at ompi_info after I sent that note sigh.

SEND_INPLACE appears to help performance of larger messages in my synthetic 
benchmarks over regular SEND.  Also it appears that SEND_INPLACE still allows 
our code to run.

We working on getting devs access to our system and code. 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On May 16, 2011, at 11:49 AM, George Bosilca wrote:

> Here is the output of the "ompi_info --param btl openib":
> 
> MCA btl: parameter "btl_openib_flags" (current value: <306>, 
> data
>  source: default value)
>  BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
>  SEND_INPLACE=8, RDMA_MATCHED=64, 
> HETEROGENEOUS_RDMA=256; flags
>  only used by the "dr" PML (ignored by others): 
> ACK=16,
>  CHECKSUM=32, RDMA_COMPLETION=128; flags only used by 
> the "bfo"
>  PML (ignored by others): FAILOVER_SUPPORT=512)
> 
> So the 305 flags means: HETEROGENEOUS_RDMA | CHECKSUM | ACK | SEND. Most of 
> these flags are totally useless in the current version of Open MPI (DR is not 
> supported), so the only value that really matter is SEND | HETEROGENEOUS_RDMA.
> 
> If you want to enable the send protocol try first with SEND | SEND_INPLACE 
> (9), if not downgrade to SEND (1)
> 
>  george.
> 
> On May 16, 2011, at 11:33 , Samuel K. Gutierrez wrote:
> 
>> 
>> On May 16, 2011, at 8:53 AM, Brock Palen wrote:
>> 
>>> 
>>> 
>>> 
>>> On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
>>> 
 Hi,
 
 Just out of curiosity - what happens when you add the following MCA option 
 to your openib runs?
 
 -mca btl_openib_flags 305
>>> 
>>> You Sir found the magic combination.
>> 
>> :-)  - cool.
>> 
>> Developers - does this smell like a registered memory availability hang?
>> 
>>> I verified this lets IMB and CRASH progress pass their lockup points,
>>> I will have a user test this, 
>> 
>> Please let us know what you find.
>> 
>>> Is this an ok option to put in our environment?  What does 305 mean?
>> 
>> There may be a performance hit associated with this configuration, but if it 
>> lets your users run, then I don't see a problem with adding it to your 
>> environment.
>> 
>> If I'm reading things correctly, 305 turns off RDMA PUT/GET and turns on 
>> SEND.
>> 
>> OpenFabrics gurus - please correct me if I'm wrong :-).
>> 
>> Samuel Gutierrez
>> Los Alamos National Laboratory
>> 
>> 
>>> 
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> Center for Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
 
 Thanks,
 
 Samuel Gutierrez
 Los Alamos National Laboratory
 
 On May 13, 2011, at 2:38 PM, Brock Palen wrote:
 
> On May 13, 2011, at 4:09 PM, Dave Love wrote:
> 
>> Jeff Squyres  writes:
>> 
>>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>>> 
 We can reproduce it with IMB.  We could provide access, but we'd have 
 to
 negotiate with the owners of the relevant nodes to give you interactive
 access to them.  Maybe Brock's would be more accessible?  (If you
 contact me, I may not be able to respond for a few days.)
>>> 
>>> Brock has replied off-list that he, too, is able to reliably reproduce 
>>> the issue with IMB, and is working to get access for us.  Many thanks 
>>> for your offer; let's see where Brock's access takes us.
>> 
>> Good.  Let me know if we could be useful
>> 
> -- we have not closed this issue,
 
 Which issue?   I couldn't find a relevant-looking one.
>>> 
>>> https://svn.open-mpi.org/trac/ompi/ticket/2714
>> 
>> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
>> connectx with more than one collective I can't recall.
> 
> Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  
> well that doesn't help here, both my production code (crash) and IMB 
> still hang.
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
>> 
>> -- 
>> Excuse the typping -- I have a broken wrist
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
>>> 
>>> 
>>> ___
>>> 

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread George Bosilca
Here is the output of the "ompi_info --param btl openib":

 MCA btl: parameter "btl_openib_flags" (current value: <306>, 
data
  source: default value)
  BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
  SEND_INPLACE=8, RDMA_MATCHED=64, 
HETEROGENEOUS_RDMA=256; flags
  only used by the "dr" PML (ignored by others): ACK=16,
  CHECKSUM=32, RDMA_COMPLETION=128; flags only used by 
the "bfo"
  PML (ignored by others): FAILOVER_SUPPORT=512)

So the 305 flags means: HETEROGENEOUS_RDMA | CHECKSUM | ACK | SEND. Most of 
these flags are totally useless in the current version of Open MPI (DR is not 
supported), so the only value that really matter is SEND | HETEROGENEOUS_RDMA.

If you want to enable the send protocol try first with SEND | SEND_INPLACE (9), 
if not downgrade to SEND (1)

  george.

On May 16, 2011, at 11:33 , Samuel K. Gutierrez wrote:

> 
> On May 16, 2011, at 8:53 AM, Brock Palen wrote:
> 
>> 
>> 
>> 
>> On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
>> 
>>> Hi,
>>> 
>>> Just out of curiosity - what happens when you add the following MCA option 
>>> to your openib runs?
>>> 
>>> -mca btl_openib_flags 305
>> 
>> You Sir found the magic combination.
> 
> :-)  - cool.
> 
> Developers - does this smell like a registered memory availability hang?
> 
>> I verified this lets IMB and CRASH progress pass their lockup points,
>> I will have a user test this, 
> 
> Please let us know what you find.
> 
>> Is this an ok option to put in our environment?  What does 305 mean?
> 
> There may be a performance hit associated with this configuration, but if it 
> lets your users run, then I don't see a problem with adding it to your 
> environment.
> 
> If I'm reading things correctly, 305 turns off RDMA PUT/GET and turns on SEND.
> 
> OpenFabrics gurus - please correct me if I'm wrong :-).
> 
> Samuel Gutierrez
> Los Alamos National Laboratory
> 
> 
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>>> 
>>> Thanks,
>>> 
>>> Samuel Gutierrez
>>> Los Alamos National Laboratory
>>> 
>>> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
>>> 
 On May 13, 2011, at 4:09 PM, Dave Love wrote:
 
> Jeff Squyres  writes:
> 
>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>> 
>>> We can reproduce it with IMB.  We could provide access, but we'd have to
>>> negotiate with the owners of the relevant nodes to give you interactive
>>> access to them.  Maybe Brock's would be more accessible?  (If you
>>> contact me, I may not be able to respond for a few days.)
>> 
>> Brock has replied off-list that he, too, is able to reliably reproduce 
>> the issue with IMB, and is working to get access for us.  Many thanks 
>> for your offer; let's see where Brock's access takes us.
> 
> Good.  Let me know if we could be useful
> 
 -- we have not closed this issue,
>>> 
>>> Which issue?   I couldn't find a relevant-looking one.
>> 
>> https://svn.open-mpi.org/trac/ompi/ticket/2714
> 
> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
> connectx with more than one collective I can't recall.
 
 Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  well 
 that doesn't help here, both my production code (crash) and IMB still hang.
 
 
 Brock Palen
 www.umich.edu/~brockp
 Center for Advanced Computing
 bro...@umich.edu
 (734)936-1985
 
> 
> -- 
> Excuse the typping -- I have a broken wrist
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

George Bosilca
Research Assistant Professor
Innovative Computing Laboratory
Department of Electrical Engineering and Computer Science
University of Tennessee, Knoxville
http://web.eecs.utk.edu/~bosilca/




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Brock Palen



On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:

> Hi,
> 
> Just out of curiosity - what happens when you add the following MCA option to 
> your openib runs?
> 
> -mca btl_openib_flags 305

You Sir found the magic combination.
I verified this lets IMB and CRASH progress pass their lockup points,
I will have a user test this, 
Is this an ok option to put in our environment?  What does 305 mean?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985

> 
> Thanks,
> 
> Samuel Gutierrez
> Los Alamos National Laboratory
> 
> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
> 
>> On May 13, 2011, at 4:09 PM, Dave Love wrote:
>> 
>>> Jeff Squyres  writes:
>>> 
 On May 11, 2011, at 3:21 PM, Dave Love wrote:
 
> We can reproduce it with IMB.  We could provide access, but we'd have to
> negotiate with the owners of the relevant nodes to give you interactive
> access to them.  Maybe Brock's would be more accessible?  (If you
> contact me, I may not be able to respond for a few days.)
 
 Brock has replied off-list that he, too, is able to reliably reproduce the 
 issue with IMB, and is working to get access for us.  Many thanks for your 
 offer; let's see where Brock's access takes us.
>>> 
>>> Good.  Let me know if we could be useful
>>> 
>> -- we have not closed this issue,
> 
> Which issue?   I couldn't find a relevant-looking one.
 
 https://svn.open-mpi.org/trac/ompi/ticket/2714
>>> 
>>> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
>>> connectx with more than one collective I can't recall.
>> 
>> Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  well 
>> that doesn't help here, both my production code (crash) and IMB still hang.
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>>> 
>>> -- 
>>> Excuse the typping -- I have a broken wrist
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Samuel K. Gutierrez
Hi,

Just out of curiosity - what happens when you add the following MCA option to 
your openib runs?

-mca btl_openib_flags 305

Thanks,

Samuel Gutierrez
Los Alamos National Laboratory

On May 13, 2011, at 2:38 PM, Brock Palen wrote:

> On May 13, 2011, at 4:09 PM, Dave Love wrote:
> 
>> Jeff Squyres  writes:
>> 
>>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>>> 
 We can reproduce it with IMB.  We could provide access, but we'd have to
 negotiate with the owners of the relevant nodes to give you interactive
 access to them.  Maybe Brock's would be more accessible?  (If you
 contact me, I may not be able to respond for a few days.)
>>> 
>>> Brock has replied off-list that he, too, is able to reliably reproduce the 
>>> issue with IMB, and is working to get access for us.  Many thanks for your 
>>> offer; let's see where Brock's access takes us.
>> 
>> Good.  Let me know if we could be useful
>> 
> -- we have not closed this issue,
 
 Which issue?   I couldn't find a relevant-looking one.
>>> 
>>> https://svn.open-mpi.org/trac/ompi/ticket/2714
>> 
>> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
>> connectx with more than one collective I can't recall.
> 
> Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  well 
> that doesn't help here, both my production code (crash) and IMB still hang.
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
>> 
>> -- 
>> Excuse the typping -- I have a broken wrist
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-13 Thread Brock Palen
On May 13, 2011, at 4:09 PM, Dave Love wrote:

> Jeff Squyres  writes:
> 
>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>> 
>>> We can reproduce it with IMB.  We could provide access, but we'd have to
>>> negotiate with the owners of the relevant nodes to give you interactive
>>> access to them.  Maybe Brock's would be more accessible?  (If you
>>> contact me, I may not be able to respond for a few days.)
>> 
>> Brock has replied off-list that he, too, is able to reliably reproduce the 
>> issue with IMB, and is working to get access for us.  Many thanks for your 
>> offer; let's see where Brock's access takes us.
> 
> Good.  Let me know if we could be useful
> 
 -- we have not closed this issue,
>>> 
>>> Which issue?   I couldn't find a relevant-looking one.
>> 
>> https://svn.open-mpi.org/trac/ompi/ticket/2714
> 
> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
> connectx with more than one collective I can't recall.

Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  well that 
doesn't help here, both my production code (crash) and IMB still hang.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985

> 
> -- 
> Excuse the typping -- I have a broken wrist
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-13 Thread Dave Love
Jeff Squyres  writes:

> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>
>> We can reproduce it with IMB.  We could provide access, but we'd have to
>> negotiate with the owners of the relevant nodes to give you interactive
>> access to them.  Maybe Brock's would be more accessible?  (If you
>> contact me, I may not be able to respond for a few days.)
>
> Brock has replied off-list that he, too, is able to reliably reproduce the 
> issue with IMB, and is working to get access for us.  Many thanks for your 
> offer; let's see where Brock's access takes us.

Good.  Let me know if we could be useful

>>> -- we have not closed this issue,
>> 
>> Which issue?   I couldn't find a relevant-looking one.
>
> https://svn.open-mpi.org/trac/ompi/ticket/2714

Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
connectx with more than one collective I can't recall.

-- 
Excuse the typping -- I have a broken wrist



Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Brock Palen
I am pretty sure MTL's and BTL's are very different, but just as a note,
This users code (Crash) hangs at MPI_Allreduce() in 

Openib

But runs on:
tcp
psm (an mtl, different hardware)

Putting it out there if it does have any bearing.  Otherwise ignore. 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On May 12, 2011, at 10:20 AM, Brock Palen wrote:

> On May 12, 2011, at 10:13 AM, Jeff Squyres wrote:
> 
>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>> 
>>> We can reproduce it with IMB.  We could provide access, but we'd have to
>>> negotiate with the owners of the relevant nodes to give you interactive
>>> access to them.  Maybe Brock's would be more accessible?  (If you
>>> contact me, I may not be able to respond for a few days.)
>> 
>> Brock has replied off-list that he, too, is able to reliably reproduce the 
>> issue with IMB, and is working to get access for us.  Many thanks for your 
>> offer; let's see where Brock's access takes us.
> 
> I should also note that as far as I know I have three codes (CRASH, Namd 
> (some cases), and another user code.  That lockup on a collective on OpenIB 
> but run with the same library on Gig-e.
> 
> So I am not sure it is limited to IMB, or I could be crossing errors, 
> normally I would assume unmatched eager recvs for this sort of problem. 
> 
>> 
 -- we have not closed this issue,
>>> 
>>> Which issue?   I couldn't find a relevant-looking one.
>> 
>> https://svn.open-mpi.org/trac/ompi/ticket/2714
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Brock Palen
On May 12, 2011, at 10:13 AM, Jeff Squyres wrote:

> On May 11, 2011, at 3:21 PM, Dave Love wrote:
> 
>> We can reproduce it with IMB.  We could provide access, but we'd have to
>> negotiate with the owners of the relevant nodes to give you interactive
>> access to them.  Maybe Brock's would be more accessible?  (If you
>> contact me, I may not be able to respond for a few days.)
> 
> Brock has replied off-list that he, too, is able to reliably reproduce the 
> issue with IMB, and is working to get access for us.  Many thanks for your 
> offer; let's see where Brock's access takes us.

I should also note that as far as I know I have three codes (CRASH, Namd (some 
cases), and another user code.  That lockup on a collective on OpenIB but run 
with the same library on Gig-e.

So I am not sure it is limited to IMB, or I could be crossing errors, normally 
I would assume unmatched eager recvs for this sort of problem. 

> 
>>> -- we have not closed this issue,
>> 
>> Which issue?   I couldn't find a relevant-looking one.
> 
> https://svn.open-mpi.org/trac/ompi/ticket/2714
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Jeff Squyres
On May 11, 2011, at 3:21 PM, Dave Love wrote:

> We can reproduce it with IMB.  We could provide access, but we'd have to
> negotiate with the owners of the relevant nodes to give you interactive
> access to them.  Maybe Brock's would be more accessible?  (If you
> contact me, I may not be able to respond for a few days.)

Brock has replied off-list that he, too, is able to reliably reproduce the 
issue with IMB, and is working to get access for us.  Many thanks for your 
offer; let's see where Brock's access takes us.

>> -- we have not closed this issue,
> 
> Which issue?   I couldn't find a relevant-looking one.

https://svn.open-mpi.org/trac/ompi/ticket/2714

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Ralph Castain

On May 11, 2011, at 4:27 PM, Dave Love wrote:

> Ralph Castain  writes:
> 
>> I'll go back to my earlier comments. Users always claim that their
>> code doesn't have the sync issue, but it has proved to help more often
>> than not, and costs nothing to try,
> 
> Could you point to that post, or tell us what to try excatly, given
> we're running IMB?  Thanks.

http://www.open-mpi.org/community/lists/users/2011/04/16243.php

> 
> (As far as I know, this isn't happening with real codes, just IMB, but
> only a few have been in use.)

Interesting - my prior experience was with real codes, typically "legacy" codes 
that worked fine until you loaded the node.


> 
> -- 
> Excuse the typping -- I have a broken wrist
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Dave Love
Ralph Castain  writes:

> I'll go back to my earlier comments. Users always claim that their
> code doesn't have the sync issue, but it has proved to help more often
> than not, and costs nothing to try,

Could you point to that post, or tell us what to try excatly, given
we're running IMB?  Thanks.

(As far as I know, this isn't happening with real codes, just IMB, but
only a few have been in use.)

-- 
Excuse the typping -- I have a broken wrist


Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Ralph Castain


Sent from my iPad

On May 11, 2011, at 2:05 PM, Brock Palen  wrote:

> On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:
> 
>> On May 3, 2011, at 6:42 AM, Dave Love wrote:
>> 
 We managed to have another user hit the bug that causes collectives (this 
 time MPI_Bcast() ) to hang on IB that was fixed by setting:
 
 btl_openib_cpc_include rdmacm
>>> 
>>> Could someone explain this?  We also have problems with collective hangs
>>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>>> see any relevant issues filed.  However, rdmacm isn't an available value
>>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>>> that I understand what these things are...).
>> 
>> Sorry for the delay -- perhaps an IB vendor can reply here with more 
>> detail...
>> 
>> We had a user-reported issue of some hangs that the IB vendors have been 
>> unable to replicate in their respective labs.  We *suspect* that it may be 
>> an issue with the oob openib CPC, but that code is pretty old and pretty 
>> mature, so all of us would be at least somewhat surprised if that were the 
>> case.  If anyone can reliably reproduce this error, please let us know 
>> and/or give us access to your machines -- we have not closed this issue, but 
>> are unable to move forward because the customers who reported this issue 
>> switched to rdmacm and moved on (i.e., we don't have access to their 
>> machines to test any more).
> 
> An update, we set all our ib0 interfaces to have IP's on a 172. network. This 
> allowed the use of rdmacm to work and get latencies that we would expect.  
> That said we are still getting hangs.  I can very reliably reproduce it using 
> IMB with a specific core count on a specific test case. 
> 
> Just an update.  Has anyone else had luck fixing the lockup issues on openib 
> BTL for collectives in some cases? Thanks!

I'll go back to my earlier comments. Users always claim that their code doesn't 
have the sync issue, but it has proved to help more often than not, and costs 
nothing to try,

My $.0002


> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Brock Palen
On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:

> On May 3, 2011, at 6:42 AM, Dave Love wrote:
> 
>>> We managed to have another user hit the bug that causes collectives (this 
>>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>> 
>>> btl_openib_cpc_include rdmacm
>> 
>> Could someone explain this?  We also have problems with collective hangs
>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>> see any relevant issues filed.  However, rdmacm isn't an available value
>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>> that I understand what these things are...).
> 
> Sorry for the delay -- perhaps an IB vendor can reply here with more detail...
> 
> We had a user-reported issue of some hangs that the IB vendors have been 
> unable to replicate in their respective labs.  We *suspect* that it may be an 
> issue with the oob openib CPC, but that code is pretty old and pretty mature, 
> so all of us would be at least somewhat surprised if that were the case.  If 
> anyone can reliably reproduce this error, please let us know and/or give us 
> access to your machines -- we have not closed this issue, but are unable to 
> move forward because the customers who reported this issue switched to rdmacm 
> and moved on (i.e., we don't have access to their machines to test any more).

An update, we set all our ib0 interfaces to have IP's on a 172. network. This 
allowed the use of rdmacm to work and get latencies that we would expect.  That 
said we are still getting hangs.  I can very reliably reproduce it using IMB 
with a specific core count on a specific test case. 

Just an update.  Has anyone else had luck fixing the lockup issues on openib 
BTL for collectives in some cases? Thanks!


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-09 Thread Jeff Squyres
On May 3, 2011, at 6:42 AM, Dave Love wrote:

>> We managed to have another user hit the bug that causes collectives (this 
>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>> 
>> btl_openib_cpc_include rdmacm
> 
> Could someone explain this?  We also have problems with collective hangs
> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
> see any relevant issues filed.  However, rdmacm isn't an available value
> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
> that I understand what these things are...).

Sorry for the delay -- perhaps an IB vendor can reply here with more detail...

We had a user-reported issue of some hangs that the IB vendors have been unable 
to replicate in their respective labs.  We *suspect* that it may be an issue 
with the oob openib CPC, but that code is pretty old and pretty mature, so all 
of us would be at least somewhat surprised if that were the case.  If anyone 
can reliably reproduce this error, please let us know and/or give us access to 
your machines -- we have not closed this issue, but are unable to move forward 
because the customers who reported this issue switched to rdmacm and moved on 
(i.e., we don't have access to their machines to test any more).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-09 Thread Jeff Squyres
Sorry for the delay on this -- it looks like the problem is caused by messages 
like this (from your first message):

[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port

RDMA CM requires IP addresses (i.e., IPoIB) to be enabled on every port/LID 
where you want to use it.


On May 5, 2011, at 1:15 PM, Brock Palen wrote:

> Yeah we have ran into more issues, with rdmacm not being avialable on all of 
> our hosts.  So it would be nice to know what we can do to test that a host 
> would support rdmacm,
> 
> Example:
> 
> --
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>  Local host:   nyx5067.engin.umich.edu
>  Local device: mlx4_0
>  Local port:   1
>  CPCs attempted:   rdmacm
> --
> 
> This is one of our QDR hosts that rdmacm generally works on. Which this code 
> (CRASH) requires to avoid a collective hang in MPI_Allreduce() 
> 
> I look on this hosts and I find:
> [root@nyx5067 ~]# rpm -qa | grep rdma
> librdmacm-1.0.11-1
> librdmacm-1.0.11-1
> librdmacm-devel-1.0.11-1
> librdmacm-devel-1.0.11-1
> librdmacm-utils-1.0.11-1
> 
> So all the libraries are installed (I think) is there a way to verify this?  
> Or to have OpenMPI be more verbose what caused rdmacm to fail as an oob 
> option?
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On May 3, 2011, at 9:42 AM, Dave Love wrote:
> 
>> Brock Palen  writes:
>> 
>>> We managed to have another user hit the bug that causes collectives (this 
>>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>> 
>>> btl_openib_cpc_include rdmacm
>> 
>> Could someone explain this?  We also have problems with collective hangs
>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>> see any relevant issues filed.  However, rdmacm isn't an available value
>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>> that I understand what these things are...).
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-05 Thread Brock Palen
Yeah we have ran into more issues, with rdmacm not being avialable on all of 
our hosts.  So it would be nice to know what we can do to test that a host 
would support rdmacm,

Example:

--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   nyx5067.engin.umich.edu
  Local device: mlx4_0
  Local port:   1
  CPCs attempted:   rdmacm
--

This is one of our QDR hosts that rdmacm generally works on. Which this code 
(CRASH) requires to avoid a collective hang in MPI_Allreduce() 

I look on this hosts and I find:
[root@nyx5067 ~]# rpm -qa | grep rdma
librdmacm-1.0.11-1
librdmacm-1.0.11-1
librdmacm-devel-1.0.11-1
librdmacm-devel-1.0.11-1
librdmacm-utils-1.0.11-1

So all the libraries are installed (I think) is there a way to verify this?  Or 
to have OpenMPI be more verbose what caused rdmacm to fail as an oob option?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On May 3, 2011, at 9:42 AM, Dave Love wrote:

> Brock Palen  writes:
> 
>> We managed to have another user hit the bug that causes collectives (this 
>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>> 
>> btl_openib_cpc_include rdmacm
> 
> Could someone explain this?  We also have problems with collective hangs
> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
> see any relevant issues filed.  However, rdmacm isn't an available value
> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
> that I understand what these things are...).
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-03 Thread Dave Love
Brock Palen  writes:

> We managed to have another user hit the bug that causes collectives (this 
> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>
> btl_openib_cpc_include rdmacm

Could someone explain this?  We also have problems with collective hangs
with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
see any relevant issues filed.  However, rdmacm isn't an available value
for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
that I understand what these things are...).



Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-28 Thread Brock Palen
Attached is the output of running with verbose 100, mpirun --mca 
btl_openib_cpc_include rdmacm --mca btl_base_verbose 100 NPmpi
[nyx0665.engin.umich.edu:06399] mca: base: components_open: Looking for btl 
components
[nyx0666.engin.umich.edu:07210] mca: base: components_open: Looking for btl 
components
[nyx0665.engin.umich.edu:06399] mca: base: components_open: opening btl 
components
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component ofud
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component ofud has 
no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component ofud open 
function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component openib
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component openib 
has no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component openib 
open function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component self
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component self has 
no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component self open 
function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component sm
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component sm has no 
register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component sm open 
function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component tcp
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component tcp has 
no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component tcp open 
function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: opening btl 
components
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component ofud
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component ofud has 
no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component ofud open 
function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component openib
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component openib 
has no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component openib 
open function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component self
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component self has 
no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component self open 
function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component sm
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component sm has no 
register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component sm open 
function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component tcp
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component tcp has 
no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component tcp open 
function successful
[nyx0665.engin.umich.edu:06399] select: initializing btl component ofud
[nyx0665.engin.umich.edu:06399] select: init of component ofud returned failure
[nyx0665.engin.umich.edu:06399] select: module ofud unloaded
[nyx0665.engin.umich.edu:06399] select: initializing btl component openib
[nyx0666.engin.umich.edu:07210] select: initializing btl component ofud
[nyx0666.engin.umich.edu:07210] select: init of component ofud returned failure
[nyx0666.engin.umich.edu:07210] select: module ofud unloaded
[nyx0666.engin.umich.edu:07210] select: initializing btl component openib
[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port
[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm CPC unavailable for use on 
mthca0:1; skipped
--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   nyx0665.engin.umich.edu
  Local device: mthca0
  Local port:   1
  CPCs attempted:   rdmacm
--
[nyx0665.engin.umich.edu:06399] select: init of component openib returned 
failure
[nyx0665.engin.umich.edu:06399] select: module openib unloaded
[nyx0665.engin.umich.edu:06399] select: initializing btl component self
[nyx0665.engin.umich.edu:06399] select: init of component self returned success
[nyx0665.engin.umich.edu:06399] select: initializing btl component sm
[nyx0665.engin.umich.edu:06399] select: 

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-28 Thread Jeff Squyres
On Apr 27, 2011, at 10:02 AM, Brock Palen wrote:

> Argh, our messed up environment with three generations on infiniband bit us,
> Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR 
> ib on some of our hosts.  Note that jobs will never run across our old DDR ib 
> and our new QDR stuff where rdmacm does work.

Hmm -- odd.  I use RDMACM on some old DDR (and SDR!) IB hardware and it seems 
to work fine.

Do you have any indication as to why OMPI is refusing to use rdmacm on your 
older hardware, other than "No OF connection schemes reported..."?  Try running 
with --mca btl_base_verbose 100 (beware: it will be a truckload of output).  
Make sure that you have rdmacm support available on those machines, both in 
OMPI and in OFED/the OS.

> I am doing some testing with:
> export OMPI_MCA_btl_openib_cpc_include=rdmacm,oob,xoob
> 
> What I want to know is there a way to tell mpirun to 'dump all resolved mca 
> settings'  Or something similar. 

I'm not quite sure what you're asking here -- do you want to override MCA 
params on specific hosts?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-27 Thread Brock Palen
Argh, our messed up environment with three generations on infiniband bit us,
Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR ib 
on some of our hosts.  Note that jobs will never run across our old DDR ib and 
our new QDR stuff where rdmacm does work.

I am doing some testing with:
export OMPI_MCA_btl_openib_cpc_include=rdmacm,oob,xoob

What I want to know is there a way to tell mpirun to 'dump all resolved mca 
settings'  Or something similar. 

The error we get which I think is expected is we set only rdmacm is:
--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   nyx0665.engin.umich.edu
  Local device: mthca0
  Local port:   1
  CPCs attempted:   rdmacm
--

Again I think this is expected on this older hardware. 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Apr 22, 2011, at 10:23 AM, Brock Palen wrote:

> On Apr 21, 2011, at 6:49 PM, Ralph Castain wrote:
> 
>> 
>> On Apr 21, 2011, at 4:41 PM, Brock Palen wrote:
>> 
>>> Given that part of our cluster is TCP only, openib wouldn't even startup on 
>>> those hosts
>> 
>> That is correct - it would have no impact on those hosts
>> 
>>> and this would be ignored on hosts with IB adaptors?  
>> 
>> Ummm...not sure I understand this one. The param -will- be used on hosts 
>> with IB adaptors because that is what it is controlling.
>> 
>> However, it -won't- have any impact on hosts without IB adaptors, which is 
>> what I suspect you meant to ask?
> 
> Correct typo, Thanks, I am going to add the environment variable to our 
> OpenMPI modules so rdmacm is our default for now,  Thanks!
> 
>> 
>> 
>>> 
>>> Just checking thanks!
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> Center for Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Apr 21, 2011, at 6:21 PM, Jeff Squyres wrote:
>>> 
 Over IB, I'm not sure there is much of a drawback.  It might be slightly 
 slower to establish QP's, but I don't think that matters much.
 
 Over iWARP, rdmacm can cause connection storms as you scale to thousands 
 of MPI processes.
 
 
 On Apr 20, 2011, at 5:03 PM, Brock Palen wrote:
 
> We managed to have another user hit the bug that causes collectives (this 
> time MPI_Bcast() ) to hang on IB that was fixed by setting:
> 
> btl_openib_cpc_include rdmacm
> 
> My question is if we set this to the default on our system with an 
> environment variable does it introduce any performance or other issues we 
> should be aware of?
> 
> Is there a reason we should not use rdmacm?
> 
> Thanks!
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 -- 
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-22 Thread Brock Palen
On Apr 21, 2011, at 6:49 PM, Ralph Castain wrote:

> 
> On Apr 21, 2011, at 4:41 PM, Brock Palen wrote:
> 
>> Given that part of our cluster is TCP only, openib wouldn't even startup on 
>> those hosts
> 
> That is correct - it would have no impact on those hosts
> 
>> and this would be ignored on hosts with IB adaptors?  
> 
> Ummm...not sure I understand this one. The param -will- be used on hosts with 
> IB adaptors because that is what it is controlling.
> 
> However, it -won't- have any impact on hosts without IB adaptors, which is 
> what I suspect you meant to ask?

Correct typo, Thanks, I am going to add the environment variable to our OpenMPI 
modules so rdmacm is our default for now,  Thanks!

> 
> 
>> 
>> Just checking thanks!
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Apr 21, 2011, at 6:21 PM, Jeff Squyres wrote:
>> 
>>> Over IB, I'm not sure there is much of a drawback.  It might be slightly 
>>> slower to establish QP's, but I don't think that matters much.
>>> 
>>> Over iWARP, rdmacm can cause connection storms as you scale to thousands of 
>>> MPI processes.
>>> 
>>> 
>>> On Apr 20, 2011, at 5:03 PM, Brock Palen wrote:
>>> 
 We managed to have another user hit the bug that causes collectives (this 
 time MPI_Bcast() ) to hang on IB that was fixed by setting:
 
 btl_openib_cpc_include rdmacm
 
 My question is if we set this to the default on our system with an 
 environment variable does it introduce any performance or other issues we 
 should be aware of?
 
 Is there a reason we should not use rdmacm?
 
 Thanks!
 
 Brock Palen
 www.umich.edu/~brockp
 Center for Advanced Computing
 bro...@umich.edu
 (734)936-1985
 
 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-21 Thread Ralph Castain

On Apr 21, 2011, at 4:41 PM, Brock Palen wrote:

> Given that part of our cluster is TCP only, openib wouldn't even startup on 
> those hosts

That is correct - it would have no impact on those hosts

> and this would be ignored on hosts with IB adaptors?  

Ummm...not sure I understand this one. The param -will- be used on hosts with 
IB adaptors because that is what it is controlling.

However, it -won't- have any impact on hosts without IB adaptors, which is what 
I suspect you meant to ask?


> 
> Just checking thanks!
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Apr 21, 2011, at 6:21 PM, Jeff Squyres wrote:
> 
>> Over IB, I'm not sure there is much of a drawback.  It might be slightly 
>> slower to establish QP's, but I don't think that matters much.
>> 
>> Over iWARP, rdmacm can cause connection storms as you scale to thousands of 
>> MPI processes.
>> 
>> 
>> On Apr 20, 2011, at 5:03 PM, Brock Palen wrote:
>> 
>>> We managed to have another user hit the bug that causes collectives (this 
>>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>> 
>>> btl_openib_cpc_include rdmacm
>>> 
>>> My question is if we set this to the default on our system with an 
>>> environment variable does it introduce any performance or other issues we 
>>> should be aware of?
>>> 
>>> Is there a reason we should not use rdmacm?
>>> 
>>> Thanks!
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> Center for Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-21 Thread Brock Palen
Given that part of our cluster is TCP only, openib wouldn't even startup on 
those hosts and this would be ignored on hosts with IB adaptors?  

Just checking thanks!

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Apr 21, 2011, at 6:21 PM, Jeff Squyres wrote:

> Over IB, I'm not sure there is much of a drawback.  It might be slightly 
> slower to establish QP's, but I don't think that matters much.
> 
> Over iWARP, rdmacm can cause connection storms as you scale to thousands of 
> MPI processes.
> 
> 
> On Apr 20, 2011, at 5:03 PM, Brock Palen wrote:
> 
>> We managed to have another user hit the bug that causes collectives (this 
>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>> 
>> btl_openib_cpc_include rdmacm
>> 
>> My question is if we set this to the default on our system with an 
>> environment variable does it introduce any performance or other issues we 
>> should be aware of?
>> 
>> Is there a reason we should not use rdmacm?
>> 
>> Thanks!
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-21 Thread Jeff Squyres
Over IB, I'm not sure there is much of a drawback.  It might be slightly slower 
to establish QP's, but I don't think that matters much.

Over iWARP, rdmacm can cause connection storms as you scale to thousands of MPI 
processes.


On Apr 20, 2011, at 5:03 PM, Brock Palen wrote:

> We managed to have another user hit the bug that causes collectives (this 
> time MPI_Bcast() ) to hang on IB that was fixed by setting:
> 
> btl_openib_cpc_include rdmacm
> 
> My question is if we set this to the default on our system with an 
> environment variable does it introduce any performance or other issues we 
> should be aware of?
> 
> Is there a reason we should not use rdmacm?
> 
> Thanks!
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-20 Thread Brock Palen
We managed to have another user hit the bug that causes collectives (this time 
MPI_Bcast() ) to hang on IB that was fixed by setting:

btl_openib_cpc_include rdmacm

My question is if we set this to the default on our system with an environment 
variable does it introduce any performance or other issues we should be aware 
of?

Is there a reason we should not use rdmacm?

Thanks!

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985