Sorry to bring this back up.
We recently had an outage updated the firmware on our GD4700 and installed a
new mellonox provided OFED stack and the problem has returned.
Specifically I am able to produce the problem with IMB 4 12 core nodes when it
tries to go to 16 cores. I have verified that
Brock Palen writes:
> Well I have a new wrench into this situation.
> We have a power failure at our datacenter took down our entire system
> nodes,switch,sm.
> Now I am unable to produce the error with oob default ibflags etc.
As far as I know, we could still reproduce it.
Well I have a new wrench into this situation.
We have a power failure at our datacenter took down our entire system
nodes,switch,sm.
Now I am unable to produce the error with oob default ibflags etc.
Does this shed any light on the issue? It also makes it hard to now debug the
issue without
Sorry typo 314 not 313,
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On May 17, 2011, at 2:02 PM, Brock Palen wrote:
> Thanks, I though of looking at ompi_info after I sent that note sigh.
>
> SEND_INPLACE appears to help performance of
Thanks, I though of looking at ompi_info after I sent that note sigh.
SEND_INPLACE appears to help performance of larger messages in my synthetic
benchmarks over regular SEND. Also it appears that SEND_INPLACE still allows
our code to run.
We working on getting devs access to our system and
Here is the output of the "ompi_info --param btl openib":
MCA btl: parameter "btl_openib_flags" (current value: <306>,
data
source: default value)
BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
> Hi,
>
> Just out of curiosity - what happens when you add the following MCA option to
> your openib runs?
>
> -mca btl_openib_flags 305
You Sir found the magic combination.
I verified this lets IMB and CRASH progress pass their
Hi,
Just out of curiosity - what happens when you add the following MCA option to
your openib runs?
-mca btl_openib_flags 305
Thanks,
Samuel Gutierrez
Los Alamos National Laboratory
On May 13, 2011, at 2:38 PM, Brock Palen wrote:
> On May 13, 2011, at 4:09 PM, Dave Love wrote:
>
>> Jeff
On May 13, 2011, at 4:09 PM, Dave Love wrote:
> Jeff Squyres writes:
>
>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>>
>>> We can reproduce it with IMB. We could provide access, but we'd have to
>>> negotiate with the owners of the relevant nodes to give you
Jeff Squyres writes:
> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>
>> We can reproduce it with IMB. We could provide access, but we'd have to
>> negotiate with the owners of the relevant nodes to give you interactive
>> access to them. Maybe Brock's would be more
I am pretty sure MTL's and BTL's are very different, but just as a note,
This users code (Crash) hangs at MPI_Allreduce() in
Openib
But runs on:
tcp
psm (an mtl, different hardware)
Putting it out there if it does have any bearing. Otherwise ignore.
Brock Palen
www.umich.edu/~brockp
Center
On May 12, 2011, at 10:13 AM, Jeff Squyres wrote:
> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>
>> We can reproduce it with IMB. We could provide access, but we'd have to
>> negotiate with the owners of the relevant nodes to give you interactive
>> access to them. Maybe Brock's would be
On May 11, 2011, at 3:21 PM, Dave Love wrote:
> We can reproduce it with IMB. We could provide access, but we'd have to
> negotiate with the owners of the relevant nodes to give you interactive
> access to them. Maybe Brock's would be more accessible? (If you
> contact me, I may not be able to
On May 11, 2011, at 4:27 PM, Dave Love wrote:
> Ralph Castain writes:
>
>> I'll go back to my earlier comments. Users always claim that their
>> code doesn't have the sync issue, but it has proved to help more often
>> than not, and costs nothing to try,
>
> Could you
Ralph Castain writes:
> I'll go back to my earlier comments. Users always claim that their
> code doesn't have the sync issue, but it has proved to help more often
> than not, and costs nothing to try,
Could you point to that post, or tell us what to try excatly, given
Sent from my iPad
On May 11, 2011, at 2:05 PM, Brock Palen wrote:
> On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:
>
>> On May 3, 2011, at 6:42 AM, Dave Love wrote:
>>
We managed to have another user hit the bug that causes collectives (this
time MPI_Bcast() )
On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:
> On May 3, 2011, at 6:42 AM, Dave Love wrote:
>
>>> We managed to have another user hit the bug that causes collectives (this
>>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>>
>>> btl_openib_cpc_include rdmacm
>>
>> Could
On May 3, 2011, at 6:42 AM, Dave Love wrote:
>> We managed to have another user hit the bug that causes collectives (this
>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>
>> btl_openib_cpc_include rdmacm
>
> Could someone explain this? We also have problems with collective
Sorry for the delay on this -- it looks like the problem is caused by messages
like this (from your first message):
[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port
RDMA CM requires IP addresses (i.e., IPoIB) to be enabled on every port/LID
where you want to use
Yeah we have ran into more issues, with rdmacm not being avialable on all of
our hosts. So it would be nice to know what we can do to test that a host
would support rdmacm,
Example:
--
No OpenFabrics connection schemes
Brock Palen writes:
> We managed to have another user hit the bug that causes collectives (this
> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>
> btl_openib_cpc_include rdmacm
Could someone explain this? We also have problems with collective hangs
with
Attached is the output of running with verbose 100, mpirun --mca
btl_openib_cpc_include rdmacm --mca btl_base_verbose 100 NPmpi
[nyx0665.engin.umich.edu:06399] mca: base: components_open: Looking for btl
components
[nyx0666.engin.umich.edu:07210] mca: base: components_open: Looking for btl
On Apr 27, 2011, at 10:02 AM, Brock Palen wrote:
> Argh, our messed up environment with three generations on infiniband bit us,
> Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR
> ib on some of our hosts. Note that jobs will never run across our old DDR ib
> and
Argh, our messed up environment with three generations on infiniband bit us,
Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR ib
on some of our hosts. Note that jobs will never run across our old DDR ib and
our new QDR stuff where rdmacm does work.
I am doing some
On Apr 21, 2011, at 6:49 PM, Ralph Castain wrote:
>
> On Apr 21, 2011, at 4:41 PM, Brock Palen wrote:
>
>> Given that part of our cluster is TCP only, openib wouldn't even startup on
>> those hosts
>
> That is correct - it would have no impact on those hosts
>
>> and this would be ignored on
On Apr 21, 2011, at 4:41 PM, Brock Palen wrote:
> Given that part of our cluster is TCP only, openib wouldn't even startup on
> those hosts
That is correct - it would have no impact on those hosts
> and this would be ignored on hosts with IB adaptors?
Ummm...not sure I understand this one.
Given that part of our cluster is TCP only, openib wouldn't even startup on
those hosts and this would be ignored on hosts with IB adaptors?
Just checking thanks!
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On Apr 21, 2011, at 6:21 PM,
Over IB, I'm not sure there is much of a drawback. It might be slightly slower
to establish QP's, but I don't think that matters much.
Over iWARP, rdmacm can cause connection storms as you scale to thousands of MPI
processes.
On Apr 20, 2011, at 5:03 PM, Brock Palen wrote:
> We managed to
28 matches
Mail list logo