Hey Jeff, what did you run to generate the memory corruption? Can you
run the same test with --mca btl_openib_memalign_threshold 12288 and see
if you get the same corruption? I'm not hitting any corruption over
iw_cxgb4 with a simple test.
On 6/10/2015 2:39 PM, Jeff Squyres (jsquyres) wrote
e change
that breaks things until we can figure this out.
-Original Message-
From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
Sent: Wednesday, June 10, 2015 3:30 PM
To: Open MPI User's List
Cc: Nathan Hjelm; Steve Wise
Subject: Re: [OMPI users] Default value of btl_openib
..@cisco.com]
>> Sent: Wednesday, June 10, 2015 3:30 PM
>> To: Open MPI User's List
>> Cc: Nathan Hjelm; Steve Wise
>> Subject: Re: [OMPI users] Default value of btl_openib_memalign_threshold
>>
>> Nathan / Steve -- you guys are nominally the owners of the open
ist
> Cc: Nathan Hjelm; Steve Wise
> Subject: Re: [OMPI users] Default value of btl_openib_memalign_threshold
>
> Nathan / Steve -- you guys are nominally the owners of the openib BTL: can
> you please investigate?
>
>
> > On Jun 10, 2015, at 4:15 PM, Ralph Castain wro
Nathan / Steve -- you guys are nominally the owners of the openib BTL: can you
please investigate?
> On Jun 10, 2015, at 4:15 PM, Ralph Castain wrote:
>
> Odd - without that setting, the value is essentially undefined, so it’s hard
> to understand how that is any better. Maybe the whole align
Odd - without that setting, the value is essentially undefined, so it’s hard to
understand how that is any better. Maybe the whole alignment thing is busted,
and leaving it undefined (which usually defaults to zero, but not always)
causes it to be turned “off”?
I don’t really care, mind you - b
Ralph --
This change was not correct
(https://github.com/open-mpi/ompi/commit/ce915b5757d428d3e914dcef50bd4b2636561bca).
It is causing memory corruption in the openib BTL.
> On May 25, 2015, at 11:56 AM, Ralph Castain wrote:
>
> I don’t see a problem with it. FWIW: I’m getting ready to rel
I don’t see a problem with it. FWIW: I’m getting ready to release 1.8.6 in the
next week
> On May 25, 2015, at 8:46 AM, Xavier Besseron wrote:
>
> Good that it will be fixed in the next release!
>
> In the meantime, and because it might impact other users,
> I would like to ask my sysadmins t
Good that it will be fixed in the next release!
In the meantime, and because it might impact other users,
I would like to ask my sysadmins to set btl_openib_memalign_threshold=12288
in etc/openmpi-mca-params.conf on our clusters.
Do you see any good reason not doing it?
Thanks!
Xavier
On Mo
I found the problem - someone had a typo in btl_openib_mca.c. The threshold
need to be set to the module eager limit as that is the only thing defined at
that point.
Thanks for bringing it to our attention! I’ll set it up to go into 1.8.6
> On May 25, 2015, at 3:04 AM, Xavier Besseron wrote:
Hi,
Thanks for your reply Ralph.
The option only I'm using when configuring OpenMPI is '--prefix'.
When checking the config.log file, I see
configure:208504: checking whether the openib BTL will use malloc hooks
configure:208510: result: yes
so I guess it is properly enabled (full config.log in
Looking at the code, we do in fact set the memalign_threshold = eager_limit by
default, but only if you configured with —enable-btl-openib-malloc-alignment
AND/OR we found the malloc hook functions were available.
You might check config.log to see if the openib malloc hooks were enabled. My
gue
12 matches
Mail list logo