Hi - I'm trying to compile openmpi 4.0.0 with srun support, so I'm trying to
tell openmpi's configure where to find the relevant files by doing
$ ./configure --with-verbs --with-ofi --with-pmi=/usr/include/slurm
--with-pmi-libdir=/usr/lib64 --prefix=/share/apps/mpi/openmpi/4.0.0/ib/gnu
verbs
Hi Jeff - do you remember this issue from a couple of months ago?
Unfortunately, the failure to find pmi.h is still happening. I just tried with
4.0.1 (not rc), and I still run into the same error (failing to find #include
when compiling opal/mca/pmix/s1/mca_pmix_s1_la-pmix_s1.lo):
make[2]:
I tried to disable ucx (successfully, I think - I replaced the “—mca btl ucx
—mca btl ^vader,tcp,openib” with “—mca btl_openib_allow_ib 1”, and attaching
gdb to a running process shows no ucx-related routines active). It still has
the same fast growing (1 GB/s) memory usage problem.
> On Jun 19, 2019, at 2:00 PM, John Hearns via users
> wrote:
>
> Noam, it may be a stupid question. Could you try runningslabtop ss the
> program executes
The top SIZE usage is this line
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
5937540 5937540
Hi - we’re having a weird problem with OpenMPI on our newish infiniband EDR
(mlx5) nodes. We're running CentOS 7.6, with all the infiniband and ucx
libraries as provided by CentOS, i.e.
ucx-1.4.0-1.el7.x86_64
libibverbs-utils-17.2-3.el7.x86_64
libibverbs-17.2-3.el7.x86_64
> On Jun 19, 2019, at 5:05 PM, Joshua Ladd wrote:
>
> Hi, Noam
>
> Can you try your original command line with the following addition:
>
> mpirun —mca pml ucx —mca btl ^vader,tcp,openib -mca osc ucx
>
> I think we're seeing some conflict between UCX PML and UCT OSC.
I did this, although
> On Jun 19, 2019, at 2:44 PM, George Bosilca wrote:
>
> To completely disable UCX you need to disable the UCX MTL and not only the
> BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1”.
Thanks for the pointer. Disabling ucx this way _does_ seem to fix the memory
> On Jun 20, 2019, at 4:44 AM, Charles A Taylor wrote:
>
> This looks a lot like a problem I had with OpenMPI 3.1.2. I thought the fix
> was landed in 4.0.0 but you might
> want to check the code to be sure there wasn’t a regression in 4.1.x. Most
> of our codes are still running
> 3.1.2
> On Jun 20, 2019, at 12:25 PM, Carlson, Timothy S
> wrote:
>
> As of recent you needed to use --with-slurm and --with-pmi2
>
> While the configure line indicates it picks up pmi2 as part of slurm that is
> not in fact true and you need to specifically tell it about pmi2
When I do
> On Jun 20, 2019, at 12:55 PM, Carlson, Timothy S
> wrote:
>
> Just pass /usr to configure instead of /usr/include/slurm
This seems to have done it (as did passing CPPFLAGS, but this feels cleaner).
Thank you all for the suggestions.
> On Jun 20, 2019, at 10:42 AM, Noam Bernstein via users
> wrote:
>
> I haven’t yet tried the latest OFED or Mellanox low level stuff. That’s next
> on my list, but slightly more involved to do, so I’ve been avoiding it.
>
Aha - using Mellanox’s OFED packaging se
> On Jun 20, 2019, at 1:38 PM, Nathan Hjelm via users
> wrote:
>
> THAT is a good idea. When using Omnipath we see an issue with stale files in
> /dev/shm if the application exits abnormally. I don't know if UCX uses that
> space as well.
No stale shm files. echo 3 >
> On Jun 21, 2019, at 4:04 PM, Ralph Castain via users
> wrote:
>
> I’m unaware of any “map-to cartofile” option, nor do I find it in mpirun’s
> help or man page. Are you seeing it somewhere?
From "mpirun —help”:
tin 1431 : mpirun --help mapping
mpirun (Open MPI) 4.0.1
Usage: mpirun
Hi - are there any examples of the cartofile format? Or is there some combo of
—map, —rank, or —bind to achieve this mapping?
[BB/..][../..]
[../BB][../..]
[../..][BB/..]
[../..][../BB]
I tried everything I could think of for —bind-to, —map-by, and —rank-by, and I
can’t get it to happen. I
> On Jun 21, 2019, at 4:45 PM, Ralph Castain wrote:
>
> Hilarious - I wrote that code and I have no idea who added that option or
> what it is supposed to do. I can assure, however, that it isn’t implemented
> anywhere.
Not really a big deal, since the documentation doesn’t explain them,
> On Jun 21, 2019, at 5:02 PM, Ralph Castain wrote:
>
>
>
> Too many emails to track :-(
>
> Should just be “--map-by core --rank-by core” - nothing fancy required.
> Sounds like you are getting --map-by node, or at least --rank-by node, which
> means somebody has set an MCA param either in
> On Jun 20, 2019, at 11:54 AM, Jeff Squyres (jsquyres)
> wrote:
>
> On Jun 14, 2019, at 2:02 PM, Noam Bernstein via users
> wrote:
>>
>> Hi Jeff - do you remember this issue from a couple of months ago?
>
> Noam: I'm sorry, I totally missed thi
Perhaps I spoke too soon. Now, with the Mellanox OFED stack, we occasionally
get the following failure on exit:
[compute-4-20:68008:0:68008] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x10)
0 0x0002a3c5 opal_free_list_destruct() opal_free_list.c:0
1
> On Jun 21, 2019, at 9:57 PM, Carlson, Timothy S
> wrote:
>
> Switch back to stock OFED?
Well, CentOS included OFED has a memory leak (at least when using ucx). I
haven't tried OFED's stack yet.
>
> Make sure all your cards are patched to the latest firmware.
That's a good idea.
> On Jun 20, 2019, at 9:40 AM, Jeff Squyres (jsquyres)
> wrote:
>
> On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users
> wrote:
>>
>> One thing that I’m wondering if anyone familiar with the internals can
>> explain is how you get a memory leak that isn’t
20 matches
Mail list logo