[OMPI users] OpenMPI 4 and pmi2 support

2019-03-22 Thread Noam Bernstein via users
Hi - I'm trying to compile openmpi 4.0.0 with srun support, so I'm trying to tell openmpi's configure where to find the relevant files by doing $ ./configure --with-verbs --with-ofi --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/lib64 --prefix=/share/apps/mpi/openmpi/4.0.0/ib/gnu verbs

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-14 Thread Noam Bernstein via users
Hi Jeff - do you remember this issue from a couple of months ago? Unfortunately, the failure to find pmi.h is still happening. I just tried with 4.0.1 (not rc), and I still run into the same error (failing to find #include when compiling opal/mca/pmix/s1/mca_pmix_s1_la-pmix_s1.lo): make[2]:

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
I tried to disable ucx (successfully, I think - I replaced the “—mca btl ucx —mca btl ^vader,tcp,openib” with “—mca btl_openib_allow_ib 1”, and attaching gdb to a running process shows no ucx-related routines active). It still has the same fast growing (1 GB/s) memory usage problem.

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
> On Jun 19, 2019, at 2:00 PM, John Hearns via users > wrote: > > Noam, it may be a stupid question. Could you try runningslabtop ss the > program executes The top SIZE usage is this line OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 5937540 5937540

[OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
Hi - we’re having a weird problem with OpenMPI on our newish infiniband EDR (mlx5) nodes. We're running CentOS 7.6, with all the infiniband and ucx libraries as provided by CentOS, i.e. ucx-1.4.0-1.el7.x86_64 libibverbs-utils-17.2-3.el7.x86_64 libibverbs-17.2-3.el7.x86_64

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
> On Jun 19, 2019, at 5:05 PM, Joshua Ladd wrote: > > Hi, Noam > > Can you try your original command line with the following addition: > > mpirun —mca pml ucx —mca btl ^vader,tcp,openib -mca osc ucx > > I think we're seeing some conflict between UCX PML and UCT OSC. I did this, although

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
> On Jun 19, 2019, at 2:44 PM, George Bosilca wrote: > > To completely disable UCX you need to disable the UCX MTL and not only the > BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1”. Thanks for the pointer. Disabling ucx this way _does_ seem to fix the memory

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 4:44 AM, Charles A Taylor wrote: > > This looks a lot like a problem I had with OpenMPI 3.1.2. I thought the fix > was landed in 4.0.0 but you might > want to check the code to be sure there wasn’t a regression in 4.1.x. Most > of our codes are still running > 3.1.2

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 12:25 PM, Carlson, Timothy S > wrote: > > As of recent you needed to use --with-slurm and --with-pmi2 > > While the configure line indicates it picks up pmi2 as part of slurm that is > not in fact true and you need to specifically tell it about pmi2 When I do

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 12:55 PM, Carlson, Timothy S > wrote: > > Just pass /usr to configure instead of /usr/include/slurm This seems to have done it (as did passing CPPFLAGS, but this feels cleaner). Thank you all for the suggestions.

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 10:42 AM, Noam Bernstein via users > wrote: > > I haven’t yet tried the latest OFED or Mellanox low level stuff. That’s next > on my list, but slightly more involved to do, so I’ve been avoiding it. > Aha - using Mellanox’s OFED packaging se

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 1:38 PM, Nathan Hjelm via users > wrote: > > THAT is a good idea. When using Omnipath we see an issue with stale files in > /dev/shm if the application exits abnormally. I don't know if UCX uses that > space as well. No stale shm files. echo 3 >

Re: [OMPI users] process mapping

2019-06-21 Thread Noam Bernstein via users
> On Jun 21, 2019, at 4:04 PM, Ralph Castain via users > wrote: > > I’m unaware of any “map-to cartofile” option, nor do I find it in mpirun’s > help or man page. Are you seeing it somewhere? From "mpirun —help”: tin 1431 : mpirun --help mapping mpirun (Open MPI) 4.0.1 Usage: mpirun

[OMPI users] process mapping

2019-06-21 Thread Noam Bernstein via users
Hi - are there any examples of the cartofile format? Or is there some combo of —map, —rank, or —bind to achieve this mapping? [BB/..][../..] [../BB][../..] [../..][BB/..] [../..][../BB] I tried everything I could think of for —bind-to, —map-by, and —rank-by, and I can’t get it to happen. I

Re: [OMPI users] process mapping

2019-06-21 Thread Noam Bernstein via users
> On Jun 21, 2019, at 4:45 PM, Ralph Castain wrote: > > Hilarious - I wrote that code and I have no idea who added that option or > what it is supposed to do. I can assure, however, that it isn’t implemented > anywhere. Not really a big deal, since the documentation doesn’t explain them,

Re: [OMPI users] process mapping

2019-06-21 Thread Noam Bernstein via users
> On Jun 21, 2019, at 5:02 PM, Ralph Castain wrote: > > > > Too many emails to track :-( > > Should just be “--map-by core --rank-by core” - nothing fancy required. > Sounds like you are getting --map-by node, or at least --rank-by node, which > means somebody has set an MCA param either in

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 11:54 AM, Jeff Squyres (jsquyres) > wrote: > > On Jun 14, 2019, at 2:02 PM, Noam Bernstein via users > wrote: >> >> Hi Jeff - do you remember this issue from a couple of months ago? > > Noam: I'm sorry, I totally missed thi

Re: [OMPI users] growing memory use from MPI application

2019-06-21 Thread Noam Bernstein via users
Perhaps I spoke too soon. Now, with the Mellanox OFED stack, we occasionally get the following failure on exit: [compute-4-20:68008:0:68008] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) 0 0x0002a3c5 opal_free_list_destruct() opal_free_list.c:0 1

Re: [OMPI users] growing memory use from MPI application

2019-06-21 Thread Noam Bernstein via users
> On Jun 21, 2019, at 9:57 PM, Carlson, Timothy S > wrote: > > Switch back to stock OFED? Well, CentOS included OFED has a memory leak (at least when using ucx). I haven't tried OFED's stack yet. > > Make sure all your cards are patched to the latest firmware. That's a good idea.

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 9:40 AM, Jeff Squyres (jsquyres) > wrote: > > On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users > wrote: >> >> One thing that I’m wondering if anyone familiar with the internals can >> explain is how you get a memory leak that isn’t