Re: [OMPI users] Vader - Where to Look for Shared Memory Use

2020-07-22 Thread John Hearns via users
John, as an aside it is always worth running 'lstopo' from the hwloc
package to look at the layout of your cpus cores and caches.
Getting a bit late now so I apologise for being too lazy to boot up my Pi
to capture the output.

On Wed, 22 Jul 2020 at 19:55, George Bosilca via users <
users@lists.open-mpi.org> wrote:

> John,
>
> There are many things in play in such an experiment. Plus, expecting
> linear speedup even at the node level is certainly overly optimistic.
>
> 1. A single core experiment has full memory bandwidth, so you will
> asymptotically reach the max flops. Adding more cores will increase the
> memory pressure, and at some point the memory will not be able to deliver,
> and will become the limiting factor (not the computation capabilities of
> the cores).
>
> 2. HPL communication pattern is composed of 3 types of messages. 1 element
> in the panel (column) in the context of an allreduce (to find the max),
> medium size (a decreasing multiple of NB as you progress in the
> computation) for the swap operation, and finally some large messages of
> (NB*NB*sizeof(elem)) for the update. All this to say that CMA_SIZE_MBYTES=5
> should be more than enough for you.
>
> Have fun,
>   George.
>
>
>
> On Wed, Jul 22, 2020 at 2:19 PM John Duffy via users <
> users@lists.open-mpi.org> wrote:
>
>> Hi Joseph, John
>>
>> Thank you for your replies.
>>
>> I’m using Ubuntu 20.04 aarch64 on a 8 x Raspberry Pi 4 cluster.
>>
>> The symptoms I’m experiencing are that the HPL Linpack performance in
>> Gflops increases on a single core as NB is increased from 32 to 256. The
>> theoretical maximum is 6 Gflops per core. I can achieve 4.8 Gflops, which I
>> think is a reasonable expectation. However, as I add more cores on a single
>> node, 2, 3 and finally 4 cores, the performance scaling is nowhere near
>> linear, and tails off dramatically as NB is increased. I can achieve 15
>> Gflops on a single node of 4 cores, whereas the theoretical maximum is 24
>> Gflops per node.
>>
>> opmi_info suggest vader is available/working…
>>
>>  MCA btl: openib (MCA v2.1.0, API v3.1.0, Component
>> v4.0.3)
>>  MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>>  MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>>  MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>>
>> I’m wondering whether the Ubuntu kernel CMA_SIZE_MBYTES=5 is limiting
>> Open-MPI message number/size. So, I’m currently building a new kernel with
>> CMA_SIZE_MBYTES=16.
>>
>> I have attached 2 plots from my experiments…
>>
>> Plot 1 - shows an increase in Gflops for 1 core as NB increases, up to a
>> maximum value of 4.75 Gflops when NB = 240.
>>
>> Plot 2 - shows an increase in Gflops for 4 x cores (all on same the same
>> node) as NB increases. The maximum Gflops achieved is 15 Gflops. I would
>> hope that rather than drop off dramatically at NB = 168, the performance
>> would trend upwards towards somewhere near 4 x 4.75 = 19 Gflops.
>>
>> This is why I wondering whether Open-MPI messages via vader are being
>> hampered by a limiting CMA size.
>>
>> Lets see what happens with my new kernel...
>>
>> Best regards
>>
>> John
>>
>>
>>


Re: [OMPI users] Vader - Where to Look for Shared Memory Use

2020-07-22 Thread George Bosilca via users
John,

There are many things in play in such an experiment. Plus, expecting linear
speedup even at the node level is certainly overly optimistic.

1. A single core experiment has full memory bandwidth, so you will
asymptotically reach the max flops. Adding more cores will increase the
memory pressure, and at some point the memory will not be able to deliver,
and will become the limiting factor (not the computation capabilities of
the cores).

2. HPL communication pattern is composed of 3 types of messages. 1 element
in the panel (column) in the context of an allreduce (to find the max),
medium size (a decreasing multiple of NB as you progress in the
computation) for the swap operation, and finally some large messages of
(NB*NB*sizeof(elem)) for the update. All this to say that CMA_SIZE_MBYTES=5
should be more than enough for you.

Have fun,
  George.



On Wed, Jul 22, 2020 at 2:19 PM John Duffy via users <
users@lists.open-mpi.org> wrote:

> Hi Joseph, John
>
> Thank you for your replies.
>
> I’m using Ubuntu 20.04 aarch64 on a 8 x Raspberry Pi 4 cluster.
>
> The symptoms I’m experiencing are that the HPL Linpack performance in
> Gflops increases on a single core as NB is increased from 32 to 256. The
> theoretical maximum is 6 Gflops per core. I can achieve 4.8 Gflops, which I
> think is a reasonable expectation. However, as I add more cores on a single
> node, 2, 3 and finally 4 cores, the performance scaling is nowhere near
> linear, and tails off dramatically as NB is increased. I can achieve 15
> Gflops on a single node of 4 cores, whereas the theoretical maximum is 24
> Gflops per node.
>
> opmi_info suggest vader is available/working…
>
>  MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>  MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>  MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>  MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>
> I’m wondering whether the Ubuntu kernel CMA_SIZE_MBYTES=5 is limiting
> Open-MPI message number/size. So, I’m currently building a new kernel with
> CMA_SIZE_MBYTES=16.
>
> I have attached 2 plots from my experiments…
>
> Plot 1 - shows an increase in Gflops for 1 core as NB increases, up to a
> maximum value of 4.75 Gflops when NB = 240.
>
> Plot 2 - shows an increase in Gflops for 4 x cores (all on same the same
> node) as NB increases. The maximum Gflops achieved is 15 Gflops. I would
> hope that rather than drop off dramatically at NB = 168, the performance
> would trend upwards towards somewhere near 4 x 4.75 = 19 Gflops.
>
> This is why I wondering whether Open-MPI messages via vader are being
> hampered by a limiting CMA size.
>
> Lets see what happens with my new kernel...
>
> Best regards
>
> John
>
>
>


Re: [OMPI users] Vader - Where to Look for Shared Memory Use

2020-07-22 Thread John Duffy via users
Hi Joseph, JohnThank you for your replies.I’m using Ubuntu 20.04 aarch64 on a 8 x Raspberry Pi 4 cluster.The symptoms I’m experiencing are that the HPL Linpack performance in Gflops increases on a single core as NB is increased from 32 to 256. The theoretical maximum is 6 Gflops per core. I can achieve 4.8 Gflops, which I think is a reasonable expectation. However, as I add more cores on a single node, 2, 3 and finally 4 cores, the performance scaling is nowhere near linear, and tails off dramatically as NB is increased. I can achieve 15 Gflops on a single node of 4 cores, whereas the theoretical maximum is 24 Gflops per node.opmi_info suggest vader is available/working…                 MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3)                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)I’m wondering whether the Ubuntu kernel CMA_SIZE_MBYTES=5 is limiting Open-MPI message number/size. So, I’m currently building a new kernel with CMA_SIZE_MBYTES=16.I have attached 2 plots from my experiments…Plot 1 - shows an increase in Gflops for 1 core as NB increases, up to a maximum value of 4.75 Gflops when NB = 240.Plot 2 - shows an increase in Gflops for 4 x cores (all on same the same node) as NB increases. The maximum Gflops achieved is 15 Gflops. I would hope that rather than drop off dramatically at NB = 168, the performance would trend upwards towards somewhere near 4 x 4.75 = 19 Gflops.This is why I wondering whether Open-MPI messages via vader are being hampered by a limiting CMA size.Lets see what happens with my new kernel...Best regardsJohn

gflops_vs_nb_1_core_80_percent_memory.pdf
Description: Adobe PDF document


gflops_vs_nb_1_node_80_percent_memory.pdf
Description: Adobe PDF document


Re: [OMPI users] choosing network: infiniband vs. ethernet

2020-07-22 Thread Jeff Squyres (jsquyres) via users
Glad you figured it out!

I was waiting for Mellanox support to jump in and answer here; I am not part of 
the UCX community, so I can't really provide definitive UCX answers.



On Jul 22, 2020, at 1:16 PM, Lana Deere 
mailto:lana.de...@gmail.com>> wrote:

Never mind.  This was apparently because I had ucx configured for static 
libraries while openmpi was configured for shared libraries.

.. Lana (lana.de...@gmail.com)




On Tue, Jul 21, 2020 at 12:58 PM Lana Deere 
mailto:lana.de...@gmail.com>> wrote:
I'm using the infiniband drivers in the CentOS7 distribution, not the Mellanox 
drivers.  The version of Lustre we're using is built against the distro drivers 
and breaks if the Mellanox drivers get installed.

Is there a particular version of ucx which should be used with openmpi 4.0.4?  
I downloaded ucx 1.8.1 and installed it, then tried to configure openmpi with 
--with-ucx= but the configure failed.  The configure finds the ucx 
installation OK but thinks some symbols are undeclared.  I tried to find those 
in the ucx source area (in case I configured ucx wrong) but didn't turn them up 
anywhere.  Here is the bottom of the configure output showing mostly "yes" for 
checks but a series of "no" at the end.

[...]
checking ucp/api/ucp.h usability... yes
checking ucp/api/ucp.h presence... yes
checking for ucp/api/ucp.h... yes
checking for library containing ucp_cleanup... no
checking whether ucp_tag_send_nbr is declared... yes
checking whether ucp_ep_flush_nb is declared... yes
checking whether ucp_worker_flush_nb is declared... yes
checking whether ucp_request_check_status is declared... yes
checking whether ucp_put_nb is declared... yes
checking whether ucp_get_nb is declared... yes
checking whether ucm_test_events is declared... yes
checking whether UCP_ATOMIC_POST_OP_AND is declared... yes
checking whether UCP_ATOMIC_POST_OP_OR is declared... yes
checking whether UCP_ATOMIC_POST_OP_XOR is declared... yes
checking whether UCP_ATOMIC_FETCH_OP_FAND is declared... yes
checking whether UCP_ATOMIC_FETCH_OP_FOR is declared... yes
checking whether UCP_ATOMIC_FETCH_OP_FXOR is declared... yes
checking whether UCP_PARAM_FIELD_ESTIMATED_NUM_PPN is declared... yes
checking whether UCP_WORKER_ATTR_FIELD_ADDRESS_FLAGS is declared... yes
checking whether ucp_tag_send_nbx is declared... no
checking whether ucp_tag_send_sync_nbx is declared... no
checking whether ucp_tag_recv_nbx is declared... no
checking for ucp_request_param_t... no
configure: error: UCX support requested but not found.  Aborting


.. Lana (lana.de...@gmail.com)




On Mon, Jul 20, 2020 at 12:43 PM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
Correct, UCX = OpenUCX.org.

If you have the Mellanox drivers package installed, it probably would have 
installed UCX (and Open MPI).  You'll have to talk to your sysadmin and/or 
Mellanox support for details about that.


On Jul 20, 2020, at 11:36 AM, Lana Deere 
mailto:lana.de...@gmail.com>> wrote:

I assume UCX is https://www.openucx.org?  (Google 
found several things called UCX when I searched, but that seemed the right 
one.)  I will try installing it and then reinstall OpenMPI.  Hopefully it will 
then choose between network transports automatically based on what's available. 
 I'll also look at the slides and see if I can make sense of them.  Thanks.

.. Lana (lana.de...@gmail.com)




On Sat, Jul 18, 2020 at 9:41 AM Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
On Jul 16, 2020, at 2:56 PM, Lana Deere via users 
mailto:users@lists.open-mpi.org>> wrote:

I am new to open mpi.  I built 4.0.4 on a CentOS7 machine and tried doing an 
mpirun of a small program compiled against openmpi.  It seems to have failed 
because my host does not have infiniband.  I can't seem to figure out how I 
should configure when I build so it will do what I want, namely use infiniband 
if there are IB HCAs on the system and otherwise use the ethernet on the system.

UCX is the underlying library that Mellanox/Nvidia prefers these days for use 
with MPI and InfiniBand.

Meaning: you should first install UCX and then build Open MPI with 
--with-ucx=/directory/of/ucx/installation.

We just hosted parts 1 and 2 of a seminar entitled "The ABCs of Open MPI" that 
covered topics like this.  Check out:

https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-1
and
https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-2

In particular, you might want to look at slides 28-42 in part 2 for a bunch of 
discussion about how Open MPI (by default) picks the underlying network / APIs 
to use, and then how you can override that if you want to.

--
Jeff Squyres
jsquy...@cisco.com



--
Jeff Squyres
jsquy...@cisco.com



--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] choosing network: infiniband vs. ethernet

2020-07-22 Thread Lana Deere via users
Never mind.  This was apparently because I had ucx configured for static
libraries while openmpi was configured for shared libraries.

.. Lana (lana.de...@gmail.com)




On Tue, Jul 21, 2020 at 12:58 PM Lana Deere  wrote:

> I'm using the infiniband drivers in the CentOS7 distribution, not the
> Mellanox drivers.  The version of Lustre we're using is built against the
> distro drivers and breaks if the Mellanox drivers get installed.
>
> Is there a particular version of ucx which should be used with openmpi
> 4.0.4?  I downloaded ucx 1.8.1 and installed it, then tried to configure
> openmpi with --with-ucx= but the configure failed.  The configure
> finds the ucx installation OK but thinks some symbols are undeclared.  I
> tried to find those in the ucx source area (in case I configured ucx wrong)
> but didn't turn them up anywhere.  Here is the bottom of the configure
> output showing mostly "yes" for checks but a series of "no" at the end.
>
> [...]
> checking ucp/api/ucp.h usability... yes
> checking ucp/api/ucp.h presence... yes
> checking for ucp/api/ucp.h... yes
> checking for library containing ucp_cleanup... no
> checking whether ucp_tag_send_nbr is declared... yes
> checking whether ucp_ep_flush_nb is declared... yes
> checking whether ucp_worker_flush_nb is declared... yes
> checking whether ucp_request_check_status is declared... yes
> checking whether ucp_put_nb is declared... yes
> checking whether ucp_get_nb is declared... yes
> checking whether ucm_test_events is declared... yes
> checking whether UCP_ATOMIC_POST_OP_AND is declared... yes
> checking whether UCP_ATOMIC_POST_OP_OR is declared... yes
> checking whether UCP_ATOMIC_POST_OP_XOR is declared... yes
> checking whether UCP_ATOMIC_FETCH_OP_FAND is declared... yes
> checking whether UCP_ATOMIC_FETCH_OP_FOR is declared... yes
> checking whether UCP_ATOMIC_FETCH_OP_FXOR is declared... yes
> checking whether UCP_PARAM_FIELD_ESTIMATED_NUM_PPN is declared... yes
> checking whether UCP_WORKER_ATTR_FIELD_ADDRESS_FLAGS is declared... yes
> checking whether ucp_tag_send_nbx is declared... no
> checking whether ucp_tag_send_sync_nbx is declared... no
> checking whether ucp_tag_recv_nbx is declared... no
> checking for ucp_request_param_t... no
> configure: error: UCX support requested but not found.  Aborting
>
>
> .. Lana (lana.de...@gmail.com)
>
>
>
>
> On Mon, Jul 20, 2020 at 12:43 PM Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> Correct, UCX = OpenUCX.org.
>>
>> If you have the Mellanox drivers package installed, it probably would
>> have installed UCX (and Open MPI).  You'll have to talk to your sysadmin
>> and/or Mellanox support for details about that.
>>
>>
>> On Jul 20, 2020, at 11:36 AM, Lana Deere  wrote:
>>
>> I assume UCX is https://www.openucx.org?  (Google found several things
>> called UCX when I searched, but that seemed the right one.)  I will try
>> installing it and then reinstall OpenMPI.  Hopefully it will then choose
>> between network transports automatically based on what's available.  I'll
>> also look at the slides and see if I can make sense of them.  Thanks.
>>
>> .. Lana (lana.de...@gmail.com)
>>
>>
>>
>>
>> On Sat, Jul 18, 2020 at 9:41 AM Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>>
>>> On Jul 16, 2020, at 2:56 PM, Lana Deere via users <
>>> users@lists.open-mpi.org> wrote:
>>>
>>>
>>> I am new to open mpi.  I built 4.0.4 on a CentOS7 machine and tried
>>> doing an mpirun of a small program compiled against openmpi.  It seems to
>>> have failed because my host does not have infiniband.  I can't seem to
>>> figure out how I should configure when I build so it will do what I want,
>>> namely use infiniband if there are IB HCAs on the system and otherwise use
>>> the ethernet on the system.
>>>
>>>
>>> UCX is the underlying library that Mellanox/Nvidia prefers these days
>>> for use with MPI and InfiniBand.
>>>
>>> Meaning: you should first install UCX and then build Open MPI with
>>> --with-ucx=/directory/of/ucx/installation.
>>>
>>> We just hosted parts 1 and 2 of a seminar entitled "The ABCs of Open
>>> MPI" that covered topics like this.  Check out:
>>>
>>> https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-1
>>> and
>>> https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-2
>>>
>>> In particular, you might want to look at slides 28-42 in part 2 for a
>>> bunch of discussion about how Open MPI (by default) picks the underlying
>>> network / APIs to use, and then how you can override that if you want to.
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>>
>>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>>


Re: [OMPI users] Vader - Where to Look for Shared Memory Use

2020-07-22 Thread Joseph Schuchart via users

Hi John,

Depending on your platform the default behavior of Open MPI is to mmap a 
shared backing file that is either located in a session directory under 
/dev/shm or under $TMPDIR (I believe under Linux it is /dev/shm). You 
will find a set of files there that are used to back shared memory. They 
should be deleted automatically at the end of a run.


What symptoms are you experiencing and on what platform?

Cheers
Joseph

On 7/22/20 10:15 AM, John Duffy via users wrote:

Hi

I’m trying to investigate an HPL Linpack scaling issue on a single node, 
increasing from 1 to 4 cores.

Regarding single node messages, I think I understand that Open-MPI will select 
the most efficient mechanism, which in this case I think should be vader shared 
memory.

But when I run Linpack, ipcs -m gives…

-- Shared Memory Segments 
keyshmid  owner  perms  bytes  nattch status


And, ipcs -u gives…

-- Messages Status 
allocated queues = 0
used headers = 0
used space = 0 bytes

-- Shared Memory Status 
segments allocated 0
pages allocated 0
pages resident  0
pages swapped   0
Swap performance: 0 attempts 0 successes

-- Semaphore Status 
used arrays = 0
allocated semaphores = 0


Am I looking in the wrong place to see how/if vader is using shared memory? I’m 
wondering if a slower mechanism is being used.

My ompi_info includes...

MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)


Best wishes



[OMPI users] Vader - Where to Look for Shared Memory Use

2020-07-22 Thread John Duffy via users
Hi

I’m trying to investigate an HPL Linpack scaling issue on a single node, 
increasing from 1 to 4 cores.

Regarding single node messages, I think I understand that Open-MPI will select 
the most efficient mechanism, which in this case I think should be vader shared 
memory.

But when I run Linpack, ipcs -m gives…

-- Shared Memory Segments 
keyshmid  owner  perms  bytes  nattch status  


And, ipcs -u gives…

-- Messages Status 
allocated queues = 0
used headers = 0
used space = 0 bytes

-- Shared Memory Status 
segments allocated 0
pages allocated 0
pages resident  0
pages swapped   0
Swap performance: 0 attempts 0 successes

-- Semaphore Status 
used arrays = 0
allocated semaphores = 0


Am I looking in the wrong place to see how/if vader is using shared memory? I’m 
wondering if a slower mechanism is being used.

My ompi_info includes...

MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)


Best wishes