John,

There are many things in play in such an experiment. Plus, expecting linear
speedup even at the node level is certainly overly optimistic.

1. A single core experiment has full memory bandwidth, so you will
asymptotically reach the max flops. Adding more cores will increase the
memory pressure, and at some point the memory will not be able to deliver,
and will become the limiting factor (not the computation capabilities of
the cores).

2. HPL communication pattern is composed of 3 types of messages. 1 element
in the panel (column) in the context of an allreduce (to find the max),
medium size (a decreasing multiple of NB as you progress in the
computation) for the swap operation, and finally some large messages of
(NB*NB*sizeof(elem)) for the update. All this to say that CMA_SIZE_MBYTES=5
should be more than enough for you.

Have fun,
  George.



On Wed, Jul 22, 2020 at 2:19 PM John Duffy via users <
users@lists.open-mpi.org> wrote:

> Hi Joseph, John
>
> Thank you for your replies.
>
> I’m using Ubuntu 20.04 aarch64 on a 8 x Raspberry Pi 4 cluster.
>
> The symptoms I’m experiencing are that the HPL Linpack performance in
> Gflops increases on a single core as NB is increased from 32 to 256. The
> theoretical maximum is 6 Gflops per core. I can achieve 4.8 Gflops, which I
> think is a reasonable expectation. However, as I add more cores on a single
> node, 2, 3 and finally 4 cores, the performance scaling is nowhere near
> linear, and tails off dramatically as NB is increased. I can achieve 15
> Gflops on a single node of 4 cores, whereas the theoretical maximum is 24
> Gflops per node.
>
> opmi_info suggest vader is available/working…
>
>                  MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>                  MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>                  MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>                  MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>
> I’m wondering whether the Ubuntu kernel CMA_SIZE_MBYTES=5 is limiting
> Open-MPI message number/size. So, I’m currently building a new kernel with
> CMA_SIZE_MBYTES=16.
>
> I have attached 2 plots from my experiments…
>
> Plot 1 - shows an increase in Gflops for 1 core as NB increases, up to a
> maximum value of 4.75 Gflops when NB = 240.
>
> Plot 2 - shows an increase in Gflops for 4 x cores (all on same the same
> node) as NB increases. The maximum Gflops achieved is 15 Gflops. I would
> hope that rather than drop off dramatically at NB = 168, the performance
> would trend upwards towards somewhere near 4 x 4.75 = 19 Gflops.
>
> This is why I wondering whether Open-MPI messages via vader are being
> hampered by a limiting CMA size.
>
> Lets see what happens with my new kernel...
>
> Best regards
>
> John
>
>
>

Reply via email to