On Wed, 28 Aug 2019 09:45:15 -0500
Cooper Burns <cooper.bu...@convergecfd.com> wrote:

> Peter,
> Thanks for your input!
> I tried some things:
> *1) The app was placed/pinned differently by the two MPIs. Often this
> would probably not cause such a big difference.*
> I agree this is unlikely the cause, however I tried various
> configurations of map-by, bind-to, etc and none of them had any
> measurable impact at all, which points to this not being the cause
> (as you suspected)

OK, there's still one thing to rule out, which rank was placed on which

For OpenMPI you can pass "-report-bindings" and verify that the first N
ranks are placed on the first node (for N cores or ranks per node).

node0: r0 r4 r8 ...
node1: r1 ...
node2: r2 ...
node3: r3 ...


node0: r0 r1 r2 r3 ...

> *2) Bad luck wrt collective performance. Different MPIs have
> different weak spots across the parameter space of
> numranks,transfersize,mpi-coll**ective.* This is possible... But the
> magnitude of the runtime difference seems too large to me... Are
> there any options we can give to OMPI to cause it to use different
> collective algorithms so that we can test this theory?

It can certainly cause the observed difference. I've seen very large

To get collective tunables from OpenMPI do something like:

 ompi_info --param coll all --level 5

But it will really help to know or suspect what collectives the
application depend on.

For example, if you suspected alltoall to be a factor you could sweep
all valid alltoall algorithms by setting:

 -mca coll coll_tuned_alltoall_algorithm X

Where X is 0..6 in my case (ompi_info returned: 0 ignore, 1 basic
linear, 2 bruck, 3 recursive doubling, 4 ring, 5 neighbor exchange, 6:
two proc only.)

users mailing list

Reply via email to