On Wed, 25 Nov 2020, Dave Love via users wrote:
The perf test says romio performs a bit better. Also -- from overall
time -- it's faster on IMB-IO (which I haven't looked at in detail, and
ran with suboptimal striping).
I take that back. I can't reproduce a significant difference for total
IMB-IO runtime, with both run in parallel on 16 ranks, using either the
system default of a single 1MB stripe or using eight stripes. I haven't
teased out figures for different operations yet. That must have been
done elsewhere, but I've never seen figures.
But remember that IMB-IO doesn't cover everything. For example, hdf5's
t_bigio parallel test appears to be a pathological case and OMPIO is 2
orders of magnitude slower on a Lustre filesystem:
- OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds
- OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds
End users seem to have the choice of:
- use openmpi 4.x and have some things broken (romio)
- use openmpi 4.x and have some things slow (ompio)
- use openmpi 3.x and everything works
My concern is that openmpi 3.x is near, or at, end of life.
Mark
t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7,
Lustre 2.12.5:
[login testpar]$ time mpirun -np 6 ./t_bigio
Testing Dataset1 write by ROW
Testing Dataset2 write by COL
Testing Dataset3 write select ALL proc 0, NONE others
Testing Dataset4 write point selection
Read Testing Dataset1 by COL
Read Testing Dataset2 by ROW
Read Testing Dataset3 read select ALL proc 0, NONE others
Read Testing Dataset4 with Point selection
***Express test mode on. Several tests are skipped
real 0m21.141s
user 2m0.318s
sys 0m3.289s
[login testpar]$ export OMPI_MCA_io=ompio
[login testpar]$ time mpirun -np 6 ./t_bigio
Testing Dataset1 write by ROW
Testing Dataset2 write by COL
Testing Dataset3 write select ALL proc 0, NONE others
Testing Dataset4 write point selection
Read Testing Dataset1 by COL
Read Testing Dataset2 by ROW
Read Testing Dataset3 read select ALL proc 0, NONE others
Read Testing Dataset4 with Point selection
***Express test mode on. Several tests are skipped
real 42m34.103s
user 213m22.925s
sys 8m6.742s