All,

My guess is this is a "I built Open MPI incorrectly" sort of issue, but I'm
not sure how to fix it. Namely, I'm currently trying to get an MPI
project's CI working on CircleCI using Open MPI to run some unit tests (on
a single node, so need some oversubscribe). I can build everything just
fine, but when I try to run, things just...blow up:

[root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18
-oversubscribe /root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6
-nso 6 -ngo 1 -ngi 1 -v T,U -s mpi
 start app rank:           0
 start app rank:           1
 start app rank:           2
 start app rank:           3
 start app rank:           4
 start app rank:           5
[3796b115c961:03629] Read -1, expected 48, errno = 1
[3796b115c961:03629] *** An error occurred in MPI_Get
[3796b115c961:03629] *** reported by process [2144600065,12]
[3796b115c961:03629] *** on win rdma window 5
[3796b115c961:03629] *** MPI_ERR_OTHER: known error not in list
[3796b115c961:03629] *** MPI_ERRORS_ARE_FATAL (processes in this win will
now abort,
[3796b115c961:03629] ***    and potentially your MPI job)

I'm currently more concerned about the MPI_Get error, though I'm not sure
what that "Read -1, expected 48, errno = 1" bit is about (MPI-IO error?).
Now this code is fairly fancy MPI code, so I decided to try a simpler one.
Searched the internet and found an example program here:

https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication

and when I build and run with Intel MPI it works:

(1027)(master) $ mpirun -V
Intel(R) MPI Library for Linux* OS, Version 2018 Update 4 Build 20180823
(id: 18555)
Copyright 2003-2018 Intel Corporation.
(1028)(master) $ mpiicc rma_test.c
(1029)(master) $ mpirun -np 2 ./a.out
srun.slurm: cluster configuration lacks support for cpu binding
Rank 0 running on borgj001
Rank 1 running on borgj001
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
Rank 0 gets data from the shared memory: 10 11 12 13
Rank 1 gets data from the shared memory: 00 01 02 03
Rank 0 has new data in the shared memory:Rank 1 has new data in the shared
memory: 10 11 12 13
 00 01 02 03

So, I have some confidence it was written correctly. Now on the same system
I try with Open MPI (building with gcc, not Intel C):

(1032)(master) $ mpirun -V
mpirun (Open MPI) 4.0.1

Report bugs to http://www.open-mpi.org/community/help/
(1033)(master) $ mpicc rma_test.c
(1034)(master) $ mpirun -np 2 ./a.out
Rank 0 running on borgj001
Rank 1 running on borgj001
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
[borgj001:22668] *** An error occurred in MPI_Get
[borgj001:22668] *** reported by process [2514223105,1]
[borgj001:22668] *** on win rdma window 3
[borgj001:22668] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[borgj001:22668] *** MPI_ERRORS_ARE_FATAL (processes in this win will now
abort,
[borgj001:22668] ***    and potentially your MPI job)
[borgj001:22642] 1 more process has sent help message help-mpi-errors.txt /
mpi_errors_are_fatal
[borgj001:22642] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

This is a similar failure to above. Any ideas what I might be doing wrong
here? I don't doubt I'm missing something, but I'm not sure what. Open MPI
was built pretty boringly:

Configure command line: '--with-slurm' '--enable-shared'
'--disable-wrapper-rpath' '--disable-wrapper-runpath'
'--enable-mca-no-build=btl-usnic' '--prefix=...'

And I'm not sure if we need those disable-wrapper bits anymore, but long
ago we needed them, and so they've lived on in "how to build" READMEs until
something breaks. This btl-usnic is a bit unknown to me (this was built by
sysadmins on a cluster), but this is pretty close to how I build on my
desktop and it has the same issue.

Any ideas from the experts?

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton

Reply via email to