Howard, If I run examples/hello_c explicitly with —mca pml ob1, it runs fine.
If I just let mpirun pick out things, it bombs in the OMPI progress thread. Here is the trace: [node39:126438] mca: base: components_register: registering framework pml components [node39:126438] mca: base: components_register: found loaded component v [node39:126438] mca: base: components_register: component v register function successful [node39:126438] mca: base: components_register: found loaded component cm [node39:126438] mca: base: components_register: component cm register function successful [node39:126438] mca: base: components_register: found loaded component monitoring [node39:126438] mca: base: components_register: component monitoring register function successful [node39:126438] mca: base: components_register: found loaded component ob1 [node39:126438] mca: base: components_register: component ob1 register function successful [node39:126438] mca: base: components_open: opening pml components [node39:126438] mca: base: components_open: found loaded component v [node39:126438] mca: base: components_open: component v open function successful [node39:126438] mca: base: components_open: found loaded component cm [node39:126438] mca: base: components_open: component cm open function successful [node39:126438] mca: base: components_open: found loaded component monitoring [node39:126438] mca: base: components_open: component monitoring open function successful [node39:126438] mca: base: components_open: found loaded component ob1 [node39:126438] mca: base: components_open: component ob1 open function successful [node39:126438] select: component v not in the include list [node39:126438] select: initializing pml component cm [node39:126438] select: init returned priority 10 [node39:126438] select: component monitoring not in the include list [node39:126438] select: initializing pml component ob1 [node39:126438] select: init returned priority 20 [node39:126438] selected ob1 best priority 20 [node39:126438] *** Process received signal *** [node39:126438] Signal: Segmentation fault (11) [node39:126438] Signal code: Invalid permissions (2) [node39:126438] Failing at address: 0x7fcb175bc008 [node39:126438] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fcb21a5e710] [node39:126438] [ 1] /ascldap/users/dblarki/opt/portals4.master/lib/libportals.so.4(PtlEQPoll+0x62)[0x7fcb14a5bf62] [node39:126438] [ 2] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(ompi_mtl_portals4_progress+0x3d)[0x7fcb0dbea40e] [node39:126438] [ 3] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(ompi_mtl_portals4_finalize+0x20)[0x7fcb0dbe8c2f] [node39:126438] [ 4] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_pml_cm.so(+0x67dc)[0x7fcb0f9c67dc] [node39:126438] [ 5] /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(mca_pml_base_select+0x56c)[0x7fcb21d7b7ce] [node39:126438] [ 6] /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(ompi_mpi_init+0xb54)[0x7fcb21cc7c2e] [node39:126438] [ 7] /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(MPI_Init+0x7d)[0x7fcb21d069b7] [node39:126438] [ 8] ./hello_c[0x400806] [node39:126438] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fcb216d9d5d] [node39:126438] [10] ./hello_c[0x400719] [node39:126438] *** End of error message *** If I run osu_latency with —mca pml ob1, I get the same thing. The segfault is happening in Portals, but is probably due to the progress thread calling PtlEQPoll on invalid event queues. b. > On Feb 8, 2018, at 2:05 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > > HI Brian, > > Thanks for the info. I'm not sure I quite get the response though. Is the > race condition in the way > Open MPI Portals4 MTL is using portals or is a problem in the portals > implementation itself? > > Howard > > > 2018-02-08 9:20 GMT-07:00 D. Brian Larkins <brianlark...@gmail.com > <mailto:brianlark...@gmail.com>>: > Howard, > > Looks like ob1 is working fine. When I looked into the problems with ob1, it > looked like the progress thread was polling the Portals event queue before it > had been initialized. > > b. > > $ mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib osu_latency > WARNING: Ummunotify not found: Not using ummunotify can result in incorrect > results download and install ummunotify from: > > http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2 > > <http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2> > WARNING: Ummunotify not found: Not using ummunotify can result in incorrect > results download and install ummunotify from: > > http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2 > > <http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2> > # OSU MPI Latency Test > # Size Latency (us) > 0 1.87 > 1 1.93 > 2 1.90 > 4 1.94 > 8 1.94 > 16 1.96 > 32 1.97 > 64 1.99 > 128 2.43 > 256 2.50 > 512 2.71 > 1024 3.01 > 2048 3.45 > 4096 4.56 > 8192 6.39 > 16384 8.79 > 32768 11.50 > 65536 16.59 > 131072 27.10 > 262144 46.97 > 524288 87.55 > 1048576 168.89 > 2097152 331.40 > 4194304 654.08 > > >> On Feb 7, 2018, at 9:04 PM, Howard Pritchard <hpprit...@gmail.com >> <mailto:hpprit...@gmail.com>> wrote: >> >> HI Brian, >> >> As a sanity check, can you see if the ob1 pml works okay, i.e. >> >> mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency >> >> Howard >> >> >> 2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com >> <mailto:brianlark...@gmail.com>>: >> Hello, >> >> I’m doing some work with Portals4 and am trying to run some MPI programs >> using the Portals 4 as the transport layer. I’m running into problems and am >> hoping that someone can help me figure out how to get things working. I’m >> using OpenMPI 3.0.0 with the following configuration: >> >> ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky >> --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4 >> --disable-oshmem --disable-vt --disable-java --disable-mpi-io >> --disable-io-romio --disable-libompitrace >> --disable-btl-portals4-flow-control --disable-mtl-portals4-flow-control >> >> I have also tried the head from the git repo and 2.1.2 with the same >> results. A simpler configure line (w —prefix and —with-portals4=) also gets >> same results. >> >> Portals4 configuration is from github master and configured thus: >> >> ./configure —prefix=path/to/portals4 --with-ev=path/to/libev >> --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered >> >> If I specify the cm pml on the command-line, I can get examples/hello_c to >> run correctly. Trying to get some latency numbers using the OSU benchmarks >> is where my trouble begins: >> >> $ mpirun -n 2 --mca mtl portals4 --mca pml cm env >> PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency >> NOTE: Ummunotify and IB registered mem cache disabled, set >> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. >> NOTE: Ummunotify and IB registered mem cache disabled, set >> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. >> # OSU MPI Latency Test >> # Size Latency (us) >> 0 25.96 >> [node41:19740] *** An error occurred in MPI_Barrier >> [node41:19740] *** reported by process [139815819542529,4294967297] >> [node41:19740] *** on communicator MPI_COMM_WORLD >> [node41:19740] *** MPI_ERR_OTHER: known error not in list >> [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will >> now abort, >> [node41:19740] *** and potentially your MPI job) >> >> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to be >> a progress thread initialization problem. >> Using PTL_IGNORE_UMMUNOTIFY=1 gets here: >> >> $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency >> # OSU MPI Latency Test >> # Size Latency (us) >> 0 24.14 >> 1 26.24 >> [node41:19993] *** Process received signal *** >> [node41:19993] Signal: Segmentation fault (11) >> [node41:19993] Signal code: Address not mapped (1) >> [node41:19993] Failing at address: 0x141 >> [node41:19993] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fa6ac73b710] >> [node41:19993] [ 1] >> /ascldap/users/dblarki/opt/portals4.master/lib/libportals.so.4(+0xcd65)[0x7fa69b770d65] >> [node41:19993] [ 2] >> /ascldap/users/dblarki/opt/portals4.master/lib/libportals.so.4(PtlPut+0x143)[0x7fa69b773fb3] >> [node41:19993] [ 3] >> /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(+0xa961)[0x7fa698cf5961] >> [node41:19993] [ 4] >> /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(+0xb0e5)[0x7fa698cf60e5] >> [node41:19993] [ 5] >> /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(ompi_mtl_portals4_send+0x90)[0x7fa698cf61d1] >> [node41:19993] [ 6] >> /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_pml_cm.so(+0x5430)[0x7fa69a794430] >> [node41:19993] [ 7] >> /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(PMPI_Send+0x2b4)[0x7fa6ac9ff018] >> [node41:19993] [ 8] ./osu_latency[0x40106f] >> [node41:19993] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa6ac3b6d5d] >> [node41:19993] [10] ./osu_latency[0x400c59] >> >> This cluster is running RHEL 6.5 without ummunotify modules, but I get the >> same results on a local (small) cluster running ubuntu 16.04 with ummunotify >> loaded. >> >> Any help would be much appreciated. >> thanks, >> >> brian. >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users> > -- > D. Brian Larkins > Assistant Professor of Computer Science > Rhodes College > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://lists.open-mpi.org/mailman/listinfo/users > <https://lists.open-mpi.org/mailman/listinfo/users> > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- D. Brian Larkins Assistant Professor of Computer Science Rhodes College
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users