Howard,

If I run examples/hello_c explicitly with —mca pml ob1, it runs fine. 

If I just let mpirun pick out things, it bombs in the OMPI progress thread.
Here is the trace:

[node39:126438] mca: base: components_register: registering framework pml 
components
[node39:126438] mca: base: components_register: found loaded component v
[node39:126438] mca: base: components_register: component v register function 
successful
[node39:126438] mca: base: components_register: found loaded component cm
[node39:126438] mca: base: components_register: component cm register function 
successful
[node39:126438] mca: base: components_register: found loaded component 
monitoring
[node39:126438] mca: base: components_register: component monitoring register 
function successful
[node39:126438] mca: base: components_register: found loaded component ob1
[node39:126438] mca: base: components_register: component ob1 register function 
successful
[node39:126438] mca: base: components_open: opening pml components
[node39:126438] mca: base: components_open: found loaded component v
[node39:126438] mca: base: components_open: component v open function successful
[node39:126438] mca: base: components_open: found loaded component cm
[node39:126438] mca: base: components_open: component cm open function 
successful
[node39:126438] mca: base: components_open: found loaded component monitoring
[node39:126438] mca: base: components_open: component monitoring open function 
successful
[node39:126438] mca: base: components_open: found loaded component ob1
[node39:126438] mca: base: components_open: component ob1 open function 
successful
[node39:126438] select: component v not in the include list
[node39:126438] select: initializing pml component cm
[node39:126438] select: init returned priority 10
[node39:126438] select: component monitoring not in the include list
[node39:126438] select: initializing pml component ob1
[node39:126438] select: init returned priority 20
[node39:126438] selected ob1 best priority 20
[node39:126438] *** Process received signal ***
[node39:126438] Signal: Segmentation fault (11)
[node39:126438] Signal code: Invalid permissions (2)
[node39:126438] Failing at address: 0x7fcb175bc008
[node39:126438] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fcb21a5e710]
[node39:126438] [ 1] 
/ascldap/users/dblarki/opt/portals4.master/lib/libportals.so.4(PtlEQPoll+0x62)[0x7fcb14a5bf62]
[node39:126438] [ 2] 
/ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(ompi_mtl_portals4_progress+0x3d)[0x7fcb0dbea40e]
[node39:126438] [ 3] 
/ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(ompi_mtl_portals4_finalize+0x20)[0x7fcb0dbe8c2f]
[node39:126438] [ 4] 
/ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_pml_cm.so(+0x67dc)[0x7fcb0f9c67dc]
[node39:126438] [ 5] 
/ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(mca_pml_base_select+0x56c)[0x7fcb21d7b7ce]
[node39:126438] [ 6] 
/ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(ompi_mpi_init+0xb54)[0x7fcb21cc7c2e]
[node39:126438] [ 7] 
/ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(MPI_Init+0x7d)[0x7fcb21d069b7]
[node39:126438] [ 8] ./hello_c[0x400806]
[node39:126438] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fcb216d9d5d]
[node39:126438] [10] ./hello_c[0x400719]
[node39:126438] *** End of error message ***

If I run osu_latency with —mca pml ob1, I get the same thing. The segfault is 
happening in Portals, but is probably due to the progress thread calling 
PtlEQPoll on invalid event queues.

b.

> On Feb 8, 2018, at 2:05 PM, Howard Pritchard <hpprit...@gmail.com> wrote:
> 
> HI Brian,
> 
> Thanks for the info.   I'm not sure I quite get the response though.  Is the 
> race condition in the way
> Open MPI Portals4 MTL is using portals or is a problem in the portals 
> implementation itself?
> 
> Howard
> 
> 
> 2018-02-08 9:20 GMT-07:00 D. Brian Larkins <brianlark...@gmail.com 
> <mailto:brianlark...@gmail.com>>:
> Howard,
> 
> Looks like ob1 is working fine. When I looked into the problems with ob1, it 
> looked like the progress thread was polling the Portals event queue before it 
> had been initialized.
> 
> b.
> 
> $ mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib osu_latency
> WARNING: Ummunotify not found: Not using ummunotify can result in incorrect 
> results download and install ummunotify from:
>  
> http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2
>  
> <http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2>
> WARNING: Ummunotify not found: Not using ummunotify can result in incorrect 
> results download and install ummunotify from:
>  
> http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2
>  
> <http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2>
> # OSU MPI Latency Test
> # Size            Latency (us)
> 0                         1.87
> 1                         1.93
> 2                         1.90
> 4                         1.94
> 8                         1.94
> 16                        1.96
> 32                        1.97
> 64                        1.99
> 128                       2.43
> 256                       2.50
> 512                       2.71
> 1024                      3.01
> 2048                      3.45
> 4096                      4.56
> 8192                      6.39
> 16384                     8.79
> 32768                    11.50
> 65536                    16.59
> 131072                   27.10
> 262144                   46.97
> 524288                   87.55
> 1048576                 168.89
> 2097152                 331.40
> 4194304                 654.08
> 
> 
>> On Feb 7, 2018, at 9:04 PM, Howard Pritchard <hpprit...@gmail.com 
>> <mailto:hpprit...@gmail.com>> wrote:
>> 
>> HI Brian,
>> 
>> As a sanity check, can you see if the ob1 pml works okay, i.e.
>> 
>>  mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency
>> 
>> Howard
>> 
>> 
>> 2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com 
>> <mailto:brianlark...@gmail.com>>:
>> Hello,
>> 
>> I’m doing some work with Portals4 and am trying to run some MPI programs 
>> using the Portals 4 as the transport layer. I’m running into problems and am 
>> hoping that someone can help me figure out how to get things working. I’m 
>> using OpenMPI 3.0.0 with the following configuration:
>> 
>> ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky 
>> --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4 
>> --disable-oshmem --disable-vt --disable-java --disable-mpi-io 
>> --disable-io-romio --disable-libompitrace 
>> --disable-btl-portals4-flow-control --disable-mtl-portals4-flow-control
>> 
>> I have also tried the head from the git repo and 2.1.2 with the same 
>> results. A simpler configure line (w —prefix and —with-portals4=) also gets 
>> same results.
>> 
>> Portals4 configuration is from github master and configured thus:
>> 
>> ./configure —prefix=path/to/portals4 --with-ev=path/to/libev 
>> --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered
>> 
>> If I specify the cm pml on the command-line, I can get examples/hello_c to 
>> run correctly. Trying to get some latency numbers using the OSU benchmarks 
>> is where my trouble begins:
>> 
>> $ mpirun -n 2 --mca mtl portals4  --mca pml cm env 
>> PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency
>> NOTE: Ummunotify and IB registered mem cache disabled, set 
>> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
>> NOTE: Ummunotify and IB registered mem cache disabled, set 
>> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
>> # OSU MPI Latency Test
>> # Size            Latency (us)
>> 0                        25.96
>> [node41:19740] *** An error occurred in MPI_Barrier
>> [node41:19740] *** reported by process [139815819542529,4294967297]
>> [node41:19740] *** on communicator MPI_COMM_WORLD
>> [node41:19740] *** MPI_ERR_OTHER: known error not in list
>> [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>> now abort,
>> [node41:19740] ***    and potentially your MPI job)
>> 
>> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to be 
>> a progress thread initialization problem.
>> Using PTL_IGNORE_UMMUNOTIFY=1  gets here:
>> 
>> $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency
>> # OSU MPI Latency Test
>> # Size            Latency (us)
>> 0                        24.14
>> 1                        26.24
>> [node41:19993] *** Process received signal ***
>> [node41:19993] Signal: Segmentation fault (11)
>> [node41:19993] Signal code: Address not mapped (1)
>> [node41:19993] Failing at address: 0x141
>> [node41:19993] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fa6ac73b710]
>> [node41:19993] [ 1] 
>> /ascldap/users/dblarki/opt/portals4.master/lib/libportals.so.4(+0xcd65)[0x7fa69b770d65]
>> [node41:19993] [ 2] 
>> /ascldap/users/dblarki/opt/portals4.master/lib/libportals.so.4(PtlPut+0x143)[0x7fa69b773fb3]
>> [node41:19993] [ 3] 
>> /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(+0xa961)[0x7fa698cf5961]
>> [node41:19993] [ 4] 
>> /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(+0xb0e5)[0x7fa698cf60e5]
>> [node41:19993] [ 5] 
>> /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(ompi_mtl_portals4_send+0x90)[0x7fa698cf61d1]
>> [node41:19993] [ 6] 
>> /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_pml_cm.so(+0x5430)[0x7fa69a794430]
>> [node41:19993] [ 7] 
>> /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(PMPI_Send+0x2b4)[0x7fa6ac9ff018]
>> [node41:19993] [ 8] ./osu_latency[0x40106f]
>> [node41:19993] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa6ac3b6d5d]
>> [node41:19993] [10] ./osu_latency[0x400c59]
>> 
>> This cluster is running RHEL 6.5 without ummunotify modules, but I get the 
>> same results on a local (small) cluster running ubuntu 16.04 with ummunotify 
>> loaded.
>> 
>> Any help would be much appreciated.
>> thanks,
>> 
>> brian.
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/users 
>> <https://lists.open-mpi.org/mailman/listinfo/users>
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/users 
>> <https://lists.open-mpi.org/mailman/listinfo/users>
> --
> D. Brian Larkins
> Assistant Professor of Computer Science
> Rhodes College
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

--
D. Brian Larkins
Assistant Professor of Computer Science
Rhodes College

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to