I'm sorry; I can't help debug a version from 9 years ago.  The best suggestion 
I have is to use a modern version of Open MPI.

Note, however, your use of "--mca btl ..." is going to have the same meaning 
for all versions of Open MPI.  The problem you showed in the first mail was 
with the shared memory transport.  Using "--mca btl tcp,self" means you're not 
using the shared memory transport.  If you don't specify "--mca btl tcp,self", 
Open MPI will automatically use the shared memory transport.  Hence, you could 
be running into the same (or similar/related) problem that you mentioned in the 
first mail -- i.e., something is going wrong with how the v1.2.9 shared memory 
transport is interacting with your system.

Likewise, "--mca btl_tcp_if_include ib0" tells the TCP BTL plugin to use the 
"ib0" network.  But if you have the openib BTL available (i.e., the IB-native 
plug), that will be used instead of the TCP BTL because native verbs over IB 
performs much better than TCP over IB.  Meaning: if you specify 
btl_Tcp_if_include without specifying "--mca btl tcp,self", then (assuming 
openib is available) the TCP BTL likely isn't used and the btl_tcp_if_include 
value is therefore ignored.

Also, what version of Linpack are you using?  The error you show is usually 
indicative of an MPI application bug (the MPI_COMM_SPLIT error).  If you're 
running an old version of xhpl, you should upgrade to the latest.




> On Mar 19, 2018, at 9:59 PM, Kaiming Ouyang <kouya...@ucr.edu> wrote:
> 
> Hi Jeff,
> Thank you for your reply. I just changed to another cluster which does not 
> have infiniband. I ran the HPL by:
> mpirun --mca btl tcp,self -np 144 --hostfile /root/research/hostfile ./xhpl
> 
> It ran successfully, but if I delete "--mca btl tcp,self", it cannot run 
> again. So I doubt whether openmpi 1.2 cannot identify the proper network 
> interface and set correct parameters for them. 
> Then, I return back to the previous cluster with infiniband and type the same 
> command above. It gets stuck forever.
> 
> I change the command to:
> mpirun --mca btl_tcp_if_include ib0 --hostfile /root/research/hostfile-ib -np 
> 48 ./xhpl
> 
> It can successfully launch but gives me errors as follows when HPL tries to 
> split the communication:
> 
> [node1.novalocal:09562] *** An error occurred in MPI_Comm_split
> [node1.novalocal:09562] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
> [node1.novalocal:09562] *** MPI_ERR_IN_STATUS: error code in status
> [node1.novalocal:09562] *** MPI_ERRORS_ARE_FATAL (goodbye)
> [node1.novalocal:09583] *** An error occurred in MPI_Comm_split
> [node1.novalocal:09583] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
> [node1.novalocal:09583] *** MPI_ERR_IN_STATUS: error code in status
> [node1.novalocal:09583] *** MPI_ERRORS_ARE_FATAL (goodbye)
> [node1.novalocal:09637] *** An error occurred in MPI_Comm_split
> [node1.novalocal:09637] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
> [node1.novalocal:09637] *** MPI_ERR_IN_STATUS: error code in status
> [node1.novalocal:09637] *** MPI_ERRORS_ARE_FATAL (goodbye)
> [node1.novalocal:09994] *** An error occurred in MPI_Comm_split
> [node1.novalocal:09994] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
> [node1.novalocal:09994] *** MPI_ERR_IN_STATUS: error code in status
> [node1.novalocal:09994] *** MPI_ERRORS_ARE_FATAL (goodbye)
> mpirun noticed that job rank 0 with PID 46005 on node test-ib exited on 
> signal 15 (Terminated).
> 
> Hope you can give me some suggestions. Thank you.
> 
> Kaiming Ouyang, Research Assistant.
> Department of Computer Science and Engineering
> University of California, Riverside
> 900 University Avenue, Riverside, CA 92521
> 
> 
> On Mon, Mar 19, 2018 at 7:35 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> That's actually failing in a shared memory section of the code.
> 
> But to answer your question, yes, Open MPI 1.2 did have IB support.
> 
> That being said, I have no idea what would cause this shared memory segv -- 
> it's quite possible that it's simple bit rot (i.e., v1.2.9 was released 9 
> years ago -- see 
> https://www.open-mpi.org/software/ompi/versions/timeline.php.  Perhaps it 
> does not function correctly on modern glibc/Linux kernel-based platforms).
> 
> Can you upgrade to a [much] newer Open MPI?
> 
> 
> 
> > On Mar 19, 2018, at 8:29 PM, Kaiming Ouyang <kouya...@ucr.edu> wrote:
> >
> > Hi everyone,
> > Recently I need to compile High-Performance Linpack code with openmpi 1.2 
> > version (a little bit old). When I finish compilation, and try to run, I 
> > get the following errors:
> >
> > [test:32058] *** Process received signal ***
> > [test:32058] Signal: Segmentation fault (11)
> > [test:32058] Signal code: Address not mapped (1)
> > [test:32058] Failing at address: 0x14a2b84b6304
> > [test:32058] [ 0] /lib64/libpthread.so.0(+0xf5e0) [0x14eb116295e0]
> > [test:32058] [ 1] 
> > /root/research/lib/openmpi-1.2.9/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x28a)
> >  [0x14eaa81258aa]
> > [test:32058] [ 2] 
> > /root/research/lib/openmpi-1.2.9/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x2b)
> >  [0x14eaa853219b]
> > [test:32058] [ 3] 
> > /root/research/lib/openmpi-1.2.9/lib/libopen-pal.so.0(opal_progress+0x4a) 
> > [0x14eb128dbaaa]
> > [test:32058] [ 4] 
> > /root/research/lib/openmpi-1.2.9/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x1d)
> >  [0x14eaf41e6b4d]
> > [test:32058] [ 5] 
> > /root/research/lib/openmpi-1.2.9/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x3a5)
> >  [0x14eaf41eac45]
> > [test:32058] [ 6] 
> > /root/research/lib/openmpi-1.2.9/lib/libopen-rte.so.0(mca_oob_recv_packed+0x33)
> >  [0x14eb12b62223]
> > [test:32058] [ 7] 
> > /root/research/lib/openmpi-1.2.9/lib/openmpi/mca_gpr_proxy.so(orte_gpr_proxy_put+0x1f9)
> >  [0x14eaf3dd7db9]
> > [test:32058] [ 8] 
> > /root/research/lib/openmpi-1.2.9/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x31d)
> >  [0x14eb12b7893d]
> > [test:32058] [ 9] 
> > /root/research/lib/openmpi-1.2.9/lib/libmpi.so.0(ompi_mpi_init+0x8d6) 
> > [0x14eb13202136]
> > [test:32058] [10] 
> > /root/research/lib/openmpi-1.2.9/lib/libmpi.so.0(MPI_Init+0x6a) 
> > [0x14eb1322461a]
> > [test:32058] [11] ./xhpl(main+0x5d) [0x404e7d]
> > [test:32058] [12] /lib64/libc.so.6(__libc_start_main+0xf5) [0x14eb11278c05]
> > [test:32058] [13] ./xhpl() [0x4056cb]
> > [test:32058] *** End of error message ***
> > mpirun noticed that job rank 0 with PID 31481 on node test.novalocal exited 
> > on signal 15 (Terminated).
> > 23 additional processes aborted (not shown)
> >
> > The machine has infiniband, so I doubt whether openmpi 1.2 does not support 
> > infiniband by default. I also try to run it not through infiniband, but the 
> > program can only deal with small size input. When I increase the input size 
> > and grid size, it just gets stuck. The program I run is a benchmark, so I 
> > don't think there would be a problem in the code. Any idea? Thanks.
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to