Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0
Finally it worked, thanks! [mishima@manage OMB-3.1.1-openmpi2.0.0]$ ompi_info --param btl openib --level 5 | grep openib_flags MCA btl openib: parameter "btl_openib_flags" (current value: "65847", data source: default, level: 5 tuner/det ail, type: unsigned_int) [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -report-bindings osu_bw [manage.cluster:14439] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so cket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] [manage.cluster:14439] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so cket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] # OSU MPI Bandwidth Test v3.1.1 # SizeBandwidth (MB/s) 1 1.72 2 3.52 4 7.01 814.11 16 28.17 32 55.90 64 99.83 128 159.13 256 272.98 512 476.35 1024911.49 2048 1319.96 4096 1767.78 8192 2169.53 16384 2507.96 32768 2957.28 65536 3206.90 131072 3610.33 262144 3985.18 524288 4379.47 10485764560.90 20971524661.44 41943044631.21 Tetsuya Mishima 2016/08/10 11:57:29、"devel"さんは「Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0」で書きました > Ack, the segv is due to a typo from transcribing the patch. Fixed. Please try the following patch and let me know if it fixes the issues. > > https://github.com/hjelmn/ompi/commit/4079eec9749e47dddc6acc9c0847b3091601919f.patch > > -Nathan > > > On Aug 8, 2016, at 9:48 PM, tmish...@jcity.maeda.co.jp wrote: > > > > The latest patch also causes a segfault... > > > > By the way, I found a typo as below. _pml_ob1.use_all_rdma in the last > > line should be _pml_ob1.use_all_rdma: > > > > +mca_pml_ob1.use_all_rdma = false; > > +(void) mca_base_component_var_register > > (_pml_ob1_component.pmlm_version, "use_all_rdma", > > + "Use all available RDMA btls > > for the RDMA and RDMA pipeline protocols " > > + "(default: false)", > > MCA_BASE_VAR_TYPE_BOOL, NULL, 0, 0, > > + OPAL_INFO_LVL_5, > > MCA_BASE_VAR_SCOPE_GROUP, _pml_ob1.use_all_rdma); > > + > > > > Here is the OSU_BW and gdb output: > > > > # OSU MPI Bandwidth Test v3.1.1 > > # SizeBandwidth (MB/s) > > 1 2.19 > > 2 4.43 > > 4 8.98 > > 818.07 > > 16 35.58 > > 32 70.62 > > 64 108.88 > > 128 172.97 > > 256 305.73 > > 512 536.48 > > 1024957.57 > > 2048 1587.21 > > 4096 1638.81 > > 8192 2165.14 > > 16384 2482.43 > > 32768 2866.33 > > 65536 3655.33 > > 131072 4208.40 > > 262144 4596.12 > > 524288 4769.27 > > 10485764900.00 > > [manage:16596] *** Process received signal *** > > [manage:16596] Signal: Segmentation fault (11) > > [manage:16596] Signal code: Address not mapped (1) > > [manage:16596] Failing at address: 0x8 > > ... > > Core was generated by `osu_bw'. > > Program terminated with signal 11, Segmentation fault. > > #0 0x0031d9008806 in ?? () from /lib64/libgcc_s.so.1 > > (gdb) where > > #0 0x0031d9008806 in ?? () from /lib64/libgcc_s.so.1 > > #1 0x0031d9008934 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > > #2 0x0037ab8e5ee8 in backtrace () from /lib64/libc.so.6 > > #3 0x2b5060c14345 in opal_backtrace_print () > > at ./backtrace_execinfo.c:47 > > #4 0x2b5060c11180 in show_stackframe () at ./stacktrace.c:331 > > #5 > > #6 mca_pml_ob1_recv_request_schedule_once () at ./pml_ob1_recvreq.c:983 > > #7 0x2aaab461c71a in mca_pml_ob1_recv_request_progress_rndv () > > > > from /home/mishima/opt/mpi/openmpi-2.0.0-pgi16.5/lib/openmpi/mca_pml_ob1.so > > #8 0x2aaab46198e5 in mca_pml_ob1_recv_frag_match () > > at ./pml_ob1_recvfrag.c:715 > > #9 0x2aaab4618e46 in mca_pml_ob1_recv_frag_callback_rndv () > > at ./pml_ob1_recvfrag.c:267 > > #10 0x2aaab37958d3 in mca_btl_vader_poll_handle_frag () > > at ./btl_vader_component.c:589 > > #11 0x2aaab3795b9a in mca_btl_vader_component_progress () > > at ./btl_vader_component.c:231 > > #12 0x2b5060bd16fc in opal_progress () at runtime/opal_progress.c:224 > > #13
Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0
I understood. Thanks. Tetsuya Mishima 2016/08/09 11:33:15、"devel"さんは「Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0」で書きました > I will add a control to have the new behavior or using all available RDMA btls or just the eager ones for the RDMA protocol. The flags will remain as they are. And, yes, for 2.0.0 you can set the btl > flags if you do not intend to use MPI RMA. > > New patch: > > https://github.com/hjelmn/ompi/commit/43267012e58d78e3fc713b98c6fb9f782de977c7.patch > > -Nathan > > > On Aug 8, 2016, at 8:16 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Then, my understanding is that you will restore the default value of > > btl_openib_flags to previous one( = 310) and add a new MCA parameter to > > control HCA inclusion for such a situation. The work arround so far for > > openmpi-2.0.0 is setting those flags manually. Right? > > > > Tetsuya Mishima > > > > 2016/08/09 9:56:29、"devel"さんは「Re: [OMPI devel] sm BTL performace of > > the openmpi-2.0.0」で書きました > >> Hmm, not good. So we have a situation where it is sometimes better to > > include the HCA when it is the only rdma btl. Will have a new version up in > > a bit that adds an MCA parameter to control the > >> behavior. The default will be the same as 1.10.x. > >> > >> -Nathan > >> > >>> On Aug 8, 2016, at 4:51 PM, tmish...@jcity.maeda.co.jp wrote: > >>> > >>> Hi, unfortunately it doesn't work well. The previous one was much > >>> better ... > >>> > >>> [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -report-bindings > >>> osu_bw > >>> [manage.cluster:25107] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > socket > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>> cket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt > > 0]]: > >>> [B/B/B/B/B/B][./././././.] > >>> [manage.cluster:25107] MCW rank 1 bound to socket 0[core 0[hwt 0]], > > socket > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>> cket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt > > 0]]: > >>> [B/B/B/B/B/B][./././././.] > >>> # OSU MPI Bandwidth Test v3.1.1 > >>> # SizeBandwidth (MB/s) > >>> 1 2.22 > >>> 2 4.53 > >>> 4 9.11 > >>> 818.02 > >>> 16 35.44 > >>> 32 70.84 > >>> 64 113.71 > >>> 128 176.74 > >>> 256 311.07 > >>> 512 529.03 > >>> 1024907.83 > >>> 2048 1597.66 > >>> 4096330.14 > >>> 8192516.49 > >>> 16384 780.31 > >>> 32768 1038.43 > >>> 65536 1186.36 > >>> 131072 1268.87 > >>> 262144 1222.24 > >>> 524288 1232.30 > >>> 10485761244.62 > >>> 20971521260.25 > >>> 41943041263.47 > >>> > >>> Tetsuya > >>> > >>> > >>> 2016/08/09 2:42:24、"devel"さんは「Re: [OMPI devel] sm BTL performace > > of > >>> the openmpi-2.0.0」で書きました > Ok, there was a problem with the selection logic when only one rdma > >>> capable btl is available. I changed the logic to always use the RDMA > > btl > >>> over pipelined send/recv. This works better for me on a > Intel Omnipath system. Let me know if this works for you. > > > >>> > > https://github.com/hjelmn/ompi/commit/dddb865b5337213fd73d0e226b02e2f049cfab47.patch > > > >>> > > -Nathan > > On Aug 07, 2016, at 10:00 PM, tmish...@jcity.maeda.co.jp wrote: > > Hi, here is the gdb output for additional information: > > (It might be inexact, because I built openmpi-2.0.0 without debug > > option) > > Core was generated by `osu_bw'. > Program terminated with signal 11, Segmentation fault. > #0 0x0031d9008806 in ?? () from /lib64/libgcc_s.so.1 > (gdb) where > #0 0x0031d9008806 in ?? () from /lib64/libgcc_s.so.1 > #1 0x0031d9008934 in _Unwind_Backtrace () > > from /lib64/libgcc_s.so.1 > #2 0x0037ab8e5ee8 in backtrace () from /lib64/libc.so.6 > #3 0x2ad882bd4345 in opal_backtrace_print () > at ./backtrace_execinfo.c:47 > #4 0x2ad882bd1180 in show_stackframe () at ./stacktrace.c:331 > #5 > #6 mca_pml_ob1_recv_request_schedule_once () > > at ./pml_ob1_recvreq.c:983 > #7 0x2aaab412f47a in mca_pml_ob1_recv_request_progress_rndv () > > > >>> > > from /home/mishima/opt/mpi/openmpi-2.0.0-pgi16.5/lib/openmpi/mca_pml_ob1.so > #8 0x2aaab412c645 in mca_pml_ob1_recv_frag_match () > at ./pml_ob1_recvfrag.c:715 > #9 0x2aaab412bba6 in mca_pml_ob1_recv_frag_callback_rndv () > at ./pml_ob1_recvfrag.c:267 > #10 0x2f2748d3 in mca_btl_vader_poll_handle_frag () > at ./btl_vader_component.c:589 > #11 0x2f274b9a in mca_btl_vader_component_progress () > at
Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0
Then, my understanding is that you will restore the default value of btl_openib_flags to previous one( = 310) and add a new MCA parameter to control HCA inclusion for such a situation. The work arround so far for openmpi-2.0.0 is setting those flags manually. Right? Tetsuya Mishima 2016/08/09 9:56:29、"devel"さんは「Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0」で書きました > Hmm, not good. So we have a situation where it is sometimes better to include the HCA when it is the only rdma btl. Will have a new version up in a bit that adds an MCA parameter to control the > behavior. The default will be the same as 1.10.x. > > -Nathan > > > On Aug 8, 2016, at 4:51 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Hi, unfortunately it doesn't work well. The previous one was much > > better ... > > > > [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -report-bindings > > osu_bw > > [manage.cluster:25107] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: > > [B/B/B/B/B/B][./././././.] > > [manage.cluster:25107] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: > > [B/B/B/B/B/B][./././././.] > > # OSU MPI Bandwidth Test v3.1.1 > > # SizeBandwidth (MB/s) > > 1 2.22 > > 2 4.53 > > 4 9.11 > > 818.02 > > 16 35.44 > > 32 70.84 > > 64 113.71 > > 128 176.74 > > 256 311.07 > > 512 529.03 > > 1024907.83 > > 2048 1597.66 > > 4096330.14 > > 8192516.49 > > 16384 780.31 > > 32768 1038.43 > > 65536 1186.36 > > 131072 1268.87 > > 262144 1222.24 > > 524288 1232.30 > > 10485761244.62 > > 20971521260.25 > > 41943041263.47 > > > > Tetsuya > > > > > > 2016/08/09 2:42:24、"devel"さんは「Re: [OMPI devel] sm BTL performace of > > the openmpi-2.0.0」で書きました > >> Ok, there was a problem with the selection logic when only one rdma > > capable btl is available. I changed the logic to always use the RDMA btl > > over pipelined send/recv. This works better for me on a > >> Intel Omnipath system. Let me know if this works for you. > >> > >> > > https://github.com/hjelmn/ompi/commit/dddb865b5337213fd73d0e226b02e2f049cfab47.patch > > > >> > >> -Nathan > >> > >> On Aug 07, 2016, at 10:00 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >> Hi, here is the gdb output for additional information: > >> > >> (It might be inexact, because I built openmpi-2.0.0 without debug option) > >> > >> Core was generated by `osu_bw'. > >> Program terminated with signal 11, Segmentation fault. > >> #0 0x0031d9008806 in ?? () from /lib64/libgcc_s.so.1 > >> (gdb) where > >> #0 0x0031d9008806 in ?? () from /lib64/libgcc_s.so.1 > >> #1 0x0031d9008934 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > >> #2 0x0037ab8e5ee8 in backtrace () from /lib64/libc.so.6 > >> #3 0x2ad882bd4345 in opal_backtrace_print () > >> at ./backtrace_execinfo.c:47 > >> #4 0x2ad882bd1180 in show_stackframe () at ./stacktrace.c:331 > >> #5 > >> #6 mca_pml_ob1_recv_request_schedule_once () at ./pml_ob1_recvreq.c:983 > >> #7 0x2aaab412f47a in mca_pml_ob1_recv_request_progress_rndv () > >> > >> > > from /home/mishima/opt/mpi/openmpi-2.0.0-pgi16.5/lib/openmpi/mca_pml_ob1.so > >> #8 0x2aaab412c645 in mca_pml_ob1_recv_frag_match () > >> at ./pml_ob1_recvfrag.c:715 > >> #9 0x2aaab412bba6 in mca_pml_ob1_recv_frag_callback_rndv () > >> at ./pml_ob1_recvfrag.c:267 > >> #10 0x2f2748d3 in mca_btl_vader_poll_handle_frag () > >> at ./btl_vader_component.c:589 > >> #11 0x2f274b9a in mca_btl_vader_component_progress () > >> at ./btl_vader_component.c:231 > >> #12 0x2ad882b916fc in opal_progress () at runtime/opal_progress.c:224 > >> #13 0x2ad8820a9aa5 in ompi_request_default_wait_all () at > >> request/req_wait.c:77 > >> #14 0x2ad8820f10dd in PMPI_Waitall () at ./pwaitall.c:76 > >> #15 0x00401108 in main () at ./osu_bw.c:144 > >> > >> Tetsuya > >> > >> > >> 2016/08/08 12:34:57、"devel"さんは「Re: [OMPI devel] sm BTL performace of > >> the openmpi-2.0.0」で書きました > >> Hi, it caused segfault as below: > >> [manage.cluster:25436] MCW rank 0 bound to socket 0[core 0[hwt 0]],socket > >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], > > socket 0[core 4[hwt 0]], socket 0[core 5[hwt > > 0]]:[B/B/B/B/B/B][./././././.][manage.cluster:25436] MCW rank 1 bound to > > socket 0[core > >> 0[hwt 0]],socket > >> 0[core 1[hwt 0]], socket 0[core 2[hwt
Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0
Hi Gilles, I confirmed the vader is used when I don't specify any BTL as you pointed out! Regards, Tetsuya Mishima [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 --mca btl_base_verbose 10 -bind-to core -report-bindings osu_bw [manage.cluster:20006] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [manage.cluster:20006] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.] [manage.cluster:20011] mca: base: components_register: registering framework btl components [manage.cluster:20011] mca: base: components_register: found loaded component self [manage.cluster:20011] mca: base: components_register: component self register function successful [manage.cluster:20011] mca: base: components_register: found loaded component vader [manage.cluster:20011] mca: base: components_register: component vader register function successful [manage.cluster:20011] mca: base: components_register: found loaded component tcp [manage.cluster:20011] mca: base: components_register: component tcp register function successful [manage.cluster:20011] mca: base: components_register: found loaded component sm [manage.cluster:20011] mca: base: components_register: component sm register function successful [manage.cluster:20011] mca: base: components_register: found loaded component openib [manage.cluster:20011] mca: base: components_register: component openib register function successful [manage.cluster:20011] mca: base: components_open: opening btl components [manage.cluster:20011] mca: base: components_open: found loaded component self [manage.cluster:20011] mca: base: components_open: component self open function successful [manage.cluster:20011] mca: base: components_open: found loaded component vader [manage.cluster:20011] mca: base: components_open: component vader open function successful [manage.cluster:20011] mca: base: components_open: found loaded component tcp [manage.cluster:20011] mca: base: components_open: component tcp open function successful [manage.cluster:20011] mca: base: components_open: found loaded component sm [manage.cluster:20011] mca: base: components_open: component sm open function successful [manage.cluster:20011] mca: base: components_open: found loaded component openib [manage.cluster:20011] mca: base: components_open: component openib open function successful [manage.cluster:20011] select: initializing btl component self [manage.cluster:20011] select: init of component self returned success [manage.cluster:20011] select: initializing btl component vader [manage.cluster:20011] select: init of component vader returned success [manage.cluster:20011] select: initializing btl component tcp [manage.cluster:20011] select: init of component tcp returned success [manage.cluster:20011] select: initializing btl component sm [manage.cluster:20011] select: init of component sm returned success [manage.cluster:20011] select: initializing btl component openib [manage.cluster:20011] Checking distance from this process to device=mthca0 [manage.cluster:20011] hwloc_distances->nbobjs=2 [manage.cluster:20011] hwloc_distances->latency[0]=1.00 [manage.cluster:20011] hwloc_distances->latency[1]=1.60 [manage.cluster:20011] hwloc_distances->latency[2]=1.60 [manage.cluster:20011] hwloc_distances->latency[3]=1.00 [manage.cluster:20011] ibv_obj->type set to NULL [manage.cluster:20011] Process is bound: distance to device is 0.00 [manage.cluster:20012] mca: base: components_register: registering framework btl components [manage.cluster:20012] mca: base: components_register: found loaded component self [manage.cluster:20012] mca: base: components_register: component self register function successful [manage.cluster:20012] mca: base: components_register: found loaded component vader [manage.cluster:20012] mca: base: components_register: component vader register function successful [manage.cluster:20012] mca: base: components_register: found loaded component tcp [manage.cluster:20012] mca: base: components_register: component tcp register function successful [manage.cluster:20012] mca: base: components_register: found loaded component sm [manage.cluster:20012] mca: base: components_register: component sm register function successful [manage.cluster:20012] mca: base: components_register: found loaded component openib [manage.cluster:20012] mca: base: components_register: component openib register function successful [manage.cluster:20012] mca: base: components_open: opening btl components [manage.cluster:20012] mca: base: components_open: found loaded component self [manage.cluster:20012] mca: base: components_open: component self open function successful [manage.cluster:20012] mca: base: components_open: found loaded component vader [manage.cluster:20012] mca: base: components_open: component vader open function successful [manage.cluster:20012] mca: base: components_open: found loaded component tcp [manage.cluster:20012] mca: base: components_open: component tcp
Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0
Hi, Thanks. I will try it and report later. Tetsuya Mishima 2016/07/27 9:20:28、"devel"さんは「Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0」で書きました > sm is deprecated in 2.0.0 and will likely be removed in favor of vader in 2.1.0. > > This issue is probably this known issue: https://github.com/open-mpi/ompi-release/pull/1250 > > Please apply those commits and see if it fixes the issue for you. > > -Nathan > > > On Jul 26, 2016, at 6:17 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Hi Gilles, > > > > Thanks. I ran again with --mca pml ob1 but I've got the same results as > > below: > > > > [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca pml ob1 -bind-to > > core -report-bindings osu_bw > > [manage.cluster:18142] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > > [B/././././.][./././././.] > > [manage.cluster:18142] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > > [./B/./././.][./././././.] > > # OSU MPI Bandwidth Test v3.1.1 > > # SizeBandwidth (MB/s) > > 1 1.48 > > 2 3.07 > > 4 6.26 > > 812.53 > > 16 24.33 > > 32 49.03 > > 64 83.46 > > 128 132.60 > > 256 234.96 > > 512 420.86 > > 1024842.37 > > 2048 1231.65 > > 4096264.67 > > 8192472.16 > > 16384 740.42 > > 32768 1030.39 > > 65536 1191.16 > > 131072 1269.45 > > 262144 1238.33 > > 524288 1247.97 > > 10485761257.96 > > 20971521274.74 > > 41943041280.94 > > [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca pml ob1 -mca btl > > self,sm -bind-to core -report-bindings osu_b > > w > > [manage.cluster:18204] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > > [B/././././.][./././././.] > > [manage.cluster:18204] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > > [./B/./././.][./././././.] > > # OSU MPI Bandwidth Test v3.1.1 > > # SizeBandwidth (MB/s) > > 1 0.52 > > 2 1.05 > > 4 2.08 > > 8 4.18 > > 168.21 > > 32 16.65 > > 64 32.60 > > 128 66.70 > > 256 132.45 > > 512 269.27 > > 1024504.63 > > 2048819.76 > > 4096874.54 > > 8192 1447.11 > > 16384 2263.28 > > 32768 3236.85 > > 65536 3567.34 > > 131072 3555.17 > > 262144 3455.76 > > 524288 3441.80 > > 10485763505.30 > > 20971523534.01 > > 41943043546.94 > > [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca pml ob1 -mca btl > > self,sm,openib -bind-to core -report-binding > > s osu_bw > > [manage.cluster:18218] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > > [B/././././.][./././././.] > > [manage.cluster:18218] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > > [./B/./././.][./././././.] > > # OSU MPI Bandwidth Test v3.1.1 > > # SizeBandwidth (MB/s) > > 1 0.51 > > 2 1.03 > > 4 2.05 > > 8 4.07 > > 168.14 > > 32 16.32 > > 64 32.98 > > 128 63.70 > > 256 126.66 > > 512 252.61 > > 1024480.22 > > 2048810.54 > > 4096290.61 > > 8192512.49 > > 16384 764.60 > > 32768 1036.81 > > 65536 1182.81 > > 131072 1264.48 > > 262144 1235.82 > > 524288 1246.70 > > 10485761254.66 > > 20971521274.64 > > 41943041280.65 > > [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca pml ob1 -mca btl > > self,openib -bind-to core -report-bindings o > > su_bw > > [manage.cluster:18276] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > > [B/././././.][./././././.] > > [manage.cluster:18276] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > > [./B/./././.][./././././.] > > # OSU MPI Bandwidth Test v3.1.1 > > # SizeBandwidth (MB/s) > > 1 0.54 > > 2 1.08 > > 4 2.18 > > 8 4.33 > > 168.69 > > 32 17.39 > > 64 34.34 > > 128 66.28 > > 256 130.36 > > 512
Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0
Hi Gilles, Thanks. I ran again with --mca pml ob1 but I've got the same results as below: [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca pml ob1 -bind-to core -report-bindings osu_bw [manage.cluster:18142] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [manage.cluster:18142] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.] # OSU MPI Bandwidth Test v3.1.1 # SizeBandwidth (MB/s) 1 1.48 2 3.07 4 6.26 812.53 16 24.33 32 49.03 64 83.46 128 132.60 256 234.96 512 420.86 1024842.37 2048 1231.65 4096264.67 8192472.16 16384 740.42 32768 1030.39 65536 1191.16 131072 1269.45 262144 1238.33 524288 1247.97 10485761257.96 20971521274.74 41943041280.94 [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca pml ob1 -mca btl self,sm -bind-to core -report-bindings osu_b w [manage.cluster:18204] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [manage.cluster:18204] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.] # OSU MPI Bandwidth Test v3.1.1 # SizeBandwidth (MB/s) 1 0.52 2 1.05 4 2.08 8 4.18 168.21 32 16.65 64 32.60 128 66.70 256 132.45 512 269.27 1024504.63 2048819.76 4096874.54 8192 1447.11 16384 2263.28 32768 3236.85 65536 3567.34 131072 3555.17 262144 3455.76 524288 3441.80 10485763505.30 20971523534.01 41943043546.94 [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca pml ob1 -mca btl self,sm,openib -bind-to core -report-binding s osu_bw [manage.cluster:18218] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [manage.cluster:18218] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.] # OSU MPI Bandwidth Test v3.1.1 # SizeBandwidth (MB/s) 1 0.51 2 1.03 4 2.05 8 4.07 168.14 32 16.32 64 32.98 128 63.70 256 126.66 512 252.61 1024480.22 2048810.54 4096290.61 8192512.49 16384 764.60 32768 1036.81 65536 1182.81 131072 1264.48 262144 1235.82 524288 1246.70 10485761254.66 20971521274.64 41943041280.65 [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca pml ob1 -mca btl self,openib -bind-to core -report-bindings o su_bw [manage.cluster:18276] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [manage.cluster:18276] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.] # OSU MPI Bandwidth Test v3.1.1 # SizeBandwidth (MB/s) 1 0.54 2 1.08 4 2.18 8 4.33 168.69 32 17.39 64 34.34 128 66.28 256 130.36 512 241.81 1024429.86 2048553.44 4096707.14 8192879.60 16384 763.02 32768 1042.89 65536 1185.45 131072 1267.56 262144 1227.41 524288 1244.61 10485761255.66 20971521273.55 41943041281.05 2016/07/27 9:02:49、"devel"さんは「Re: [OMPI devel] sm BTL performace of the openmpi-2.0.0」で書きました > Hi, > > > can you please run again with > > --mca pml ob1 > > > if Open MPI was built with mxm support, pml/cm and mtl/mxm are used > instead of pml/ob1 and btl/openib > > > Cheers, > > > Gilles > > > On 7/27/2016 8:56 AM, tmish...@jcity.maeda.co.jp wrote: > > Hi folks, > > > > I saw a performance degradation of openmpi-2.0.0 when I ran our application > > on a node (12cores). So I did 4 tests using osu_bw as below: > > > > 1: mpirun –np 2 osu_bw
[OMPI devel] sm BTL performace of the openmpi-2.0.0
Hi folks, I saw a performance degradation of openmpi-2.0.0 when I ran our application on a node (12cores). So I did 4 tests using osu_bw as below: 1: mpirun –np 2 osu_bw bad(30% of test2) 2: mpirun –np 2 –mca btl self,sm osu_bw good(same as openmpi1.10.3) 3: mpirun –np 2 –mca btl self,sm,openib osu_bw bad(30% of test2) 4: mpirun –np 2 –mca btl self,openib osu_bw bad(30% of test2) I guess openib BTL was used in the test 1 and 3, because these results are almost same as test 4. I believe that sm BTL should be used even in the test 1 and 3, because its priority is higher than openib. Unfortunately, at the moment, I couldn’t figure out the root cause. So please someone would take care of it. Regards, Tetsuya Mishima P.S. Here I attached these test results. [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -bind-to core -report-bindings osu_bw [manage.cluster:13389] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [manage.cluster:13389] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.] # OSU MPI Bandwidth Test v3.1.1 # SizeBandwidth (MB/s) 1 1.49 2 3.04 4 6.13 812.23 16 25.01 32 49.96 64 87.07 128 138.87 256 245.97 512 423.30 1024865.85 2048 1279.63 4096264.79 8192473.92 16384 739.27 32768 1030.49 65536 1190.21 131072 1270.77 262144 1238.74 524288 1245.97 10485761260.09 20971521274.53 41943041285.07 [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca btl self,sm -bind-to core -report-bindings osu_bw [manage.cluster:13448] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [manage.cluster:13448] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.] # OSU MPI Bandwidth Test v3.1.1 # SizeBandwidth (MB/s) 1 0.51 2 1.01 4 2.03 8 4.08 167.92 32 16.16 64 32.53 128 64.30 256 128.19 512 256.48 1024468.62 2048785.29 4096854.78 8192 1404.51 16384 2249.20 32768 3136.40 65536 3495.84 131072 3436.69 262144 3392.11 524288 3400.07 10485763460.60 20971523488.09 41943043498.45 [mishima@manageOMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca btl self,sm,openib -bind-to core -report-bindings osu_bw [manage.cluster:13462] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [manage.cluster:13462] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.] # OSU MPI Bandwidth Test v3.1.1 # SizeBandwidth (MB/s) 1 0.54 2 1.09 4 2.18 8 4.37 168.75 32 17.37 64 34.67 128 66.66 256 132.55 512 261.52 1024489.51 2048818.38 4096290.48 8192511.64 16384 765.24 32768 1043.28 65536 1180.48 131072 1261.41 262144 1232.86 524288 1245.70 10485761245.69 20971521268.67 41943041281.33 [mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca btl self,openib -bind-to core -report-bindings osu_bw [manage.cluster:13521] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] [manage.cluster:13521] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.] # OSU MPI Bandwidth Test v3.1.1 # SizeBandwidth (MB/s) 1 0.54 2 1.08 4 2.16 8 4.34 168.64 32 17.25 64 34.30 128 66.13 256 129.99 512 242.26 1024429.24 2048556.00 4096706.80 8192874.35 16384 762.60 32768 1039.61 65536
Re: [OMPI devel] v2.0.0rc4 is released
Hi Gilles san, thank you for your quick comment. I fully understand the meaning of the warning. Regarding the question you raise, I'm afraid that I'm not sure which solution is better ... Regards, Tetsuya Mishima 2016/07/07 14:13:02、"devel"さんは「Re: [OMPI devel] v2.0.0rc4 is released」で書きました > This is a warning that can be safely ignored. > > > That being said, this can be seen as a false positive (unless we fix > flex or its generated output). > > Also, and generally speaking, these kind of warnings is for developers > only (e.g. end users can do nothing about that). > > > That raises the question : what could/should we do ? > > - master filters out these false positives, should we backport this to > v2.x ? > > - should we simply not check for common symbols when building from a > tarball ? > > > Cheers, > > > Gilles > > On 7/7/2016 2:03 PM, tmish...@jcity.maeda.co.jp wrote: > > Hi Jeff, sorry for a very short report. I saw the warning below > > at the end of installation of openmpi-2.0.0rc4. Is this okay? > > > > $ make install > > ... > > make install-exec-hook > > make[3]: Entering directory > > `/home/mishima/mis/openmpi/openmpi-pgi16.5/openmpi-2.0.0rc4' > > WARNING! Common symbols found: > >show_help_lex.o: 0004 C opal_show_help_yyleng > >show_help_lex.o: 0008 C opal_show_help_yytext > > hostfile_lex.o: 0004 C orte_util_hostfile_leng > > hostfile_lex.o: 0008 C orte_util_hostfile_text > > rmaps_rank_file_lex.o: 0004 C orte_rmaps_rank_file_leng > > rmaps_rank_file_lex.o: 0008 C orte_rmaps_rank_file_text > > make[3]: [install-exec-hook] Error 1 (ignored) > > make[3]: Leaving directory > > `/home/mishima/mis/openmpi/openmpi-pgi16.5/openmpi-2.0.0rc4' > > make[2]: Nothing to be done for `install-data-am'. > > make[2]: Leaving directory > > `/home/mishima/mis/openmpi/openmpi-pgi16.5/openmpi-2.0.0rc4' > > make[1]: Leaving directory > > `/home/mishima/mis/openmpi/openmpi-pgi16.5/openmpi-2.0.0rc4' > > > > Regards, > > Tetsuya Mishima > > > > 2016/07/07 2:40:25、"devel"さんは「[OMPI devel] v2.0.0rc4 is released」で書 > > きました > >> While crossing our fingers and doing mystical rain dances, we're hoping > > that 2.0.0rc4 is the last rc before v2.0.0 (final) is released. Please > > test! > >> https://www.open-mpi.org/software/ompi/v2.x/ > >> > >> Changes since rc3 (the list may look long, but most are quite small > > corner cases): > >> - Lots of threading fixes > >> - More fixes for the new memory patcher system > >> - Updates to NEWS and README > >> - Fixed some hcoll bugs > >> - Updates for external PMIx support > >> - PMIx direct launching fixes > >> - libudev fixes > >> - compatibility fixes with ibv_exp_* > >> - 32 bit compatibility fixes > >> - fix some powerpc issues > >> - various OMPIO / libnbc fixes from Lisandro Dalcin > >> - fix some Solaris configury patching > >> - fix PSM/PSM2 active state detection > >> - disable PSM/PSM2 signal hijacking by default > >> - datatype fixes > >> - portals4 fixes > >> - change ofi MTL to only use a limited set of OFI providers by default > >> - fix OSHMEM init error check > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > > http://www.open-mpi.org/community/lists/devel/2016/07/19153.php > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19158.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19159.php
Re: [OMPI devel] v2.0.0rc4 is released
Hi Jeff, sorry for a very short report. I saw the warning below at the end of installation of openmpi-2.0.0rc4. Is this okay? $ make install ... make install-exec-hook make[3]: Entering directory `/home/mishima/mis/openmpi/openmpi-pgi16.5/openmpi-2.0.0rc4' WARNING! Common symbols found: show_help_lex.o: 0004 C opal_show_help_yyleng show_help_lex.o: 0008 C opal_show_help_yytext hostfile_lex.o: 0004 C orte_util_hostfile_leng hostfile_lex.o: 0008 C orte_util_hostfile_text rmaps_rank_file_lex.o: 0004 C orte_rmaps_rank_file_leng rmaps_rank_file_lex.o: 0008 C orte_rmaps_rank_file_text make[3]: [install-exec-hook] Error 1 (ignored) make[3]: Leaving directory `/home/mishima/mis/openmpi/openmpi-pgi16.5/openmpi-2.0.0rc4' make[2]: Nothing to be done for `install-data-am'. make[2]: Leaving directory `/home/mishima/mis/openmpi/openmpi-pgi16.5/openmpi-2.0.0rc4' make[1]: Leaving directory `/home/mishima/mis/openmpi/openmpi-pgi16.5/openmpi-2.0.0rc4' Regards, Tetsuya Mishima 2016/07/07 2:40:25、"devel"さんは「[OMPI devel] v2.0.0rc4 is released」で書 きました > While crossing our fingers and doing mystical rain dances, we're hoping that 2.0.0rc4 is the last rc before v2.0.0 (final) is released. Please test! > > https://www.open-mpi.org/software/ompi/v2.x/ > > Changes since rc3 (the list may look long, but most are quite small corner cases): > > - Lots of threading fixes > - More fixes for the new memory patcher system > - Updates to NEWS and README > - Fixed some hcoll bugs > - Updates for external PMIx support > - PMIx direct launching fixes > - libudev fixes > - compatibility fixes with ibv_exp_* > - 32 bit compatibility fixes > - fix some powerpc issues > - various OMPIO / libnbc fixes from Lisandro Dalcin > - fix some Solaris configury patching > - fix PSM/PSM2 active state detection > - disable PSM/PSM2 signal hijacking by default > - datatype fixes > - portals4 fixes > - change ofi MTL to only use a limited set of OFI providers by default > - fix OSHMEM init error check > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19153.php
Re: [OMPI devel] binding output error
Hi Devendar, As far as I know, the report-bindings option shows the logical cpu order. On the other hand, you are talking about physical one, I guess. Regards, Tetsuya Mishima 2015/04/21 9:04:37、"devel"さんは「Re: [OMPI devel] binding output error」で書きました > HT is not enabled. All node are same topo . This is reproducible even on single node. > > > > I ran osu latency to see if it is really is mapped to other socket or not with –map-by socket. It looks likes mapping is correct as per latency test. > > > > $mpirun -np 2 -report-bindings -map-by socket /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency > > [clx-orion-001:10084] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././.][./././././././././././././.] > > [clx-orion-001:10084] MCW rank 1 bound to socket 1[core 14[hwt 0]]: [./././././././././././././.][B/././././././././././././.] > > # OSU MPI Latency Test v4.4.1 > > # Size Latency (us) > > 0 0.50 > > 1 0.50 > > 2 0.50 > > 4 0.49 > > > > > > $mpirun -np 2 -report-bindings -cpu-set 1,7 /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency > > [clx-orion-001:10155] MCW rank 0 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././.][./././././././././././././.] > > [clx-orion-001:10155] MCW rank 1 bound to socket 0[core 7[hwt 0]]: [./././././././B/./././././.][./././././././././././././.] > > # OSU MPI Latency Test v4.4.1 > > # Size Latency (us) > > 0 0.23 > > 1 0.24 > > 2 0.23 > > 4 0.22 > > 8 0.23 > > > > Both hwloc and /proc/cpuinfo indicates following cpu numbering > > socket 0 cpus: 0 1 2 3 4 5 6 14 15 16 17 18 19 20 > > socket 1 cpus: 7 8 9 10 11 12 13 21 22 23 24 25 26 27 > > > > $hwloc-info -f > > Machine (256GB) > > NUMANode L#0 (P#0 128GB) + Socket L#0 + L3 L#0 (35MB) > > L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) > > L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) > > L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) > > L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) > > L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) > > L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) > > L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) > > L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#14) > > L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#15) > > L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#16) > > L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#17) > > L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#18) > > L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 + PU L#12 (P#19) > > L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 + PU L#13 (P#20) > > NUMANode L#1 (P#1 128GB) + Socket L#1 + L3 L#1 (35MB) > > L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 + PU L#14 (P#7) > > L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 + PU L#15 (P#8) > > L2 L#16 (256KB) + L1 L#16 (32KB) + Core L#16 + PU L#16 (P#9) > > L2 L#17 (256KB) + L1 L#17 (32KB) + Core L#17 + PU L#17 (P#10) > > L2 L#18 (256KB) + L1 L#18 (32KB) + Core L#18 + PU L#18 (P#11) > > L2 L#19 (256KB) + L1 L#19 (32KB) + Core L#19 + PU L#19 (P#12) > > L2 L#20 (256KB) + L1 L#20 (32KB) + Core L#20 + PU L#20 (P#13) > > L2 L#21 (256KB) + L1 L#21 (32KB) + Core L#21 + PU L#21 (P#21) > > L2 L#22 (256KB) + L1 L#22 (32KB) + Core L#22 + PU L#22 (P#22) > > L2 L#23 (256KB) + L1 L#23 (32KB) + Core L#23 + PU L#23 (P#23) > > L2 L#24 (256KB) + L1 L#24 (32KB) + Core L#24 + PU L#24 (P#24) > > L2 L#25 (256KB) + L1 L#25 (32KB) + Core L#25 + PU L#25 (P#25) > > L2 L#26 (256KB) + L1 L#26 (32KB) + Core L#26 + PU L#26 (P#26) > > L2 L#27 (256KB) + L1 L#27 (32KB) + Core L#27 + PU L#27 (P#27) > > > > > > So, Is --reporting-binding shows one more level of logical CPU numbering? > > > > > > -Devendar > > > > > > From:devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Monday, April 20, 2015 3:52 PM > To: Open MPI Developers > Subject: Re: [OMPI devel] binding output error > > > > Also, was this with HT's enabled? I'm wondering if the print code is incorrectly computing the core because it isn't correctly accounting for HT cpus. > > > > > > On Mon, Apr 20, 2015 at 3:49 PM, Jeff Squyres (jsquyres)wrote: > > Ralph's the authority on this one, but just to be sure: are all nodes the same topology? E.g., does adding "--hetero-nodes" to the mpirun command line fix the problem? > > > > > On Apr 20, 2015, at 9:29 AM, Elena Elkina wrote: > > > > Hi guys, > > > > I faced with an issue on our cluster related to mapping & binding policies on 1.8.5. > > > > The matter is that
Re: [OMPI devel] oshmem-openmpi-1.8.2 causes compile error with -i8(64bit fortarn integer) configuration
Gilles, Your patch looks good to me and I think this issue should be fixed in the upcoming openmpi-1.8.3. Could you commit it to the trunk and create a CMR for it? Tetsuya > Mishima-san, > > the root cause is macro expansion does not always occur as one would > have expected ... > > could you please give a try to the attached patch ? > > it compiles (at least with gcc) and i made zero tests so far > > Cheers, > > Gilles > > On 2014/09/01 10:44, tmish...@jcity.maeda.co.jp wrote: > > Hi folks, > > > > I tried to build openmpi-1.8.2 with PGI fortran and -i8(64bit fortran int) > > option > > as shown below: > > > > ./configure \ > > --prefix=/home/mishima/opt/mpi/openmpi-1.8.2-pgi14.7_int64 \ > > --enable-abi-breaking-fortran-status-i8-fix \ > > --with-tm \ > > --with-verbs \ > > --disable-ipv6 \ > > CC=pgcc CFLAGS="-tp k8-64e -fast" \ > > CXX=pgCC CXXFLAGS="-tp k8-64e -fast" \ > > F77=pgfortran FFLAGS="-i8 -tp k8-64e -fast" \ > > FC=pgfortran FCFLAGS="-i8 -tp k8-64e -fast" > > > > Then I saw this compile error in making oshmem at the last stage: > > > > if test ! -r pshmem_real8_swap_f.c ; then \ > > pname=`echo pshmem_real8_swap_f.c | cut -b '2-'` ; \ > > ln -s ../../../../oshmem/shmem/fortran/$pname > > pshmem_real8_swap_f.c ; \ > > fi > > CC pshmem_real8_swap_f.lo > > if test ! -r pshmem_int4_cswap_f.c ; then \ > > pname=`echo pshmem_int4_cswap_f.c | cut -b '2-'` ; \ > > ln -s ../../../../oshmem/shmem/fortran/$pname > > pshmem_int4_cswap_f.c ; \ > > fi > > CC pshmem_int4_cswap_f.lo > > PGC-S-0058-Illegal lvalue (pshmem_int4_cswap_f.c: 39) > > PGC/x86-64 Linux 14.7-0: compilation completed with severe errors > > make[3]: *** [pshmem_int4_cswap_f.lo] Error 1 > > make[3]: Leaving directory > > `/home/mishima/mis/openmpi/openmpi-pgi14.7/int64/openmpi-1.8.2/oshmem/shmem/fortran/profile' > > make[2]: *** [all-recursive] Error 1 > > make[2]: Leaving directory > > `/home/mishima/mis/openmpi/openmpi-pgi14.7/int64/openmpi-1.8.2/oshmem/shmem/fortran' > > make[1]: *** [all-recursive] Error 1 > > make[1]: Leaving directory > > `/home/mishima/mis/openmpi/openmpi-pgi14.7/int64/openmpi-1.8.2/oshmem' > > make: *** [all-recursive] Error 1 > > > > I confirmed that it worked if I added configure option of --disable-oshmem. > > So, I hope that oshmem experts would fix this problem. > > > > (additional note) > > I switched to use gnu compiler and checked with this configuration, then > > I got the same error: > > > > ./configure \ > > --prefix=/home/mishima/opt/mpi/openmpi-1.8.2-gnu_int64 \ > > --enable-abi-breaking-fortran-status-i8-fix \ > > --disable-ipv6 \ > > F77=gfortran \ > > FC=gfortran \ > > CC=gcc \ > > CXX=g++ \ > > FFLAGS="-m64 -fdefault-integer-8" \ > > FCFLAGS="-m64 -fdefault-integer-8" \ > > CFLAGS=-m64 \ > > CXXFLAGS=-m64 > > > > make > > > > if test ! -r pshmem_int4_cswap_f.c ; then \ > > pname=`echo pshmem_int4_cswap_f.c | cut -b '2-'` ; \ > > ln -s ../../../../oshmem/shmem/fortran/$pname > > pshmem_int4_cswap_f.c ; \ > > fi > > CC pshmem_int4_cswap_f.lo > > pshmem_int4_cswap_f.c: In function 'shmem_int4_cswap_f': > > pshmem_int4_cswap_f.c:39: error: invalid lvalue in unary '&' > > make[3]: *** [pshmem_int4_cswap_f.lo] Error 1 > > > > Regards > > Tetsuya Mishima > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/08/15764.php > > - oshmem.i8.patch___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/develSearchable archives: http://www.open-mpi.org/community/lists/devel/2014/09/index.php
Re: [OMPI devel] oshmem-openmpi-1.8.2 causes compile error with -i8(64bit fortarn integer) configuration
Gilles, Thank you for your fix. I successfully compiled it with PGI, although I could not check it executing actual test run. Tetsuya > Mishima-san, > > the root cause is macro expansion does not always occur as one would > have expected ... > > could you please give a try to the attached patch ? > > it compiles (at least with gcc) and i made zero tests so far > > Cheers, > > Gilles > > On 2014/09/01 10:44, tmish...@jcity.maeda.co.jp wrote: > > Hi folks, > > > > I tried to build openmpi-1.8.2 with PGI fortran and -i8(64bit fortran int) > > option > > as shown below: > > > > ./configure \ > > --prefix=/home/mishima/opt/mpi/openmpi-1.8.2-pgi14.7_int64 \ > > --enable-abi-breaking-fortran-status-i8-fix \ > > --with-tm \ > > --with-verbs \ > > --disable-ipv6 \ > > CC=pgcc CFLAGS="-tp k8-64e -fast" \ > > CXX=pgCC CXXFLAGS="-tp k8-64e -fast" \ > > F77=pgfortran FFLAGS="-i8 -tp k8-64e -fast" \ > > FC=pgfortran FCFLAGS="-i8 -tp k8-64e -fast" > > > > Then I saw this compile error in making oshmem at the last stage: > > > > if test ! -r pshmem_real8_swap_f.c ; then \ > > pname=`echo pshmem_real8_swap_f.c | cut -b '2-'` ; \ > > ln -s ../../../../oshmem/shmem/fortran/$pname > > pshmem_real8_swap_f.c ; \ > > fi > > CC pshmem_real8_swap_f.lo > > if test ! -r pshmem_int4_cswap_f.c ; then \ > > pname=`echo pshmem_int4_cswap_f.c | cut -b '2-'` ; \ > > ln -s ../../../../oshmem/shmem/fortran/$pname > > pshmem_int4_cswap_f.c ; \ > > fi > > CC pshmem_int4_cswap_f.lo > > PGC-S-0058-Illegal lvalue (pshmem_int4_cswap_f.c: 39) > > PGC/x86-64 Linux 14.7-0: compilation completed with severe errors > > make[3]: *** [pshmem_int4_cswap_f.lo] Error 1 > > make[3]: Leaving directory > > `/home/mishima/mis/openmpi/openmpi-pgi14.7/int64/openmpi-1.8.2/oshmem/shmem/fortran/profile' > > make[2]: *** [all-recursive] Error 1 > > make[2]: Leaving directory > > `/home/mishima/mis/openmpi/openmpi-pgi14.7/int64/openmpi-1.8.2/oshmem/shmem/fortran' > > make[1]: *** [all-recursive] Error 1 > > make[1]: Leaving directory > > `/home/mishima/mis/openmpi/openmpi-pgi14.7/int64/openmpi-1.8.2/oshmem' > > make: *** [all-recursive] Error 1 > > > > I confirmed that it worked if I added configure option of --disable-oshmem. > > So, I hope that oshmem experts would fix this problem. > > > > (additional note) > > I switched to use gnu compiler and checked with this configuration, then > > I got the same error: > > > > ./configure \ > > --prefix=/home/mishima/opt/mpi/openmpi-1.8.2-gnu_int64 \ > > --enable-abi-breaking-fortran-status-i8-fix \ > > --disable-ipv6 \ > > F77=gfortran \ > > FC=gfortran \ > > CC=gcc \ > > CXX=g++ \ > > FFLAGS="-m64 -fdefault-integer-8" \ > > FCFLAGS="-m64 -fdefault-integer-8" \ > > CFLAGS=-m64 \ > > CXXFLAGS=-m64 > > > > make > > > > if test ! -r pshmem_int4_cswap_f.c ; then \ > > pname=`echo pshmem_int4_cswap_f.c | cut -b '2-'` ; \ > > ln -s ../../../../oshmem/shmem/fortran/$pname > > pshmem_int4_cswap_f.c ; \ > > fi > > CC pshmem_int4_cswap_f.lo > > pshmem_int4_cswap_f.c: In function 'shmem_int4_cswap_f': > > pshmem_int4_cswap_f.c:39: error: invalid lvalue in unary '&' > > make[3]: *** [pshmem_int4_cswap_f.lo] Error 1 > > > > Regards > > Tetsuya Mishima > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/08/15764.php > > - oshmem.i8.patch___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/develSearchable archives: http://www.open-mpi.org/community/lists/devel/2014/09/index.php
[OMPI devel] oshmem-openmpi-1.8.2 causes compile error with -i8(64bit fortarn integer) configuration
Hi folks, I tried to build openmpi-1.8.2 with PGI fortran and -i8(64bit fortran int) option as shown below: ./configure \ --prefix=/home/mishima/opt/mpi/openmpi-1.8.2-pgi14.7_int64 \ --enable-abi-breaking-fortran-status-i8-fix \ --with-tm \ --with-verbs \ --disable-ipv6 \ CC=pgcc CFLAGS="-tp k8-64e -fast" \ CXX=pgCC CXXFLAGS="-tp k8-64e -fast" \ F77=pgfortran FFLAGS="-i8 -tp k8-64e -fast" \ FC=pgfortran FCFLAGS="-i8 -tp k8-64e -fast" Then I saw this compile error in making oshmem at the last stage: if test ! -r pshmem_real8_swap_f.c ; then \ pname=`echo pshmem_real8_swap_f.c | cut -b '2-'` ; \ ln -s ../../../../oshmem/shmem/fortran/$pname pshmem_real8_swap_f.c ; \ fi CC pshmem_real8_swap_f.lo if test ! -r pshmem_int4_cswap_f.c ; then \ pname=`echo pshmem_int4_cswap_f.c | cut -b '2-'` ; \ ln -s ../../../../oshmem/shmem/fortran/$pname pshmem_int4_cswap_f.c ; \ fi CC pshmem_int4_cswap_f.lo PGC-S-0058-Illegal lvalue (pshmem_int4_cswap_f.c: 39) PGC/x86-64 Linux 14.7-0: compilation completed with severe errors make[3]: *** [pshmem_int4_cswap_f.lo] Error 1 make[3]: Leaving directory `/home/mishima/mis/openmpi/openmpi-pgi14.7/int64/openmpi-1.8.2/oshmem/shmem/fortran/profile' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/mishima/mis/openmpi/openmpi-pgi14.7/int64/openmpi-1.8.2/oshmem/shmem/fortran' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/mishima/mis/openmpi/openmpi-pgi14.7/int64/openmpi-1.8.2/oshmem' make: *** [all-recursive] Error 1 I confirmed that it worked if I added configure option of --disable-oshmem. So, I hope that oshmem experts would fix this problem. (additional note) I switched to use gnu compiler and checked with this configuration, then I got the same error: ./configure \ --prefix=/home/mishima/opt/mpi/openmpi-1.8.2-gnu_int64 \ --enable-abi-breaking-fortran-status-i8-fix \ --disable-ipv6 \ F77=gfortran \ FC=gfortran \ CC=gcc \ CXX=g++ \ FFLAGS="-m64 -fdefault-integer-8" \ FCFLAGS="-m64 -fdefault-integer-8" \ CFLAGS=-m64 \ CXXFLAGS=-m64 make if test ! -r pshmem_int4_cswap_f.c ; then \ pname=`echo pshmem_int4_cswap_f.c | cut -b '2-'` ; \ ln -s ../../../../oshmem/shmem/fortran/$pname pshmem_int4_cswap_f.c ; \ fi CC pshmem_int4_cswap_f.lo pshmem_int4_cswap_f.c: In function 'shmem_int4_cswap_f': pshmem_int4_cswap_f.c:39: error: invalid lvalue in unary '&' make[3]: *** [pshmem_int4_cswap_f.lo] Error 1 Regards Tetsuya Mishima
Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error
Hi Ralph, I comfirmed that the openib issue was really fixed by r32395 and hope you'll be able to release the final version soon. Tetsuya > Kewl - the openib issue has been fixed in the nightly tarball. I'm waiting for review of a couple of pending CMRs, then we'll release a quick rc4 and move to release the final version > > > On Aug 1, 2014, at 9:55 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > I comfirmed openmpi-1.8.2rc3 with PGI-14.7 worked fine for me > > except for the openib issue reported by Mike Dubman. > > > > Tetsuya Mishima > > > >> Sorry, finally got through all this ompi email and see this problem was > > fixed. > >> > >> -Original Message- > >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Pritchard > > Jr., Howard > >> Sent: Friday, August 01, 2014 8:59 AM > >> To: Open MPI Developers > >> Subject: Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with > > PGI-14.7 causes link error > >> > >> Hi Jeff, > >> > >> Finally got info yesterday about where the newer PGI compilers are hiding > > out at LANL. > >> I'll check this out today. > >> > >> Howard > >> > >> > >> -Original Message- > >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres > > (jsquyres) > >> Sent: Tuesday, July 29, 2014 5:24 PM > >> To: Open MPI Developers List > >> Subject: Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with > > PGI-14.7 causes link error > >> > >> Tetsuya -- > >> > >> I am unable to test with the PGI compiler -- I don't have a license. I > > was hoping that LANL would be able to test today, but I don't think they > > got to it. > >> > >> Can you send more details? > >> > >> E.g., can you send the all the stuff listed on > > http://www.open-mpi.org/community/help/ for 1.8 and 1.8.2rc2 for the 14.7 > > compiler? > >> > >> I'm *guessing* that we've done something new in the changes since 1.8 > > that PGI doesn't support, and we need to disable that something (hopefully > > while not needing to disable the entire mpi_f08 > >> bindings...). > >> > >> > >> > >> On Jul 28, 2014, at 11:43 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> Hi folks, > >>> > >>> I tried to build openmpi-1.8.2rc2 with PGI-14.7 and execute a sample > >>> program. Then, it causes linking error: > >>> > >>> [mishima@manage work]$ cat test.f > >>> program hello_world > >>> use mpi_f08 > >>> implicit none > >>> > >>> type(MPI_Comm) :: comm > >>> integer :: myid, npes, ierror > >>> integer :: name_length > >>> character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name > >>> > >>> call mpi_init(ierror) > >>> comm = MPI_COMM_WORLD > >>> call MPI_Comm_rank(comm, myid, ierror) > >>> call MPI_Comm_size(comm, npes, ierror) > >>> call MPI_Get_processor_name(processor_name, name_length, ierror) > >>> write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') > >>>+"Process", myid, "of", npes, "is on", trim(processor_name) > >>> call MPI_Finalize(ierror) > >>> > >>> end program hello_world > >>> > >>> [mishima@manage work]$ mpif90 test.f -o test.ex > >>> /tmp/pgfortran65ZcUeoncoqT.o: In function `.C1_283': > >>> test.f:(.data+0x6c): undefined reference to > > `mpi_f08_interfaces_callbacks_' > >>> test.f:(.data+0x74): undefined reference to `mpi_f08_interfaces_' > >>> test.f:(.data+0x7c): undefined reference to `pmpi_f08_interfaces_' > >>> test.f:(.data+0x84): undefined reference to `mpi_f08_sizeof_' > >>> > >>> So, I did some more tests with previous version of PGI and > >>> openmpi-1.8. The results are summarized as follows: > >>> > >>> PGI13.10 PGI14.7 > >>> openmpi-1.8 OK OK > >>> openmpi-1.8.2rc2 configure sets use_f08_mpi:no link error > >>> > >>> Regards, > >>> Tetsuya Mishima > >>> > >>> ___ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/devel/2014/07/15303.php > >> > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/07/15335.php > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15452.php > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >
Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error
I comfirmed openmpi-1.8.2rc3 with PGI-14.7 worked fine for me except for the openib issue reported by Mike Dubman. Tetsuya Mishima > Sorry, finally got through all this ompi email and see this problem was fixed. > > -Original Message- > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Pritchard Jr., Howard > Sent: Friday, August 01, 2014 8:59 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error > > Hi Jeff, > > Finally got info yesterday about where the newer PGI compilers are hiding out at LANL. > I'll check this out today. > > Howard > > > -Original Message- > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres (jsquyres) > Sent: Tuesday, July 29, 2014 5:24 PM > To: Open MPI Developers List > Subject: Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error > > Tetsuya -- > > I am unable to test with the PGI compiler -- I don't have a license. I was hoping that LANL would be able to test today, but I don't think they got to it. > > Can you send more details? > > E.g., can you send the all the stuff listed on http://www.open-mpi.org/community/help/ for 1.8 and 1.8.2rc2 for the 14.7 compiler? > > I'm *guessing* that we've done something new in the changes since 1.8 that PGI doesn't support, and we need to disable that something (hopefully while not needing to disable the entire mpi_f08 > bindings...). > > > > On Jul 28, 2014, at 11:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > Hi folks, > > > > I tried to build openmpi-1.8.2rc2 with PGI-14.7 and execute a sample > > program. Then, it causes linking error: > > > > [mishima@manage work]$ cat test.f > > program hello_world > > use mpi_f08 > > implicit none > > > > type(MPI_Comm) :: comm > > integer :: myid, npes, ierror > > integer :: name_length > > character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name > > > > call mpi_init(ierror) > > comm = MPI_COMM_WORLD > > call MPI_Comm_rank(comm, myid, ierror) > > call MPI_Comm_size(comm, npes, ierror) > > call MPI_Get_processor_name(processor_name, name_length, ierror) > > write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') > > +"Process", myid, "of", npes, "is on", trim(processor_name) > > call MPI_Finalize(ierror) > > > > end program hello_world > > > > [mishima@manage work]$ mpif90 test.f -o test.ex > > /tmp/pgfortran65ZcUeoncoqT.o: In function `.C1_283': > > test.f:(.data+0x6c): undefined reference to `mpi_f08_interfaces_callbacks_' > > test.f:(.data+0x74): undefined reference to `mpi_f08_interfaces_' > > test.f:(.data+0x7c): undefined reference to `pmpi_f08_interfaces_' > > test.f:(.data+0x84): undefined reference to `mpi_f08_sizeof_' > > > > So, I did some more tests with previous version of PGI and > > openmpi-1.8. The results are summarized as follows: > > > > PGI13.10 PGI14.7 > > openmpi-1.8 OK OK > > openmpi-1.8.2rc2 configure sets use_f08_mpi:no link error > > > > Regards, > > Tetsuya Mishima > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/07/15303.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15335.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/08/15452.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/08/15455.php
Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error
Hi Paul, Thank you for your investigation. I'm sure it's very close to fix the problem although I myself can't do that. So I must owe you something... Please try Awamori, which is Okinawa's sake and very good in such a hot day. Tetsuya > On Wed, Jul 30, 2014 at 8:53 PM, Paul Hargrovewrote: > [...] > I have a clear answer to *what* is different (below) and am next looking into the why/how now. > It seems that 1.8.1 has included all dependencies into libmpi_usempif08 while 1.8.2rc2 does not. > [...] > > The difference appears to stem from the following difference in ompi/mpi/fortran/use-mpi-f08/Makefile.am: > > 1.8.1: > libmpi_usempif08_la_LIBADD = \ > $(module_sentinel_file) \ > $(OMPI_MPIEXT_USEMPIF08_LIBS) \ > $(top_builddir)/ompi/libmpi.la > > 1.8.2rc2: > libmpi_usempif08_la_LIBADD = \ > $(OMPI_MPIEXT_USEMPIF08_LIBS) \ > $(top_builddir)/ompi/libmpi.la > libmpi_usempif08_la_DEPENDENCIES = $(module_sentinel_file) > > Where in both cases one has: > > module_sentinel_file = \ > libforce_usempif08_internal_modules_to_be_built.la > > which contains all of the symbols which my previous testing found had "disappeared" from libmpi_usempif08.so between 1.8.1 and 1.8.2rc2. > > I don't have recent enough autotools to attempt the change the Makefile.am, but instead restored the removed item from libmpi_usempif08_la_LIBADD directly in Makefile.in. However, rather than fixing > the problem, that resulted in multiple definitions of a bunch of _eq and _ne functions (e.g. mpi_f08_types_ompi_request_op_ne_). So, I am uncertain how to proceed. > > Use svn blame points at a "bulk" CMR of many fortran related changes, including one related to the eq/ne operators. So, I am turning over this investigation to Jeff and/or Ralph to figure out what > actually is required to fix this without loss of whatever benefits were in that CMR. I am still available to test the proposed fixes. Happy hunting... > > Somebody owes me a virtual beer (or nihonshu) ;-) > -Paul > > > -- > > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15387.php
Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error
Paul and Jeff, I additionally installed PGI14.4 and check the behavior. Then, I confirmed that both versions create same results. PGI14.7: [mishima@manage work]$ mpif90 test.f -o test.ex --showme pgfortran test.f -o test.ex -I/home/mishima/opt/mpi/openmpi-1.8.2rc2-pgi14.7/include -I/home/mishima/opt/mpi/openmpi-1.8 .2rc2-pgi14.7/lib -Wl,-rpath -Wl,/home/mishima/opt/mpi/openmpi-1.8.2rc2-pgi14.7/lib -L/home/mishima/opt/mpi/openmpi-1.8. 2rc2-pgi14.7/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi [mishima@manage work]$ mpif90 test.f -o test.ex /tmp/pgfortranD-vdxk_lnPL3.o: In function `.C1_283': test.f:(.data+0x6c): undefined reference to `mpi_f08_interfaces_callbacks_' test.f:(.data+0x74): undefined reference to `mpi_f08_interfaces_' test.f:(.data+0x7c): undefined reference to `pmpi_f08_interfaces_' test.f:(.data+0x84): undefined reference to `mpi_f08_sizeof_' PGI14.4: [mishima@manage work]$ mpif90 test.f -o test.ex --showme pgfortran test.f -o test.ex -I/home/mishima/opt/mpi/openmpi-1.8.2rc2-pgi14.4/include -I/home/mishima/opt/mpi/openmpi-1.8 .2rc2-pgi14.4/lib -Wl,-rpath -Wl,/home/mishima/opt/mpi/openmpi-1.8.2rc2-pgi14.4/lib -L/home/mishima/opt/mpi/openmpi-1.8. 2rc2-pgi14.4/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi [mishima@manage work]$ mpif90 test.f -o test.ex /tmp/pgfortranm9sdKiZYkrMy.o: In function `.C1_283': test.f:(.data+0x6c): undefined reference to `mpi_f08_interfaces_callbacks_' test.f:(.data+0x74): undefined reference to `mpi_f08_interfaces_' test.f:(.data+0x7c): undefined reference to `pmpi_f08_interfaces_' test.f:(.data+0x84): undefined reference to `mpi_f08_sizeof_' As I reported before, mpi_f08*.mod is created in $prefix/lib. [mishima@manage openmpi-1.8.2rc2-pgi14.7]$ ll lib/mpi_f08* -rwxr-xr-x 1 mishima mishima327 Jul 30 12:27 lib/mpi_f08_ext.mod -rwxr-xr-x 1 mishima mishima 11716 Jul 30 12:27 lib/mpi_f08_interfaces_callbacks.mod -rwxr-xr-x 1 mishima mishima 374813 Jul 30 12:27 lib/mpi_f08_interfaces.mod -rwxr-xr-x 1 mishima mishima 715615 Jul 30 12:27 lib/mpi_f08.mod -rwxr-xr-x 1 mishima mishima 14730 Jul 30 12:27 lib/mpi_f08_sizeof.mod -rwxr-xr-x 1 mishima mishima 77141 Jul 30 12:27 lib/mpi_f08_types.mod Strange thing is that openmpi-1.8 with PGI14.7 works fine. What's the difference with openmpi-1.8 and openmpi-1.8.2rc2? [mishima@manage work]$ mpif90 test.f -o test.ex --showme pgfortran test.f -o test.ex -I/home/mishima/opt/mpi/openmpi-1.8-pgi14.7/include -I/home/mishima/opt/mpi/openmpi-1.8-pgi1 4.7/lib -Wl,-rpath -Wl,/home/mishima/opt/mpi/openmpi-1.8-pgi14.7/lib -L/home/mishima/opt/mpi/openmpi-1.8-pgi14.7/lib -lm pi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi [mishima@manage work]$ mpif90 test.f -o test.ex [mishima@manage work]$ Tetsuya > On Jul 28, 2014, at 11:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > [mishima@manage work]$ mpif90 test.f -o test.ex > > /tmp/pgfortran65ZcUeoncoqT.o: In function `.C1_283': > > test.f:(.data+0x6c): undefined reference to `mpi_f08_interfaces_callbacks_' > > test.f:(.data+0x74): undefined reference to `mpi_f08_interfaces_' > > test.f:(.data+0x7c): undefined reference to `pmpi_f08_interfaces_' > > test.f:(.data+0x84): undefined reference to `mpi_f08_sizeof_' > > Just to go back to the original post here: can you send the results of > > mpifort test.f -o test.ex --showme > > I'd like to see what fortran libraries are being linked in. Here's what I get when I compile OMPI with the Intel suite: > > - > $ mpifort hello_usempif08.f90 -o hello --showme > ifort hello_usempif08.f90 -o hello -I/home/jsquyres/bogus/include -I/home/jsquyres/bogus/lib -Wl,-rpath -Wl,/home/jsquyres/bogus/lib -Wl,--enable-new-dtags -L/home/jsquyres/bogus/lib -lmpi_usempif08 > -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi > > > I note that with the Intel compiler, the Fortran module files are created in the lib directory (i.e., $prefix/lib), which is -L'ed on the link line. Does the PGI compiler require something > different? Does the PGI 14 compiler make an additional library for modules that we need to link in? > > We didn't use CONTAINS, and it supposedly works fine with the mpi module (right, guys?), so I'm not sure would the same scheme wouldn't work for the mpi_f08 module...? > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15377.php
Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error
Hi Paul, thank you for your comment. I don't think my mpi_f08.mod is older one, because the time stamp is equal to the time when I rebuilt them today. [mishima@manage openmpi-1.8.2rc2-pgi14.7]$ ll lib/mpi* -rwxr-xr-x 1 mishima mishima315 Jul 30 12:27 lib/mpi_ext.mod -rwxr-xr-x 1 mishima mishima327 Jul 30 12:27 lib/mpi_f08_ext.mod -rwxr-xr-x 1 mishima mishima 11716 Jul 30 12:27 lib/mpi_f08_interfaces_callbacks.mod -rwxr-xr-x 1 mishima mishima 374813 Jul 30 12:27 lib/mpi_f08_interfaces.mod -rwxr-xr-x 1 mishima mishima 715615 Jul 30 12:27 lib/mpi_f08.mod -rwxr-xr-x 1 mishima mishima 14730 Jul 30 12:27 lib/mpi_f08_sizeof.mod -rwxr-xr-x 1 mishima mishima 77141 Jul 30 12:27 lib/mpi_f08_types.mod -rwxr-xr-x 1 mishima mishima 878339 Jul 30 12:27 lib/mpi.mod Regards, Tetsuya > On Tue, Jul 29, 2014 at 6:38 PM, Paul Hargrovewrote: > > On Tue, Jul 29, 2014 at 6:33 PM, Paul Hargrove wrote: > I am trying again with an explicit --enable-mpi-fortran=usempi at configure time to see what happens. > > Of course that should have said --enable-mpi-fortran=usempif08 > > I've switched to using PG13.6 for my testing. > I find that even when I pass that flag I see that use_mpi_f08 is NOT enabled: > > checking Fortran compiler ignore TKR syntax... not cached; checking variants > checking for Fortran compiler support of TYPE(*), DIMENSION(*)... no > checking for Fortran compiler support of !DEC$ ATTRIBUTES NO_ARG_CHECK... no > checking for Fortran compiler support of !$PRAGMA IGNORE_TKR... no > checking for Fortran compiler support of !DIR$ IGNORE_TKR... yes > checking Fortran compiler ignore TKR syntax... 1:real, dimension(*):!DIR$ IGNORE_TKR > checking if Fortran compiler supports ISO_C_BINDING... yes > checking if building Fortran 'use mpi' bindings... yes > checking if Fortran compiler supports SUBROUTINE BIND(C)... yes > checking if Fortran compiler supports TYPE, BIND(C)... yes > checking if Fortran compiler supports TYPE(type), BIND(C, NAME="name")... yes > checking if Fortran compiler supports PROCEDURE... no > checking if building Fortran 'use mpi_f08' bindings... no > > Contrast that to openmpi-1.8.1 and the same compiler: > > checking Fortran compiler ignore TKR syntax... not cached; checking variants > checking for Fortran compiler support of TYPE(*), DIMENSION(*)... no > checking for Fortran compiler support of !DEC$ ATTRIBUTES NO_ARG_CHECK... no > checking for Fortran compiler support of !$PRAGMA IGNORE_TKR... no > checking for Fortran compiler support of !DIR$ IGNORE_TKR... yes > checking Fortran compiler ignore TKR syntax... 1:real, dimension(*):!DIR$ IGNORE_TKR > checking if building Fortran 'use mpi' bindings... yes > checking if Fortran compiler supports ISO_C_BINDING... yes > checking if Fortran compiler supports SUBROUTINE BIND(C)... yes > checking if Fortran compiler supports TYPE, BIND(C)... yes > checking if Fortran compiler supports TYPE(type), BIND(C, NAME="name")... yes > checking if Fortran compiler supports optional arguments... yes > checking if Fortran compiler supports PRIVATE... yes > checking if Fortran compiler supports PROTECTED... yes > checking if Fortran compiler supports ABSTRACT... yes > checking if Fortran compiler supports ASYNCHRONOUS... yes > checking if Fortran compiler supports PROCEDURE... no > checking size of Fortran type(test_mpi_handle)... 4 > checking Fortran compiler F08 assumed rank syntax... not cached; checking > checking for Fortran compiler support of TYPE(*), DIMENSION(..)... no > checking Fortran compiler F08 assumed rank syntax... no > checking which mpi_f08 implementation to build... "good" compiler, no array subsections > checking if building Fortran 'use mpi_f08' bindings... yes > > So, somewhere between 1.8.1 and 1.8.2rc2 something has happened in the configure logic to disqualify the pgf90 compiler. > > I also surprised to see 1.8.2rc2 performing *fewer* tests of FC then 1.8.1 did (unless they moved elsewhere?). > > In the end I cannot reproduce the originally reported problem for the simple reason that I instead see: > > {hargrove@hopper04 openmpi-1.8.2rc2-linux-x86_64-pgi-14.4}$ ./INST/bin/mpif90 ../test.f > PGF90-F-0004-Unable to open MODULE file mpi_f08.mod (../test.f: 2) > PGF90/x86-64 Linux 14.4-0: compilation aborted > > > Tetsuya Mishima, > > Is it possible that your installation of 1.8.2rc2 was to the same prefix as an older build? > It that is the case, you may have the mpi_f08.mod from the older build even though no f08 support is in the new build. > > > -Paul > > > -- > > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post:
Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error
This is another one. (See attached file: openmpi-1.8.2rc2-pgi14.7.tar.gz) Tetusya > Tetsuya -- > > I am unable to test with the PGI compiler -- I don't have a license. I was hoping that LANL would be able to test today, but I don't think they got to it. > > Can you send more details? > > E.g., can you send the all the stuff listed on http://www.open-mpi.org/community/help/ for 1.8 and 1.8.2rc2 for the 14.7 compiler? > > I'm *guessing* that we've done something new in the changes since 1.8 that PGI doesn't support, and we need to disable that something (hopefully while not needing to disable the entire mpi_f08 > bindings...). > > > > On Jul 28, 2014, at 11:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > Hi folks, > > > > I tried to build openmpi-1.8.2rc2 with PGI-14.7 and execute a sample > > program. Then, it causes linking error: > > > > [mishima@manage work]$ cat test.f > > program hello_world > > use mpi_f08 > > implicit none > > > > type(MPI_Comm) :: comm > > integer :: myid, npes, ierror > > integer :: name_length > > character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name > > > > call mpi_init(ierror) > > comm = MPI_COMM_WORLD > > call MPI_Comm_rank(comm, myid, ierror) > > call MPI_Comm_size(comm, npes, ierror) > > call MPI_Get_processor_name(processor_name, name_length, ierror) > > write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') > > +"Process", myid, "of", npes, "is on", trim(processor_name) > > call MPI_Finalize(ierror) > > > > end program hello_world > > > > [mishima@manage work]$ mpif90 test.f -o test.ex > > /tmp/pgfortran65ZcUeoncoqT.o: In function `.C1_283': > > test.f:(.data+0x6c): undefined reference to `mpi_f08_interfaces_callbacks_' > > test.f:(.data+0x74): undefined reference to `mpi_f08_interfaces_' > > test.f:(.data+0x7c): undefined reference to `pmpi_f08_interfaces_' > > test.f:(.data+0x84): undefined reference to `mpi_f08_sizeof_' > > > > So, I did some more tests with previous version of PGI and > > openmpi-1.8. The results are summarized as follows: > > > > PGI13.10 PGI14.7 > > openmpi-1.8 OK OK > > openmpi-1.8.2rc2 configure sets use_f08_mpi:no link error > > > > Regards, > > Tetsuya Mishima > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15303.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15335.php openmpi-1.8.2rc2-pgi14.7.tar.gz Description: Binary data
Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error
Hi Jeff, Sorry for poor information and late reply. Today, I attended a very very long meeting ... Anyway, I attached compile-output and configure-log. (due to file size limitation, I send them in twice) I hope you could find the problem. (See attached file: openmpi-1.8-pgi14.7.tar.gz) Regards, Tetsuya > Tetsuya -- > > I am unable to test with the PGI compiler -- I don't have a license. I was hoping that LANL would be able to test today, but I don't think they got to it. > > Can you send more details? > > E.g., can you send the all the stuff listed on http://www.open-mpi.org/community/help/ for 1.8 and 1.8.2rc2 for the 14.7 compiler? > > I'm *guessing* that we've done something new in the changes since 1.8 that PGI doesn't support, and we need to disable that something (hopefully while not needing to disable the entire mpi_f08 > bindings...). > > > > On Jul 28, 2014, at 11:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > Hi folks, > > > > I tried to build openmpi-1.8.2rc2 with PGI-14.7 and execute a sample > > program. Then, it causes linking error: > > > > [mishima@manage work]$ cat test.f > > program hello_world > > use mpi_f08 > > implicit none > > > > type(MPI_Comm) :: comm > > integer :: myid, npes, ierror > > integer :: name_length > > character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name > > > > call mpi_init(ierror) > > comm = MPI_COMM_WORLD > > call MPI_Comm_rank(comm, myid, ierror) > > call MPI_Comm_size(comm, npes, ierror) > > call MPI_Get_processor_name(processor_name, name_length, ierror) > > write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') > > +"Process", myid, "of", npes, "is on", trim(processor_name) > > call MPI_Finalize(ierror) > > > > end program hello_world > > > > [mishima@manage work]$ mpif90 test.f -o test.ex > > /tmp/pgfortran65ZcUeoncoqT.o: In function `.C1_283': > > test.f:(.data+0x6c): undefined reference to `mpi_f08_interfaces_callbacks_' > > test.f:(.data+0x74): undefined reference to `mpi_f08_interfaces_' > > test.f:(.data+0x7c): undefined reference to `pmpi_f08_interfaces_' > > test.f:(.data+0x84): undefined reference to `mpi_f08_sizeof_' > > > > So, I did some more tests with previous version of PGI and > > openmpi-1.8. The results are summarized as follows: > > > > PGI13.10 PGI14.7 > > openmpi-1.8 OK OK > > openmpi-1.8.2rc2 configure sets use_f08_mpi:no link error > > > > Regards, > > Tetsuya Mishima > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15303.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15335.php openmpi-1.8-pgi14.7.tar.gz Description: Binary data
Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error
Sorry for poor information. I attached compile-output and configure-log. I hope you could find the problem. (See attached file: openmpi-pgi14.7.tar.gz) Regards, Tetsuya Mishima > Tetsuya -- > > I am unable to test with the PGI compiler -- I don't have a license. I was hoping that LANL would be able to test today, but I don't think they got to it. > > Can you send more details? > > E.g., can you send the all the stuff listed on http://www.open-mpi.org/community/help/ for 1.8 and 1.8.2rc2 for the 14.7 compiler? > > I'm *guessing* that we've done something new in the changes since 1.8 that PGI doesn't support, and we need to disable that something (hopefully while not needing to disable the entire mpi_f08 > bindings...). > > > > On Jul 28, 2014, at 11:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > Hi folks, > > > > I tried to build openmpi-1.8.2rc2 with PGI-14.7 and execute a sample > > program. Then, it causes linking error: > > > > [mishima@manage work]$ cat test.f > > program hello_world > > use mpi_f08 > > implicit none > > > > type(MPI_Comm) :: comm > > integer :: myid, npes, ierror > > integer :: name_length > > character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name > > > > call mpi_init(ierror) > > comm = MPI_COMM_WORLD > > call MPI_Comm_rank(comm, myid, ierror) > > call MPI_Comm_size(comm, npes, ierror) > > call MPI_Get_processor_name(processor_name, name_length, ierror) > > write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') > > +"Process", myid, "of", npes, "is on", trim(processor_name) > > call MPI_Finalize(ierror) > > > > end program hello_world > > > > [mishima@manage work]$ mpif90 test.f -o test.ex > > /tmp/pgfortran65ZcUeoncoqT.o: In function `.C1_283': > > test.f:(.data+0x6c): undefined reference to `mpi_f08_interfaces_callbacks_' > > test.f:(.data+0x74): undefined reference to `mpi_f08_interfaces_' > > test.f:(.data+0x7c): undefined reference to `pmpi_f08_interfaces_' > > test.f:(.data+0x84): undefined reference to `mpi_f08_sizeof_' > > > > So, I did some more tests with previous version of PGI and > > openmpi-1.8. The results are summarized as follows: > > > > PGI13.10 PGI14.7 > > openmpi-1.8 OK OK > > openmpi-1.8.2rc2 configure sets use_f08_mpi:no link error > > > > Regards, > > Tetsuya Mishima > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15303.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15335.php openmpi-pgi14.7.tar.gz Description: Binary data
[OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error
Hi folks, I tried to build openmpi-1.8.2rc2 with PGI-14.7 and execute a sample program. Then, it causes linking error: [mishima@manage work]$ cat test.f program hello_world use mpi_f08 implicit none type(MPI_Comm) :: comm integer :: myid, npes, ierror integer :: name_length character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name call mpi_init(ierror) comm = MPI_COMM_WORLD call MPI_Comm_rank(comm, myid, ierror) call MPI_Comm_size(comm, npes, ierror) call MPI_Get_processor_name(processor_name, name_length, ierror) write (*,'(A,X,I4,X,A,X,I4,X,A,X,A)') +"Process", myid, "of", npes, "is on", trim(processor_name) call MPI_Finalize(ierror) end program hello_world [mishima@manage work]$ mpif90 test.f -o test.ex /tmp/pgfortran65ZcUeoncoqT.o: In function `.C1_283': test.f:(.data+0x6c): undefined reference to `mpi_f08_interfaces_callbacks_' test.f:(.data+0x74): undefined reference to `mpi_f08_interfaces_' test.f:(.data+0x7c): undefined reference to `pmpi_f08_interfaces_' test.f:(.data+0x84): undefined reference to `mpi_f08_sizeof_' So, I did some more tests with previous version of PGI and openmpi-1.8. The results are summarized as follows: PGI13.10 PGI14.7 openmpi-1.8 OK OK openmpi-1.8.2rc2 configure sets use_f08_mpi:no link error Regards, Tetsuya Mishima
Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile
Hi Ralph, By the way, something is wrong with your latest rmaps_rank_file.c. I've got the error below. I'm tring to find the problem. But, you could find it more quickly... [mishima@manage trial]$ cat rankfile rank 0=node05 slot=0-1 rank 1=node05 slot=3-4 rank 2=node05 slot=6-7 [mishima@manage trial]$ mpirun -np 3 -rf rankfile -report-bindings demos/myprog -- Error, invalid syntax in the rankfile (rankfile) syntax must be the fallowing rank i=host_i slot=string Examples of proper syntax include: rank 1=host1 slot=1:0,1 rank 0=host2 slot=0:* rank 2=host4 slot=1-2 rank 3=host3 slot=0:1;1:0-2 -- [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 483 [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 149 [manage.cluster:24456] [[20979,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 287 Regards, Tetsuya Mishima > My guess is that the coll/ml component may have problems with binding a single process across multiple cores like that - it might be that we'll have to have it check for that condition and disqualify > itself. It is a particularly bad binding pattern, though, as shared memory gets completely messed up when you split that way. > > > On Jun 19, 2014, at 3:57 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > Hi folks, > > > > Recently I have been seeing a hang with trunk when I specify a > > particular binding by use of rankfile or "-map-by slot". > > > > This can be reproduced by the rankfile which allocates a process > > beyond socket boundary. For example, on the node05 which has 2 socket > > with 4 core, the rank 1 is allocated through socket 0 and 1 as shown > > below. Then it hangs in the middle of communication. > > > > [mishima@manage trial]$ cat rankfile1 > > rank 0=node05 slot=0-1 > > rank 1=node05 slot=3-4 > > rank 2=node05 slot=6-7 > > > > [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings demos/myprog > > [node05.cluster:02342] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]]: [B/B/./.][./././.] > > [node05.cluster:02342] MCW rank 1 bound to socket 0[core 3[hwt 0]], socket > > 1[core 4[hwt 0]]: [./././B][B/././.] > > [node05.cluster:02342] MCW rank 2 bound to socket 1[core 6[hwt 0]], socket > > 1[core 7[hwt 0]]: [./././.][././B/B] > > Hello world from process 2 of 3 > > Hello world from process 1 of 3 > > << hang here! >> > > > > If I disable coll_ml or use 1.8 series, it works, which means it > > might be affected by coll_ml component, I guess. But, unfortunately, > > I have no idea to fix this problem. So, please somebody could resolve > > the issue. > > > > [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings -mca > > coll_ml_priority 0 demos/myprog > > [node05.cluster:02382] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]]: [B/B/./.][./././.] > > [node05.cluster:02382] MCW rank 1 bound to socket 0[core 3[hwt 0]], socket > > 1[core 4[hwt 0]]: [./././B][B/././.] > > [node05.cluster:02382] MCW rank 2 bound to socket 1[core 6[hwt 0]], socket > > 1[core 7[hwt 0]]: [./././.][././B/B] > > Hello world from process 2 of 3 > > Hello world from process 0 of 3 > > Hello world from process 1 of 3 > > > > In addtition, when I use the host with 12 cores, "-map-by slot" causes the > > same problem. > > [mishima@manage trial]$ mpirun -np 3 -map-by slot:pe=4 -report-bindings > > demos/myprog > > [manage.cluster:02557] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]]: [B/B/B/B/./.][./././././.] > > [manage.cluster:02557] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket > > 0[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > cket 1[core 7[hwt 0]]: [././././B/B][B/B/./././.] > > [manage.cluster:02557] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket > > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > > ocket 1[core 11[hwt 0]]: [./././././.][././B/B/B/B] > > Hello world from process 1 of 3 > > Hello world from process 2 of 3 > > << hang here! >> > > > > Regards, > > Tetsuya Mishima > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15030.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15032.php
Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile
I'm not sure, but I guess it's related to Gilles's ticket. It's a quite bad binding pattern as Ralph pointed out, so checking for that condition and disqualifying coll/ml could be a practical solution as well. Tetsuya > It is related, but it means that coll/ml has a higher degree of sensitivity to the binding pattern than what you reported (which was that coll/ml doesn't work with unbound processes). What we are now > seeing is that coll/ml also doesn't work when processes are bound across sockets. > > Which means that Nathan's revised tests are going to have to cover a lot more corner cases. Our locality flags don't currently include "bound-to-multiple-sockets", and I'm not sure how he is going to > easily resolve that case. > > > On Jun 19, 2014, at 8:02 PM, Gilles Gouaillardetwrote: > > > Ralph and Tetsuya, > > > > is this related to the hang i reported at > > http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ? > > > > Nathan already replied he is working on a fix. > > > > Cheers, > > > > Gilles > > > > > > On 2014/06/20 11:54, Ralph Castain wrote: > >> My guess is that the coll/ml component may have problems with binding a single process across multiple cores like that - it might be that we'll have to have it check for that condition and > disqualify itself. It is a particularly bad binding pattern, though, as shared memory gets completely messed up when you split that way. > >> > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15033.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15034.php
[OMPI devel] trunk hangs when I specify a particular binding by rankfile
Hi folks, Recently I have been seeing a hang with trunk when I specify a particular binding by use of rankfile or "-map-by slot". This can be reproduced by the rankfile which allocates a process beyond socket boundary. For example, on the node05 which has 2 socket with 4 core, the rank 1 is allocated through socket 0 and 1 as shown below. Then it hangs in the middle of communication. [mishima@manage trial]$ cat rankfile1 rank 0=node05 slot=0-1 rank 1=node05 slot=3-4 rank 2=node05 slot=6-7 [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings demos/myprog [node05.cluster:02342] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./.][./././.] [node05.cluster:02342] MCW rank 1 bound to socket 0[core 3[hwt 0]], socket 1[core 4[hwt 0]]: [./././B][B/././.] [node05.cluster:02342] MCW rank 2 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]]: [./././.][././B/B] Hello world from process 2 of 3 Hello world from process 1 of 3 << hang here! >> If I disable coll_ml or use 1.8 series, it works, which means it might be affected by coll_ml component, I guess. But, unfortunately, I have no idea to fix this problem. So, please somebody could resolve the issue. [mishima@manage trial]$ mpirun -rf rankfile1 -report-bindings -mca coll_ml_priority 0 demos/myprog [node05.cluster:02382] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./.][./././.] [node05.cluster:02382] MCW rank 1 bound to socket 0[core 3[hwt 0]], socket 1[core 4[hwt 0]]: [./././B][B/././.] [node05.cluster:02382] MCW rank 2 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]]: [./././.][././B/B] Hello world from process 2 of 3 Hello world from process 0 of 3 Hello world from process 1 of 3 In addtition, when I use the host with 12 cores, "-map-by slot" causes the same problem. [mishima@manage trial]$ mpirun -np 3 -map-by slot:pe=4 -report-bindings demos/myprog [manage.cluster:02557] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so cket 0[core 3[hwt 0]]: [B/B/B/B/./.][./././././.] [manage.cluster:02557] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 1[core 6[hwt 0]], so cket 1[core 7[hwt 0]]: [././././B/B][B/B/./././.] [manage.cluster:02557] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s ocket 1[core 11[hwt 0]]: [./././././.][././B/B/B/B] Hello world from process 1 of 3 Hello world from process 2 of 3 << hang here! >> Regards, Tetsuya Mishima
Re: [OMPI devel] openmpi-1.8 - hangup using more than 4 nodes under managed state by Torque
Thanks Ralph. Tetsuya > I tracked it down - not Torque specific, but impacts all managed environments. Will fix > > > On Apr 1, 2014, at 2:23 AM, tmish...@jcity.maeda.co.jp wrote: > > > > > Hi Ralph, > > > > I saw another hangup with openmpi-1.8 when I used more than 4 nodes > > (having 8 cores each) under managed state by Torque. Although I'm not > > sure you can reproduce it with SLURM, at leaset with Torque it can be > > reproduced in this way: > > > > [mishima@manage ~]$ qsub -I -l nodes=4:ppn=8 > > qsub: waiting for job 8726.manage.cluster to start > > qsub: job 8726.manage.cluster ready > > > > [mishima@node09 ~]$ mpirun -np 65 ~/mis/openmpi/demos/myprog > > -- > > There are not enough slots available in the system to satisfy the 65 slots > > that were requested by the application: > > /home/mishima/mis/openmpi/demos/myprog > > > > Either request fewer slots for your application, or make more slots > > available > > for use. > > -- > > <<< HANG HERE!! >>> > > Abort is in progress...hit ctrl-c again within 5 seconds to forcibly > > terminate > > > > I found this behavior when I happened to input wrong procs. With less than > > 4 > > nodes or rsh - namely unmanaged state, it works. I'm afraid to say I have > > no > > idea how to resolve it. I hope you could fix the problem. > > > > Tetsuya > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Searchable archives: http://www.open-mpi.org/community/lists/devel/2014/04/index.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/04/14438.php
Re: [OMPI devel] system call failed during shared memory initialization with openmpi-1.8a1r31254
Hi Jeff, it worked for me with openmpi-1.8rc1. Tetsuya > Ralph applied a bunch of CMRs to the v1.8 branch after the nightly tarball was made last night. > > I just created a new nightly tarball that includes all of those CMRs: 1.8a1r31269. It should have the fix for this error included in it. > > > On Mar 28, 2014, at 6:50 AM,wrote: > > > > > > > Thanks Jeff. It seems to be really the latest one - ticket #4474. > > > >> On Mar 28, 2014, at 5:45 AM, wrote: > >> > >>> > > -- > >>> A system call failed during shared memory initialization that should > >>> not have. It is likely that your MPI job will now either abort or > >>> experience performance degradation. > >>> > >>> Local host: node03.cluster > >>> System call: unlink > >>> > > (2) /tmp/openmpi-sessions-mishima@node03_0/17579/1/vader_segment.node03.0 > >>> Error: No such file or directory (errno 2) > >>> > > -- > >> > >> > >> This error was just fixed last night. > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/03/14416.php > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14417.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14419.php
Re: [OMPI devel] system call failed during shared memory initialization with openmpi-1.8a1r31254
Thanks Jeff. But I'm already offline today ... I can not confirm it until monday morning, sorry. Tetsuya > Ralph applied a bunch of CMRs to the v1.8 branch after the nightly tarball was made last night. > > I just created a new nightly tarball that includes all of those CMRs: 1.8a1r31269. It should have the fix for this error included in it. > > > On Mar 28, 2014, at 6:50 AM,wrote: > > > > > > > Thanks Jeff. It seems to be really the latest one - ticket #4474. > > > >> On Mar 28, 2014, at 5:45 AM, wrote: > >> > >>> > > -- > >>> A system call failed during shared memory initialization that should > >>> not have. It is likely that your MPI job will now either abort or > >>> experience performance degradation. > >>> > >>> Local host: node03.cluster > >>> System call: unlink > >>> > > (2) /tmp/openmpi-sessions-mishima@node03_0/17579/1/vader_segment.node03.0 > >>> Error: No such file or directory (errno 2) > >>> > > -- > >> > >> > >> This error was just fixed last night. > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/03/14416.php > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14417.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14419.php
Re: [OMPI devel] system call failed during shared memory initialization with openmpi-1.8a1r31254
Thanks Jeff. It seems to be really the latest one - ticket #4474. > On Mar 28, 2014, at 5:45 AM,wrote: > > > -- > > A system call failed during shared memory initialization that should > > not have. It is likely that your MPI job will now either abort or > > experience performance degradation. > > > > Local host: node03.cluster > > System call: unlink > > (2) /tmp/openmpi-sessions-mishima@node03_0/17579/1/vader_segment.node03.0 > > Error: No such file or directory (errno 2) > > -- > > > This error was just fixed last night. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14416.php
[OMPI devel] system call failed during shared memory initialization with openmpi-1.8a1r31254
Hi all, I saw this error as shown below with openmpi-1.8a1r31254. I've never seen it before with openmpi-1.7.5. The message implies it's related to vader and I can stop it by excluding vader from btl, -mca btl ^vader. Could someone fix this problem? Tetsuya [mishima@manage openmpi]$ mpirun -np 16 -host node03,node04 -map-by numa:pe=4 -display-map -report-bindings -bind-to cor e ./demos/myprog Data for JOB [17579,1] offset 0 JOB MAP Data for node: node03 Num slots: 1Max slots: 0Num procs: 8 Process OMPI jobid: [17579,1] App: 0 Process rank: 0 Process OMPI jobid: [17579,1] App: 0 Process rank: 1 Process OMPI jobid: [17579,1] App: 0 Process rank: 2 Process OMPI jobid: [17579,1] App: 0 Process rank: 3 Process OMPI jobid: [17579,1] App: 0 Process rank: 4 Process OMPI jobid: [17579,1] App: 0 Process rank: 5 Process OMPI jobid: [17579,1] App: 0 Process rank: 6 Process OMPI jobid: [17579,1] App: 0 Process rank: 7 Data for node: node04 Num slots: 1Max slots: 0Num procs: 8 Process OMPI jobid: [17579,1] App: 0 Process rank: 8 Process OMPI jobid: [17579,1] App: 0 Process rank: 9 Process OMPI jobid: [17579,1] App: 0 Process rank: 10 Process OMPI jobid: [17579,1] App: 0 Process rank: 11 Process OMPI jobid: [17579,1] App: 0 Process rank: 12 Process OMPI jobid: [17579,1] App: 0 Process rank: 13 Process OMPI jobid: [17579,1] App: 0 Process rank: 14 Process OMPI jobid: [17579,1] App: 0 Process rank: 15 = [node03.cluster:23025] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], socket 2[core 19[hwt 0]]: [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] [node03.cluster:23025] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], socket 2[core 23[hwt 0]]: [./././././././.][./././././././.][././././B/B/B/B][./././././././.] [node03.cluster:23025] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], socket 3[core 27[hwt 0]]: [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] [node03.cluster:23025] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], socket 3[core 31[hwt 0]]: [./././././././.][./././././././.][./././././././.][././././B/B/B/B] [node03.cluster:23025] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so cket 0[core 3[hwt 0]]: [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] [node04.cluster:29332] MCW rank 10 bound to socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] [node04.cluster:29332] MCW rank 11 bound to socket 1[core 12[hwt 0]], socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]] , socket 1[core 15[hwt 0]]: [./././././././.][././././B/B/B/B][./././././././.][./././././././.] [node04.cluster:29332] MCW rank 12 bound to socket 2[core 16[hwt 0]], socket 2[core 17[hwt 0]], socket 2[core 18[hwt 0]] , socket 2[core 19[hwt 0]]: [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] [node04.cluster:29332] MCW rank 13 bound to socket 2[core 20[hwt 0]], socket 2[core 21[hwt 0]], socket 2[core 22[hwt 0]] , socket 2[core 23[hwt 0]]: [./././././././.][./././././././.][././././B/B/B/B][./././././././.] [node04.cluster:29332] MCW rank 14 bound to socket 3[core 24[hwt 0]], socket 3[core 25[hwt 0]], socket 3[core 26[hwt 0]] , socket 3[core 27[hwt 0]]: [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] [node04.cluster:29332] MCW rank 15 bound to socket 3[core 28[hwt 0]], socket 3[core 29[hwt 0]], socket 3[core 30[hwt 0]] , socket 3[core 31[hwt 0]]: [./././././././.][./././././././.][./././././././.][././././B/B/B/B] [node04.cluster:29332] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so cket 0[core 3[hwt 0]]: [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] [node04.cluster:29332] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so cket 0[core 7[hwt 0]]: [././././B/B/B/B][./././././././.][./././././././.][./././././././.] [node03.cluster:23025] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so cket 0[core 7[hwt 0]]: [././././B/B/B/B][./././././././.][./././././././.][./././././././.] [node03.cluster:23025] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s ocket 1[core 11[hwt 0]]: [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] [node03.cluster:23025] MCW rank 3 bound to socket 1[core 12[hwt 0]],
Re: [OMPI devel] cleanup of rr_byobj
I added two improvements. Please replace the previous patch file by this attached one, and take a look this week end. 1. Add pre-check for ORTE_ERR_NOT_FOUND to make retry with byslot work afterward correctly. Otherwise, the retry could fail, because some fields such as node->procs, node->slots_inuse is already updated. 2. Improve the detection of oversubscription, when node->slots is not multiple number of cpus_per_rank. For example, using node05, node06 with slots = 8 and setting cpus_per_rank = 3, np = 5 should be oversubscribed, although np x cpus_per_rank(3X5=15) is less than num_slots(=16). I fixed to detect this oversubscription. Tetsuya (See attached file: patch.byobj2) > Hi Tetsuya > > Let me take a look when I get home this weekend - I'm giving an ORTE tutorial to a group of new developers this week and my time is very limited. > > Thanks > Ralph > > > > On Tue, Mar 25, 2014 at 5:37 PM,wrote: > > Hi Ralph, I moved on to the development list. > > I'm not sure why add_one flag is used in the rr_byobj. > Here, if oversubscribed, proc is mapped to each object > one by one. So, I think the add_one is not necesarry. > > Instead, when the user doesn't permit oversubscription, > the second pass should be skipped. > > I made the logic a bit clear based upon this idea and > removed some outputs to synchronize it with the 1.7 branch. > > Please take a look at attached patch file. > > Tetsuya > > (See attached file: patch.byobj) > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14393.php___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14394.php patch.byobj2 Description: Binary data
Re: [OMPI devel] cleanup of rr_byobj
no problem - it's a minor cleanup. Tetsuya > Hi Tetsuya > > Let me take a look when I get home this weekend - I'm giving an ORTE tutorial to a group of new developers this week and my time is very limited. > > Thanks > Ralph > > > > On Tue, Mar 25, 2014 at 5:37 PM,wrote: > > Hi Ralph, I moved on to the development list. > > I'm not sure why add_one flag is used in the rr_byobj. > Here, if oversubscribed, proc is mapped to each object > one by one. So, I think the add_one is not necesarry. > > Instead, when the user doesn't permit oversubscription, > the second pass should be skipped. > > I made the logic a bit clear based upon this idea and > removed some outputs to synchronize it with the 1.7 branch. > > Please take a look at attached patch file. > > Tetsuya > > (See attached file: patch.byobj) > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14393.php___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14394.php
[OMPI devel] cleanup of rr_byobj
Hi Ralph, I moved on to the development list. I'm not sure why add_one flag is used in the rr_byobj. Here, if oversubscribed, proc is mapped to each object one by one. So, I think the add_one is not necesarry. Instead, when the user doesn't permit oversubscription, the second pass should be skipped. I made the logic a bit clear based upon this idea and removed some outputs to synchronize it with the 1.7 branch. Please take a look at attached patch file. Tetsuya (See attached file: patch.byobj) patch.byobj Description: Binary data