HPCX package uses pml "yalla" by default (part of ompi master branch, not in v1.8). So, "-mca mtl mxm" has no effect, unless "-mca pml cm" specified to disable "pml yalla" and let mtl layer to play.
On Fri, Apr 24, 2015 at 6:36 AM, Subhra Mazumdar <subhramazumd...@gmail.com> wrote: > I changed my downloaded MOFED version to match the one installed on the > node and now the error goes away and it runs fine. But I still have a > question, I get the exact same performance on all the below 3 cases: > > 1) mpirun --allow-run-as-root --mca mtl mxm -mca mtl_mxm_np 0 -x > MXM_TLS=self,shm,rc,ud -n 1 /root/backend localhost : -x > LD_PRELOAD=/root/libci.so -n 1 /root/app2 > > 2) mpirun --allow-run-as-root --mca mtl mxm -n 1 /root/backend localhost > : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 > > 3) mpirun --allow-run-as-root --mca mtl ^mxm -n 1 /root/backend > localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2 > > Seems like it doesn't matter if I use mxm, not use mxm or use it with > reliable connection (RC). How can I be sure I am indeed using mxm over > infiniband? > > Thanks, > Subhra. > > > > > > On Thu, Apr 23, 2015 at 1:06 AM, Mike Dubman <mi...@dev.mellanox.co.il> > wrote: > >> /usr/bin/ofed_info >> >> So, the OFED on your system is not MellanoxOFED 2.4.x but smth else. >> >> try #rpm -qi libibverbs >> >> >> On Thu, Apr 23, 2015 at 7:47 AM, Subhra Mazumdar < >> subhramazumd...@gmail.com> wrote: >> >>> Hi, >>> >>> where is the command ofed_info located? I searched from / but didn't >>> find it. >>> >>> Subhra. >>> >>> On Tue, Apr 21, 2015 at 10:43 PM, Mike Dubman <mi...@dev.mellanox.co.il> >>> wrote: >>> >>>> cool, progress! >>>> >>>> >>1429676565.124664] sys.c:719 MXM WARN Conflicting CPU >>>> frequencies detected, using: 2601.00 >>>> >>>> means that cpu governor on your machine is not on "performance" mode >>>> >>>> >> MXM ERROR ibv_query_device() returned 38: Function not implemented >>>> >>>> indicates that ofed installed on your nodes is not indeed 2.4.-1.0.0 or >>>> there is a mismatch between ofed kernel drivers version and ofed userspace >>>> libraries version. >>>> or you have multiple ofed libraries installed on your node and use >>>> incorrect one. >>>> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0? >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Apr 22, 2015 at 7:59 AM, Subhra Mazumdar < >>>> subhramazumd...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I compiled the openmpi that comes inside the mellanox hpcx package >>>>> with mxm support instead of separately downloaded openmpi. I also used the >>>>> environment as in the README so that no LD_PRELOAD (except our own library >>>>> which is unrelated) is needed. Now it runs fine (no segfault) but we get >>>>> same errors as before (saying initialization of MXM library failed). Is it >>>>> using MXM successfully? >>>>> >>>>> [root@JARVICE >>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# mpirun >>>>> --allow-run-as-root --mca mtl mxm -n 1 /root/backend localhost : -x >>>>> LD_PRELOAD=/root/libci.so -n 1 /root/app2 >>>>> >>>>> -------------------------------------------------------------------------- >>>>> WARNING: a request was made to bind a process. While the system >>>>> supports binding the process itself, at least one node does NOT >>>>> support binding memory to the process location. >>>>> >>>>> Node: JARVICE >>>>> >>>>> This usually is due to not having the required NUMA support installed >>>>> on the node. In some Linux distributions, the required support is >>>>> contained in the libnumactl and libnumactl-devel packages. >>>>> This is a warning only; your job will continue, though performance may >>>>> be degraded. >>>>> >>>>> -------------------------------------------------------------------------- >>>>> i am backend >>>>> [1429676565.121218] sys.c:719 MXM WARN Conflicting CPU >>>>> frequencies detected, using: 2601.00 >>>>> [1429676565.122937] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>> [1429676565.122950] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >>>>> ibv_query_device() returned 38: Function not implemented >>>>> [1429676565.123535] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>> [1429676565.123543] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >>>>> ibv_query_device() returned 38: Function not implemented >>>>> [1429676565.124664] sys.c:719 MXM WARN Conflicting CPU >>>>> frequencies detected, using: 2601.00 >>>>> [1429676565.126264] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>> [1429676565.126276] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >>>>> ibv_query_device() returned 38: Function not implemented >>>>> [1429676565.126812] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>> [1429676565.126821] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >>>>> ibv_query_device() returned 38: Function not implemented >>>>> >>>>> -------------------------------------------------------------------------- >>>>> Initialization of MXM library failed. >>>>> >>>>> Error: Input/output error >>>>> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> >>>>> <application runs fine> >>>>> >>>>> >>>>> Thanks, >>>>> Subhra. >>>>> >>>>> >>>>> On Sat, Apr 18, 2015 at 12:28 AM, Mike Dubman < >>>>> mi...@dev.mellanox.co.il> wrote: >>>>> >>>>>> could you please check that ofed_info -s indeed prints mofed >>>>>> 2.4-1.0.0? >>>>>> why LD_PRELOAD needed in your command line? Can you try >>>>>> >>>>>> module load hpcx >>>>>> mpirun -np $np test.exe >>>>>> ? >>>>>> >>>>>> On Sat, Apr 18, 2015 at 8:39 AM, Subhra Mazumdar < >>>>>> subhramazumd...@gmail.com> wrote: >>>>>> >>>>>>> I followed the instructions as in the README, now getting a >>>>>>> different error: >>>>>>> >>>>>>> [root@JARVICE >>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# >>>>>>> ../openmpi-1.8.4/openmpinstall/bin/mpirun --allow-run-as-root --mca mtl >>>>>>> mxm >>>>>>> -x LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>> ./mxm/lib/libmxm.so.2" -n 1 ../backend localhost : -x >>>>>>> LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>> ./mxm/lib/libmxm.so.2 ../libci.so" -n 1 ../app2 >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> WARNING: a request was made to bind a process. While the system >>>>>>> >>>>>>> supports binding the process itself, at least one node does NOT >>>>>>> >>>>>>> support binding memory to the process location. >>>>>>> >>>>>>> Node: JARVICE >>>>>>> >>>>>>> This usually is due to not having the required NUMA support installed >>>>>>> >>>>>>> on the node. In some Linux distributions, the required support is >>>>>>> >>>>>>> contained in the libnumactl and libnumactl-devel packages. >>>>>>> >>>>>>> This is a warning only; your job will continue, though performance >>>>>>> may be degraded. >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> i am backend >>>>>>> >>>>>>> [1429334876.139452] [JARVICE:449 :0] ib_dev.c:445 MXM WARN >>>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>>> >>>>>>> [1429334876.139464] [JARVICE:449 :0] ib_dev.c:456 MXM ERROR >>>>>>> ibv_query_device() returned 38: Function not implemented >>>>>>> >>>>>>> [1429334876.139982] [JARVICE:449 :0] ib_dev.c:445 MXM WARN >>>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>>> >>>>>>> [1429334876.139990] [JARVICE:449 :0] ib_dev.c:456 MXM ERROR >>>>>>> ibv_query_device() returned 38: Function not implemented >>>>>>> >>>>>>> [1429334876.142649] [JARVICE:450 :0] ib_dev.c:445 MXM WARN >>>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>>> >>>>>>> [1429334876.142666] [JARVICE:450 :0] ib_dev.c:456 MXM ERROR >>>>>>> ibv_query_device() returned 38: Function not implemented >>>>>>> >>>>>>> [1429334876.143235] [JARVICE:450 :0] ib_dev.c:445 MXM WARN >>>>>>> failed call to ibv_exp_use_priv_env(): Function not implemented >>>>>>> >>>>>>> [1429334876.143243] [JARVICE:450 :0] ib_dev.c:456 MXM ERROR >>>>>>> ibv_query_device() returned 38: Function not implemented >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> Initialization of MXM library failed. >>>>>>> >>>>>>> Error: Input/output error >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> [JARVICE:449 :0] Caught signal 11 (Segmentation fault) >>>>>>> >>>>>>> [JARVICE:450 :0] Caught signal 11 (Segmentation fault) >>>>>>> >>>>>>> ==== backtrace ==== >>>>>>> >>>>>>> 2 0x000000000005640c mxm_handle_error() >>>>>>> >>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641 >>>>>>> >>>>>>> 3 0x000000000005657c mxm_error_signal_handler() >>>>>>> >>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616 >>>>>>> >>>>>>> 4 0x00000000000329a0 killpg() ??:0 >>>>>>> >>>>>>> 5 0x000000000004812c _IO_vfprintf() ??:0 >>>>>>> >>>>>>> 6 0x000000000006f6da vasprintf() ??:0 >>>>>>> >>>>>>> 7 0x0000000000059b3b opal_show_help_vstring() ??:0 >>>>>>> >>>>>>> 8 0x0000000000026630 orte_show_help() ??:0 >>>>>>> >>>>>>> 9 0x0000000000001a3f mca_bml_r2_add_procs() >>>>>>> >>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/bml/r2/bml_r2.c:409 >>>>>>> >>>>>>> 10 0x0000000000004475 mca_pml_ob1_add_procs() >>>>>>> >>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/pml/ob1/pml_ob1.c:332 >>>>>>> >>>>>>> 11 0x00000000000442f3 ompi_mpi_init() ??:0 >>>>>>> >>>>>>> 12 0x0000000000067cb0 PMPI_Init_thread() ??:0 >>>>>>> >>>>>>> 13 0x000000000000d0ca l_getLocalFromConfig() >>>>>>> /root/rain_ib/interposer/libciutils.c:83 >>>>>>> >>>>>>> 14 0x000000000000c7b4 __cudaRegisterFatBinary() >>>>>>> /root/rain_ib/interposer/libci.c:4055 >>>>>>> >>>>>>> 15 0x0000000000402b59 >>>>>>> _ZL70__sti____cudaRegisterAll_39_tmpxft_00000703_00000000_6_app2_cpp1_ii_hwv() >>>>>>> tmpxft_00000703_00000000-3_app2.cudafe1.cpp:0 >>>>>>> >>>>>>> 16 0x0000000000402dd6 __do_global_ctors_aux() crtstuff.c:0 >>>>>>> >>>>>>> =================== >>>>>>> >>>>>>> ==== backtrace ==== >>>>>>> >>>>>>> 2 0x000000000005640c mxm_handle_error() >>>>>>> >>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641 >>>>>>> >>>>>>> 3 0x000000000005657c mxm_error_signal_handler() >>>>>>> >>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616 >>>>>>> >>>>>>> 4 0x00000000000329a0 killpg() ??:0 >>>>>>> >>>>>>> 5 0x000000000004812c _IO_vfprintf() ??:0 >>>>>>> >>>>>>> 6 0x000000000006f6da vasprintf() ??:0 >>>>>>> >>>>>>> 7 0x0000000000059b3b opal_show_help_vstring() ??:0 >>>>>>> >>>>>>> 8 0x0000000000026630 orte_show_help() ??:0 >>>>>>> >>>>>>> 9 0x0000000000001a3f mca_bml_r2_add_procs() >>>>>>> >>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/bml/r2/bml_r2.c:409 >>>>>>> >>>>>>> 10 0x0000000000004475 mca_pml_ob1_add_procs() >>>>>>> >>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/pml/ob1/pml_ob1.c:332 >>>>>>> >>>>>>> 11 0x00000000000442f3 ompi_mpi_init() ??:0 >>>>>>> >>>>>>> 12 0x0000000000067cb0 PMPI_Init_thread() ??:0 >>>>>>> >>>>>>> 13 0x0000000000404fdf main() /root/rain_ib/backend/backend.c:1237 >>>>>>> >>>>>>> 14 0x000000000001ed1d __libc_start_main() ??:0 >>>>>>> >>>>>>> 15 0x0000000000402db9 _start() ??:0 >>>>>>> >>>>>>> =================== >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> mpirun noticed that process rank 1 with PID 450 on node JARVICE >>>>>>> exited on signal 11 (Segmentation fault). >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> [JARVICE:00447] 1 more process has sent help message >>>>>>> help-mtl-mxm.txt / mxm init >>>>>>> >>>>>>> [JARVICE:00447] Set MCA parameter "orte_base_help_aggregate" to 0 to >>>>>>> see all help / error messages >>>>>>> >>>>>>> [root@JARVICE >>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# >>>>>>> >>>>>>> >>>>>>> Subhra. >>>>>>> >>>>>>> >>>>>>> On Mon, Apr 13, 2015 at 10:58 PM, Mike Dubman < >>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>> >>>>>>>> Have you followed installation steps from README (Also here for >>>>>>>> reference http://bgate.mellanox.com/products/hpcx/README.txt) >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>> * Load OpenMPI/OpenSHMEM v1.8 based package: >>>>>>>> >>>>>>>> % source $HPCX_HOME/hpcx-init.sh >>>>>>>> % hpcx_load >>>>>>>> % env | grep HPCX >>>>>>>> % mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_usempi >>>>>>>> % oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem >>>>>>>> % hpcx_unload >>>>>>>> >>>>>>>> 3. Load HPCX environment from modules >>>>>>>> >>>>>>>> * Load OpenMPI/OpenSHMEM based package: >>>>>>>> >>>>>>>> % module use $HPCX_HOME/modulefiles >>>>>>>> % module load hpcx >>>>>>>> % mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c >>>>>>>> % oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem >>>>>>>> % module unload hpcx >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>> On Tue, Apr 14, 2015 at 5:42 AM, Subhra Mazumdar < >>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I am using 2.4-1.0.0 mellanox ofed. >>>>>>>>> >>>>>>>>> I downloaded mofed tarball >>>>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5.tar and >>>>>>>>> extracted >>>>>>>>> it. It has mxm directory. >>>>>>>>> >>>>>>>>> hpcx-v1.2.0-325-[root@JARVICE ~]# ls >>>>>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5 >>>>>>>>> archive fca hpcx-init-ompi-mellanox-v1.8.sh ibprof >>>>>>>>> modulefiles ompi-mellanox-v1.8 sources VERSION >>>>>>>>> bupc-master hcoll hpcx-init.sh knem >>>>>>>>> mxm README.txt utils >>>>>>>>> >>>>>>>>> I tried using LD_PRELOAD for libmxm, but getting a different error >>>>>>>>> stack now as following >>>>>>>>> >>>>>>>>> [root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun >>>>>>>>> --allow-run-as-root --mca mtl mxm -x >>>>>>>>> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>>>> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2" >>>>>>>>> -n 1 ./backend localhost : -x >>>>>>>>> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 >>>>>>>>> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2 >>>>>>>>> ./libci.so" -n 1 ./app2 >>>>>>>>> i am backend >>>>>>>>> [JARVICE:00564] mca: base: components_open: component pml / cm >>>>>>>>> open function failed >>>>>>>>> [JARVICE:564 :0] Caught signal 11 (Segmentation fault) >>>>>>>>> [JARVICE:00565] mca: base: components_open: component pml / cm >>>>>>>>> open function failed >>>>>>>>> [JARVICE:565 :0] Caught signal 11 (Segmentation fault) >>>>>>>>> ==== backtrace ==== >>>>>>>>> 2 0x000000000005640c mxm_handle_error() >>>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641 >>>>>>>>> 3 0x000000000005657c mxm_error_signal_handler() >>>>>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616 >>>>>>>>> 4 0x00000000000329a0 killpg() ??:0 >>>>>>>>> 5 0x0000000000045491 mca_base_components_close() ??:0 >>>>>>>>> 6 0x000000000004e99a mca_base_framework_close() ??:0 >>>>>>>>> 7 0x0000000000045431 mca_base_component_close() ??:0 >>>>>>>>> 8 0x000000000004515c mca_base_framework_components_open() ??:0 >>>>>>>>> 9 0x00000000000a0de9 mca_pml_base_open() pml_base_frame.c:0 >>>>>>>>> 10 0x000000000004eb1c mca_base_framework_open() ??:0 >>>>>>>>> 11 0x0000000000043eb3 ompi_mpi_init() ??:0 >>>>>>>>> 12 0x0000000000067cb0 PMPI_Init_thread() ??:0 >>>>>>>>> 13 0x0000000000404fdf main() /root/rain_ib/backend/backend.c:1237 >>>>>>>>> 14 0x000000000001ed1d __libc_start_main() ??:0 >>>>>>>>> 15 0x0000000000402db9 _start() ??:0 >>>>>>>>> =================== >>>>>>>>> >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> A requested component was not found, or was unable to be opened. >>>>>>>>> This >>>>>>>>> means that this component is either not installed or is unable to >>>>>>>>> be >>>>>>>>> used on your system (e.g., sometimes this means that shared >>>>>>>>> libraries >>>>>>>>> that the component requires are unable to be found/loaded). Note >>>>>>>>> that >>>>>>>>> Open MPI stopped checking at the first component that it did not >>>>>>>>> find. >>>>>>>>> >>>>>>>>> Host: JARVICE >>>>>>>>> Framework: mtl >>>>>>>>> Component: mxm >>>>>>>>> >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> mpirun noticed that process rank 0 with PID 564 on node JARVICE >>>>>>>>> exited on signal 11 (Segmentation fault). >>>>>>>>> >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> [JARVICE:00562] 1 more process has sent help message >>>>>>>>> help-mca-base.txt / find-available:not-valid >>>>>>>>> [JARVICE:00562] Set MCA parameter "orte_base_help_aggregate" to 0 >>>>>>>>> to see all help / error messages >>>>>>>>> >>>>>>>>> >>>>>>>>> Subhra >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Apr 12, 2015 at 10:48 PM, Mike Dubman < >>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>> >>>>>>>>>> seems like mxm was not found in your ld_library_path. >>>>>>>>>> >>>>>>>>>> what mofed version do you use? >>>>>>>>>> does it have /opt/mellanox/mxm in it? >>>>>>>>>> You could just run mpirun from HPCX package which looks for mxm >>>>>>>>>> internally and recompile ompi as mentioned in README. >>>>>>>>>> >>>>>>>>>> On Mon, Apr 13, 2015 at 3:24 AM, Subhra Mazumdar < >>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I used mxm mtl as follows but getting segfault. It says mxm >>>>>>>>>>> component not found but I have compiled openmpi with mxm. Any idea >>>>>>>>>>> what I >>>>>>>>>>> might be missing? >>>>>>>>>>> >>>>>>>>>>> [root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun >>>>>>>>>>> --allow-run-as-root --mca pml cm --mca mtl mxm -n 1 -x >>>>>>>>>>> LD_PRELOAD=./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 ./backend >>>>>>>>>>> localhosst : -n 1 -x LD_PRELOAD="./libci.so >>>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1" ./app2 >>>>>>>>>>> i am backend >>>>>>>>>>> [JARVICE:08398] *** Process received signal *** >>>>>>>>>>> [JARVICE:08398] Signal: Segmentation fault (11) >>>>>>>>>>> [JARVICE:08398] Signal code: Address not mapped (1) >>>>>>>>>>> [JARVICE:08398] Failing at address: 0x10 >>>>>>>>>>> [JARVICE:08398] [ 0] >>>>>>>>>>> /lib64/libpthread.so.0(+0xf710)[0x7ff8d0ddb710] >>>>>>>>>>> [JARVICE:08398] [ 1] >>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_components_close+0x21)[0x7ff8cf9ae491] >>>>>>>>>>> [JARVICE:08398] [ 2] >>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_close+0x6a)[0x7ff8cf9b799a] >>>>>>>>>>> [JARVICE:08398] [ 3] >>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_component_close+0x21)[0x7ff8cf9ae431] >>>>>>>>>>> [JARVICE:08398] [ 4] >>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_components_open+0x11c)[0x7ff8cf9ae15c] >>>>>>>>>>> [JARVICE:08398] [ 5] >>>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(+0xa0de9)[0x7ff8d1089de9] >>>>>>>>>>> [JARVICE:08398] [ 6] >>>>>>>>>>> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7ff8cf9b7b1c] >>>>>>>>>>> [JARVICE:08398] [ 7] [JARVICE:08398] mca: base: components_open: >>>>>>>>>>> component pml / cm open function failed >>>>>>>>>>> >>>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(ompi_mpi_init+0x4b3)[0x7ff8d102ceb3] >>>>>>>>>>> [JARVICE:08398] [ 8] >>>>>>>>>>> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(PMPI_Init_thread+0x100)[0x7ff8d1050cb0] >>>>>>>>>>> [JARVICE:08398] [ 9] ./backend[0x404fdf] >>>>>>>>>>> [JARVICE:08398] [10] >>>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff8cfeded1d] >>>>>>>>>>> [JARVICE:08398] [11] ./backend[0x402db9] >>>>>>>>>>> [JARVICE:08398] *** End of error message *** >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> A requested component was not found, or was unable to be >>>>>>>>>>> opened. This >>>>>>>>>>> means that this component is either not installed or is unable >>>>>>>>>>> to be >>>>>>>>>>> used on your system (e.g., sometimes this means that shared >>>>>>>>>>> libraries >>>>>>>>>>> that the component requires are unable to be found/loaded). >>>>>>>>>>> Note that >>>>>>>>>>> Open MPI stopped checking at the first component that it did not >>>>>>>>>>> find. >>>>>>>>>>> >>>>>>>>>>> Host: JARVICE >>>>>>>>>>> Framework: mtl >>>>>>>>>>> Component: mxm >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> mpirun noticed that process rank 0 with PID 8398 on node JARVICE >>>>>>>>>>> exited on signal 11 (Segmentation fault). >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Subhra. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Apr 10, 2015 at 12:12 AM, Mike Dubman < >>>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>>> >>>>>>>>>>>> no need IPoIB, mxm uses native IB. >>>>>>>>>>>> >>>>>>>>>>>> Please see HPCX (pre-compiled ompi, integrated with MXM and >>>>>>>>>>>> FCA) README file for details how to compile/select. >>>>>>>>>>>> >>>>>>>>>>>> The default transport is UD for internode communication and >>>>>>>>>>>> shared-memory for intra-node. >>>>>>>>>>>> >>>>>>>>>>>> http://bgate,mellanox.com/products/hpcx/ >>>>>>>>>>>> >>>>>>>>>>>> Also, mxm included in the Mellanox OFED. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Apr 10, 2015 at 5:26 AM, Subhra Mazumdar < >>>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> Does ipoib need to be configured on the ib cards for mxm (I >>>>>>>>>>>>> have a separate ethernet connection too)? Also are there special >>>>>>>>>>>>> flags in >>>>>>>>>>>>> mpirun to select from UD/RC/DC? What is the default? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Subhra. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Mar 31, 2015 at 9:46 AM, Mike Dubman < >>>>>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> mxm uses IB rdma/roce technologies. Once can select UD/RC/DC >>>>>>>>>>>>>> transports to be used in mxm. >>>>>>>>>>>>>> >>>>>>>>>>>>>> By selecting mxm, all MPI p2p routines will be mapped to >>>>>>>>>>>>>> appropriate mxm functions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> M >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Mar 30, 2015 at 7:32 PM, Subhra Mazumdar < >>>>>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi MIke, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Does the mxm mtl use infiniband rdma? Also from programming >>>>>>>>>>>>>>> perspective, do I need to use anything else other than >>>>>>>>>>>>>>> MPI_Send/MPI_Recv? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Subhra. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sun, Mar 29, 2015 at 11:14 PM, Mike Dubman < >>>>>>>>>>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> openib btl does not support this thread model. >>>>>>>>>>>>>>>> You can use OMPI w/ mxm (-mca mtl mxm) and multiple thread >>>>>>>>>>>>>>>> mode lin 1.8 x series or (-mca pml yalla) in the master branch. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> M >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Mar 30, 2015 at 9:09 AM, Subhra Mazumdar < >>>>>>>>>>>>>>>> subhramazumd...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Can MPI_THREAD_MULTIPLE and openib btl work together in >>>>>>>>>>>>>>>>> open mpi 1.8.4? If so are there any command line options >>>>>>>>>>>>>>>>> needed during run >>>>>>>>>>>>>>>>> time? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Subhra. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26574.php >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> M. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26575.php >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26580.php >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>>> >>>>>>>>>>>>>> M. >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26584.php >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> Subscription: >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26663.php >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> Kind Regards, >>>>>>>>>>>> >>>>>>>>>>>> M. >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> Subscription: >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26665.php >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26686.php >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Kind Regards, >>>>>>>>>> >>>>>>>>>> M. >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26688.php >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26711.php >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Kind Regards, >>>>>>>> >>>>>>>> M. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26712.php >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26752.php >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Kind Regards, >>>>>> >>>>>> M. >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2015/04/26754.php >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/04/26761.php >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Kind Regards, >>>> >>>> M. >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/04/26762.php >>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/04/26766.php >>> >> >> >> >> -- >> >> Kind Regards, >> >> M. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26768.php >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26777.php > -- Kind Regards, M.