scif is a OFA device from Intel. can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry
On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov <tismagi...@mail.ru> wrote: > Hi, Mike, > that is what i have: > > $ echo $LD_LIBRARY_PATH | tr ":" "\n" > /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib > > /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib > > /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib > > > /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib > +intel compiler paths > > $ echo > $OPAL_PREFIX > > > /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 > > I don't use LD_PRELOAD. > > In the attached file(ompi_info.out) you will find the output of ompi_info > -l 9 command. > > *P.S*. > node1 $ ./mxm_perftest > node2 $ ./mxm_perftest node1 -t send_lat > [1432568685.067067] [node151:87372:0] shm.c:65 MXM WARN Could > not open the KNEM device file $t /dev/knem : No such file or directory. > Won't use knem. *( I don't have knem)* > [1432568685.069699] [node151:87372:0] ib_dev.c:531 MXM WARN > skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox > device *(???)* > Failed to create endpoint: No such device > > $ ibv_devinfo > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.10.600 > node_guid: 0002:c903:00a1:13b0 > sys_image_guid: 0002:c903:00a1:13b3 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x0 > board_id: MT_1090120019 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 1 > port_lid: 83 > port_lmc: 0x00 > > port: 2 > state: PORT_DOWN (1) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > > Best regards, > Timur. > > > Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < > mi...@dev.mellanox.co.il>: > > Hi Timur, > seems that yalla component was not found in your OMPI tree. > can it be that your mpirun is not from hpcx? Can you please check > LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the > right mpirun? > > Also, could you please check that yalla is present in the ompi_info -l 9 > output? > > Thanks > > On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov <tismagi...@mail.ru > <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote: > > I can password-less ssh to all nodes: > base$ ssh node1 > node1$ssh node2 > Last login: Mon May 25 18:41:23 > node2$ssh node3 > Last login: Mon May 25 16:25:01 > node3$ssh node4 > Last login: Mon May 25 16:27:04 > node4$ > > Is this correct? > > In ompi-1.9 i do not have no-tree-spawn problem. > > > Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain <r...@open-mpi.org > <https://e.mail.ru/compose/?mailto=mailto%3arhc@open%2dmpi.org>>: > > I can’t speak to the mxm problem, but the no-tree-spawn issue indicates > that you don’t have password-less ssh authorized between the compute nodes > > > On May 25, 2015, at 8:55 AM, Timur Ismagilov <tismagi...@mail.ru > <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote: > > Hello! > > I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; > OFED-1.5.4.1; > CentOS release 6.2; > infiniband 4x FDR > > > > I have two problems: > *1. I can not use mxm*: > *1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 > -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello * > -------------------------------------------------------------------------- > > A requested component was not found, or was unable to be opened. > This > means that this component is either not installed or is unable to > be > used on your system (e.g., sometimes this means that shared > libraries > that the component requires are unable to be found/loaded). Note > that > Open MPI stopped checking at the first component that it did not > find. > > > Host: > node14 > > Framework: > pml > > Component: > yalla > > -------------------------------------------------------------------------- > > *** An error occurred in > MPI_Init > > -------------------------------------------------------------------------- > > It looks like MPI_INIT failed for some reason; your parallel process > is > likely to abort. There are many reasons that a parallel process > can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's > some > additional information (which may only be relevant to an Open > MPI > developer): > > > > mca_pml_base_open() > failed > > --> Returned "Not found" (-13) instead of "Success" > (0) > -------------------------------------------------------------------------- > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > *** An error occurred in > MPI_Init > > [node28:102377] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node29:105600] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > *** An error occurred in > MPI_Init > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node5:102409] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > *** An error occurred in > MPI_Init > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node14:85284] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > ------------------------------------------------------- > > Primary job terminated normally, but 1 process > returned > a non-zero exit code.. Per user-direction, the job has been > aborted. > ------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so > was: > > > Process name: [[9372,1],2] > Exit code: > 1 > > -------------------------------------------------------------------------- > > [login:08295] 3 more processes have sent help message help-mca-base.txt / > find-available:not-valid > [login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > [login:08295] 3 more processes have sent help message help-mpi-runtime / > mpi_init:startup:internal-failur > e > > > *1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca > plm_rsh_no_tree_spawn 1 -np 4 ./hello * > -------------------------------------------------------------------------- > > A requested component was not found, or was unable to be opened. > This > means that this component is either not installed or is unable to > be > used on your system (e.g., sometimes this means that shared > libraries > that the component requires are unable to be found/loaded). Note > that > Open MPI stopped checking at the first component that it did not > find. > > > Host: > node5 > > Framework: > pml > > Component: > yalla > > -------------------------------------------------------------------------- > > *** An error occurred in > MPI_Init > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node5:102449] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > -------------------------------------------------------------------------- > > It looks like MPI_INIT failed for some reason; your parallel process > is > likely to abort. There are many reasons that a parallel process > can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's > some > additional information (which may only be relevant to an Open > MPI > developer): > > > > mca_pml_base_open() > failed > > --> Returned "Not found" (-13) instead of "Success" > (0) > -------------------------------------------------------------------------- > > ------------------------------------------------------- > > Primary job terminated normally, but 1 process > returned > a non-zero exit code.. Per user-direction, the job has been > aborted. > ------------------------------------------------------- > > *** An error occurred in > MPI_Init > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node14:85325] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > -------------------------------------------------------------------------- > > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so > was: > > > Process name: > [[9619,1],0] > > Exit code: > 1 > > -------------------------------------------------------------------------- > > [login:08552] 1 more process has sent help message help-mca-base.txt / > find-available:not-valid > [login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > > *2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line:* > $mpirun -host node5,node14,node28,node29 -np 4 ./hello > sh: -c: line 0: syntax error near unexpected token > `--tree-spawn' > sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; > OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc > es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export > OPAL_PREFIX; PATH=/gpfs/NETHOME/o > ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH > ; export PA > TH ; > LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi > -mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; > DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice > vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH > ; expor > t DYLD_LIBRARY_PATH ; > /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o > mpi-mellanox-v1.8/bin/orted --hnp-topo-sig > 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es > s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca > orte_parent_uri "625606656.1;tc > p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri > "625606656.0;tcp://10.65.0.2,10.67.0.2,8 > 3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca > plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s > pawn' > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more > daemons. > This usually is caused > by: > > > > * not finding the required libraries and/or binaries > on > one or more nodes. Please check your PATH and > LD_LIBRARY_PATH > settings, or configure OMPI with > --enable-orterun-prefix-by-default > > > * lack of authority to execute on one or more specified > nodes. > Please verify your allocation and > authorities. > > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to > use. > > > * compilation of the orted with dynamic libraries when static are > required > (e.g., on Cray). Please check your configure cmd line and consider > using > one of the contrib/platform definitions for your system > type. > > > * an inability to create a connection back to mpirun due to > a > lack of common network interfaces and/or no route found > between > them. Please check network connectivity (including > firewalls > and network routing > requirements). > > -------------------------------------------------------------------------- > > mpirun: abort is already in progress...hit ctrl-c again to forcibly > terminate > > > Thank you for your comments. > > Best regards, > Timur. > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26919.php > > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26922.php > > > > > -- > > Kind Regards, > > M. > > > > > -- Kind Regards, M.