I can’t speak to the mxm problem, but the no-tree-spawn issue indicates that you don’t have password-less ssh authorized between the compute nodes
> On May 25, 2015, at 8:55 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > > Hello! > > I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; > OFED-1.5.4.1; > CentOS release 6.2; > infiniband 4x FDR > > > > I have two problems: > 1. I can not use mxm: > 1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 -mca > plm_rsh_no_tree_spawn 1 -np 4 ./hello > -------------------------------------------------------------------------- > > A requested component was not found, or was unable to be opened. This > > means that this component is either not installed or is unable to be > > used on your system (e.g., sometimes this means that shared libraries > > that the component requires are unable to be found/loaded). Note that > > Open MPI stopped checking at the first component that it did not find. > > > > Host: node14 > > Framework: pml > > Component: yalla > > -------------------------------------------------------------------------- > > *** An error occurred in MPI_Init > > -------------------------------------------------------------------------- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > mca_pml_base_open() failed > > --> Returned "Not found" (-13) instead of "Success" (0) > > -------------------------------------------------------------------------- > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > *** An error occurred in MPI_Init > > [node28:102377] Local abort before MPI_INIT completed successfully; not able > to aggregate error messages, > and not able to guarantee that all other processes were killed! > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > [node29:105600] Local abort before MPI_INIT completed successfully; not able > to aggregate error messages, > and not able to guarantee that all other processes were killed! > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > [node5:102409] Local abort before MPI_INIT completed successfully; not able > to aggregate error messages, > and not able to guarantee that all other processes were killed! > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > [node14:85284] Local abort before MPI_INIT completed successfully; not able > to aggregate error messages, > and not able to guarantee that all other processes were killed! > > ------------------------------------------------------- > > Primary job terminated normally, but 1 process returned > > a non-zero exit code.. Per user-direction, the job has been aborted. > > ------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun detected that one or more processes exited with non-zero status, thus > causing > the job to be terminated. The first process to do so was: > > > > Process name: [[9372,1],2] > Exit code: 1 > > -------------------------------------------------------------------------- > > [login:08295] 3 more processes have sent help message help-mca-base.txt / > find-available:not-valid > [login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > [login:08295] 3 more processes have sent help message help-mpi-runtime / > mpi_init:startup:internal-failur > e > > > 1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca > plm_rsh_no_tree_spawn 1 -np 4 ./hello > -------------------------------------------------------------------------- > > A requested component was not found, or was unable to be opened. This > > means that this component is either not installed or is unable to be > > used on your system (e.g., sometimes this means that shared libraries > > that the component requires are unable to be found/loaded). Note that > > Open MPI stopped checking at the first component that it did not find. > > > > Host: node5 > > Framework: pml > > Component: yalla > > -------------------------------------------------------------------------- > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > [node5:102449] Local abort before MPI_INIT completed successfully; not able > to aggregate error messages, > and not able to guarantee that all other processes were killed! > > -------------------------------------------------------------------------- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > mca_pml_base_open() failed > > --> Returned "Not found" (-13) instead of "Success" (0) > > -------------------------------------------------------------------------- > > ------------------------------------------------------- > > Primary job terminated normally, but 1 process returned > > a non-zero exit code.. Per user-direction, the job has been aborted. > > ------------------------------------------------------- > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > [node14:85325] Local abort before MPI_INIT completed successfully; not able > to aggregate error messages, > and not able to guarantee that all other processes were killed! > > -------------------------------------------------------------------------- > > mpirun detected that one or more processes exited with non-zero status, thus > causing > the job to be terminated. The first process to do so was: > > > > Process name: [[9619,1],0] > > Exit code: 1 > > -------------------------------------------------------------------------- > > [login:08552] 1 more process has sent help message help-mca-base.txt / > find-available:not-valid > [login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > > 2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line: > $mpirun -host node5,node14,node28,node29 -np 4 ./hello > sh: -c: line 0: syntax error near unexpected token `--tree-spawn' > > sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; > OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc > es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export > OPAL_PREFIX; PATH=/gpfs/NETHOME/o > ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH > ; export PA > TH ; > LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi > -mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; > DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice > vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH > ; expor > t DYLD_LIBRARY_PATH ; > /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o > mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 > -mca ess "env" -mca orte_es > s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca > orte_parent_uri "625606656.1;tc > p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri > "625606656.0;tcp://10.65.0.2,10.67.0.2,8 > 3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca plm_rsh_no_tree_spawn > "0" -mca plm "rsh" ) --tree-s > pawn' > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to use. > > > > * compilation of the orted with dynamic libraries when static are required > > (e.g., on Cray). Please check your configure cmd line and consider using > > one of the contrib/platform definitions for your system type. > > > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > -------------------------------------------------------------------------- > > mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate > > > > > Thank you for your comments. > > Best regards, > Timur. > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26919.php