Re: [OMPI users] MXM problem

Mike Dubman Mon, 25 May 2015 13:32:02 -0400 (EDT)

scif is a OFA device from Intel.
can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry


On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov <tismagi...@mail.ru> wrote:

> Hi, Mike,
> that is what i have:
>
> $ echo $LD_LIBRARY_PATH | tr ":" "\n"
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
>
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
>  +intel compiler paths
>
> $ echo
> $OPAL_PREFIX
>
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
>
> I don't use LD_PRELOAD.
>
> In the attached file(ompi_info.out) you will find the output of ompi_info
> -l 9  command.
>
> *P.S*.
> node1 $ ./mxm_perftest
> node2 $  ./mxm_perftest node1  -t send_lat
> [1432568685.067067] [node151:87372:0]         shm.c:65   MXM  WARN  Could
> not open the KNEM device file $t /dev/knem : No such file or directory.
> Won't use knem.         *( I don't have knem)*
> [1432568685.069699] [node151:87372:0]      ib_dev.c:531  MXM  WARN
> skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox
> device                               *(???)*
> Failed to create endpoint: No such device
>
> $  ibv_devinfo
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.10.600
>         node_guid:                      0002:c903:00a1:13b0
>         sys_image_guid:                 0002:c903:00a1:13b3
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x0
>         board_id:                       MT_1090120019
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               83
>                         port_lmc:               0x00
>
>                 port:   2
>                         state:                  PORT_DOWN (1)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
>
> Best regards,
> Timur.
>
>
> Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman <
> mi...@dev.mellanox.co.il>:
>
>   Hi Timur,
> seems that yalla component was not found in your OMPI tree.
> can it be that your mpirun is not from hpcx? Can you please check
> LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the
> right mpirun?
>
> Also, could you please check that yalla is present in the ompi_info -l 9
> output?
>
> Thanks
>
> On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> I can password-less ssh to all nodes:
> base$ ssh node1
> node1$ssh node2
> Last login: Mon May 25 18:41:23
> node2$ssh node3
> Last login: Mon May 25 16:25:01
> node3$ssh node4
> Last login: Mon May 25 16:27:04
> node4$
>
> Is this correct?
>
> In ompi-1.9 i do not have no-tree-spawn problem.
>
>
> Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain <r...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3arhc@open%2dmpi.org>>:
>
>   I can’t speak to the mxm problem, but the no-tree-spawn issue indicates
> that you don’t have password-less ssh authorized between the compute nodes
>
>
> On May 25, 2015, at 8:55 AM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> Hello!
>
> I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
> OFED-1.5.4.1;
> CentOS release 6.2;
> infiniband 4x FDR
>
>
>
> I have two problems:
> *1. I can not use mxm*:
> *1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29
> -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello *
> --------------------------------------------------------------------------
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component that it did not
> find.
>
>
> Host:
> node14
>
> Framework:
> pml
>
> Component:
> yalla
>
> --------------------------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> --------------------------------------------------------------------------
>
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process
> can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's
> some
> additional information (which may only be relevant to an Open
> MPI
> developer):
>
>
>
>   mca_pml_base_open()
> failed
>
>   --> Returned "Not found" (-13) instead of "Success"
> (0)
> --------------------------------------------------------------------------
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> *** An error occurred in
> MPI_Init
>
> [node28:102377] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
>  and not able to guarantee that all other processes were
> killed!
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node29:105600] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
>  and not able to guarantee that all other processes were
> killed!
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node5:102409] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node14:85284] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> -------------------------------------------------------
>
> Primary job  terminated normally, but 1 process
> returned
> a non-zero exit code.. Per user-direction, the job has been
> aborted.
> -------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so
> was:
>
>
>   Process name: [[9372,1],2]
>   Exit code:
> 1
>
> --------------------------------------------------------------------------
>
> [login:08295] 3 more processes have sent help message help-mca-base.txt /
> find-available:not-valid
> [login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
> [login:08295] 3 more processes have sent help message help-mpi-runtime /
> mpi_init:startup:internal-failur
> e
>
>
> *1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca
> plm_rsh_no_tree_spawn 1 -np 4 ./hello *
> --------------------------------------------------------------------------
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component that it did not
> find.
>
>
> Host:
> node5
>
> Framework:
> pml
>
> Component:
> yalla
>
> --------------------------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node5:102449] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> --------------------------------------------------------------------------
>
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process
> can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's
> some
> additional information (which may only be relevant to an Open
> MPI
> developer):
>
>
>
>   mca_pml_base_open()
> failed
>
>   --> Returned "Not found" (-13) instead of "Success"
> (0)
> --------------------------------------------------------------------------
>
> -------------------------------------------------------
>
> Primary job  terminated normally, but 1 process
> returned
> a non-zero exit code.. Per user-direction, the job has been
> aborted.
> -------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node14:85325] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so
> was:
>
>
>   Process name:
> [[9619,1],0]
>
>   Exit code:
> 1
>
> --------------------------------------------------------------------------
>
> [login:08552] 1 more process has sent help message help-mca-base.txt /
> find-available:not-valid
> [login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
>
> *2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line:*
> $mpirun -host node5,node14,node28,node29 -np 4 ./hello
> sh: -c: line 0: syntax error near unexpected token
> `--tree-spawn'
> sh: -c: line 0: `( test ! -r ./.profile || . ./.profile;
> OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
> es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export
> OPAL_PREFIX; PATH=/gpfs/NETHOME/o
> ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
> ; export PA
> TH ;
> LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
> -mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
> DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
> vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
> ; expor
> t DYLD_LIBRARY_PATH ;
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
> mpi-mellanox-v1.8/bin/orted --hnp-topo-sig
> 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es
> s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca
> orte_parent_uri "625606656.1;tc
> p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri
> "625606656.0;tcp://10.65.0.2,10.67.0.2,8
> 3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca
> plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s
> pawn'
>
> --------------------------------------------------------------------------
>
> ORTE was unable to reliably start one or more
> daemons.
> This usually is caused
> by:
>
>
>
> * not finding the required libraries and/or binaries
> on
>   one or more nodes. Please check your PATH and
> LD_LIBRARY_PATH
>   settings, or configure OMPI with
> --enable-orterun-prefix-by-default
>
>
> * lack of authority to execute on one or more specified
> nodes.
>   Please verify your allocation and
> authorities.
>
>
> * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to
> use.
>
>
> *  compilation of the orted with dynamic libraries when static are
> required
>   (e.g., on Cray). Please check your configure cmd line and consider
> using
>   one of the contrib/platform definitions for your system
> type.
>
>
> * an inability to create a connection back to mpirun due to
> a
>   lack of common network interfaces and/or no route found
> between
>   them. Please check network connectivity (including
> firewalls
>   and network routing
> requirements).
>
> --------------------------------------------------------------------------
>
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
>
> Thank you for your comments.
>
> Best regards,
> Timur.
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26919.php
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26922.php
>
>
>
>
> --
>
> Kind Regards,
>
> M.
>
>
>
>
>


-- 

Kind Regards,

M.

Re: [OMPI users] MXM problem

Reply via email to