Sorry, typo, try:

mpirun -np 128 --debug-daemons -mca plm rsh hostname

Josh

On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd <jladd.m...@gmail.com> wrote:

> And if you try:
> mpirun -np 128 --debug-daemons -plm rsh hostname
>
> Josh
>
> On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger <
> cstrassbur...@bihrle.com> wrote:
>
>> Input:   mpirun -np 128 --debug-daemons hostname
>>
>>
>>
>> Output:
>>
>> [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs
>>
>> [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd
>>
>> [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone -
>> exiting
>>
>> --------------------------------------------------------------------------
>>
>> mpirun was unable to start the specified application as it encountered an
>>
>> error:
>>
>>
>>
>> Error code: 63
>>
>> Error name: (null)
>>
>> Node: Gen2Node3
>>
>>
>>
>> when attempting to start process rank 0.
>>
>> --------------------------------------
>>
>>
>>
>> Collin
>>
>>
>>
>> *From:* Joshua Ladd <jladd.m...@gmail.com>
>> *Sent:* Tuesday, January 28, 2020 12:31 PM
>> *To:* Collin Strassburger <cstrassbur...@bihrle.com>
>> *Cc:* Open MPI Users <users@lists.open-mpi.org>; Ralph Castain <
>> r...@open-mpi.org>
>> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
>> 7742 when utilizing 100+ processors per node
>>
>>
>>
>> Interesting. Can you try:
>>
>>
>>
>> mpirun -np 128 --debug-daemons hostname
>>
>>
>>
>> Josh
>>
>>
>>
>> On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger <
>> cstrassbur...@bihrle.com> wrote:
>>
>> In relation to the multi-node attempt, I haven’t yet set that up yet as
>> the per-node configuration doesn’t pass its tests (full node utilization,
>> etc).
>>
>>
>>
>> Here are the results for the hostname test:
>>
>> Input: mpirun -np 128 hostname
>>
>>
>>
>> Output:
>>
>> --------------------------------------------------------------------------
>>
>> mpirun was unable to start the specified application as it encountered an
>>
>> error:
>>
>>
>>
>> Error code: 63
>>
>> Error name: (null)
>>
>> Node: Gen2Node3
>>
>>
>>
>> when attempting to start process rank 0.
>>
>> --------------------------------------------------------------------------
>>
>> 128 total processes failed to start
>>
>>
>>
>>
>>
>> Collin
>>
>>
>>
>>
>>
>> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ralph
>> Castain via users
>> *Sent:* Tuesday, January 28, 2020 12:06 PM
>> *To:* Joshua Ladd <jladd.m...@gmail.com>
>> *Cc:* Ralph Castain <r...@open-mpi.org>; Open MPI Users <
>> users@lists.open-mpi.org>
>> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
>> 7742 when utilizing 100+ processors per node
>>
>>
>>
>> Josh - if you read thru the thread, you will see that disabling
>> Mellanox/IB drivers allows the program to run. It only fails when they are
>> enabled.
>>
>>
>>
>>
>>
>> On Jan 28, 2020, at 8:49 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>>
>>
>>
>> I don't see how this can be diagnosed as a "problem with the Mellanox
>> Software". This is on a single node. What happens when you try to launch on
>> more than one node?
>>
>>
>>
>> Josh
>>
>>
>>
>> On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger <
>> cstrassbur...@bihrle.com> wrote:
>>
>> Here’s the I/O for these high local core count runs. (“xhpcg” is the
>> standard hpcg benchmark)
>>
>>
>>
>> Run command: mpirun -np 128 bin/xhpcg
>>
>> Output:
>>
>> --------------------------------------------------------------------------
>>
>> mpirun was unable to start the specified application as it encountered an
>>
>> error:
>>
>>
>>
>> Error code: 63
>>
>> Error name: (null)
>>
>> Node: Gen2Node4
>>
>>
>>
>> when attempting to start process rank 0.
>>
>> --------------------------------------------------------------------------
>>
>> 128 total processes failed to start
>>
>>
>>
>>
>>
>> Collin
>>
>>
>>
>> *From:* Joshua Ladd <jladd.m...@gmail.com>
>> *Sent:* Tuesday, January 28, 2020 11:39 AM
>> *To:* Open MPI Users <users@lists.open-mpi.org>
>> *Cc:* Collin Strassburger <cstrassbur...@bihrle.com>; Ralph Castain <
>> r...@open-mpi.org>; Artem Polyakov <art...@mellanox.com>
>> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
>> 7742 when utilizing 100+ processors per node
>>
>>
>>
>> Can you send the output of a failed run including your command line.
>>
>>
>>
>> Josh
>>
>>
>>
>> On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <
>> users@lists.open-mpi.org> wrote:
>>
>> Okay, so this is a problem with the Mellanox software - copying Artem.
>>
>>
>>
>> On Jan 28, 2020, at 8:15 AM, Collin Strassburger <
>> cstrassbur...@bihrle.com> wrote:
>>
>>
>>
>> I just tried that and it does indeed work with pbs and without Mellanox
>> (until a reboot makes it complain about Mellanox/IB related defaults as no
>> drivers were installed, etc).
>>
>>
>>
>> After installing the Mellanox drivers, I used
>>
>> ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx
>> --with-platform=contrib/platform/mellanox/optimized
>>
>>
>>
>> With the new compile it fails on the higher core counts.
>>
>>
>>
>>
>>
>> Collin
>>
>>
>>
>> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ralph
>> Castain via users
>> *Sent:* Tuesday, January 28, 2020 11:02 AM
>> *To:* Open MPI Users <users@lists.open-mpi.org>
>> *Cc:* Ralph Castain <r...@open-mpi.org>
>> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
>> 7742 when utilizing 100+ processors per node
>>
>>
>>
>> Does it work with pbs but not Mellanox? Just trying to isolate the
>> problem.
>>
>>
>>
>>
>>
>> On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users <
>> users@lists.open-mpi.org> wrote:
>>
>>
>>
>> Hello,
>>
>>
>>
>> I have done some additional testing and I can say that it works correctly
>> with gcc8 and no mellanox or pbs installed.
>>
>>
>>
>> I am have done two runs with Mellanox and pbs installed.  One run
>> includes the actual run options I will be using while the other includes a
>> truncated set which still compiles but fails to execute correctly.  As the
>> option with the actual run options results in a smaller config log, I am
>> including it here.
>>
>>
>>
>> Version: 4.0.2
>>
>> The config log is available at
>> https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and
>> the ompi dump is available athttps://pastebin.com/md3HwTUR.
>>
>>
>>
>> The IB network information (which is not being explicitly operated
>> across):
>>
>> Packages: MLNX_OFED and Mellanox HPC-X, both are current versions
>> (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and
>> hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
>>
>> Ulimit -l = unlimited
>>
>> Ibv_devinfo:
>>
>> hca_id: mlx4_0
>>
>>         transport:                      InfiniBand (0)
>>
>>         fw_ver:                         2.42.5000
>>
>> …
>>
>>         vendor_id:                      0x02c9
>>
>>         vendor_part_id:                 4099
>>
>>         hw_ver:                         0x1
>>
>>         board_id:                       MT_1100120019
>>
>>         phys_port_cnt:                  1
>>
>>         Device ports:
>>
>>                 port:   1
>>
>>                         state:                  PORT_ACTIVE (4)
>>
>>                         max_mtu:                4096 (5)
>>
>>                         active_mtu:             4096 (5)
>>
>>                         sm_lid:                 1
>>
>>                         port_lid:               12
>>
>>                         port_lmc:               0x00
>>
>>                         link_layer:             InfiniBand
>>
>> It looks like the rest of the IB information is in the config file.
>>
>>
>>
>> I hope this helps,
>>
>> Collin
>>
>>
>>
>>
>>
>>
>>
>> *From:* Jeff Squyres (jsquyres) <jsquy...@cisco.com>
>> *Sent:* Monday, January 27, 2020 3:40 PM
>> *To:* Open MPI User's List <users@lists.open-mpi.org>
>> *Cc:* Collin Strassburger <cstrassbur...@bihrle.com>
>> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
>> 7742 when utilizing 100+ processors per node
>>
>>
>>
>> Can you please send all the information listed here:
>>
>>
>>
>>     https://www.open-mpi.org/community/help/
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users <
>> users@lists.open-mpi.org> wrote:
>>
>>
>>
>> Hello,
>>
>>
>>
>> I had initially thought the same thing about the streams, but I have 2
>> sockets with 64 cores each.  Additionally, I have not yet turned
>> multithreading off, so lscpu reports a total of 256 logical cores and 128
>> physical cores.  As such, I don’t see how it could be running out of
>> streams unless something is being passed incorrectly.
>>
>>
>>
>> Collin
>>
>>
>>
>> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ray
>> Sheppard via users
>> *Sent:* Monday, January 27, 2020 11:53 AM
>> *To:* users@lists.open-mpi.org
>> *Cc:* Ray Sheppard <rshep...@iu.edu>
>> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
>> 7742 when utilizing 100+ processors per node
>>
>>
>>
>> Hi All,
>>   Just my two cents, I think error code 63 is saying it is running out of
>> streams to use.  I think you have only 64 cores, so at 100, you are
>> overloading most of them.  It feels like you are running out of resources
>> trying to swap in and out ranks on physical cores.
>>    Ray
>>
>> On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
>>
>> This message was sent from a non-IU address. Please exercise caution when
>> clicking links or opening attachments from external sources.
>>
>>
>>
>> Hello Howard,
>>
>>
>>
>> To remove potential interactions, I have found that the issue persists
>> without ucx and hcoll support.
>>
>>
>>
>> Run command: mpirun -np 128 bin/xhpcg
>>
>> Output:
>>
>> --------------------------------------------------------------------------
>>
>> mpirun was unable to start the specified application as it encountered an
>>
>> error:
>>
>>
>>
>> Error code: 63
>>
>> Error name: (null)
>>
>> Node: Gen2Node4
>>
>>
>>
>> when attempting to start process rank 0.
>>
>> --------------------------------------------------------------------------
>>
>> 128 total processes failed to start
>>
>>
>>
>> It returns this error for any process I initialize with >100 processes
>> per node.  I get the same error message for multiple different codes, so
>> the error code is mpi related rather than being program specific.
>>
>>
>>
>> Collin
>>
>>
>>
>> *From:* Howard Pritchard <hpprit...@gmail.com> <hpprit...@gmail.com>
>> *Sent:* Monday, January 27, 2020 11:20 AM
>> *To:* Open MPI Users <users@lists.open-mpi.org>
>> <users@lists.open-mpi.org>
>> *Cc:* Collin Strassburger <cstrassbur...@bihrle.com>
>> <cstrassbur...@bihrle.com>
>> *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when
>> utilizing 100+ processors per node
>>
>>
>>
>> Hello Collen,
>>
>>
>>
>> Could you provide more information about the error.  Is there any output
>> from either Open MPI or, maybe, UCX, that could provide more information
>> about the problem you are hitting?
>>
>>
>>
>> Howard
>>
>>
>>
>>
>>
>> Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users <
>> users@lists.open-mpi.org>:
>>
>> Hello,
>>
>>
>>
>> I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of
>> these versions cause the same error (error code 63) when utilizing more
>> than 100 cores on a single node.  The processors I am utilizing are AMD
>> Epyc “Rome” 7742s.  The OS is CentOS 8.1.  I have tried compiling with both
>> the default gcc 8 and locally compiled gcc 9.  I have already tried
>> modifying the maximum name field values with no success.
>>
>>
>>
>> My compile options are:
>>
>> ./configure
>>
>>      --prefix=${HPCX_HOME}/ompi
>>
>>      --with-platform=contrib/platform/mellanox/optimized
>>
>>
>>
>> Any assistance would be appreciated,
>>
>> Collin
>>
>>
>>
>> Collin Strassburger
>>
>> Bihrle Applied Research Inc.
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>>
>>
>>
>>
>>

Reply via email to