Can you send the output of a failed run including your command line.

Josh

On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> Okay, so this is a problem with the Mellanox software - copying Artem.
>
> On Jan 28, 2020, at 8:15 AM, Collin Strassburger <cstrassbur...@bihrle.com>
> wrote:
>
> I just tried that and it does indeed work with pbs and without Mellanox
> (until a reboot makes it complain about Mellanox/IB related defaults as no
> drivers were installed, etc).
>
> After installing the Mellanox drivers, I used
> ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx
> --with-platform=contrib/platform/mellanox/optimized
>
> With the new compile it fails on the higher core counts.
>
>
> Collin
>
> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ralph
> Castain via users
> *Sent:* Tuesday, January 28, 2020 11:02 AM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Cc:* Ralph Castain <r...@open-mpi.org>
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
> Does it work with pbs but not Mellanox? Just trying to isolate the problem.
>
>
>
> On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users <
> users@lists.open-mpi.org> wrote:
>
> Hello,
>
> I have done some additional testing and I can say that it works correctly
> with gcc8 and no mellanox or pbs installed.
>
> I am have done two runs with Mellanox and pbs installed.  One run includes
> the actual run options I will be using while the other includes a truncated
> set which still compiles but fails to execute correctly.  As the option
> with the actual run options results in a smaller config log, I am including
> it here.
>
> Version: 4.0.2
> The config log is available at
> https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and
> the ompi dump is available athttps://pastebin.com/md3HwTUR.
>
> The IB network information (which is not being explicitly operated across):
> Packages: MLNX_OFED and Mellanox HPC-X, both are current versions
> (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and
> hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
> Ulimit -l = unlimited
> Ibv_devinfo:
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.42.5000
> …
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x1
>         board_id:                       MT_1100120019
>         phys_port_cnt:                  1
>         Device ports:
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               12
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
> It looks like the rest of the IB information is in the config file.
>
> I hope this helps,
> Collin
>
>
>
> *From:* Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> *Sent:* Monday, January 27, 2020 3:40 PM
> *To:* Open MPI User's List <users@lists.open-mpi.org>
> *Cc:* Collin Strassburger <cstrassbur...@bihrle.com>
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
> Can you please send all the information listed here:
>
>     https://www.open-mpi.org/community/help/
>
> Thanks!
>
>
>
>
> On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users <
> users@lists.open-mpi.org> wrote:
>
> Hello,
>
> I had initially thought the same thing about the streams, but I have 2
> sockets with 64 cores each.  Additionally, I have not yet turned
> multithreading off, so lscpu reports a total of 256 logical cores and 128
> physical cores.  As such, I don’t see how it could be running out of
> streams unless something is being passed incorrectly.
>
> Collin
>
> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ray
> Sheppard via users
> *Sent:* Monday, January 27, 2020 11:53 AM
> *To:* users@lists.open-mpi.org
> *Cc:* Ray Sheppard <rshep...@iu.edu>
> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD
> 7742 when utilizing 100+ processors per node
>
>
> Hi All,
>   Just my two cents, I think error code 63 is saying it is running out of
> streams to use.  I think you have only 64 cores, so at 100, you are
> overloading most of them.  It feels like you are running out of resources
> trying to swap in and out ranks on physical cores.
>    Ray
> On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
>
> This message was sent from a non-IU address. Please exercise caution when
> clicking links or opening attachments from external sources.
>
> Hello Howard,
>
> To remove potential interactions, I have found that the issue persists
> without ucx and hcoll support.
>
> Run command: mpirun -np 128 bin/xhpcg
> Output:
> --------------------------------------------------------------------------
> mpirun was unable to start the specified application as it encountered an
> error:
>
> Error code: 63
> Error name: (null)
> Node: Gen2Node4
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 128 total processes failed to start
>
> It returns this error for any process I initialize with >100 processes per
> node.  I get the same error message for multiple different codes, so the
> error code is mpi related rather than being program specific.
>
> Collin
>
> *From:* Howard Pritchard <hpprit...@gmail.com> <hpprit...@gmail.com>
> *Sent:* Monday, January 27, 2020 11:20 AM
> *To:* Open MPI Users <users@lists.open-mpi.org> <users@lists.open-mpi.org>
> *Cc:* Collin Strassburger <cstrassbur...@bihrle.com>
> <cstrassbur...@bihrle.com>
> *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when
> utilizing 100+ processors per node
>
> Hello Collen,
>
> Could you provide more information about the error.  Is there any output
> from either Open MPI or, maybe, UCX, that could provide more information
> about the problem you are hitting?
>
> Howard
>
>
> Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users <
> users@lists.open-mpi.org>:
>
> Hello,
>
> I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of
> these versions cause the same error (error code 63) when utilizing more
> than 100 cores on a single node.  The processors I am utilizing are AMD
> Epyc “Rome” 7742s.  The OS is CentOS 8.1.  I have tried compiling with both
> the default gcc 8 and locally compiled gcc 9.  I have already tried
> modifying the maximum name field values with no success.
>
> My compile options are:
> ./configure
>      --prefix=${HPCX_HOME}/ompi
>      --with-platform=contrib/platform/mellanox/optimized
>
> Any assistance would be appreciated,
> Collin
>
> Collin Strassburger
> Bihrle Applied Research Inc.
>
>
>
>
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>
>

Reply via email to