In relation to the multi-node attempt, I haven’t yet set that up yet as the 
per-node configuration doesn’t pass its tests (full node utilization, etc).

Here are the results for the hostname test:
Input: mpirun -np 128 hostname

Output:
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node3

when attempting to start process rank 0.
--------------------------------------------------------------------------
128 total processes failed to start


Collin


From: users <users-boun...@lists.open-mpi.org> On Behalf Of Ralph Castain via 
users
Sent: Tuesday, January 28, 2020 12:06 PM
To: Joshua Ladd <jladd.m...@gmail.com>
Cc: Ralph Castain <r...@open-mpi.org>; Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Josh - if you read thru the thread, you will see that disabling Mellanox/IB 
drivers allows the program to run. It only fails when they are enabled.



On Jan 28, 2020, at 8:49 AM, Joshua Ladd 
<jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:

I don't see how this can be diagnosed as a "problem with the Mellanox 
Software". This is on a single node. What happens when you try to launch on 
more than one node?

Josh

On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger 
<cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> wrote:
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard 
hpcg benchmark)

Run command: mpirun -np 128 bin/xhpcg
Output:
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--------------------------------------------------------------------------
128 total processes failed to start


Collin

From: Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>>
Sent: Tuesday, January 28, 2020 11:39 AM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: Collin Strassburger 
<cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>>; Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>>; Artem Polyakov 
<art...@mellanox.com<mailto:art...@mellanox.com>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Can you send the output of a failed run including your command line.

Josh

On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
Okay, so this is a problem with the Mellanox software - copying Artem.

On Jan 28, 2020, at 8:15 AM, Collin Strassburger 
<cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>> wrote:

I just tried that and it does indeed work with pbs and without Mellanox (until 
a reboot makes it complain about Mellanox/IB related defaults as no drivers 
were installed, etc).

After installing the Mellanox drivers, I used
./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx 
--with-platform=contrib/platform/mellanox/optimized

With the new compile it fails on the higher core counts.


Collin

From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Tuesday, January 28, 2020 11:02 AM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Does it work with pbs but not Mellanox? Just trying to isolate the problem.


On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Hello,

I have done some additional testing and I can say that it works correctly with 
gcc8 and no mellanox or pbs installed.

I am have done two runs with Mellanox and pbs installed.  One run includes the 
actual run options I will be using while the other includes a truncated set 
which still compiles but fails to execute correctly.  As the option with the 
actual run options results in a smaller config log, I am including it here.

Version: 4.0.2
The config log is available at 
https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and the 
ompi dump is available athttps://pastebin.com/md3HwTUR.

The IB network information (which is not being explicitly operated across):
Packages: MLNX_OFED and Mellanox HPC-X, both are current versions 
(MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and 
hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64)
Ulimit -l = unlimited
Ibv_devinfo:
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.42.5000
…
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x1
        board_id:                       MT_1100120019
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               12
                        port_lmc:               0x00
                        link_layer:             InfiniBand
It looks like the rest of the IB information is in the config file.

I hope this helps,
Collin



From: Jeff Squyres (jsquyres) <jsquy...@cisco.com<mailto:jsquy...@cisco.com>>
Sent: Monday, January 27, 2020 3:40 PM
To: Open MPI User's List 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: Collin Strassburger 
<cstrassbur...@bihrle.com<mailto:cstrassbur...@bihrle.com>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Can you please send all the information listed here:

    https://www.open-mpi.org/community/help/

Thanks!


On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Hello,

I had initially thought the same thing about the streams, but I have 2 sockets 
with 64 cores each.  Additionally, I have not yet turned multithreading off, so 
lscpu reports a total of 256 logical cores and 128 physical cores.  As such, I 
don’t see how it could be running out of streams unless something is being 
passed incorrectly.

Collin

From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ray Sheppard via users
Sent: Monday, January 27, 2020 11:53 AM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Ray Sheppard <rshep...@iu.edu<mailto:rshep...@iu.edu>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when 
utilizing 100+ processors per node

Hi All,
  Just my two cents, I think error code 63 is saying it is running out of 
streams to use.  I think you have only 64 cores, so at 100, you are overloading 
most of them.  It feels like you are running out of resources trying to swap in 
and out ranks on physical cores.
   Ray
On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
This message was sent from a non-IU address. Please exercise caution when 
clicking links or opening attachments from external sources.

Hello Howard,

To remove potential interactions, I have found that the issue persists without 
ucx and hcoll support.

Run command: mpirun -np 128 bin/xhpcg
Output:
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--------------------------------------------------------------------------
128 total processes failed to start

It returns this error for any process I initialize with >100 processes per 
node.  I get the same error message for multiple different codes, so the error 
code is mpi related rather than being program specific.

Collin

From: Howard Pritchard <hpprit...@gmail.com><mailto:hpprit...@gmail.com>
Sent: Monday, January 27, 2020 11:20 AM
To: Open MPI Users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org>
Cc: Collin Strassburger 
<cstrassbur...@bihrle.com><mailto:cstrassbur...@bihrle.com>
Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ 
processors per node

Hello Collen,

Could you provide more information about the error.  Is there any output from 
either Open MPI or, maybe, UCX, that could provide more information about the 
problem you are hitting?

Howard


Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>:
Hello,

I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of these 
versions cause the same error (error code 63) when utilizing more than 100 
cores on a single node.  The processors I am utilizing are AMD Epyc “Rome” 
7742s.  The OS is CentOS 8.1.  I have tried compiling with both the default gcc 
8 and locally compiled gcc 9.  I have already tried modifying the maximum name 
field values with no success.

My compile options are:
./configure
     --prefix=${HPCX_HOME}/ompi
     --with-platform=contrib/platform/mellanox/optimized

Any assistance would be appreciated,
Collin

Collin Strassburger
Bihrle Applied Research Inc.




--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


Reply via email to