Can you send the output of a failed run including your command line. Josh
On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < users@lists.open-mpi.org> wrote: > Okay, so this is a problem with the Mellanox software - copying Artem. > > On Jan 28, 2020, at 8:15 AM, Collin Strassburger <cstrassbur...@bihrle.com> > wrote: > > I just tried that and it does indeed work with pbs and without Mellanox > (until a reboot makes it complain about Mellanox/IB related defaults as no > drivers were installed, etc). > > After installing the Mellanox drivers, I used > ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx > --with-platform=contrib/platform/mellanox/optimized > > With the new compile it fails on the higher core counts. > > > Collin > > *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ralph > Castain via users > *Sent:* Tuesday, January 28, 2020 11:02 AM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Cc:* Ralph Castain <r...@open-mpi.org> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > Does it work with pbs but not Mellanox? Just trying to isolate the problem. > > > > On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users < > users@lists.open-mpi.org> wrote: > > Hello, > > I have done some additional testing and I can say that it works correctly > with gcc8 and no mellanox or pbs installed. > > I am have done two runs with Mellanox and pbs installed. One run includes > the actual run options I will be using while the other includes a truncated > set which still compiles but fails to execute correctly. As the option > with the actual run options results in a smaller config log, I am including > it here. > > Version: 4.0.2 > The config log is available at > https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and > the ompi dump is available athttps://pastebin.com/md3HwTUR. > > The IB network information (which is not being explicitly operated across): > Packages: MLNX_OFED and Mellanox HPC-X, both are current versions > (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and > hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) > Ulimit -l = unlimited > Ibv_devinfo: > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.42.5000 > … > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x1 > board_id: MT_1100120019 > phys_port_cnt: 1 > Device ports: > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 1 > port_lid: 12 > port_lmc: 0x00 > link_layer: InfiniBand > It looks like the rest of the IB information is in the config file. > > I hope this helps, > Collin > > > > *From:* Jeff Squyres (jsquyres) <jsquy...@cisco.com> > *Sent:* Monday, January 27, 2020 3:40 PM > *To:* Open MPI User's List <users@lists.open-mpi.org> > *Cc:* Collin Strassburger <cstrassbur...@bihrle.com> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > Can you please send all the information listed here: > > https://www.open-mpi.org/community/help/ > > Thanks! > > > > > On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users < > users@lists.open-mpi.org> wrote: > > Hello, > > I had initially thought the same thing about the streams, but I have 2 > sockets with 64 cores each. Additionally, I have not yet turned > multithreading off, so lscpu reports a total of 256 logical cores and 128 > physical cores. As such, I don’t see how it could be running out of > streams unless something is being passed incorrectly. > > Collin > > *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ray > Sheppard via users > *Sent:* Monday, January 27, 2020 11:53 AM > *To:* users@lists.open-mpi.org > *Cc:* Ray Sheppard <rshep...@iu.edu> > *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD > 7742 when utilizing 100+ processors per node > > > Hi All, > Just my two cents, I think error code 63 is saying it is running out of > streams to use. I think you have only 64 cores, so at 100, you are > overloading most of them. It feels like you are running out of resources > trying to swap in and out ranks on physical cores. > Ray > On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: > > This message was sent from a non-IU address. Please exercise caution when > clicking links or opening attachments from external sources. > > Hello Howard, > > To remove potential interactions, I have found that the issue persists > without ucx and hcoll support. > > Run command: mpirun -np 128 bin/xhpcg > Output: > -------------------------------------------------------------------------- > mpirun was unable to start the specified application as it encountered an > error: > > Error code: 63 > Error name: (null) > Node: Gen2Node4 > > when attempting to start process rank 0. > -------------------------------------------------------------------------- > 128 total processes failed to start > > It returns this error for any process I initialize with >100 processes per > node. I get the same error message for multiple different codes, so the > error code is mpi related rather than being program specific. > > Collin > > *From:* Howard Pritchard <hpprit...@gmail.com> <hpprit...@gmail.com> > *Sent:* Monday, January 27, 2020 11:20 AM > *To:* Open MPI Users <users@lists.open-mpi.org> <users@lists.open-mpi.org> > *Cc:* Collin Strassburger <cstrassbur...@bihrle.com> > <cstrassbur...@bihrle.com> > *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when > utilizing 100+ processors per node > > Hello Collen, > > Could you provide more information about the error. Is there any output > from either Open MPI or, maybe, UCX, that could provide more information > about the problem you are hitting? > > Howard > > > Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users < > users@lists.open-mpi.org>: > > Hello, > > I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of > these versions cause the same error (error code 63) when utilizing more > than 100 cores on a single node. The processors I am utilizing are AMD > Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both > the default gcc 8 and locally compiled gcc 9. I have already tried > modifying the maximum name field values with no success. > > My compile options are: > ./configure > --prefix=${HPCX_HOME}/ompi > --with-platform=contrib/platform/mellanox/optimized > > Any assistance would be appreciated, > Collin > > Collin Strassburger > Bihrle Applied Research Inc. > > > > > > > -- > Jeff Squyres > jsquy...@cisco.com > > >