Following Jeff's suggestion adding devel mailing list.

Hi All,
I am currently facing strange situation that I can't run OMPI on more than 65 
nodes.
It seems like environmental issue that does not allow me to open more 
connections.
Any ideas ?
Log attached, more info below in the mail.

Running OMPI from trunk
[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288

Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com>

Office:    +972 74 712 9244
Mobile:  +972 54 554 0233
Fax:        +972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lenny Verkhovsky
Sent: Tuesday, August 12, 2014 1:13 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65


Hi,

Config:
./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
--enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
--disable-openib-connectx-xrc

Run:
/home/sources/ompi-bin/bin/mpirun -np 65 --host 
ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
 --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca 
btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
hostname 2>&1|tee > /tmp/mpi.log

Environment:
     According to the attached log it's rsh environment


Output attached

Notes:
The problem is always with tha last node, 64 connections work, 65 connections 
fail.
node-119.ssauniversal.ssa.kodiak.nx == ko0237

mpi.log line 1034:
--------------------------------------------------------------------------
An invalid value was supplied for an enum variable.
  Variable     : orte_debug_daemons
  Value        : 1,1
  Valid values : 0: f|false|disabled, 1: t|true|enabled
--------------------------------------------------------------------------

mpi.log line 1059:
[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288



Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com>

Office:    +972 74 712 9244
Mobile:  +972 54 554 0233
Fax:        +972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, August 11, 2014 4:53 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65

Okay, let's start with the basics :-)

How was this configured? What environment are you running in (rsh, slurm, ??)? 
If you configured --enable-debug, then please run it with

--mca plm_base_verbose 5 --debug-daemons

and send the output


On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky 
<len...@mellanox.com<mailto:len...@mellanox.com>> wrote:

I don't think so,
It's always the 66th node, even if I swap between 65th and 66th
I also get the same error when setting np=66, while having only 65 hosts in 
hostfile
(I am using only tcp btl )


Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com/>

Office:    +972 74 712 9244
Mobile:  +972 54 554 0233
Fax:        +972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, August 11, 2014 1:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65

Looks to me like your 65th host is missing the dstore library - is it possible 
you don't have your paths set correctly on all hosts in your hostfile?


On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky 
<len...@mellanox.com<mailto:len...@mellanox.com>> wrote:


Hi all,

Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
OMPI with more than 65 procs.
It looks like MPI failes to open 66th connection even with running `hostname` 
over tcp.
It also seems to unrelated to specific host.
All hosts are Ubuntu 12.04.1 LTS

mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
--mca btl tcp,self hostname
[nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 288

.......................................
It looks like environment issue, but I can't find any limit related.
Any ideas ?
Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com/>

Office:    +972 74 712 9244
Mobile:  +972 54 554 0233
Fax:        +972 72 257 9400

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24961.php

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24964.php

Attachment: mpi65.tgz
Description: mpi65.tgz

Reply via email to