Re: [OMPI users] OpenMPI fails with np > 65
Following Jeff's suggestion adding devel mailing list. Hi All, I am currently facing strange situation that I can't run OMPI on more than 65 nodes. It seems like environmental issue that does not allow me to open more connections. Any ideas ? Log attached, more info below in the mail. Running OMPI from trunk [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 288 Thanks. Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com> Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lenny Verkhovsky Sent: Tuesday, August 12, 2014 1:13 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI fails with np > 65 Hi, Config: ./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin --enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug --disable-openib-connectx-xrc Run: /home/sources/ompi-bin/bin/mpirun -np 65 --host ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237 --mca btl openib,self --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons hostname 2>&1|tee > /tmp/mpi.log Environment: According to the attached log it's rsh environment Output attached Notes: The problem is always with tha last node, 64 connections work, 65 connections fail. node-119.ssauniversal.ssa.kodiak.nx == ko0237 mpi.log line 1034: -- An invalid value was supplied for an enum variable. Variable : orte_debug_daemons Value: 1,1 Valid values : 0: f|false|disabled, 1: t|true|enabled -- mpi.log line 1059: [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 288 Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com> Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Monday, August 11, 2014 4:53 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI fails with np > 65 Okay, let's start with the basics :-) How was this configured? What environment are you running in (rsh, slurm, ??)? If you configured --enable-debug, then please run it with --mca plm_base_verbose 5 --debug-daemons and send the output On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky <len...@mellanox.com<mailto:len...@mellanox.com>> wrote: I don't think so, It's always the 66th node, even if I swap between 65th and 66th I also get the same error when setting np=66, while having only 65 hosts in hostfile (I am using only tcp btl ) Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com/> Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Monday, August 11, 2014 1:07 AM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI fails with np > 65 Looks to me like your 65th host is missing the dstore library - is it possible you don't have your paths set correctly on all hosts in your hostfile? On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky <len...@mellanox.com<mailto:len...@mellanox.com>> wrote: Hi all, Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running OMPI with more than 65 procs. It looks like MPI failes to open 66th connection even with running `hostname` over tcp. It also seems to unrelated to specific host. All hosts are Ubuntu 12.04.1 LTS mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt --mca btl tcp,self hostname [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 288 ... It looks like environment issue, but I can't find any limit related. Any ideas ? Thanks. Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com/> Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listi
Re: [OMPI users] OpenMPI fails with np > 65
Lenny -- Since this is about the development trunk, how about sending these kinds of mails to the de...@open-mpi.org list? Not all the OMPI developers are on the user's list, and this very definitely is not a user-level question. On Aug 12, 2014, at 6:12 AM, Lenny Verkhovsky <len...@mellanox.com> wrote: > > Hi, > > Config: > ./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin > --enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug > --disable-openib-connectx-xrc > > Run: > /home/sources/ompi-bin/bin/mpirun -np 65 --host > ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237 > --mca btl openib,self --mca btl_openib_cpc_include rdmacm --mca pml ob1 > --mca btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons > hostname 2>&1|tee > /tmp/mpi.log > > Environment: > According to the attached log it’s rsh environment > > > Output attached > > Notes: > The problem is always with tha last node, 64 connections work, 65 connections > fail. > node-119.ssauniversal.ssa.kodiak.nx == ko0237 > > mpi.log line 1034: > -- > An invalid value was supplied for an enum variable. > Variable : orte_debug_daemons > Value: 1,1 > Valid values : 0: f|false|disabled, 1: t|true|enabled > -- > > mpi.log line 1059: > [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: > Error in file base/ess_base_std_orted.c at line 288 > > > > Lenny Verkhovsky > SW Engineer, Mellanox Technologies > www.mellanox.com > > Office:+972 74 712 9244 > Mobile: +972 54 554 0233 > Fax:+972 72 257 9400 > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Monday, August 11, 2014 4:53 PM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI fails with np > 65 > > Okay, let's start with the basics :-) > > How was this configured? What environment are you running in (rsh, slurm, > ??)? If you configured --enable-debug, then please run it with > > --mca plm_base_verbose 5 --debug-daemons > > and send the output > > > On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky <len...@mellanox.com> wrote: > > > I don’t think so, > It’s always the 66th node, even if I swap between 65th and 66th > I also get the same error when setting np=66, while having only 65 hosts in > hostfile > (I am using only tcp btl ) > > > Lenny Verkhovsky > SW Engineer, Mellanox Technologies > www.mellanox.com > > Office:+972 74 712 9244 > Mobile: +972 54 554 0233 > Fax:+972 72 257 9400 > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Monday, August 11, 2014 1:07 AM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI fails with np > 65 > > Looks to me like your 65th host is missing the dstore library - is it > possible you don't have your paths set correctly on all hosts in your > hostfile? > > > On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky <len...@mellanox.com> wrote: > > > > Hi all, > > Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running > OMPI with more than 65 procs. > It looks like MPI failes to open 66th connection even with running `hostname` > over tcp. > It also seems to unrelated to specific host. > All hosts are Ubuntu 12.04.1 LTS > mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt > --mca btl tcp,self hostname > [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file > base/ess_base_std_orted.c at line 288 > > … > > It looks like environment issue, but I can’t find any limit related. > Any ideas ? > Thanks. > Lenny Verkhovsky > SW Engineer, Mellanox Technologies > www.mellanox.com > > Office:+972 74 712 9244 > Mobile: +972 54 554 0233 > Fax:+972 72 257 9400 > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/l
Re: [OMPI users] OpenMPI fails with np > 65
Hi, Config: ./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin --enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug --disable-openib-connectx-xrc Run: /home/sources/ompi-bin/bin/mpirun -np 65 --host ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237 --mca btl openib,self --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons hostname 2>&1|tee > /tmp/mpi.log Environment: According to the attached log it's rsh environment Output attached Notes: The problem is always with tha last node, 64 connections work, 65 connections fail. node-119.ssauniversal.ssa.kodiak.nx == ko0237 mpi.log line 1034: -- An invalid value was supplied for an enum variable. Variable : orte_debug_daemons Value: 1,1 Valid values : 0: f|false|disabled, 1: t|true|enabled -- mpi.log line 1059: [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 288 Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com> Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Monday, August 11, 2014 4:53 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI fails with np > 65 Okay, let's start with the basics :-) How was this configured? What environment are you running in (rsh, slurm, ??)? If you configured --enable-debug, then please run it with --mca plm_base_verbose 5 --debug-daemons and send the output On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky <len...@mellanox.com<mailto:len...@mellanox.com>> wrote: I don't think so, It's always the 66th node, even if I swap between 65th and 66th I also get the same error when setting np=66, while having only 65 hosts in hostfile (I am using only tcp btl ) Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com/> Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Monday, August 11, 2014 1:07 AM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI fails with np > 65 Looks to me like your 65th host is missing the dstore library - is it possible you don't have your paths set correctly on all hosts in your hostfile? On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky <len...@mellanox.com<mailto:len...@mellanox.com>> wrote: Hi all, Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running OMPI with more than 65 procs. It looks like MPI failes to open 66th connection even with running `hostname` over tcp. It also seems to unrelated to specific host. All hosts are Ubuntu 12.04.1 LTS mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt --mca btl tcp,self hostname [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 288 ... It looks like environment issue, but I can't find any limit related. Any ideas ? Thanks. Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com/> Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/24961.php ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/24964.php mpi65.tgz Description: mpi65.tgz
Re: [OMPI users] OpenMPI fails with np > 65
Okay, let's start with the basics :-) How was this configured? What environment are you running in (rsh, slurm, ??)? If you configured --enable-debug, then please run it with --mca plm_base_verbose 5 --debug-daemons and send the output On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky <len...@mellanox.com> wrote: > I don’t think so, > It’s always the 66th node, even if I swap between 65th and 66th > I also get the same error when setting np=66, while having only 65 hosts in > hostfile > (I am using only tcp btl ) > > > Lenny Verkhovsky > SW Engineer, Mellanox Technologies > www.mellanox.com > > Office:+972 74 712 9244 > Mobile: +972 54 554 0233 > Fax:+972 72 257 9400 > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Monday, August 11, 2014 1:07 AM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI fails with np > 65 > > Looks to me like your 65th host is missing the dstore library - is it > possible you don't have your paths set correctly on all hosts in your > hostfile? > > > On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky <len...@mellanox.com> wrote: > > > Hi all, > > Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running > OMPI with more than 65 procs. > It looks like MPI failes to open 66th connection even with running `hostname` > over tcp. > It also seems to unrelated to specific host. > All hosts are Ubuntu 12.04.1 LTS > mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt > --mca btl tcp,self hostname > [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file > base/ess_base_std_orted.c at line 288 > > … > > It looks like environment issue, but I can’t find any limit related. > Any ideas ? > Thanks. > Lenny Verkhovsky > SW Engineer, Mellanox Technologies > www.mellanox.com > > Office:+972 74 712 9244 > Mobile: +972 54 554 0233 > Fax:+972 72 257 9400 > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/24961.php > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/24964.php
Re: [OMPI users] OpenMPI fails with np > 65
I don't think so, It's always the 66th node, even if I swap between 65th and 66th I also get the same error when setting np=66, while having only 65 hosts in hostfile (I am using only tcp btl ) Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com> Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Monday, August 11, 2014 1:07 AM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI fails with np > 65 Looks to me like your 65th host is missing the dstore library - is it possible you don't have your paths set correctly on all hosts in your hostfile? On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky <len...@mellanox.com<mailto:len...@mellanox.com>> wrote: Hi all, Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running OMPI with more than 65 procs. It looks like MPI failes to open 66th connection even with running `hostname` over tcp. It also seems to unrelated to specific host. All hosts are Ubuntu 12.04.1 LTS mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt --mca btl tcp,self hostname [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 288 ... It looks like environment issue, but I can't find any limit related. Any ideas ? Thanks. Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com/> Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/24961.php
Re: [OMPI users] OpenMPI fails with np > 65
Looks to me like your 65th host is missing the dstore library - is it possible you don't have your paths set correctly on all hosts in your hostfile? On Aug 10, 2014, at 1:13 PM, Lenny Verkhovskywrote: > Hi all, > > Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running > OMPI with more than 65 procs. > It looks like MPI failes to open 66th connection even with running `hostname` > over tcp. > It also seems to unrelated to specific host. > All hosts are Ubuntu 12.04.1 LTS > mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt > --mca btl tcp,self hostname > [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file > base/ess_base_std_orted.c at line 288 > > … > > It looks like environment issue, but I can’t find any limit related. > Any ideas ? > Thanks. > Lenny Verkhovsky > SW Engineer, Mellanox Technologies > www.mellanox.com > > Office:+972 74 712 9244 > Mobile: +972 54 554 0233 > Fax:+972 72 257 9400 > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/24961.php