subject:"Re\: \[OMPI users\] Help"

 Or you can replace the mpi-selector thing with

module load mpi/openmpi-x86_64

if it does not work,

module avail

and then

module load 

note this is per session, so you should do that each time you start a 
new terminal or submit a job

Cheers,

Gilles

- Original Message -

When I run command rpm --query centos-release, it shows the 
following: centos-release-7-3.1611.el7.centos.x86_64. So maybe I should 
install CentOS 5?

 

C.

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 12:36 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 

 by the way, are you running CentOS 5 ?

it seems mpi-selector is no more available from CentOS 6

 

Cheers,

 

Gilles

- Original Message -

Yes, I write it wrong the previous e-mail, but actually it does 
not work. Gives the error message: mpi: command not found

 

 

Corina

 

 

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf 
Of gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
    Subject: Re: [OMPI users] Help

 

 

 Hi,

 

that looks like a typo, the command is

mpi-selector --list

 

Cheers,

 

Gilles

- Original Message -

Hello,

 

 

I am trying to install Open MPI on Centos and I got stuck. I 
have installed an GNU compiler and after that I run the command: yum 
install openmpi-devel.x86_64. But when I run command mpi selector –- 
list I receive this error “mpi: command not found”

I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html

Any help is much appreciated. J

 

 

Corina



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

2017-04-27 Thread Corina Jeni Tudorache

When I run command rpm --query centos-release, it shows the following: 
centos-release-7-3.1611.el7.centos.x86_64. So maybe I should install CentOS 5?

C.

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 12:36 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 by the way, are you running CentOS 5 ?
it seems mpi-selector is no more available from CentOS 6

Cheers,

Gilles
- Original Message -
Yes, I write it wrong the previous e-mail, but actually it does not work. Gives 
the error message: mpi: command not found

Corina

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 Hi,

that looks like a typo, the command is
mpi-selector --list

Cheers,

Gilles
- Original Message -
Hello,

I am trying to install Open MPI on Centos and I got stuck. I have installed an 
GNU compiler and after that I run the command: yum install 
openmpi-devel.x86_64. But when I run command mpi selector –- list I receive 
this error “mpi: command not found”
I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html
Any help is much appreciated. J

Corina___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

2017-04-27 Thread Corina Jeni Tudorache

It says mpi-selector is not installed. And yes for the mpi-selector command, 
the error message is command not found.

C.

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 12:32 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 Well, i cannot make sense of this error message.

if the command is mpi-selector, the error message could be
mpi-selector: command not found
but this is not the error message you reported

what does
rpm -ql mpi-selector
reports ?

Cheers,

Gilles
- Original Message -
Yes, I write it wrong the previous e-mail, but actually it does not work. Gives 
the error message: mpi: command not found

Corina

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 Hi,

that looks like a typo, the command is
mpi-selector --list

Cheers,

Gilles
- Original Message -
Hello,

I am trying to install Open MPI on Centos and I got stuck. I have installed an 
GNU compiler and after that I run the command: yum install 
openmpi-devel.x86_64. But when I run command mpi selector –- list I receive 
this error “mpi: command not found”
I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html
Any help is much appreciated. J

Corina___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

 by the way, are you running CentOS 5 ?

it seems mpi-selector is no more available from CentOS 6

Cheers,

Gilles

- Original Message -

Yes, I write it wrong the previous e-mail, but actually it does not 
work. Gives the error message: mpi: command not found

 

Corina

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 

 Hi,

 

that looks like a typo, the command is

mpi-selector --list

 

Cheers,

 

Gilles

- Original Message -

Hello,

 

 

I am trying to install Open MPI on Centos and I got stuck. I 
have installed an GNU compiler and after that I run the command: yum 
install openmpi-devel.x86_64. But when I run command mpi selector –- 
list I receive this error “mpi: command not found”

I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html


Any help is much appreciated. J

 

 

Corina



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

 Well, i cannot make sense of this error message.

if the command is mpi-selector, the error message could be

mpi-selector: command not found

but this is not the error message you reported

what does

rpm -ql mpi-selector

reports ?

Cheers,

Gilles

- Original Message -

Yes, I write it wrong the previous e-mail, but actually it does not 
work. Gives the error message: mpi: command not found

 

Corina

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 

 Hi,

 

that looks like a typo, the command is

mpi-selector --list

 

Cheers,

 

Gilles

- Original Message -

Hello,

 

 

I am trying to install Open MPI on Centos and I got stuck. I 
have installed an GNU compiler and after that I run the command: yum 
install openmpi-devel.x86_64. But when I run command mpi selector –- 
list I receive this error “mpi: command not found”

I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html


Any help is much appreciated. J

 

 

Corina



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

2017-04-27 Thread Corina Jeni Tudorache

Yes, I write it wrong the previous e-mail, but actually it does not work. Gives 
the error message: mpi: command not found

Corina

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
gil...@rist.or.jp
Sent: Thursday, April 27, 2017 11:34 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Help

 Hi,

that looks like a typo, the command is
mpi-selector --list

Cheers,

Gilles
- Original Message -
Hello,

I am trying to install Open MPI on Centos and I got stuck. I have installed an 
GNU compiler and after that I run the command: yum install 
openmpi-devel.x86_64. But when I run command mpi selector –- list I receive 
this error “mpi: command not found”
I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html
Any help is much appreciated. J

Corina___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help

 Hi,

that looks like a typo, the command is

mpi-selector --list

Cheers,

Gilles

- Original Message -

Hello,

 

I am trying to install Open MPI on Centos and I got stuck. I have 
installed an GNU compiler and after that I run the command: yum install 
openmpi-devel.x86_64. But when I run command mpi selector –- list I 
receive this error “mpi: command not found”

I am following the instruction from here: 
https://na-inet.jp/na/pccluster/centos_x86_64-en.html


Any help is much appreciated. J

 

Corina



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++

2017-03-24 Thread Matt Thompson

Gilles,

The library I have having issues linking is ESMF and it is a C++/Fortran
application. From
http://www.earthsystemmodeling.org/esmf_releases/non_public/ESMF_7_0_0/ESMF_usrdoc/node9.html#SECTION00092000
:

The following compilers and utilities *are required* for compiling, linking
> and testing the ESMF software:
> Fortran90 (or later) compiler;
> C++ compiler;
> MPI implementation compatible with the above compilers (but see below);
> GNU's gcc compiler - for a standard cpp preprocessor implementation;
> GNU make;
> Perl - for running test scripts.


(Emphasis mine)

This is why I am concerned. For now, I'll build Open MPI with the (possibly
useless) C++ support for PGI and move on to the Fortran issue (which I'll
detail in another email).

But, as I *need* ESMF for my application, it would be good to get an mpicxx
that I can have confidence in with PGI.

Matt


On Thu, Mar 23, 2017 at 9:05 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Matt,
>
> a C++ compiler is required to configure Open MPI.
> That being said, C++ compiler is only used if you build the C++ bindings
> (That were removed from MPI-3)
> And unless you plan to use the mpic++ wrapper (with or without the C++
> bindings),
> a valid C++ compiler is not required at all.
> /* configure still requires one, and that could be improved */
>
> My point is you should not worry too much about configure messages related
> to C++,
> and you should instead focus on the Fortran issue.
>
> Cheers,
>
> Gilles
>
> On Thursday, March 23, 2017, Matt Thompson  wrote:
>
>> All, I'm hoping one of you knows what I might be doing wrong here.  I'm
>> trying to use Open MPI 2.1.0 for PGI 16.10 (Community Edition) on macOS.
>> Now, I built it a la:
>>
>> http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105
>>
>> and found that it built, but the resulting mpifort, etc were just not
>> good. Couldn't even do Hello World.
>>
>> So, I thought I'd start from the beginning. I tried running:
>>
>> configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran
>> --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.1.0
>> but when I did I saw this:
>>
>> *** C++ compiler and preprocessor
>> checking whether we are using the GNU C++ compiler... yes
>> checking whether pgc++ accepts -g... yes
>> checking dependency style of pgc++... none
>> checking how to run the C++ preprocessor... pgc++ -E
>> checking for the C++ compiler vendor... gnu
>>
>> Well, that's not the right vendor. So, I took a look at configure and I
>> saw that at least some detection for PGI was a la:
>>
>>   pgCC* | pgcpp*)
>> # Portland Group C++ compiler
>> case `$CC -V` in
>> *pgCC\ [1-5].* | *pgcpp\ [1-5].*)
>>
>>   pgCC* | pgcpp*)
>> # Portland Group C++ compiler
>> lt_prog_compiler_wl_CXX='-Wl,'
>> lt_prog_compiler_pic_CXX='-fpic'
>> lt_prog_compiler_static_CXX='-Bstatic'
>> ;;
>>
>> Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked
>> configure so that references to pgCC (nonexistent on macOS) are gone and
>> all pgcpp became pgc++, but:
>>
>> *** C++ compiler and preprocessor
>> checking whether we are using the GNU C++ compiler... yes
>> checking whether pgc++ accepts -g... yes
>> checking dependency style of pgc++... none
>> checking how to run the C++ preprocessor... pgc++ -E
>> checking for the C++ compiler vendor... gnu
>>
>> Well, at this point, I think I'm stopping until I get help. Will this
>> chunk of configure always return gnu for PGI? I know the C part returns
>> 'portland group':
>>
>> *** C compiler and preprocessor
>> checking for gcc... (cached) pgcc
>> checking whether we are using the GNU C compiler... (cached) no
>> checking whether pgcc accepts -g... (cached) yes
>> checking for pgcc option to accept ISO C89... (cached) none needed
>> checking whether pgcc understands -c and -o together... (cached) yes
>> checking for pgcc option to accept ISO C99... none needed
>> checking for the C compiler vendor... portland group
>>
>> so I thought the C++ section would as well. I also tried passing in
>> --enable-mpi-cxx, but that did nothing.
>>
>> Is this just a red herring? My real concern is with pgfortran/mpifort,
>> but I thought I'd start with this. If this is okay, I'll move on and detail
>> the fortran issues I'm having.
>>
>> Matt
>> --
>> Matt Thompson
>>
>> Man Among Men
>> Fulcrum of History
>>
>>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++

2017-03-23 Thread Gilles Gouaillardet

Matt,

a C++ compiler is required to configure Open MPI.
That being said, C++ compiler is only used if you build the C++ bindings
(That were removed from MPI-3)
And unless you plan to use the mpic++ wrapper (with or without the C++
bindings),
a valid C++ compiler is not required at all.
/* configure still requires one, and that could be improved */

My point is you should not worry too much about configure messages related
to C++,
and you should instead focus on the Fortran issue.

Cheers,

Gilles

On Thursday, March 23, 2017, Matt Thompson  wrote:

> All, I'm hoping one of you knows what I might be doing wrong here.  I'm
> trying to use Open MPI 2.1.0 for PGI 16.10 (Community Edition) on macOS.
> Now, I built it a la:
>
> http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105
>
> and found that it built, but the resulting mpifort, etc were just not
> good. Couldn't even do Hello World.
>
> So, I thought I'd start from the beginning. I tried running:
>
> configure --disable-wrapper-rpath CC=pgcc CXX=pgc++ FC=pgfortran
> --prefix=/Users/mathomp4/installed/Compiler/pgi-16.10/openmpi/2.1.0
> but when I did I saw this:
>
> *** C++ compiler and preprocessor
> checking whether we are using the GNU C++ compiler... yes
> checking whether pgc++ accepts -g... yes
> checking dependency style of pgc++... none
> checking how to run the C++ preprocessor... pgc++ -E
> checking for the C++ compiler vendor... gnu
>
> Well, that's not the right vendor. So, I took a look at configure and I
> saw that at least some detection for PGI was a la:
>
>   pgCC* | pgcpp*)
> # Portland Group C++ compiler
> case `$CC -V` in
> *pgCC\ [1-5].* | *pgcpp\ [1-5].*)
>
>   pgCC* | pgcpp*)
> # Portland Group C++ compiler
> lt_prog_compiler_wl_CXX='-Wl,'
> lt_prog_compiler_pic_CXX='-fpic'
> lt_prog_compiler_static_CXX='-Bstatic'
> ;;
>
> Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked
> configure so that references to pgCC (nonexistent on macOS) are gone and
> all pgcpp became pgc++, but:
>
> *** C++ compiler and preprocessor
> checking whether we are using the GNU C++ compiler... yes
> checking whether pgc++ accepts -g... yes
> checking dependency style of pgc++... none
> checking how to run the C++ preprocessor... pgc++ -E
> checking for the C++ compiler vendor... gnu
>
> Well, at this point, I think I'm stopping until I get help. Will this
> chunk of configure always return gnu for PGI? I know the C part returns
> 'portland group':
>
> *** C compiler and preprocessor
> checking for gcc... (cached) pgcc
> checking whether we are using the GNU C compiler... (cached) no
> checking whether pgcc accepts -g... (cached) yes
> checking for pgcc option to accept ISO C89... (cached) none needed
> checking whether pgcc understands -c and -o together... (cached) yes
> checking for pgcc option to accept ISO C99... none needed
> checking for the C compiler vendor... portland group
>
> so I thought the C++ section would as well. I also tried passing in
> --enable-mpi-cxx, but that did nothing.
>
> Is this just a red herring? My real concern is with pgfortran/mpifort, but
> I thought I'd start with this. If this is okay, I'll move on and detail the
> fortran issues I'm having.
>
> Matt
> --
> Matt Thompson
>
> Man Among Men
> Fulcrum of History
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++

2017-03-23 Thread Reuti

Hi,

Am 22.03.2017 um 20:12 schrieb Matt Thompson:

> […]
> 
> Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked 
> configure so that references to pgCC (nonexistent on macOS) are gone and all 
> pgcpp became pgc++, but:

This is not unique to macOS. pgCC used STLPort STL and is no longer included 
with their compiler suite, pgc++ now uses a GCC compatible library format and 
replaces the former one on Linux too.

There I get, ignoring the gnu output during `configure` and compiling anyway:

$ mpic++ --version

pgc++ 16.10-0 64-bit target on x86-64 Linux -tp bulldozer
The Portland Group - PGI Compilers and Tools
Copyright (c) 2016, NVIDIA CORPORATION.  All rights reserved.

Maybe some options for the `mpic++` wrapper were just set in a wrong way?

===

Nevertheless: did you see the error on the Mac at the end of the `configure` 
step too, or was it gone after the hints on the discussion's link you posted? 
As I face it there still about "libeevent".

-- Reuti


> 
> *** C++ compiler and preprocessor
> checking whether we are using the GNU C++ compiler... yes
> checking whether pgc++ accepts -g... yes
> checking dependency style of pgc++... none
> checking how to run the C++ preprocessor... pgc++ -E
> checking for the C++ compiler vendor... gnu
> 
> Well, at this point, I think I'm stopping until I get help. Will this chunk 
> of configure always return gnu for PGI? I know the C part returns 'portland 
> group':
> 
> *** C compiler and preprocessor
> checking for gcc... (cached) pgcc
> checking whether we are using the GNU C compiler... (cached) no
> checking whether pgcc accepts -g... (cached) yes
> checking for pgcc option to accept ISO C89... (cached) none needed
> checking whether pgcc understands -c and -o together... (cached) yes
> checking for pgcc option to accept ISO C99... none needed
> checking for the C compiler vendor... portland group
> 
> so I thought the C++ section would as well. I also tried passing in 
> --enable-mpi-cxx, but that did nothing.
> 
> Is this just a red herring? My real concern is with pgfortran/mpifort, but I 
> thought I'd start with this. If this is okay, I'll move on and detail the 
> fortran issues I'm having.
> 
> Matt
> --
> Matt Thompson
> Man Among Men
> Fulcrum of History
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-10-21 Thread Gilles Gouaillardet

Matus,


This has very likely been fixed by
https://github.com/open-mpi/ompi/pull/2259
Can you download the patch at
https://github.com/open-mpi/ompi/pull/2259.patch and apply it manually on
v1.10 ?

Cheers,

Gilles


On Monday, August 29, 2016, M. D.  wrote:

>
> Hi,
>
> I would like to ask - are there any new solutions or investigations in
> this problem?
>
> Cheers,
>
> Matus Dobrotka
>
> 2016-07-19 15:23 GMT+02:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> >:
>
>> my bad for the confusion,
>>
>> I misread you and miswrote my reply.
>>
>> I will investigate this again.
>>
>> strictly speaking, the clients can only start after the server first
>> write the port info to a file.
>> if you start the client right after the server start, they might use
>> incorrect/outdated info and cause all the test hang.
>>
>> I will start reproducing the hang
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Tuesday, July 19, 2016, M. D. > > wrote:
>>
>>> Yes I understand it, but I think, this is exactly that situation you are
>>> talking about. In my opinion, the test is doing exactly what you said -
>>> when a new player is willing to join, other players must invoke 
>>> MPI_Comm_accept().
>>> All *other* players must invoke MPI_Comm_accept(). Only the last client
>>> (in this case last player which wants to join) does not
>>> invoke MPI_Comm_accept(), because this client invokes only
>>> MPI_Comm_connect(). He is connecting to communicator, in which all other
>>> players are already involved and therefore this last client doesn't have to
>>> invoke MPI_Comm_accept().
>>>
>>> Am I still missing something in this my reflection?
>>>
>>> Matus
>>>
>>> 2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet :
>>>
 here is what the client is doing

 printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
 rank) ;

 for (i = rank ; i < num_clients ; i++)
 {
   /* client performs a collective accept */
   CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0,
 intracomm, )) ;

   printf("CLIENT: connected to server on port\n") ;
   [...]

 }

 2) has rank 1

 /* and 3) has rank 2) */

 so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
 called, hence my analysis of the crash/hang


 I understand what you are trying to achieve, keep in mind
 MPI_Comm_accept() is a collective call, so when a new player

 is willing to join, other players must invoke MPI_Comm_accept().

 and it is up to you to make sure that happens


 Cheers,


 Gilles

 On 7/19/2016 5:48 PM, M. D. wrote:



 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet :

> MPI_Comm_accept must be called by all the tasks of the local
> communicator.
>
 Yes, that's how I understand it. In the source code of the test, all
 the tasks call  MPI_Comm_accept - server and also relevant clients.

> so if you
>
> 1) mpirun -np 1 ./singleton_client_server 2 1
>
> 2) mpirun -np 1 ./singleton_client_server 2 0
>
> 3) mpirun -np 1 ./singleton_client_server 2 0
>
> then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
> and an exited task (2)
>
 This is not true in my opinion -  because of above mentioned fact that
 MPI_Comm_accept is called by all the tasks of the local communicator.

> /*
>
> strictly speaking, there is a race condition, if 2) has exited, then
> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>
> if 2) has not yet exited, then the test will hang because 2) does not
> invoke MPI_Comm_accept
>
> */
>
 Task 2) does not exit, because of blocking call of MPI_Comm_accept.

>
>

> there are different ways of seeing things :
>
> 1) this is an incorrect usage of the test, the number of clients
> should be the same everywhere
>
> 2) task 2) should not exit (because it did not call
> MPI_Comm_disconnect()) and the test should hang when
>
> starting task 3) because task 2) does not call MPI_Comm_accept()
>
>
> ad 1) I am sorry, but maybe I do not understand what you think - In my
 previous post I wrote that the number of clients is the same in every
 mpirun instance.
 ad 2) it is the same as above

> i do not know how you want to spawn your tasks.
>
> if 2) and 3) do not need to communicate with each other (they only
> communicate with 1)), then
>
> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>
> if 2 and 3) need to communicate with each other, it would be much
> easier to

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

my bad for the confusion,

I misread you and miswrote my reply.

I will investigate this again.

strictly speaking, the clients can only start after the server first write
the port info to a file.
if you start the client right after the server start, they might use
incorrect/outdated info and cause all the test hang.

I will start reproducing the hang

Cheers,

Gilles

On Tuesday, July 19, 2016, M. D.  wrote:

> Yes I understand it, but I think, this is exactly that situation you are
> talking about. In my opinion, the test is doing exactly what you said -
> when a new player is willing to join, other players must invoke 
> MPI_Comm_accept().
> All *other* players must invoke MPI_Comm_accept(). Only the last client
> (in this case last player which wants to join) does not
> invoke MPI_Comm_accept(), because this client invokes only
> MPI_Comm_connect(). He is connecting to communicator, in which all other
> players are already involved and therefore this last client doesn't have to
> invoke MPI_Comm_accept().
>
> Am I still missing something in this my reflection?
>
> Matus
>
> 2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet  >:
>
>> here is what the client is doing
>>
>> printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
>> rank) ;
>>
>> for (i = rank ; i < num_clients ; i++)
>> {
>>   /* client performs a collective accept */
>>   CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm,
>> )) ;
>>
>>   printf("CLIENT: connected to server on port\n") ;
>>   [...]
>>
>> }
>>
>> 2) has rank 1
>>
>> /* and 3) has rank 2) */
>>
>> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
>> called, hence my analysis of the crash/hang
>>
>>
>> I understand what you are trying to achieve, keep in mind
>> MPI_Comm_accept() is a collective call, so when a new player
>>
>> is willing to join, other players must invoke MPI_Comm_accept().
>>
>> and it is up to you to make sure that happens
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>> On 7/19/2016 5:48 PM, M. D. wrote:
>>
>>
>>
>> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet > >:
>>
>>> MPI_Comm_accept must be called by all the tasks of the local
>>> communicator.
>>>
>> Yes, that's how I understand it. In the source code of the test, all the
>> tasks call  MPI_Comm_accept - server and also relevant clients.
>>
>>> so if you
>>>
>>> 1) mpirun -np 1 ./singleton_client_server 2 1
>>>
>>> 2) mpirun -np 1 ./singleton_client_server 2 0
>>>
>>> 3) mpirun -np 1 ./singleton_client_server 2 0
>>>
>>> then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
>>> and an exited task (2)
>>>
>> This is not true in my opinion -  because of above mentioned fact that
>> MPI_Comm_accept is called by all the tasks of the local communicator.
>>
>>> /*
>>>
>>> strictly speaking, there is a race condition, if 2) has exited, then
>>> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>>>
>>> if 2) has not yet exited, then the test will hang because 2) does not
>>> invoke MPI_Comm_accept
>>>
>>> */
>>>
>> Task 2) does not exit, because of blocking call of MPI_Comm_accept.
>>
>>>
>>>
>>
>>> there are different ways of seeing things :
>>>
>>> 1) this is an incorrect usage of the test, the number of clients should
>>> be the same everywhere
>>>
>>> 2) task 2) should not exit (because it did not call
>>> MPI_Comm_disconnect()) and the test should hang when
>>>
>>> starting task 3) because task 2) does not call MPI_Comm_accept()
>>>
>>>
>>> ad 1) I am sorry, but maybe I do not understand what you think - In my
>> previous post I wrote that the number of clients is the same in every
>> mpirun instance.
>> ad 2) it is the same as above
>>
>>> i do not know how you want to spawn your tasks.
>>>
>>> if 2) and 3) do not need to communicate with each other (they only
>>> communicate with 1)), then
>>>
>>> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>>>
>>> if 2 and 3) need to communicate with each other, it would be much easier
>>> to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
>>>
>>> so there is only one inter communicator with all the tasks.
>>>
>> My aim is that all the tasks need to communicate with each other. I am
>> implementing a distributed application - game with more players
>> communicating with each other via MPI. It should work as follows - First
>> player creates a game and waits for other players to connect to this game.
>> On different computers (in the same network) the other players can join
>> this game. When they are connected, they should be able to play this game
>> together.
>> I hope, it is clear what my idea is. If it is not, just ask me, please.
>>
>>>
>>> The current test program is growing incrementally the intercomm, which
>>> does require extra steps for synchronization.
>>>
>>>
>>> Cheers,

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread M. D.

Yes I understand it, but I think, this is exactly that situation you are
talking about. In my opinion, the test is doing exactly what you said -
when a new player is willing to join, other players must invoke
MPI_Comm_accept().
All *other* players must invoke MPI_Comm_accept(). Only the last client (in
this case last player which wants to join) does not
invoke MPI_Comm_accept(), because this client invokes only
MPI_Comm_connect(). He is connecting to communicator, in which all other
players are already involved and therefore this last client doesn't have to
invoke MPI_Comm_accept().

Am I still missing something in this my reflection?

Matus

2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet :

> here is what the client is doing
>
> printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
> rank) ;
>
> for (i = rank ; i < num_clients ; i++)
> {
>   /* client performs a collective accept */
>   CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm,
> )) ;
>
>   printf("CLIENT: connected to server on port\n") ;
>   [...]
>
> }
>
> 2) has rank 1
>
> /* and 3) has rank 2) */
>
> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
> called, hence my analysis of the crash/hang
>
>
> I understand what you are trying to achieve, keep in mind
> MPI_Comm_accept() is a collective call, so when a new player
>
> is willing to join, other players must invoke MPI_Comm_accept().
>
> and it is up to you to make sure that happens
>
>
> Cheers,
>
>
> Gilles
>
> On 7/19/2016 5:48 PM, M. D. wrote:
>
>
>
> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet :
>
>> MPI_Comm_accept must be called by all the tasks of the local communicator.
>>
> Yes, that's how I understand it. In the source code of the test, all the
> tasks call  MPI_Comm_accept - server and also relevant clients.
>
>> so if you
>>
>> 1) mpirun -np 1 ./singleton_client_server 2 1
>>
>> 2) mpirun -np 1 ./singleton_client_server 2 0
>>
>> 3) mpirun -np 1 ./singleton_client_server 2 0
>>
>> then 3) starts after 2) has exited, so on 1), intracomm is made of 1) and
>> an exited task (2)
>>
> This is not true in my opinion -  because of above mentioned fact that
> MPI_Comm_accept is called by all the tasks of the local communicator.
>
>> /*
>>
>> strictly speaking, there is a race condition, if 2) has exited, then
>> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>>
>> if 2) has not yet exited, then the test will hang because 2) does not
>> invoke MPI_Comm_accept
>>
>> */
>>
> Task 2) does not exit, because of blocking call of MPI_Comm_accept.
>
>>
>>
>
>> there are different ways of seeing things :
>>
>> 1) this is an incorrect usage of the test, the number of clients should
>> be the same everywhere
>>
>> 2) task 2) should not exit (because it did not call
>> MPI_Comm_disconnect()) and the test should hang when
>>
>> starting task 3) because task 2) does not call MPI_Comm_accept()
>>
>>
>> ad 1) I am sorry, but maybe I do not understand what you think - In my
> previous post I wrote that the number of clients is the same in every
> mpirun instance.
> ad 2) it is the same as above
>
>> i do not know how you want to spawn your tasks.
>>
>> if 2) and 3) do not need to communicate with each other (they only
>> communicate with 1)), then
>>
>> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>>
>> if 2 and 3) need to communicate with each other, it would be much easier
>> to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
>>
>> so there is only one inter communicator with all the tasks.
>>
> My aim is that all the tasks need to communicate with each other. I am
> implementing a distributed application - game with more players
> communicating with each other via MPI. It should work as follows - First
> player creates a game and waits for other players to connect to this game.
> On different computers (in the same network) the other players can join
> this game. When they are connected, they should be able to play this game
> together.
> I hope, it is clear what my idea is. If it is not, just ask me, please.
>
>>
>> The current test program is growing incrementally the intercomm, which
>> does require extra steps for synchronization.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
> Cheers,
>
> Matus
>
>> On 7/19/2016 4:37 PM, M. D. wrote:
>>
>> Hi,
>> thank you for your interest in this topic.
>>
>> So, I normally run the test as follows:
>> Firstly, I run "server" (second parameter is 1):
>> *mpirun -np 1 ./singleton_client_server number_of_clients 1*
>>
>> Secondly, I run corresponding number of "clients" via following command:
>> *mpirun -np 1 ./singleton_client_server number_of_clients 0*
>>
>> So, for example with 3 clients I do:
>> mpirun -np 1 ./singleton_client_server 3 1
>> mpirun -np 1 ./singleton_client_server 3 0
>> mpirun -np 1 ./singleton_client_server 3 0
>> mpirun -np 1 ./singleton_client_server 3 0
>>
>> It means you are right - there

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server


here is what the client is doing

printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size, 
rank) ;


for (i = rank ; i < num_clients ; i++)
{
  /* client performs a collective accept */
  CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, 
intracomm, )) ;


  printf("CLIENT: connected to server on port\n") ;
  [...]

}

2) has rank 1

/* and 3) has rank 2) */

so unless you run 2) with num_clients=2, MPI_Comm_accept() is never 
called, hence my analysis of the crash/hang



I understand what you are trying to achieve, keep in mind 
MPI_Comm_accept() is a collective call, so when a new player


is willing to join, other players must invoke MPI_Comm_accept().

and it is up to you to make sure that happens


Cheers,


Gilles


On 7/19/2016 5:48 PM, M. D. wrote:



2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet >:


MPI_Comm_accept must be called by all the tasks of the local
communicator.

Yes, that's how I understand it. In the source code of the test, all 
the tasks call  MPI_Comm_accept - server and also relevant clients.


so if you

1) mpirun -np 1 ./singleton_client_server 2 1

2) mpirun -np 1 ./singleton_client_server 2 0

3) mpirun -np 1 ./singleton_client_server 2 0

then 3) starts after 2) has exited, so on 1), intracomm is made of
1) and an exited task (2)

This is not true in my opinion -  because of above mentioned fact that 
MPI_Comm_accept is called by all the tasks of the local communicator.


/*

strictly speaking, there is a race condition, if 2) has exited,
then MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.

if 2) has not yet exited, then the test will hang because 2) does
not invoke MPI_Comm_accept

*/

Task 2) does not exit, because of blocking call of MPI_Comm_accept.


there are different ways of seeing things :

1) this is an incorrect usage of the test, the number of clients
should be the same everywhere

2) task 2) should not exit (because it did not call
MPI_Comm_disconnect()) and the test should hang when

starting task 3) because task 2) does not call MPI_Comm_accept()


ad 1) I am sorry, but maybe I do not understand what you think - In my 
previous post I wrote that the number of clients is the same in every 
mpirun instance.

ad 2) it is the same as above

i do not know how you want to spawn your tasks.

if 2) and 3) do not need to communicate with each other (they only
communicate with 1)), then

you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)

if 2 and 3) need to communicate with each other, it would be much
easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),

so there is only one inter communicator with all the tasks.

My aim is that all the tasks need to communicate with each other. I am 
implementing a distributed application - game with more players 
communicating with each other via MPI. It should work as follows - 
First player creates a game and waits for other players to connect to 
this game. On different computers (in the same network) the other 
players can join this game. When they are connected, they should be 
able to play this game together.

I hope, it is clear what my idea is. If it is not, just ask me, please.


The current test program is growing incrementally the intercomm,
which does require extra steps for synchronization.


Cheers,


Gilles

Cheers,

Matus

On 7/19/2016 4:37 PM, M. D. wrote:

Hi,
thank you for your interest in this topic.

So, I normally run the test as follows:
Firstly, I run "server" (second parameter is 1):
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*
*
Secondly, I run corresponding number of "clients" via following
command:
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
*
*
So, for example with 3 clients I do:
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0

It means you are right - there should be the same number of
clients in each mpirun instance.

The test does not involve MPI_Comm_disconnect(), but the problem
in the test is in the earlier position, because some of clients
(in the most cases actually the last client) cannot sometimes
connect to the server and therefore all clients with server are
hanging (waiting for the connections with the last client(s) ).

So, the bahaviour of accept/connect method is a bit confusing for
me.
If I understand you, according to your post - the problem is not
in the timeout value, isn't it?

Cheers,

Matus

2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet >:

How do you run the test ?

you should

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread M. D.

2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet :

> MPI_Comm_accept must be called by all the tasks of the local communicator.
>
Yes, that's how I understand it. In the source code of the test, all the
tasks call  MPI_Comm_accept - server and also relevant clients.

> so if you
>
> 1) mpirun -np 1 ./singleton_client_server 2 1
>
> 2) mpirun -np 1 ./singleton_client_server 2 0
>
> 3) mpirun -np 1 ./singleton_client_server 2 0
>
> then 3) starts after 2) has exited, so on 1), intracomm is made of 1) and
> an exited task (2)
>
This is not true in my opinion -  because of above mentioned fact that
MPI_Comm_accept is called by all the tasks of the local communicator.

> /*
>
> strictly speaking, there is a race condition, if 2) has exited, then
> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>
> if 2) has not yet exited, then the test will hang because 2) does not
> invoke MPI_Comm_accept
>
> */
>
Task 2) does not exit, because of blocking call of MPI_Comm_accept.

>
>

> there are different ways of seeing things :
>
> 1) this is an incorrect usage of the test, the number of clients should be
> the same everywhere
>
> 2) task 2) should not exit (because it did not call MPI_Comm_disconnect())
> and the test should hang when
>
> starting task 3) because task 2) does not call MPI_Comm_accept()
>
>
> ad 1) I am sorry, but maybe I do not understand what you think - In my
previous post I wrote that the number of clients is the same in every
mpirun instance.
ad 2) it is the same as above

> i do not know how you want to spawn your tasks.
>
> if 2) and 3) do not need to communicate with each other (they only
> communicate with 1)), then
>
> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>
> if 2 and 3) need to communicate with each other, it would be much easier
> to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
>
> so there is only one inter communicator with all the tasks.
>
My aim is that all the tasks need to communicate with each other. I am
implementing a distributed application - game with more players
communicating with each other via MPI. It should work as follows - First
player creates a game and waits for other players to connect to this game.
On different computers (in the same network) the other players can join
this game. When they are connected, they should be able to play this game
together.
I hope, it is clear what my idea is. If it is not, just ask me, please.

>
> The current test program is growing incrementally the intercomm, which
> does require extra steps for synchronization.
>
>
> Cheers,
>
>
> Gilles
>
Cheers,

Matus

> On 7/19/2016 4:37 PM, M. D. wrote:
>
> Hi,
> thank you for your interest in this topic.
>
> So, I normally run the test as follows:
> Firstly, I run "server" (second parameter is 1):
> *mpirun -np 1 ./singleton_client_server number_of_clients 1*
>
> Secondly, I run corresponding number of "clients" via following command:
> *mpirun -np 1 ./singleton_client_server number_of_clients 0*
>
> So, for example with 3 clients I do:
> mpirun -np 1 ./singleton_client_server 3 1
> mpirun -np 1 ./singleton_client_server 3 0
> mpirun -np 1 ./singleton_client_server 3 0
> mpirun -np 1 ./singleton_client_server 3 0
>
> It means you are right - there should be the same number of clients in
> each mpirun instance.
>
> The test does not involve MPI_Comm_disconnect(), but the problem in the
> test is in the earlier position, because some of clients (in the most cases
> actually the last client) cannot sometimes connect to the server and
> therefore all clients with server are hanging (waiting for the connections
> with the last client(s) ).
>
> So, the bahaviour of accept/connect method is a bit confusing for me.
> If I understand you, according to your post - the problem is not in the
> timeout value, isn't it?
>
> Cheers,
>
> Matus
>
> 2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet :
>
>> How do you run the test ?
>>
>> you should have the same number of clients in each mpirun instance, the
>> following simple shell starts the test as i think it is supposed to
>>
>> note the test itself is arguable since MPI_Comm_disconnect() is never
>> invoked
>>
>> (and you will observe some related dpm_base_disconnect_init errors)
>>
>>
>> #!/bin/sh
>>
>> clients=3
>>
>> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 1
>> 2>&1 | tee /tmp/server.$clients"
>> for i in $(seq $clients); do
>>
>> sleep 1
>>
>> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 0
>> 2>&1 | tee /tmp/client.$clients.$i"
>> done
>>
>>
>> Ralph,
>>
>>
>> this test fails with master.
>>
>> when the "server" (second parameter is 1), MPI_Comm_accept() fails with a
>> timeout.
>>
>> i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
>>
>> OPAL_PMIX_EXCHANGE(rc, , , 60);
>>
>> but this is not the timeout that is triggered ...
>>
>> the eviction_cbfunc timeout function is invoked, and it has been set when
>>

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server


MPI_Comm_accept must be called by all the tasks of the local communicator.

so if you

1) mpirun -np 1 ./singleton_client_server 2 1

2) mpirun -np 1 ./singleton_client_server 2 0

3) mpirun -np 1 ./singleton_client_server 2 0

then 3) starts after 2) has exited, so on 1), intracomm is made of 1) 
and an exited task (2)


/*

strictly speaking, there is a race condition, if 2) has exited, then 
MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.


if 2) has not yet exited, then the test will hang because 2) does not 
invoke MPI_Comm_accept


*/


there are different ways of seeing things :

1) this is an incorrect usage of the test, the number of clients should 
be the same everywhere


2) task 2) should not exit (because it did not call 
MPI_Comm_disconnect()) and the test should hang when


starting task 3) because task 2) does not call MPI_Comm_accept()


i do not know how you want to spawn your tasks.

if 2) and 3) do not need to communicate with each other (they only 
communicate with 1)), then


you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)

if 2 and 3) need to communicate with each other, it would be much easier 
to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),


so there is only one inter communicator with all the tasks.


The current test program is growing incrementally the intercomm, which 
does require extra steps for synchronization.



Cheers,


Gilles

On 7/19/2016 4:37 PM, M. D. wrote:

Hi,
thank you for your interest in this topic.

So, I normally run the test as follows:
Firstly, I run "server" (second parameter is 1):
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*
*
Secondly, I run corresponding number of "clients" via following command:
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
*
*
So, for example with 3 clients I do:
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0

It means you are right - there should be the same number of clients in 
each mpirun instance.


The test does not involve MPI_Comm_disconnect(), but the problem in 
the test is in the earlier position, because some of clients (in the 
most cases actually the last client) cannot sometimes connect to the 
server and therefore all clients with server are hanging (waiting for 
the connections with the last client(s) ).


So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in 
the timeout value, isn't it?


Cheers,

Matus

2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet >:


How do you run the test ?

you should have the same number of clients in each mpirun
instance, the following simple shell starts the test as i think it
is supposed to

note the test itself is arguable since MPI_Comm_disconnect() is
never invoked

(and you will observe some related dpm_base_disconnect_init errors)


#!/bin/sh

clients=3

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 1 2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do

sleep 1

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 0 2>&1 | tee /tmp/client.$clients.$i"
done


Ralph,


this test fails with master.

when the "server" (second parameter is 1), MPI_Comm_accept() fails
with a timeout.

i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout

OPAL_PMIX_EXCHANGE(rc, , , 60);

but this is not the timeout that is triggered ...

the eviction_cbfunc timeout function is invoked, and it has been
set when opal_hotel_init() was invoked in
orte/orted/pmix/pmix_server.c


default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60
seconds (hard coded value described above)

sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not
return when the client invokes MPI_Comm_connect()


could you please have a look at this ?


Cheers,


Gilles


On 7/15/2016 9:20 PM, M. D. wrote:

Hello,

I have a problem with basic client - server application. I tried
to run C program from this website

https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website,
so I expected that it should work properly, but after more
testing I found out that there is probably an error somewhere in
connect/accept method. I have read many discussions and threads
on your website, but I have not found similar problem that I am
facing. It seems that nobody had similar problem like me. When I
run this app with one server and more clients (3,4,5,6,...)

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-07-19 Thread M. D.

Hi,
thank you for your interest in this topic.

So, I normally run the test as follows:
Firstly, I run "server" (second parameter is 1):
*mpirun -np 1 ./singleton_client_server number_of_clients 1*

Secondly, I run corresponding number of "clients" via following command:
*mpirun -np 1 ./singleton_client_server number_of_clients 0*

So, for example with 3 clients I do:
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0

It means you are right - there should be the same number of clients in each
mpirun instance.

The test does not involve MPI_Comm_disconnect(), but the problem in the
test is in the earlier position, because some of clients (in the most cases
actually the last client) cannot sometimes connect to the server and
therefore all clients with server are hanging (waiting for the connections
with the last client(s) ).

So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in the
timeout value, isn't it?

Cheers,

Matus

2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet :

> How do you run the test ?
>
> you should have the same number of clients in each mpirun instance, the
> following simple shell starts the test as i think it is supposed to
>
> note the test itself is arguable since MPI_Comm_disconnect() is never
> invoked
>
> (and you will observe some related dpm_base_disconnect_init errors)
>
>
> #!/bin/sh
>
> clients=3
>
> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 1
> 2>&1 | tee /tmp/server.$clients"
> for i in $(seq $clients); do
>
> sleep 1
>
> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 0
> 2>&1 | tee /tmp/client.$clients.$i"
> done
>
>
> Ralph,
>
>
> this test fails with master.
>
> when the "server" (second parameter is 1), MPI_Comm_accept() fails with a
> timeout.
>
> i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
>
> OPAL_PMIX_EXCHANGE(rc, , , 60);
>
> but this is not the timeout that is triggered ...
>
> the eviction_cbfunc timeout function is invoked, and it has been set when
> opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
>
>
> default timeout is 2 seconds, but in this case (user invokes
> MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
> (hard coded value described above)
>
> sadly, if i set a higher timeout value (mpirun --mca
> orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
> the client invokes MPI_Comm_connect()
>
>
> could you please have a look at this ?
>
>
> Cheers,
>
>
> Gilles
>
> On 7/15/2016 9:20 PM, M. D. wrote:
>
> Hello,
>
> I have a problem with basic client - server application. I tried to run C
> program from this website
> 
> https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
> I saw this program mentioned in many discussions in your website, so I
> expected that it should work properly, but after more testing I found out
> that there is probably an error somewhere in connect/accept method. I have
> read many discussions and threads on your website, but I have not found
> similar problem that I am facing. It seems that nobody had similar problem
> like me. When I run this app with one server and more clients (3,4,5,6,...)
> sometimes the app hangs. It hangs when second or next client wants to
> connect to the server (it depends, sometimes third client hangs, sometimes
> fourth, sometimes second, and so on).
> So it means that app starts to hang where server waits for accept and
> client waits for connect. And it is not possible to continue, because this
> client cannot connect to the server. It is strange, because I observed this
> behaviour only in some cases... Sometimes it works without any problems,
> sometimes it does not work. The behaviour is unpredictable and not stable.
>
> I have installed openmpi 1.10.2 on my Fedora 19. I have the same problem
> with Java alternative of this application. It hangs also sometimes... I
> need this app in Java, but firstly it must work properly in C
> implementation. Because of this strange behaviour I assume that there can
> be an error maybe inside of openmpi implementation of connect/accept
> methods. I tried it also with another version of openmpi - 1.8.1. However,
> the problem did not disappear.
>
> Could you help me, what can cause the problem? Maybe I did not get
> something about openmpi (or connect/server) and the problem is with me... I
> will appreciate any your help, support, or interest about this topic.
>
> Best regards,
> Matus Dobrotka
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:

Re: [OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

How do you run the test ?

you should have the same number of clients in each mpirun instance, the
following simple shell starts the test as i think it is supposed to

note the test itself is arguable since MPI_Comm_disconnect() is never
invoked

(and you will observe some related dpm_base_disconnect_init errors)

#!/bin/sh

clients=3

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
1 2>&1 | tee /tmp/server.$clients"

for i in $(seq $clients); do

sleep 1

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
0 2>&1 | tee /tmp/client.$clients.$i"

done

Ralph,

this test fails with master.

when the "server" (second parameter is 1), MPI_Comm_accept() fails with
a timeout.

i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout

OPAL_PMIX_EXCHANGE(rc, , , 60);

but this is not the timeout that is triggered ...

the eviction_cbfunc timeout function is invoked, and it has been set
when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c

default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
(hard coded value described above)

sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return
when the client invokes MPI_Comm_connect()

could you please have a look at this ?

Cheers,

Gilles

On 7/15/2016 9:20 PM, M. D. wrote:

Hello,

I have a problem with basic client - server application. I tried to
run C program from this website
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I
expected that it should work properly, but after more testing I found
out that there is probably an error somewhere in connect/accept
method. I have read many discussions and threads on your website, but
I have not found similar problem that I am facing. It seems that
nobody had similar problem like me. When I run this app with one
server and more clients (3,4,5,6,...) sometimes the app hangs. It
hangs when second or next client wants to connect to the server (it
depends, sometimes third client hangs, sometimes fourth, sometimes
second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because
this client cannot connect to the server. It is strange, because I
observed this behaviour only in some cases... Sometimes it works
without any problems, sometimes it does not work. The behaviour is
unpredictable and not stable.

I have installed openmpi 1.10.2 on my Fedora 19. I have the same
problem with Java alternative of this application. It hangs also
sometimes... I need this app in Java, but firstly it must work
properly in C implementation. Because of this strange behaviour I
assume that there can be an error maybe inside of openmpi
implementation of connect/accept methods. I tried it also with another
version of openmpi - 1.8.1. However, the problem did not disappear.

Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with
me... I will appreciate any your help, support, or interest about this
topic.

Best regards,
Matus Dobrotka

___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/07/29673.php

Re: [OMPI users] Help on Windows

2016-02-23 Thread Walt Brainerd

Thank you, Gilles! It's amazing to get such help.

It seems to work when I unplugged the ethernet
and have the wireless on, but I will check it out
further (including the firewall situation) to pin it down.

 time mpirun -np 4 ./a
 Hello from   1 out of   4 images.
 Hello from   2 out of   4 images.
 Hello from   3 out of   4 images.
 Hello from   4 out of   4 images.

real0m0.774s   ---
user0m0.341s
sys 0m0.933s

On Tue, Feb 23, 2016 at 4:26 PM, Gilles Gouaillardet 
wrote:

> Walt,
>
> generally speaking, that kind of things happen when you are using a
> wireless network and/or a firewall.
>
> so i recommend you first try to disconnect all your networks and see how
> things get improved
>
> Cheers,
>
> Gilles
>
>
> On 2/24/2016 5:08 AM, Walt Brainerd wrote:
>
> I am running up-to-date cygwin on W10 on a 4x i5 processor,
> including gcc 5.3.
>
> I built libcaf_mpi.a provided by OpenCoarrays.
>
> $ cat hello.f90
> program hello
>
>implicit none
>
>print *, "Hello from", this_image(), &
> "out of", num_images(), "images."
>
> end program hello
>
> I compiled the hello.f90 with
>
> $ mpifort -fcoarray=lib hello.f90 libcaf_mpi.a
>
> and ran it with
>
> $ time mpirun -np 4 ./a
>  Hello from   1 out of   4 images.
>  Hello from   2 out of   4 images.
>  Hello from   3 out of   4 images.
>  Hello from   4 out of   4 images.
>
> real0m42.733s   ! <
> user0m0.201s
> sys 0m0.934s
>
> So I am getting this long startup delay. The same thing
> happens with other coarray programs.
>
> Any ideas? BTW, I know almost nothing about MPI :-(.
>
> Thanks.
>
> --
> Walt Brainerd
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/02/28569.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28570.php
>



-- 
Walt Brainerd

Re: [OMPI users] Help on Windows

2016-02-23 Thread Gilles Gouaillardet


Walt,

generally speaking, that kind of things happen when you are using a 
wireless network and/or a firewall.


so i recommend you first try to disconnect all your networks and see how 
things get improved


Cheers,

Gilles

On 2/24/2016 5:08 AM, Walt Brainerd wrote:

I am running up-to-date cygwin on W10 on a 4x i5 processor,
including gcc 5.3.

I built libcaf_mpi.a provided by OpenCoarrays.

$ cat hello.f90
program hello

   implicit none

   print *, "Hello from", this_image(), &
"out of", num_images(), "images."

end program hello

I compiled the hello.f90 with

$ mpifort -fcoarray=lib hello.f90 libcaf_mpi.a

and ran it with

$ time mpirun -np 4 ./a
 Hello from   1 out of   4 images.
 Hello from   2 out of   4 images.
 Hello from   3 out of   4 images.
 Hello from   4 out of   4 images.

real0m42.733s   ! <
user0m0.201s
sys 0m0.934s

So I am getting this long startup delay. The same thing
happens with other coarray programs.

Any ideas? BTW, I know almost nothing about MPI :-(.

Thanks.

--
Walt Brainerd


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/02/28569.php

Re: [OMPI users] Help with OpenMPI and Univa Grid Engine

2016-02-09 Thread Rahul Pisharody

Hello Ralph, Dave,

Thank you for your suggestions. Let me check on the nfs mounts.

The problem is I am not the grid administrator. I'm working with the grid
administrator to get it resolved. If I had my way, I would be probably
using Sun Grid.

Thank you Dave for pointing out something that I had missed. Let me ask the
admin to check with the Univa guys as well.

Thanks,
Rahul

On Tue, Feb 9, 2016 at 4:43 AM, Dave Love  wrote:

> Rahul Pisharody  writes:
>
> > Hello all,
> >
> > I'm trying to get a simple program (print the hostname of the executing
> > machine) compiled with openmpi run across multiple machines on Univa Grid
> > Engine.
> >
> > This particular configuration has many of the ports blocked. My run
> command
> > has the mca options necessary to limit the ports to the known open ports.
> >
> > However, when I launch the program with mpirun, I get the following error
> > messages:
> >
> > +
> >> error: executing task of job 23 failed: execution daemon on host
> >> "" didn't accept task
>
> So you have a grid engine problem and you're paying Univa a load of
> money (with one of the selling points being MPI support, if I recall
> correctly)...
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28473.php
>

Re: [OMPI users] Help with OpenMPI and Univa Grid Engine

2016-02-09 Thread Dave Love

Rahul Pisharody  writes:

> Hello all,
>
> I'm trying to get a simple program (print the hostname of the executing
> machine) compiled with openmpi run across multiple machines on Univa Grid
> Engine.
>
> This particular configuration has many of the ports blocked. My run command
> has the mca options necessary to limit the ports to the known open ports.
>
> However, when I launch the program with mpirun, I get the following error
> messages:
>
> +
>> error: executing task of job 23 failed: execution daemon on host
>> "" didn't accept task

So you have a grid engine problem and you're paying Univa a load of
money (with one of the selling points being MPI support, if I recall
correctly)...

Re: [OMPI users] Help with OpenMPI and Univa Grid Engine

2016-02-08 Thread Ralph Castain

Is your OMPI installed on an NFS partition? If so, is it in the same mount 
point on all nodes?

Most likely problem is that the required libraries were not found on the remote 
node

> On Feb 8, 2016, at 10:45 AM, Rahul Pisharody  wrote:
> 
> Hello all, 
> 
> I'm trying to get a simple program (print the hostname of the executing 
> machine) compiled with openmpi run across multiple machines on Univa Grid 
> Engine. 
> 
> This particular configuration has many of the ports blocked. My run command 
> has the mca options necessary to limit the ports to the known open ports.
> 
> However, when I launch the program with mpirun, I get the following error 
> messages:
> 
> +
> error: executing task of job 23 failed: execution daemon on host "" 
> didn't accept task
> --
> A daemon (pid 10126) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>  
> There may be more information reported by the environment (see above).
>  
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> error: executing task of job 23 failed: execution daemon on host "machine" 
> didn't accept task
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> 
> 
> I've set the LD_LIBRARY_PATH and I've verified that path points to the 
> necessary shared libraries.
> 
> Any idea/suggestion as to what is happening here will be greatly appreciated.
> 
> Thanks,
> Rahul
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/02/28467.php

Re: [OMPI users] Help with Binding in 1.8.8: Use only second socket

2015-12-21 Thread Saliya Ekanayake

I tried the following with OpenMPI 1.8.1 and 1.10.1. The both worked. In my
case a node has 2 sockets like yours, but each socket has 12 cores and
lstopo showed core numbers for the second socket are from 12 to 23.

* mpirun --report-bindings --bind-to core --cpu-set 12,13,14,15,16,17,18,19
-np 8 java Hello*

[j-049:182867] MCW rank 0 bound to socket 1[core 12[hwt 0-1]]:
[../../../../../../../../../../../..][BB/../../../../../../../../../../..]
[j-049:182867] MCW rank 1 bound to socket 1[core 13[hwt 0-1]]:
[../../../../../../../../../../../..][../BB/../../../../../../../../../..]
[j-049:182867] MCW rank 2 bound to socket 1[core 14[hwt 0-1]]:
[../../../../../../../../../../../..][../../BB/../../../../../../../../..]
[j-049:182867] MCW rank 3 bound to socket 1[core 15[hwt 0-1]]:
[../../../../../../../../../../../..][../../../BB/../../../../../../../..]
[j-049:182867] MCW rank 4 bound to socket 1[core 16[hwt 0-1]]:
[../../../../../../../../../../../..][../../../../BB/../../../../../../..]
[j-049:182867] MCW rank 5 bound to socket 1[core 17[hwt 0-1]]:
[../../../../../../../../../../../..][../../../../../BB/../../../../../..]
[j-049:182867] MCW rank 6 bound to socket 1[core 18[hwt 0-1]]:
[../../../../../../../../../../../..][../../../../../../BB/../../../../..]
[j-049:182867] MCW rank 7 bound to socket 1[core 19[hwt 0-1]]:
[../../../../../../../../../../../..][../../../../../../../BB/../../../..]



On Mon, Dec 21, 2015 at 11:40 AM, Matt Thompson  wrote:

> Ralph,
>
> Huh. That isn't in the Open MPI 1.8.8 mpirun man page. It is in Open MPI
> 1.10, so I'm guessing someone noticed it wasn't there. Explains why I
> didn't try it out. I'm assuming this option is respected on all nodes?
>
> Note: a SmarterManThanI™ here at Goddard thought up this:
>
> #!/bin/bash
> rank=0
> for node in $(srun uname -n | sort); do
> echo "rank $rank=$node slots=1:*"
> let rank+=1
> done
>
> It does seem to work in synthetic tests so I'm trying it now in my real
> job. I had to hack a few run scripts so I'll probably spend the next hour
> debugging something dumb I did.
>
> What I'm wondering about all this is: can this be done with --slot-list?
> Or, perhaps, does --slot-list even work?
>
> I have tried about 20 different variations of it, e.g., --slot-list 1:*,
> --slot-list '1:*', --slot-list 1:0,1,2,3,4,5,6,7, --slot-list
> 1:8,9,10,11,12,13,14,15, --slot-list 8-15, , and every time I seem to
> trigger an error via help-rmaps_rank_file.txt. I tried to read
> through opal_hwloc_base_slot_list_parse in the source, but my C isn't great
> (see my gmail address name) so that didn't help. Might not even be the
> right function, but I was just acking the code.
>
> Thanks,
> Matt
>
>
> On Mon, Dec 21, 2015 at 10:51 AM, Ralph Castain  wrote:
>
>> Try adding —cpu-set a,b,c,…  where the a,b,c… are the core id’s of your
>> second socket. I’m working on a cleaner option as this has come up before.
>>
>>
>> On Dec 21, 2015, at 5:29 AM, Matt Thompson  wrote:
>>
>> Dear Open MPI Gurus,
>>
>> I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty
>> sure is possible, but I'm just not smart enough to figure out. Namely, I'm
>> seeing some odd GPU timings and I think it's because I was dumb and assumed
>> the GPU was on the PCI bus next to Socket #0 as some older GPU nodes I ran
>> on were like that.
>>
>> But, a trip through lspci and lstopo has shown me that the GPU is
>> actually on Socket #1. These are dual socket Sandy Bridge nodes and I'd
>> like to do some tests where I run a 8 processes per node and those
>> processes all land on Socket #1.
>>
>> So, what I'm trying to figure out is how to have Open MPI bind processes
>> like that. My first thought as always is to run a helloworld job with
>> -report-bindings on. I can manage to do this:
>>
>> (1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe
>> [borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]:
>> [././././B/././.][./././././././.]
>> [borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]:
>> [./././././B/./.][./././././././.]
>> [borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]:
>> [././././././B/.][./././././././.]
>> [borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]:
>> [./././././././B][./././././././.]
>> [borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
>> [B/././././././.][./././././././.]
>> [borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
>> [./B/./././././.][./././././././.]
>> [borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]:
>> [././B/././././.][./././././././.]
>> [borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]:
>> [./././B/./././.][./././././././.]
>> Process7 of8 is on borg01z205
>> Process5 of8 is on borg01z205
>> Process2 of8 is on borg01z205
>> Process3 of8 is on borg01z205
>> Process4 of8 is on borg01z205
>> Process6

Re: [OMPI users] Help with Binding in 1.8.8: Use only second socket

2015-12-21 Thread Matt Thompson

Ralph,

Huh. That isn't in the Open MPI 1.8.8 mpirun man page. It is in Open MPI
1.10, so I'm guessing someone noticed it wasn't there. Explains why I
didn't try it out. I'm assuming this option is respected on all nodes?

Note: a SmarterManThanI™ here at Goddard thought up this:

#!/bin/bash
rank=0
for node in $(srun uname -n | sort); do
echo "rank $rank=$node slots=1:*"
let rank+=1
done

It does seem to work in synthetic tests so I'm trying it now in my real
job. I had to hack a few run scripts so I'll probably spend the next hour
debugging something dumb I did.

What I'm wondering about all this is: can this be done with --slot-list?
Or, perhaps, does --slot-list even work?

I have tried about 20 different variations of it, e.g., --slot-list 1:*,
--slot-list '1:*', --slot-list 1:0,1,2,3,4,5,6,7, --slot-list
1:8,9,10,11,12,13,14,15, --slot-list 8-15, , and every time I seem to
trigger an error via help-rmaps_rank_file.txt. I tried to read
through opal_hwloc_base_slot_list_parse in the source, but my C isn't great
(see my gmail address name) so that didn't help. Might not even be the
right function, but I was just acking the code.

Thanks,
Matt


On Mon, Dec 21, 2015 at 10:51 AM, Ralph Castain  wrote:

> Try adding —cpu-set a,b,c,…  where the a,b,c… are the core id’s of your
> second socket. I’m working on a cleaner option as this has come up before.
>
>
> On Dec 21, 2015, at 5:29 AM, Matt Thompson  wrote:
>
> Dear Open MPI Gurus,
>
> I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty
> sure is possible, but I'm just not smart enough to figure out. Namely, I'm
> seeing some odd GPU timings and I think it's because I was dumb and assumed
> the GPU was on the PCI bus next to Socket #0 as some older GPU nodes I ran
> on were like that.
>
> But, a trip through lspci and lstopo has shown me that the GPU is actually
> on Socket #1. These are dual socket Sandy Bridge nodes and I'd like to do
> some tests where I run a 8 processes per node and those processes all land
> on Socket #1.
>
> So, what I'm trying to figure out is how to have Open MPI bind processes
> like that. My first thought as always is to run a helloworld job with
> -report-bindings on. I can manage to do this:
>
> (1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe
> [borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]:
> [././././B/././.][./././././././.]
> [borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]:
> [./././././B/./.][./././././././.]
> [borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]:
> [././././././B/.][./././././././.]
> [borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]:
> [./././././././B][./././././././.]
> [borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> [B/././././././.][./././././././.]
> [borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
> [./B/./././././.][./././././././.]
> [borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]:
> [././B/././././.][./././././././.]
> [borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]:
> [./././B/./././.][./././././././.]
> Process7 of8 is on borg01z205
> Process5 of8 is on borg01z205
> Process2 of8 is on borg01z205
> Process3 of8 is on borg01z205
> Process4 of8 is on borg01z205
> Process6 of8 is on borg01z205
> Process0 of8 is on borg01z205
> Process1 of8 is on borg01z205
>
> Great...but wrong socket! Is there a way to tell it to use Socket 1
> instead?
>
> Note I'll be running under SLURM, so I will only have 8 processes per
> node, so it shouldn't need to use Socket 0.
> --
> Matt Thompson
>
> Man Among Men
> Fulcrum of History
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/12/28190.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/12/28195.php
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History

Re: [OMPI users] Help with Binding in 1.8.8: Use only second socket

2015-12-21 Thread Ralph Castain

Try adding —cpu-set a,b,c,…  where the a,b,c… are the core id’s of your second 
socket. I’m working on a cleaner option as this has come up before.


> On Dec 21, 2015, at 5:29 AM, Matt Thompson  > wrote:
> 
> Dear Open MPI Gurus,
> 
> I'm currently trying to do something with Open MPI 1.8.8 that I'm pretty sure 
> is possible, but I'm just not smart enough to figure out. Namely, I'm seeing 
> some odd GPU timings and I think it's because I was dumb and assumed the GPU 
> was on the PCI bus next to Socket #0 as some older GPU nodes I ran on were 
> like that. 
> 
> But, a trip through lspci and lstopo has shown me that the GPU is actually on 
> Socket #1. These are dual socket Sandy Bridge nodes and I'd like to do some 
> tests where I run a 8 processes per node and those processes all land on 
> Socket #1.
> 
> So, what I'm trying to figure out is how to have Open MPI bind processes like 
> that. My first thought as always is to run a helloworld job with 
> -report-bindings on. I can manage to do this:
> 
> (1061) $ mpirun -np 8 -report-bindings -map-by core ./helloWorld.exe
> [borg01z205:16306] MCW rank 4 bound to socket 0[core 4[hwt 0]]: 
> [././././B/././.][./././././././.]
> [borg01z205:16306] MCW rank 5 bound to socket 0[core 5[hwt 0]]: 
> [./././././B/./.][./././././././.]
> [borg01z205:16306] MCW rank 6 bound to socket 0[core 6[hwt 0]]: 
> [././././././B/.][./././././././.]
> [borg01z205:16306] MCW rank 7 bound to socket 0[core 7[hwt 0]]: 
> [./././././././B][./././././././.]
> [borg01z205:16306] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././././.][./././././././.]
> [borg01z205:16306] MCW rank 1 bound to socket 0[core 1[hwt 0]]: 
> [./B/./././././.][./././././././.]
> [borg01z205:16306] MCW rank 2 bound to socket 0[core 2[hwt 0]]: 
> [././B/././././.][./././././././.]
> [borg01z205:16306] MCW rank 3 bound to socket 0[core 3[hwt 0]]: 
> [./././B/./././.][./././././././.]
> Process7 of8 is on borg01z205
> Process5 of8 is on borg01z205
> Process2 of8 is on borg01z205
> Process3 of8 is on borg01z205
> Process4 of8 is on borg01z205
> Process6 of8 is on borg01z205
> Process0 of8 is on borg01z205
> Process1 of8 is on borg01z205
> 
> Great...but wrong socket! Is there a way to tell it to use Socket 1 instead? 
> 
> Note I'll be running under SLURM, so I will only have 8 processes per node, 
> so it shouldn't need to use Socket 0.
> -- 
> Matt Thompson
> Man Among Men
> Fulcrum of History
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/12/28190.php

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-30 Thread Jeff Squyres (jsquyres)

On Nov 24, 2015, at 9:31 AM, Dave Love  wrote:
> 
>> btw, we already use the force, thanks to the ob1 pml and the yoda spml
> 
> I think that's assuming familiarity with something which leaves out some
> people...

FWIW, I agree: we use unhelpful names for components in Open MPI.  What Gilles 
is specifically referring to here is that there are several Star Wars-based 
names of plugins in Open MPI.  They mean something to us developers (they 
started off as a funny joke), but they mean little/nothing to end users.

I actually specifically called out this issue in the SC'15 Open MPI BOF:

http://image.slidesharecdn.com/ompi-bof-2015-for-web-151130155610-lva1-app6891/95/open-mpi-sc15-state-of-the-union-bof-28-638.jpg?cb=1448898995

This is definitely an issue that is on the agenda for the face-to-face Open MPI 
developer's meeting in February 
(https://github.com/open-mpi/ompi/wiki/Meeting-2016-02).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-24 Thread Dave Love

Gilles Gouaillardet  writes:

> Currently, ompi create a file in the temporary directory and then mmap it.
> an obvious requirement is the temporary directory must have enough free
> space for that file.
> (this might be an issue on some disk less nodes)
>
>
> a simple alternative could be to try /tmp, and if there is not enough
> space, try /dev/shm
> (unless the tmpdir has been set explicitly)
>
> any thought ?

/tmp is already the default if TMPDIR et al aren't defined, isn't it?

While you may not have any choice to use /dev/shm on a diskless node, it
doesn't seem a good thing to do by default for large maps.  It wasn't
here.

[I've never been sure of the semantics of mmap over tmpfs.]

I think the important thing is clear explanation of any error, and
suggestions for workarounds.  Presumably anyone operating diskless nodes
has made arrangements for this sort of thing.

> Gilles
>
> btw, we already use the force, thanks to the ob1 pml and the yoda spml

I think that's assuming familiarity with something which leaves out some
people...

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-20 Thread Gilles Gouaillardet

Currently, ompi create a file in the temporary directory and then mmap it.
an obvious requirement is the temporary directory must have enough free
space for that file.
(this might be an issue on some disk less nodes)


a simple alternative could be to try /tmp, and if there is not enough
space, try /dev/shm
(unless the tmpdir has been set explicitly)

any thought ?

Gilles

btw, we already use the force, thanks to the ob1 pml and the yoda spml

On Friday, November 20, 2015, Dave Love  wrote:

> Jeff Hammond > writes:
>
> >> Doesn't mpich have the option to use sysv memory?  You may want to try
> that
> >>
> >>
> > MPICH?  Look, I may have earned my way onto Santa's naughty list more
> than
> > a few times, but at least I have the decency not to post MPICH questions
> to
> > the Open-MPI list ;-)
> >
> > If there is a way to tell Open-MPI to use shm_open without filesystem
> > backing (if that is even possible) at configure time, I'd love to do
> that.
>
> I'm not sure I understand what's required, but is this what you're after?
>
>   $ ompi_info --param shmem all -l 9|grep priority
>  MCA shmem: parameter "shmem_mmap_priority" (current
> value: "50", data source: default, level: 3 user/all, type: int)
>  MCA shmem: parameter "shmem_posix_priority" (current
> value: "40", data source: default, level: 3 user/all, type: int)
>  MCA shmem: parameter "shmem_sysv_priority" (current
> value: "30", data source: default, level: 3 user/all, type: int)
>
> >> In the spirit OMPI - may the force be with you.
> >>
> >>
> > All I will say here is that Open-MPI has a Vader BTL :-)
>
> Whatever that might mean.
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28084.php
>

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-20 Thread Dave Love

Jeff Hammond  writes:

>> Doesn't mpich have the option to use sysv memory?  You may want to try that
>>
>>
> MPICH?  Look, I may have earned my way onto Santa's naughty list more than
> a few times, but at least I have the decency not to post MPICH questions to
> the Open-MPI list ;-)
>
> If there is a way to tell Open-MPI to use shm_open without filesystem
> backing (if that is even possible) at configure time, I'd love to do that.

I'm not sure I understand what's required, but is this what you're after?

  $ ompi_info --param shmem all -l 9|grep priority
 MCA shmem: parameter "shmem_mmap_priority" (current value: 
"50", data source: default, level: 3 user/all, type: int)
 MCA shmem: parameter "shmem_posix_priority" (current value: 
"40", data source: default, level: 3 user/all, type: int)
 MCA shmem: parameter "shmem_sysv_priority" (current value: 
"30", data source: default, level: 3 user/all, type: int)

>> In the spirit OMPI - may the force be with you.
>>
>>
> All I will say here is that Open-MPI has a Vader BTL :-)

Whatever that might mean.

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-20 Thread Dave Love

[There must be someone better to answer this, but since I've seen it:]

Jeff Hammond  writes:

> I have no idea what this is trying to tell me.  Help?
>
> jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
> [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
> ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418

That must be a system error message, presumably indicating why the
process couldn't be launched; it's not in the OMPI source.

> I can run the same job with srun without incident:
>
> jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
> MPI was initialized.
>
> This is on the NERSC Cori Cray XC40 system.  I build Open-MPI git head from
> source for OFI libfabric.
>
> I have many other issues, which I will report later.  As a spoiler, if I
> cannot use your mpirun, I cannot set any of the MCA options there.  Is
> there a method to set MCA options with environment variables?  I could not
> find this documented anywhere.

mpirun(1) documents the mechanisms under "Setting MCA Parameters",
unless it's changed since 1.8.  [I have wondered why a file in cwd isn't
a possibility, only in $HOME.]

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Jeff Hammond

On Thu, Nov 19, 2015 at 4:11 PM, Howard Pritchard 
wrote:

> Hi Jeff H.
>
> Why don't you just try configuring with
>
> ./configure --prefix=my_favorite_install_dir
> --with-libfabric=install_dir_for_libfabric
> make -j 8 install
>
> and see what happens?
>
>
That was the first thing I tried.  However, it seemed to give me a
Verbs-oriented build, and Verbs is the Sith lord to us JedOFIs :-)

>From aforementioned Wiki:

../configure \
 --with-libfabric=$HOME/OFI/install-ofi-gcc-gni-cori \
 --disable-shared \
 --prefix=$HOME/MPI/install-ompi-ofi-gcc-gni-cori

Unfortunately, this (above) leads to an mpicc that indicates support for IB
Verbs, not OFI.
I will try again though just in case.


> Make sure before you configure that you have PrgEnv-gnu or PrgEnv-intel
> module loaded.
>
>
Yeah, I know better than to use the Cray compilers for such things (e.g.
https://github.com/jeffhammond/OpenPA/commit/965ca014ea3148ee5349e16d2cec1024271a7415
)


> Those were the configure/compiler options I used to do testing of ofi mtl
> on cori.
>
> Jeff S. - this thread has gotten intermingled with mpich setup as well,
> hence
> the suggestion for the mpich shm mechanism.
>
>
The first OSS implementation of MPI that I can use on Cray XC using OFI
gets a prize at the December MPI Forum.

Best,

Jeff



> Howard
>
>
>
> 2015-11-19 16:59 GMT-07:00 Jeff Hammond :
>
>>
>>> How did you configure for Cori?  You need to be using the slurm plm
>>> component for that system.  I know this sounds like gibberish.
>>>
>>>
>> ../configure --with-libfabric=$HOME/OFI/install-ofi-gcc-gni-cori \
>>  --enable-mca-static=mtl-ofi \
>>  --enable-mca-no-build=btl-openib,btl-vader,btl-ugni,btl-tcp \
>>  --enable-static --disable-shared --disable-dlopen \
>>  --prefix=$HOME/MPI/install-ompi-ofi-gcc-gni-xpmem-cori \
>>  --with-cray-pmi --with-alps --with-cray-xpmem --with-slurm \
>>  --without-verbs --without-fca --without-mxm --without-ucx \
>>  --without-portals4 --without-psm --without-psm2 \
>>  --without-udreg --without-ugni --without-munge \
>>  --without-sge --without-loadleveler --without-tm --without-lsf \
>>  --without-pvfs2 --without-plfs \
>>  --without-cuda --disable-oshmem \
>>  --disable-mpi-fortran --disable-oshmem-fortran \
>>  LDFLAGS="-L/opt/cray/ugni/default/lib64 -lugni \
>>   -L/opt/cray/alps/default/lib64 -lalps -lalpslli -lalpsutil \   
>>-ldl -lrt"
>>
>>
>> This is copied from
>> https://github.com/jeffhammond/HPCInfo/blob/master/ofi/README.md#open-mpi,
>> which I note in case you want to see what changes I've made at any point in
>> the future.
>>
>>
>>> There should be a with-slurm configure option to pick up this component.
>>>
>>> Indeed there is.
>>
>>
>>> Doesn't mpich have the option to use sysv memory?  You may want to try
>>> that
>>>
>>>
>> MPICH?  Look, I may have earned my way onto Santa's naughty list more
>> than a few times, but at least I have the decency not to post MPICH
>> questions to the Open-MPI list ;-)
>>
>> If there is a way to tell Open-MPI to use shm_open without filesystem
>> backing (if that is even possible) at configure time, I'd love to do that.
>>
>>
>>> Oh for tuning params you can use env variables.  For example lets say
>>> rather than using the gni provider in ofi mtl you want to try sockets. Then
>>> do
>>>
>>> Export OMPI_MCA_mtl_ofi_provider_include=sockets
>>>
>>>
>> Thanks.  I'm glad that there is an option to set them this way.
>>
>>
>>> In the spirit OMPI - may the force be with you.
>>>
>>>
>> All I will say here is that Open-MPI has a Vader BTL :-)
>>
>>>
>>> > On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
>>> > > I have no idea what this is trying to tell me. Help?
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
>>> > > [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
>>> > > ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
>>> > >
>>> > > I can run the same job with srun without incident:
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
>>> > > MPI was initialized.
>>> > >
>>> > > This is on the NERSC Cori Cray XC40 system. I build Open-MPI git
>>> head from
>>> > > source for OFI libfabric.
>>> > >
>>> > > I have many other issues, which I will report later. As a spoiler,
>>> if I
>>> > > cannot use your mpirun, I cannot set any of the MCA options there. Is
>>> > > there a method to set MCA options with environment variables? I
>>> could not
>>> > > find this documented anywhere.
>>> > >
>>> > > In particular, is there a way to cause shm to not use the global
>>> > > filesystem? I see this issue comes up a lot and I read the list
>>> archives,
>>> > > but the warning message (
>>> > >
>>>

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Howard Pritchard

Hi Jeff,

I finally got an allocation on cori - its one busy machine.

Anyway, using the ompi i'd built on edison with the above recommended
configure options
I was able to run using either srun or mpirun on cori provided that in the
later case I used

mpirun -np X -N Y --mca plm slurm ./my_favorite_app

I will make an adjustment to the alps plm launcher to disqualify itself if
the wlm_detect
facility on the cray reports that srun is the launcher.  That's a minor fix
and should make
it in to v2.x in a week or so.  It will be a runtime selection so you only
have to build ompi
once for use either on edison or cori.

Howard


2015-11-19 17:11 GMT-07:00 Howard Pritchard :

> Hi Jeff H.
>
> Why don't you just try configuring with
>
> ./configure --prefix=my_favorite_install_dir
> --with-libfabric=install_dir_for_libfabric
> make -j 8 install
>
> and see what happens?
>
> Make sure before you configure that you have PrgEnv-gnu or PrgEnv-intel
> module loaded.
>
> Those were the configure/compiler options I used to do testing of ofi mtl
> on cori.
>
> Jeff S. - this thread has gotten intermingled with mpich setup as well,
> hence
> the suggestion for the mpich shm mechanism.
>
>
> Howard
>
>
>
> 2015-11-19 16:59 GMT-07:00 Jeff Hammond :
>
>>
>>> How did you configure for Cori?  You need to be using the slurm plm
>>> component for that system.  I know this sounds like gibberish.
>>>
>>>
>> ../configure --with-libfabric=$HOME/OFI/install-ofi-gcc-gni-cori \
>>  --enable-mca-static=mtl-ofi \
>>  --enable-mca-no-build=btl-openib,btl-vader,btl-ugni,btl-tcp \
>>  --enable-static --disable-shared --disable-dlopen \
>>  --prefix=$HOME/MPI/install-ompi-ofi-gcc-gni-xpmem-cori \
>>  --with-cray-pmi --with-alps --with-cray-xpmem --with-slurm \
>>  --without-verbs --without-fca --without-mxm --without-ucx \
>>  --without-portals4 --without-psm --without-psm2 \
>>  --without-udreg --without-ugni --without-munge \
>>  --without-sge --without-loadleveler --without-tm --without-lsf \
>>  --without-pvfs2 --without-plfs \
>>  --without-cuda --disable-oshmem \
>>  --disable-mpi-fortran --disable-oshmem-fortran \
>>  LDFLAGS="-L/opt/cray/ugni/default/lib64 -lugni \
>>   -L/opt/cray/alps/default/lib64 -lalps -lalpslli -lalpsutil \   
>>-ldl -lrt"
>>
>>
>> This is copied from
>> https://github.com/jeffhammond/HPCInfo/blob/master/ofi/README.md#open-mpi,
>> which I note in case you want to see what changes I've made at any point in
>> the future.
>>
>>
>>> There should be a with-slurm configure option to pick up this component.
>>>
>>> Indeed there is.
>>
>>
>>> Doesn't mpich have the option to use sysv memory?  You may want to try
>>> that
>>>
>>>
>> MPICH?  Look, I may have earned my way onto Santa's naughty list more
>> than a few times, but at least I have the decency not to post MPICH
>> questions to the Open-MPI list ;-)
>>
>> If there is a way to tell Open-MPI to use shm_open without filesystem
>> backing (if that is even possible) at configure time, I'd love to do that.
>>
>>
>>> Oh for tuning params you can use env variables.  For example lets say
>>> rather than using the gni provider in ofi mtl you want to try sockets. Then
>>> do
>>>
>>> Export OMPI_MCA_mtl_ofi_provider_include=sockets
>>>
>>>
>> Thanks.  I'm glad that there is an option to set them this way.
>>
>>
>>> In the spirit OMPI - may the force be with you.
>>>
>>>
>> All I will say here is that Open-MPI has a Vader BTL :-)
>>
>>>
>>> > On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
>>> > > I have no idea what this is trying to tell me. Help?
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
>>> > > [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
>>> > > ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
>>> > >
>>> > > I can run the same job with srun without incident:
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
>>> > > MPI was initialized.
>>> > >
>>> > > This is on the NERSC Cori Cray XC40 system. I build Open-MPI git
>>> head from
>>> > > source for OFI libfabric.
>>> > >
>>> > > I have many other issues, which I will report later. As a spoiler,
>>> if I
>>> > > cannot use your mpirun, I cannot set any of the MCA options there. Is
>>> > > there a method to set MCA options with environment variables? I
>>> could not
>>> > > find this documented anywhere.
>>> > >
>>> > > In particular, is there a way to cause shm to not use the global
>>> > > filesystem? I see this issue comes up a lot and I read the list
>>> archives,
>>> > > but the warning message (
>>> > >
>>> https://github.com/hpc/cce-mpi-openmpi-1.6.4/blob/master/ompi/mca/common/sm/
>>> > > help-mpi-common-sm.txt) suggested that I could override it by
>>> setting TMP,
>>>

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Jeff Hammond

>
>
> How did you configure for Cori?  You need to be using the slurm plm
> component for that system.  I know this sounds like gibberish.
>
>
../configure --with-libfabric=$HOME/OFI/install-ofi-gcc-gni-cori \
 --enable-mca-static=mtl-ofi \
 --enable-mca-no-build=btl-openib,btl-vader,btl-ugni,btl-tcp \
 --enable-static --disable-shared --disable-dlopen \
 --prefix=$HOME/MPI/install-ompi-ofi-gcc-gni-xpmem-cori \
 --with-cray-pmi --with-alps --with-cray-xpmem --with-slurm \
 --without-verbs --without-fca --without-mxm --without-ucx \
 --without-portals4 --without-psm --without-psm2 \
 --without-udreg --without-ugni --without-munge \
 --without-sge --without-loadleveler --without-tm --without-lsf \
 --without-pvfs2 --without-plfs \
 --without-cuda --disable-oshmem \
 --disable-mpi-fortran --disable-oshmem-fortran \
 LDFLAGS="-L/opt/cray/ugni/default/lib64 -lugni \
-L/opt/cray/alps/default/lib64 -lalps -lalpslli -lalpsutil
\  -ldl -lrt"


This is copied from
https://github.com/jeffhammond/HPCInfo/blob/master/ofi/README.md#open-mpi,
which I note in case you want to see what changes I've made at any point in
the future.


> There should be a with-slurm configure option to pick up this component.
>
> Indeed there is.


> Doesn't mpich have the option to use sysv memory?  You may want to try that
>
>
MPICH?  Look, I may have earned my way onto Santa's naughty list more than
a few times, but at least I have the decency not to post MPICH questions to
the Open-MPI list ;-)

If there is a way to tell Open-MPI to use shm_open without filesystem
backing (if that is even possible) at configure time, I'd love to do that.


> Oh for tuning params you can use env variables.  For example lets say
> rather than using the gni provider in ofi mtl you want to try sockets. Then
> do
>
> Export OMPI_MCA_mtl_ofi_provider_include=sockets
>
>
Thanks.  I'm glad that there is an option to set them this way.


> In the spirit OMPI - may the force be with you.
>
>
All I will say here is that Open-MPI has a Vader BTL :-)

>
> > On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
> > > I have no idea what this is trying to tell me. Help?
> > >
> > > jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
> > > [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
> > > ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
> > >
> > > I can run the same job with srun without incident:
> > >
> > > jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
> > > MPI was initialized.
> > >
> > > This is on the NERSC Cori Cray XC40 system. I build Open-MPI git head
> from
> > > source for OFI libfabric.
> > >
> > > I have many other issues, which I will report later. As a spoiler, if I
> > > cannot use your mpirun, I cannot set any of the MCA options there. Is
> > > there a method to set MCA options with environment variables? I could
> not
> > > find this documented anywhere.
> > >
> > > In particular, is there a way to cause shm to not use the global
> > > filesystem? I see this issue comes up a lot and I read the list
> archives,
> > > but the warning message (
> > >
> https://github.com/hpc/cce-mpi-openmpi-1.6.4/blob/master/ompi/mca/common/sm/
> > > help-mpi-common-sm.txt) suggested that I could override it by setting
> TMP,
> > > TEMP or TEMPDIR, which I did to no avail.
> >
> > From my experience on edison: the one environment variable that does
> works is TMPDIR - the one that is not listed in the error message :-)
>

That's great.  I will try that now.  Is there a Github issue open already
to fix that documentation?  If not...


> > Can't help you with your mpirun problem though ...
>
> No worries.  I appreciate all the help I can get.

Thanks,

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Howard

Hi Jeff

How did you configure for Cori?  You need to be using the slurm plm component 
for that system.  I know this sounds like gibberish.  

There should be a with-slurm configure option to pick up this component. 

Doesn't mpich have the option to use sysv memory?  You may want to try that

Oh for tuning params you can use env variables.  For example lets say rather 
than using the gni provider in ofi mtl you want to try sockets. Then do

Export OMPI_MCA_mtl_ofi_provider_include=sockets

In the spirit OMPI - may the force be with you.   

Howard 

Von meinem iPhone gesendet

> Am 19.11.2015 um 11:51 schrieb Martin Siegert :
> 
> Hi Jeff,
>  
> On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
> > I have no idea what this is trying to tell me. Help?
> >
> > jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
> > [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
> > ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
> >
> > I can run the same job with srun without incident:
> >
> > jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
> > MPI was initialized.
> >
> > This is on the NERSC Cori Cray XC40 system. I build Open-MPI git head from
> > source for OFI libfabric.
> >
> > I have many other issues, which I will report later. As a spoiler, if I
> > cannot use your mpirun, I cannot set any of the MCA options there. Is
> > there a method to set MCA options with environment variables? I could not
> > find this documented anywhere.
> >
> > In particular, is there a way to cause shm to not use the global
> > filesystem? I see this issue comes up a lot and I read the list archives,
> > but the warning message (
> > https://github.com/hpc/cce-mpi-openmpi-1.6.4/blob/master/ompi/mca/common/sm/
> > help-mpi-common-sm.txt) suggested that I could override it by setting TMP,
> > TEMP or TEMPDIR, which I did to no avail.
>  
> From my experience on edison: the one environment variable that does works is 
> TMPDIR - the one that is not listed in the error message :-)
>  
> Can't help you with your mpirun problem though ...
>  
> Cheers,
> Martin
>  
> --
> Martin Siegert
> Head, Research Computing
> WestGrid/ComputeCanada Site Lead
> Simon Fraser University
> Burnaby, British Columbia
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/11/28063.php

Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Martin Siegert

Hi Jeff,

On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
> I have no idea what this is trying to tell me.  Help?
> 
> jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
> [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
> ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
> 
> I can run the same job with srun without incident:
> 
> jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
> MPI was initialized.
> 
> This is on the NERSC Cori Cray XC40 system.  I build Open-MPI git head 
from
> source for OFI libfabric.
> 
> I have many other issues, which I will report later.  As a spoiler, if I
> cannot use your mpirun, I cannot set any of the MCA options there.  Is
> there a method to set MCA options with environment variables?  I could 
not
> find this documented anywhere.
> 
> In particular, is there a way to cause shm to not use the global
> filesystem?  I see this issue comes up a lot and I read the list archives,
> but the warning message (
> https://github.com/hpc/cce-mpi-openmpi-1.6.4/blob/master/ompi/mca/common/sm/
> help-mpi-common-sm.txt) suggested that I could override it by setting 
TMP,
> TEMP or TEMPDIR, which I did to no avail.

>From my experience on edison: the one environment variable that does 
works is TMPDIR - the one that is not listed in the error message :-)

Can't help you with your mpirun problem though ...

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid/ComputeCanada Site Lead
Simon Fraser University
Burnaby, British Columbia

Re: [OMPI users] Help with Specific Binding

Checkout the man page “OMPI_Affinity_str” for an MPI extension that might help


> On Sep 13, 2015, at 7:28 AM, Saliya Ekanayake  wrote:
> 
> Thank you, I'll try this. Also, is there a way to know which core a process 
> is bound to within the program other than executing something like taskset 
>  from program?
> 
> On Sun, Sep 13, 2015 at 10:05 AM, Ralph Castain  > wrote:
> Actually, the error was correct - it was me that was incorrect. The correct 
> set of options would be:
> 
> —map-by ppr:12_node —bind-to core —cpu-set=0,2,4,…
> 
> Sorry about the confusion
> 
> 
>> On Sep 13, 2015, at 2:43 AM, Ralph Castain > > wrote:
>> 
>> The rankfile will certainly do it, but that error is a bug and I’ll have to 
>> fix it.
>> 
>> 
>>> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake >> > wrote:
>>> 
>>> I could get it working by manually generating a rankfile all the ranks and 
>>> not using any --map-by options.
>>> 
>>> I'll try the --map-by core as well
>>> 
>>> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel >> > wrote:
>>> Hi,
>>> use: --map-by core
>>> 
>>> regards,
>>> Tobias
>>> 
>>> 
>>> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
 I tried,
 
  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to 
 core -np 12
 
 but it complains,
 
 "Conflicting directives for binding policy are causing the policy
 to be redefined:
 
   New policy:   socket
   Prior policy:  CORE
 
 Please check that only one policy is defined.
 "
 
 On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain > wrote:
 Try something like this instead:
 
 —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
 
 You’ll have to play a bit with the core numbers in the slot-list to get 
 the numbering right as I don’t know how your machine numbers them, and I 
 can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then 
 I may have to add an option for such purposes
 
 Ralph
 
> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  > wrote:
> 
> Hi,
> 
> We've a machine as in the following picture. I'd like to run 12 MPI procs 
> per node each bound to 1 core, but like shown in blue dots in the 
> pictures. I can use the following command to run 12 procs per node, but 
> PE=1 makes all the 12 processes will run in just 1 socket. PE=2 will make 
> a process bind to 2 cores, which is not what I want. 
> 
> --map-by ppr:12:node:PE=1,SPAN
> 
> Thank you,
> Saliya
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914 
>  http://saliya.org 
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27558.php 
> 
 
 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
 
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2015/09/27559.php 
 
 
 
 
 -- 
 Saliya Ekanayake
 Ph.D. Candidate | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 Cell 812-391-4914 
 http://saliya.org 
 
 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
 
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2015/09/27560.php 
 
>>> -- 
>>> M.Sc. Tobias Klöffel
>>> ===
>>> Interdisciplinary Center for Molecular Materials (ICMM)
>>> and Computer-Chemistry-Center (CCC)
>>> Department Chemie und Pharmazie
>>> Friedrich-Alexander-Universität

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Gilles Gouaillardet

on linux, you can look at /proc/self/status and search allowed_cpus_list
or you can use the sched_getaffinity system call

note that in some (hopefully rare)cases, this will return different results
than hwloc

On Sunday, September 13, 2015, Saliya Ekanayake  wrote:

> Thank you, I'll try this. Also, is there a way to know which core a
> process is bound to within the program other than executing something like
> taskset  from program?
>
> On Sun, Sep 13, 2015 at 10:05 AM, Ralph Castain  > wrote:
>
>> Actually, the error was correct - it was me that was incorrect. The
>> correct set of options would be:
>>
>> —map-by ppr:12_node —bind-to core —cpu-set=0,2,4,…
>>
>> Sorry about the confusion
>>
>>
>> On Sep 13, 2015, at 2:43 AM, Ralph Castain > > wrote:
>>
>> The rankfile will certainly do it, but that error is a bug and I’ll have
>> to fix it.
>>
>>
>> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake > > wrote:
>>
>> I could get it working by manually generating a rankfile all the ranks
>> and not using any --map-by options.
>>
>> I'll try the --map-by core as well
>>
>> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel > > wrote:
>>
>>> Hi,
>>> use: --map-by core
>>>
>>> regards,
>>> Tobias
>>>
>>>
>>> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>>>
>>> I tried,
>>>
>>>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22
>>> --bind-to core -np 12
>>>
>>> but it complains,
>>>
>>> "Conflicting directives for binding policy are causing the policy
>>> to be redefined:
>>>
>>>   New policy:   socket
>>>   Prior policy:  CORE
>>>
>>> Please check that only one policy is defined.
>>> "
>>>
>>> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain >> > wrote:
>>>
 Try something like this instead:

 —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…

 You’ll have to play a bit with the core numbers in the slot-list to get
 the numbering right as I don’t know how your machine numbers them, and I
 can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then I
 may have to add an option for such purposes

 Ralph

 On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake > wrote:

 Hi,

 We've a machine as in the following picture. I'd like to run 12 MPI
 procs per node each bound to 1 core, but like shown in blue dots in the
 pictures. I can use the following command to run 12 procs per node, but
 PE=1 makes all the 12 processes will run in just 1 socket. PE=2 will make a
 process bind to 2 cores, which is not what I want.

 --map-by ppr:12:node:PE=1,SPAN

 Thank you,
 Saliya

 

 --
 Saliya Ekanayake
 Ph.D. Candidate | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 Cell 812-391-4914
 http://saliya.org
 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2015/09/27558.php



 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2015/09/27559.php

>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>>
>>> ___
>>> users mailing listus...@open-mpi.org 
>>> 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/09/27560.php
>>>
>>>
>>> --
>>> M.Sc. Tobias Klöffel
>>> ===
>>> Interdisciplinary Center for Molecular Materials (ICMM)
>>> and Computer-Chemistry-Center (CCC)
>>> Department Chemie und Pharmazie
>>> Friedrich-Alexander-Universität Erlangen-Nürnberg
>>> Nägelsbachstr. 25
>>> D-91052 Erlangen, Germany
>>>
>>> Room: 2.307
>>> Phone: +49 (0) 9131 / 85 - 20421
>>> Fax: +49 (0) 9131 / 85 - 26565
>>>
>>>

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Saliya Ekanayake

Thank you, I'll try this. Also, is there a way to know which core a process
is bound to within the program other than executing something like taskset
 from program?

On Sun, Sep 13, 2015 at 10:05 AM, Ralph Castain  wrote:

> Actually, the error was correct - it was me that was incorrect. The
> correct set of options would be:
>
> —map-by ppr:12_node —bind-to core —cpu-set=0,2,4,…
>
> Sorry about the confusion
>
>
> On Sep 13, 2015, at 2:43 AM, Ralph Castain  wrote:
>
> The rankfile will certainly do it, but that error is a bug and I’ll have
> to fix it.
>
>
> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake  wrote:
>
> I could get it working by manually generating a rankfile all the ranks and
> not using any --map-by options.
>
> I'll try the --map-by core as well
>
> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel 
> wrote:
>
>> Hi,
>> use: --map-by core
>>
>> regards,
>> Tobias
>>
>>
>> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>>
>> I tried,
>>
>>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22
>> --bind-to core -np 12
>>
>> but it complains,
>>
>> "Conflicting directives for binding policy are causing the policy
>> to be redefined:
>>
>>   New policy:   socket
>>   Prior policy:  CORE
>>
>> Please check that only one policy is defined.
>> "
>>
>> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain  wrote:
>>
>>> Try something like this instead:
>>>
>>> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>>>
>>> You’ll have to play a bit with the core numbers in the slot-list to get
>>> the numbering right as I don’t know how your machine numbers them, and I
>>> can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then I
>>> may have to add an option for such purposes
>>>
>>> Ralph
>>>
>>> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  wrote:
>>>
>>> Hi,
>>>
>>> We've a machine as in the following picture. I'd like to run 12 MPI
>>> procs per node each bound to 1 core, but like shown in blue dots in the
>>> pictures. I can use the following command to run 12 procs per node, but
>>> PE=1 makes all the 12 processes will run in just 1 socket. PE=2 will make a
>>> process bind to 2 cores, which is not what I want.
>>>
>>> --map-by ppr:12:node:PE=1,SPAN
>>>
>>> Thank you,
>>> Saliya
>>>
>>> 
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/09/27558.php
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/09/27559.php
>>>
>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>>
>> ___
>> users mailing listus...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/09/27560.php
>>
>>
>> --
>> M.Sc. Tobias Klöffel
>> ===
>> Interdisciplinary Center for Molecular Materials (ICMM)
>> and Computer-Chemistry-Center (CCC)
>> Department Chemie und Pharmazie
>> Friedrich-Alexander-Universität Erlangen-Nürnberg
>> Nägelsbachstr. 25
>> D-91052 Erlangen, Germany
>>
>> Room: 2.307
>> Phone: +49 (0) 9131 / 85 - 20421
>> Fax: +49 (0) 9131 / 85 - 26565
>>
>> ===
>>
>>
>> E-mail: tobias.kloef...@fau.de
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27561.php
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27562.php
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription:

Re: [OMPI users] Help with Specific Binding

Actually, the error was correct - it was me that was incorrect. The correct set 
of options would be:

—map-by ppr:12_node —bind-to core —cpu-set=0,2,4,…

Sorry about the confusion


> On Sep 13, 2015, at 2:43 AM, Ralph Castain  wrote:
> 
> The rankfile will certainly do it, but that error is a bug and I’ll have to 
> fix it.
> 
> 
>> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake > > wrote:
>> 
>> I could get it working by manually generating a rankfile all the ranks and 
>> not using any --map-by options.
>> 
>> I'll try the --map-by core as well
>> 
>> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel > > wrote:
>> Hi,
>> use: --map-by core
>> 
>> regards,
>> Tobias
>> 
>> 
>> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>>> I tried,
>>> 
>>>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to 
>>> core -np 12
>>> 
>>> but it complains,
>>> 
>>> "Conflicting directives for binding policy are causing the policy
>>> to be redefined:
>>> 
>>>   New policy:   socket
>>>   Prior policy:  CORE
>>> 
>>> Please check that only one policy is defined.
>>> "
>>> 
>>> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain >> > wrote:
>>> Try something like this instead:
>>> 
>>> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>>> 
>>> You’ll have to play a bit with the core numbers in the slot-list to get the 
>>> numbering right as I don’t know how your machine numbers them, and I can’t 
>>> guarantee it will work - but it’s worth a shot. If it doesn’t, then I may 
>>> have to add an option for such purposes
>>> 
>>> Ralph
>>> 
 On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake > wrote:
 
 Hi,
 
 We've a machine as in the following picture. I'd like to run 12 MPI procs 
 per node each bound to 1 core, but like shown in blue dots in the 
 pictures. I can use the following command to run 12 procs per node, but 
 PE=1 makes all the 12 processes will run in just 1 socket. PE=2 will make 
 a process bind to 2 cores, which is not what I want. 
 
 --map-by ppr:12:node:PE=1,SPAN
 
 Thank you,
 Saliya
 
 
 
 -- 
 Saliya Ekanayake
 Ph.D. Candidate | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 Cell 812-391-4914 
  http://saliya.org 
 ___
 users mailing list
 us...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
 
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2015/09/27558.php 
 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/09/27559.php 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914 
>>> http://saliya.org 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/09/27560.php 
>>> 
>> -- 
>> M.Sc. Tobias Klöffel
>> ===
>> Interdisciplinary Center for Molecular Materials (ICMM)
>> and Computer-Chemistry-Center (CCC)
>> Department Chemie und Pharmazie
>> Friedrich-Alexander-Universität Erlangen-Nürnberg
>> Nägelsbachstr. 25
>> D-91052 Erlangen, Germany
>> 
>> Room: 2.307
>> Phone: +49 (0) 9131 / 85 - 20421 
>> 
>> Fax: +49 (0) 9131 / 85 - 26565 
>> 
>> 
>> ===
>> 
>> 
>> E-mail: tobias.kloef...@fau.de 
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>

Re: [OMPI users] Help with Specific Binding

The rankfile will certainly do it, but that error is a bug and I’ll have to fix 
it.


> On Sep 13, 2015, at 1:10 AM, Saliya Ekanayake  wrote:
> 
> I could get it working by manually generating a rankfile all the ranks and 
> not using any --map-by options.
> 
> I'll try the --map-by core as well
> 
> On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel  > wrote:
> Hi,
> use: --map-by core
> 
> regards,
> Tobias
> 
> 
> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>> I tried,
>> 
>>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to 
>> core -np 12
>> 
>> but it complains,
>> 
>> "Conflicting directives for binding policy are causing the policy
>> to be redefined:
>> 
>>   New policy:   socket
>>   Prior policy:  CORE
>> 
>> Please check that only one policy is defined.
>> "
>> 
>> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain > > wrote:
>> Try something like this instead:
>> 
>> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>> 
>> You’ll have to play a bit with the core numbers in the slot-list to get the 
>> numbering right as I don’t know how your machine numbers them, and I can’t 
>> guarantee it will work - but it’s worth a shot. If it doesn’t, then I may 
>> have to add an option for such purposes
>> 
>> Ralph
>> 
>>> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake >> > wrote:
>>> 
>>> Hi,
>>> 
>>> We've a machine as in the following picture. I'd like to run 12 MPI procs 
>>> per node each bound to 1 core, but like shown in blue dots in the pictures. 
>>> I can use the following command to run 12 procs per node, but PE=1 makes 
>>> all the 12 processes will run in just 1 socket. PE=2 will make a process 
>>> bind to 2 cores, which is not what I want. 
>>> 
>>> --map-by ppr:12:node:PE=1,SPAN
>>> 
>>> Thank you,
>>> Saliya
>>> 
>>> 
>>> 
>>> -- 
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914 
>>>  http://saliya.org 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/09/27558.php 
>>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/09/27559.php 
>> 
>> 
>> 
>> 
>> -- 
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914 
>> http://saliya.org 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/09/27560.php 
>> 
> -- 
> M.Sc. Tobias Klöffel
> ===
> Interdisciplinary Center for Molecular Materials (ICMM)
> and Computer-Chemistry-Center (CCC)
> Department Chemie und Pharmazie
> Friedrich-Alexander-Universität Erlangen-Nürnberg
> Nägelsbachstr. 25
> D-91052 Erlangen, Germany
> 
> Room: 2.307
> Phone: +49 (0) 9131 / 85 - 20421 
> 
> Fax: +49 (0) 9131 / 85 - 26565 
> 
> 
> ===
> 
> 
> E-mail: tobias.kloef...@fau.de 
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27561.php 
> 
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org 
>

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Saliya Ekanayake

I could get it working by manually generating a rankfile all the ranks and
not using any --map-by options.

I'll try the --map-by core as well

On Sun, Sep 13, 2015 at 3:59 AM, Tobias Kloeffel 
wrote:

> Hi,
> use: --map-by core
>
> regards,
> Tobias
>
>
> On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:
>
> I tried,
>
>  --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to
> core -np 12
>
> but it complains,
>
> "Conflicting directives for binding policy are causing the policy
> to be redefined:
>
>   New policy:   socket
>   Prior policy:  CORE
>
> Please check that only one policy is defined.
> "
>
> On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain  wrote:
>
>> Try something like this instead:
>>
>> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>>
>> You’ll have to play a bit with the core numbers in the slot-list to get
>> the numbering right as I don’t know how your machine numbers them, and I
>> can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then I
>> may have to add an option for such purposes
>>
>> Ralph
>>
>> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  wrote:
>>
>> Hi,
>>
>> We've a machine as in the following picture. I'd like to run 12 MPI procs
>> per node each bound to 1 core, but like shown in blue dots in the pictures.
>> I can use the following command to run 12 procs per node, but PE=1 makes
>> all the 12 processes will run in just 1 socket. PE=2 will make a process
>> bind to 2 cores, which is not what I want.
>>
>> --map-by ppr:12:node:PE=1,SPAN
>>
>> Thank you,
>> Saliya
>>
>> 
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27558.php
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27559.php
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27560.php
>
>
> --
> M.Sc. Tobias Klöffel
> ===
> Interdisciplinary Center for Molecular Materials (ICMM)
> and Computer-Chemistry-Center (CCC)
> Department Chemie und Pharmazie
> Friedrich-Alexander-Universität Erlangen-Nürnberg
> Nägelsbachstr. 25
> D-91052 Erlangen, Germany
>
> Room: 2.307
> Phone: +49 (0) 9131 / 85 - 20421
> Fax: +49 (0) 9131 / 85 - 26565
>
> ===
>
>
> E-mail: tobias.kloef...@fau.de
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27561.php
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Tobias Kloeffel


Hi,
use: --map-by core

regards,
Tobias

On 09/13/2015 09:41 AM, Saliya Ekanayake wrote:

I tried,

 --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 
--bind-to core -np 12


but it complains,

"Conflicting directives for binding policy are causing the policy
to be redefined:

  New policy:   socket
  Prior policy:  CORE

Please check that only one policy is defined.
"

On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain > wrote:


Try something like this instead:

—map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…

You’ll have to play a bit with the core numbers in the slot-list
to get the numbering right as I don’t know how your machine
numbers them, and I can’t guarantee it will work - but it’s worth
a shot. If it doesn’t, then I may have to add an option for such
purposes

Ralph


On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake > wrote:

Hi,

We've a machine as in the following picture. I'd like to run 12
MPI procs per node each bound to 1 core, but like shown in blue
dots in the pictures. I can use the following command to run 12
procs per node, but PE=1 makes all the 12 processes will run in
just 1 socket. PE=2 will make a process bind to 2 cores, which is
not what I want.

--map-by ppr:12:node:PE=1,SPAN

Thank you,
Saliya



-- 
Saliya Ekanayake

Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914 
http://saliya.org 
___
users mailing list
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27558.php



___
users mailing list
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27559.php




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27560.php


--
M.Sc. Tobias Klöffel
===
Interdisciplinary Center for Molecular Materials (ICMM)
and Computer-Chemistry-Center (CCC)
Department Chemie und Pharmazie
Friedrich-Alexander-Universität Erlangen-Nürnberg
Nägelsbachstr. 25
D-91052 Erlangen, Germany

Room: 2.307
Phone: +49 (0) 9131 / 85 - 20421
Fax: +49 (0) 9131 / 85 - 26565

===


E-mail: tobias.kloef...@fau.de

Re: [OMPI users] Help with Specific Binding

2015-09-13 Thread Saliya Ekanayake

I tried,

 --map-by ppr:12:node --slot-list 0,2,4,6,8,10,12,14,16,18,20,22 --bind-to
core -np 12

but it complains,

"Conflicting directives for binding policy are causing the policy
to be redefined:

  New policy:   socket
  Prior policy:  CORE

Please check that only one policy is defined.
"

On Sun, Sep 13, 2015 at 2:57 AM, Ralph Castain  wrote:

> Try something like this instead:
>
> —map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…
>
> You’ll have to play a bit with the core numbers in the slot-list to get
> the numbering right as I don’t know how your machine numbers them, and I
> can’t guarantee it will work - but it’s worth a shot. If it doesn’t, then I
> may have to add an option for such purposes
>
> Ralph
>
> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  wrote:
>
> Hi,
>
> We've a machine as in the following picture. I'd like to run 12 MPI procs
> per node each bound to 1 core, but like shown in blue dots in the pictures.
> I can use the following command to run 12 procs per node, but PE=1 makes
> all the 12 processes will run in just 1 socket. PE=2 will make a process
> bind to 2 cores, which is not what I want.
>
> --map-by ppr:12:node:PE=1,SPAN
>
> Thank you,
> Saliya
>
> 
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27558.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27559.php
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Re: [OMPI users] Help with Specific Binding

Try something like this instead:

—map-by ppr:12:node —bind-to core —slot-list=0,2,4,6,8,…

You’ll have to play a bit with the core numbers in the slot-list to get the 
numbering right as I don’t know how your machine numbers them, and I can’t 
guarantee it will work - but it’s worth a shot. If it doesn’t, then I may have 
to add an option for such purposes

Ralph

> On Sep 12, 2015, at 7:39 PM, Saliya Ekanayake  wrote:
> 
> Hi,
> 
> We've a machine as in the following picture. I'd like to run 12 MPI procs per 
> node each bound to 1 core, but like shown in blue dots in the pictures. I can 
> use the following command to run 12 procs per node, but PE=1 makes all the 12 
> processes will run in just 1 socket. PE=2 will make a process bind to 2 
> cores, which is not what I want. 
> 
> --map-by ppr:12:node:PE=1,SPAN
> 
> Thank you,
> Saliya
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27558.php

Re: [OMPI users] Help : Slowness with OpenMPI (1.8.1) and Numpy

2015-06-12 Thread Ralph Castain

Is this a threaded code? If so, you should add —bind-to none to your 1.8 series 
command line


> On Jun 12, 2015, at 7:58 AM, kishor sharma  wrote:
> 
> Hi There,
> 
> 
> 
> I am facing slowness running numpy code using mpirun with openmpi 1.8.1 
> version.
> 
> 
> 
> With Open MPI (1.8.1)
> 
> -
> 
> > /usr/lib64/openmpi/bin/mpirun -version
> 
> mpirun (Open MPI) 1.8.1
> 
>  
> Report bugs to http://www.open-mpi.org/community/help/ 
> 
> >  time /usr/lib64/openmpi/bin/mpirun -np 1 python -c 'import numpy; 
> > numpy.linalg.svd(numpy.eye(1000))'
> 
> real 23.75
> 
> user 6.95
> 
> sys 16.68
> 
> > 
> 
> 
> 
> 
> 
> With Open MPI (1.5.4):
> 
> -
> 
> > /usr/lib64/openmpi/bin/mpirun -version
> 
> mpirun (Open MPI) 1.5.4
> 
>  
> Report bugs to http://www.open-mpi.org/community/help/ 
> 
> > time /usr/lib64/openmpi/bin/mpirun -np 1 python -c 'import numpy; 
> > numpy.linalg.svd(numpy.eye(1000))'
> 
> real 1.35
> 
> user 2.11
> 
> sys 0.71
> 
> >
> 
> 
> 
> Do you guys have any idea why the above function is 10-15x with openmpi 
> version 1.8.1
> 
> 
> 
> Thanks,
> 
> Kishor
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/06/27123.php

Re: [OMPI users] help in execution mpi

2015-04-23 Thread Ralph Castain

Use “orte_rsh_agent = rsh” instead


> On Apr 23, 2015, at 10:48 AM, rebona...@upf.br wrote:
> 
> Hi all
> 
> I am install mpi (version 1.6.5) at ubuntu 14.04. I am teach parallel 
> programming in undergraduate course.
> I wnat use rsh instead ssh (default).
> I change the file "openmpi-mca-params.conf" and put there plm_rsh_agent = rsh 
> .
> The mpi application work, but a message appear for each process created:
> 
> /* begin message */
> --
> A deprecated MCA parameter value was specified in an MCA parameter
> file.  Deprecated MCA parameters should be avoided; they may disappear
> in future releases.
> 
>  Deprecated parameter: plm_rsh_agent
> --
> /* end message */
> 
> It's bad for explanation with students. There is any form to supress these 
> warning messages?
> 
> Thank's a lot.
> 
> 
> Marcelo Trindade Rebonatto
> Passo Fundo University - Brazil
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26773.php

Re: [OMPI users] Help on getting CMA works

2015-02-24 Thread Nathan Hjelm


I don't know the reasoning for requiring --with-cma to enable CMA but I
am looking at auto-detecting CMA instead of requiring Open MPI to be
configured with --with-cma. This will likely go into the 1.9 release
series and not 1.8.

-Nathan

On Thu, Feb 19, 2015 at 09:31:43PM -0500, Eric Chamberland wrote:
> Maybe it is a stupid question, but... why it is not tested and enabled by
> default at configure time since it is part of the kernel?
> 
> Eric
> 
> 
> On 02/19/2015 03:53 PM, Nathan Hjelm wrote:
> >Great! I will add an MCA variable to force CMA and also enable it if 1)
> >no yama and 2) no PR_SET_PTRACER.
> >
> >You might also look at using xpmem. You can find a version that supports
> >3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> >userspace library that can be used by vader as a single-copy mechanism.
> >
> >In benchmarks it performs better than CMA but it may or may not perform
> >better with a real application.
> >
> >See:
> >
> >http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> >
> >-Nathan
> >
> >On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> >>On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> >>>On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> >If you have yama installed you can try:
> Nope, I do not have it installed... is it absolutely necessary? (and would
> it change something when it fails when I am root?)
> 
> Other question: In addition to "--with-cma" configure flag, do we have to
> pass any options to "mpicc" when compiling/linking an mpi application to 
> use
> cma?
> >>>No. CMA should work out of the box. You appear to have a setup I haven't
> >>>yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> >>>prctl. Its quite possible there are no restriction on ptrace in this
> >>>setup. Can you try changing the following line at
> >>>opal/mca/btl/vader/btl_vader_component.c:370 from:
> >>>
> >>>bool cma_happy = false;
> >>>
> >>>to
> >>>
> >>>bool cma_happy = true;
> >>>
> >>ok! (as of the officiel release, this is line 386.)
> >>
> >>>and let me know if that works. If it does I will update vader to allow
> >>>CMA in this configuration.
> >>Yep!  It now works perfectly.  Testing with
> >>https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
> >>own computer (dual Xeon), I have this:
> >>
> >>Without CMA:
> >>
> >>***Message size:  100 *** best  /  avg  / worst (MB/sec)
> >>task pair:0 -1:8363.52 / 7946.77 / 5391.14
> >>
> >>with CMA:
> >>task pair:0 -1:9137.92 / 8955.98 / 7489.83
> >>
> >>Great!
> >>
> >>Now I have to bench my real application... ;-)
> >>
> >>Thanks!
> >>
> >>Eric
> >>
> >>___
> >>users mailing list
> >>us...@open-mpi.org
> >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>Link to this post: 
> >>http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26362.php


pgpKWremq_SLC.pgp
Description: PGP signature

Re: [OMPI users] Help on getting CMA works

Maybe it is a stupid question, but... why it is not tested and enabled 
by default at configure time since it is part of the kernel?


Eric


On 02/19/2015 03:53 PM, Nathan Hjelm wrote:

Great! I will add an MCA variable to force CMA and also enable it if 1)
no yama and 2) no PR_SET_PTRACER.

You might also look at using xpmem. You can find a version that supports
3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
userspace library that can be used by vader as a single-copy mechanism.

In benchmarks it performs better than CMA but it may or may not perform
better with a real application.

See:

http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy

-Nathan

On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:

On 02/19/2015 02:58 PM, Nathan Hjelm wrote:

On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:

On 02/19/2015 11:56 AM, Nathan Hjelm wrote:

If you have yama installed you can try:

Nope, I do not have it installed... is it absolutely necessary? (and would
it change something when it fails when I am root?)

Other question: In addition to "--with-cma" configure flag, do we have to
pass any options to "mpicc" when compiling/linking an mpi application to use
cma?

No. CMA should work out of the box. You appear to have a setup I haven't
yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
prctl. Its quite possible there are no restriction on ptrace in this
setup. Can you try changing the following line at
opal/mca/btl/vader/btl_vader_component.c:370 from:

bool cma_happy = false;

to

bool cma_happy = true;


ok! (as of the officiel release, this is line 386.)


and let me know if that works. If it does I will update vader to allow
CMA in this configuration.

Yep!  It now works perfectly.  Testing with
https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
own computer (dual Xeon), I have this:

Without CMA:

***Message size:  100 *** best  /  avg  / worst (MB/sec)
task pair:0 -1:8363.52 / 7946.77 / 5391.14

with CMA:
task pair:0 -1:9137.92 / 8955.98 / 7489.83

Great!

Now I have to bench my real application... ;-)

Thanks!

Eric

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/02/26355.php

Re: [OMPI users] Help on getting CMA works


Aurélien, I should also point out your fix has already been applied to
the 1.8 branch and will be included in 1.8.5.

-Nathan

On Thu, Feb 19, 2015 at 02:57:38PM -0700, Nathan Hjelm wrote:
> 
> Hmm, wait. Yes. Your change went in after 1.8.4 and has the same
> effect. If yama ins't installed it is safe to assume that the ptrace
> scope is effectively 0. So, your patch does fix the issue.
> 
> -Nathan
> 
> On Thu, Feb 19, 2015 at 02:53:47PM -0700, Nathan Hjelm wrote:
> > 
> > I don't think that will fix this issue. In this case yama is not
> > installed and it appears PR_SET_PTRACER is not available. This forces
> > vader to assume that CMA can not be used when that isn't always the
> > case. I think it might be safe to assume that CMA is unrestricted here.
> > 
> > -Nathan
> > 
> > On Thu, Feb 19, 2015 at 04:35:00PM -0500, Aurélien Bouteiller wrote:
> > > Nathan, 
> > > 
> > > I think I already pushed a patch for this particular issue last month. I 
> > > do not know if it has been back ported to release yet. 
> > > 
> > > See 
> > > here:https://github.com/open-mpi/ompi/commit/ee3b0903164898750137d3b71a8f067e16521102
> > > 
> > > Aurelien 
> > > 
> > > --
> > >   ~~~ Aurélien Bouteiller, Ph.D. ~~~
> > >  ~ Research Scientist @ ICL ~
> > > The University of Tennessee, Innovative Computing Laboratory
> > > 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
> > > tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
> > > https://icl.cs.utk.edu/~bouteill/
> > > 
> > > 
> > > 
> > > 
> > > > Le 19 févr. 2015 à 15:53, Nathan Hjelm  a écrit :
> > > > 
> > > > 
> > > > Great! I will add an MCA variable to force CMA and also enable it if 1)
> > > > no yama and 2) no PR_SET_PTRACER.
> > > > 
> > > > You might also look at using xpmem. You can find a version that supports
> > > > 3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> > > > userspace library that can be used by vader as a single-copy mechanism.
> > > > 
> > > > In benchmarks it performs better than CMA but it may or may not perform
> > > > better with a real application.
> > > > 
> > > > See:
> > > > 
> > > > http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> > > > 
> > > > -Nathan
> > > > 
> > > > On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> > > >> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> > > >>> On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> > >  
> > >  On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> > > > 
> > > > If you have yama installed you can try:
> > >  
> > >  Nope, I do not have it installed... is it absolutely necessary? (and 
> > >  would
> > >  it change something when it fails when I am root?)
> > >  
> > >  Other question: In addition to "--with-cma" configure flag, do we 
> > >  have to
> > >  pass any options to "mpicc" when compiling/linking an mpi 
> > >  application to use
> > >  cma?
> > > >>> 
> > > >>> No. CMA should work out of the box. You appear to have a setup I 
> > > >>> haven't
> > > >>> yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> > > >>> prctl. Its quite possible there are no restriction on ptrace in this
> > > >>> setup. Can you try changing the following line at
> > > >>> opal/mca/btl/vader/btl_vader_component.c:370 from:
> > > >>> 
> > > >>> bool cma_happy = false;
> > > >>> 
> > > >>> to
> > > >>> 
> > > >>> bool cma_happy = true;
> > > >>> 
> > > >> 
> > > >> ok! (as of the officiel release, this is line 386.)
> > > >> 
> > > >>> and let me know if that works. If it does I will update vader to allow
> > > >>> CMA in this configuration.
> > > >> 
> > > >> Yep!  It now works perfectly.  Testing with
> > > >> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on 
> > > >> my
> > > >> own computer (dual Xeon), I have this:
> > > >> 
> > > >> Without CMA:
> > > >> 
> > > >> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
> > > >>   task pair:0 -1:8363.52 / 7946.77 / 5391.14
> > > >> 
> > > >> with CMA:
> > > >>   task pair:0 -1:9137.92 / 8955.98 / 7489.83
> > > >> 
> > > >> Great!
> > > >> 
> > > >> Now I have to bench my real application... ;-)
> > > >> 
> > > >> Thanks!
> > > >> 
> > > >> Eric
> > > >> 
> > > >> ___
> > > >> users mailing list
> > > >> us...@open-mpi.org
> > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >> Link to this post: 
> > > >> http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> > > > ___
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > Link to this post: 
> > > > http://www.open-mpi.org/community/lists/users/2015/02/26356.php
> > > 
> > >

Re: [OMPI users] Help on getting CMA works


Hmm, wait. Yes. Your change went in after 1.8.4 and has the same
effect. If yama ins't installed it is safe to assume that the ptrace
scope is effectively 0. So, your patch does fix the issue.

-Nathan

On Thu, Feb 19, 2015 at 02:53:47PM -0700, Nathan Hjelm wrote:
> 
> I don't think that will fix this issue. In this case yama is not
> installed and it appears PR_SET_PTRACER is not available. This forces
> vader to assume that CMA can not be used when that isn't always the
> case. I think it might be safe to assume that CMA is unrestricted here.
> 
> -Nathan
> 
> On Thu, Feb 19, 2015 at 04:35:00PM -0500, Aurélien Bouteiller wrote:
> > Nathan, 
> > 
> > I think I already pushed a patch for this particular issue last month. I do 
> > not know if it has been back ported to release yet. 
> > 
> > See 
> > here:https://github.com/open-mpi/ompi/commit/ee3b0903164898750137d3b71a8f067e16521102
> > 
> > Aurelien 
> > 
> > --
> >   ~~~ Aurélien Bouteiller, Ph.D. ~~~
> >  ~ Research Scientist @ ICL ~
> > The University of Tennessee, Innovative Computing Laboratory
> > 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
> > tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
> > https://icl.cs.utk.edu/~bouteill/
> > 
> > 
> > 
> > 
> > > Le 19 févr. 2015 à 15:53, Nathan Hjelm  a écrit :
> > > 
> > > 
> > > Great! I will add an MCA variable to force CMA and also enable it if 1)
> > > no yama and 2) no PR_SET_PTRACER.
> > > 
> > > You might also look at using xpmem. You can find a version that supports
> > > 3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> > > userspace library that can be used by vader as a single-copy mechanism.
> > > 
> > > In benchmarks it performs better than CMA but it may or may not perform
> > > better with a real application.
> > > 
> > > See:
> > > 
> > > http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> > > 
> > > -Nathan
> > > 
> > > On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> > >> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> > >>> On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> >  
> >  On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> > > 
> > > If you have yama installed you can try:
> >  
> >  Nope, I do not have it installed... is it absolutely necessary? (and 
> >  would
> >  it change something when it fails when I am root?)
> >  
> >  Other question: In addition to "--with-cma" configure flag, do we have 
> >  to
> >  pass any options to "mpicc" when compiling/linking an mpi application 
> >  to use
> >  cma?
> > >>> 
> > >>> No. CMA should work out of the box. You appear to have a setup I haven't
> > >>> yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> > >>> prctl. Its quite possible there are no restriction on ptrace in this
> > >>> setup. Can you try changing the following line at
> > >>> opal/mca/btl/vader/btl_vader_component.c:370 from:
> > >>> 
> > >>> bool cma_happy = false;
> > >>> 
> > >>> to
> > >>> 
> > >>> bool cma_happy = true;
> > >>> 
> > >> 
> > >> ok! (as of the officiel release, this is line 386.)
> > >> 
> > >>> and let me know if that works. If it does I will update vader to allow
> > >>> CMA in this configuration.
> > >> 
> > >> Yep!  It now works perfectly.  Testing with
> > >> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
> > >> own computer (dual Xeon), I have this:
> > >> 
> > >> Without CMA:
> > >> 
> > >> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
> > >>   task pair:0 -1:8363.52 / 7946.77 / 5391.14
> > >> 
> > >> with CMA:
> > >>   task pair:0 -1:9137.92 / 8955.98 / 7489.83
> > >> 
> > >> Great!
> > >> 
> > >> Now I have to bench my real application... ;-)
> > >> 
> > >> Thanks!
> > >> 
> > >> Eric
> > >> 
> > >> ___
> > >> users mailing list
> > >> us...@open-mpi.org
> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >> Link to this post: 
> > >> http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/users/2015/02/26356.php
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2015/02/26358.php



> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26359.php

Re: [OMPI users] Help on getting CMA works


I don't think that will fix this issue. In this case yama is not
installed and it appears PR_SET_PTRACER is not available. This forces
vader to assume that CMA can not be used when that isn't always the
case. I think it might be safe to assume that CMA is unrestricted here.

-Nathan

On Thu, Feb 19, 2015 at 04:35:00PM -0500, Aurélien Bouteiller wrote:
> Nathan, 
> 
> I think I already pushed a patch for this particular issue last month. I do 
> not know if it has been back ported to release yet. 
> 
> See 
> here:https://github.com/open-mpi/ompi/commit/ee3b0903164898750137d3b71a8f067e16521102
> 
> Aurelien 
> 
> --
>   ~~~ Aurélien Bouteiller, Ph.D. ~~~
>  ~ Research Scientist @ ICL ~
> The University of Tennessee, Innovative Computing Laboratory
> 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
> tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
> https://icl.cs.utk.edu/~bouteill/
> 
> 
> 
> 
> > Le 19 févr. 2015 à 15:53, Nathan Hjelm  a écrit :
> > 
> > 
> > Great! I will add an MCA variable to force CMA and also enable it if 1)
> > no yama and 2) no PR_SET_PTRACER.
> > 
> > You might also look at using xpmem. You can find a version that supports
> > 3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> > userspace library that can be used by vader as a single-copy mechanism.
> > 
> > In benchmarks it performs better than CMA but it may or may not perform
> > better with a real application.
> > 
> > See:
> > 
> > http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> > 
> > -Nathan
> > 
> > On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> >> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> >>> On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
>  
>  On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> > 
> > If you have yama installed you can try:
>  
>  Nope, I do not have it installed... is it absolutely necessary? (and 
>  would
>  it change something when it fails when I am root?)
>  
>  Other question: In addition to "--with-cma" configure flag, do we have to
>  pass any options to "mpicc" when compiling/linking an mpi application to 
>  use
>  cma?
> >>> 
> >>> No. CMA should work out of the box. You appear to have a setup I haven't
> >>> yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> >>> prctl. Its quite possible there are no restriction on ptrace in this
> >>> setup. Can you try changing the following line at
> >>> opal/mca/btl/vader/btl_vader_component.c:370 from:
> >>> 
> >>> bool cma_happy = false;
> >>> 
> >>> to
> >>> 
> >>> bool cma_happy = true;
> >>> 
> >> 
> >> ok! (as of the officiel release, this is line 386.)
> >> 
> >>> and let me know if that works. If it does I will update vader to allow
> >>> CMA in this configuration.
> >> 
> >> Yep!  It now works perfectly.  Testing with
> >> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
> >> own computer (dual Xeon), I have this:
> >> 
> >> Without CMA:
> >> 
> >> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
> >>   task pair:0 -1:8363.52 / 7946.77 / 5391.14
> >> 
> >> with CMA:
> >>   task pair:0 -1:9137.92 / 8955.98 / 7489.83
> >> 
> >> Great!
> >> 
> >> Now I have to bench my real application... ;-)
> >> 
> >> Thanks!
> >> 
> >> Eric
> >> 
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2015/02/26356.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26358.php


pgp7NUNlOKKzV.pgp
Description: PGP signature

Re: [OMPI users] Help on getting CMA works

2015-02-19 Thread Aurélien Bouteiller

Nathan, 

I think I already pushed a patch for this particular issue last month. I do not 
know if it has been back ported to release yet. 

See 
here:https://github.com/open-mpi/ompi/commit/ee3b0903164898750137d3b71a8f067e16521102

Aurelien 

--
  ~~~ Aurélien Bouteiller, Ph.D. ~~~
 ~ Research Scientist @ ICL ~
The University of Tennessee, Innovative Computing Laboratory
1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
https://icl.cs.utk.edu/~bouteill/




> Le 19 févr. 2015 à 15:53, Nathan Hjelm  a écrit :
> 
> 
> Great! I will add an MCA variable to force CMA and also enable it if 1)
> no yama and 2) no PR_SET_PTRACER.
> 
> You might also look at using xpmem. You can find a version that supports
> 3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> userspace library that can be used by vader as a single-copy mechanism.
> 
> In benchmarks it performs better than CMA but it may or may not perform
> better with a real application.
> 
> See:
> 
> http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> 
> -Nathan
> 
> On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
>> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
>>> On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
 
 On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> 
> If you have yama installed you can try:
 
 Nope, I do not have it installed... is it absolutely necessary? (and would
 it change something when it fails when I am root?)
 
 Other question: In addition to "--with-cma" configure flag, do we have to
 pass any options to "mpicc" when compiling/linking an mpi application to 
 use
 cma?
>>> 
>>> No. CMA should work out of the box. You appear to have a setup I haven't
>>> yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
>>> prctl. Its quite possible there are no restriction on ptrace in this
>>> setup. Can you try changing the following line at
>>> opal/mca/btl/vader/btl_vader_component.c:370 from:
>>> 
>>> bool cma_happy = false;
>>> 
>>> to
>>> 
>>> bool cma_happy = true;
>>> 
>> 
>> ok! (as of the officiel release, this is line 386.)
>> 
>>> and let me know if that works. If it does I will update vader to allow
>>> CMA in this configuration.
>> 
>> Yep!  It now works perfectly.  Testing with
>> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
>> own computer (dual Xeon), I have this:
>> 
>> Without CMA:
>> 
>> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
>>   task pair:0 -1:8363.52 / 7946.77 / 5391.14
>> 
>> with CMA:
>>   task pair:0 -1:9137.92 / 8955.98 / 7489.83
>> 
>> Great!
>> 
>> Now I have to bench my real application... ;-)
>> 
>> Thanks!
>> 
>> Eric
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26356.php

Re: [OMPI users] Help on getting CMA works


On 02/19/2015 03:53 PM, Nathan Hjelm wrote:


Great! I will add an MCA variable to force CMA and also enable it if 1)
no yama and 2) no PR_SET_PTRACER.


cool, thanks again!



You might also look at using xpmem. You can find a version that supports
3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
userspace library that can be used by vader as a single-copy mechanism.

In benchmarks it performs better than CMA but it may or may not perform
better with a real application.

See:

http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy


ok, I will look (and relay the information to colleagues).

Thanks,

Eric

Re: [OMPI users] Help on getting CMA works


Great! I will add an MCA variable to force CMA and also enable it if 1)
no yama and 2) no PR_SET_PTRACER.

You might also look at using xpmem. You can find a version that supports
3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
userspace library that can be used by vader as a single-copy mechanism.

In benchmarks it performs better than CMA but it may or may not perform
better with a real application.

See:

http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy

-Nathan

On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> >On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> >>
> >>On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> >>>
> >>>If you have yama installed you can try:
> >>
> >>Nope, I do not have it installed... is it absolutely necessary? (and would
> >>it change something when it fails when I am root?)
> >>
> >>Other question: In addition to "--with-cma" configure flag, do we have to
> >>pass any options to "mpicc" when compiling/linking an mpi application to use
> >>cma?
> >
> >No. CMA should work out of the box. You appear to have a setup I haven't
> >yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> >prctl. Its quite possible there are no restriction on ptrace in this
> >setup. Can you try changing the following line at
> >opal/mca/btl/vader/btl_vader_component.c:370 from:
> >
> >bool cma_happy = false;
> >
> >to
> >
> >bool cma_happy = true;
> >
> 
> ok! (as of the officiel release, this is line 386.)
> 
> >and let me know if that works. If it does I will update vader to allow
> >CMA in this configuration.
> 
> Yep!  It now works perfectly.  Testing with
> https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
> own computer (dual Xeon), I have this:
> 
> Without CMA:
> 
> ***Message size:  100 *** best  /  avg  / worst (MB/sec)
>task pair:0 -1:8363.52 / 7946.77 / 5391.14
> 
> with CMA:
>task pair:0 -1:9137.92 / 8955.98 / 7489.83
> 
> Great!
> 
> Now I have to bench my real application... ;-)
> 
> Thanks!
> 
> Eric
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26355.php


pgp4qiwgVsc3t.pgp
Description: PGP signature

Re: [OMPI users] Help on getting CMA works


On 02/19/2015 02:58 PM, Nathan Hjelm wrote:

On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:


On 02/19/2015 11:56 AM, Nathan Hjelm wrote:


If you have yama installed you can try:


Nope, I do not have it installed... is it absolutely necessary? (and would
it change something when it fails when I am root?)

Other question: In addition to "--with-cma" configure flag, do we have to
pass any options to "mpicc" when compiling/linking an mpi application to use
cma?


No. CMA should work out of the box. You appear to have a setup I haven't
yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
prctl. Its quite possible there are no restriction on ptrace in this
setup. Can you try changing the following line at
opal/mca/btl/vader/btl_vader_component.c:370 from:

bool cma_happy = false;

to

bool cma_happy = true;



ok! (as of the officiel release, this is line 386.)


and let me know if that works. If it does I will update vader to allow
CMA in this configuration.


Yep!  It now works perfectly.  Testing with 
https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on 
my own computer (dual Xeon), I have this:


Without CMA:

***Message size:  100 *** best  /  avg  / worst (MB/sec)
   task pair:0 -1:8363.52 / 7946.77 / 5391.14

with CMA:
   task pair:0 -1:9137.92 / 8955.98 / 7489.83

Great!

Now I have to bench my real application... ;-)

Thanks!

Eric

Re: [OMPI users] Help on getting CMA works

On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> 
> On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> >
> >If you have yama installed you can try:
> 
> Nope, I do not have it installed... is it absolutely necessary? (and would
> it change something when it fails when I am root?)
> 
> Other question: In addition to "--with-cma" configure flag, do we have to
> pass any options to "mpicc" when compiling/linking an mpi application to use
> cma?

No. CMA should work out of the box. You appear to have a setup I haven't
yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
prctl. Its quite possible there are no restriction on ptrace in this
setup. Can you try changing the following line at
opal/mca/btl/vader/btl_vader_component.c:370 from:

bool cma_happy = false;

to

bool cma_happy = true;

and let me know if that works. If it does I will update vader to allow
CMA in this configuration.

-Nathan

pgp8k5J9uaK7b.pgp
Description: PGP signature

Re: [OMPI users] Help on getting CMA works



On 02/19/2015 11:56 AM, Nathan Hjelm wrote:


If you have yama installed you can try:


Nope, I do not have it installed... is it absolutely necessary? (and 
would it change something when it fails when I am root?)


Other question: In addition to "--with-cma" configure flag, do we have 
to pass any options to "mpicc" when compiling/linking an mpi application 
to use cma?


Thanks,

Eric



echo 1 > /proc/sys/kernel/yama/ptrace_scope

as root.

-Nathan

On Thu, Feb 19, 2015 at 11:06:09AM -0500, Eric Chamberland wrote:

By the way,

I have tried two others things:

#1- I launched it as root:

mpiexec --mca mca_btl_vader_single_copy_mechanism cma --allow-run-as-root
-np 2 ./hw

#2- Found this 
(http://askubuntu.com/questions/146160/what-is-the-ptrace-scope-workaround-for-wine-programs-and-are-there-any-risks)
and tried this:

sudo setcap cap_sys_ptrace=eip /tmp/hw

On both RedHat 6.5 and OpenSuse 12.3 and still get the same error message!!!
:-/

Sorry, I am not a kernel expert...

What's wrong?

Thanks,

Eric

On 02/18/2015 04:48 PM, Éric Chamberland wrote:


Le 2015-02-18 15:14, Nathan Hjelm a écrit :

I recommend using vader for CMA. It has code to get around the ptrace
setting. Run with mca_btl_vader_single_copy_mechanism cma (should be the
default).

Ok, I tried it, but it gives exactly the same error message!

Eric


-Nathan

On Wed, Feb 18, 2015 at 02:56:01PM -0500, Eric Chamberland wrote:

Hi,

I have configured with "--with-cma" on 2 differents OS (RedHat 6.6 and
OpenSuse 12.3), but in both case, I have the following error when
launching
a simple mpi_hello_world.c example:

/opt/openmpi-1.8.4_cma/bin/mpiexec --mca btl_sm_use_cma 1 -np 2 /tmp/hw
--

WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

   Local host: compile
--

Hello world from process 0 of 2
Hello world from process 1 of 2
[compile:23874] 1 more process has sent help message
help-btl-vader.txt /
cma-permission-denied
[compile:23874] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all
help / error messages

After I googled the subject, it seems there is a kernel parameter to
modify,
but I can't find it for OpenSuse 12.3 or RedHat 6.6...

Here is the "config.log" issued from RedHat 6.6...

http://www.giref.ulaval.ca/~ericc/ompi_bug/config.184_cma.gz

Thanks,

Eric
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/02/26339.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/02/26342.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/02/26351.php

Re: [OMPI users] Help on getting CMA works