Re: [OMPI users] MPI group and stuck in communication

2018-08-11 Thread Jeff Squyres (jsquyres) via users
On Aug 10, 2018, at 6:27 PM, Diego Avesani  wrote:
> 
> The question is:
> Is it possible to have a barrier for all CPUs despite they belong to 
> different group?
> If the answer is yes I will go in more details.

By "CPUs", I assume you mean "MPI processes", right?  (i.e., not threads inside 
an individual MPI process)

Again, this is not quite a specific-enough question.  Do the different groups 
(and I assume you really mean communicators) overlap?  Are they disjoint?  Is 
there a reason MPI_COMM_WORLD is not sufficient?

There are two typical ways to barrier a set of MPI processes.

1. Write your own algorithm to do sends / receives -- and possibly even 
collectives -- to ensure that no process leaves the barrier before every 
process enters the barrier.

2. Make sure that you have a communicator that includes exactly the set of 
processes that you want (and if you don't have a communicator fitting this 
description, make one), and then call MPI_BARRIER on it.

#2 is typically the easier solution.

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] cannot run openmpi 2.1

2018-08-11 Thread Kapetanakis Giannis

Hi,

I'm struggling to get 2.1.x to work with our HPC.

Version 1.8.8 and 3.x works fine.

In 2.1.3 and 2.1.4 I get errors and segmentation faults. The builds are 
with infiniband and slurm support.

mpirun locally works fine. Any help to debug this?

[node39:20090] [[50526,1],2] usock_peer_recv_connect_ack: received 
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20088] [[50526,1],0] usock_peer_recv_connect_ack: received 
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20096] [[50526,1],8] usock_peer_recv_connect_ack: received 
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],6] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],6] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20094] [[50526,1],6] usock_peer_recv_connect_ack: received 
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],6] 
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20097] [[50526,1],9] usock_peer_recv_connect_ack: received 
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20092] [[50526,1],4] usock_peer_recv_connect_ack: received 
unexpected process identifier [[50526,0],0] from [[50526,0],1]



a part from debug:

[node39:20515] mca:oob:select: Inserting component
[node39:20515] mca:oob:select: Found 3 active transports
[node39:20515] [[50428,1],9]: set_addr to uri 
3304849408.1;usock;tcp://192.168.20.113,10.1.7.69:37147;ud://181895.60.1
[node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is 
reachable via component usock
[node39:20515] [[50428,1],9]:[oob_usock_component.c:349] connect to 
[[50428,0],1]
[node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via 
component usock
[node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is 
reachable via component tcp

[node39:20515] [[50428,1],9] oob:tcp: ignoring address usock
[node39:20515] [[50428,1],9] oob:tcp: working peer [[50428,0],1] address 
tcp://192.168.20.113,10.1.7.69:37147

[node39:20515] [[50428,1],9] PASSING ADDR 192.168.20.113 TO MODULE
[node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1]
[node39:20515] [[50428,1],9] PASSING ADDR 10.1.7.69 TO MODULE
[node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1]
[node39:20515] [[50428,1],9] oob:tcp: ignoring address ud://181895.60.1
[node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via 
component tcp
[node39:20515] [[50428,1],

Re: [OMPI users] cannot run openmpi 2.1

2018-08-11 Thread Ralph H Castain
Put "oob=^usock” in your default mca param file, or add OMPI_MCA_oob=^usock to 
your environment

> On Aug 11, 2018, at 5:54 AM, Kapetanakis Giannis  
> wrote:
> 
> Hi,
> 
> I'm struggling to get 2.1.x to work with our HPC.
> 
> Version 1.8.8 and 3.x works fine.
> 
> In 2.1.3 and 2.1.4 I get errors and segmentation faults. The builds are with 
> infiniband and slurm support.
> mpirun locally works fine. Any help to debug this?
> 
> [node39:20090] [[50526,1],2] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20088] [[50526,1],0] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20096] [[50526,1],8] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20094] [[50526,1],6] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20097] [[50526,1],9] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20092] [[50526,1],4] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> 
> 
> a part from debug:
> 
> [node39:20515] mca:oob:select: Inserting component
> [node39:20515] mca:oob:select: Found 3 active transports
> [node39:20515] [[50428,1],9]: set_addr to uri 
> 3304849408.1;usock;tcp://192.168.20.113,10.1.7.69:37147;ud://181895.60.1
> [node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is 
> reachable via component usock
> [node39:20515] [[50428,1],9]:[oob_usock_component.c:349] connect to 
> [[50428,0],1]
> [node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component 
> usock
> [node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is 
> reachable via component tcp
> [node39:20515] [[50428,1],9] oob:tcp: ignoring address usock
> [node39:20515] [[50428,1],9] oob:tcp: working peer [[50428,0],1] address 
> tcp://192.168.20.113,10.1.7.69:37147
> [node39:20515] [[50428,1],9] PASSING ADDR 192.168.20.113 TO MODULE
> [node39:20515] [[50

Re: [OMPI users] cannot run openmpi 2.1

2018-08-11 Thread Kapetanakis Giannis

On 11/08/18 16:39, Ralph H Castain wrote:

Put "oob=^usock” in your default mca param file, or add OMPI_MCA_oob=^usock to 
your environment



Thank you very much, that did the trick.

Could you please explain about this, cause I cannot find documentation

G
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] know which CPU has the maximum value

2018-08-11 Thread Jeff Hammond
The MPI Forum email lists and GitHub are not secret.  Please feel free to
follow the GitHub project linked below and/or sign up for the MPI Forum
email lists if you are interested in the evolution of the MPI standard.

What MPI Forum members should avoid is creating FUD about MPI by
speculating about the removal of useful features.  There is plenty of time
to have those debates in both public and private after formal proposals are
made.

Jeff

On Fri, Aug 10, 2018 at 11:11 AM, Gus Correa  wrote:

> Hmmm ... no, no, no!
> Keep it secret why!?!?
>
> Diego Avesani's questions and questioning
> may have saved us users from getting a
> useful feature deprecated in the name of code elegance.
> Code elegance may be very cherished by developers,
> but it is not necessarily helpful to users,
> specially if it strips off useful functionality.
>
> My cheap 2 cents from a user.
> Gus Correa
>
>
> On 08/10/2018 01:52 PM, Jeff Hammond wrote:
>
>> This thread is a perfect illustration of why MPI Forum participants
>> should not flippantly discuss feature deprecation in discussion with
>> users.  Users who are not familiar with the MPI Forum process are not able
>> to evaluate whether such proposals are serious or have any hope of
>> succeeding and therefore may be unnecessarily worried about their code
>> breaking in the future, when that future is 5 to infinity years away.
>>
>> If someone wants to deprecate MPI_{MIN,MAX}LOC, they should start that
>> discussion on https://github.com/mpi-forum/mpi-issues/issues or
>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-coll.
>>
>> Jeff
>>
>> On Fri, Aug 10, 2018 at 10:27 AM, Jeff Squyres (jsquyres) via users <
>> users@lists.open-mpi.org > wrote:
>>
>> It is unlikely that MPI_MINLOC and MPI_MAXLOC will go away any time
>> soon.
>>
>> As far as I know, Nathan hasn't advanced a proposal to kill them in
>> MPI-4, meaning that they'll likely continue to be in MPI for at
>> least another 10 years.  :-)
>>
>> (And even if they did get killed in MPI-4, implementations like Open
>> MPI would continue to keep them in our implementations for quite a
>> while -- i.e., years)
>>
>>
>>  > On Aug 10, 2018, at 1:13 PM, Diego Avesani
>> mailto:diego.aves...@gmail.com>> wrote:
>>  >
>>  > I agree about the names, it is very similar to MIN_LOC and
>> MAX_LOC in fortran 90.
>>  > However, I find difficult to define some algorithm able to do the
>> same things.
>>  >
>>  >
>>  >
>>  > Diego
>>  >
>>  >
>>  > On 10 August 2018 at 19:03, Nathan Hjelm via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>  > They do not fit with the rest of the predefined operations (which
>> operate on a single basic type) and can easily be implemented as
>> user defined operations and get the same performance. Add to that
>> the fixed number of tuple types and the fact that some of them are
>> non-contiguous (MPI_SHORT_INT) plus the terrible names. If I could
>> kill them in MPI-4 I would.
>>  >
>>  > On Aug 10, 2018, at 9:47 AM, Diego Avesani
>> mailto:diego.aves...@gmail.com>> wrote:
>>  >
>>  >> Dear all,
>>  >> I have just implemented MAXLOC, why should they  go away?
>>  >> it seems working pretty well.
>>  >>
>>  >> thanks
>>  >>
>>  >> Diego
>>  >>
>>  >>
>>  >> On 10 August 2018 at 17:39, Nathan Hjelm via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>  >> The problem is minloc and maxloc need to go away. better to use
>> a custom op.
>>  >>
>>  >> On Aug 10, 2018, at 9:36 AM, George Bosilca > > wrote:
>>  >>
>>  >>> You will need to create a special variable that holds 2
>> entries, one for the max operation (with whatever type you need) and
>> an int for the rank of the process. The MAXLOC is described on the
>> OMPI man page [1] and you can find an example on how to use it on
>> the MPI Forum [2].
>>  >>>
>>  >>> George.
>>  >>>
>>  >>>
>>  >>> [1] https://www.open-mpi.org/doc/v2.0/man3/MPI_Reduce.3.php
>> 
>>  >>> [2]
>> https://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node79.html
>> 
>>  >>>
>>  >>> On Fri, Aug 10, 2018 at 11:25 AM Diego Avesani
>> mailto:diego.aves...@gmail.com>> wrote:
>>  >>>  Dear all,
>>  >>> I have probably understood.
>>  >>> The trick is to use a real vector and to memorize also the rank.
>>  >>>
>>  >>> Have I understood correctly?
>>  >>> thanks
>>  >>>
>>  >>> Diego
>>  >>>
>>  >>>
>>  >>> On 10 August 2018 at 17:19, Diego Avesani
>> mailto:diego.aves...@gmail.com>> wrote:
>>  >>> Deal all,
>>  >>> I do not understand how MPI_MINLOC works. it seem locate the
>>