Re: [OMPI users] IpV6 Openmpi mpirun failed

2017-10-19 Thread Mukkie
Ok, Thanks. In that case, can we reopen this issue then, to get an update
from the participants.?


Cordially,
Muku.

On Thu, Oct 19, 2017 at 4:37 PM, r...@open-mpi.org  wrote:

> Actually, I don’t see any related changes in OMPI master, let alone the
> branches. So far as I can tell, the author never actually submitted the
> work.
>
>
> On Oct 19, 2017, at 3:57 PM, Mukkie  wrote:
>
> FWIW, my issue is related to this one.
> https://github.com/open-mpi/ompi/issues/1585
>
> I have version 3.0.0 and the above issue is closed saying, fixes went into
> 3.1.0
> However, i don't see the code changes towards this issue.?
>
> Cordially,
> Muku.
>
> On Wed, Oct 18, 2017 at 3:52 PM, Mukkie  wrote:
>
>> Thanks for your suggestion. However my firewall's are already disabled on
>> both the machines.
>>
>> Cordially,
>> Muku.
>>
>> On Wed, Oct 18, 2017 at 2:38 PM, r...@open-mpi.org 
>> wrote:
>>
>>> Looks like there is a firewall or something blocking communication
>>> between those nodes?
>>>
>>> On Oct 18, 2017, at 1:29 PM, Mukkie  wrote:
>>>
>>> Adding a verbose output. Please check for failed and advise. Thank you.
>>>
>>> [mselvam@ipv-rhel73 examples]$ mpirun -hostfile host --mca
>>> oob_base_verbose 100 --mca btl tcp,self ring_c
>>> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open
>>> mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or
>>> directory (ignored)
>>> [ipv-rhel73:10575] mca: base: components_register: registering framework
>>> oob components
>>> [ipv-rhel73:10575] mca: base: components_register: found loaded
>>> component tcp
>>> [ipv-rhel73:10575] mca: base: components_register: component tcp
>>> register function successful
>>> [ipv-rhel73:10575] mca: base: components_open: opening oob components
>>> [ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
>>> [ipv-rhel73:10575] mca: base: components_open: component tcp open
>>> function successful
>>> [ipv-rhel73:10575] mca:oob:select: checking available component tcp
>>> [ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
>>> [ipv-rhel73:10575] oob:tcp: component_available called
>>> [ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
>>> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding
>>> fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
>>> [ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
>>> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback
>>> interface lo
>>> [ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>>> [ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
>>> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
>>> [ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
>>> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
>>> [ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
>>> [ipv-rhel73:10575] mca:oob:select: Adding component to end
>>> [ipv-rhel73:10575] mca:oob:select: Found 1 active transports
>>> [ipv-rhel73:10575] [[20058,0],0]: get transports
>>> [ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
>>> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open
>>> mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or
>>> directory (ignored)
>>> [ipv-rhel71a.locallab.local:12299] mca: base: components_register:
>>> registering framework oob components
>>> [ipv-rhel71a.locallab.local:12299] mca: base: components_register:
>>> found loaded component tcp
>>> [ipv-rhel71a.locallab.local:12299] mca: base: components_register:
>>> component tcp register function successful
>>> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening
>>> oob components
>>> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: found
>>> loaded component tcp
>>> [ipv-rhel71a.locallab.local:12299] mca: base: components_open:
>>> component tcp open function successful
>>> [ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available
>>> component tcp
>>> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component
>>> [tcp]
>>> [ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
>>> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2
>>> FAMILY: V6
>>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding
>>> fe80::226:b9ff:fe85:6a28 to our list of V6 connections
>>> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1
>>> FAMILY: V4
>>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting
>>> loopback interface lo
>>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
>>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to
>>> IPv4 port 0
>>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port
>>> 50782
>>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] 

Re: [OMPI users] IpV6 Openmpi mpirun failed

2017-10-19 Thread r...@open-mpi.org
Actually, I don’t see any related changes in OMPI master, let alone the 
branches. So far as I can tell, the author never actually submitted the work.


> On Oct 19, 2017, at 3:57 PM, Mukkie  wrote:
> 
> FWIW, my issue is related to this one.
> https://github.com/open-mpi/ompi/issues/1585 
> 
> 
> I have version 3.0.0 and the above issue is closed saying, fixes went into 
> 3.1.0
> However, i don't see the code changes towards this issue.?
> 
> Cordially,
> Muku.
> 
> On Wed, Oct 18, 2017 at 3:52 PM, Mukkie  > wrote:
> Thanks for your suggestion. However my firewall's are already disabled on 
> both the machines.
> 
> Cordially,
> Muku. 
> 
> On Wed, Oct 18, 2017 at 2:38 PM, r...@open-mpi.org  
> > wrote:
> Looks like there is a firewall or something blocking communication between 
> those nodes?
> 
>> On Oct 18, 2017, at 1:29 PM, Mukkie > > wrote:
>> 
>> Adding a verbose output. Please check for failed and advise. Thank you.
>> 
>> [mselvam@ipv-rhel73 examples]$ mpirun -hostfile host --mca oob_base_verbose 
>> 100 --mca btl tcp,self ring_c
>> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open 
>> mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or 
>> directory (ignored)
>> [ipv-rhel73:10575] mca: base: components_register: registering framework oob 
>> components
>> [ipv-rhel73:10575] mca: base: components_register: found loaded component tcp
>> [ipv-rhel73:10575] mca: base: components_register: component tcp register 
>> function successful
>> [ipv-rhel73:10575] mca: base: components_open: opening oob components
>> [ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
>> [ipv-rhel73:10575] mca: base: components_open: component tcp open function 
>> successful
>> [ipv-rhel73:10575] mca:oob:select: checking available component tcp
>> [ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
>> [ipv-rhel73:10575] oob:tcp: component_available called
>> [ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
>> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding 
>> fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
>> [ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
>> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface lo
>> [ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>> [ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
>> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
>> [ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
>> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
>> [ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
>> [ipv-rhel73:10575] mca:oob:select: Adding component to end
>> [ipv-rhel73:10575] mca:oob:select: Found 1 active transports
>> [ipv-rhel73:10575] [[20058,0],0]: get transports
>> [ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
>> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open 
>> mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or 
>> directory (ignored)
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_register: 
>> registering framework oob components
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_register: found 
>> loaded component tcp
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_register: component 
>> tcp register function successful
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening oob 
>> components
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: found loaded 
>> component tcp
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: component tcp 
>> open function successful
>> [ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available 
>> component tcp
>> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component [tcp]
>> [ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
>> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2 
>> FAMILY: V6
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding 
>> fe80::226:b9ff:fe85:6a28 to our list of V6 connections
>> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1 
>> FAMILY: V4
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting 
>> loopback interface lo
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv4 
>> port 0
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv6 
>> port 0
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 

Re: [OMPI users] IpV6 Openmpi mpirun failed

2017-10-19 Thread Mukkie
FWIW, my issue is related to this one.
https://github.com/open-mpi/ompi/issues/1585

I have version 3.0.0 and the above issue is closed saying, fixes went into
3.1.0
However, i don't see the code changes towards this issue.?

Cordially,
Muku.

On Wed, Oct 18, 2017 at 3:52 PM, Mukkie  wrote:

> Thanks for your suggestion. However my firewall's are already disabled on
> both the machines.
>
> Cordially,
> Muku.
>
> On Wed, Oct 18, 2017 at 2:38 PM, r...@open-mpi.org 
> wrote:
>
>> Looks like there is a firewall or something blocking communication
>> between those nodes?
>>
>> On Oct 18, 2017, at 1:29 PM, Mukkie  wrote:
>>
>> Adding a verbose output. Please check for failed and advise. Thank you.
>>
>> [mselvam@ipv-rhel73 examples]$ mpirun -hostfile host --mca
>> oob_base_verbose 100 --mca btl tcp,self ring_c
>> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open
>> mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or
>> directory (ignored)
>> [ipv-rhel73:10575] mca: base: components_register: registering framework
>> oob components
>> [ipv-rhel73:10575] mca: base: components_register: found loaded component
>> tcp
>> [ipv-rhel73:10575] mca: base: components_register: component tcp register
>> function successful
>> [ipv-rhel73:10575] mca: base: components_open: opening oob components
>> [ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
>> [ipv-rhel73:10575] mca: base: components_open: component tcp open
>> function successful
>> [ipv-rhel73:10575] mca:oob:select: checking available component tcp
>> [ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
>> [ipv-rhel73:10575] oob:tcp: component_available called
>> [ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
>> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding
>> fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
>> [ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
>> [ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback
>> interface lo
>> [ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>> [ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
>> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
>> [ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
>> [ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
>> [ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
>> [ipv-rhel73:10575] mca:oob:select: Adding component to end
>> [ipv-rhel73:10575] mca:oob:select: Found 1 active transports
>> [ipv-rhel73:10575] [[20058,0],0]: get transports
>> [ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
>> [ipv-rhel73:10575] mca_base_component_repository_open: unable to open
>> mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or
>> directory (ignored)
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_register:
>> registering framework oob components
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_register: found
>> loaded component tcp
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_register:
>> component tcp register function successful
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening
>> oob components
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: found
>> loaded component tcp
>> [ipv-rhel71a.locallab.local:12299] mca: base: components_open: component
>> tcp open function successful
>> [ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available
>> component tcp
>> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component
>> [tcp]
>> [ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
>> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2
>> FAMILY: V6
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding
>> fe80::226:b9ff:fe85:6a28 to our list of V6 connections
>> [ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1
>> FAMILY: V4
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting
>> loopback interface lo
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to
>> IPv4 port 0
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to
>> IPv6 port 0
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
>> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to
>> end
>> [ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active
>> transports
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for
>> component tcp
>> [ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri
>> 

Re: [OMPI users] openib/mpi_alloc_mem pathology [#20160912-1315]

2017-10-19 Thread Paul Kapinos
Hi all,
sorry for the long long latency - this message was buried in my mailbox for
months



On 03/16/2017 10:35 AM, Alfio Lazzaro wrote:
> Hello Dave and others,
> we jump in the discussion as CP2K developers.
> We would like to ask you which version of CP2K you are using in your tests 
version 4.1 (release)

> and
> if you can share with us your input file and output log.

The input file is property of Mathias Schumacher (CC:) and we need a permission
of him to provide it.



> Some clarifications on the way we use MPI allocate/free:
> 1) only buffers used for MPI communications are allocated with MPI 
> allocate/free
> 2) in general we use memory pools, therefore we reach a limit in the buffers
> sizes after some iterations, i.e. they are not reallocated anymore
> 3) there are some cases where we don't use memory pools, but their overall
> contribution should be very small. You can run with the CALLGRAPH option
> (https://www.cp2k.org/dev:profiling#the_cp2k_callgraph) to get more insight
> where those allocations/deallocations are.

We ran the data set again with CALLGRAPH option. Please have a look at the
attached files. You see a callgraph file (from rank 0 of 24 used) and some
exported call tree views.

We can see that the *allocate* routines (mp_[de|]allocate_[i|d]) are called 33k
vs. 28k times (multiple this with 24x processes per node). In the 'good case'
(Intel MPI and Open MPI with workaround) these calls are only a fraction of 1%
of time; in 'bad case' (OpenMPI w/o workaround, attached) the two
mp_dealocate_[i|d] calls use 81% of the time in 'Self', huh. That's mainly the
observation we made a long time ago: if in a node with Intel OmniPath fabric the
failback to InfiniBand is not prohibited, the  MPI_Free_mem() take ages.
(I'm not familiar with CCachegrind so forgive me if I'm not true).

Have a nice day,

Paul Kapinos



-- 
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, IT Center
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915


20171019-callgraph.tar.gz
Description: application/gzip


smime.p7s
Description: S/MIME Cryptographic Signature
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Failed to register memory (openmpi 2.0.2)

2017-10-19 Thread Mark Dixon

Thanks Ralph, will do.

Cheers,

Mark

On Wed, 18 Oct 2017, r...@open-mpi.org wrote:


Put “oob=tcp” in your default MCA param file


On Oct 18, 2017, at 9:00 AM, Mark Dixon  wrote:

Hi,

We're intermittently seeing messages (below) about failing to register memory 
with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the vanilla IB 
stack as shipped by centos.

We're not using any mlx4_core module tweaks at the moment. On earlier machines 
we used to set registered memory as per the FAQ, but neither log_num_mtt nor 
num_mtt seem to exist these days (according to 
/sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to follow 
the FAQ.

The output of 'ulimit -l' shows as unlimited for every rank.

Does anyone have any advice, please?

Thanks,

Mark

-
Failed to register memory region (MR):

Hostname: dc1s0b1c
Address:  ec5000
Length:   20480
Error:Cannot allocate memory
--
--
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users