Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread John Hearns via users
Thankyou.  That is helpful.

Could you run an 'ldd' on your executable, on one of the compute nodes if
possible?
I will nto be able to solve your problem, but at least we now know what the
application is,
and can look at the libraries it is using.



On 2 September 2016 at 17:19, Mahmood Naderan  wrote:

> The application is Siesta-3.2 and the command I use is
>
>
> /share/apps/computer/openmpi-1.6.5/bin/mpirun -hostfile hosts.txt -np
> 15 /share/apps/chemistry/siesta-3.2-pl-5/tpar/transiesta <
> trans-cc-bt-cc-163-20.fdf
>
> There is one node in the hosts.txt file. I have built transiesta
> binary from the source which uses
> /share/apps/computer/openmpi-1.6.5/bin/mpif90
>
> --
> Regards,
> Mahmood
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread John Hearns via users
Mahmood,
are you compiling and linking this application?
Or are you using an executable which someone else has prepared?

It would be very useful if we could know the application.




On 2 September 2016 at 16:35, Mahmood Naderan  wrote:

> >Did you ran
> >ulimit -c unlimited
> >before invoking mpirun ?
>
> Yes. On the node which says that error. Is that file created in the
> current working directory? Or it is somewhere in the system folders?
>
>
>
> As another question, I am trying to use OpenMPI-2.0.0 as a new one.
> Problem is that the application uses libmpi_f90.a from old versions
> but I don't see that in OpenMPI-2.0.0. There are some other libraries
> there.
>
>
>
>
> --
> Regards,
> Mahmood
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] New to (Open)MPI

2016-09-02 Thread John Hearns via users
Hello Lachlan.  I think Jeff Squyres will be along in a short while! HE is
of course the expert on Cisco.

In the meantime a quick Google turns up:
http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/usnic/c/deployment/2_0_X/b_Cisco_usNIC_Deployment_Guide_For_Standalone_C-SeriesServers.html

On 2 September 2016 at 06:54, Gilles Gouaillardet  wrote:

> Hi,
>
>
> FCoE is for storage, Ethernet is for the network.
>
> I assume you can ssh into your nodes, which means you have a TCP/IP, and
> it is up and running.
>
> i do not know the details of Cisco hardware, but you might be able to use
> usnic (native btl or via libfabric) instead of the plain TCP/IP network.
>
>
> at first, you can build Open MPI, and run a job on two nodes with one task
> per node.
>
> in your script, you can
>
> mpirun --mca btl_base_verbose 100 --mca pml_base_verbose 100 ...
>
> this will tell you which network is used.
>
>
> Cheers,
>
>
> Gilles
> On 9/2/2016 11:06 AM, Lachlan Musicman wrote:
>
> Hola,
>
> I'm new to MPI and OpenMPI. Relatively new to HPC as well.
>
> I've just installed a SLURM cluster and added OpenMPI for the users to
> take advantage of.
>
> I'm just discovering that I have missed a vital part - the networking.
>
> I'm looking over the networking options and from what I can tell we only
> have (at the moment) Fibre Channel over Ethernet (FCoE).
>
> Is this a network technology that's supported by OpenMPI?
>
> (system is running Centos 7, on Cisco M Series hardware)
>
> Please excuse me if I have terms wrong or am missing knowledge. Am new to
> this.
>
> cheers
> Lachlan
>
>
> --
> The most dangerous phrase in the language is, "We've always done it this
> way."
>
> - Grace Hopper
>
>
> ___
> users mailing 
> listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread John Hearns via users
Mahmood, as Giles says start by looking at how that application is compiled
and linked.
Run 'ldd' on the executable and look closely at the libraries.  Do this on
a compute node if you can.

There was a discussion on another mailign list recently about how to
fingerpritn executables and see which architecture it was compiled for.
My mind is a blank at the moment as to what that discussion concluded.
Sorry.  And if this was on OpenMPI I am doubly sorry!


On 2 September 2016 at 10:37, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Did you ran
> ulimit -c unlimited
> before invoking mpirun ?
>
> if your application can be ran with only one tasks, you can try to run it
> under gdb.
> you will hopefully be able to see where the illegal instruction occurs.
>
> since you are running on AMD processors, you have to make sure you are not
> using any third party library that was optimized for Intel processors (e.g.
> that uses AVX (SSE ?) instructions)
>
> Cheers,
>
> Gilles
>
> On Friday, September 2, 2016, Mahmood Naderan 
> wrote:
>
>> >Are you running under a batch manager ?
>> >On which architecture ?
>> Currently I am not using the job manager (which is actually PBS). I am
>> running from the terminal.
>>
>> The machines are AMD Opteron 64 bit
>>
>>
>> >Hopefully you will get a core file that points you to the illegal
>> instruction
>> Where is that core file. I can not find it.
>>
>> BTW, the openmpi is 1.6.5
>>
>>
>> --
>> Regards,
>> Mahmood
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI + InfiniBand

2016-11-01 Thread John Hearns via users
Segei,
can you run :

ibhosts

ibstat

ibdiagnet


Lord help me for being so naive, but do you have a subnet manager running?



On 1 November 2016 at 06:40, Sergei Hrushev  wrote:

> Hi Jeff !
>
> What does "ompi_info | grep openib" show?
>>
>>
> $ ompi_info | grep openib
>  MCA btl: openib (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>
> Additionally, Mellanox provides alternate support through their MXM
>> libraries, if you want to try that.
>>
>
> Yes, I know.
> But we already have a hybrid cluster with OpenMPI, OpenMP, CUDA, Torque
> and many other libraries installed,
> and because it works perfect over Ethernet interconnect my idea was to add
> InfiniBand support with minimum
> of changes. Mainly because we already have some custom-written software
> for OpenMPI.
>
>
>> If that shows that you have the openib BTL plugin loaded, try running
>> with "mpirun --mca btl_base_verbose 100 ..."  That will provide additional
>> output about why / why not each point-to-point plugin is chosen.
>>
>>
> Yes, I tried to get this info already.
> And I saw in log that rdmacm wants IP address on port.
> So my question in topc start message was:
>
> Is it enough for OpenMPI to have RDMA only or IPoIB should also be
> installed?
>
> The mpirun output is:
>
> [node1:02674] mca: base: components_register: registering btl components
> [node1:02674] mca: base: components_register: found loaded component openib
> [node1:02674] mca: base: components_register: component openib register
> function successful
> [node1:02674] mca: base: components_register: found loaded component sm
> [node1:02674] mca: base: components_register: component sm register
> function successful
> [node1:02674] mca: base: components_register: found loaded component self
> [node1:02674] mca: base: components_register: component self register
> function successful
> [node1:02674] mca: base: components_open: opening btl components
> [node1:02674] mca: base: components_open: found loaded component openib
> [node1:02674] mca: base: components_open: component openib open function
> successful
> [node1:02674] mca: base: components_open: found loaded component sm
> [node1:02674] mca: base: components_open: component sm open function
> successful
> [node1:02674] mca: base: components_open: found loaded component self
> [node1:02674] mca: base: components_open: component self open function
> successful
> [node1:02674] select: initializing btl component openib
> [node1:02674] openib BTL: rdmacm IP address not found on port
> [node1:02674] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1;
> skipped
> [node1:02674] select: init of component openib returned failure
> [node1:02674] mca: base: close: component openib closed
> [node1:02674] mca: base: close: unloading component openib
> [node1:02674] select: initializing btl component sm
> [node1:02674] select: init of component sm returned failure
> [node1:02674] mca: base: close: component sm closed
> [node1:02674] mca: base: close: unloading component sm
> [node1:02674] select: initializing btl component self
> [node1:02674] select: init of component self returned success
> [node1:02674] mca: bml: Using self btl to [[16642,1],0] on node node1
> [node1:02674] mca: base: close: component self closed
> [node1:02674] mca: base: close: unloading component self
>
> Best regards,
> Sergei.
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI + InfiniBand

2016-10-28 Thread John Hearns via users
Sorry - shoot down my idea. Over to someone else (me hides head in shame)

On 28 October 2016 at 11:28, Sergei Hrushev  wrote:

> Sergei,   what does the command  "ibv_devinfo" return please?
>>
>> I had a recent case like this, but on Qlogic hardware.
>> Sorry if I am mixing things up.
>>
>>
> An output of ibv_devinfo from cluster's 1st node is:
>
> $ ibv_devinfo -d mlx4_0
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.35.5100
> node_guid:  7cfe:9003:00bd:dec0
> sys_image_guid: 7cfe:9003:00bd:dec3
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id:   MT_1100120019
> phys_port_cnt:  1
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 3
> port_lid:   3
> port_lmc:   0x00
> link_layer: InfiniBand
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI + InfiniBand

2016-10-28 Thread John Hearns via users
Sergei,   what does the command  "ibv_devinfo" return please?

I had a recent case like this, but on Qlogic hardware.
Sorry if I am mixing things up.

On 28 October 2016 at 10:48, Sergei Hrushev  wrote:

> Hello, All !
>
> We have a problem with OpenMPI version 1.10.2 on a cluster with newly
> installed Mellanox InfiniBand adapters.
> OpenMPI was re-configured and re-compiled using: --with-verbs
> --with-verbs-libdir=/usr/lib
>
> And our test MPI task returns proper results but it seems OpenMPI
> continues to use existing 1Gbit Ethernet network instead of InfiniBand.
>
> An output file contains these lines:
> --
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
>
>   Local host:   node1
>   Local device: mlx4_0
>   Local port:   1
>   CPCs attempted:   rdmacm, udcm
> --
>
> InfiniBand network itself seems to be working:
>
> $ ibstat mlx4_0 shows:
>
> CA 'mlx4_0'
> CA type: MT4099
> Number of ports: 1
> Firmware version: 2.35.5100
> Hardware version: 0
> Node GUID: 0x7cfe900300bddec0
> System image GUID: 0x7cfe900300bddec3
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 56
> Base lid: 3
> LMC: 0
> SM lid: 3
> Capability mask: 0x0251486a
> Port GUID: 0x7cfe900300bddec1
> Link layer: InfiniBand
>
> ibping also works.
> ibnetdiscover shows the correct topology of  IB network.
>
> Cluster works under Ubuntu 16.04 and we use drivers from OS (OFED is not
> installed).
>
> Is it enough for OpenMPI to have RDMA only or IPoIB should also be
> installed?
> What else can be checked?
>
> Thanks a lot for any help!
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] install OpenMPI on CentOS in HPC

2016-12-18 Thread John Hearns via users
Mahmoud, you should look at the OpenHPC project.
http://www.openhpc.community/

On 15 December 2016 at 19:50, Mahmoud MIRZAEI  wrote:

> Dears,
>
> May you please let me know if there is any procedure to install OpenMPI on
> CentOS in HPC?
>
> Thanks.
> Mahmoud
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Communicating MPI processes running in Docker containers in the same host by means of shared memory?

2017-03-24 Thread John Hearns via users
Jordi,
  this is not an answer to your question. However have you looked at
Singularity:
http://singularity.lbl.gov/



On 24 March 2017 at 08:54, Jordi Guitart  wrote:

> Hello,
>
> Docker allows several containers running in the same host to share the
> same IPC namespace, thus they can share memory (see example here:
> https://github.com/docker/docker/pull/8211#issuecomment-56873448). I
> assume this could be used by OpenMPI to communicate MPI processes running
> in different Docker containers in the same host by using shared memory (sm
> or vader). However, I cannot make it work. I tried to force mpirun to use
> shared memory (--mca btl self, sm) but it complains that MPI processes
> running in other Docker containers are not reachable. It seems like OpenMPI
> cannot recognize that shared memory is available between containers. Has
> anybody any hint about how this could be worked out?
>
> Thanks
>
>
> http://bsc.es/disclaimer
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Q: Basic invoking of InfiniBand with OpenMPI

2017-07-14 Thread John Hearns via users
ABoris, as Gilles says - first do som elower level checkouts of your
Infiniband network.
I suggest running:
ibdiagnet
ibhosts
and then as Gilles says 'ibstat' on each node



On 14 July 2017 at 03:58, Gilles Gouaillardet  wrote:

> Boris,
>
>
> Open MPI should automatically detect the infiniband hardware, and use
> openib (and *not* tcp) for inter node communications
>
> and a shared memory optimized btl (e.g. sm or vader) for intra node
> communications.
>
>
> note if you "-mca btl openib,self", you tell Open MPI to use the openib
> btl between any tasks,
>
> including tasks running on the same node (which is less efficient than
> using sm or vader)
>
>
> at first, i suggest you make sure infiniband is up and running on all your
> nodes.
>
> (just run ibstat, at least one port should be listed, state should be
> Active, and all nodes should have the same SM lid)
>
>
> then try to run two tasks on two nodes.
>
>
> if this does not work, you can
>
> mpirun --mca btl_base_verbose 100 ...
>
> and post the logs so we can investigate from there.
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 7/14/2017 6:43 AM, Boris M. Vulovic wrote:
>
>>
>> I would like to know how to invoke InfiniBand hardware on CentOS 6x
>> cluster with OpenMPI (static libs.) for running my C++ code. This is how I
>> compile and run:
>>
>> /usr/local/open-mpi/1.10.7/bin/mpic++ -L/usr/local/open-mpi/1.10.7/lib
>> -Bstatic main.cpp -o DoWork
>>
>> usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self --hostfile
>> hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork
>>
>> Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and the cluster
>> has InfiniBand.
>>
>> What should be changed in compiling and running commands for InfiniBand
>> to be invoked? If I just replace "*-mca btl tcp,self*" with "*-mca btl
>> openib,self*" then I get plenty of errors with relevant one saying:
>>
>> /At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated that
>> it can be used to communicate between these processes. This is an error;
>> Open MPI requires that all MPI processes be able to reach each other. This
>> error can sometimes be the result of forgetting to specify the "self" BTL./
>>
>> Thanks very much!!!
>>
>>
>> *Boris *
>>
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Q: Basic invoking of InfiniBand with OpenMPI

2017-07-17 Thread John Hearns via users
Boris,
do you have a Subnet Manager running on your fabric?

I am sorry if there have bene other replies ot this over the weekend.

On 14 July 2017 at 18:34, Boris M. Vulovic <boris.m.vulo...@gmail.com>
wrote:

> Gus, Gilles and John,
>
> Thanks for the help. Let me first post (below) the output from checkouts
> of the IB network:
> ibdiagnet
> ibhosts
> ibstat  (for login node, for now)
>
> What do you think?
> Thanks
> --Boris
>
>
> 
> 
>
> -bash-4.1$ *ibdiagnet*
> --
> Load Plugins from:
> /usr/share/ibdiagnet2.1.1/plugins/
> (You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH"
> env variable)
>
> Plugin Name   Result Comment
> libibdiagnet_cable_diag_plugin-2.1.1  Succeeded  Plugin loaded
> libibdiagnet_phy_diag_plugin-2.1.1Succeeded  Plugin loaded
>
> -
> Discovery
> -E- Failed to initialize
>
> -E- Fabric Discover failed, err=IBDiag initialize wasn't done
> -E- Fabric Discover failed, MAD err=Failed to register SMI class
>
> -
> Summary
> -I- Stage Warnings   Errors Comment
> -I- Discovery   NA
> -I- Lids Check  NA
> -I- Links Check NA
> -I- Subnet Manager  NA
> -I- Port Counters   NA
> -I- Nodes Information   NA
> -I- Speed / Width checksNA
> -I- Partition Keys  NA
> -I- Alias GUIDs NA
> -I- Temperature Sensing NA
>
> -I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/
> ibdiagnet2.log
>
> -E- A fatal error occurred, exiting...
> -bash-4.1$
> 
> 
>
> -bash-4.1$ *ibhosts*
> ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1 failed
> src/ibnetdisc.c:766; can't open MAD port ((null):0)
> /usr/sbin/ibnetdiscover: iberror: failed: discover failed
> -bash-4.1$
>
> 
> 
> -bash-4.1$ *ibstat*
> CA 'mlx5_0'
> CA type: MT4115
> Number of ports: 1
> Firmware version: 12.17.2020
> Hardware version: 0
> Node GUID: 0x248a0703005abb1c
> System image GUID: 0x248a0703005abb1c
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 100
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x3c01
> Port GUID: 0x268a07fffe5abb1c
> Link layer: Ethernet
> CA 'mlx5_1'
> CA type: MT4115
> Number of ports: 1
> Firmware version: 12.17.2020
> Hardware version: 0
> Node GUID: 0x248a0703005abb1d
> System image GUID: 0x248a0703005abb1c
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 100
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x3c01
> Port GUID: 0x
> Link layer: Ethernet
> CA 'mlx5_2'
> CA type: MT4115
> Number of ports: 1
> Firmware version: 12.17.2020
> Hardware version: 0
> Node GUID: 0x248a0703005abb30
> System image GUID: 0x248a0703005abb30
> Port 1:
> State: Down
> Physical state: Disabled
> Rate: 100
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x3c01
> Port GUID: 0x268a07fffe5abb30
> Link layer: Ethernet
> CA 'mlx5_3'
> CA type: MT4115
> Number of ports: 1
> Firmware version: 12.17.2020
> Hardware version: 0
> Node GUID: 0x248a0703005abb31
> System image GUID: 0x248a0703005abb30
> Port 1:
> State: Down
> Physical state: Disabled
>         Rate: 100
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask

Re: [OMPI users] Basic build trouble on RHEL7

2017-04-27 Thread John Hearns via users
Ray, probably a stupid question but do you have the hwloc-devel package
installed?
And also the libxml2-devel package?



On 27 April 2017 at 21:54, Ray Sheppard  wrote:

> Hi All,
>   I have searched the mail archives because I think this issue was
> addressed earlier, but I can not find anything useful.
>   We are standing up a few racks of RHEL-7 on Intel to slowly migrate the
> cluster from RHEL6.   I downloaded 2.1.0 to install. All goes well until
> about "CCLD libopen-rte.la."  Then it cannot find -lhwloc or -lxml2.
> There are copies of both in /usr/lib64.  I tried many variations of fixes.
> The most extreme is:
>
> #!/bin/bash
> export LT_SYS_LIBRARY_PATH=/usr/lib64
> export CC="gcc -L/usr/lib64 "
> export CXX="g++ -L/usr/lib64 "
> export FC="gfortran -L/usr/lib64 "
> ./configure CC="gcc -L/usr/lib64 " CXX="g++ -L/usr/lib64 " FC="gfortran
> -L/usr/lib64 " --enable-static --with-hwloc-libdir=/usr/lib64
> --with-threads=posix  --disable-vt --prefix=/N/soft/rhel7/openmpi
> /gnu/2.1.0
> #
>
> Nothing worked.  I thought maybe the older 1.X might not use HWLOC and  I
> see you still support it at 1.10.6.  I downloaded that and gave it a try.
> The  -lhwloc message was gone but -lxml2 was still there.  For fun, I tried
> the build on the rhel6 side.  With only a "regular' configure (./configure
> CC=gcc CXX=g++ FC=gfortran --enable-static --with-threads=posix
> --disable-vt --prefix=/N/soft/rhel6/openmpi/gnu/2.1.0 )  it worked just
> fine. I would appreciate knowing what I am missing.  Thanks.
> Ray
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] disable slurm/munge from mpirun

2017-06-22 Thread John Hearns via users
Michael,  try
 --mca plm_rsh_agent ssh

I've been fooling with this myself recently, in the contect of a PBS cluster

On 22 June 2017 at 16:16, Michael Di Domenico 
wrote:

> is it possible to disable slurm/munge/psm/pmi(x) from the mpirun
> command line or (better) using environment variables?
>
> i'd like to use the installed version of openmpi i have on a
> workstation, but it's linked with slurm from one of my clusters.
>
> mpi/slurm work just fine on the cluster, but when i run it on a
> workstation i get the below errors
>
> mca_base_component_repositoy_open: unable to open mca_sec_munge:
> libmunge missing
> ORTE_ERROR_LOG Not found in file ess_hnp_module.c at line 648
> opal_pmix_base_select failed
> returned value not found (-13) instead of orte_success
>
> there's probably a magical incantation of mca parameters, but i'm not
> adept enough at determining what they are
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] disable slurm/munge from mpirun

2017-06-22 Thread John Hearns via users
Having had some problems with ssh launching (a few minutes ago) I can
confirm that this works:

--mca plm_rsh_agent "ssh -v"

Stupidly I thought there was a majr problem - when it turned otu I could
not ssh into a host.. ahem.



On 22 June 2017 at 16:35, r...@open-mpi.org <r...@open-mpi.org> wrote:

> You can add "OMPI_MCA_plm=rsh OMPI_MCA_sec=^munge” to your environment
>
>
> On Jun 22, 2017, at 7:28 AM, John Hearns via users <
> users@lists.open-mpi.org> wrote:
>
> Michael,  try
>  --mca plm_rsh_agent ssh
>
> I've been fooling with this myself recently, in the contect of a PBS
> cluster
>
> On 22 June 2017 at 16:16, Michael Di Domenico <mdidomeni...@gmail.com>
> wrote:
>
>> is it possible to disable slurm/munge/psm/pmi(x) from the mpirun
>> command line or (better) using environment variables?
>>
>> i'd like to use the installed version of openmpi i have on a
>> workstation, but it's linked with slurm from one of my clusters.
>>
>> mpi/slurm work just fine on the cluster, but when i run it on a
>> workstation i get the below errors
>>
>> mca_base_component_repositoy_open: unable to open mca_sec_munge:
>> libmunge missing
>> ORTE_ERROR_LOG Not found in file ess_hnp_module.c at line 648
>> opal_pmix_base_select failed
>> returned value not found (-13) instead of orte_success
>>
>> there's probably a magical incantation of mca parameters, but i'm not
>> adept enough at determining what they are
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Openmpi with btl_openib_ib_service_level

2017-06-22 Thread John Hearns via users
I may have asked this recently (if so sorry).
If anyoen has worked with QoS settings with OpenMPI please ping me off list,
eg


mpirun --mca btl_openib_ib_service_level N
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] IBM Spectrum MPI problem

2017-05-19 Thread John Hearns via users
loaded or not
>>>> allowed itself to be used.  Your MPI job will now abort.
>>>>
>>>> You may wish to try to narrow down the problem;
>>>>  * Check the output of ompi_info to see which BTL/MTL plugins are
>>>>available.
>>>>  * Run your application with MPI_THREAD_SINGLE.
>>>>  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>>>>if using MTL-based communications) to see exactly which
>>>>communication plugins were considered and/or discarded.
>>>> 
>>>> --
>>>> [openpower:88867] 1 more process has sent help message
>>>> help-mca-bml-r2.txt / unreachable proc
>>>> [openpower:88867] Set MCA parameter "orte_base_help_aggregate" to 0 to
>>>> see all help / error messages
>>>> [openpower:88867] 1 more process has sent help message
>>>> help-mpi-runtime.txt / mpi_init:startup:pml-add-procs-fail
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2017-05-19 9:22 GMT+02:00 Gabriele Fatigati <g.fatig...@cineca.it
>>>> <mailto:g.fatig...@cineca.it>>:
>>>>
>>>> Hi GIlles,
>>>>
>>>> using your command with one MPI procs I get:
>>>>
>>>> findActiveDevices Error
>>>> We found no active IB device ports
>>>> Hello world from rank 0  out of 1 processors
>>>>
>>>> So it seems to work apart the error message.
>>>>
>>>>
>>>> 2017-05-19 9:10 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp
>>>> <mailto:gil...@rist.or.jp>>:
>>>>
>>>> Gabriele,
>>>>
>>>>
>>>> so it seems pml/pami assumes there is an infiniband card
>>>> available (!)
>>>>
>>>> i guess IBM folks will comment on that shortly.
>>>>
>>>>
>>>> meanwhile, you do not need pami since you are running on a
>>>> single node
>>>>
>>>> mpirun --mca pml ^pami ...
>>>>
>>>> should do the trick
>>>>
>>>> (if it does not work, can run and post the logs)
>>>>
>>>> mpirun --mca pml ^pami --mca pml_base_verbose 100 ...
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Gilles
>>>>
>>>>
>>>> On 5/19/2017 4:01 PM, Gabriele Fatigati wrote:
>>>>
>>>> Hi John,
>>>> Infiniband is not used, there is a single node on this
>>>> machine.
>>>>
>>>> 2017-05-19 8:50 GMT+02:00 John Hearns via users
>>>> <users@lists.open-mpi.org
>>>> <mailto:users@lists.open-mpi.org>
>>>> <mailto:users@lists.open-mpi.org
>>>> <mailto:users@lists.open-mpi.org>>>:
>>>>
>>>> Gabriele,   pleae run  'ibv_devinfo'
>>>> It looks to me like you may have the physical
>>>> interface cards in
>>>> these systems, but you do not have the correct drivers
>>>> or
>>>> libraries loaded.
>>>>
>>>> I have had similar messages when using Infiniband on
>>>> x86 systems -
>>>> which did not have libibverbs installed.
>>>>
>>>>
>>>> On 19 May 2017 at 08:41, Gabriele Fatigati
>>>> <g.fatig...@cineca.it <mailto:g.fatig...@cineca.it>
>>>> <mailto:g.fatig...@cineca.it
>>>> <mailto:g.fatig...@cineca.it>>> wrote:
>>>>
>>>> Hi Gilles, using your command:
>>>>
>>>> [openpower:88536] mca: base: components_register:
>>>> registering
>>>> framework pml components
>>>> [openpower:88536] mca: base: components_register:
>>>> found loaded
>>>> component pami
>>>> [openpower:88536] mca: base: components_register:
>>>>  

Re: [OMPI users] IBM Spectrum MPI problem

2017-05-19 Thread John Hearns via users
 Hello world from rank 0  out of 1 processors
>>>
>>> So it seems to work apart the error message.
>>>
>>>
>>> 2017-05-19 9:10 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp
>>> <mailto:gil...@rist.or.jp>>:
>>>
>>> Gabriele,
>>>
>>>
>>> so it seems pml/pami assumes there is an infiniband card
>>> available (!)
>>>
>>> i guess IBM folks will comment on that shortly.
>>>
>>>
>>> meanwhile, you do not need pami since you are running on a
>>> single node
>>>
>>> mpirun --mca pml ^pami ...
>>>
>>> should do the trick
>>>
>>> (if it does not work, can run and post the logs)
>>>
>>> mpirun --mca pml ^pami --mca pml_base_verbose 100 ...
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>>
>>> On 5/19/2017 4:01 PM, Gabriele Fatigati wrote:
>>>
>>> Hi John,
>>> Infiniband is not used, there is a single node on this
>>> machine.
>>>
>>> 2017-05-19 8:50 GMT+02:00 John Hearns via users
>>> <users@lists.open-mpi.org
>>> <mailto:users@lists.open-mpi.org>
>>> <mailto:users@lists.open-mpi.org
>>> <mailto:users@lists.open-mpi.org>>>:
>>>
>>> Gabriele,   pleae run  'ibv_devinfo'
>>> It looks to me like you may have the physical
>>> interface cards in
>>> these systems, but you do not have the correct drivers or
>>> libraries loaded.
>>>
>>> I have had similar messages when using Infiniband on
>>> x86 systems -
>>> which did not have libibverbs installed.
>>>
>>>
>>> On 19 May 2017 at 08:41, Gabriele Fatigati
>>> <g.fatig...@cineca.it <mailto:g.fatig...@cineca.it>
>>> <mailto:g.fatig...@cineca.it
>>> <mailto:g.fatig...@cineca.it>>> wrote:
>>>
>>> Hi Gilles, using your command:
>>>
>>> [openpower:88536] mca: base: components_register:
>>> registering
>>> framework pml components
>>> [openpower:88536] mca: base: components_register:
>>> found loaded
>>> component pami
>>> [openpower:88536] mca: base: components_register:
>>> component
>>> pami register function successful
>>> [openpower:88536] mca: base: components_open:
>>> opening pml
>>> components
>>> [openpower:88536] mca: base: components_open:
>>> found loaded
>>> component pami
>>> [openpower:88536] mca: base: components_open:
>>> component pami
>>> open function successful
>>> [openpower:88536] select: initializing pml
>>> component pami
>>> findActiveDevices Error
>>> We found no active IB device ports
>>> [openpower:88536] select: init returned failure
>>> for component pami
>>> [openpower:88536] PML pami cannot be selected
>>>-
>>> -
>>> No components were able to be opened in the pml
>>> framework.
>>>
>>> This typically means that either no components of
>>> this type were
>>> installed, or none of the installed componnets can
>>> be loaded.
>>> Sometimes this means that shared libraries
>>> required by these
>>> components are unable to be found/loaded.
>>>
>>>   Host:  openpower
>>>   Framework: pml
>>>-

Re: [OMPI users] IBM Spectrum MPI problem

2017-05-19 Thread John Hearns via users
Gabriele,   pleae run  'ibv_devinfo'
It looks to me like you may have the physical interface cards in these
systems, but you do not have the correct drivers or libraries loaded.

I have had similar messages when using Infiniband on x86 systems - which
did not have libibverbs installed.


On 19 May 2017 at 08:41, Gabriele Fatigati  wrote:

> Hi Gilles, using your command:
>
> [openpower:88536] mca: base: components_register: registering framework
> pml components
> [openpower:88536] mca: base: components_register: found loaded component
> pami
> [openpower:88536] mca: base: components_register: component pami register
> function successful
> [openpower:88536] mca: base: components_open: opening pml components
> [openpower:88536] mca: base: components_open: found loaded component pami
> [openpower:88536] mca: base: components_open: component pami open function
> successful
> [openpower:88536] select: initializing pml component pami
> findActiveDevices Error
> We found no active IB device ports
> [openpower:88536] select: init returned failure for component pami
> [openpower:88536] PML pami cannot be selected
> --
> No components were able to be opened in the pml framework.
>
> This typically means that either no components of this type were
> installed, or none of the installed componnets can be loaded.
> Sometimes this means that shared libraries required by these
> components are unable to be found/loaded.
>
>   Host:  openpower
>   Framework: pml
> --
>
>
> 2017-05-19 7:03 GMT+02:00 Gilles Gouaillardet :
>
>> Gabriele,
>>
>>
>> pml/pami is here, at least according to ompi_info
>>
>>
>> can you update your mpirun command like this
>>
>> mpirun --mca pml_base_verbose 100 ..
>>
>>
>> and post the output ?
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On 5/18/2017 10:41 PM, Gabriele Fatigati wrote:
>>
>>> Hi Gilles, attached the requested info
>>>
>>> 2017-05-18 15:04 GMT+02:00 Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com >:
>>>
>>> Gabriele,
>>>
>>> can you
>>> ompi_info --all | grep pml
>>>
>>> also, make sure there is nothing in your environment pointing to
>>> an other Open MPI install
>>> for example
>>> ldd a.out
>>> should only point to IBM libraries
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Thursday, May 18, 2017, Gabriele Fatigati >> > wrote:
>>>
>>> Dear OpenMPI users and developers, I'm using IBM Spectrum MPI
>>> 10.1.0 based on OpenMPI, so I hope there are some MPI expert
>>> can help me to solve the problem.
>>>
>>> When I run a simple Hello World MPI program, I get the follow
>>> error message:
>>>
>>>
>>> A requested component was not found, or was unable to be
>>> opened.  This
>>> means that this component is either not installed or is unable
>>> to be
>>> used on your system (e.g., sometimes this means that shared
>>> libraries
>>> that the component requires are unable to be found/loaded).
>>>Note that
>>> Open MPI stopped checking at the first component that it did
>>> not find.
>>>
>>> Host:  openpower
>>> Framework: pml
>>> Component: pami
>>> 
>>> --
>>> 
>>> --
>>> It looks like MPI_INIT failed for some reason; your parallel
>>> process is
>>> likely to abort. There are many reasons that a parallel
>>> process can
>>> fail during MPI_INIT; some of which are due to configuration
>>> or environment
>>> problems.  This failure appears to be an internal failure;
>>> here's some
>>> additional information (which may only be relevant to an Open MPI
>>> developer):
>>>
>>> mca_pml_base_open() failed
>>>   --> Returned "Not found" (-13) instead of "Success" (0)
>>> 
>>> --
>>> *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
>>> now abort,
>>> ***and potentially your MPI job)
>>>
>>> My sysadmin used official IBM Spectrum packages to install
>>> MPI, so It's quite strange that there are some components
>>> missing (pami). Any help? Thanks
>>>
>>>
>>> -- Ing. Gabriele Fatigati
>>>
>>> HPC specialist
>>>
>>> SuperComputing Applications and Innovation Department
>>>
>>> Via Magnanelli 6/3, Casalecchio di Reno 

Re: [OMPI users] IBM Spectrum MPI problem

2017-05-18 Thread John Hearns via users
Gabriele,  as this is based on OpenMPI can you run ompi_info
then look for the btl which are available and the mtl which are available?



On 18 May 2017 at 14:10, Reuti  wrote:

> Hi,
>
> > Am 18.05.2017 um 14:02 schrieb Gabriele Fatigati :
> >
> > Dear OpenMPI users and developers, I'm using IBM Spectrum MPI 10.1.0
>
> I noticed this on IBM's website too. Is this freely available? Up to now I
> was always bounced back to their former Platform MPI when trying to
> download the community edition (even the evaluation link on the Spectrum
> MPI page does the same).
>
> -- Reuti
>
>
> >  based on OpenMPI, so I hope there are some MPI expert can help me to
> solve the problem.
> >
> > When I run a simple Hello World MPI program, I get the follow error
> message:
> >
> > A requested component was not found, or was unable to be opened.  This
> > means that this component is either not installed or is unable to be
> > used on your system (e.g., sometimes this means that shared libraries
> > that the component requires are unable to be found/loaded).  Note that
> > Open MPI stopped checking at the first component that it did not find.
> >
> > Host:  openpower
> > Framework: pml
> > Component: pami
> > 
> --
> > 
> --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or
> environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >   mca_pml_base_open() failed
> >   --> Returned "Not found" (-13) instead of "Success" (0)
> > 
> --
> > *** An error occurred in MPI_Init
> > *** on a NULL communicator
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > ***and potentially your MPI job)
> >
> > My sysadmin used official IBM Spectrum packages to install MPI, so It's
> quite strange that there are some components missing (pami). Any help?
> Thanks
> >
> > --
> > Ing. Gabriele Fatigati
> >
> > HPC specialist
> >
> > SuperComputing Applications and Innovation Department
> >
> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> >
> > www.cineca.itTel:   +39 051 6171722
> >
> > g.fatigati [AT] cineca.it
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] IBM Spectrum MPI problem

2017-05-18 Thread John Hearns via users
One very stupid question...what does  'ibv_devinfo'  say when you run
it on the compute nodes?

ps. I know nothing about IBM MPI or pami but I think this is as you say
some simple library being missing etc.


On 18 May 2017 at 14:20, Gabriele Fatigati <g.fatig...@cineca.it> wrote:

> Hi John, about btl this is the output of ompi_info:
>
> MCA btl: self (MCA v2.1.0, API v3.0.0, Component v10.1.0)
> MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v10.1.0)
> MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v10.1.0)
> MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v10.1.0)
> MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v10.1.0)
>
>
> about mtl no information retrieve ompi_info
>
>
> 2017-05-18 14:13 GMT+02:00 John Hearns via users <users@lists.open-mpi.org
> >:
>
>> Gabriele,  as this is based on OpenMPI can you run ompi_info
>> then look for the btl which are available and the mtl which are available?
>>
>>
>>
>> On 18 May 2017 at 14:10, Reuti <re...@staff.uni-marburg.de> wrote:
>>
>>> Hi,
>>>
>>> > Am 18.05.2017 um 14:02 schrieb Gabriele Fatigati <g.fatig...@cineca.it
>>> >:
>>> >
>>> > Dear OpenMPI users and developers, I'm using IBM Spectrum MPI 10.1.0
>>>
>>> I noticed this on IBM's website too. Is this freely available? Up to now
>>> I was always bounced back to their former Platform MPI when trying to
>>> download the community edition (even the evaluation link on the Spectrum
>>> MPI page does the same).
>>>
>>> -- Reuti
>>>
>>>
>>> >  based on OpenMPI, so I hope there are some MPI expert can help me to
>>> solve the problem.
>>> >
>>> > When I run a simple Hello World MPI program, I get the follow error
>>> message:
>>> >
>>> > A requested component was not found, or was unable to be opened.  This
>>> > means that this component is either not installed or is unable to be
>>> > used on your system (e.g., sometimes this means that shared libraries
>>> > that the component requires are unable to be found/loaded).  Note that
>>> > Open MPI stopped checking at the first component that it did not find.
>>> >
>>> > Host:  openpower
>>> > Framework: pml
>>> > Component: pami
>>> > 
>>> --
>>> > 
>>> --
>>> > It looks like MPI_INIT failed for some reason; your parallel process is
>>> > likely to abort.  There are many reasons that a parallel process can
>>> > fail during MPI_INIT; some of which are due to configuration or
>>> environment
>>> > problems.  This failure appears to be an internal failure; here's some
>>> > additional information (which may only be relevant to an Open MPI
>>> > developer):
>>> >
>>> >   mca_pml_base_open() failed
>>> >   --> Returned "Not found" (-13) instead of "Success" (0)
>>> > 
>>> --
>>> > *** An error occurred in MPI_Init
>>> > *** on a NULL communicator
>>> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
>>> abort,
>>> > ***and potentially your MPI job)
>>> >
>>> > My sysadmin used official IBM Spectrum packages to install MPI, so
>>> It's quite strange that there are some components missing (pami). Any help?
>>> Thanks
>>> >
>>> > --
>>> > Ing. Gabriele Fatigati
>>> >
>>> > HPC specialist
>>> >
>>> > SuperComputing Applications and Innovation Department
>>> >
>>> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>> >
>>> > www.cineca.itTel:   +39 051 6171722
>>> >
>>> > g.fatigati [AT] cineca.it
>>> > ___
>>> > users mailing list
>>> > users@lists.open-mpi.org
>>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
>
> --
> Ing. Gabriele Fatigati
>
> HPC specialist
>
> SuperComputing Applications and Innovation Department
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.itTel:   +39 051 6171722
> <+39%20051%20617%201722>
>
> g.fatigati [AT] cineca.it
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] IBM Spectrum MPI problem

2017-05-19 Thread John Hearns via users
Gabriele,
as Gilles says if you are running within a single host system, you don not
need the pami layer.
Usually you would use the btls  sm,selfthough I guess 'vader' is the
more up to date choice

On 19 May 2017 at 09:10, Gilles Gouaillardet <gil...@rist.or.jp> wrote:

> Gabriele,
>
>
> so it seems pml/pami assumes there is an infiniband card available (!)
>
> i guess IBM folks will comment on that shortly.
>
>
> meanwhile, you do not need pami since you are running on a single node
>
> mpirun --mca pml ^pami ...
>
> should do the trick
>
> (if it does not work, can run and post the logs)
>
> mpirun --mca pml ^pami --mca pml_base_verbose 100 ...
>
>
> Cheers,
>
>
> Gilles
>
>
> On 5/19/2017 4:01 PM, Gabriele Fatigati wrote:
>
>> Hi John,
>> Infiniband is not used, there is a single node on this machine.
>>
>> 2017-05-19 8:50 GMT+02:00 John Hearns via users <users@lists.open-mpi.org
>> <mailto:users@lists.open-mpi.org>>:
>>
>> Gabriele,   pleae run  'ibv_devinfo'
>> It looks to me like you may have the physical interface cards in
>> these systems, but you do not have the correct drivers or
>> libraries loaded.
>>
>> I have had similar messages when using Infiniband on x86 systems -
>> which did not have libibverbs installed.
>>
>>
>> On 19 May 2017 at 08:41, Gabriele Fatigati <g.fatig...@cineca.it
>> <mailto:g.fatig...@cineca.it>> wrote:
>>
>> Hi Gilles, using your command:
>>
>> [openpower:88536] mca: base: components_register: registering
>> framework pml components
>> [openpower:88536] mca: base: components_register: found loaded
>> component pami
>> [openpower:88536] mca: base: components_register: component
>> pami register function successful
>> [openpower:88536] mca: base: components_open: opening pml
>> components
>> [openpower:88536] mca: base: components_open: found loaded
>> component pami
>> [openpower:88536] mca: base: components_open: component pami
>> open function successful
>> [openpower:88536] select: initializing pml component pami
>> findActiveDevices Error
>> We found no active IB device ports
>> [openpower:88536] select: init returned failure for component pami
>> [openpower:88536] PML pami cannot be selected
>> 
>> --
>> No components were able to be opened in the pml framework.
>>
>> This typically means that either no components of this type were
>> installed, or none of the installed componnets can be loaded.
>> Sometimes this means that shared libraries required by these
>> components are unable to be found/loaded.
>>
>>   Host:  openpower
>>   Framework: pml
>> 
>> --
>>
>>
>> 2017-05-19 7:03 GMT+02:00 Gilles Gouaillardet
>> <gil...@rist.or.jp <mailto:gil...@rist.or.jp>>:
>>
>> Gabriele,
>>
>>
>> pml/pami is here, at least according to ompi_info
>>
>>
>> can you update your mpirun command like this
>>
>> mpirun --mca pml_base_verbose 100 ..
>>
>>
>> and post the output ?
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On 5/18/2017 10:41 PM, Gabriele Fatigati wrote:
>>
>> Hi Gilles, attached the requested info
>>
>> 2017-05-18 15:04 GMT+02:00 Gilles Gouaillardet
>> <gilles.gouaillar...@gmail.com
>> <mailto:gilles.gouaillar...@gmail.com>
>> <mailto:gilles.gouaillar...@gmail.com
>> <mailto:gilles.gouaillar...@gmail.com>>>:
>>
>> Gabriele,
>>
>> can you
>> ompi_info --all | grep pml
>>
>> also, make sure there is nothing in your
>> environment pointing to
>> an other Open MPI install
>> for example
>> ldd a.out
>> should only point to IBM libraries
>>
>> Cheers

Re: [OMPI users] Many different errors with ompi version 2.1.1

2017-05-19 Thread John Hearns via users
Giles, Allan,

if the host 'smd' is acting as a cluster head node it is not a must for it
to have an Infiniband card.
So you should be able to run jobs across the other nodes, which have Qlogic
cards.
I may have something mixed up here, if so I am sorry.

If you want also to run jobs on the smd host, you should take note of what
Giles says.
You may be out of luck in that case.

On 19 May 2017 at 09:15, Gilles Gouaillardet  wrote:

> Allan,
>
>
> i just noted smd has a Mellanox card, while other nodes have QLogic cards.
>
> mtl/psm works best for QLogic while btl/openib (or mtl/mxm) work best for
> Mellanox,
>
> but these are not interoperable. also, i do not think btl/openib can be
> used with QLogic cards
>
> (please someone correct me if i am wrong)
>
>
> from the logs, i can see that smd (Mellanox) is not even able to use the
> infiniband port.
>
> if you run with 2 MPI tasks, both run on smd and hence btl/vader is used,
> that is why it works
>
> if you run with more than 2 MPI tasks, then smd and other nodes are used,
> and every MPI task fall back to btl/tcp
>
> for inter node communication.
>
> [smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.1.196 failed: No route to host (113)
>
> this usually indicates a firewall, but since both ssh and oob/tcp are
> fine, this puzzles me.
>
>
> what if you
>
> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
> --mca btl_tcp_if_include 192.168.1.0/24 --mca pml ob1 --mca btl
> tcp,sm,vader,self  ring
>
> that should work with no error messages, and then you can try with 12 MPI
> tasks
>
> (note internode MPI communications will use tcp only)
>
>
> if you want optimal performance, i am afraid you cannot run any MPI task
> on smd (so mtl/psm can be used )
>
> (btw, make sure PSM support was built in Open MPI)
>
> a suboptimal option is to force MPI communications on IPoIB with
>
> /* make sure all nodes can ping each other via IPoIB first */
>
> mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include
> 10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self
>
>
>
> Cheers,
>
>
> Gilles
>
>
> On 5/19/2017 3:50 PM, Allan Overstreet wrote:
>
>> Gilles,
>>
>> On which node is mpirun invoked ?
>>
>> The mpirun command was involed on node smd.
>>
>> Are you running from a batch manager?
>>
>> No.
>>
>> Is there any firewall running on your nodes ?
>>
>> No CentOS minimal does not have a firewall installed and Ubuntu
>> Mate's firewall is disabled.
>>
>> All three of your commands have appeared to run successfully. The outputs
>> of the three commands are attached.
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 true &> cmd1
>>
>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 true &> cmd2
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 ring &> cmd3
>>
>> If I increase the number of processors in the ring program, mpirun will
>> not succeed.
>>
>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 ring &> cmd4
>>
>>
>> On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote:
>>
>>> Allan,
>>>
>>>
>>> - on which node is mpirun invoked ?
>>>
>>> - are you running from a batch manager ?
>>>
>>> - is there any firewall running on your nodes ?
>>>
>>>
>>> the error is likely occuring when wiring-up mpirun/orted
>>>
>>> what if you
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true
>>>
>>> then (if the previous command worked)
>>>
>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true
>>>
>>> and finally (if both previous commands worked)
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 ring
>>>
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 5/19/2017 3:07 PM, Allan Overstreet wrote:
>>>
 I experiencing many different errors with openmpi version 2.1.1. I have
 had a suspicion that this might be related to the way the servers were
 connected and configured. Regardless below is a diagram of how the server
 are configured.

 __  _
[__]|=|
/::/|_|
HOST: smd
Dual 1Gb Ethernet Bonded
.-> Bond0 IP: 192.168.1.200
|   Infiniband Card: MHQH29B-XTR <.
|   Ib0 IP: 10.1.0.1  |
|   OS: Ubuntu Mate   |
|   __ _ |
| [__]|=| 

Re: [OMPI users] Many different errors with ompi version 2.1.1

2017-05-19 Thread John Hearns via users
Allan,
remember that Infiniband is not Ethernet.  You dont NEED to set up IPOIB
interfaces.

Two diagnostics please for you to run:

ibnetdiscover

ibdiagnet


Let us please have the reuslts ofibnetdiscover




On 19 May 2017 at 09:25, John Hearns  wrote:

> Giles, Allan,
>
> if the host 'smd' is acting as a cluster head node it is not a must for it
> to have an Infiniband card.
> So you should be able to run jobs across the other nodes, which have
> Qlogic cards.
> I may have something mixed up here, if so I am sorry.
>
> If you want also to run jobs on the smd host, you should take note of what
> Giles says.
> You may be out of luck in that case.
>
> On 19 May 2017 at 09:15, Gilles Gouaillardet  wrote:
>
>> Allan,
>>
>>
>> i just noted smd has a Mellanox card, while other nodes have QLogic cards.
>>
>> mtl/psm works best for QLogic while btl/openib (or mtl/mxm) work best for
>> Mellanox,
>>
>> but these are not interoperable. also, i do not think btl/openib can be
>> used with QLogic cards
>>
>> (please someone correct me if i am wrong)
>>
>>
>> from the logs, i can see that smd (Mellanox) is not even able to use the
>> infiniband port.
>>
>> if you run with 2 MPI tasks, both run on smd and hence btl/vader is used,
>> that is why it works
>>
>> if you run with more than 2 MPI tasks, then smd and other nodes are used,
>> and every MPI task fall back to btl/tcp
>>
>> for inter node communication.
>>
>> [smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.1.196 failed: No route to host (113)
>>
>> this usually indicates a firewall, but since both ssh and oob/tcp are
>> fine, this puzzles me.
>>
>>
>> what if you
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca btl_tcp_if_include 192.168.1.0/24 --mca pml ob1 --mca btl
>> tcp,sm,vader,self  ring
>>
>> that should work with no error messages, and then you can try with 12 MPI
>> tasks
>>
>> (note internode MPI communications will use tcp only)
>>
>>
>> if you want optimal performance, i am afraid you cannot run any MPI task
>> on smd (so mtl/psm can be used )
>>
>> (btw, make sure PSM support was built in Open MPI)
>>
>> a suboptimal option is to force MPI communications on IPoIB with
>>
>> /* make sure all nodes can ping each other via IPoIB first */
>>
>> mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include
>> 10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self
>>
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> On 5/19/2017 3:50 PM, Allan Overstreet wrote:
>>
>>> Gilles,
>>>
>>> On which node is mpirun invoked ?
>>>
>>> The mpirun command was involed on node smd.
>>>
>>> Are you running from a batch manager?
>>>
>>> No.
>>>
>>> Is there any firewall running on your nodes ?
>>>
>>> No CentOS minimal does not have a firewall installed and Ubuntu
>>> Mate's firewall is disabled.
>>>
>>> All three of your commands have appeared to run successfully. The
>>> outputs of the three commands are attached.
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true &> cmd1
>>>
>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true &> cmd2
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 ring &> cmd3
>>>
>>> If I increase the number of processors in the ring program, mpirun will
>>> not succeed.
>>>
>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 ring &> cmd4
>>>
>>>
>>> On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote:
>>>
 Allan,


 - on which node is mpirun invoked ?

 - are you running from a batch manager ?

 - is there any firewall running on your nodes ?


 the error is likely occuring when wiring-up mpirun/orted

 what if you

 mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
 --mca oob_base_verbose 100 true

 then (if the previous command worked)

 mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
 --mca oob_base_verbose 100 true

 and finally (if both previous commands worked)

 mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
 --mca oob_base_verbose 100 ring


 Cheers,

 Gilles

 On 5/19/2017 3:07 PM, Allan Overstreet wrote:

> I experiencing many different errors with openmpi version 2.1.1. I
> have had a suspicion that this might be related to the way the servers 
> were
> connected and configured. Regardless below is a diagram of how the server
> are configured.
>
> __  _
>[__]|=|
>/::/|_|
>

Re: [OMPI users] IBM Spectrum MPI problem

2017-05-19 Thread John Hearns via users
BTLs attempted: self

That should only allow a single process to communicate with its self




On 19 May 2017 at 09:23, Gabriele Fatigati <g.fatig...@cineca.it> wrote:

> Oh no, by using two procs:
>
>
> findActiveDevices Error
> We found no active IB device ports
> findActiveDevices Error
> We found no active IB device ports
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[12380,1],0]) is on host: openpower
>   Process 2 ([[12380,1],1]) is on host: openpower
>   BTLs attempted: self
>
> Your MPI job is now going to abort; sorry.
> --
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> --
> MPI_INIT has failed because at least one MPI process is unreachable
> from another.  This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used.  Your MPI job will now abort.
>
> You may wish to try to narrow down the problem;
>  * Check the output of ompi_info to see which BTL/MTL plugins are
>available.
>  * Run your application with MPI_THREAD_SINGLE.
>  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>if using MTL-based communications) to see exactly which
>communication plugins were considered and/or discarded.
> --
> [openpower:88867] 1 more process has sent help message help-mca-bml-r2.txt
> / unreachable proc
> [openpower:88867] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> [openpower:88867] 1 more process has sent help message
> help-mpi-runtime.txt / mpi_init:startup:pml-add-procs-fail
>
>
>
>
>
> 2017-05-19 9:22 GMT+02:00 Gabriele Fatigati <g.fatig...@cineca.it>:
>
>> Hi GIlles,
>>
>> using your command with one MPI procs I get:
>>
>> findActiveDevices Error
>> We found no active IB device ports
>> Hello world from rank 0  out of 1 processors
>>
>> So it seems to work apart the error message.
>>
>>
>> 2017-05-19 9:10 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>:
>>
>>> Gabriele,
>>>
>>>
>>> so it seems pml/pami assumes there is an infiniband card available (!)
>>>
>>> i guess IBM folks will comment on that shortly.
>>>
>>>
>>> meanwhile, you do not need pami since you are running on a single node
>>>
>>> mpirun --mca pml ^pami ...
>>>
>>> should do the trick
>>>
>>> (if it does not work, can run and post the logs)
>>>
>>> mpirun --mca pml ^pami --mca pml_base_verbose 100 ...
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>>
>>> On 5/19/2017 4:01 PM, Gabriele Fatigati wrote:
>>>
>>>> Hi John,
>>>> Infiniband is not used, there is a single node on this machine.
>>>>
>>>> 2017-05-19 8:50 GMT+02:00 John Hearns via users <
>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>:
>>>>
>>>> Gabriele,   pleae run  'ibv_devinfo'
>>>> It looks to me like you may have the physical interface cards in
>>>> these systems, but you do not have the correct drivers or
>>>> libraries loaded.
>>>>
>>>> I have had similar messages when using Infiniband on x86 systems -
>>>> which did not have libibverbs installed.
>>>>
>>>>
>>>> On 19 May 2017 at 08:41, Gabriele Fatigati <g.fatig...@cineca.it
>>>> <mailto:g.fatig...@cineca.it>> wrote:
>>>>
>>>> Hi Gilles, using your command:
>>>>
>>

Re: [OMPI users] mpif90 unable to find ibverbs

2017-09-14 Thread John Hearns via users
Then let me add in my  thoughts please..   Rocks is getting out of date.
Mahmood, I would imagine that you are not given the choice of installing
something more modern,
ie the place where you work has an existing Rocks cluster and is unwilling
to re-install it.

So what is wrong with using the  'module load openmpi' which you would
normally do in a job?

In a more constructive fashion, the University of California at San Diego
make available Rocks Rolls
which have a much more modern software envoronment.

https://github.com/sdsc/cluster-guide

https://github.com/sdsc/mpi-roll

Mahmood, I would ask your systems team if they are willing to install these
rolls.










On 14 September 2017 at 12:42, Mahmood Naderan  wrote:

> So it seems that -rpath is not available with 1.4 which is ompi came with
> rocks 6.
>
> Regards,
> Mahmood
>
>
>
> On Thu, Sep 14, 2017 at 2:44 PM, Mahmood Naderan 
> wrote:
>
>> Well that may be good if someone intend to rebuild ompi.
>> Lets say, there is an ompi on the system...
>>
>> Regards,
>> Mahmood
>>
>>
>>
>> On Thu, Sep 14, 2017 at 2:31 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> Peter and all,
>>>
>>> an easier option is to configure Open MPI with --mpirun-prefix-by-default
>>> this will automagically add rpath to the libs.
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Thu, Sep 14, 2017 at 6:43 PM, Peter Kjellström 
>>> wrote:
>>> > On Wed, 13 Sep 2017 20:13:54 +0430
>>> > Mahmood Naderan  wrote:
>>> > ...
>>> >> `/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../lib64/li
>>> bc.a(strcmp.o)'
>>> >> can not be used when making an executable; recompile with -fPIE and
>>> >> relink with -pie collect2: ld returned 1 exit status
>>> >>
>>> >>
>>> >> With such an error, I thought it is better to forget static linking!
>>> >> (as it is related to libc) and work with the shared libs and
>>> >> LD_LIBRARY_PATH
>>> >
>>> > First, I think giving up on static linking is the right choice.
>>> >
>>> > If the main thing you were after was the convenience of a binary that
>>> > will run without the need to setup LD_LIBRARY_PATH correctly you should
>>> > have a look at passing -rpath to the linker.
>>> >
>>> > In short, "mpicc -Wl,-rpath=/my/lib/path helloworld.c -o hello", will
>>> > compile a dynamic binary "hello" with built in search path
>>> > to "/my/lib/path".
>>> >
>>> > With OpenMPI this will be added as a "runpath" due to how the wrappers
>>> > are designed. Both rpath and runpath works for finding "/my/lib/path"
>>> > wihtout LD_LIBRARY_PATH but the difference is in priority. rpath is
>>> > higher priority than LD_LIBRARY_PATH etc. and runpath is lower.
>>> >
>>> > You can check your rpath or runpath in a binary using the command
>>> > chrpath (package on rhel/centos/... is chrpath):
>>> >
>>> > $ chrpath hello
>>> > hello: RUNPATH=/my/lib/path
>>> >
>>> > If what you really wanted is the rpath behavior (winning over any
>>> > LD_LIBRARY_PATH in the environment etc.) then you need to modify the
>>> > openmpi wrappers (rebuild openmpi) such that it does NOT pass
>>> > "--enable-new-dtags" to the linker.
>>> >
>>> > /Peter
>>> > ___
>>> > users mailing list
>>> > users@lists.open-mpi.org
>>> > https://lists.open-mpi.org/mailman/listinfo/users
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpif90 unable to find ibverbs

2017-09-14 Thread John Hearns via users
Jeff, from what I read yesterday it is OpenMPI 2 , I am not sure of the
minor version.
I do acknowledge that Mahmood reports that the Rocks 7 beta is available -
when I last used Rocks this was not avaiable.
But still - look at something more up to date, such as OpenHPC.
There is nothing intrinsically tied up between hardware and the operating
system / software stack.


On 14 September 2017 at 16:48, Jeff Squyres (jsquyres) 
wrote:

> Let me throw in one more item: I don't know what versions of Open MPI are
> available in those Rocks Rolls, but Open MPI v3.0.0 was released
> yesterday.  You will be much better served with a modern version of Open
> MPI (vs. v1.4, the last release of which was in 2012).
>
>
>
> > On Sep 14, 2017, at 8:21 AM, Peter Kjellström  wrote:
> >
> > On Thu, 14 Sep 2017 19:01:08 +0900
> > Gilles Gouaillardet  wrote:
> >
> > > Peter and all,
> > >
> > > an easier option is to configure Open MPI with
> > > --mpirun-prefix-by-default this will automagically add rpath to the
> > > libs.
> >
> > Yes that sorts out the OpenMPI libs but I was imagining a more general
> > situation (and the OP later tried adding openblas).
> >
> > It's also only available if the OpenMPI in question is built with it
> > or if you can rebuild OpenMPI.
> >
> > The OP seems at least partially interested in additional libraries and
> > not rebuilding the system provided OpenMPI.
> >
> > /Peter
> >
> > --
> > Sent from my Android device with K-9 Mail. Please excuse my
> brevity.___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Fwd: MCA version error

2017-10-13 Thread John Hearns via users
Abhisek ...  Gilles asked which program you re trying to run, and how it
was linked with OpenMPI

Also please realise that you do not HAVE to use the openmpi packages
provided by your linux distribution.
It is perfectly OK to download, compile and install another version.


On 13 October 2017 at 15:41, abhisek Mondal  wrote:

> Hello,
>
> I'm getting this following error:
> * [localhost.localdomain:00307] mca_base_component_repository_open: shmem
> "/usr/lib64/openmpi/lib/openmpi/mca_shmem_mmap" uses an MCA interface that
> is not recognized (component MCA v2.0.0 != supported MCA v2.1.0) -- ignored*
> * [localhost.localdomain:00307] mca_base_component_repository_open: unable
> to open mca_shmem_posix: /usr/lib64/openmpi/lib/openmpi/mca_shmem_posix.so:
> undefined symbol: opal_shmem_base_output (ignored)*
> * [localhost.localdomain:00307] mca_base_component_repository_open: shmem
> "/usr/lib64/openmpi/lib/openmpi/mca_shmem_sysv" uses an MCA interface that
> is not recognized (component MCA v2.0.0 != supported MCA v2.1.0) -- ignored*
>
> I had installed it using yum command: *yum install openmpi-1.10.0-10.el7 *
>
> My current MCA version is showing 2.1.0.
>
> On Fri, Oct 13, 2017 at 1:24 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Hi,
>>
>> let's take one or two steps back.
>>
>> which version of Open MPI did you use in order to build your program ?
>>
>> what does "not working under MCA2.1" mean ?
>> link error ? unresolved symbols ? runtime crash ?
>>
>> please detail your environment and post all relevant error messages
>>
>> Cheers,
>>
>> Gilles
>>
>> On Fri, Oct 13, 2017 at 10:17 PM, abhisek Mondal 
>> wrote:
>> >
>> > Hi,
>> >
>> > I have installed an openmpi using following command:
>> > yum install openmpi-1.10.0-10.el7
>> >
>> > When I put ompi_info command it shows me that it is using MCAv.2.1. Is
>> there
>> > any way I can use MCA2.0 ?
>> >
>> > My program is not working under MCA2.1.
>> > Please help me out here.
>> >
>> > Thank you.
>> >
>> >
>> >
>> > --
>> > Abhisek Mondal
>> > Senior Research Fellow
>> > Structural Biology and Bioinformatics Division
>> > CSIR-Indian Institute of Chemical Biology
>> > Kolkata 700032
>> > INDIA
>> >
>> > ___
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
>
> --
> Abhisek Mondal
>
> *Senior Research Fellow*
>
> *Structural Biology and Bioinformatics Division*
> *CSIR-Indian Institute of Chemical Biology*
>
> *Kolkata 700032*
>
> *INDIA*
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls


On 28 September 2017 at 01:26, Ludovic Raess  wrote:

> Hi,
>
>
> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single
> port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>
>
> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
> processes distributed on 16 nodes [node01-node16]​), we observe the freeze
> of the simulation due to an internal error displaying: "error polling LP CQ
> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>  vendor error 136 qp_idx 0" (see attached file for full output).
>
>
> The job hangs, no computation neither communication occurs anymore, but no
> exit neither unload of the nodes is observed. The job can be killed
> normally but then the concerned nodes do not fully recover. A relaunch of
> the simulation usually sustains a couple of iterations (few minutes
> runtime), and then the job hangs again due to similar reasons. The only
> workaround so far is to reboot the involved nodes.
>
>
> Since we didn't find any hints on the web regarding this
> strange behaviour, I am wondering if this is a known issue. We actually
> don't know what causes this to happen and why. So any hints were to start
> investigating or possible reasons for this to happen are welcome.​
>
>
> Ludovic
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns  wrote:

>
> Google turns this up:
> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
>
>
> On 28 September 2017 at 01:26, Ludovic Raess 
> wrote:
>
>> Hi,
>>
>>
>> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
>> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single
>> port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>>
>>
>> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
>> processes distributed on 16 nodes [node01-node16]​), we observe the freeze
>> of the simulation due to an internal error displaying: "error polling LP CQ
>> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>>  vendor error 136 qp_idx 0" (see attached file for full output).
>>
>>
>> The job hangs, no computation neither communication occurs anymore, but
>> no exit neither unload of the nodes is observed. The job can be killed
>> normally but then the concerned nodes do not fully recover. A relaunch of
>> the simulation usually sustains a couple of iterations (few minutes
>> runtime), and then the job hangs again due to similar reasons. The only
>> workaround so far is to reboot the involved nodes.
>>
>>
>> Since we didn't find any hints on the web regarding this
>> strange behaviour, I am wondering if this is a known issue. We actually
>> don't know what causes this to happen and why. So any hints were to start
>> investigating or possible reasons for this to happen are welcome.​
>>
>>
>> Ludovic
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Setting LD_LIBRARY_PATH for orted

2017-08-22 Thread John Hearns via users
Gary, are you using Modules?
http://www.admin-magazine.com/HPC/Articles/Environment-Modules

On 22 August 2017 at 02:04, Gilles Gouaillardet  wrote:

> Gary,
>
>
> one option (as mentioned in the error message) is to configure Open MPI
> with --enable-orterun-prefix-by-default.
>
> this will force the build process to use rpath, so you do not have to set
> LD_LIBRARY_PATH
>
> this is the easiest option, but cannot be used if you plan to relocate the
> Open MPI installation directory.
>
>
> an other option is to use a wrapper for orted.
>
> mpirun --mca orte_launch_agent /.../myorted ...
>
> where myorted is a script that looks like
>
> #!/bin/sh
>
> export LD_LIBRARY_PATH=...
>
> exec /.../bin/orted "$@"
>
>
> you can make this setting system-wide by adding the following line to
> /.../etc/openmpi-mca-params.conf
>
> orte_launch_agent = /.../myorted
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 8/22/2017 1:06 AM, Jackson, Gary L. wrote:
>
>>
>> I’m using a binary distribution of OpenMPI 1.10.2. As linked, it requires
>> certain shared libraries outside of OpenMPI for orted itself to start. So,
>> passing in LD_LIBRARY_PATH with the “-x” flag to mpirun doesn’t do anything:
>>
>> $ mpirun –hostfile ${HOSTFILE} -N 1 -n 2 -x LD_LIBRARY_PATH hostname
>>
>> /path/to/orted: error while loading shared libraries: LIBRARY.so: cannot
>> open shared object file: No such file or directory
>>
>> 
>> --
>>
>> ORTE was unable to reliably start one or more daemons.
>>
>> This usually is caused by:
>>
>> * not finding the required libraries and/or binaries on
>>
>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>
>> settings, or configure OMPI with --enable-orterun-prefix-by-default
>>
>> * lack of authority to execute on one or more specified nodes.
>>
>> Please verify your allocation and authorities.
>>
>> * the inability to write startup files into /tmp
>> (--tmpdir/orte_tmpdir_base).
>>
>> Please check with your sys admin to determine the correct location to use.
>>
>> * compilation of the orted with dynamic libraries when static are required
>>
>> (e.g., on Cray). Please check your configure cmd line and consider using
>>
>> one of the contrib/platform definitions for your system type.
>>
>> * an inability to create a connection back to mpirun due to a
>>
>> lack of common network interfaces and/or no route found between
>>
>> them. Please check network connectivity (including firewalls
>>
>> and network routing requirements).
>>
>> 
>> --
>>
>> How do I get around this cleanly? This works just fine when I set
>> LD_LIBRARY_PATH in my .bashrc, but I’d rather not pollute that if I can
>> avoid it.
>>
>> --
>>
>> Gary Jackson, Ph.D.
>>
>> Johns Hopkins University Applied Physics Laboratory
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Error building openmpi on Raspberry pi 2

2017-09-27 Thread John Hearns via users
This might be of interest for ARM users:
https://developer.arm.com/products/software-development-tools/hpc/arm-compiler-for-hpc



On 27 September 2017 at 06:58, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Faraz,
>
> which OS are you running ?
>
> iirc, i faced similar issues, and the root cause is that though ARMv7
> does support these instructions, the compiler only generate ARMv6 code
> and hence failed to build Open MPI
>
> Cheers,
>
> Gilles
>
> On Wed, Sep 27, 2017 at 10:32 AM, Faraz Hussain 
> wrote:
> > I am receiving the make errors below on my pi 2:
> >
> > pi@pi001:~/openmpi-2.1.1 $ uname -a
> > Linux pi001 4.9.35-v7+ #1014 SMP Fri Jun 30 14:47:43 BST 2017 armv7l
> > GNU/Linux
> >
> > pi@pi001:~/openmpi-2.1.1 $ make -j 4
> > .
> > .
> > .
> > .
> > make[2]: Entering directory '/home/pi/openmpi-2.1.1/opal/asm'
> >   CPPASatomic-asm.lo
> > atomic-asm.S: Assembler messages:
> > atomic-asm.S:7: Error: selected processor does not support ARM mode `dmb'
> > atomic-asm.S:15: Error: selected processor does not support ARM mode
> `dmb'
> > atomic-asm.S:23: Error: selected processor does not support ARM mode
> `dmb'
> > atomic-asm.S:55: Error: selected processor does not support ARM mode
> `dmb'
> > atomic-asm.S:70: Error: selected processor does not support ARM mode
> `dmb'
> > atomic-asm.S:86: Error: selected processor does not support ARM mode
> `ldrexd
> > r4,r5,[r0]'
> > atomic-asm.S:91: Error: selected processor does not support ARM mode
> `strexd
> > r1,r6,r7,[r0]'
> > atomic-asm.S:107: Error: selected processor does not support ARM mode
> > `ldrexd r4,r5,[r0]'
> > atomic-asm.S:112: Error: selected processor does not support ARM mode
> > `strexd r1,r6,r7,[r0]'
> > atomic-asm.S:115: Error: selected processor does not support ARM mode
> `dmb'
> > atomic-asm.S:130: Error: selected processor does not support ARM mode
> > `ldrexd r4,r5,[r0]'
> > atomic-asm.S:135: Error: selected processor does not support ARM mode
> `dmb'
> > atomic-asm.S:136: Error: selected processor does not support ARM mode
> > `strexd r1,r6,r7,[r0]'
> > Makefile:1743: recipe for target 'atomic-asm.lo' failed
> > make[2]: *** [atomic-asm.lo] Error 1
> > make[2]: Leaving directory '/home/pi/openmpi-2.1.1/opal/asm'
> > Makefile:2307: recipe for target 'all-recursive' failed
> > make[1]: *** [all-recursive] Error 1
> > make[1]: Leaving directory '/home/pi/openmpi-2.1.1/opal'
> > Makefile:1806: recipe for target 'all-recursive' failed
> > make: *** [all-recursive] Error 1
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread John Hearns via users
One very, very stupid question here. This arose over on the Slurm list
actually.
Those hostnames look like quite generic names, ie they are part of an HPC
cluster?
Do they happen to have independednt home directories for your userid?
Could that possibly make a difference to the MPI launcher?

On 14 May 2018 at 06:44, Max Mellette  wrote:

> Hi Gilles,
>
> Thanks for the suggestions; the results are below. Any ideas where to go
> from here?
>
> - Seems that selinux is not installed:
>
> user@b09-30:~$ sestatus
> The program 'sestatus' is currently not installed. You can install it by
> typing:
> sudo apt install policycoreutils
>
> - Output from orted:
>
> user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> ess_env_module.c at line 147
> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
> util/session_dir.c at line 106
> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
> util/session_dir.c at line 345
> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 270
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   orte_session_dir failed
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
> --
>
> - iptables rules:
>
> user@b09-30:~$ sudo iptables -L
> Chain INPUT (policy ACCEPT)
> target prot opt source   destination
> ufw-before-logging-input  all  --  anywhere anywhere
> ufw-before-input  all  --  anywhere anywhere
> ufw-after-input  all  --  anywhere anywhere
> ufw-after-logging-input  all  --  anywhere anywhere
> ufw-reject-input  all  --  anywhere anywhere
> ufw-track-input  all  --  anywhere anywhere
>
> Chain FORWARD (policy ACCEPT)
> target prot opt source   destination
> ufw-before-logging-forward  all  --  anywhere anywhere
> ufw-before-forward  all  --  anywhere anywhere
> ufw-after-forward  all  --  anywhere anywhere
> ufw-after-logging-forward  all  --  anywhere anywhere
> ufw-reject-forward  all  --  anywhere anywhere
> ufw-track-forward  all  --  anywhere anywhere
>
> Chain OUTPUT (policy ACCEPT)
> target prot opt source   destination
> ufw-before-logging-output  all  --  anywhere anywhere
> ufw-before-output  all  --  anywhere anywhere
> ufw-after-output  all  --  anywhere anywhere
> ufw-after-logging-output  all  --  anywhere anywhere
> ufw-reject-output  all  --  anywhere anywhere
> ufw-track-output  all  --  anywhere anywhere
>
> Chain ufw-after-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-after-input (1 references)
> target prot opt source   destination
>
> Chain ufw-after-logging-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-after-logging-input (1 references)
> target prot opt source   destination
>
> Chain ufw-after-logging-output (1 references)
> target prot opt source   destination
>
> Chain ufw-after-output (1 references)
> target prot opt source   destination
>
> Chain ufw-before-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-before-input (1 references)
> target prot opt source   destination
>
> Chain ufw-before-logging-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-before-logging-input (1 references)
> target prot opt source   destination
>
> Chain ufw-before-logging-output (1 references)
> target prot opt source   destination
>
> Chain ufw-before-output (1 references)
> target prot opt source   destination
>
> Chain ufw-reject-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-reject-input (1 references)
> target prot opt source   destination
>
> Chain ufw-reject-output (1 references)
> target prot opt source   destination
>
> Chain ufw-track-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-track-input (1 references)
> target prot opt source   destination
>
> Chain ufw-track-output (1 references)
> 

Re: [OMPI users] peformance abnormality with openib and tcp framework

2018-05-14 Thread John Hearns via users
Xie Bin,  I do hate to ask this.  You say  "in a two-node cluster (IB
direcet-connected). "
Does that mean that you have no IB switch, and that there is a single IB
cable joining up these two servers?
If so please run:ibstatusibhosts   ibdiagnet
I am trying to check if the IB fabric is functioning properly in that
situation.
(Also need to check if there is o Subnet Manager  - so run   sminfo)

But you do say that the IMB test gives good results for IB, so you must
have IB working properly.
Therefore I am an idiot...



On 14 May 2018 at 11:04, Blade Shieh  wrote:

>
> Hi, Nathan:
> Thanks for you reply.
> 1) It was my mistake not to notice usage of osu_latency. Now it worked
> well, but still poorer in openib.
> 2) I did not use sm or vader because I wanted to check performance between
> tcp and openib. Besides, I will run the application in cluster, so vader is
> not so important.
> 3) Of course, I tried you suggestions. I used ^tcp/^openib and set
> btl_openib_if_include to mlx5_0 in a two-node cluster (IB
> direcet-connected).  The result did not change -- IB still better in MPI
> benchmark but poorer in my applicaion.
>
> Best Regards,
> Xie Bin
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] peformance abnormality with openib and tcp framework

2018-05-15 Thread John Hearns via users
Xie,   as far as I know you need to run OpenSM even on two hosts.

On 15 May 2018 at 03:29, Blade Shieh <bladesh...@gmail.com> wrote:

> Hi, John:
>
> You are right on the network framework. I do have no IB switch and just
> connect the servers with an IB cable. I did not even open the opensmd
> service because it seems unnecessary in this situation. Can this be the
> reason why IB performs poorer?
>
> Interconnection details are in the attachment.
>
>
>
> Best Regards,
>
> Xie Bin
>
>
> John Hearns via users <users@lists.open-mpi.org> 于 2018年5月14日 周一 17:45写道:
>
>> Xie Bin,  I do hate to ask this.  You say  "in a two-node cluster (IB
>> direcet-connected). "
>> Does that mean that you have no IB switch, and that there is a single IB
>> cable joining up these two servers?
>> If so please run:ibstatusibhosts   ibdiagnet
>> I am trying to check if the IB fabric is functioning properly in that
>> situation.
>> (Also need to check if there is o Subnet Manager  - so run   sminfo)
>>
>> But you do say that the IMB test gives good results for IB, so you must
>> have IB working properly.
>> Therefore I am an idiot...
>>
>>
>>
>> On 14 May 2018 at 11:04, Blade Shieh <bladesh...@gmail.com> wrote:
>>
>>>
>>> Hi, Nathan:
>>> Thanks for you reply.
>>> 1) It was my mistake not to notice usage of osu_latency. Now it worked
>>> well, but still poorer in openib.
>>> 2) I did not use sm or vader because I wanted to check performance
>>> between tcp and openib. Besides, I will run the application in cluster, so
>>> vader is not so important.
>>> 3) Of course, I tried you suggestions. I used ^tcp/^openib and set
>>> btl_openib_if_include to mlx5_0 in a two-node cluster (IB
>>> direcet-connected).  The result did not change -- IB still better in MPI
>>> benchmark but poorer in my applicaion.
>>>
>>> Best Regards,
>>> Xie Bin
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] problem

2018-05-09 Thread John Hearns via users
Ankita, looks like your program is not launching correctly.
I would try the following:
define two hosts in a machinefile.  Use mpirun -np 2  machinefile  date
Ie can you use mpirun just to run the command 'date'

Secondly compile up and try to run an MPI 'Hello World' program


On 9 May 2018 at 12:28, Ankita m  wrote:

> I am using ompi -3.1.0 version in my program and compiler is mpicc
>
> its a parallel program which uses multiple nodes with 16 cores in each
> node.
>
> but its not working and generates a error file . i Have attached the error
> file below.
>
> can anyone please tell what is the issue actually
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] need help finding mpi for Raspberry pi Raspian Streach

2018-05-30 Thread John Hearns via users
Forgive me for chipping in here. There is definitely a momentum behind the
ARM architecture in HPC.
However it seems to me that there are a lot of architectures under the
'ARM' umbrella.
Does anyone have a simplified guide to what they all mean?



On 30 May 2018 at 02:26, Gilles Gouaillardet  wrote:

> Neil,
>
>
> If that does not work, please compress and post your config.log
>
>
> There used to be an issue with raspberry pi3 which is detected as an ARMv8
> processor but the raspbian compilers only generate
>
> ARMv6 compatible binaries.
>
>
> If such an issue occurs, you might want to
>
> configure CFLAGS=-march=armv6 LDFLAGS=-march=armv6
>
>
> and try again
>
>
> FWIW, I run openSuSE Leap for aarch64 (e.g. native ARMv8 processor) and
> have no issue building/using Open MPI
>
>
>
> Cheers,
>
> Gilles
>
>
> On 5/30/2018 9:03 AM, Jeff Squyres (jsquyres) wrote:
>
>> If your Linux distro does not have an Open MPI package readily available,
>> you can build Open MPI from source for an RPi fairly easily.  Something
>> like this (not tested on an RPi / YMMV):
>>
>> wget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-
>> 3.1.0.tar.bz2
>> tar xf openmpi-3.1.0.tar.bz2
>> cd openmpi-3.1.0
>> ./configure |& tee config.out
>> make -j |& tee make.out
>> sudo make install |& tee install.out
>>
>> This will download, configure, build, and install Open MPI into the
>> /usr/local tree.
>>
>> You can optionally specify a prefix to have it install elsewhere, e.g.:
>>
>> ./configure --prefix=/path/to/where/you/want/it/to/install |& tee
>> config.out
>>
>> Then do the make/sudo make again.
>>
>>
>> On May 29, 2018, at 6:43 PM, Neil k8it  wrote:
>>>
>>> I  am starting to build a Raspberry pi cluster with MPI and I want to
>>> use the latest Raspian Streach Lite version from the raspberrypi.org
>>> website. After a lot of trials of watching youtubes on how to do this, I
>>> have found that the new version of Raspian Streach LITE is not compatible .
>>> I am looking for details instructions on how to install MPIwith this latest
>>> version of Raspian Streach Lite. I am using the newset hardware,RPI 3 B+
>>> which requires this OS to use the features on the  -new chipset
>>>   Thanks
>>> Neil
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI cartesian grid : cumulate a scalar value through the procs of a given axis of the grid

2018-05-02 Thread John Hearns via users
Peter,  how large are your models, ie how many cells in each direction?
Something inside of me is shouting that if the models are small enough then
MPI is not the way here.
Assuming use of a Xeon processor there should be some AVX instructions
which can do this.

This is rather out of date, but is it helpful?
ttps://
www.quora.com/Is-there-an-SIMD-architecture-that-supports-horizontal-cumulative-sum-Prefix-sum-as-a-single-instruction

https://software.intel.com/sites/landingpage/IntrinsicsGuide/


On 2 May 2018 at 13:56, Peter Kjellström  wrote:

> On Wed, 2 May 2018 11:15:09 +0200
> Pierre Gubernatis  wrote:
>
> > Hello all...
> >
> > I am using a *cartesian grid* of processors which represents a spatial
> > domain (a cubic geometrical domain split into several smaller
> > cubes...), and I have communicators to address the procs, as for
> > example a comm along each of the 3 axes I,J,K, or along a plane
> > IK,JK,IJ, etc..).
> >
> > *I need to cumulate a scalar value (SCAL) through the procs which
> > belong to a given axis* (let's say the K axis, defined by I=J=0).
> >
> > Precisely, the origin proc 0-0-0 has a given value for SCAL (say
> > SCAL000). I need to update the 'following' proc (0-0-1) by doing SCAL
> > = SCAL + SCAL000, and I need to *propagate* this updating along the K
> > axis. At the end, the last proc of the axis should have the total sum
> > of SCAL over the axis. (and of course, at a given rank k along the
> > axis, the SCAL value = sum over 0,1,   K of SCAL)
> >
> > Please, do you see a way to do this ? I have tried many things (with
> > MPI_SENDRECV and by looping over the procs of the axis, but I get
> > deadlocks that prove I don't handle this correctly...)
> > Thank you in any case.
>
> Why did you try SENDRECV? As far as I understand your description above
> data only flows one direction (along K)?
>
> There is no MPI collective to support the kind of reduction you
> describe but it should not be hard to do using normal SEND and RECV.
> Something like (simplified psuedo code):
>
> if (not_first_along_K)
>  MPI_RECV(SCAL_tmp, previous)
>  SCAL += SCAL_tmp
>
> if (not_last_along_K)
>  MPI_SEND(SCAL, next)
>
> /Peter K
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI cartesian grid : cumulate a scalar value through the procs of a given axis of the grid

2018-05-02 Thread John Hearns via users
Also my inner voice is shouting that there must be an easy way to express
this in Julia
https://discourse.julialang.org/t/apply-reduction-along-specific-axes/3301/16

OK, these are not the same stepwise cumulative operatiosn that you want,
but the idea is close.


ps. Note to self - stop listening to the voices.


On 2 May 2018 at 14:08, John Hearns  wrote:

> Peter,  how large are your models, ie how many cells in each direction?
> Something inside of me is shouting that if the models are small enough
> then MPI is not the way here.
> Assuming use of a Xeon processor there should be some AVX instructions
> which can do this.
>
> This is rather out of date, but is it helpful?
> ttps://www.quora.com/Is-there-an-SIMD-architecture-that-
> supports-horizontal-cumulative-sum-Prefix-sum-as-a-single-instruction
>
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/
>
>
> On 2 May 2018 at 13:56, Peter Kjellström  wrote:
>
>> On Wed, 2 May 2018 11:15:09 +0200
>> Pierre Gubernatis  wrote:
>>
>> > Hello all...
>> >
>> > I am using a *cartesian grid* of processors which represents a spatial
>> > domain (a cubic geometrical domain split into several smaller
>> > cubes...), and I have communicators to address the procs, as for
>> > example a comm along each of the 3 axes I,J,K, or along a plane
>> > IK,JK,IJ, etc..).
>> >
>> > *I need to cumulate a scalar value (SCAL) through the procs which
>> > belong to a given axis* (let's say the K axis, defined by I=J=0).
>> >
>> > Precisely, the origin proc 0-0-0 has a given value for SCAL (say
>> > SCAL000). I need to update the 'following' proc (0-0-1) by doing SCAL
>> > = SCAL + SCAL000, and I need to *propagate* this updating along the K
>> > axis. At the end, the last proc of the axis should have the total sum
>> > of SCAL over the axis. (and of course, at a given rank k along the
>> > axis, the SCAL value = sum over 0,1,   K of SCAL)
>> >
>> > Please, do you see a way to do this ? I have tried many things (with
>> > MPI_SENDRECV and by looping over the procs of the axis, but I get
>> > deadlocks that prove I don't handle this correctly...)
>> > Thank you in any case.
>>
>> Why did you try SENDRECV? As far as I understand your description above
>> data only flows one direction (along K)?
>>
>> There is no MPI collective to support the kind of reduction you
>> describe but it should not be hard to do using normal SEND and RECV.
>> Something like (simplified psuedo code):
>>
>> if (not_first_along_K)
>>  MPI_RECV(SCAL_tmp, previous)
>>  SCAL += SCAL_tmp
>>
>> if (not_last_along_K)
>>  MPI_SEND(SCAL, next)
>>
>> /Peter K
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI cartesian grid : cumulate a scalar value through the procs of a given axis of the grid

2018-05-02 Thread John Hearns via users
Pierre, I may not be able to help you directly. But I had better stop
listening to the voices.
Mail me off list please.

This might do the trick using Julia
http://juliadb.org/latest/api/aggregation.html

On 2 May 2018 at 14:11, John Hearns  wrote:

> Also my inner voice is shouting that there must be an easy way to express
> this in Julia
> https://discourse.julialang.org/t/apply-reduction-along-
> specific-axes/3301/16
>
> OK, these are not the same stepwise cumulative operatiosn that you want,
> but the idea is close.
>
>
> ps. Note to self - stop listening to the voices.
>
>
> On 2 May 2018 at 14:08, John Hearns  wrote:
>
>> Peter,  how large are your models, ie how many cells in each direction?
>> Something inside of me is shouting that if the models are small enough
>> then MPI is not the way here.
>> Assuming use of a Xeon processor there should be some AVX instructions
>> which can do this.
>>
>> This is rather out of date, but is it helpful?
>> ttps://www.quora.com/Is-there-an-SIMD-architecture-that-supp
>> orts-horizontal-cumulative-sum-Prefix-sum-as-a-single-instruction
>>
>> https://software.intel.com/sites/landingpage/IntrinsicsGuide/
>>
>>
>> On 2 May 2018 at 13:56, Peter Kjellström  wrote:
>>
>>> On Wed, 2 May 2018 11:15:09 +0200
>>> Pierre Gubernatis  wrote:
>>>
>>> > Hello all...
>>> >
>>> > I am using a *cartesian grid* of processors which represents a spatial
>>> > domain (a cubic geometrical domain split into several smaller
>>> > cubes...), and I have communicators to address the procs, as for
>>> > example a comm along each of the 3 axes I,J,K, or along a plane
>>> > IK,JK,IJ, etc..).
>>> >
>>> > *I need to cumulate a scalar value (SCAL) through the procs which
>>> > belong to a given axis* (let's say the K axis, defined by I=J=0).
>>> >
>>> > Precisely, the origin proc 0-0-0 has a given value for SCAL (say
>>> > SCAL000). I need to update the 'following' proc (0-0-1) by doing SCAL
>>> > = SCAL + SCAL000, and I need to *propagate* this updating along the K
>>> > axis. At the end, the last proc of the axis should have the total sum
>>> > of SCAL over the axis. (and of course, at a given rank k along the
>>> > axis, the SCAL value = sum over 0,1,   K of SCAL)
>>> >
>>> > Please, do you see a way to do this ? I have tried many things (with
>>> > MPI_SENDRECV and by looping over the procs of the axis, but I get
>>> > deadlocks that prove I don't handle this correctly...)
>>> > Thank you in any case.
>>>
>>> Why did you try SENDRECV? As far as I understand your description above
>>> data only flows one direction (along K)?
>>>
>>> There is no MPI collective to support the kind of reduction you
>>> describe but it should not be hard to do using normal SEND and RECV.
>>> Something like (simplified psuedo code):
>>>
>>> if (not_first_along_K)
>>>  MPI_RECV(SCAL_tmp, previous)
>>>  SCAL += SCAL_tmp
>>>
>>> if (not_last_along_K)
>>>  MPI_SEND(SCAL, next)
>>>
>>> /Peter K
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI cartesian grid : cumulate a scalar value through the procs of a given axis of the grid

2018-05-02 Thread John Hearns via users
Peter is correct. We need to find out what K is.
But we may never find out https://en.wikipedia.org/wiki/The_Trial

It would be fun if we could get some real-world dimesnions here and some
real-world numbers.
What range of numbers are these also?

On 2 May 2018 at 15:21, Peter Kjellström  wrote:

> On Wed, 2 May 2018 08:39:30 -0400
> Charles Antonelli  wrote:
>
> > This seems to be crying out for MPI_Reduce.
>
> No, the described reduction cannot be implemented with MPI_Reduce (note
> the need for partial sums along the axis).
>
> > Also in the previous solution given, I think you should do the
> > MPI_Sends first. Doing the MPI_Receives first forces serialization.
>
> It needs that. The first thing that happens is that the first rank
> skips the recv and sends its SCAL to the 2nd process that just posted
> recv.
>
> Each process needs to complete the recv to know what to send (unless
> you split it out into many more sends which is possible).
>
> What's the best solution depends on if this part is performance
> critical and how large K is.
>
> /Peter K
>
> > Regards,
> > Charles
> ...
> > > Something like (simplified psuedo code):
> > >
> > > if (not_first_along_K)
> > > MPI_RECV(SCAL_tmp, previous)
> > > SCAL += SCAL_tmp
> > >
> > > if (not_last_along_K)
> > > MPI_SEND(SCAL, next)
> > >
> > > /Peter K
> > > ___
> > > users mailing list
> > > users@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/users
> > >
>
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI advantages over PBS

2018-08-25 Thread John Hearns via users
Diego,
I am sorry but you have different things here. PBS is a resource allocation
system. It will reserve the use of a compute server, or several compute
servers, for you to run your parallel job on. PBS can launch the MPI job -
there are several mechanisms for launching parallel jobs.
MPI is an API for parallel programming. I would rather say a library, but
if I'm not wrong MPI is a standard for parallel programming and is
technically an API.

One piece of advice I would have is that you can run MPI programs from the
command line. So Google for 'Hello World MPI'. Write your first MPI program
then use mpirun from the command line.

If you have a cluster which has the PBS batch system you can then use PBS
to run your MPI program.
IF that is not clear please let us know what help you need.











On Sat, 25 Aug 2018 at 06:54, Diego Avesani  wrote:

> Dear all,
>
> I have a philosophical question.
>
> I am reading a lot of papers where people use Portable Batch System or job
> scheduler in order to parallelize their code.
>
> What are the advantages in using MPI instead?
>
> I am writing a report on my code, where of course I use openMPI. So tell
> me please how can I cite you. You deserve all the credits.
>
> Thanks a lot,
> Thanks again,
>
>
> Diego
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] RDMA over Ethernet in Open MPI - RoCE on AWS?

2018-09-07 Thread John Hearns via users
Ben, ping me off list. I know the guy who heads the HPC Solutions
Architect team for AWS and an AWS Solutions Architect here in the UK.
On Fri, 7 Sep 2018 at 03:11, Benjamin Brock  wrote:
>
> I'm setting up a cluster on AWS, which will have a 10Gb/s or 25Gb/s Ethernet 
> network.  Should I expect to be able to get RoCE to work in Open MPI on AWS?
>
> More generally, what optimizations and performance tuning can I do to an Open 
> MPI installation to get good performance on an Ethernet network?
>
> My codes use a lot of random access AMOs and asynchronous block transfers, so 
> it seems to me like setting up RDMA over Ethernet would be essential to 
> getting good performance, but I can't seem to find much information about it 
> online.
>
> Any pointers you have would be appreciated.
>
> Ben
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Fwd: problem in cluster

2018-04-25 Thread John Hearns via users
Ankita, this is  problem with your batch queuing system.  Do you know which
batch system you are using on this cluster?
Can you share with us what command you use to submit a job?

Also please do not share your teamviewer password with us. I doubt this is
of much use to anyone, but...

On 25 April 2018 at 08:03, Ankita m  wrote:

> While using the open mpi got this error. Can you please tell why so
>
> -- Forwarded message -
> From: Ankita m 
> Date: Tue, 24 Apr 2018, 12:55 pm
> Subject: Re: problem in cluster
> To: sagar mcp , Krishna Singh 
>
>
> while using openmpi- 1.4.5 the program ended by showing this error
>
> On Tue, Apr 24, 2018 at 12:28 PM, Ankita m 
> wrote:
>
>> teamviewer id 565 248 412
>>
>> password   jfu477
>>  my contact number 7830622816
>>
>> On Tue, Apr 24, 2018 at 12:18 PM, Sagar Naik  wrote:
>>
>>> Share your contact details
>>>
>>>
>>>
>>>
>>>
>>> *Thanks & Regards,*
>>>
>>>
>>>
>>> *Sagar Vijay Naik*
>>>
>>> *Sr. Customer Support Engineer*
>>>
>>> [image: Description: Logo][image: Description:
>>> cid:image002.jpg@01D084CC.4DBB74F0]
>>>
>>> Address: 17/18 Navketan Estate | Opp. Onida House | Mahakali Caves Road
>>> |Andheri ( East) | Mumbai - 400 093.
>>>
>>> Email: sa...@mpcl.in | Mobile No: +91 9969478594 | Board Line (D) :
>>> 022-40956342 |Fax No: 022- 6870250 |URL – www.mpcl.in | Follw us on : 
>>> [image:
>>> Description: download][image: Description: Facebook]
>>>
>>> P   Please don't print this e-mail unless you really need to.
>>>
>>>
>>>
>>> *From:* Ankita m [mailto:ankitamait...@gmail.com]
>>> *Sent:* 24 April 2018 10:52
>>> *To:* sagar mcp ; Krishna Singh 
>>> *Subject:* Fwd: problem in cluster
>>>
>>>
>>>
>>>
>>>
>>> -- Forwarded message --
>>> From: *Ankita m* 
>>> Date: Mon, Apr 23, 2018 at 4:18 PM
>>> Subject: problem in cluster
>>> To: sagar mcp 
>>>
>>> Hello Sir
>>>
>>>
>>>
>>> I am Ankita Maity from Mechanical Department IIT Roorkee.
>>>
>>>
>>>
>>> I am facing problem while submitting a job . all the programs
>>> automatically are going to either queue or status is showing "H". Please
>>> help sir .
>>>
>>>
>>>
>>> my program folder is Home/ankitamed/MarineTurbine1/
>>>
>>>
>>>
>>> Team viewer id password are
>>>
>>>
>>>
>>> 565 248 412
>>>
>>> jfu477
>>>
>>>
>>>
>>> Regards
>>>
>>> Ankita
>>>
>>>
>>>
>>
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Fwd: Fwd: problem in cluster

2018-04-25 Thread John Hearns via users
Ankita, please read here:https://www.open-mpi.org/faq/?category=mpi-apps

On 25 April 2018 at 11:44, Ankita m <ankitamait...@gmail.com> wrote:

> Can you please tell me whether to use mpicc compiler ar any other compiler
> for openmpi programs
>
> On Wed, Apr 25, 2018 at 3:13 PM, Ankita m <ankitamait...@gmail.com> wrote:
>
>> i have 16 cores per one node. I usually use 4 node each node has 16 cores
>> so total 64 processes.
>>
>> On Wed, Apr 25, 2018 at 2:57 PM, John Hearns via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> I do not see much wrong with that.
>>> However nodes=4  ppn=2  makes  8 processes in all.
>>> You are using mpirun -np 64
>>>
>>> Actually it is better practice to use the PBS supplied environment
>>> variables during the job, rather than hard-wiring   64
>>> I dont have access to a PBS cluster from my desk at the moment.
>>> You could also investigate using  mpiprocs=2  Then I think with openmpi
>>> if it has compiled in PBS support all you would have to do is
>>> mpirun
>>>
>>> Are you sure your compute servers only have two cores ??
>>>
>>> I also see that you are commenting out the module load openmpi-3.0.1   I
>>> would guess you want the default Opnempi, which is OK
>>>
>>> First thing I would do, before the mpirun line in that job script:
>>>
>>> which mpirun(check that you are picking up an Openmpi version)
>>>
>>> ldd ./cgles  (check you are bringing in the libraries that you should)
>>>
>>>
>>> Also run mpirun with the verbose flag  -v
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 25 April 2018 at 11:10, Ankita m <ankitamait...@gmail.com> wrote:
>>>
>>>>
>>>>> while using openmpi- 1.4.5 the program ended by showing this error
>>>>> file (in the attachment)
>>>>>
>>>>
>>>>  I am Using PBS file . Below u can find the script that i am using to
>>>> run my program
>>>>
>>>> ___
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Fwd: Fwd: problem in cluster

2018-04-25 Thread John Hearns via users
I do not see much wrong with that.
However nodes=4  ppn=2  makes  8 processes in all.
You are using mpirun -np 64

Actually it is better practice to use the PBS supplied environment
variables during the job, rather than hard-wiring   64
I dont have access to a PBS cluster from my desk at the moment.
You could also investigate using  mpiprocs=2  Then I think with openmpi if
it has compiled in PBS support all you would have to do is
mpirun

Are you sure your compute servers only have two cores ??

I also see that you are commenting out the module load openmpi-3.0.1   I
would guess you want the default Opnempi, which is OK

First thing I would do, before the mpirun line in that job script:

which mpirun(check that you are picking up an Openmpi version)

ldd ./cgles  (check you are bringing in the libraries that you should)


Also run mpirun with the verbose flag  -v




























On 25 April 2018 at 11:10, Ankita m  wrote:

>
>> while using openmpi- 1.4.5 the program ended by showing this error file
>> (in the attachment)
>>
>
>  I am Using PBS file . Below u can find the script that i am using to run
> my program
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Fwd: Fwd: problem in cluster

2018-04-25 Thread John Hearns via users
That fine. But in your job script ppn=2

Also check   ldd cgles  on the compute servers themselves.
Are all the libraries available in your path?


On 25 April 2018 at 11:43, Ankita m <ankitamait...@gmail.com> wrote:

> i have 16 cores per one node. I usually use 4 node each node has 16 cores
> so total 64 processes.
>
> On Wed, Apr 25, 2018 at 2:57 PM, John Hearns via users <
> users@lists.open-mpi.org> wrote:
>
>> I do not see much wrong with that.
>> However nodes=4  ppn=2  makes  8 processes in all.
>> You are using mpirun -np 64
>>
>> Actually it is better practice to use the PBS supplied environment
>> variables during the job, rather than hard-wiring   64
>> I dont have access to a PBS cluster from my desk at the moment.
>> You could also investigate using  mpiprocs=2  Then I think with openmpi
>> if it has compiled in PBS support all you would have to do is
>> mpirun
>>
>> Are you sure your compute servers only have two cores ??
>>
>> I also see that you are commenting out the module load openmpi-3.0.1   I
>> would guess you want the default Opnempi, which is OK
>>
>> First thing I would do, before the mpirun line in that job script:
>>
>> which mpirun(check that you are picking up an Openmpi version)
>>
>> ldd ./cgles  (check you are bringing in the libraries that you should)
>>
>>
>> Also run mpirun with the verbose flag  -v
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 25 April 2018 at 11:10, Ankita m <ankitamait...@gmail.com> wrote:
>>
>>>
>>>> while using openmpi- 1.4.5 the program ended by showing this error file
>>>> (in the attachment)
>>>>
>>>
>>>  I am Using PBS file . Below u can find the script that i am using to
>>> run my program
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Old version openmpi 1.2 support infiniband?

2018-03-21 Thread John Hearns via users
Kaiming,  good luck with your project.  I think you should contact Barry
Rountree directly. you will probably get good advice!

It is worth saying that with Turboboost there is variation between each
individual CPU die, even within the same SKU.
What Turboboost does is to set a thermal envelope, and the CPU core(s) ramp
up in frequency till the thermal limit is reached.
So each CPU die is slightly different  (*)
Indeed in my last job we had a benchmarking exercise where the instruction
was to explicitly turn off Turboboost.


(*) As I work at ASML I really should understand this better... I really
should.






On 20 March 2018 at 19:34, Kaiming Ouyang <kouya...@ucr.edu> wrote:

> Hi John,
> Thank you for your advice. But this is only related to its functionality,
> and right now my problem is it cannot compile with new version openmpi.
> The reason may come from its patch file since it needs to intercept MPI
> calls to profile some data. New version openmpi may change its framework so
> that this old software does not fit it anymore.
>
>
> Kaiming Ouyang, Research Assistant.
> Department of Computer Science and Engineering
> University of California, Riverside
> 900 University Avenue, Riverside, CA 92521
>
>
> On Tue, Mar 20, 2018 at 10:46 AM, John Hearns via users <
> users@lists.open-mpi.org> wrote:
>
>> "It does not handle more recent improvements such as Intel's turbo
>> mode and the processor performance inhomogeneity that comes with it."
>> I guess it is easy enough to disable Turbo mode in the BIOS though.
>>
>>
>>
>> On 20 March 2018 at 17:48, Kaiming Ouyang <kouya...@ucr.edu> wrote:
>>
>>> I think the problem it has is it only deal with the old
>>> framework because it will intercept MPI calls and do some profiling. Here
>>> is the library:
>>> https://github.com/LLNL/Adagio
>>>
>>> I checked the openmpi changelog. From openmpi 1.3, it began to switch to
>>> a new framework, and openmpi 1.4+ has different one too. This library only
>>> works under openmpi 1.2.
>>> Thank you for your advise, I will try it. My current problem is this
>>> library seems to try to patch mpi.h file, but it fails during the patching
>>> process for new version openmpi. I don't know the reason yet, and will
>>> check it soon. Thank you.
>>>
>>> Kaiming Ouyang, Research Assistant.
>>> Department of Computer Science and Engineering
>>> University of California, Riverside
>>> 900 University Avenue, Riverside, CA 92521
>>>
>>>
>>> On Tue, Mar 20, 2018 at 4:35 AM, Jeff Squyres (jsquyres) <
>>> jsquy...@cisco.com> wrote:
>>>
>>>> On Mar 19, 2018, at 11:32 PM, Kaiming Ouyang <kouya...@ucr.edu> wrote:
>>>> >
>>>> > Thank you.
>>>> > I am using newest version HPL.
>>>> > I forgot to say I can run HPL with openmpi-3.0 under infiniband. The
>>>> reason I want to use old version is I need to compile a library that only
>>>> supports old version openmpi, so I am trying to do this tricky job.
>>>>
>>>> Gotcha.
>>>>
>>>> Is there something in particular about the old library that requires
>>>> Open MPI v1.2.x?
>>>>
>>>> More specifically: is there a particular error you get when you try to
>>>> use Open MPI v3.0.0 with that library?
>>>>
>>>> I ask because if the app supports the MPI API in Open MPI v1.2.9, then
>>>> it also supports the MPI API in Open MPI v3.0.0.  We *have* changed lots of
>>>> other things under the covers in that time, such as:
>>>>
>>>> - how those MPI API's are implemented
>>>> - mpirun (and friends) command line parameters
>>>> - MCA parameters
>>>> - compilation flags
>>>>
>>>> But many of those things might actually be mostly -- if not entirely --
>>>> hidden from a library that uses MPI.
>>>>
>>>> My point: it may be easier to get your library to use a newer version
>>>> of Open MPI than you think.  For example, if the library has some
>>>> hard-coded flags in their configure/Makefile to build with Open MPI, just
>>>> replace those flags with `mpicc --showme:BLAH` variants (see `mpicc
>>>> --showme:help` for a full listing).  This will have Open MPI tell you
>>>> exactly what flags it needs to compile, link, etc.
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>>
>>>> ___
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Old version openmpi 1.2 support infiniband?

2018-03-20 Thread John Hearns via users
"It does not handle more recent improvements such as Intel's turbo
mode and the processor performance inhomogeneity that comes with it."
I guess it is easy enough to disable Turbo mode in the BIOS though.



On 20 March 2018 at 17:48, Kaiming Ouyang  wrote:

> I think the problem it has is it only deal with the old framework because
> it will intercept MPI calls and do some profiling. Here is the library:
> https://github.com/LLNL/Adagio
>
> I checked the openmpi changelog. From openmpi 1.3, it began to switch to a
> new framework, and openmpi 1.4+ has different one too. This library only
> works under openmpi 1.2.
> Thank you for your advise, I will try it. My current problem is this
> library seems to try to patch mpi.h file, but it fails during the patching
> process for new version openmpi. I don't know the reason yet, and will
> check it soon. Thank you.
>
> Kaiming Ouyang, Research Assistant.
> Department of Computer Science and Engineering
> University of California, Riverside
> 900 University Avenue, Riverside, CA 92521
>
>
> On Tue, Mar 20, 2018 at 4:35 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> On Mar 19, 2018, at 11:32 PM, Kaiming Ouyang  wrote:
>> >
>> > Thank you.
>> > I am using newest version HPL.
>> > I forgot to say I can run HPL with openmpi-3.0 under infiniband. The
>> reason I want to use old version is I need to compile a library that only
>> supports old version openmpi, so I am trying to do this tricky job.
>>
>> Gotcha.
>>
>> Is there something in particular about the old library that requires Open
>> MPI v1.2.x?
>>
>> More specifically: is there a particular error you get when you try to
>> use Open MPI v3.0.0 with that library?
>>
>> I ask because if the app supports the MPI API in Open MPI v1.2.9, then it
>> also supports the MPI API in Open MPI v3.0.0.  We *have* changed lots of
>> other things under the covers in that time, such as:
>>
>> - how those MPI API's are implemented
>> - mpirun (and friends) command line parameters
>> - MCA parameters
>> - compilation flags
>>
>> But many of those things might actually be mostly -- if not entirely --
>> hidden from a library that uses MPI.
>>
>> My point: it may be easier to get your library to use a newer version of
>> Open MPI than you think.  For example, if the library has some hard-coded
>> flags in their configure/Makefile to build with Open MPI, just replace
>> those flags with `mpicc --showme:BLAH` variants (see `mpicc --showme:help`
>> for a full listing).  This will have Open MPI tell you exactly what flags
>> it needs to compile, link, etc.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-09 Thread John Hearns via users
Michele, as other have said  libibverbs.so.1  is not in your library path.
Can you ask the person who manages yoru cluster where libibverbs is
located on the compute nodes?
Also try to runibv_devinfo

On Tue, 9 Oct 2018 at 16:03, Castellana Michele
 wrote:
>
> Dear John,
> Thank you for your reply. Here is the output of ldd
>
> $ ldd ./code.io
> linux-vdso.so.1 =>  (0x7ffcc759f000)
> liblapack.so.3 => /usr/lib64/liblapack.so.3 (0x7fbc1c613000)
> libgsl.so.0 => /usr/lib64/libgsl.so.0 (0x7fbc1c1ea000)
> libgslcblas.so.0 => /usr/lib64/libgslcblas.so.0 (0x7fbc1bfad000)
> libmpi.so.40 => /data/users/xx/openmpi/lib/libmpi.so.40 (0x7fbc1bcad000)
> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x7fbc1b9a6000)
> libm.so.6 => /usr/lib64/libm.so.6 (0x7fbc1b6a4000)
> libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x7fbc1b48e000)
> libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x7fbc1b272000)
> libc.so.6 => /usr/lib64/libc.so.6 (0x7fbc1aea5000)
> libblas.so.3 => /usr/lib64/libblas.so.3 (0x7fbc1ac4c000)
> libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x7fbc1a92a000)
> libsatlas.so.3 => /usr/lib64/atlas/libsatlas.so.3 (0x7fbc19cdd000)
> libopen-rte.so.40 => /data/users/xx/openmpi/lib/libopen-rte.so.40 
> (0x7fbc19a2d000)
> libopen-pal.so.40 => /data/users/xx/openmpi/lib/libopen-pal.so.40 
> (0x7fbc19733000)
> libdl.so.2 => /usr/lib64/libdl.so.2 (0x7fbc1952f000)
> librt.so.1 => /usr/lib64/librt.so.1 (0x7fbc19327000)
> libutil.so.1 => /usr/lib64/libutil.so.1 (0x7fbc19124000)
> libz.so.1 => /usr/lib64/libz.so.1 (0x7fbc18f0e000)
> /lib64/ld-linux-x86-64.so.2 (0x7fbc1cd7)
> libquadmath.so.0 => /usr/lib64/libquadmath.so.0 (0x7fbc18cd2000)
>
> and the one for the PBS version
>
> $   qstat --version
> Version: 6.1.2
> Commit: 661e092552de43a785c15d39a3634a541d86898e
>
> After I created the symbolic links libcrypto.so.0.9.8  libssl.so.0.9.8, I 
> still have one error message left from MPI:
>
> mca_base_component_repository_open: unable to open mca_btl_openib: 
> libibverbs.so.1: cannot open shared object file: No such file or directory 
> (ignored)
>
> Please let me know if you have any suggestions.
>
> Best,
>
>
> On Oct 4, 2018, at 3:12 PM, John Hearns via users  
> wrote:
>
> Michele, the command is   ldd ./code.io
> I just Googled - ldd  means List dynamic Dependencies
>
> To find out the PBS batch system type - that is a good question!
> Try this: qstat --version
>
>
>
> On Thu, 4 Oct 2018 at 10:12, Castellana Michele
>  wrote:
>
>
> Dear John,
> Thank you for your reply. I have tried
>
> ldd mpirun ./code.o
>
> but I get an error message, I do not know what is the proper syntax to use 
> ldd command. Here is the information about the Linux version
>
> $ cat /etc/os-release
> NAME="CentOS Linux"
> VERSION="7 (Core)"
> ID="centos"
> ID_LIKE="rhel fedora"
> VERSION_ID="7"
> PRETTY_NAME="CentOS Linux 7 (Core)"
> ANSI_COLOR="0;31"
> CPE_NAME="cpe:/o:centos:centos:7"
> HOME_URL="https://www.centos.org/;
> BUG_REPORT_URL="https://bugs.centos.org/;
>
> CENTOS_MANTISBT_PROJECT="CentOS-7"
> CENTOS_MANTISBT_PROJECT_VERSION="7"
> REDHAT_SUPPORT_PRODUCT="centos"
> REDHAT_SUPPORT_PRODUCT_VERSION=“7"
>
> May you please tell me how to check whether the batch system is PBSPro or 
> OpenPBS?
>
> Best,
>
>
>
>
> On Oct 4, 2018, at 10:30 AM, John Hearns via users  
> wrote:
>
> Michele  one tip:   log into a compute node using ssh and as your own 
> username.
> If you use the Modules envirnonment then load the modules you use in
> the job script
> then use the  ldd  utility to check if you can load all the libraries
> in the code.io executable
>
> Actually you are better to submit a short batch job which does not use
> mpirun but uses ldd
> A proper batch job will duplicate the environment you wish to run in.
>
>   ldd ./code.io
>
> By the way, is the batch system PBSPro or OpenPBS?  Version 6 seems a bit old.
> Can you say what version of Redhat or CentOS this cluster is installed with?
>
>
>
> On Thu, 4 Oct 2018 at 00:02, Castellana Michele
>  wrote:
>
> I fixed it, the correct file was in /lib64, not in /lib.
>
> Thank you for your help.
>
> On Oct 3, 2018, at 11:30 PM, Castellana Michele  
> wrote:
>
> Thank you, I found some libcrypto files in /usr/lib indeed:
>
> $ ls libcry*
> libcrypt-2.17.so  libcrypto.so.10  libcrypto.so.1.0.2k  libcrypt.so.1
>
> but I could not find li

Re: [OMPI users] no openmpi over IB on new CentOS 7 system

2018-10-10 Thread John Hearns via users
Noam,  what does ompi_info say - specifically which BTLs are available?
Stupid question though - this is a single system with no connection to a switch?
You probably dont have an OpenSM subnet manager running then - could
that be the root cause?

On Wed, 10 Oct 2018 at 09:53, Dave Love  wrote:
>
> RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in
> case that's the problem.  (Fixed in 3.10.0-862.14.4.)
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] no openmpi over IB on new CentOS 7 system

2018-10-10 Thread John Hearns via users
On that system please tell us what these return:
ibstat
ibstatus
sminfo
ibdiagnet




On Wed, 10 Oct 2018 at 12:49, John Hearns  wrote:
>
> Noam,  what does ompi_info say - specifically which BTLs are available?
> Stupid question though - this is a single system with no connection to a 
> switch?
> You probably dont have an OpenSM subnet manager running then - could that be 
> the root cause?
>
> On Wed, 10 Oct 2018 at 09:53, Dave Love  wrote:
> >
> > RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in
> > case that's the problem.  (Fixed in 3.10.0-862.14.4.)
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-04 Thread John Hearns via users
Michele  one tip:   log into a compute node using ssh and as your own username.
If you use the Modules envirnonment then load the modules you use in
the job script
then use the  ldd  utility to check if you can load all the libraries
in the code.io executable

Actually you are better to submit a short batch job which does not use
mpirun but uses ldd
A proper batch job will duplicate the environment you wish to run in.

ldd ./code.io

By the way, is the batch system PBSPro or OpenPBS?  Version 6 seems a bit old.
Can you say what version of Redhat or CentOS this cluster is installed with?



On Thu, 4 Oct 2018 at 00:02, Castellana Michele
 wrote:
>
> I fixed it, the correct file was in /lib64, not in /lib.
>
> Thank you for your help.
>
> On Oct 3, 2018, at 11:30 PM, Castellana Michele  
> wrote:
>
> Thank you, I found some libcrypto files in /usr/lib indeed:
>
> $ ls libcry*
> libcrypt-2.17.so  libcrypto.so.10  libcrypto.so.1.0.2k  libcrypt.so.1
>
> but I could not find libcrypto.so.0.9.8. Here they suggest to create a 
> hyperlink, but if I do I still get an error from MPI. Is there another way 
> around this?
>
> Best,
>
> On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users 
>  wrote:
>
> It's probably in your Linux distro somewhere -- I'd guess you're missing a 
> package (e.g., an RPM or a deb) out on your compute nodes...?
>
>
> On Oct 3, 2018, at 4:24 PM, Castellana Michele  
> wrote:
>
> Dear Ralph,
> Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ?
>
> Best,
>
> On Oct 3, 2018, at 9:41 PM, Ralph H Castain  wrote:
>
> Actually, I see that you do have the tm components built, but they cannot be 
> loaded because you are missing libcrypto from your LD_LIBRARY_PATH
>
>
> On Oct 3, 2018, at 12:33 PM, Ralph H Castain  wrote:
>
> Did you configure OMPI —with-tm=? It looks like we didn’t 
> build PBS support and so we only see one node with a single slot allocated to 
> it.
>
>
> On Oct 3, 2018, at 12:02 PM, Castellana Michele  
> wrote:
>
> Dear all,
> I am having trouble running an MPI code across multiple cores on a new 
> computer cluster, which uses PBS. Here is a minimal example, where I want to 
> run two MPI processes, each on  a different node. The PBS script is
>
> #!/bin/bash
> #PBS -l walltime=00:01:00
> #PBS -l mem=1gb
> #PBS -l nodes=2:ppn=1
> #PBS -q batch
> #PBS -N test
> mpirun -np 2 ./code.o
>
> and when I submit it with
>
> $qsub script.sh
>
> I get the following message in the PBS error file
>
> $ cat test.e1234
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file 
> or directory (ignored)
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or 
> directory (ignored)
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_ras_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file 
> or directory (ignored)
> --
> There are not enough slots available in the system to satisfy the 2 slots
> that were requested by the application:
>  ./code.o
>
> Either request fewer slots for your application, or make more slots available
> for use.
> —
>
> The PBS version is
>
> $ qstat --version
> Version: 6.1.2
>
> and here is some additional information on the MPI version
>
> $ mpicc -v
> Using built-in specs.
> COLLECT_GCC=/bin/gcc
> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
> Target: x86_64-redhat-linux
> […]
> Thread model: posix
> gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)
>
> Do you guys know what may be the issue here?
>
> Thank you
> Best,
>
>
>
>
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-04 Thread John Hearns via users
Michele, the command is   ldd ./code.io
I just Googled - ldd  means List dynamic Dependencies

To find out the PBS batch system type - that is a good question!
Try this: qstat --version



On Thu, 4 Oct 2018 at 10:12, Castellana Michele
 wrote:
>
> Dear John,
> Thank you for your reply. I have tried
>
> ldd mpirun ./code.o
>
> but I get an error message, I do not know what is the proper syntax to use 
> ldd command. Here is the information about the Linux version
>
> $ cat /etc/os-release
> NAME="CentOS Linux"
> VERSION="7 (Core)"
> ID="centos"
> ID_LIKE="rhel fedora"
> VERSION_ID="7"
> PRETTY_NAME="CentOS Linux 7 (Core)"
> ANSI_COLOR="0;31"
> CPE_NAME="cpe:/o:centos:centos:7"
> HOME_URL="https://www.centos.org/;
> BUG_REPORT_URL="https://bugs.centos.org/;
>
> CENTOS_MANTISBT_PROJECT="CentOS-7"
> CENTOS_MANTISBT_PROJECT_VERSION="7"
> REDHAT_SUPPORT_PRODUCT="centos"
> REDHAT_SUPPORT_PRODUCT_VERSION=“7"
>
> May you please tell me how to check whether the batch system is PBSPro or 
> OpenPBS?
>
> Best,
>
>
>
>
> On Oct 4, 2018, at 10:30 AM, John Hearns via users  
> wrote:
>
> Michele  one tip:   log into a compute node using ssh and as your own 
> username.
> If you use the Modules envirnonment then load the modules you use in
> the job script
> then use the  ldd  utility to check if you can load all the libraries
> in the code.io executable
>
> Actually you are better to submit a short batch job which does not use
> mpirun but uses ldd
> A proper batch job will duplicate the environment you wish to run in.
>
>ldd ./code.io
>
> By the way, is the batch system PBSPro or OpenPBS?  Version 6 seems a bit old.
> Can you say what version of Redhat or CentOS this cluster is installed with?
>
>
>
> On Thu, 4 Oct 2018 at 00:02, Castellana Michele
>  wrote:
>
> I fixed it, the correct file was in /lib64, not in /lib.
>
> Thank you for your help.
>
> On Oct 3, 2018, at 11:30 PM, Castellana Michele  
> wrote:
>
> Thank you, I found some libcrypto files in /usr/lib indeed:
>
> $ ls libcry*
> libcrypt-2.17.so  libcrypto.so.10  libcrypto.so.1.0.2k  libcrypt.so.1
>
> but I could not find libcrypto.so.0.9.8. Here they suggest to create a 
> hyperlink, but if I do I still get an error from MPI. Is there another way 
> around this?
>
> Best,
>
> On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users 
>  wrote:
>
> It's probably in your Linux distro somewhere -- I'd guess you're missing a 
> package (e.g., an RPM or a deb) out on your compute nodes...?
>
>
> On Oct 3, 2018, at 4:24 PM, Castellana Michele  
> wrote:
>
> Dear Ralph,
> Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ?
>
> Best,
>
> On Oct 3, 2018, at 9:41 PM, Ralph H Castain  wrote:
>
> Actually, I see that you do have the tm components built, but they cannot be 
> loaded because you are missing libcrypto from your LD_LIBRARY_PATH
>
>
> On Oct 3, 2018, at 12:33 PM, Ralph H Castain  wrote:
>
> Did you configure OMPI —with-tm=? It looks like we didn’t 
> build PBS support and so we only see one node with a single slot allocated to 
> it.
>
>
> On Oct 3, 2018, at 12:02 PM, Castellana Michele  
> wrote:
>
> Dear all,
> I am having trouble running an MPI code across multiple cores on a new 
> computer cluster, which uses PBS. Here is a minimal example, where I want to 
> run two MPI processes, each on  a different node. The PBS script is
>
> #!/bin/bash
> #PBS -l walltime=00:01:00
> #PBS -l mem=1gb
> #PBS -l nodes=2:ppn=1
> #PBS -q batch
> #PBS -N test
> mpirun -np 2 ./code.o
>
> and when I submit it with
>
> $qsub script.sh
>
> I get the following message in the PBS error file
>
> $ cat test.e1234
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file 
> or directory (ignored)
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or 
> directory (ignored)
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_ras_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file 
> or directory (ignored)
> --
> There are not enough slots available in the system to satisfy the 2 slots
> that were requested by the application:
> ./code.o
>
> Either request fewer sl

Re: [OMPI users] OpenMPI building fails on Windows Linux Subsystem(WLS).

2018-09-19 Thread John Hearns via users
Oleg, I have  a Windows 10 system and could help by testing this also.
But I have to say - it will be quicker just to install VirtualBox and
a CentOS VM. Or an Ubuntu VM.
You can then set up a small test network of VMs using the VirtualBox
HostOnly network for tests of your MPI code.
On Wed, 19 Sep 2018 at 16:59, Jeff Squyres (jsquyres) via users
 wrote:
>
> I can't say that we've tried to build on WSL; the fact that it fails is 
> probably not entirely unsurprising.  :-(
>
> I looked at your logs, and although I see the compile failure, I don't see 
> any reason *why* it failed.  Here's the relevant fail from the 
> tar_openmpi_fail file:
>
> -
> 5523 Making all in mca/filem
> 5524 make[2]: Entering directory 
> '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
> 5525   GENERATE orte_filem.7
> 5526   CC   base/filem_base_frame.lo
> 5527   CC   base/filem_base_select.lo
> 5528   CC   base/filem_base_receive.lo
> 5529   CC   base/filem_base_fns.lo
> 5530 base/filem_base_receive.c: In function 
> ‘filem_base_process_get_remote_path_cmd’:
> 5531 base/filem_base_receive.c:250:9: warning: ignoring return value of 
> ‘getcwd’, declared with attribute warn_unused_result [-Wunused-result]
> 5532  getcwd(cwd, sizeof(cwd));
> 5533  ^~~~
> 5534 base/filem_base_receive.c:251:9: warning: ignoring return value of 
> ‘asprintf’, declared with attribute warn_unused_result [-Wunused-result]
> 5535  asprintf(_name, "%s/%s", cwd, filename);
> 5536  ^~~
> 5537 Makefile:1892: recipe for target 'base/filem_base_select.lo' failed
> 5538 make[2]: *** [base/filem_base_select.lo] Error 1
> 5539 make[2]: *** Waiting for unfinished jobs
> 5540 make[2]: Leaving directory 
> '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
> 5541 Makefile:2586: recipe for target 'all-recursive' failed
> 5542 make[1]: *** [all-recursive] Error 1
> 5543 make[1]: Leaving directory '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte'
> 5544 Makefile:1897: recipe for target 'all-recursive' failed
> 5545 make: *** [all-recursive] Error 1
> -
>
> I.e., I see "recipe for target 'base/filem_base_select.lo' failed" -- but 
> there's no error indicating *why* it failed.  There were 2 warnings when 
> compiling that file -- but not errors.  That should not have prevented 
> compilation for that .c file.
>
> You then went on to run "make check", but that failed predictably because 
> "make" had already failed.
>
> You might want to run "make V=1" to see if you can get more details about why 
> orte/mca/filem/base/filem_base_select.c failed to compile properly.
>
> It looks like your GitHub clone build failed in exactly the same place.
>
> There's something about filem_base_select.c that is failing to compile -- 
> that's what we need more detail on.
>
>
>
> > On Sep 18, 2018, at 10:06 AM, Oleg Kmechak  wrote:
> >
> > Hello,
> >
> > I am student of Physics from University of Warsaw, and new to OpenMPI. 
> > Currently just trying to compile it from source code(tried both github and  
> > tar(3.1.2)).
> > I am using Windows Linux Subsystem(WLS), Ubuntu.
> >
> > uname -a:
> > >Linux Canopus 4.4.0-17134-Microsoft #285-Microsoft Thu Aug 30 17:31:00 PST 
> > >2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > I have done all steps suggested in INSTALL and HACKING files, installed 
> > next tool i proper order: M4(1.4.18), autoconf(2.69), automake(1.15.1), 
> > libtool(2.4.6), flex(2.6.4).
> >
> > Next I enabled AUTOMAKE_JOBS=4 and ran:
> >
> > ./autogen.pl #for source code from git hub
> >
> > Then
> > ./configure --disable-picky --enable-mpi-cxx --without-cma --enable-static
> >
> > I added --without-cma cos I have a lot of warnings about compiling asprintf 
> > function
> >
> > and finally:
> > make -j 4 all #cos I have 4 logical processors
> >
> > And in both versions(from github or  tar(3.1.2)) it fails.
> > Github version error:
> > >../../../../opal/mca/hwloc/hwloc201/hwloc/include/hwloc.h:71:10: fatal 
> > >error: hwloc/bitmap.h: No such file or directory
> >  #include 
> >
> > And tar(3.1.2) version:
> > >libtool:   error: cannot find the library '../../ompi/libmpi.la' or 
> > >unhandled argument '../../ompi/libmpi.la'
> >
> > Please see also full log in attachment
> > Thanks, hope You will help(cos I passed a lot of time on it currently:) )
> >
> >
> > PS: if this is a bug or unimplemented feature(WLS is probably quite 
> > specific platform), should I rise issue on github project?
> >
> >
> > Regards, Oleg Kmechak
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list

Re: [OMPI users] OpenMPI building fails on Windows Linux Subsystem(WLS).

2018-09-19 Thread John Hearns via users
Oleg, I can build the latest master branch of OpenMPI in WSL
I can give it a try with 3.1.2 if that is any help to you?

uname -a
Linux Johns-Spectre 4.4.0-17134-Microsoft #285-Microsoft Thu Aug 30
17:31:00 PST 2018 x86_64 x86_64 x86_64 GNU/Linux
apt-get upgrade
apt-get install gfortran
wget https://github.com/open-mpi/ompi/archive/master.zip
cd ompi-master

./autogen.pl
./configure --enable-mpi-cxx

make -j 2

configure returns this:

Open MPI configuration:
---
Version: 4.1.0a1
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): yes
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
MPI Build Java bindings (experimental): no
Build Open SHMEM support: false (no spml)
Debug build: no
Platform file: (none)
Miscellaneous
---
CUDA support: no
HWLOC support: internal
Libevent support: internal
PMIx support: internal
Transports
---
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: no
OpenFabrics Libfabric: no
OpenFabrics Verbs: no
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
Resource Managers
---
Cray Alps: no
Grid Engine: no
LSF: no
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no
OMPIO File Systems
---
DDN Infinite Memory Engine: no
Generic Unix FS: yes
Lustre: no
PVFS2/OrangeFS: no


On Wed, 19 Sep 2018 at 17:36, John Hearns  wrote:
>
> Oleg, I have  a Windows 10 system and could help by testing this also.
> But I have to say - it will be quicker just to install VirtualBox and
> a CentOS VM. Or an Ubuntu VM.
> You can then set up a small test network of VMs using the VirtualBox
> HostOnly network for tests of your MPI code.
> On Wed, 19 Sep 2018 at 16:59, Jeff Squyres (jsquyres) via users
>  wrote:
> >
> > I can't say that we've tried to build on WSL; the fact that it fails is 
> > probably not entirely unsurprising.  :-(
> >
> > I looked at your logs, and although I see the compile failure, I don't see 
> > any reason *why* it failed.  Here's the relevant fail from the 
> > tar_openmpi_fail file:
> >
> > -
> > 5523 Making all in mca/filem
> > 5524 make[2]: Entering directory 
> > '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
> > 5525   GENERATE orte_filem.7
> > 5526   CC   base/filem_base_frame.lo
> > 5527   CC   base/filem_base_select.lo
> > 5528   CC   base/filem_base_receive.lo
> > 5529   CC   base/filem_base_fns.lo
> > 5530 base/filem_base_receive.c: In function 
> > ‘filem_base_process_get_remote_path_cmd’:
> > 5531 base/filem_base_receive.c:250:9: warning: ignoring return value of 
> > ‘getcwd’, declared with attribute warn_unused_result [-Wunused-result]
> > 5532  getcwd(cwd, sizeof(cwd));
> > 5533  ^~~~
> > 5534 base/filem_base_receive.c:251:9: warning: ignoring return value of 
> > ‘asprintf’, declared with attribute warn_unused_result [-Wunused-result]
> > 5535  asprintf(_name, "%s/%s", cwd, filename);
> > 5536  ^~~
> > 5537 Makefile:1892: recipe for target 'base/filem_base_select.lo' failed
> > 5538 make[2]: *** [base/filem_base_select.lo] Error 1
> > 5539 make[2]: *** Waiting for unfinished jobs
> > 5540 make[2]: Leaving directory 
> > '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte/mca/filem'
> > 5541 Makefile:2586: recipe for target 'all-recursive' failed
> > 5542 make[1]: *** [all-recursive] Error 1
> > 5543 make[1]: Leaving directory '/mnt/c/Users/ofcra/dev/openmpi-3.1.2/orte'
> > 5544 Makefile:1897: recipe for target 'all-recursive' failed
> > 5545 make: *** [all-recursive] Error 1
> > -
> >
> > I.e., I see "recipe for target 'base/filem_base_select.lo' failed" -- but 
> > there's no error indicating *why* it failed.  There were 2 warnings when 
> > compiling that file -- but not errors.  That should not have prevented 
> > compilation for that .c file.
> >
> > You then went on to run "make check", but that failed predictably because 
> > "make" had already failed.
> >
> > You might want to run "make V=1" to see if you can get more details about 
> > why orte/mca/filem/base/filem_base_select.c failed to compile properly.
> >
> > It looks like your GitHub clone build failed in exactly the same place.
> >
> > There's something about filem_base_select.c that is failing to compile -- 
> > that's what we need more detail on.
> >
> >
> >
> > > On Sep 18, 2018, at 10:06 AM, Oleg Kmechak  wrote:
> > >
> > > Hello,
> > >
> > > I am student of Physics from University of Warsaw, and new to OpenMPI. 
> > > Currently just trying to compile it from source code(tried both github 
> > > and  tar(3.1.2)).
> > > I am using Windows Linux Subsystem(WLS), Ubuntu.
> > >
> > > uname -a:
> > > >Linux Canopus 4.4.0-17134-Microsoft #285-Microsoft Thu Aug 30 17:31:00 
> > > >PST 2018 x86_64 x86_64 x86_64 

Re: [OMPI users] Open MPI installation problem

2019-01-23 Thread John Hearns via users
Sorry if I am being stupid, Serdar might also have to set the location for
the includes by setting MPI_INC

On Wed, 23 Jan 2019 at 14:47, Ralph H Castain  wrote:

> Your PATH and LD_LIBRARY_PATH setting is incorrect. You installed OMPI
> into $HOME/openmpi, so you should have done:
>
> PATH=$HOME/openmpi/bin:$PATH
> LD_LIBRARY_PATH=$HOME/openmpi/lib:$LD_LIBRARY_PATH
>
> Ralph
>
>
> On Jan 23, 2019, at 6:36 AM, Serdar Hiçdurmaz 
> wrote:
>
> Hi All,
>
> I try to install Open MPI, which is prerequiste for liggghts (DEM
> software). Some info about my current linux version :
>
> NAME="SLED"
> VERSION="12-SP3"
> VERSION_ID="12.3"
> PRETTY_NAME="SUSE Linux Enterprise Desktop 12 SP3"
> ID="sled"
>
> I installed Open MPI 1.6 by typing
>
> ./configure --prefix=$HOME/openmpi
> make all
> make install
>
> Here, it is discussed that openmpi 1.6 is compatible with OpenSuse 12.3
> https://public.kitware.com/pipermail/paraview/2014-February/030487.html
> https://build.opensuse.org/package/show/openSUSE:12.3/openmpi
>
> To add OpenMPI to my path and LD_LIBRARY_PATH, I execute the following
> comands on terminal:
>
> export PATH=$PATH:/usr/lib64/mpi/gcc/openmpi/bin
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/mpi/gcc/openmpi/lib64
>
> Then, in /liggghts/src directory, I execute make auto, this appears :
>
> Creating list of contact models completed.
> make[1]: Entering directory
> '/home/serdarhd/liggghts/LIGGGHTS-PUBLIC/src/Obj_auto'
> Makefile:456: *** 'Could not compile a simple MPI example. Test was done
> with MPI_INC="" and MPICXX="mpicxx"'. Stop.
> make[1]: Leaving directory
> '/home/serdarhd/liggghts/LIGGGHTS-PUBLIC/src/Obj_auto'
> Makefile:106: recipe for target 'auto' failed
> make: *** [auto] Error 2
>
> Do you have any idea what the problem is here ? I went through the
> "makefile" but it looks like quite complicated as linux beginner like me.
>
> Thanks in advance. Regards,
>
> Serdar
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] job termination

2019-04-17 Thread John Hearns via users
I would do the normal things. Log into those nodes. Run  dmesg  and look at
/var/log/messages
Look at the Slurm log on the node and look for the job ending.

Also look at the sysstat files and see if there was a lot of memory being
used http://sebastien.godard.pagesperso-orange.fr/

On Wed, 17 Apr 2019 at 09:16, Mahmood Naderan  wrote:

> Hi,
> A QuantumEspresso, multinode and multiprocess MPI job has been terminated
> with the following messages in the log file
>
>
>  total cpu time spent up to now is63540.4 secs
>
>  total energy  =  -14004.61932175 Ry
>  Harris-Foulkes estimate   =  -14004.73511665 Ry
>  estimated scf accuracy<   0.84597958 Ry
>
>  iteration #  7 ecut=48.95 Ry beta= 0.70
>  Davidson diagonalization with overlap
> --
> ORTE has lost communication with a remote daemon.
>
>   HNP daemon   : [[7952,0],0] on node compute-0-0
>   Remote daemon: [[7952,0],1] on node compute-0-1
>
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --
>
>
>
>
> The slurm script for that is
>
> #!/bin/bash
> #SBATCH --job-name=myQE
> #SBATCH --output=mos2.rlx.out
> #SBATCH --ntasks=14
> #SBATCH --mem-per-cpu=17G
> #SBATCH --nodes=6
> #SBATCH --partition=QUARTZ
> #SBATCH --account=z5
> mpirun pw.x -i mos2.rlx.in
>
>
> The job is running on Slurm 18.08 and Rocks7 which its default OpenMPI
> 2.1.1.
>
> Other jobs with OMPI and slurm and QE are fine. So, I want to know how can
> I narrow my searches to find the root of the problem of this specific
> problem. For example, I don't know if the QE job had been diverged in
> calculations or not. Is there any way to find more information about that.
>
> Any idea?
>
> Regards,
> Mahmood
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread John Hearns via users
Noam, it may be a stupid question. Could you try runningslabtop   ss
the program executes

Also  'watch  cat /proc/meminfo'is also a good diagnostic

On Wed, 19 Jun 2019 at 18:32, Noam Bernstein via users <
users@lists.open-mpi.org> wrote:

> Hi - we’re having a weird problem with OpenMPI on our newish infiniband
> EDR (mlx5) nodes.  We're running CentOS 7.6, with all the infiniband and
> ucx libraries as provided by CentOS, i.e.
>
> ucx-1.4.0-1.el7.x86_64
> libibverbs-utils-17.2-3.el7.x86_64
> libibverbs-17.2-3.el7.x86_64
> libibumad-17.2-3.el7.x86_64
>
> kernel is
>
> 3.10.0-957.21.2.el7.x86_64
>
> I’ve compiled my open OpenMPI, version 4.0.1 (—with-verbs —with-ofi
> —with-ucx).
>
> The job is started with
>
> mpirun —mca pml ucx —mca btl ^vader,tcp,openib
>
> as recommended for ucx.
>
> We have some jobs (one particular code, some but not all sets of input
> parameters) that appear to take an increasing amount of memory (in MPI?)
> until the node crashes.  The total memory used by all processes (reported
> by ps or top) is not increasing, but “free” reports less and less available
> memory.  Within a couple of minutes it uses all of the 96GB on each of the
> nodes. When the job is killed the processes go away, but the memory usage
> (as reported by “free”) stays the same, e.g.:
>
>   totalusedfree
>   shared  buff/cache   available
> Mem:   9842395688750140 70216882184 2652128
>   6793020
> Swap:  65535996  36531265170684
>
> As far as I can tell I have to reboot to get the memory back.
>
> If I attach to a running process with “gdb -p”, I see stack traces that
> look like these two examples (starting from the first mpi-related call):
>
>
> #0  0x2b22a95134a3 in pthread_spin_lock () from /lib64/libpthread.so.0
> #1  0x2b22be73a3e8 in mlx5_poll_cq_v1 () from
> /usr/lib64/libibverbs/libmlx5-rdmav17.so
> #2  0x2b22bcb267de in uct_ud_verbs_iface_progress () from
> /lib64/libuct.so.0
> #3  0x2b22bc8d28b2 in ucp_worker_progress () from /lib64/libucp.so.0
> #4  0x2b22b7cd14e7 in mca_pml_ucx_progress ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
> #5  0x2b22ab6064fc in opal_progress () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libopen-pal.so.40
> #6  0x2b22a9f51dc5 in ompi_request_default_wait ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #7  0x2b22a9fa355c in ompi_coll_base_allreduce_intra_ring ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #8  0x2b22a9f65cb3 in PMPI_Allreduce () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #9  0x2b22a9cedf9b in pmpi_allreduce__ ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40
>
>
> #0  0x2ae0518de69d in write () from /lib64/libpthread.so.0
> #1  0x2ae064458d7f in ibv_cmd_reg_mr () from /usr/lib64/libibverbs.so.1
> #2  0x2ae066b9221b in mlx5_reg_mr () from
> /usr/lib64/libibverbs/libmlx5-rdmav17.so
> #3  0x2ae064461f08 in ibv_reg_mr () from /usr/lib64/libibverbs.so.1
> #4  0x2ae064f6e312 in uct_ib_md_reg_mr.isra.11.constprop () from
> /lib64/libuct.so.0
> #5  0x2ae064f6e4f2 in uct_ib_rcache_mem_reg_cb () from
> /lib64/libuct.so.0
> #6  0x2ae0651aec0f in ucs_rcache_get () from /lib64/libucs.so.0
> #7  0x2ae064f6d6a4 in uct_ib_mem_rcache_reg () from /lib64/libuct.so.0
> #8  0x2ae064d1fa58 in ucp_mem_rereg_mds () from /lib64/libucp.so.0
> #9  0x2ae064d21438 in ucp_request_memory_reg () from /lib64/libucp.so.0
> #10 0x2ae064d21663 in ucp_request_send_start () from /lib64/libucp.so.0
> #11 0x2ae064d335dd in ucp_tag_send_nb () from /lib64/libucp.so.0
> #12 0x2ae06420a5e6 in mca_pml_ucx_start ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
> #13 0x2ae05236fc06 in ompi_coll_base_alltoall_intra_basic_linear ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #14 0x2ae05232f347 in PMPI_Alltoall () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #15 0x2ae0520b704c in pmpi_alltoall__ () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40
>
> This doesn’t seem to happen on our older nodes (which have FDR mlx4
> interfaces).
>
> I don’t really have a mental model for OpenMPI's memory use, so I don’t
> know what component I should investigate: OpenMPI itself? ucx?  OFED?
> Something else?  IF anyone has any suggestions for what to try, and/or what
> other information would be useful, I’d appreciate it.
>
> thanks,
> Noam
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread John Hearns via users
Errr..   you chave dropped caches?   echo 3 > /proc/sys/vm/drop_caches


On Thu, 20 Jun 2019 at 15:59, Yann Jobic via users 
wrote:

> Hi,
>
> Le 6/20/2019 à 3:31 PM, Noam Bernstein via users a écrit :
> >
> >
> >> On Jun 20, 2019, at 4:44 AM, Charles A Taylor  >> > wrote:
> >>
> >> This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought
> >> the fix was landed in 4.0.0 but you might
> >> want to check the code to be sure there wasn’t a regression in 4.1.x.
> >>  Most of our codes are still running
> >> 3.1.2 so I haven’t built anything beyond 4.0.0 which definitely
> >> included the fix.
> >
> > Unfortunately, 4.0.0 behaves the same.
> >
> > One thing that I’m wondering if anyone familiar with the internals can
> > explain is how you get a memory leak that isn’t freed when then program
> > ends?  Doesn’t that suggest that it’s something lower level, like maybe
> > a kernel issue?
>
> Maybe it's only some data in cache memory, which is tagged as "used",
> but the kernel could use it, if needed. Have you tried to use the whole
> memory again with your code ? It sould work.
>
> Yann
>
> >
> > Noam
> >
> > 
> > |
> > |
> > |
> > *U.S. NAVAL*
> > |
> > |
> > _*RESEARCH*_
> > |
> > LABORATORY
> >
> > Noam Bernstein, Ph.D.
> > Center for Materials Physics and Technology
> > U.S. Naval Research Laboratory
> > T +1 202 404 8628  F +1 202 404 7546
> > https://www.nrl.navy.mil
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread John Hearns via users
The kernel using memory is why I suggested running slabtop, to see the
kernel slab allocations.
Clearly I Was barking up a wrong tree there...

On Thu, 20 Jun 2019 at 14:41, Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users <
> users@lists.open-mpi.org> wrote:
> >
> > One thing that I’m wondering if anyone familiar with the internals can
> explain is how you get a memory leak that isn’t freed when then program
> ends?  Doesn’t that suggest that it’s something lower level, like maybe a
> kernel issue?
>
> If "top" doesn't show processes eating up the memory, and killing
> processes (e.g., MPI processes) doesn't give you memory back, then it's
> likely that something in the kernel is leaking memory.
>
> Have you tried the latest version of UCX -- including their kernel drivers
> -- from Mellanox (vs. inbox/CentOS)?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] How it the rank determined (Open MPI and Podman)

2019-07-11 Thread John Hearns via users
Not really a relevant reply, however Nomad has task drivers for Docker and
Singularity
https://www.hashicorp.com/blog/singularity-and-hashicorp-nomad-a-perfect-fit

I'm not sure if it woul dbe easier to set up an MPI enviroment with Nomad
though

On Thu, 11 Jul 2019 at 11:08, Adrian Reber via users <
users@lists.open-mpi.org> wrote:

> Gilles,
>
> thanks for pointing out the environment variables. I quickly created a
> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables
> (grep "\(PMIX\|OMPI\)"). Now it works:
>
> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id
> --net=host mpi-test /home/mpi/hello
>
>  Hello, world (2 procs total)
> --> Process #   0 of   2 is alive. ->test1
> --> Process #   1 of   2 is alive. ->test2
>
> I need to tell Podman to mount /tmp from the host into the container, as
> I am running rootless I also need to tell Podman to use the same user ID
> in the container as outside (so that the Open MPI files in /tmp) can be
> shared and I am also running without a network namespace.
>
> So this is now with the full Podman provided isolation except the
> network namespace. Thanks for you help!
>
> Adrian
>
> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via users
> wrote:
> > Adrian,
> >
> >
> > the MPI application relies on some environment variables (they typically
> > start with OMPI_ and PMIX_).
> >
> > The MPI application internally uses a PMIx client that must be able to
> > contact a PMIx server
> >
> > (that is included in mpirun and the orted daemon(s) spawned on the remote
> > hosts).
> >
> > located on the same host.
> >
> >
> > If podman provides some isolation between the app inside the container
> (e.g.
> > /home/mpi/hello)
> >
> > and the outside world (e.g. mpirun/orted), that won't be an easy ride.
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> >
> > On 7/11/2019 4:35 PM, Adrian Reber via users wrote:
> > > I did a quick test to see if I can use Podman in combination with Open
> > > MPI:
> > >
> > > [test@test1 ~]$ mpirun --hostfile ~/hosts podman run
> quay.io/adrianreber/mpi-test /home/mpi/hello
> > >
> > >   Hello, world (1 procs total)
> > >  --> Process #   0 of   1 is alive. ->789b8fb622ef
> > >
> > >   Hello, world (1 procs total)
> > >  --> Process #   0 of   1 is alive. ->749eb4e1c01a
> > >
> > > The test program (hello) is taken from
> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c
> > >
> > >
> > > The problem with this is that each process thinks it is process 0 of 1
> > > instead of
> > >
> > >   Hello, world (2 procs total)
> > >  --> Process #   1 of   2 is alive.  ->test1
> > >  --> Process #   0 of   2 is alive.  ->test2
> > >
> > > My questions is how is the rank determined? What resources do I need
> to have
> > > in my container to correctly determine the rank.
> > >
> > > This is Podman 1.4.2 and Open MPI 4.0.1.
> > >
> > > Adrian
> > > ___
> > > users mailing list
> > > users@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/users
> > >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread John Hearns via users
Have you checked your ssh between nodes?
Also how is your Path set up?
There is a difference between interactive and non interactive login sessions

I advuse
A. Construct a hosts file and mpirun by hand

B. Use modules rather than. Bashrc files

C. Slurm

On Thu, 25 Jul 2019, 18:00 David Laidlaw via users, <
users@lists.open-mpi.org> wrote:

> I have been trying to run some MPI jobs under SGE for almost a year
> without success.  What seems like a very simple test program fails; the
> ingredients of it are below.  Any suggestions on any piece of the test,
> reasons for failure, requests for additional info, configuration thoughts,
> etc. would be much appreciated.  I suspect the linkage between SGIEand MPI,
> but can't identify the problem.  We do have SGE support build into MPI.  We
> also have the SGE parallel environment (PE) set up as described in several
> places on the web.
>
> Many thanks for any input!
>
> Cheers,
>
> -David Laidlaw
>
>
>
>
> Here is how I submit the job:
>
>/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
>
>
> Here is what is in runme:
>
>   #!/bin/bash
>   #$ -cwd
>   #$ -pe orte_fill 1
>   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> allocation ./hello
>
>
> Here is hello.c:
>
> #include 
> #include 
> #include 
> #include 
>
> int main(int argc, char** argv) {
> // Initialize the MPI environment
> MPI_Init(NULL, NULL);
>
> // Get the number of processes
> int world_size;
> MPI_Comm_size(MPI_COMM_WORLD, _size);
>
> // Get the rank of the process
> int world_rank;
> MPI_Comm_rank(MPI_COMM_WORLD, _rank);
>
> // Get the name of the processor
> char processor_name[MPI_MAX_PROCESSOR_NAME];
> int name_len;
> MPI_Get_processor_name(processor_name, _len);
>
> // Print off a hello world message
> printf("Hello world from processor %s, rank %d out of %d processors\n",
>processor_name, world_rank, world_size);
> // system("printenv");
>
> sleep(15); // sleep for 60 seconds
>
> // Finalize the MPI environment.
> MPI_Finalize();
> }
>
>
> This command will build it:
>
>  mpicc hello.c -o hello
>
>
> Running produces the following:
>
> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
>
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
>
> * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to
> use.
>
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
>
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --
>
>
> and:
>
> [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
> /usr/bin/ssh  set path = ( /usr/bin $path ) ; if ( $?
> LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
>  == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 )
> setenv
> LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
> _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
> DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
> LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted
> --hnp-topo-sig
> 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
> bid "2446000128" -mca ess_base_vpid "" -mca ess_base_num_procs
> "2" -
> mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
>  --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca
> pmix "^s1,s2,cray"
> ssh_exchange_identification: read: Connection reset by peer
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users