Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread Gilles Gouaillardet
you need to run the ulimit command before mpirun and on the same node.
if it still does not work, then you can use a wrapper.
instead of
mpirun a.out
you would do
mpirun a.sh

a.sh is a script

ulimit -c unlimited
exec a.out

the core is created in the current directory

Cheers,

Gilles

On Saturday, September 3, 2016, Mahmood Naderan 
wrote:

> >Did you ran
> >ulimit -c unlimited
> >before invoking mpirun ?
>
> Yes. On the node which says that error. Is that file created in the
> current working directory? Or it is somewhere in the system folders?
>
>
>
> As another question, I am trying to use OpenMPI-2.0.0 as a new one.
> Problem is that the application uses libmpi_f90.a from old versions
> but I don't see that in OpenMPI-2.0.0. There are some other libraries
> there.
>
>
>
>
> --
> Regards,
> Mahmood
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Error in file runtime/orte_init.c

2016-09-02 Thread Mahmood Naderan
​OK thanks for the hint. In fact 'ldd' command shows that some libraries
were missing. adding the paths to LD_LIBRARY_PATH solved the problem.



Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Error in file runtime/orte_init.c

2016-09-02 Thread Jeff Squyres (jsquyres)
Did you, perchance, install open MPI v2.0.0 in the same directory tree that a 
prior version of open MPI was already installed?

If so, open MPI may be trying to use plugins from the prior version of open 
MPI, which will be problematic. 

Sent from my phone. No type good. 

> On Sep 2, 2016, at 11:53 AM, Mahmood Naderan  wrote:
> 
> Hi,
> Using OpenMPI-2.0.0, is there any idea about this error
> 
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
> 
> Host:  compute-0-1.local
> Framework: ess
> Component: pmi
> --
> [compute-0-1.local:22993] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in
> file runtime/orte_init.c at line 116
> 
> 
> 
> 
> -- 
> Regards,
> Mahmood
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread Jeff Squyres (jsquyres)
Note that open MPI v2.0.0 is not ABI compatible with prior releases of open 
MPI. If you are trying to run an MPI executable created by a prior version of 
open MPI, you will need to recompile your application with open MPI v2.0.0.

Sent from my phone. No type good. 

> On Sep 2, 2016, at 12:48 PM, Mahmood Naderan  wrote:
> 
> Thanks for your help. Please see below
> 
> mahmood@compute-0-1:~$ ldd 
> /share/apps/chemistry/siesta-3.2-pl-5/tpar/transiesta
>linux-vdso.so.1 =>  (0x7fffba9a8000)
>libmpi_f90.so.1 => /opt/openmpi/lib/libmpi_f90.so.1 
> (0x2b472b64)
>libmpi_f77.so.1 => /opt/openmpi/lib/libmpi_f77.so.1 
> (0x2b472b848000)
>libmpi.so.1 => /opt/openmpi/lib/libmpi.so.1 (0x2b472ba8)
>libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x003d17e0)
>librt.so.1 => /lib64/librt.so.1 (0x003d1860)
>libnsl.so.1 => /lib64/libnsl.so.1 (0x003d1ae0)
>libutil.so.1 => /lib64/libutil.so.1 (0x003d18a0)
>libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x2b472c028000)
>libm.so.6 => /lib64/libm.so.6 (0x2b472c32)
>libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x2b472c5a8000)
>libdl.so.2 => /lib64/libdl.so.2 (0x003d1760)
>libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003d1920)
>libpthread.so.0 => /lib64/libpthread.so.0 (0x003d17a0)
>libc.so.6 => /lib64/libc.so.6 (0x003d1720)
>libdat.so.1 => /usr/lib64/libdat.so.1 (0x2b472c8b)
>/lib64/ld-linux-x86-64.so.2 (0x003d16e0)
> 
> 
> -- 
> Regards,
> Mahmood
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread Mahmood Naderan
Thanks for your help. Please see below

mahmood@compute-0-1:~$ ldd /share/apps/chemistry/siesta-3.2-pl-5/tpar/transiesta
linux-vdso.so.1 =>  (0x7fffba9a8000)
libmpi_f90.so.1 => /opt/openmpi/lib/libmpi_f90.so.1 (0x2b472b64)
libmpi_f77.so.1 => /opt/openmpi/lib/libmpi_f77.so.1 (0x2b472b848000)
libmpi.so.1 => /opt/openmpi/lib/libmpi.so.1 (0x2b472ba8)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x003d17e0)
librt.so.1 => /lib64/librt.so.1 (0x003d1860)
libnsl.so.1 => /lib64/libnsl.so.1 (0x003d1ae0)
libutil.so.1 => /lib64/libutil.so.1 (0x003d18a0)
libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x2b472c028000)
libm.so.6 => /lib64/libm.so.6 (0x2b472c32)
libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x2b472c5a8000)
libdl.so.2 => /lib64/libdl.so.2 (0x003d1760)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003d1920)
libpthread.so.0 => /lib64/libpthread.so.0 (0x003d17a0)
libc.so.6 => /lib64/libc.so.6 (0x003d1720)
libdat.so.1 => /usr/lib64/libdat.so.1 (0x2b472c8b)
/lib64/ld-linux-x86-64.so.2 (0x003d16e0)


-- 
Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread John Hearns via users
Thankyou.  That is helpful.

Could you run an 'ldd' on your executable, on one of the compute nodes if
possible?
I will nto be able to solve your problem, but at least we now know what the
application is,
and can look at the libraries it is using.



On 2 September 2016 at 17:19, Mahmood Naderan  wrote:

> The application is Siesta-3.2 and the command I use is
>
>
> /share/apps/computer/openmpi-1.6.5/bin/mpirun -hostfile hosts.txt -np
> 15 /share/apps/chemistry/siesta-3.2-pl-5/tpar/transiesta <
> trans-cc-bt-cc-163-20.fdf
>
> There is one node in the hosts.txt file. I have built transiesta
> binary from the source which uses
> /share/apps/computer/openmpi-1.6.5/bin/mpif90
>
> --
> Regards,
> Mahmood
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread Mahmood Naderan
The application is Siesta-3.2 and the command I use is


/share/apps/computer/openmpi-1.6.5/bin/mpirun -hostfile hosts.txt -np
15 /share/apps/chemistry/siesta-3.2-pl-5/tpar/transiesta <
trans-cc-bt-cc-163-20.fdf

There is one node in the hosts.txt file. I have built transiesta
binary from the source which uses
/share/apps/computer/openmpi-1.6.5/bin/mpif90

-- 
Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread John Hearns via users
Mahmood,
are you compiling and linking this application?
Or are you using an executable which someone else has prepared?

It would be very useful if we could know the application.




On 2 September 2016 at 16:35, Mahmood Naderan  wrote:

> >Did you ran
> >ulimit -c unlimited
> >before invoking mpirun ?
>
> Yes. On the node which says that error. Is that file created in the
> current working directory? Or it is somewhere in the system folders?
>
>
>
> As another question, I am trying to use OpenMPI-2.0.0 as a new one.
> Problem is that the application uses libmpi_f90.a from old versions
> but I don't see that in OpenMPI-2.0.0. There are some other libraries
> there.
>
>
>
>
> --
> Regards,
> Mahmood
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Error in file runtime/orte_init.c

2016-09-02 Thread Mahmood Naderan
Hi,
Using OpenMPI-2.0.0, is there any idea about this error

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  compute-0-1.local
Framework: ess
Component: pmi
--
[compute-0-1.local:22993] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in
file runtime/orte_init.c at line 116




-- 
Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread Mahmood Naderan
>Did you ran
>ulimit -c unlimited
>before invoking mpirun ?

Yes. On the node which says that error. Is that file created in the
current working directory? Or it is somewhere in the system folders?



As another question, I am trying to use OpenMPI-2.0.0 as a new one.
Problem is that the application uses libmpi_f90.a from old versions
but I don't see that in OpenMPI-2.0.0. There are some other libraries
there.




-- 
Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] New to (Open)MPI

2016-09-02 Thread Dave Goodell (dgoodell)
Lachlan mentioned that he has "M Series" hardware, which, to the best of my 
knowledge, does not officially support usNIC.  It may not be possible to even 
configure the relevant usNIC adapter policy in UCSM for M Series 
modules/chassis.

Using the TCP BTL may be the only realistic option here.

-Dave

> On Sep 2, 2016, at 5:35 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Greetings Lachlan.
> 
> Yes, Gilles and John are correct: on Cisco hardware, our usNIC transport is 
> the lowest latency / best HPC-performance transport.  I'm not aware of any 
> MPI implementation (including Open MPI) that has support for FC types of 
> transports (including FCoE).
> 
> I'll ping you off-list with some usNIC details.
> 
> 
>> On Sep 1, 2016, at 10:06 PM, Lachlan Musicman  wrote:
>> 
>> Hola,
>> 
>> I'm new to MPI and OpenMPI. Relatively new to HPC as well.
>> 
>> I've just installed a SLURM cluster and added OpenMPI for the users to take 
>> advantage of.
>> 
>> I'm just discovering that I have missed a vital part - the networking.
>> 
>> I'm looking over the networking options and from what I can tell we only 
>> have (at the moment) Fibre Channel over Ethernet (FCoE).
>> 
>> Is this a network technology that's supported by OpenMPI?
>> 
>> (system is running Centos 7, on Cisco M Series hardware)
>> 
>> Please excuse me if I have terms wrong or am missing knowledge. Am new to 
>> this.
>> 
>> cheers
>> Lachlan
>> 
>> 
>> --
>> The most dangerous phrase in the language is, "We've always done it this 
>> way."
>> 
>> - Grace Hopper
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] New to (Open)MPI

2016-09-02 Thread Jeff Squyres (jsquyres)
Greetings Lachlan.

Yes, Gilles and John are correct: on Cisco hardware, our usNIC transport is the 
lowest latency / best HPC-performance transport.  I'm not aware of any MPI 
implementation (including Open MPI) that has support for FC types of transports 
(including FCoE).

I'll ping you off-list with some usNIC details.


> On Sep 1, 2016, at 10:06 PM, Lachlan Musicman  wrote:
> 
> Hola,
> 
> I'm new to MPI and OpenMPI. Relatively new to HPC as well.
> 
> I've just installed a SLURM cluster and added OpenMPI for the users to take 
> advantage of.
> 
> I'm just discovering that I have missed a vital part - the networking.
> 
> I'm looking over the networking options and from what I can tell we only have 
> (at the moment) Fibre Channel over Ethernet (FCoE).
> 
> Is this a network technology that's supported by OpenMPI?
> 
> (system is running Centos 7, on Cisco M Series hardware)
> 
> Please excuse me if I have terms wrong or am missing knowledge. Am new to 
> this.
> 
> cheers
> Lachlan
> 
> 
> --
> The most dangerous phrase in the language is, "We've always done it this way."
> 
> - Grace Hopper
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread Jeff Squyres (jsquyres)
Also, the error message suggested that TCP is not the issue here -- the TCP 
hangups are likely because some other process exited unexpectedly.

Indeed:

-
mpirun noticed that process rank 0 with PID 4989 on node compute-0-1 exited on 
signal 4 (Illegal instruction).
-

This might be the real issue.  Getting a corefile, as was already suggested, 
might be the best way to go forward.



> On Sep 2, 2016, at 5:50 AM, John Hearns via users  
> wrote:
> 
> Mahmood, as Giles says start by looking at how that application is compiled 
> and linked.
> Run 'ldd' on the executable and look closely at the libraries.  Do this on a 
> compute node if you can.
> 
> There was a discussion on another mailign list recently about how to 
> fingerpritn executables and see which architecture it was compiled for.
> My mind is a blank at the moment as to what that discussion concluded. Sorry. 
>  And if this was on OpenMPI I am doubly sorry!
> 
> 
> On 2 September 2016 at 10:37, Gilles Gouaillardet 
>  wrote:
> Did you ran
> ulimit -c unlimited
> before invoking mpirun ?
> 
> if your application can be ran with only one tasks, you can try to run it 
> under gdb.
> you will hopefully be able to see where the illegal instruction occurs.
> 
> since you are running on AMD processors, you have to make sure you are not 
> using any third party library that was optimized for Intel processors (e.g. 
> that uses AVX (SSE ?) instructions)
> 
> Cheers,
> 
> Gilles
> 
> On Friday, September 2, 2016, Mahmood Naderan  wrote:
> >Are you running under a batch manager ?
> >On which architecture ?
> Currently I am not using the job manager (which is actually PBS). I am
> running from the terminal.
> 
> The machines are AMD Opteron 64 bit
> 
> 
> >Hopefully you will get a core file that points you to the illegal instruction
> Where is that core file. I can not find it.
> 
> BTW, the openmpi is 1.6.5
> 
> 
> --
> Regards,
> Mahmood
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread John Hearns via users
Mahmood, as Giles says start by looking at how that application is compiled
and linked.
Run 'ldd' on the executable and look closely at the libraries.  Do this on
a compute node if you can.

There was a discussion on another mailign list recently about how to
fingerpritn executables and see which architecture it was compiled for.
My mind is a blank at the moment as to what that discussion concluded.
Sorry.  And if this was on OpenMPI I am doubly sorry!


On 2 September 2016 at 10:37, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Did you ran
> ulimit -c unlimited
> before invoking mpirun ?
>
> if your application can be ran with only one tasks, you can try to run it
> under gdb.
> you will hopefully be able to see where the illegal instruction occurs.
>
> since you are running on AMD processors, you have to make sure you are not
> using any third party library that was optimized for Intel processors (e.g.
> that uses AVX (SSE ?) instructions)
>
> Cheers,
>
> Gilles
>
> On Friday, September 2, 2016, Mahmood Naderan 
> wrote:
>
>> >Are you running under a batch manager ?
>> >On which architecture ?
>> Currently I am not using the job manager (which is actually PBS). I am
>> running from the terminal.
>>
>> The machines are AMD Opteron 64 bit
>>
>>
>> >Hopefully you will get a core file that points you to the illegal
>> instruction
>> Where is that core file. I can not find it.
>>
>> BTW, the openmpi is 1.6.5
>>
>>
>> --
>> Regards,
>> Mahmood
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread Gilles Gouaillardet
Did you ran
ulimit -c unlimited
before invoking mpirun ?

if your application can be ran with only one tasks, you can try to run it
under gdb.
you will hopefully be able to see where the illegal instruction occurs.

since you are running on AMD processors, you have to make sure you are not
using any third party library that was optimized for Intel processors (e.g.
that uses AVX (SSE ?) instructions)

Cheers,

Gilles

On Friday, September 2, 2016, Mahmood Naderan  wrote:

> >Are you running under a batch manager ?
> >On which architecture ?
> Currently I am not using the job manager (which is actually PBS). I am
> running from the terminal.
>
> The machines are AMD Opteron 64 bit
>
>
> >Hopefully you will get a core file that points you to the illegal
> instruction
> Where is that core file. I can not find it.
>
> BTW, the openmpi is 1.6.5
>
>
> --
> Regards,
> Mahmood
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] job aborts "readv failed: Connection reset by peer"

2016-09-02 Thread Mahmood Naderan
>Are you running under a batch manager ?
>On which architecture ?
Currently I am not using the job manager (which is actually PBS). I am
running from the terminal.

The machines are AMD Opteron 64 bit


>Hopefully you will get a core file that points you to the illegal instruction
Where is that core file. I can not find it.

BTW, the openmpi is 1.6.5


-- 
Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] New to (Open)MPI

2016-09-02 Thread John Hearns via users
Hello Lachlan.  I think Jeff Squyres will be along in a short while! HE is
of course the expert on Cisco.

In the meantime a quick Google turns up:
http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/usnic/c/deployment/2_0_X/b_Cisco_usNIC_Deployment_Guide_For_Standalone_C-SeriesServers.html

On 2 September 2016 at 06:54, Gilles Gouaillardet  wrote:

> Hi,
>
>
> FCoE is for storage, Ethernet is for the network.
>
> I assume you can ssh into your nodes, which means you have a TCP/IP, and
> it is up and running.
>
> i do not know the details of Cisco hardware, but you might be able to use
> usnic (native btl or via libfabric) instead of the plain TCP/IP network.
>
>
> at first, you can build Open MPI, and run a job on two nodes with one task
> per node.
>
> in your script, you can
>
> mpirun --mca btl_base_verbose 100 --mca pml_base_verbose 100 ...
>
> this will tell you which network is used.
>
>
> Cheers,
>
>
> Gilles
> On 9/2/2016 11:06 AM, Lachlan Musicman wrote:
>
> Hola,
>
> I'm new to MPI and OpenMPI. Relatively new to HPC as well.
>
> I've just installed a SLURM cluster and added OpenMPI for the users to
> take advantage of.
>
> I'm just discovering that I have missed a vital part - the networking.
>
> I'm looking over the networking options and from what I can tell we only
> have (at the moment) Fibre Channel over Ethernet (FCoE).
>
> Is this a network technology that's supported by OpenMPI?
>
> (system is running Centos 7, on Cisco M Series hardware)
>
> Please excuse me if I have terms wrong or am missing knowledge. Am new to
> this.
>
> cheers
> Lachlan
>
>
> --
> The most dangerous phrase in the language is, "We've always done it this
> way."
>
> - Grace Hopper
>
>
> ___
> users mailing 
> listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users