Re: [OMPI users] [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

2010-02-24 Thread Jeff Squyres
*Usually*, I have see these "readv failed: ..." kinds of error messages as a 
side effect of an MPI process exiting abnormally.  The "readv..." messages are 
from the peers that are left that suddenly had sockets close unexpectedly 
(because of the dead peer).

Check into the signal 11 message (that's a segv); that might be the real error.


On Feb 23, 2010, at 4:00 PM, Thomas Sadowski wrote:

> Hello all,
> 
> 
> I am currently attempting to use OpenMPI as my MPI for my VASP calculations. 
> VASP is an ab initio DFT code. Anyhow, I was able to compile and build 
> OpenMPI v. 1.4.1 (i thought) correctly using the following command:
> 
> ./configure --prefix=/home/tes98002 F77=ifort FC=ifort --with-tm=/usr/local
> 
> 
> Note that I am compiling OpenMPI for use with Torque/PBS which was compiled 
> using Intel v 10 Fortran compilers and gcc for C\C++. After building OpenMPI, 
> I successfully used it to compile VASP using Intel MKL v. 10.2. I am running 
> OpenMPI in heterogeneous cluster configuration, and I used an NFS mount so 
> that all the compute nodes could access the executable. Our hardware 
> configuration is as follows:
> 
> 7 nodes: 2 single-core AMD Opteron processors, 2GB of RAM (henceforth called 
> old nodes)
> 4 nodes: 2 duo-core AMD Opteron processors, 2GB of RAM (henceforth called new 
> nodes)
> 
> We are currently running SUSE v. 8.x. No we have problems when we attempt to 
> run VASP on multiple nodes. A small system (~10 atoms) runs perfectly well 
> with Torque and OpenMPI in all instances: running using single old node, a 
> single new node, or across multiple old and new nodes. Larger systems (>24 
> atoms) are able to run to completion if they are kept within a single old or 
> new node. However, if I try to run a job on multiple old or new nodes I 
> receive a segfault. In particular the error is as follows:
> 
> 
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer 
> (104)[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --
> mpirun noticed that process rank 6 with PID 11985 on node node11 exited on 
> signal 11 (Segmentation fault).
> --
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> 
> 
> 
> It seems to me that this is a memory issue, however I may be mistaken. I have 
> searched the archive and have as yet seen an adequate treatment of the 
> problem. I have also tried other versions of OpenMPI. Does anyone have any 
> insight into our issues
> 
> 
> -Tom
>  
> 
> 
> 
> 
> 
> Hotmail: Trusted email with powerful SPAM protection. Sign up now. 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

2010-02-23 Thread Terry Frankcombe
Vasp can be temperamental.  For example, I have a largish system (384
atoms) for which Vasp hangs if I request more than 120 MD steps at a
time.  I am not convinced that this is OPMI's problem.  However, your
case looks much more diagnosable than my silent spinning hang.

On Tue, 2010-02-23 at 16:00 -0500, Thomas Sadowski wrote:
> Hello all,
> 
> 
> I am currently attempting to use OpenMPI as my MPI for my VASP
> calculations. VASP is an ab initio DFT code. Anyhow, I was able to
> compile and build OpenMPI v. 1.4.1 (i thought) correctly using the
> following command:
> 
> ./configure --prefix=/home/tes98002 F77=ifort FC=ifort
> --with-tm=/usr/local
> 
> 
> Note that I am compiling OpenMPI for use with Torque/PBS which was
> compiled using Intel v 10 Fortran compilers and gcc for C\C++. After
> building OpenMPI, I successfully used it to compile VASP using Intel
> MKL v. 10.2. I am running OpenMPI in heterogeneous cluster
> configuration, and I used an NFS mount so that all the compute nodes
> could access the executable. Our hardware configuration is as follows:
> 
> 7 nodes: 2 single-core AMD Opteron processors, 2GB of RAM (henceforth
> called old nodes)
> 4 nodes: 2 duo-core AMD Opteron processors, 2GB of RAM (henceforth
> called new nodes)
> 
> We are currently running SUSE v. 8.x. No we have problems when we
> attempt to run VASP on multiple nodes. A small system (~10 atoms) runs
> perfectly well with Torque and OpenMPI in all instances: running using
> single old node, a single new node, or across multiple old and new
> nodes. Larger systems (>24 atoms) are able to run to completion if
> they are kept within a single old or new node. However, if I try to
> run a job on multiple old or new nodes I receive a segfault. In
> particular the error is as follows:
> 
> 
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
> (104)[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --
> mpirun noticed that process rank 6 with PID 11985 on node node11
> exited on signal 11 (Segmentation fault).
> --
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> 
> 
> 
> It seems to me that this is a memory issue, however I may be mistaken.
> I have searched the archive and have as yet seen an adequate treatment
> of the problem. I have also tried other versions of OpenMPI. Does
> anyone have any insight into our issues
> 
> 
> -Tom
>  
> 
> 
> 
> 
> 
> 
> __
> Hotmail: Trusted email with powerful SPAM protection. Sign up now.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

2010-02-23 Thread Thomas Sadowski

Hello all,


I am currently attempting to use OpenMPI as my MPI for my VASP calculations. 
VASP is an ab initio DFT code. Anyhow, I was able to compile and build OpenMPI 
v. 1.4.1 (i thought) correctly using the following command:

./configure --prefix=/home/tes98002 F77=ifort FC=ifort --with-tm=/usr/local


Note that I am compiling OpenMPI for use with Torque/PBS which was compiled 
using Intel v 10 Fortran compilers and gcc for C\C++. After building OpenMPI, I 
successfully used it to compile VASP using Intel MKL v. 10.2. I am running 
OpenMPI in heterogeneous cluster configuration, and I used an NFS mount so that 
all the compute nodes could access the executable. Our hardware configuration 
is as follows:

7 nodes: 2 single-core AMD Opteron processors, 2GB of RAM (henceforth called 
old nodes)
4 nodes: 2 duo-core AMD Opteron processors, 2GB of RAM (henceforth called new 
nodes)

We are currently running SUSE v. 8.x. No we have problems when we attempt to 
run VASP on multiple nodes. A small system (~10 atoms) runs perfectly well with 
Torque and OpenMPI in all instances: running using single old node, a single 
new node, or across multiple old and new nodes. Larger systems (>24 atoms) are 
able to run to completion if they are kept within a single old or new node. 
However, if I try to run a job on multiple old or new nodes I receive a 
segfault. In particular the error is as follows:


[node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer 
(104)[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
[node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
[node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--
mpirun noticed that process rank 6 with PID 11985 on node node11 exited on 
signal 11 (Segmentation fault).
--
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)



It seems to me that this is a memory issue, however I may be mistaken. I have 
searched the archive and have as yet seen an adequate treatment of the problem. 
I have also tried other versions of OpenMPI. Does anyone have any insight into 
our issues


-Tom
 




  
_
Hotmail: Trusted email with powerful SPAM protection.
http://clk.atdmt.com/GBL/go/201469227/direct/01/