Re: [OMPI users] problems with Intel 12.x compilers and OpenMPI (1.4.3)

2011-09-24 Thread Jeff Squyres
As a pure guess, it might actually be this one:

- Fix to detect and avoid overlapping memcpy().  Thanks to
  Francis Pellegrini for identifying the issue.

We're actually very close to releasing 1.4.4 -- using the latest RC should be 
pretty safe.


On Sep 23, 2011, at 5:51 AM, Paul Kapinos wrote:

> Hi Open MPI volks,
> 
> we see some quite strange effects with our installations of Open MPI 1.4.3 
> with Intel 12.x compilers, which makes us puzzling: Different programs 
> reproducibly deadlock or die with errors alike the below-listed ones.
> 
> Some of the errors looks alike programming issue at first look (well, a 
> deadlock *is* usually a programming error) but we do not believe it is so: 
> the errors arise in many well-tested codes including HPL (*) and only with a 
> special compiler + Open MPI version (Intel 12.x compiler + open MPI 1.4.3) 
> and only on special number of processes (usually high ones). For example, HPL 
> reproducible deadlocks with 72 procs and dies with error message #2 with 384 
> processes.
> 
> All this errors seem to be somehow related to MPI communicators; and 1.4.4rc3 
> and in 1.5.3 and 1.5.4 seem not to have this problem. Also 1.4.3 if using 
> together with Intel 11.x compielr series seem to be unproblematic. So 
> probably this:
> 
> (1.4.4 release notes:)
> - Fixed a segv in MPI_Comm_create when called with GROUP_EMPTY.
>  Thanks to Dominik Goeddeke for finding this.
> 
> is also fix for our issues? Or maybe not, because 1.5.3 is _older_ than this 
> fix?
> 
> As far as we workarounded the problem by switching our production to 1.5.3 
> this issue is not a "burning" one; but I decieded still to post this because 
> any issue on such fundamental things may be interesting for developers.
> 
> Best wishes,
> Paul Kapinos
> 
> 
> (*) http://www.netlib.org/benchmark/hpl/
> 
> 
> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(111): MPI_Comm_size(comm=0x0, size=0x6f4a90) failed
> MPI_Comm_size(69).: Invalid communicator
> 
> 
> [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** An error occurred in MPI_Comm_split
> [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** on communicator MPI COMMUNICATOR 3 
> SPLIT FROM 0
> [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERR_IN_STATUS: error code in 
> status
> [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERRORS_ARE_FATAL (your MPI job 
> will now abort)
> 
> 
> forrtl: severe (71): integer divide by zero
> Image PC Routine Line Source
> libmpi.so.0 2D9EDF52 Unknown Unknown Unknown
> libmpi.so.0 2D9EE45D Unknown Unknown Unknown
> libmpi.so.0 2D9C3375 Unknown Unknown Unknown
> libmpi_f77.so.0 2D75C37A Unknown Unknown Unknown
> vasp_mpi_gamma 0057E010 Unknown Unknown Unknown
> vasp_mpi_gamma 0059F636 Unknown Unknown Unknown
> vasp_mpi_gamma 00416C5A Unknown Unknown Unknown
> vasp_mpi_gamma 00A62BEE Unknown Unknown Unknown
> libc.so.6 003EEB61EC5D Unknown Unknown Unknown
> vasp_mpi_gamma 00416A29 Unknown Unknown Unknown
> 
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] problems with Intel 12.x compilers and OpenMPI (1.4.3)

2011-09-23 Thread Paul Kapinos

Hi Open MPI volks,

we see some quite strange effects with our installations of Open MPI 
1.4.3 with Intel 12.x compilers, which makes us puzzling: Different 
programs reproducibly deadlock or die with errors alike the below-listed 
ones.


Some of the errors looks alike programming issue at first look (well, a 
deadlock *is* usually a programming error) but we do not believe it is 
so: the errors arise in many well-tested codes including HPL (*) and 
only with a special compiler + Open MPI version (Intel 12.x compiler + 
open MPI 1.4.3) and only on special number of processes (usually high 
ones). For example, HPL reproducible deadlocks with 72 procs and dies 
with error message #2 with 384 processes.


All this errors seem to be somehow related to MPI communicators; and 
1.4.4rc3 and in 1.5.3 and 1.5.4 seem not to have this problem. Also 
1.4.3 if using together with Intel 11.x compielr series seem to be 
unproblematic. So probably this:


(1.4.4 release notes:)
- Fixed a segv in MPI_Comm_create when called with GROUP_EMPTY.
  Thanks to Dominik Goeddeke for finding this.

is also fix for our issues? Or maybe not, because 1.5.3 is _older_ than 
this fix?


As far as we workarounded the problem by switching our production to 
1.5.3 this issue is not a "burning" one; but I decieded still to post 
this because any issue on such fundamental things may be interesting for 
developers.


Best wishes,
Paul Kapinos


(*) http://www.netlib.org/benchmark/hpl/


Fatal error in MPI_Comm_size: Invalid communicator, error stack:
MPI_Comm_size(111): MPI_Comm_size(comm=0x0, size=0x6f4a90) failed
MPI_Comm_size(69).: Invalid communicator


[linuxbdc05.rz.RWTH-Aachen.DE:23219] *** An error occurred in MPI_Comm_split
[linuxbdc05.rz.RWTH-Aachen.DE:23219] *** on communicator MPI 
COMMUNICATOR 3 SPLIT FROM 0
[linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERR_IN_STATUS: error code 
in status
[linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERRORS_ARE_FATAL (your MPI 
job will now abort)



forrtl: severe (71): integer divide by zero
Image PC Routine Line Source
libmpi.so.0 2D9EDF52 Unknown Unknown Unknown
libmpi.so.0 2D9EE45D Unknown Unknown Unknown
libmpi.so.0 2D9C3375 Unknown Unknown Unknown
libmpi_f77.so.0 2D75C37A Unknown Unknown Unknown
vasp_mpi_gamma 0057E010 Unknown Unknown Unknown
vasp_mpi_gamma 0059F636 Unknown Unknown Unknown
vasp_mpi_gamma 00416C5A Unknown Unknown Unknown
vasp_mpi_gamma 00A62BEE Unknown Unknown Unknown
libc.so.6 003EEB61EC5D Unknown Unknown Unknown
vasp_mpi_gamma 00416A29 Unknown Unknown Unknown


--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915


smime.p7s
Description: S/MIME Cryptographic Signature