Re: [OMPI users] crashes in VASP with openmpi 1.6.x

2012-10-03 Thread Noam Bernstein
Thanks to everyone who answered, in particular Ake Sandgren, it appears
to be a weird problem with acml that somehow triggers a seg fault in
libmpi, but only when running on Opterons.  I'd still be interested in
figuring out how to get a more complete backtrace, but at least the
immediate problem is solved.


Noam


Re: [OMPI users] crashes in VASP with openmpi 1.6.x

2012-10-02 Thread Albert Everett
For what it's worth, on our cluster I currently do compile VASP with OpenMPI 
but we do not include ScaLAPACK because we didn't see a speedup from including 
it. So far we haven't seen improvements from using OpenMP in VASP or MKL, so 
we're not doing much with OpenMP either.

On our shared memory machine we will probably do more with OpenMP, especially 
for MKL.

We're relatively new to VASP, though, so we're eager to hear what works for 
other people. We're also curious to see how 5.3.x behavior compares with 5.2.x.

Albert

On Oct 2, 2012, at 8:11 AM, Noam Bernstein  wrote:

> Hi - I've been trying to run VASP 5.2.12 with ScaLAPACK and openmpi 
> 1.6.x on a single 32 core (4 x 8 core) Opteron node, purely shared memory.
> We've always had occasional hangs with older OpenMPI versions
> (1.4.3 and 1.5.5) on these machines, but infrequent enough to be usable 
> and not worth my time to debug.  
> 
> However, now that I've got to the 1.6 series (1.6.2, specifically), we're
> getting frequent crashes, mostly but maybe not entirely deterministic.  The 
> symptom is a segmentation fault in libopmpi.so, someplace under a call to 
> PZHEEVX, but in the traceback only routines names in VASP are being printed,  
> despite the fact that I have scalapack compiled with -g.
> 
> ScaLAPACK is v 1.8.0, because with v 2.0.2, it completely fails to converge.  
> I've tried a couple varieties of the intel compiler (11.1.080 and 
> 12.1.6.631), 
> and a couple of versions of ACML (4.4.0 and 5.2.0).   ACML version seems
> not to matter, and the two varieties of ifort give the same type of behavior, 
> but
> crash in different places in the run.  When I switch compilers and 
> acml/scalapack 
> libraries I recompile everything, except fpr OpenMPI which is always compiled 
> with 
> ifort 11.1.080.
> 
> These crashes do not seem to occur on our 2 x 4 core Xeon + IB QDR nodes.
> 
> Has anyone seen anything like this, or has any idea how to get additional 
> useful information, for example traceback information so I can figure out 
> what mpi 
> routine is having problems?
> 
>   
> thanks,
>   
> Noam
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] crashes in VASP with openmpi 1.6.x

2012-10-02 Thread Noam Bernstein
Hi - I've been trying to run VASP 5.2.12 with ScaLAPACK and openmpi 
1.6.x on a single 32 core (4 x 8 core) Opteron node, purely shared memory.
We've always had occasional hangs with older OpenMPI versions
(1.4.3 and 1.5.5) on these machines, but infrequent enough to be usable 
and not worth my time to debug.  

However, now that I've got to the 1.6 series (1.6.2, specifically), we're
getting frequent crashes, mostly but maybe not entirely deterministic.  The 
symptom is a segmentation fault in libopmpi.so, someplace under a call to 
PZHEEVX, but in the traceback only routines names in VASP are being printed,  
despite the fact that I have scalapack compiled with -g.

ScaLAPACK is v 1.8.0, because with v 2.0.2, it completely fails to converge.  
I've tried a couple varieties of the intel compiler (11.1.080 and 12.1.6.631), 
and a couple of versions of ACML (4.4.0 and 5.2.0).   ACML version seems
not to matter, and the two varieties of ifort give the same type of behavior, 
but
crash in different places in the run.  When I switch compilers and 
acml/scalapack 
libraries I recompile everything, except fpr OpenMPI which is always compiled 
with 
ifort 11.1.080.

These crashes do not seem to occur on our 2 x 4 core Xeon + IB QDR nodes.

Has anyone seen anything like this, or has any idea how to get additional 
useful information, for example traceback information so I can figure out what 
mpi 
routine is having problems?


thanks,

Noam