On Dec 18, 2013, at 10:32 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:

> Noam Bernstein <noam.bernst...@nrl.navy.mil> writes:
> 
>> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in some 
>> collective communication), but now I'm wondering whether I should just test
>> 1.6.5.
> 
> What bug, exactly?  As you mentioned vasp, is it specifically affecting
> that?

Yes - I never characterized it fully, but we attached with gdb to every
single vasp running process, and all were stuck in the same
call to MPI_allreduce() every time. It's only happening on a rather large 
jobs, so it's not the easiest setup to debug.  

If I can reproduce the problem with 1.6.5, and I can confirm that it's always 
locking up in the same call to mpi_allreduce, and all processes are stuck 
in the same call, is there interest in looking into a possible mpi issue?  

Given that 1.7.3 seems to be working now, whether 1.6.x works is a bit of a 
moot 
point for us (although I just realized that I should check that it works with 
1.7.3 even 
with --bind-to core).

                                                                        Noam

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to