On Dec 18, 2013, at 10:32 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:
> Noam Bernstein <noam.bernst...@nrl.navy.mil> writes: > >> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in some >> collective communication), but now I'm wondering whether I should just test >> 1.6.5. > > What bug, exactly? As you mentioned vasp, is it specifically affecting > that? Yes - I never characterized it fully, but we attached with gdb to every single vasp running process, and all were stuck in the same call to MPI_allreduce() every time. It's only happening on a rather large jobs, so it's not the easiest setup to debug. If I can reproduce the problem with 1.6.5, and I can confirm that it's always locking up in the same call to mpi_allreduce, and all processes are stuck in the same call, is there interest in looking into a possible mpi issue? Given that 1.7.3 seems to be working now, whether 1.6.x works is a bit of a moot point for us (although I just realized that I should check that it works with 1.7.3 even with --bind-to core). Noam
smime.p7s
Description: S/MIME cryptographic signature