Gus Correa wrote:
Why wouldn't shared memory work right on Nehalem?
We don't know exactly what is driving this problem, but the issue
appears to be related to memory fences. Messages have to be posted to a
receiver's queue. By default, each process (since OMPI 1.3.2) has only
one queue. A sender acquires a lock to the queue, writes a pointer to
its message, advances the queue index, and releases the lock. If there
are problems with memory barriers (or our use of them), messages can get
lost, overwritten, etc. One manifestation could be hangs. One
workaround, as described on this mail list, is to increase the number of
queues (FIFOs) so that each sender gets its own.
I think that's what's happening, but we don't know the root cause. The
test case in 2043 on the node I used for testing works like a gem for
GCC versions prior to 4.4.x, but with 4.4.x variants it falls hard on
its face. Is the problem with GCC 4.4.x? Or, does that compiler expose
a problem with OMPI? Etc.
It is amazing to me that this issue hasn't surfaced on this list before.
The trac ticket refers to a number of e-mail messages that might be
related. At this point, however, it's hard to know what's related and
what isn't.
Gus Correa wrote:
FYI, I do NOT see the problem reported by Matthew et al. on our AMD
Opteron Shanghai dual-socket quad-core. They run a quite outdated
CentOS kernel 2.6.18-92.1.22.el5, with gcc 4.1.2. and OpenMPI 1.3.2.
In my mind, GCC 4.1.2 may well be the ticket here. I find strong
correspondence with GCC rev (< 4.4.x vs >= 4.4.x).
Moreover, all works fine if I oversuscribe up to 256 processes on one
node.
Beyond that I get segmentation fault (not hanging) sometimes, but not
always.
I understand that extreme oversubscription is a no-no.
Sounds like another set of problems.