Gus Correa wrote:

Why wouldn't shared memory work right on Nehalem?

We don't know exactly what is driving this problem, but the issue appears to be related to memory fences. Messages have to be posted to a receiver's queue. By default, each process (since OMPI 1.3.2) has only one queue. A sender acquires a lock to the queue, writes a pointer to its message, advances the queue index, and releases the lock. If there are problems with memory barriers (or our use of them), messages can get lost, overwritten, etc. One manifestation could be hangs. One workaround, as described on this mail list, is to increase the number of queues (FIFOs) so that each sender gets its own.

I think that's what's happening, but we don't know the root cause. The test case in 2043 on the node I used for testing works like a gem for GCC versions prior to 4.4.x, but with 4.4.x variants it falls hard on its face. Is the problem with GCC 4.4.x? Or, does that compiler expose a problem with OMPI? Etc.

It is amazing to me that this issue hasn't surfaced on this list before.

The trac ticket refers to a number of e-mail messages that might be related. At this point, however, it's hard to know what's related and what isn't.

Gus Correa wrote:

FYI, I do NOT see the problem reported by Matthew et al. on our AMD Opteron Shanghai dual-socket quad-core. They run a quite outdated CentOS kernel 2.6.18-92.1.22.el5, with gcc 4.1.2. and OpenMPI 1.3.2.

In my mind, GCC 4.1.2 may well be the ticket here. I find strong correspondence with GCC rev (< 4.4.x vs >= 4.4.x).

Moreover, all works fine if I oversuscribe up to 256 processes on one node. Beyond that I get segmentation fault (not hanging) sometimes, but not always.
I understand that extreme oversubscription is a no-no.

Sounds like another set of problems.

Reply via email to