On Dec 10, 2009, at 5:53 PM, Gus Correa wrote: > How does the efficiency of loopback > (let's say, over TCP and over IB) compare with "sm"?
Definitely not as good; that's why we have sm. :-) I don't have any quantification of that assertion, though (i.e., no numbers to back that up). > FYI, I do NOT see the problem reported by Matthew et al. > on our AMD Opteron Shanghai dual-socket quad-core. > They run a quite outdated > CentOS kernel 2.6.18-92.1.22.el5, with gcc 4.1.2. > and OpenMPI 1.3.2. > (I've been lazy to upgrade, it is a production machine.) > > I could run all three OpenMPI test programs (hello_c, ring_c, and > connectivity_c) on all 8 cores on a single node WITH "sm" turned ON > with no problem whatsoever. Good. > Moreover, all works fine if I oversuscribe up to 256 processes on > one node. > Beyond that I get segmentation fault (not hanging) sometimes, > but not always. > I understand that extreme oversubscription is a no-no. It's quite possible that extreme oversubscription and/or that many procs in sm have not been well-tested. > Moreover, on the screenshots that Matthew posted, the cores > were at 100% CPU utilization on the simple connectivity_c > (although this was when he had "sm" turned on on Nehalem). > On my platform I don't get anything more than 3% or so. 100% CPU utilization usually means that some completion hasn't occurred that was expected and therefore everything is spinning waiting for that completion. The "hasn't occurred" bit is probably the bug here -- it's likely that there should have been a completion that somehow got missed. But this is speculative -- we're still investigating... -- Jeff Squyres jsquy...@cisco.com