On Apr 12, 2006, at 8:59 PM, Jeff Squyres (jsquyres) wrote:
FWIW, the "has a different size..." errors means that you may not have been linking against the shared libraries that you thought you were. This typically means that the executable expected to find an object in a library of a given size, but the actual size of the object was different. So some kind of mismatch was occurring, and the segv at the end was therefore not surprising.
Yeah; I wasn't surprised either. That's why I just re-compiled the app & ran it. Then it worked.
I'm suspicious (but can't prove it) that the opensm subnet manager (running on another node, and on the Mellanox 'ib gold' stack) wasn't working properly. The problem is that I have nothing to back up the suspicion. But the behavior was consistent to what I'd see if there was no subnet manager on the IB fabric (which may well have been the case, actually). It's working now, though...
-- Troy Telford