What version of GM are you running?
# rpm -qa |egrep "^gm-[0-9]+|^gm-devel"
gm-2.0.24-1
gm-devel-2.0.24-1
Is this too old?

Nope, that's just fine.

 A mismatch between the list
of nodes actually configured onto the Myrinet fabric and the machine file is a common source of errors like this. The mismatch could be caused by cable
failure or other mapping issues.
Could you elaborate on the mapping issues you mentioned? What are they?

If you have 3 nodes, A,B,C and the mapper on node C dies for some reason (very unusual, but maybe killed by mistake, say), then node B gets rebooted, then when node B comes back up, it will have routes to only node A and itself, though A and C will still have routes everywhere. The map versions on A and B will match, but C will have an old map version. Thus, an MPI job spanning A,B,C would fail, even though all 3 nodes show up in gm_board_info from node A.

Why GM instead of MX, by the way?
We have a few MX cards in-house, but no MX switch due to its current
market price. So we're only able to perform MX testing using
direct-connection cables, which is not very exciting :) On the
contrary, we've already had GM boards and a switch and found it
sufficient for OpenMPI testing purposes. Would be great to upgrade to
MX in the near future.

MX is just a different software stack, the hardware is the same. MX works with both 2G and 10G, but GM does not work with the 10G cards. I see from your gm_board_info output that you are using D-cards, which MX supports (anything D or later is supported by MX, but not B or C cards). Switches don't care about MX vs. GM. MX will give better performance for most MPI applications than GM, and hardware too old for MX is fairly uncommon.

-reese


Reply via email to