I found myself with a little treat this morning to the tune of tracing
running on the entire cluster of 3500 nodes. There were no logs I could
find to indicate *why* the tracing had started but it was clear it was
initiated by the cluster manager.
Some sleuthing (thanks, collectl!) allowed me
Hey All,
I've noticed after upgrading to 4.1 to 4.2.3.6 efix17 that a gpfs.snap
now takes a really long time as in... a *really* long time. Digging into
it I can see that the snap command is actually done but the sshd child
is left waiting on a sleep process on the clients (a sleep 600 at
I, personally, haven't been burned by mixing UD and RC IPoIB clients on
the same fabric but that doesn't mean it can't happen. What I *have*
been bitten by a couple times is not having enough entries in the arp
cache after bringing a bunch of new nodes online (that made for a long
Christmas
Hi, Saula,
This sounds like the problem with the jumbo frame.
Ping or metadata query use small packets, so any time you can ping or ls file.
However, data transferring use large packets, the MTU size. Your MTU 65536
nodes send out large packets, but they get dropped to the 2044 nodes, because
Wei - So the expelled node could ping the rest of the cluster just fine. In
fact, after adding this new node to the cluster I could traverse the filesystem
for simple lookups, however, heavy data moves in or out of the filesystem
seemed to trigger the expel messages to the new node.
This