[gpfsug-discuss] spontaneous tracing?

2018-03-10 Thread Aaron Knister
I found myself with a little treat this morning to the tune of tracing running on the entire cluster of 3500 nodes. There were no logs I could find to indicate *why* the tracing had started but it was clear it was initiated by the cluster manager. Some sleuthing (thanks, collectl!) allowed me

[gpfsug-discuss] gpfs.snap taking a really long time (4.2.3.6 efix17)

2018-03-10 Thread Aaron Knister
Hey All, I've noticed after upgrading to 4.1 to 4.2.3.6 efix17 that a gpfs.snap now takes a really long time as in... a *really* long time. Digging into it I can see that the snap command is actually done but the sshd child is left waiting on a sleep process on the clients (a sleep 600 at

Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes

2018-03-10 Thread Aaron Knister
I, personally, haven't been burned by mixing UD and RC IPoIB clients on the same fabric but that doesn't mean it can't happen. What I *have* been bitten by a couple times is not having enough entries in the arp cache after bringing a bunch of new nodes online (that made for a long Christmas

Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes

2018-03-10 Thread Wei Guo
Hi, Saula, This sounds like the problem with the jumbo frame. Ping or metadata query use small packets, so any time you can ping or ls file. However, data transferring use large packets, the MTU size. Your MTU 65536 nodes send out large packets, but they get dropped to the 2044 nodes, because

Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes

2018-03-10 Thread Saula, Oluwasijibomi
Wei - So the expelled node could ping the rest of the cluster just fine. In fact, after adding this new node to the cluster I could traverse the filesystem for simple lookups, however, heavy data moves in or out of the filesystem seemed to trigger the expel messages to the new node. This