We have a 96 node cluster running 3.11 with 256 vnodes each. We're running a rolling restart. As we restart nodes, we notice that each node takes a while to have all other nodes be marked as up and this corresponds to nodes that haven't finished playing hints.
We looked at the hinted handoff throttling, noticed it was still the default of 1024, so we tried to turn it off by setting it to zero. Reading the source, it looks like that rate limiting won't take affect until the current set of hints have finished. So we made that change cluster wide and then restarted the next node. However, we still saw the same issue. Looking at iftop and network throughput, it's very low (~10kB/s) and therefore the few 100k of hints that accumulate while the node is restart end up take several minutes to get sent. Any other knobs we should be tuning to increase hinted handoff throughput? Or other reasons why hinted handoff runs so slowly? -- Andrew Bialecki