We have a 96 node cluster running 3.11 with 256 vnodes each. We're running
a rolling restart. As we restart nodes, we notice that each node takes a
while to have all other nodes be marked as up and this corresponds to nodes
that haven't finished playing hints.

We looked at the hinted handoff throttling, noticed it was still the
default of 1024, so we tried to turn it off by setting it to zero. Reading
the source, it looks like that rate limiting won't take affect until the
current set of hints have finished. So we made that change cluster wide and
then restarted the next node. However, we still saw the same issue.

Looking at iftop and network throughput, it's very low (~10kB/s) and
therefore the few 100k of hints that accumulate while the node is restart
end up take several minutes to get sent.

Any other knobs we should be tuning to increase hinted handoff throughput?
Or other reasons why hinted handoff runs so slowly?

Andrew Bialecki

