Ah, I see. When a tablet server crashes, a few things usually need to happen: 1. new leaders must be elected for all the tablets with a leader that crashed, 2. all of the affected tablets must change their Raft configs to replace the dead server, and 3. all of the affected tablets need to be replicated to honor the replication factor.
Using `--num_tablets_to_copy_simultaneously` will probably only help with the last one. New leaders still must be elected, and a new tablet server will still be added for each; both of these will still contribute some to network traffic. The difference would be that, as the name implies, each tablet server will only be able to service a single tablet copy request at a time, so the time to full, "healthy" replication for the cluster will likely be longer. The tablet copies do have more data to transfer than elections and config changes, so maybe this would be sufficient. Off the top of my head, I think the second solution you suggested seems a bit riskier because it might end up preventing tablets from reaching quorums, and it relies on an operators noticing the failures and moving everything in time. The second point may be fixed by using `--evict_failed_followers` instead, but the first is a big issue to look out for. As I mentioned before, these solutions affect the guarantees that Kudu provides: the replication factor in all of these cases will be compromised in exchange for a lighter load on your network. Hope this helps, Andrew On Mon, Sep 4, 2017 at 7:11 PM, Li Jin <[email protected]> wrote: > I have got another idea. if I change the param > --follower_unavailable_considered_failed_sec=3600 > or more, that disable data migration almost, We can balance data by kudu > command line tools kudu table change_config add_replica > <https://kudu.apache.org/docs/command_line_tools_reference.html#change_config-add_replica> > /remove_replica > <https://kudu.apache.org/docs/command_line_tools_reference.html#change_config-add_replica> > to > balance data. small-scale data migration will not take so much network > traffic. maybe this is we need.๐ > > 2017-09-05 7:12 GMT+08:00 Li Jin <[email protected]>: > >> Hi Andrew, >> >> Thanks for replay. I seeing at zabbix, I would like reduce the use of network >> traffic and make sure the business online is not affect. I try to change >> the configure num_tablets_to_copy_simultaneously from 10 to 1, Can it be >> meet my needs? >> >> King Lee >> >> 2017-09-05 3:35 GMT+08:00 Andrew Wong <[email protected]>: >> >>> Hi Li, >>> >>> What errors are you seeing when the network traffic is full? Kudu needs >>> to replicate all the data that was lost to maintain the specified >>> replication factor. As far as I know there isn't a way to throttle this >>> without giving up some guarantees. >>> >>> If the concern is around Kudu scans completing, there is an >>> `isFaultTolerant` mode for the scanner that will retry at other servers, >>> although I'm not sure this is what you want. >>> >>> >>> Andrew >>> >>> >>> On Mon, Sep 4, 2017 at 5:34 AM, Li Jin <[email protected]> wrote: >>> >>>> Hi,I got a question. >>>> Our kudu service is production now. when ts crash, data will migration >>>> between ts. Network traffic will full and can not write or write >>>> normally. itโs unacceptable. Is there any good way to control network >>>> traffic when ts crash, and write or read service is not affect. Thanks! >>>> >>> >>> >>> >>> -- >>> Andrew Wong >>> >> >> > -- Andrew Wong
