Re: How to control network traffic when ts crash.

Andrew Wong Mon, 04 Sep 2017 20:29:28 -0700

Ah, I see. When a tablet server crashes, a few things usually need to
happen:
1. new leaders must be elected for all the tablets with a leader that
crashed,
2. all of the affected tablets must change their Raft configs to replace
the dead server, and
3. all of the affected tablets need to be replicated to honor the
replication factor.

Using `--num_tablets_to_copy_simultaneously` will probably only help with
the last one. New leaders
still must be elected, and a new tablet server will still be added for
each; both of these will still
contribute some to network traffic. The difference would be that, as the
name implies, each tablet
server will only be able to service a single tablet copy request at a time,
so the time to full,
"healthy" replication for the cluster will likely be longer. The tablet
copies do have more data to
transfer than elections and config changes, so maybe this would be
sufficient.

Off the top of my head, I think the second solution you suggested seems a
bit riskier because
it might end up preventing tablets from reaching quorums, and it relies on
an operators noticing
the failures and moving everything in time. The second point may be fixed
by using
`--evict_failed_followers` instead, but the first is a big issue to look
out for.

As I mentioned before, these solutions affect the guarantees that Kudu
provides: the replication
factor in all of these cases will be compromised in exchange for a lighter
load on your network.

Hope this helps,
Andrew

On Mon, Sep 4, 2017 at 7:11 PM, Li Jin <[email protected]> wrote:

> I have got another idea. if I change the param 
> --follower_unavailable_considered_failed_sec=3600
> or more, that disable data migration almost, We can balance data by kudu
> command line tools kudu table change_config add_replica
> <https://kudu.apache.org/docs/command_line_tools_reference.html#change_config-add_replica>
> /remove_replica
> <https://kudu.apache.org/docs/command_line_tools_reference.html#change_config-add_replica>
>  to
> balance data.  small-scale data migration will not take so much  network
> traffic. maybe this is we need.😄
>
> 2017-09-05 7:12 GMT+08:00 Li Jin <[email protected]>:
>
>> Hi Andrew,
>>
>> Thanks for replay. I seeing at zabbix, I would like reduce the use of network
>> traffic and make sure the business online is not affect. I try to change
>> the configure num_tablets_to_copy_simultaneously from 10 to 1, Can it be
>> meet my needs?
>>
>> King Lee
>>
>> 2017-09-05 3:35 GMT+08:00 Andrew Wong <[email protected]>:
>>
>>> Hi Li,
>>>
>>> What errors are you seeing when the network traffic is full? Kudu needs
>>> to replicate all the data that was lost to maintain the specified
>>> replication factor. As far as I know there isn't a way to throttle this
>>> without giving up some guarantees.
>>>
>>> If the concern is around Kudu scans completing, there is an
>>> `isFaultTolerant` mode for the scanner that will retry at other servers,
>>> although I'm not sure this is what you want.
>>>
>>>
>>> Andrew
>>>
>>>
>>> On Mon, Sep 4, 2017 at 5:34 AM, Li Jin <[email protected]> wrote:
>>>
>>>> Hi,I got a question.
>>>> Our kudu service is production now. when ts crash, data will migration
>>>> between ts. Network traffic will full and can not write or write
>>>> normally. it’s unacceptable. Is there any good way to control network
>>>> traffic when ts crash, and write or read service is not affect. Thanks!
>>>>
>>>
>>>
>>>
>>> --
>>> Andrew Wong
>>>
>>
>>
>

-- 
Andrew Wong

Re: How to control network traffic when ts crash.

Reply via email to