Re: Question about redistributing tablets on failure of a tserver.

Todd Lipcon Mon, 05 Jun 2017 12:47:07 -0700

On Mon, May 22, 2017 at 11:01 PM, Jason Heo <[email protected]> wrote:


> Thank you for confirming that.
>
> "*Anecdotally, this patch has improved TTR times 5-10x on highly loaded *
> *clusters.*" => Great News!
>
> Could I know when Kudu 1.4 will be released?
>

I just posted the first release candidate last night, so hopefuly the
release will be within the week.


>
> If not planned, I'd like to know if I can use https://gerrit.cloudera.
> org/#/c/6925/ in my production cluster.
>

We've been doing some testing here and it seems good. As always, running a
custom build in a production cluster carries some risk, but I think this
patch is reasonably safe as far as these things go.

-Todd


2017-05-23 4:37 GMT+09:00 Dan Burkert <[email protected]>:
>
>> Woops, I meant it should land in time for 1.4.
>>
>> - Dan
>>
>> On Mon, May 22, 2017 at 12:32 PM, Dan Burkert <[email protected]>
>> wrote:
>>
>>> Thanks for the info, Jason.  I spent some more time looking at this
>>> today, and confirmed that the patch is working as intended.  I've updated
>>> the commit message with more info about the failure that was occurring, in
>>> case you were interested.  I expect this fix will land in time for 1.5.
>>>
>>> - Dan
>>>
>>> On Sat, May 20, 2017 at 8:47 PM, Jason Heo <[email protected]>
>>> wrote:
>>>
>>>> Hi.
>>>>
>>>> I'm not sure how can I explain.
>>>>
>>>> 1.
>>>> re-replication is reduced from 20 hours to 2 hours 40 minutes.
>>>>
>>>> Here are some charts.
>>>>
>>>> Before applying the patch:
>>>>
>>>>     - Total Tablet Size: http://i.imgur.com/QtT2sH4.png
>>>>     - Network & Disk Usage: http://i.imgur.com/m4gj6p2.png (started at
>>>> 10 am, ended at tommorow 6 am)
>>>>
>>>> After applying the patch:
>>>>
>>>>     - Total Tablet Size: http://i.imgur.com/7RmWQA4.png
>>>>     - Network & Disk Usage: http://i.imgur.com/Jd7q8iY.png
>>>>
>>>> 2.
>>>> BTW, before applying, I got many "already in progress" messages in the
>>>> kudu master log file.
>>>>
>>>>     delete failed for tablet 'tablet_id' with error code
>>>> TABLET_NOT_RUNNING: Illegal state: State transition of tablet 'tablet_id'
>>>> already in progress: copying tablet
>>>>
>>>> But, after applied, there were no such messages.
>>>>
>>>> 3.
>>>> before applying, I used Kudu 1.3.0 and version is upgraded to 1.4 by
>>>> using the patch.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> 2017-05-21 0:02 GMT+09:00 Dan Burkert <[email protected]>:
>>>>
>>>>> Hey Jason,
>>>>>
>>>>> What effect did you see with that patch applied?  I've had mixed
>>>>> results with it in my failover tests - it hasn't resolved some of the
>>>>> issues that I expected it would, so I'm still looking in to it.  Any
>>>>> feedback you have on it would be appreciated.
>>>>>
>>>>> - Dan
>>>>>
>>>>> On Fri, May 19, 2017 at 10:07 PM, Jason Heo <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks, @dan @Todd
>>>>>>
>>>>>> This issue has been resolved via https://gerrit.cloudera.or
>>>>>> g/#/c/6925/
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Jason
>>>>>>
>>>>>> 2017-05-09 4:55 GMT+09:00 Todd Lipcon <[email protected]>:
>>>>>>
>>>>>>> Hey Jason
>>>>>>>
>>>>>>> Sorry for the delayed response here. It looks from your ksck like
>>>>>>> copying is ongoing but hasn't yet finished.
>>>>>>>
>>>>>>> FWIW Will B is working on adding more informative output to ksck to
>>>>>>> help diagnose cases like this:
>>>>>>> https://gerrit.cloudera.org/#/c/6772/
>>>>>>>
>>>>>>> -Todd
>>>>>>>
>>>>>>> On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> @Dan
>>>>>>>>
>>>>>>>> I monitored with `kudu ksck` while re-replication is occurring, but
>>>>>>>> I'm not sure if this output means my cluster has a problem. (It seems 
>>>>>>>> just
>>>>>>>> indicating one tserver stopped)
>>>>>>>>
>>>>>>>> Would you please check it?
>>>>>>>>
>>>>>>>> Thank,
>>>>>>>>
>>>>>>>> Jason
>>>>>>>>
>>>>>>>> ```
>>>>>>>> ...
>>>>>>>> ...
>>>>>>>> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is
>>>>>>>> under-replicated: 1 replica(s) not RUNNING
>>>>>>>>   a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING
>>>>>>>>   a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING
>>>>>>>> [LEADER]
>>>>>>>>   401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing
>>>>>>>>
>>>>>>>> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is
>>>>>>>> under-replicated: 1 replica(s) not RUNNING
>>>>>>>>   aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING
>>>>>>>> [LEADER]
>>>>>>>>   a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING
>>>>>>>>   31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state
>>>>>>>>     State:       NOT_STARTED
>>>>>>>>     Data state:  TABLET_DATA_READY
>>>>>>>>     Last status: Tablet initializing...
>>>>>>>>
>>>>>>>> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is
>>>>>>>> under-replicated: 1 replica(s) not RUNNING
>>>>>>>>   a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING
>>>>>>>>   40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING
>>>>>>>> [LEADER]
>>>>>>>>   aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state
>>>>>>>>     State:       NOT_STARTED
>>>>>>>>     Data state:  TABLET_DATA_COPYING
>>>>>>>>     Last status: TabletCopy: Downloading block 0000000005162382
>>>>>>>> (277/581)
>>>>>>>> ...
>>>>>>>> ...
>>>>>>>> ==================
>>>>>>>> Errors:
>>>>>>>> ==================
>>>>>>>> table consistency check error: Corruption: 52 table(s) are bad
>>>>>>>>
>>>>>>>> FAILED
>>>>>>>> Runtime error: ksck discovered errors
>>>>>>>> ```
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-04-13 3:47 GMT+09:00 Dan Burkert <[email protected]>:
>>>>>>>>
>>>>>>>>> Hi Jason, answers inline:
>>>>>>>>>
>>>>>>>>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Q1. Can I disable redistributing tablets on failure of a tserver?
>>>>>>>>>> The reason why I need this is described in Background.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We don't have any kind of built-in maintenance mode that would
>>>>>>>>> prevent this, but it can be achieved by setting a flag on each of the
>>>>>>>>> tablet servers.  The goal is not to disable re-replicating tablets, 
>>>>>>>>> but
>>>>>>>>> instead to avoid kicking the failed replica out of the tablet groups 
>>>>>>>>> to
>>>>>>>>> begin with.  There is a config flag to control exactly that:
>>>>>>>>> 'evict_failed_followers'.  This isn't considered a stable or supported
>>>>>>>>> flag, but it should have the effect you are looking for, if you set 
>>>>>>>>> it to
>>>>>>>>> false on each of the tablet servers, by running:
>>>>>>>>>
>>>>>>>>>     kudu tserver set-flag <tserver-addr> evict_failed_followers
>>>>>>>>> false --force
>>>>>>>>>
>>>>>>>>> for each tablet server.  When you are done, set it back to the
>>>>>>>>> default 'true' value.  This isn't something we routinely test 
>>>>>>>>> (especially
>>>>>>>>> setting it without restarting the server), so please test before 
>>>>>>>>> trying
>>>>>>>>> this on a production cluster.
>>>>>>>>>
>>>>>>>>> Q2. redistribution goes on even if the failed tserver reconnected
>>>>>>>>>> to cluster. In my test cluster, it took 2 hours to distribute when a
>>>>>>>>>> tserver which has 3TB data was killed.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This seems slow.  What's the speed of your network?  How many
>>>>>>>>> nodes?  How many tablet replicas were on the failed tserver, and were 
>>>>>>>>> the
>>>>>>>>> replica sizes evenly balanced?  Next time this happens, you might try
>>>>>>>>> monitoring with 'kudu ksck' to ensure there aren't additional 
>>>>>>>>> problems in
>>>>>>>>> the cluster (admin guide on the ksck tool
>>>>>>>>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck>
>>>>>>>>> ).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Q3. `--follower_unavailable_considered_failed_sec` can be
>>>>>>>>>> changed without restarting cluster?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The flag can be changed, but it comes with the same caveats as
>>>>>>>>> above:
>>>>>>>>>
>>>>>>>>>     'kudu tserver set-flag <tserver-addr>
>>>>>>>>> follower_unavailable_considered_failed_sec 900 --force'
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Todd Lipcon
>>>>>>> Software Engineer, Cloudera
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Question about redistributing tablets on failure of a tserver.

Reply via email to