On Mon, May 22, 2017 at 11:01 PM, Jason Heo <[email protected]> wrote:
> Thank you for confirming that. > > "*Anecdotally, this patch has improved TTR times 5-10x on highly loaded * > *clusters.*" => Great News! > > Could I know when Kudu 1.4 will be released? > I just posted the first release candidate last night, so hopefuly the release will be within the week. > > If not planned, I'd like to know if I can use https://gerrit.cloudera. > org/#/c/6925/ in my production cluster. > We've been doing some testing here and it seems good. As always, running a custom build in a production cluster carries some risk, but I think this patch is reasonably safe as far as these things go. -Todd 2017-05-23 4:37 GMT+09:00 Dan Burkert <[email protected]>: > >> Woops, I meant it should land in time for 1.4. >> >> - Dan >> >> On Mon, May 22, 2017 at 12:32 PM, Dan Burkert <[email protected]> >> wrote: >> >>> Thanks for the info, Jason. I spent some more time looking at this >>> today, and confirmed that the patch is working as intended. I've updated >>> the commit message with more info about the failure that was occurring, in >>> case you were interested. I expect this fix will land in time for 1.5. >>> >>> - Dan >>> >>> On Sat, May 20, 2017 at 8:47 PM, Jason Heo <[email protected]> >>> wrote: >>> >>>> Hi. >>>> >>>> I'm not sure how can I explain. >>>> >>>> 1. >>>> re-replication is reduced from 20 hours to 2 hours 40 minutes. >>>> >>>> Here are some charts. >>>> >>>> Before applying the patch: >>>> >>>> - Total Tablet Size: http://i.imgur.com/QtT2sH4.png >>>> - Network & Disk Usage: http://i.imgur.com/m4gj6p2.png (started at >>>> 10 am, ended at tommorow 6 am) >>>> >>>> After applying the patch: >>>> >>>> - Total Tablet Size: http://i.imgur.com/7RmWQA4.png >>>> - Network & Disk Usage: http://i.imgur.com/Jd7q8iY.png >>>> >>>> 2. >>>> BTW, before applying, I got many "already in progress" messages in the >>>> kudu master log file. >>>> >>>> delete failed for tablet 'tablet_id' with error code >>>> TABLET_NOT_RUNNING: Illegal state: State transition of tablet 'tablet_id' >>>> already in progress: copying tablet >>>> >>>> But, after applied, there were no such messages. >>>> >>>> 3. >>>> before applying, I used Kudu 1.3.0 and version is upgraded to 1.4 by >>>> using the patch. >>>> >>>> Thanks. >>>> >>>> >>>> 2017-05-21 0:02 GMT+09:00 Dan Burkert <[email protected]>: >>>> >>>>> Hey Jason, >>>>> >>>>> What effect did you see with that patch applied? I've had mixed >>>>> results with it in my failover tests - it hasn't resolved some of the >>>>> issues that I expected it would, so I'm still looking in to it. Any >>>>> feedback you have on it would be appreciated. >>>>> >>>>> - Dan >>>>> >>>>> On Fri, May 19, 2017 at 10:07 PM, Jason Heo <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks, @dan @Todd >>>>>> >>>>>> This issue has been resolved via https://gerrit.cloudera.or >>>>>> g/#/c/6925/ >>>>>> >>>>>> Regards, >>>>>> >>>>>> Jason >>>>>> >>>>>> 2017-05-09 4:55 GMT+09:00 Todd Lipcon <[email protected]>: >>>>>> >>>>>>> Hey Jason >>>>>>> >>>>>>> Sorry for the delayed response here. It looks from your ksck like >>>>>>> copying is ongoing but hasn't yet finished. >>>>>>> >>>>>>> FWIW Will B is working on adding more informative output to ksck to >>>>>>> help diagnose cases like this: >>>>>>> https://gerrit.cloudera.org/#/c/6772/ >>>>>>> >>>>>>> -Todd >>>>>>> >>>>>>> On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> @Dan >>>>>>>> >>>>>>>> I monitored with `kudu ksck` while re-replication is occurring, but >>>>>>>> I'm not sure if this output means my cluster has a problem. (It seems >>>>>>>> just >>>>>>>> indicating one tserver stopped) >>>>>>>> >>>>>>>> Would you please check it? >>>>>>>> >>>>>>>> Thank, >>>>>>>> >>>>>>>> Jason >>>>>>>> >>>>>>>> ``` >>>>>>>> ... >>>>>>>> ... >>>>>>>> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is >>>>>>>> under-replicated: 1 replica(s) not RUNNING >>>>>>>> a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING >>>>>>>> a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING >>>>>>>> [LEADER] >>>>>>>> 401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing >>>>>>>> >>>>>>>> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is >>>>>>>> under-replicated: 1 replica(s) not RUNNING >>>>>>>> aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING >>>>>>>> [LEADER] >>>>>>>> a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING >>>>>>>> 31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state >>>>>>>> State: NOT_STARTED >>>>>>>> Data state: TABLET_DATA_READY >>>>>>>> Last status: Tablet initializing... >>>>>>>> >>>>>>>> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is >>>>>>>> under-replicated: 1 replica(s) not RUNNING >>>>>>>> a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING >>>>>>>> 40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING >>>>>>>> [LEADER] >>>>>>>> aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state >>>>>>>> State: NOT_STARTED >>>>>>>> Data state: TABLET_DATA_COPYING >>>>>>>> Last status: TabletCopy: Downloading block 0000000005162382 >>>>>>>> (277/581) >>>>>>>> ... >>>>>>>> ... >>>>>>>> ================== >>>>>>>> Errors: >>>>>>>> ================== >>>>>>>> table consistency check error: Corruption: 52 table(s) are bad >>>>>>>> >>>>>>>> FAILED >>>>>>>> Runtime error: ksck discovered errors >>>>>>>> ``` >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2017-04-13 3:47 GMT+09:00 Dan Burkert <[email protected]>: >>>>>>>> >>>>>>>>> Hi Jason, answers inline: >>>>>>>>> >>>>>>>>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Q1. Can I disable redistributing tablets on failure of a tserver? >>>>>>>>>> The reason why I need this is described in Background. >>>>>>>>>> >>>>>>>>> >>>>>>>>> We don't have any kind of built-in maintenance mode that would >>>>>>>>> prevent this, but it can be achieved by setting a flag on each of the >>>>>>>>> tablet servers. The goal is not to disable re-replicating tablets, >>>>>>>>> but >>>>>>>>> instead to avoid kicking the failed replica out of the tablet groups >>>>>>>>> to >>>>>>>>> begin with. There is a config flag to control exactly that: >>>>>>>>> 'evict_failed_followers'. This isn't considered a stable or supported >>>>>>>>> flag, but it should have the effect you are looking for, if you set >>>>>>>>> it to >>>>>>>>> false on each of the tablet servers, by running: >>>>>>>>> >>>>>>>>> kudu tserver set-flag <tserver-addr> evict_failed_followers >>>>>>>>> false --force >>>>>>>>> >>>>>>>>> for each tablet server. When you are done, set it back to the >>>>>>>>> default 'true' value. This isn't something we routinely test >>>>>>>>> (especially >>>>>>>>> setting it without restarting the server), so please test before >>>>>>>>> trying >>>>>>>>> this on a production cluster. >>>>>>>>> >>>>>>>>> Q2. redistribution goes on even if the failed tserver reconnected >>>>>>>>>> to cluster. In my test cluster, it took 2 hours to distribute when a >>>>>>>>>> tserver which has 3TB data was killed. >>>>>>>>>> >>>>>>>>> >>>>>>>>> This seems slow. What's the speed of your network? How many >>>>>>>>> nodes? How many tablet replicas were on the failed tserver, and were >>>>>>>>> the >>>>>>>>> replica sizes evenly balanced? Next time this happens, you might try >>>>>>>>> monitoring with 'kudu ksck' to ensure there aren't additional >>>>>>>>> problems in >>>>>>>>> the cluster (admin guide on the ksck tool >>>>>>>>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck> >>>>>>>>> ). >>>>>>>>> >>>>>>>>> >>>>>>>>>> Q3. `--follower_unavailable_considered_failed_sec` can be >>>>>>>>>> changed without restarting cluster? >>>>>>>>>> >>>>>>>>> >>>>>>>>> The flag can be changed, but it comes with the same caveats as >>>>>>>>> above: >>>>>>>>> >>>>>>>>> 'kudu tserver set-flag <tserver-addr> >>>>>>>>> follower_unavailable_considered_failed_sec 900 --force' >>>>>>>>> >>>>>>>>> >>>>>>>>> - Dan >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Todd Lipcon >>>>>>> Software Engineer, Cloudera >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > -- Todd Lipcon Software Engineer, Cloudera
