Re:Re: Re: Re: What way to improve MTTR other than DLR(distributed log replay)

Allan Yang Tue, 18 Oct 2016 18:11:51 -0700

Hi, Ted
These issues I mentioned above(HBASE-13567, HBASE-12743, HBASE-13535, 
HBASE-14729) are ALL reproduced in our HBase1.x test environment. Fixing them 
is exactly what I'm going to do. I haven't found the root cause yet, but  I 
will update if I find solutions.
 what I afraid is that, there are other issues I don't know yet. So if you or 
other guys know other issues related to DLR, please let me know



Regards
Allan Yang  







At 2016-10-19 00:19:06, "Ted Yu" <[email protected]> wrote:
>Allan:
>I wonder how you deal with open issues such as HBASE-13535.
>From your description, it seems your team fixed more DLR issues.
>
>Cheers
>
>On Mon, Oct 17, 2016 at 11:37 PM, allanwin <[email protected]> wrote:
>
>>
>>
>>
>> Here is the thing. We have backported DLR(HBASE-7006) to our 0.94
>> clusters  in production environment(of course a lot of bugs are fixed and
>> it is working well). It is was proven to be a huge gain. When a large
>> cluster crash down, the MTTR improved from several hours to less than a
>> hour. Now, we want to move on to HBase1.x, and still we want DLR. This
>> time, we don't want to backport the 'backported' DLR to HBase1.x, but it
>> seems like that the community have determined to remove DLR...
>>
>>
>> The DLR feature is proven useful in our production environment, so I think
>> I will try to fix its issues in branch-1.x
>>
>>
>>
>>
>>
>>
>> At 2016-10-18 13:47:17, "Anoop John" <[email protected]> wrote:
>> >Agree with ur observation.. But DLR feature we wanted to get removed..
>> >Because it is known to have issues..  Or else we need major work to
>> >correct all these issues.
>> >
>> >-Anoop-
>> >
>> >On Tue, Oct 18, 2016 at 7:41 AM, Ted Yu <[email protected]> wrote:
>> >> If you have a cluster, I suggest you turn on DLR and observe the effect
>> >> where fewer than half the region servers are up after the crash.
>> >> You would have first hand experience that way.
>> >>
>> >> On Mon, Oct 17, 2016 at 6:33 PM, allanwin <[email protected]> wrote:
>> >>
>> >>>
>> >>>
>> >>>
>> >>> Yes, region replica is a good way to improve MTTR. Specially if one or
>> two
>> >>> servers are down, region replica can improve data availability. But
>> for big
>> >>> disaster like 1/3 or 1/2 region servers shutdown, I think DLR still
>> useful
>> >>> to bring regions online more quickly and with less IO usage.
>> >>>
>> >>>
>> >>> Regards
>> >>> Allan Yang
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> At 2016-10-17 21:01:16, "Ted Yu" <[email protected]> wrote:
>> >>> >Here was the thread discussing DLR:
>> >>> >
>> >>> >http://search-hadoop.com/m/YGbbOxBK2n4ES12&subj=Re+
>> >>> DISCUSS+retiring+current+DLR+code
>> >>> >
>> >>> >> On Oct 17, 2016, at 4:15 AM, allanwin <[email protected]> wrote:
>> >>> >>
>> >>> >> Hi, All
>> >>> >>  DLR can improve MTTR dramatically, but since it have many bugs like
>> >>> HBASE-13567, HBASE-12743, HBASE-13535, HBASE-14729(any more I'don't
>> know?),
>> >>> it was proved unreliable, and has been deprecated almost in all
>> branches
>> >>> now.
>> >>> >>
>> >>> >>
>> >>> >> My question is, is there any other way other than DLR to improve
>> MTTR?
>> >>> 'Cause If a big cluster crashes, It takes a long time to bring regions
>> >>> online, not to mention it will create huge pressure on the IOs.
>> >>> >>
>> >>> >>
>> >>> >> To tell the truth, I still want DLR back, if the community don't
>> have
>> >>> any plan to bring back DLR, I may want to figure out the problems in
>> DLR
>> >>> and make it working and reliable, Any suggests for that?
>> >>> >>
>> >>> >>
>> >>> >> sincerely
>> >>> >> Allan Yang
>> >>>
>>

Re:Re: Re: Re: What way to improve MTTR other than DLR(distributed log replay)

Reply via email to