Re: Sporadic memstore slowness for Read Heavy workloads

Ted Yu Tue, 28 Jan 2014 05:11:04 -0800

Varun:
Take a look at http://hbase.apache.org/book.html#dm.sort


There's no contradiction. 

Cheers

On Jan 27, 2014, at 11:40 PM, Varun Sharma <[email protected]> wrote:

> Actually, I now have another question because of the way our work load is
> structured. We use a wide schema and each time we write, we delete the
> entire row and write a fresh set of columns - we want to make sure no old
> columns survive. So, I just want to see if my picture of the memstore at
> this point is correct or not. My understanding is that Memstore is
> basically a skip list of keyvalues and compares the values using KeyValue
> comparator
> 
> 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> 
> (ROW, COL1, T=1)
> (ROW, COL2, T=1)
> (ROW, COL3, T=1)
> 
> 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> memstore has - my understanding is that we do not delete in the memstore
> but only add markers:
> 
> (ROW, <DELETE>, T=2)
> (ROW, COL1, T=1)
> (ROW, COL2, T=1)
> (ROW, COL3, T=1)
> 
> 3) Now we write our new fresh row for *T=3* - it should get inserted above
> the delete.
> 
> (ROW, COL1, T=3)
> (ROW, COL2, T=3)
> (ROW, COL3, T=3)
> (ROW, <DELETE>, T=2)
> (ROW, COL1, T=1)
> (ROW, COL2, T=1)
> (ROW, COL3, T=1)
> 
> This is the ideal scenario for the data to be correctly reflected.
> 
> (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and hence,
> *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> 
> But, we also know that KeyValues compare first by ROW, then by Column and
> then by timestamp in reverse order
> 
> *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> 
> This seems to be contradicting and my main worry is that in a skip list, it
> is quite possible for skipping to happen as you go through the high level
> express lanes and it could be possible for one of these entries to never
> actually even see the delete marker. For example consider the case above
> where entry #1 and entry #5 form the higher level of the skip list and the
> skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) and it
> could end up in the wrong location.
> 
> Obviously, if we cleanse all the row contents when a get a ROW level delete
> marker, we are fine but I want to know if that is the case. If we are not
> really cleansing all the row contents when we get a ROW level delete
> marker, then I want to know why the above scenario can not lead to bugs :)
> 
> Varun
> 
> 
> On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> <[email protected]>wrote:
> 
>> Varun,
>> 
>> There is no need to open new JIRA - there are two already:
>> https://issues.apache.org/jira/browse/HBASE-9769
>> https://issues.apache.org/jira/browse/HBASE-9778
>> 
>> Both with patches, you can grab and test them.
>> 
>> -Vladimir
>> 
>> 
>> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <[email protected]> wrote:
>> 
>>> Hi lars,
>>> 
>>> Thanks for the background. It seems that for our case, we will have to
>>> consider some solution like the Facebook one, since the next column is
>>> always the next one - this can be a simple flag. I am going to raise a
>> JIRA
>>> and we can discuss there.
>>> 
>>> Thanks
>>> Varun
>>> 
>>> 
>>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <[email protected]> wrote:
>>> 
>>>> This is somewhat of a known issue, and I'm sure Vladimir will chime in
>>>> soon. :)
>>>> 
>>>> Reseek is expensive compared to next if next would get us the KV we're
>>>> looking for. However, HBase does not know that ahead of time. There
>> might
>>>> be a 1000 versions of the previous KV to be skipped first.
>>>> HBase seeks in three situation:
>>>> 1. Seek to the next column (there might be a lot of versions to skip)
>>>> 2. Seek to the next row (there might be a lot of versions and other
>>>> columns to skip)
>>>> 3. Seek to a row via a hint
>>>> 
>>>> #3 is definitely useful, with that one can implement very efficient
>> "skip
>>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
>>>> #2 is helpful if there are many columns and one only "selects" a few
>> (and
>>>> of course also if there are many versions of columns)
>>>> #1 is only helpful when we expect there to be many versions. Or of the
>>>> size of a typical KV aproaches the block size, since then we'd need to
>>> seek
>>>> to the find the next block anyway.
>>>> 
>>>> You might well be a victim of #1. Are your rows 10-20 columns or is
>> that
>>>> just the number of column you return?
>>>> 
>>>> Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
>>> the
>>>> scanner to not seek to the next column or the next row, but just issue
>>>> next()'s until the KV is found. Another suggested approach (I think by
>>> the
>>>> Facebook guys) was to issue next() opportunistically a few times, and
>>> only
>>>> when that did not get us to ther requested KV issue a reseek.
>>>> I have also thought of a near/far designation of seeks. For near seeks
>>>> we'd do a configurable number of next()'s first, then seek.
>>>> "near" seeks would be those of category #1 (and maybe #2) above.
>>>> 
>>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>>>> 
>>>> I'll look at the trace a bit closers.
>>>> So far my scan profiling has been focused on data in the blockcache
>> since
>>>> in the normal case the vast majority of all data is found there and
>> only
>>>> recent changes are in the memstore.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Varun Sharma <[email protected]>
>>>> To: "[email protected]" <[email protected]>; "
>>> [email protected]"
>>>> <[email protected]>
>>>> Sent: Sunday, January 26, 2014 1:14 PM
>>>> Subject: Sporadic memstore slowness for Read Heavy workloads
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> We are seeing some unfortunately low performance in the memstore - we
>>> have
>>>> researched some of the previous JIRA(s) and seen some inefficiencies in
>>> the
>>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu
>> at
>>>> weird points in time - the bug is hard to reproduce and there isn't
>> like
>>> a
>>>> huge # of extra reads going to that region server or any substantial
>>>> hotspot happening. The region server recovers the moment, we flush the
>>>> memstores or restart the region server. Our queries retrieve wide rows
>>>> which are upto 10-20 columns. A stack trace shows two things:
>>>> 
>>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
>>>> ConcurrentSkipListMap
>>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
>>>> StoreScanner - this is understandable since the rows contain many
>> columns
>>>> and StoreScanner iterates one KeyValue at a time.
>>>> 
>>>> So, I was looking at the code and it seems that every single time there
>>> is
>>>> a reseek call on the same memstore scanner, we make a fresh call to
>> build
>>>> an iterator() on the skip list set - this means we an additional skip
>>> list
>>>> lookup for every column retrieved. SkipList lookups are O(n) and not
>>> O(1).
>>>> 
>>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
>>> number
>>>> if exceeded, do a lookup. However, it seems this behaviour was reverted
>>> by
>>>> HBASE 4195 and every next row/next column is now a reseek() and a skip
>>> list
>>>> lookup rather than being an iterator.
>>>> 
>>>> Are there any strong reasons against having the previous behaviour of
>>>> scanning a small # of keys before degenerating to a skip list lookup ?
>>>> Seems like it would really help for sequential memstore scans and for
>>>> memstore gets with wide tables (even 10-20 columns).
>>>> 
>>>> Thanks
>>>> Varun
>>

Re: Sporadic memstore slowness for Read Heavy workloads

Reply via email to