Re: Sporadic memstore slowness for Read Heavy workloads

lars hofhansl Tue, 28 Jan 2014 12:00:11 -0800

I see you figured it out. I should read all email before I sent my last reply.




________________________________
 From: Varun Sharma <[email protected]>
To: "[email protected]" <[email protected]> 
Cc: "[email protected]" <[email protected]>; lars hofhansl 
<[email protected]> 
Sent: Tuesday, January 28, 2014 9:43 AM
Subject: Re: Sporadic memstore slowness for Read Heavy workloads
 


Ohk I think I understand this better now. So the order will actually be, 
something like this, at step #3

(ROW, <DELETE>, T=2)
(ROW, COL1, T=3)
(ROW, COL1, T=1)  - filtered

(ROW, COL2, T=3)
(ROW, COL2, T=1)  - filtered
(ROW, COL3, T=3)
(ROW, COL3, T=1)  - filtered

The ScanDeleteTracker class would simply filter out columns which have a 
timestamp < 2.

Varun



On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <[email protected]> wrote:

Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1) because 
COL2 > COL1 lexicographically. However in the above example, it comes before 
the delete marker and hence before (ROW, COL1, T=1) which is wrong, no ?
>
>
>
>On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <[email protected]> wrote:
>
>bq. Now, clearly there will be columns above the delete marker which are
>>
>>smaller than the ones below it.
>>
>>This is where closer look is needed. Part of the confusion arises from
>>usage of > and < in your example.
>>(ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>>
>>Here, in terms of sort order, 'above' means before. 'below it' would mean
>>after. So 'smaller' would mean before.
>>
>>Cheers
>>
>>
>>
>>On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <[email protected]> wrote:
>>
>>> Hi Ted,
>>>
>>> Not satisfied with your answer, the document you sent does not talk about
>>> Delete ColumnFamily marker sort order. For the delete family marker to
>>> work, it has to mask *all* columns of a family. Hence it has to be above
>>> all the older columns. All the new columns must come above this column
>>> family delete marker. Now, clearly there will be columns above the delete
>>> marker which are smaller than the ones below it.
>>>
>>> The document talks nothing about delete marker order, could you answer the
>>> question by looking through the example ?
>>>
>>> Varun
>>>
>>>
>>> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <[email protected]> wrote:
>>>
>>> > Varun:
>>> > Take a look at http://hbase.apache.org/book.html#dm.sort
>>> >
>>> > There's no contradiction.
>>> >
>>> > Cheers
>>> >
>>> > On Jan 27, 2014, at 11:40 PM, Varun Sharma <[email protected]> wrote:
>>> >
>>> > > Actually, I now have another question because of the way our work load
>>> is
>>> > > structured. We use a wide schema and each time we write, we delete the
>>> > > entire row and write a fresh set of columns - we want to make sure no
>>> old
>>> > > columns survive. So, I just want to see if my picture of the memstore
>>> at
>>> > > this point is correct or not. My understanding is that Memstore is
>>> > > basically a skip list of keyvalues and compares the values using
>>> KeyValue
>>> > > comparator
>>> > >
>>> > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
>>> > >
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
>>> > > memstore has - my understanding is that we do not delete in the
>>> memstore
>>> > > but only add markers:
>>> > >
>>> > > (ROW, <DELETE>, T=2)
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > 3) Now we write our new fresh row for *T=3* - it should get inserted
>>> > above
>>> > > the delete.
>>> > >
>>> > > (ROW, COL1, T=3)
>>> > > (ROW, COL2, T=3)
>>> > > (ROW, COL3, T=3)
>>> > > (ROW, <DELETE>, T=2)
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > This is the ideal scenario for the data to be correctly reflected.
>>> > >
>>> > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
>>> hence,
>>> > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
>>> > >
>>> > > But, we also know that KeyValues compare first by ROW, then by Column
>>> and
>>> > > then by timestamp in reverse order
>>> > >
>>> > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
>>> > >
>>> > > This seems to be contradicting and my main worry is that in a skip
>>> list,
>>> > it
>>> > > is quite possible for skipping to happen as you go through the high
>>> level
>>> > > express lanes and it could be possible for one of these entries to
>>> never
>>> > > actually even see the delete marker. For example consider the case
>>> above
>>> > > where entry #1 and entry #5 form the higher level of the skip list and
>>> > the
>>> > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
>>> and
>>> > it
>>> > > could end up in the wrong location.
>>> > >
>>> > > Obviously, if we cleanse all the row contents when a get a ROW level
>>> > delete
>>> > > marker, we are fine but I want to know if that is the case. If we are
>>> not
>>> > > really cleansing all the row contents when we get a ROW level delete
>>> > > marker, then I want to know why the above scenario can not lead to bugs
>>> > :)
>>> > >
>>> > > Varun
>>> > >
>>> > >
>>> > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
>>> > > <[email protected]>wrote:
>>> > >
>>> > >> Varun,
>>> > >>
>>> > >> There is no need to open new JIRA - there are two already:
>>> > >> https://issues.apache.org/jira/browse/HBASE-9769
>>> > >> https://issues.apache.org/jira/browse/HBASE-9778
>>> > >>
>>> > >> Both with patches, you can grab and test them.
>>> > >>
>>> > >> -Vladimir
>>> > >>
>>> > >>
>>> > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <[email protected]>
>>> > wrote:
>>> > >>
>>> > >>> Hi lars,
>>> > >>>
>>> > >>> Thanks for the background. It seems that for our case, we will have
>>> to
>>> > >>> consider some solution like the Facebook one, since the next column
>>> is
>>> > >>> always the next one - this can be a simple flag. I am going to raise
>>> a
>>> > >> JIRA
>>> > >>> and we can discuss there.
>>> > >>>
>>> > >>> Thanks
>>> > >>> Varun
>>> > >>>
>>> > >>>
>>> > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <[email protected]>
>>> > wrote:
>>> > >>>
>>> > >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime
>>> in
>>> > >>>> soon. :)
>>> > >>>>
>>> > >>>> Reseek is expensive compared to next if next would get us the KV
>>> we're
>>> > >>>> looking for. However, HBase does not know that ahead of time. There
>>> > >> might
>>> > >>>> be a 1000 versions of the previous KV to be skipped first.
>>> > >>>> HBase seeks in three situation:
>>> > >>>> 1. Seek to the next column (there might be a lot of versions to
>>> skip)
>>> > >>>> 2. Seek to the next row (there might be a lot of versions and other
>>> > >>>> columns to skip)
>>> > >>>> 3. Seek to a row via a hint
>>> > >>>>
>>> > >>>> #3 is definitely useful, with that one can implement very efficient
>>> > >> "skip
>>> > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
>>> > >>>> #2 is helpful if there are many columns and one only "selects" a few
>>> > >> (and
>>> > >>>> of course also if there are many versions of columns)
>>> > >>>> #1 is only helpful when we expect there to be many versions. Or of
>>> the
>>> > >>>> size of a typical KV aproaches the block size, since then we'd need
>>> to
>>> > >>> seek
>>> > >>>> to the find the next block anyway.
>>> > >>>>
>>> > >>>> You might well be a victim of #1. Are your rows 10-20 columns or is
>>> > >> that
>>> > >>>> just the number of column you return?
>>> > >>>>
>>> > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
>>> instruct
>>> > >>> the
>>> > >>>> scanner to not seek to the next column or the next row, but just
>>> issue
>>> > >>>> next()'s until the KV is found. Another suggested approach (I think
>>> by
>>> > >>> the
>>> > >>>> Facebook guys) was to issue next() opportunistically a few times,
>>> and
>>> > >>> only
>>> > >>>> when that did not get us to ther requested KV issue a reseek.
>>> > >>>> I have also thought of a near/far designation of seeks. For near
>>> seeks
>>> > >>>> we'd do a configurable number of next()'s first, then seek.
>>> > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
>>> > >>>>
>>> > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>>> > >>>>
>>> > >>>> I'll look at the trace a bit closers.
>>> > >>>> So far my scan profiling has been focused on data in the blockcache
>>> > >> since
>>> > >>>> in the normal case the vast majority of all data is found there and
>>> > >> only
>>> > >>>> recent changes are in the memstore.
>>> > >>>>
>>> > >>>> -- Lars
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> ________________________________
>>> > >>>> From: Varun Sharma <[email protected]>
>>> > >>>> To: "[email protected]" <[email protected]>; "
>>> > >>> [email protected]"
>>> > >>>> <[email protected]>
>>> > >>>> Sent: Sunday, January 26, 2014 1:14 PM
>>> > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
>>> > >>>>
>>> > >>>>
>>> > >>>> Hi,
>>> > >>>>
>>> > >>>> We are seeing some unfortunately low performance in the memstore -
>>> we
>>> > >>> have
>>> > >>>> researched some of the previous JIRA(s) and seen some inefficiencies
>>> > in
>>> > >>> the
>>> > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 %
>>> cpu
>>> > >> at
>>> > >>>> weird points in time - the bug is hard to reproduce and there isn't
>>> > >> like
>>> > >>> a
>>> > >>>> huge # of extra reads going to that region server or any substantial
>>> > >>>> hotspot happening. The region server recovers the moment, we flush
>>> the
>>> > >>>> memstores or restart the region server. Our queries retrieve wide
>>> rows
>>> > >>>> which are upto 10-20 columns. A stack trace shows two things:
>>> > >>>>
>>> > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
>>> > >>>> ConcurrentSkipListMap
>>> > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
>>> > >>>> StoreScanner - this is understandable since the rows contain many
>>> > >> columns
>>> > >>>> and StoreScanner iterates one KeyValue at a time.
>>> > >>>>
>>> > >>>> So, I was looking at the code and it seems that every single time
>>> > there
>>> > >>> is
>>> > >>>> a reseek call on the same memstore scanner, we make a fresh call to
>>> > >> build
>>> > >>>> an iterator() on the skip list set - this means we an additional
>>> skip
>>> > >>> list
>>> > >>>> lookup for every column retrieved. SkipList lookups are O(n) and not
>>> > >>> O(1).
>>> > >>>>
>>> > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
>>> > >>> number
>>> > >>>> if exceeded, do a lookup. However, it seems this behaviour was
>>> > reverted
>>> > >>> by
>>> > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
>>> skip
>>> > >>> list
>>> > >>>> lookup rather than being an iterator.
>>> > >>>>
>>> > >>>> Are there any strong reasons against having the previous behaviour
>>> of
>>> > >>>> scanning a small # of keys before degenerating to a skip list
>>> lookup ?
>>> > >>>> Seems like it would really help for sequential memstore scans and
>>> for
>>> > >>>> memstore gets with wide tables (even 10-20 columns).
>>> > >>>>
>>> > >>>> Thanks
>>> > >>>> Varun
>>> > >>
>>> >
>>>
>>
>

Re: Sporadic memstore slowness for Read Heavy workloads

Reply via email to