Re: Sporadic memstore slowness for Read Heavy workloads

Vladimir Rodionov Tue, 28 Jan 2014 09:50:48 -0800

Its ScanQueryMatcher-ScanDeleteTracker responsibility to process  deletes
during scanning.



On Tue, Jan 28, 2014 at 9:43 AM, Varun Sharma <[email protected]> wrote:

> Ohk I think I understand this better now. So the order will actually be,
> something like this, at step #3
>
> (ROW, <DELETE>, T=2)
> (ROW, COL1, T=3)
> (ROW, COL1, T=1)  - filtered
> (ROW, COL2, T=3)
> (ROW, COL2, T=1)  - filtered
> (ROW, COL3, T=3)
> (ROW, COL3, T=1)  - filtered
>
> The ScanDeleteTracker class would simply filter out columns which have a
> timestamp < 2.
>
> Varun
>
>
> On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <[email protected]> wrote:
>
> > Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
> > because COL2 > COL1 lexicographically. However in the above example, it
> > comes before the delete marker and hence before (ROW, COL1, T=1) which is
> > wrong, no ?
> >
> >
> > On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <[email protected]> wrote:
> >
> >> bq. Now, clearly there will be columns above the delete marker which are
> >> smaller than the ones below it.
> >>
> >> This is where closer look is needed. Part of the confusion arises from
> >> usage of > and < in your example.
> >> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
> >>
> >> Here, in terms of sort order, 'above' means before. 'below it' would
> mean
> >> after. So 'smaller' would mean before.
> >>
> >> Cheers
> >>
> >>
> >> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <[email protected]>
> >> wrote:
> >>
> >> > Hi Ted,
> >> >
> >> > Not satisfied with your answer, the document you sent does not talk
> >> about
> >> > Delete ColumnFamily marker sort order. For the delete family marker to
> >> > work, it has to mask *all* columns of a family. Hence it has to be
> above
> >> > all the older columns. All the new columns must come above this column
> >> > family delete marker. Now, clearly there will be columns above the
> >> delete
> >> > marker which are smaller than the ones below it.
> >> >
> >> > The document talks nothing about delete marker order, could you answer
> >> the
> >> > question by looking through the example ?
> >> >
> >> > Varun
> >> >
> >> >
> >> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <[email protected]> wrote:
> >> >
> >> > > Varun:
> >> > > Take a look at http://hbase.apache.org/book.html#dm.sort
> >> > >
> >> > > There's no contradiction.
> >> > >
> >> > > Cheers
> >> > >
> >> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <[email protected]>
> >> wrote:
> >> > >
> >> > > > Actually, I now have another question because of the way our work
> >> load
> >> > is
> >> > > > structured. We use a wide schema and each time we write, we delete
> >> the
> >> > > > entire row and write a fresh set of columns - we want to make sure
> >> no
> >> > old
> >> > > > columns survive. So, I just want to see if my picture of the
> >> memstore
> >> > at
> >> > > > this point is correct or not. My understanding is that Memstore is
> >> > > > basically a skip list of keyvalues and compares the values using
> >> > KeyValue
> >> > > > comparator
> >> > > >
> >> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> >> > > >
> >> > > > (ROW, COL1, T=1)
> >> > > > (ROW, COL2, T=1)
> >> > > > (ROW, COL3, T=1)
> >> > > >
> >> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2.
> So
> >> > > > memstore has - my understanding is that we do not delete in the
> >> > memstore
> >> > > > but only add markers:
> >> > > >
> >> > > > (ROW, <DELETE>, T=2)
> >> > > > (ROW, COL1, T=1)
> >> > > > (ROW, COL2, T=1)
> >> > > > (ROW, COL3, T=1)
> >> > > >
> >> > > > 3) Now we write our new fresh row for *T=3* - it should get
> inserted
> >> > > above
> >> > > > the delete.
> >> > > >
> >> > > > (ROW, COL1, T=3)
> >> > > > (ROW, COL2, T=3)
> >> > > > (ROW, COL3, T=3)
> >> > > > (ROW, <DELETE>, T=2)
> >> > > > (ROW, COL1, T=1)
> >> > > > (ROW, COL2, T=1)
> >> > > > (ROW, COL3, T=1)
> >> > > >
> >> > > > This is the ideal scenario for the data to be correctly reflected.
> >> > > >
> >> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
> >> > hence,
> >> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> >> > > >
> >> > > > But, we also know that KeyValues compare first by ROW, then by
> >> Column
> >> > and
> >> > > > then by timestamp in reverse order
> >> > > >
> >> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> >> > > >
> >> > > > This seems to be contradicting and my main worry is that in a skip
> >> > list,
> >> > > it
> >> > > > is quite possible for skipping to happen as you go through the
> high
> >> > level
> >> > > > express lanes and it could be possible for one of these entries to
> >> > never
> >> > > > actually even see the delete marker. For example consider the case
> >> > above
> >> > > > where entry #1 and entry #5 form the higher level of the skip list
> >> and
> >> > > the
> >> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4,
> T=3)
> >> > and
> >> > > it
> >> > > > could end up in the wrong location.
> >> > > >
> >> > > > Obviously, if we cleanse all the row contents when a get a ROW
> level
> >> > > delete
> >> > > > marker, we are fine but I want to know if that is the case. If we
> >> are
> >> > not
> >> > > > really cleansing all the row contents when we get a ROW level
> delete
> >> > > > marker, then I want to know why the above scenario can not lead to
> >> bugs
> >> > > :)
> >> > > >
> >> > > > Varun
> >> > > >
> >> > > >
> >> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> >> > > > <[email protected]>wrote:
> >> > > >
> >> > > >> Varun,
> >> > > >>
> >> > > >> There is no need to open new JIRA - there are two already:
> >> > > >> https://issues.apache.org/jira/browse/HBASE-9769
> >> > > >> https://issues.apache.org/jira/browse/HBASE-9778
> >> > > >>
> >> > > >> Both with patches, you can grab and test them.
> >> > > >>
> >> > > >> -Vladimir
> >> > > >>
> >> > > >>
> >> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <
> [email protected]
> >> >
> >> > > wrote:
> >> > > >>
> >> > > >>> Hi lars,
> >> > > >>>
> >> > > >>> Thanks for the background. It seems that for our case, we will
> >> have
> >> > to
> >> > > >>> consider some solution like the Facebook one, since the next
> >> column
> >> > is
> >> > > >>> always the next one - this can be a simple flag. I am going to
> >> raise
> >> > a
> >> > > >> JIRA
> >> > > >>> and we can discuss there.
> >> > > >>>
> >> > > >>> Thanks
> >> > > >>> Varun
> >> > > >>>
> >> > > >>>
> >> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <
> [email protected]>
> >> > > wrote:
> >> > > >>>
> >> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
> >> chime
> >> > in
> >> > > >>>> soon. :)
> >> > > >>>>
> >> > > >>>> Reseek is expensive compared to next if next would get us the
> KV
> >> > we're
> >> > > >>>> looking for. However, HBase does not know that ahead of time.
> >> There
> >> > > >> might
> >> > > >>>> be a 1000 versions of the previous KV to be skipped first.
> >> > > >>>> HBase seeks in three situation:
> >> > > >>>> 1. Seek to the next column (there might be a lot of versions to
> >> > skip)
> >> > > >>>> 2. Seek to the next row (there might be a lot of versions and
> >> other
> >> > > >>>> columns to skip)
> >> > > >>>> 3. Seek to a row via a hint
> >> > > >>>>
> >> > > >>>> #3 is definitely useful, with that one can implement very
> >> efficient
> >> > > >> "skip
> >> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> >> > > >>>> #2 is helpful if there are many columns and one only "selects"
> a
> >> few
> >> > > >> (and
> >> > > >>>> of course also if there are many versions of columns)
> >> > > >>>> #1 is only helpful when we expect there to be many versions. Or
> >> of
> >> > the
> >> > > >>>> size of a typical KV aproaches the block size, since then we'd
> >> need
> >> > to
> >> > > >>> seek
> >> > > >>>> to the find the next block anyway.
> >> > > >>>>
> >> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns
> or
> >> is
> >> > > >> that
> >> > > >>>> just the number of column you return?
> >> > > >>>>
> >> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> >> > instruct
> >> > > >>> the
> >> > > >>>> scanner to not seek to the next column or the next row, but
> just
> >> > issue
> >> > > >>>> next()'s until the KV is found. Another suggested approach (I
> >> think
> >> > by
> >> > > >>> the
> >> > > >>>> Facebook guys) was to issue next() opportunistically a few
> times,
> >> > and
> >> > > >>> only
> >> > > >>>> when that did not get us to ther requested KV issue a reseek.
> >> > > >>>> I have also thought of a near/far designation of seeks. For
> near
> >> > seeks
> >> > > >>>> we'd do a configurable number of next()'s first, then seek.
> >> > > >>>> "near" seeks would be those of category #1 (and maybe #2)
> above.
> >> > > >>>>
> >> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915,
> maybe)
> >> > > >>>>
> >> > > >>>> I'll look at the trace a bit closers.
> >> > > >>>> So far my scan profiling has been focused on data in the
> >> blockcache
> >> > > >> since
> >> > > >>>> in the normal case the vast majority of all data is found there
> >> and
> >> > > >> only
> >> > > >>>> recent changes are in the memstore.
> >> > > >>>>
> >> > > >>>> -- Lars
> >> > > >>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> ________________________________
> >> > > >>>> From: Varun Sharma <[email protected]>
> >> > > >>>> To: "[email protected]" <[email protected]>; "
> >> > > >>> [email protected]"
> >> > > >>>> <[email protected]>
> >> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> >> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> Hi,
> >> > > >>>>
> >> > > >>>> We are seeing some unfortunately low performance in the
> memstore
> >> -
> >> > we
> >> > > >>> have
> >> > > >>>> researched some of the previous JIRA(s) and seen some
> >> inefficiencies
> >> > > in
> >> > > >>> the
> >> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting
> 100
> >> %
> >> > cpu
> >> > > >> at
> >> > > >>>> weird points in time - the bug is hard to reproduce and there
> >> isn't
> >> > > >> like
> >> > > >>> a
> >> > > >>>> huge # of extra reads going to that region server or any
> >> substantial
> >> > > >>>> hotspot happening. The region server recovers the moment, we
> >> flush
> >> > the
> >> > > >>>> memstores or restart the region server. Our queries retrieve
> wide
> >> > rows
> >> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
> >> > > >>>>
> >> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> >> > > >>>> ConcurrentSkipListMap
> >> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column
> inside
> >> > > >>>> StoreScanner - this is understandable since the rows contain
> many
> >> > > >> columns
> >> > > >>>> and StoreScanner iterates one KeyValue at a time.
> >> > > >>>>
> >> > > >>>> So, I was looking at the code and it seems that every single
> time
> >> > > there
> >> > > >>> is
> >> > > >>>> a reseek call on the same memstore scanner, we make a fresh
> call
> >> to
> >> > > >> build
> >> > > >>>> an iterator() on the skip list set - this means we an
> additional
> >> > skip
> >> > > >>> list
> >> > > >>>> lookup for every column retrieved. SkipList lookups are O(n)
> and
> >> not
> >> > > >>> O(1).
> >> > > >>>>
> >> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
> >> that
> >> > > >>> number
> >> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> >> > > reverted
> >> > > >>> by
> >> > > >>>> HBASE 4195 and every next row/next column is now a reseek()
> and a
> >> > skip
> >> > > >>> list
> >> > > >>>> lookup rather than being an iterator.
> >> > > >>>>
> >> > > >>>> Are there any strong reasons against having the previous
> >> behaviour
> >> > of
> >> > > >>>> scanning a small # of keys before degenerating to a skip list
> >> > lookup ?
> >> > > >>>> Seems like it would really help for sequential memstore scans
> and
> >> > for
> >> > > >>>> memstore gets with wide tables (even 10-20 columns).
> >> > > >>>>
> >> > > >>>> Thanks
> >> > > >>>> Varun
> >> > > >>
> >> > >
> >> >
> >>
> >
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Reply via email to