Its ScanQueryMatcher-ScanDeleteTracker responsibility to process deletes during scanning.
On Tue, Jan 28, 2014 at 9:43 AM, Varun Sharma <[email protected]> wrote: > Ohk I think I understand this better now. So the order will actually be, > something like this, at step #3 > > (ROW, <DELETE>, T=2) > (ROW, COL1, T=3) > (ROW, COL1, T=1) - filtered > (ROW, COL2, T=3) > (ROW, COL2, T=1) - filtered > (ROW, COL3, T=3) > (ROW, COL3, T=1) - filtered > > The ScanDeleteTracker class would simply filter out columns which have a > timestamp < 2. > > Varun > > > On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <[email protected]> wrote: > > > Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1) > > because COL2 > COL1 lexicographically. However in the above example, it > > comes before the delete marker and hence before (ROW, COL1, T=1) which is > > wrong, no ? > > > > > > On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <[email protected]> wrote: > > > >> bq. Now, clearly there will be columns above the delete marker which are > >> smaller than the ones below it. > >> > >> This is where closer look is needed. Part of the confusion arises from > >> usage of > and < in your example. > >> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1). > >> > >> Here, in terms of sort order, 'above' means before. 'below it' would > mean > >> after. So 'smaller' would mean before. > >> > >> Cheers > >> > >> > >> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <[email protected]> > >> wrote: > >> > >> > Hi Ted, > >> > > >> > Not satisfied with your answer, the document you sent does not talk > >> about > >> > Delete ColumnFamily marker sort order. For the delete family marker to > >> > work, it has to mask *all* columns of a family. Hence it has to be > above > >> > all the older columns. All the new columns must come above this column > >> > family delete marker. Now, clearly there will be columns above the > >> delete > >> > marker which are smaller than the ones below it. > >> > > >> > The document talks nothing about delete marker order, could you answer > >> the > >> > question by looking through the example ? > >> > > >> > Varun > >> > > >> > > >> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <[email protected]> wrote: > >> > > >> > > Varun: > >> > > Take a look at http://hbase.apache.org/book.html#dm.sort > >> > > > >> > > There's no contradiction. > >> > > > >> > > Cheers > >> > > > >> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <[email protected]> > >> wrote: > >> > > > >> > > > Actually, I now have another question because of the way our work > >> load > >> > is > >> > > > structured. We use a wide schema and each time we write, we delete > >> the > >> > > > entire row and write a fresh set of columns - we want to make sure > >> no > >> > old > >> > > > columns survive. So, I just want to see if my picture of the > >> memstore > >> > at > >> > > > this point is correct or not. My understanding is that Memstore is > >> > > > basically a skip list of keyvalues and compares the values using > >> > KeyValue > >> > > > comparator > >> > > > > >> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has: > >> > > > > >> > > > (ROW, COL1, T=1) > >> > > > (ROW, COL2, T=1) > >> > > > (ROW, COL3, T=1) > >> > > > > >> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. > So > >> > > > memstore has - my understanding is that we do not delete in the > >> > memstore > >> > > > but only add markers: > >> > > > > >> > > > (ROW, <DELETE>, T=2) > >> > > > (ROW, COL1, T=1) > >> > > > (ROW, COL2, T=1) > >> > > > (ROW, COL3, T=1) > >> > > > > >> > > > 3) Now we write our new fresh row for *T=3* - it should get > inserted > >> > > above > >> > > > the delete. > >> > > > > >> > > > (ROW, COL1, T=3) > >> > > > (ROW, COL2, T=3) > >> > > > (ROW, COL3, T=3) > >> > > > (ROW, <DELETE>, T=2) > >> > > > (ROW, COL1, T=1) > >> > > > (ROW, COL2, T=1) > >> > > > (ROW, COL3, T=1) > >> > > > > >> > > > This is the ideal scenario for the data to be correctly reflected. > >> > > > > >> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and > >> > hence, > >> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)* > >> > > > > >> > > > But, we also know that KeyValues compare first by ROW, then by > >> Column > >> > and > >> > > > then by timestamp in reverse order > >> > > > > >> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) * > >> > > > > >> > > > This seems to be contradicting and my main worry is that in a skip > >> > list, > >> > > it > >> > > > is quite possible for skipping to happen as you go through the > high > >> > level > >> > > > express lanes and it could be possible for one of these entries to > >> > never > >> > > > actually even see the delete marker. For example consider the case > >> > above > >> > > > where entry #1 and entry #5 form the higher level of the skip list > >> and > >> > > the > >> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, > T=3) > >> > and > >> > > it > >> > > > could end up in the wrong location. > >> > > > > >> > > > Obviously, if we cleanse all the row contents when a get a ROW > level > >> > > delete > >> > > > marker, we are fine but I want to know if that is the case. If we > >> are > >> > not > >> > > > really cleansing all the row contents when we get a ROW level > delete > >> > > > marker, then I want to know why the above scenario can not lead to > >> bugs > >> > > :) > >> > > > > >> > > > Varun > >> > > > > >> > > > > >> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov > >> > > > <[email protected]>wrote: > >> > > > > >> > > >> Varun, > >> > > >> > >> > > >> There is no need to open new JIRA - there are two already: > >> > > >> https://issues.apache.org/jira/browse/HBASE-9769 > >> > > >> https://issues.apache.org/jira/browse/HBASE-9778 > >> > > >> > >> > > >> Both with patches, you can grab and test them. > >> > > >> > >> > > >> -Vladimir > >> > > >> > >> > > >> > >> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma < > [email protected] > >> > > >> > > wrote: > >> > > >> > >> > > >>> Hi lars, > >> > > >>> > >> > > >>> Thanks for the background. It seems that for our case, we will > >> have > >> > to > >> > > >>> consider some solution like the Facebook one, since the next > >> column > >> > is > >> > > >>> always the next one - this can be a simple flag. I am going to > >> raise > >> > a > >> > > >> JIRA > >> > > >>> and we can discuss there. > >> > > >>> > >> > > >>> Thanks > >> > > >>> Varun > >> > > >>> > >> > > >>> > >> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl < > [email protected]> > >> > > wrote: > >> > > >>> > >> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will > >> chime > >> > in > >> > > >>>> soon. :) > >> > > >>>> > >> > > >>>> Reseek is expensive compared to next if next would get us the > KV > >> > we're > >> > > >>>> looking for. However, HBase does not know that ahead of time. > >> There > >> > > >> might > >> > > >>>> be a 1000 versions of the previous KV to be skipped first. > >> > > >>>> HBase seeks in three situation: > >> > > >>>> 1. Seek to the next column (there might be a lot of versions to > >> > skip) > >> > > >>>> 2. Seek to the next row (there might be a lot of versions and > >> other > >> > > >>>> columns to skip) > >> > > >>>> 3. Seek to a row via a hint > >> > > >>>> > >> > > >>>> #3 is definitely useful, with that one can implement very > >> efficient > >> > > >> "skip > >> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing). > >> > > >>>> #2 is helpful if there are many columns and one only "selects" > a > >> few > >> > > >> (and > >> > > >>>> of course also if there are many versions of columns) > >> > > >>>> #1 is only helpful when we expect there to be many versions. Or > >> of > >> > the > >> > > >>>> size of a typical KV aproaches the block size, since then we'd > >> need > >> > to > >> > > >>> seek > >> > > >>>> to the find the next block anyway. > >> > > >>>> > >> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns > or > >> is > >> > > >> that > >> > > >>>> just the number of column you return? > >> > > >>>> > >> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we > >> > instruct > >> > > >>> the > >> > > >>>> scanner to not seek to the next column or the next row, but > just > >> > issue > >> > > >>>> next()'s until the KV is found. Another suggested approach (I > >> think > >> > by > >> > > >>> the > >> > > >>>> Facebook guys) was to issue next() opportunistically a few > times, > >> > and > >> > > >>> only > >> > > >>>> when that did not get us to ther requested KV issue a reseek. > >> > > >>>> I have also thought of a near/far designation of seeks. For > near > >> > seeks > >> > > >>>> we'd do a configurable number of next()'s first, then seek. > >> > > >>>> "near" seeks would be those of category #1 (and maybe #2) > above. > >> > > >>>> > >> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, > maybe) > >> > > >>>> > >> > > >>>> I'll look at the trace a bit closers. > >> > > >>>> So far my scan profiling has been focused on data in the > >> blockcache > >> > > >> since > >> > > >>>> in the normal case the vast majority of all data is found there > >> and > >> > > >> only > >> > > >>>> recent changes are in the memstore. > >> > > >>>> > >> > > >>>> -- Lars > >> > > >>>> > >> > > >>>> > >> > > >>>> > >> > > >>>> > >> > > >>>> ________________________________ > >> > > >>>> From: Varun Sharma <[email protected]> > >> > > >>>> To: "[email protected]" <[email protected]>; " > >> > > >>> [email protected]" > >> > > >>>> <[email protected]> > >> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM > >> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads > >> > > >>>> > >> > > >>>> > >> > > >>>> Hi, > >> > > >>>> > >> > > >>>> We are seeing some unfortunately low performance in the > memstore > >> - > >> > we > >> > > >>> have > >> > > >>>> researched some of the previous JIRA(s) and seen some > >> inefficiencies > >> > > in > >> > > >>> the > >> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting > 100 > >> % > >> > cpu > >> > > >> at > >> > > >>>> weird points in time - the bug is hard to reproduce and there > >> isn't > >> > > >> like > >> > > >>> a > >> > > >>>> huge # of extra reads going to that region server or any > >> substantial > >> > > >>>> hotspot happening. The region server recovers the moment, we > >> flush > >> > the > >> > > >>>> memstores or restart the region server. Our queries retrieve > wide > >> > rows > >> > > >>>> which are upto 10-20 columns. A stack trace shows two things: > >> > > >>>> > >> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the > >> > > >>>> ConcurrentSkipListMap > >> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column > inside > >> > > >>>> StoreScanner - this is understandable since the rows contain > many > >> > > >> columns > >> > > >>>> and StoreScanner iterates one KeyValue at a time. > >> > > >>>> > >> > > >>>> So, I was looking at the code and it seems that every single > time > >> > > there > >> > > >>> is > >> > > >>>> a reseek call on the same memstore scanner, we make a fresh > call > >> to > >> > > >> build > >> > > >>>> an iterator() on the skip list set - this means we an > additional > >> > skip > >> > > >>> list > >> > > >>>> lookup for every column retrieved. SkipList lookups are O(n) > and > >> not > >> > > >>> O(1). > >> > > >>>> > >> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if > >> that > >> > > >>> number > >> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was > >> > > reverted > >> > > >>> by > >> > > >>>> HBASE 4195 and every next row/next column is now a reseek() > and a > >> > skip > >> > > >>> list > >> > > >>>> lookup rather than being an iterator. > >> > > >>>> > >> > > >>>> Are there any strong reasons against having the previous > >> behaviour > >> > of > >> > > >>>> scanning a small # of keys before degenerating to a skip list > >> > lookup ? > >> > > >>>> Seems like it would really help for sequential memstore scans > and > >> > for > >> > > >>>> memstore gets with wide tables (even 10-20 columns). > >> > > >>>> > >> > > >>>> Thanks > >> > > >>>> Varun > >> > > >> > >> > > > >> > > >> > > > > >
