I see you figured it out. I should read all email before I sent my last reply.
________________________________ From: Varun Sharma <[email protected]> To: "[email protected]" <[email protected]> Cc: "[email protected]" <[email protected]>; lars hofhansl <[email protected]> Sent: Tuesday, January 28, 2014 9:43 AM Subject: Re: Sporadic memstore slowness for Read Heavy workloads Ohk I think I understand this better now. So the order will actually be, something like this, at step #3 (ROW, <DELETE>, T=2) (ROW, COL1, T=3) (ROW, COL1, T=1) - filtered (ROW, COL2, T=3) (ROW, COL2, T=1) - filtered (ROW, COL3, T=3) (ROW, COL3, T=1) - filtered The ScanDeleteTracker class would simply filter out columns which have a timestamp < 2. Varun On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <[email protected]> wrote: Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1) because COL2 > COL1 lexicographically. However in the above example, it comes before the delete marker and hence before (ROW, COL1, T=1) which is wrong, no ? > > > >On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <[email protected]> wrote: > >bq. Now, clearly there will be columns above the delete marker which are >> >>smaller than the ones below it. >> >>This is where closer look is needed. Part of the confusion arises from >>usage of > and < in your example. >>(ROW, COL2, T=3) would sort before (ROW, COL1, T=1). >> >>Here, in terms of sort order, 'above' means before. 'below it' would mean >>after. So 'smaller' would mean before. >> >>Cheers >> >> >> >>On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <[email protected]> wrote: >> >>> Hi Ted, >>> >>> Not satisfied with your answer, the document you sent does not talk about >>> Delete ColumnFamily marker sort order. For the delete family marker to >>> work, it has to mask *all* columns of a family. Hence it has to be above >>> all the older columns. All the new columns must come above this column >>> family delete marker. Now, clearly there will be columns above the delete >>> marker which are smaller than the ones below it. >>> >>> The document talks nothing about delete marker order, could you answer the >>> question by looking through the example ? >>> >>> Varun >>> >>> >>> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <[email protected]> wrote: >>> >>> > Varun: >>> > Take a look at http://hbase.apache.org/book.html#dm.sort >>> > >>> > There's no contradiction. >>> > >>> > Cheers >>> > >>> > On Jan 27, 2014, at 11:40 PM, Varun Sharma <[email protected]> wrote: >>> > >>> > > Actually, I now have another question because of the way our work load >>> is >>> > > structured. We use a wide schema and each time we write, we delete the >>> > > entire row and write a fresh set of columns - we want to make sure no >>> old >>> > > columns survive. So, I just want to see if my picture of the memstore >>> at >>> > > this point is correct or not. My understanding is that Memstore is >>> > > basically a skip list of keyvalues and compares the values using >>> KeyValue >>> > > comparator >>> > > >>> > > 1) *T=1, *We write 3 columns for "ROW". So memstore has: >>> > > >>> > > (ROW, COL1, T=1) >>> > > (ROW, COL2, T=1) >>> > > (ROW, COL3, T=1) >>> > > >>> > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So >>> > > memstore has - my understanding is that we do not delete in the >>> memstore >>> > > but only add markers: >>> > > >>> > > (ROW, <DELETE>, T=2) >>> > > (ROW, COL1, T=1) >>> > > (ROW, COL2, T=1) >>> > > (ROW, COL3, T=1) >>> > > >>> > > 3) Now we write our new fresh row for *T=3* - it should get inserted >>> > above >>> > > the delete. >>> > > >>> > > (ROW, COL1, T=3) >>> > > (ROW, COL2, T=3) >>> > > (ROW, COL3, T=3) >>> > > (ROW, <DELETE>, T=2) >>> > > (ROW, COL1, T=1) >>> > > (ROW, COL2, T=1) >>> > > (ROW, COL3, T=1) >>> > > >>> > > This is the ideal scenario for the data to be correctly reflected. >>> > > >>> > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and >>> hence, >>> > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)* >>> > > >>> > > But, we also know that KeyValues compare first by ROW, then by Column >>> and >>> > > then by timestamp in reverse order >>> > > >>> > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) * >>> > > >>> > > This seems to be contradicting and my main worry is that in a skip >>> list, >>> > it >>> > > is quite possible for skipping to happen as you go through the high >>> level >>> > > express lanes and it could be possible for one of these entries to >>> never >>> > > actually even see the delete marker. For example consider the case >>> above >>> > > where entry #1 and entry #5 form the higher level of the skip list and >>> > the >>> > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) >>> and >>> > it >>> > > could end up in the wrong location. >>> > > >>> > > Obviously, if we cleanse all the row contents when a get a ROW level >>> > delete >>> > > marker, we are fine but I want to know if that is the case. If we are >>> not >>> > > really cleansing all the row contents when we get a ROW level delete >>> > > marker, then I want to know why the above scenario can not lead to bugs >>> > :) >>> > > >>> > > Varun >>> > > >>> > > >>> > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov >>> > > <[email protected]>wrote: >>> > > >>> > >> Varun, >>> > >> >>> > >> There is no need to open new JIRA - there are two already: >>> > >> https://issues.apache.org/jira/browse/HBASE-9769 >>> > >> https://issues.apache.org/jira/browse/HBASE-9778 >>> > >> >>> > >> Both with patches, you can grab and test them. >>> > >> >>> > >> -Vladimir >>> > >> >>> > >> >>> > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <[email protected]> >>> > wrote: >>> > >> >>> > >>> Hi lars, >>> > >>> >>> > >>> Thanks for the background. It seems that for our case, we will have >>> to >>> > >>> consider some solution like the Facebook one, since the next column >>> is >>> > >>> always the next one - this can be a simple flag. I am going to raise >>> a >>> > >> JIRA >>> > >>> and we can discuss there. >>> > >>> >>> > >>> Thanks >>> > >>> Varun >>> > >>> >>> > >>> >>> > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <[email protected]> >>> > wrote: >>> > >>> >>> > >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime >>> in >>> > >>>> soon. :) >>> > >>>> >>> > >>>> Reseek is expensive compared to next if next would get us the KV >>> we're >>> > >>>> looking for. However, HBase does not know that ahead of time. There >>> > >> might >>> > >>>> be a 1000 versions of the previous KV to be skipped first. >>> > >>>> HBase seeks in three situation: >>> > >>>> 1. Seek to the next column (there might be a lot of versions to >>> skip) >>> > >>>> 2. Seek to the next row (there might be a lot of versions and other >>> > >>>> columns to skip) >>> > >>>> 3. Seek to a row via a hint >>> > >>>> >>> > >>>> #3 is definitely useful, with that one can implement very efficient >>> > >> "skip >>> > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing). >>> > >>>> #2 is helpful if there are many columns and one only "selects" a few >>> > >> (and >>> > >>>> of course also if there are many versions of columns) >>> > >>>> #1 is only helpful when we expect there to be many versions. Or of >>> the >>> > >>>> size of a typical KV aproaches the block size, since then we'd need >>> to >>> > >>> seek >>> > >>>> to the find the next block anyway. >>> > >>>> >>> > >>>> You might well be a victim of #1. Are your rows 10-20 columns or is >>> > >> that >>> > >>>> just the number of column you return? >>> > >>>> >>> > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we >>> instruct >>> > >>> the >>> > >>>> scanner to not seek to the next column or the next row, but just >>> issue >>> > >>>> next()'s until the KV is found. Another suggested approach (I think >>> by >>> > >>> the >>> > >>>> Facebook guys) was to issue next() opportunistically a few times, >>> and >>> > >>> only >>> > >>>> when that did not get us to ther requested KV issue a reseek. >>> > >>>> I have also thought of a near/far designation of seeks. For near >>> seeks >>> > >>>> we'd do a configurable number of next()'s first, then seek. >>> > >>>> "near" seeks would be those of category #1 (and maybe #2) above. >>> > >>>> >>> > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe) >>> > >>>> >>> > >>>> I'll look at the trace a bit closers. >>> > >>>> So far my scan profiling has been focused on data in the blockcache >>> > >> since >>> > >>>> in the normal case the vast majority of all data is found there and >>> > >> only >>> > >>>> recent changes are in the memstore. >>> > >>>> >>> > >>>> -- Lars >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> ________________________________ >>> > >>>> From: Varun Sharma <[email protected]> >>> > >>>> To: "[email protected]" <[email protected]>; " >>> > >>> [email protected]" >>> > >>>> <[email protected]> >>> > >>>> Sent: Sunday, January 26, 2014 1:14 PM >>> > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads >>> > >>>> >>> > >>>> >>> > >>>> Hi, >>> > >>>> >>> > >>>> We are seeing some unfortunately low performance in the memstore - >>> we >>> > >>> have >>> > >>>> researched some of the previous JIRA(s) and seen some inefficiencies >>> > in >>> > >>> the >>> > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % >>> cpu >>> > >> at >>> > >>>> weird points in time - the bug is hard to reproduce and there isn't >>> > >> like >>> > >>> a >>> > >>>> huge # of extra reads going to that region server or any substantial >>> > >>>> hotspot happening. The region server recovers the moment, we flush >>> the >>> > >>>> memstores or restart the region server. Our queries retrieve wide >>> rows >>> > >>>> which are upto 10-20 columns. A stack trace shows two things: >>> > >>>> >>> > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the >>> > >>>> ConcurrentSkipListMap >>> > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside >>> > >>>> StoreScanner - this is understandable since the rows contain many >>> > >> columns >>> > >>>> and StoreScanner iterates one KeyValue at a time. >>> > >>>> >>> > >>>> So, I was looking at the code and it seems that every single time >>> > there >>> > >>> is >>> > >>>> a reseek call on the same memstore scanner, we make a fresh call to >>> > >> build >>> > >>>> an iterator() on the skip list set - this means we an additional >>> skip >>> > >>> list >>> > >>>> lookup for every column retrieved. SkipList lookups are O(n) and not >>> > >>> O(1). >>> > >>>> >>> > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that >>> > >>> number >>> > >>>> if exceeded, do a lookup. However, it seems this behaviour was >>> > reverted >>> > >>> by >>> > >>>> HBASE 4195 and every next row/next column is now a reseek() and a >>> skip >>> > >>> list >>> > >>>> lookup rather than being an iterator. >>> > >>>> >>> > >>>> Are there any strong reasons against having the previous behaviour >>> of >>> > >>>> scanning a small # of keys before degenerating to a skip list >>> lookup ? >>> > >>>> Seems like it would really help for sequential memstore scans and >>> for >>> > >>>> memstore gets with wide tables (even 10-20 columns). >>> > >>>> >>> > >>>> Thanks >>> > >>>> Varun >>> > >> >>> > >>> >> >
