Varun: Take a look at http://hbase.apache.org/book.html#dm.sort
There's no contradiction. Cheers On Jan 27, 2014, at 11:40 PM, Varun Sharma <[email protected]> wrote: > Actually, I now have another question because of the way our work load is > structured. We use a wide schema and each time we write, we delete the > entire row and write a fresh set of columns - we want to make sure no old > columns survive. So, I just want to see if my picture of the memstore at > this point is correct or not. My understanding is that Memstore is > basically a skip list of keyvalues and compares the values using KeyValue > comparator > > 1) *T=1, *We write 3 columns for "ROW". So memstore has: > > (ROW, COL1, T=1) > (ROW, COL2, T=1) > (ROW, COL3, T=1) > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So > memstore has - my understanding is that we do not delete in the memstore > but only add markers: > > (ROW, <DELETE>, T=2) > (ROW, COL1, T=1) > (ROW, COL2, T=1) > (ROW, COL3, T=1) > > 3) Now we write our new fresh row for *T=3* - it should get inserted above > the delete. > > (ROW, COL1, T=3) > (ROW, COL2, T=3) > (ROW, COL3, T=3) > (ROW, <DELETE>, T=2) > (ROW, COL1, T=1) > (ROW, COL2, T=1) > (ROW, COL3, T=1) > > This is the ideal scenario for the data to be correctly reflected. > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and hence, > *(ROW, COL2, T=3) > (ROW, COL1, T=1)* > > But, we also know that KeyValues compare first by ROW, then by Column and > then by timestamp in reverse order > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) * > > This seems to be contradicting and my main worry is that in a skip list, it > is quite possible for skipping to happen as you go through the high level > express lanes and it could be possible for one of these entries to never > actually even see the delete marker. For example consider the case above > where entry #1 and entry #5 form the higher level of the skip list and the > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) and it > could end up in the wrong location. > > Obviously, if we cleanse all the row contents when a get a ROW level delete > marker, we are fine but I want to know if that is the case. If we are not > really cleansing all the row contents when we get a ROW level delete > marker, then I want to know why the above scenario can not lead to bugs :) > > Varun > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov > <[email protected]>wrote: > >> Varun, >> >> There is no need to open new JIRA - there are two already: >> https://issues.apache.org/jira/browse/HBASE-9769 >> https://issues.apache.org/jira/browse/HBASE-9778 >> >> Both with patches, you can grab and test them. >> >> -Vladimir >> >> >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <[email protected]> wrote: >> >>> Hi lars, >>> >>> Thanks for the background. It seems that for our case, we will have to >>> consider some solution like the Facebook one, since the next column is >>> always the next one - this can be a simple flag. I am going to raise a >> JIRA >>> and we can discuss there. >>> >>> Thanks >>> Varun >>> >>> >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <[email protected]> wrote: >>> >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime in >>>> soon. :) >>>> >>>> Reseek is expensive compared to next if next would get us the KV we're >>>> looking for. However, HBase does not know that ahead of time. There >> might >>>> be a 1000 versions of the previous KV to be skipped first. >>>> HBase seeks in three situation: >>>> 1. Seek to the next column (there might be a lot of versions to skip) >>>> 2. Seek to the next row (there might be a lot of versions and other >>>> columns to skip) >>>> 3. Seek to a row via a hint >>>> >>>> #3 is definitely useful, with that one can implement very efficient >> "skip >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing). >>>> #2 is helpful if there are many columns and one only "selects" a few >> (and >>>> of course also if there are many versions of columns) >>>> #1 is only helpful when we expect there to be many versions. Or of the >>>> size of a typical KV aproaches the block size, since then we'd need to >>> seek >>>> to the find the next block anyway. >>>> >>>> You might well be a victim of #1. Are your rows 10-20 columns or is >> that >>>> just the number of column you return? >>>> >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we instruct >>> the >>>> scanner to not seek to the next column or the next row, but just issue >>>> next()'s until the KV is found. Another suggested approach (I think by >>> the >>>> Facebook guys) was to issue next() opportunistically a few times, and >>> only >>>> when that did not get us to ther requested KV issue a reseek. >>>> I have also thought of a near/far designation of seeks. For near seeks >>>> we'd do a configurable number of next()'s first, then seek. >>>> "near" seeks would be those of category #1 (and maybe #2) above. >>>> >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe) >>>> >>>> I'll look at the trace a bit closers. >>>> So far my scan profiling has been focused on data in the blockcache >> since >>>> in the normal case the vast majority of all data is found there and >> only >>>> recent changes are in the memstore. >>>> >>>> -- Lars >>>> >>>> >>>> >>>> >>>> ________________________________ >>>> From: Varun Sharma <[email protected]> >>>> To: "[email protected]" <[email protected]>; " >>> [email protected]" >>>> <[email protected]> >>>> Sent: Sunday, January 26, 2014 1:14 PM >>>> Subject: Sporadic memstore slowness for Read Heavy workloads >>>> >>>> >>>> Hi, >>>> >>>> We are seeing some unfortunately low performance in the memstore - we >>> have >>>> researched some of the previous JIRA(s) and seen some inefficiencies in >>> the >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu >> at >>>> weird points in time - the bug is hard to reproduce and there isn't >> like >>> a >>>> huge # of extra reads going to that region server or any substantial >>>> hotspot happening. The region server recovers the moment, we flush the >>>> memstores or restart the region server. Our queries retrieve wide rows >>>> which are upto 10-20 columns. A stack trace shows two things: >>>> >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the >>>> ConcurrentSkipListMap >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside >>>> StoreScanner - this is understandable since the rows contain many >> columns >>>> and StoreScanner iterates one KeyValue at a time. >>>> >>>> So, I was looking at the code and it seems that every single time there >>> is >>>> a reseek call on the same memstore scanner, we make a fresh call to >> build >>>> an iterator() on the skip list set - this means we an additional skip >>> list >>>> lookup for every column retrieved. SkipList lookups are O(n) and not >>> O(1). >>>> >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that >>> number >>>> if exceeded, do a lookup. However, it seems this behaviour was reverted >>> by >>>> HBASE 4195 and every next row/next column is now a reseek() and a skip >>> list >>>> lookup rather than being an iterator. >>>> >>>> Are there any strong reasons against having the previous behaviour of >>>> scanning a small # of keys before degenerating to a skip list lookup ? >>>> Seems like it would really help for sequential memstore scans and for >>>> memstore gets with wide tables (even 10-20 columns). >>>> >>>> Thanks >>>> Varun >>
