Hi lars, Thanks for the background. It seems that for our case, we will have to consider some solution like the Facebook one, since the next column is always the next one - this can be a simple flag. I am going to raise a JIRA and we can discuss there.
Thanks Varun On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <[email protected]> wrote: > This is somewhat of a known issue, and I'm sure Vladimir will chime in > soon. :) > > Reseek is expensive compared to next if next would get us the KV we're > looking for. However, HBase does not know that ahead of time. There might > be a 1000 versions of the previous KV to be skipped first. > HBase seeks in three situation: > 1. Seek to the next column (there might be a lot of versions to skip) > 2. Seek to the next row (there might be a lot of versions and other > columns to skip) > 3. Seek to a row via a hint > > #3 is definitely useful, with that one can implement very efficient "skip > scans" (see the FuzzyRowFilter and what Phoenix is doing). > #2 is helpful if there are many columns and one only "selects" a few (and > of course also if there are many versions of columns) > #1 is only helpful when we expect there to be many versions. Or of the > size of a typical KV aproaches the block size, since then we'd need to seek > to the find the next block anyway. > > You might well be a victim of #1. Are your rows 10-20 columns or is that > just the number of column you return? > > Vladimir and myself have suggested a SMALL_ROW hint, where we instruct the > scanner to not seek to the next column or the next row, but just issue > next()'s until the KV is found. Another suggested approach (I think by the > Facebook guys) was to issue next() opportunistically a few times, and only > when that did not get us to ther requested KV issue a reseek. > I have also thought of a near/far designation of seeks. For near seeks > we'd do a configurable number of next()'s first, then seek. > "near" seeks would be those of category #1 (and maybe #2) above. > > See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe) > > I'll look at the trace a bit closers. > So far my scan profiling has been focused on data in the blockcache since > in the normal case the vast majority of all data is found there and only > recent changes are in the memstore. > > -- Lars > > > > > ________________________________ > From: Varun Sharma <[email protected]> > To: "[email protected]" <[email protected]>; "[email protected]" > <[email protected]> > Sent: Sunday, January 26, 2014 1:14 PM > Subject: Sporadic memstore slowness for Read Heavy workloads > > > Hi, > > We are seeing some unfortunately low performance in the memstore - we have > researched some of the previous JIRA(s) and seen some inefficiencies in the > ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at > weird points in time - the bug is hard to reproduce and there isn't like a > huge # of extra reads going to that region server or any substantial > hotspot happening. The region server recovers the moment, we flush the > memstores or restart the region server. Our queries retrieve wide rows > which are upto 10-20 columns. A stack trace shows two things: > > 1) Time spent inside MemstoreScanner.reseek() and inside the > ConcurrentSkipListMap > 2) The reseek() is being called at the "SEEK_NEXT" column inside > StoreScanner - this is understandable since the rows contain many columns > and StoreScanner iterates one KeyValue at a time. > > So, I was looking at the code and it seems that every single time there is > a reseek call on the same memstore scanner, we make a fresh call to build > an iterator() on the skip list set - this means we an additional skip list > lookup for every column retrieved. SkipList lookups are O(n) and not O(1). > > Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number > if exceeded, do a lookup. However, it seems this behaviour was reverted by > HBASE 4195 and every next row/next column is now a reseek() and a skip list > lookup rather than being an iterator. > > Are there any strong reasons against having the previous behaviour of > scanning a small # of keys before degenerating to a skip list lookup ? > Seems like it would really help for sequential memstore scans and for > memstore gets with wide tables (even 10-20 columns). > > Thanks > Varun >
