Wow, thanks again for the deep analysis. I may have to reconsider my initial design then. I've always wanted to know to understand more about HBase internals and this may be a good place to start digging.
Cheers, -Kristoffer On Mon, Mar 2, 2015 at 6:24 PM, Nick Dimiduk <[email protected]> wrote: > Sorry Kristoffer, but I believe my previous statement was mistaken. I > cannot find a location where the timestamp is taken into account at the > StoreFile level. I thought the above statement about metadata from the > HFile headers was correct, but I cannot locate the code that takes such > information into consideration. You can start at > org.apache.hadoop.hbase.regionserver.StoreScanner and work your way down; > or from the o.a.h.h.regionserver.StoreFileManager implementation (currently > there exists two: DefaultStoreFileManager and StripeStoreFileManager) and > work your way back. The closest thing we (may) have is the > StripeStoreFileManager implementation, which creates "mini regions" within > a single region. Even there, the stripes are arranged by row key (i.e., > Scan#getStartRow()), not by key-value key. > > I think we have no optimizations at the HFile level for the timestamp > limits of a query. Which means, to answer your original question, absent a > start row and end row on your scanner, you will be consuming the entire > table. A long way of explaining, but HBase does not index by cell version > (orders by at the end of a key-value's key, but not indexed), so if you > want to model time in your schema, it's best to promote it to an indexed > field -- i.e., make it a component of your row key. > > -n > > On Mon, Mar 2, 2015 at 12:42 AM, Kristoffer Sjögren <[email protected]> > wrote: > > > Thanks, great explanation! > > > > Forgive my laziness, but do you happen to know what part(s) of the code > > base to look into even more details? > > > > On Sun, Mar 1, 2015 at 9:38 PM, Jean-Marc Spaggiari < > > [email protected] > > > wrote: > > > > > I was going to say something similar. But as soon as you have a major > > > compaction you endup with a single file and everything into it. So > > > depending on your key distribution you might still read everything. If > > you > > > read just the last few minutes over a huge table, then yes, skip will > > help. > > > Else, I'm not sure it will hep that much :( > > > > > > 2015-02-28 18:25 GMT-05:00 Nick Dimiduk <[email protected]>: > > > > > > > A Scan without start and end rows will be issued to all regions in > the > > > > table -- a full table scan. Within each region, store files will be > > > > selected to participate in the scan based on on the min/max > timestamps > > > > from their > > > > headers. > > > > > > > > On Saturday, February 28, 2015, Kristoffer Sjögren <[email protected] > > > > > > wrote: > > > > > > > > > If Scan.setTimeRange is a full table scan then it runs surprisingly > > > fast > > > > on > > > > > tables that host a few hundred million rows :-) > > > > > > > > > > > > > > > > > > > > On Sat, Feb 28, 2015 at 8:05 PM, Kristoffer Sjögren < > > [email protected] > > > > > <javascript:;>> > > > > > wrote: > > > > > > > > > > > Hi Jean-Marc > > > > > > > > > > > > I was thinking of Scan.setTimeRange to only get the x latest > rows, > > > but > > > > I > > > > > > would like to avoid a full table scan. > > > > > > > > > > > > The alternative would be to use set the timestamp in the key and > > use > > > > > start > > > > > > and stop key. But since HBase already is aware of timestamps I > > tought > > > > it > > > > > > might optimize Scan.setTimeRange scans? > > > > > > > > > > > > Cheers, > > > > > > -Kristoffer > > > > > > > > > > > > On Sat, Feb 28, 2015 at 7:45 PM, Jean-Marc Spaggiari < > > > > > > [email protected] <javascript:;>> wrote: > > > > > > > > > > > >> Hi Kristoffer, > > > > > >> > > > > > >> What do you mean by "timerange scans"? If you want to scan > > > everything > > > > > from > > > > > >> your table, you will always end up with a full table scan, no? > > > > > >> > > > > > >> JM > > > > > >> > > > > > >> 2015-02-28 13:41 GMT-05:00 Kristoffer Sjögren <[email protected] > > > > > <javascript:;>>: > > > > > >> > > > > > >> > Hi > > > > > >> > > > > > > >> > I want to understand the effectiveness of timerange scans > > without > > > > > >> setting > > > > > >> > start and stop keys? Will HBase do a full table scan or will > the > > > > scan > > > > > be > > > > > >> > optimized in any way? > > > > > >> > > > > > > >> > Cheers, > > > > > >> > -Kristoffer > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >
