Sorry Kristoffer, but I believe my previous statement was mistaken. I cannot find a location where the timestamp is taken into account at the StoreFile level. I thought the above statement about metadata from the HFile headers was correct, but I cannot locate the code that takes such information into consideration. You can start at org.apache.hadoop.hbase.regionserver.StoreScanner and work your way down; or from the o.a.h.h.regionserver.StoreFileManager implementation (currently there exists two: DefaultStoreFileManager and StripeStoreFileManager) and work your way back. The closest thing we (may) have is the StripeStoreFileManager implementation, which creates "mini regions" within a single region. Even there, the stripes are arranged by row key (i.e., Scan#getStartRow()), not by key-value key.
I think we have no optimizations at the HFile level for the timestamp limits of a query. Which means, to answer your original question, absent a start row and end row on your scanner, you will be consuming the entire table. A long way of explaining, but HBase does not index by cell version (orders by at the end of a key-value's key, but not indexed), so if you want to model time in your schema, it's best to promote it to an indexed field -- i.e., make it a component of your row key. -n On Mon, Mar 2, 2015 at 12:42 AM, Kristoffer Sjögren <[email protected]> wrote: > Thanks, great explanation! > > Forgive my laziness, but do you happen to know what part(s) of the code > base to look into even more details? > > On Sun, Mar 1, 2015 at 9:38 PM, Jean-Marc Spaggiari < > [email protected] > > wrote: > > > I was going to say something similar. But as soon as you have a major > > compaction you endup with a single file and everything into it. So > > depending on your key distribution you might still read everything. If > you > > read just the last few minutes over a huge table, then yes, skip will > help. > > Else, I'm not sure it will hep that much :( > > > > 2015-02-28 18:25 GMT-05:00 Nick Dimiduk <[email protected]>: > > > > > A Scan without start and end rows will be issued to all regions in the > > > table -- a full table scan. Within each region, store files will be > > > selected to participate in the scan based on on the min/max timestamps > > > from their > > > headers. > > > > > > On Saturday, February 28, 2015, Kristoffer Sjögren <[email protected]> > > > wrote: > > > > > > > If Scan.setTimeRange is a full table scan then it runs surprisingly > > fast > > > on > > > > tables that host a few hundred million rows :-) > > > > > > > > > > > > > > > > On Sat, Feb 28, 2015 at 8:05 PM, Kristoffer Sjögren < > [email protected] > > > > <javascript:;>> > > > > wrote: > > > > > > > > > Hi Jean-Marc > > > > > > > > > > I was thinking of Scan.setTimeRange to only get the x latest rows, > > but > > > I > > > > > would like to avoid a full table scan. > > > > > > > > > > The alternative would be to use set the timestamp in the key and > use > > > > start > > > > > and stop key. But since HBase already is aware of timestamps I > tought > > > it > > > > > might optimize Scan.setTimeRange scans? > > > > > > > > > > Cheers, > > > > > -Kristoffer > > > > > > > > > > On Sat, Feb 28, 2015 at 7:45 PM, Jean-Marc Spaggiari < > > > > > [email protected] <javascript:;>> wrote: > > > > > > > > > >> Hi Kristoffer, > > > > >> > > > > >> What do you mean by "timerange scans"? If you want to scan > > everything > > > > from > > > > >> your table, you will always end up with a full table scan, no? > > > > >> > > > > >> JM > > > > >> > > > > >> 2015-02-28 13:41 GMT-05:00 Kristoffer Sjögren <[email protected] > > > > <javascript:;>>: > > > > >> > > > > >> > Hi > > > > >> > > > > > >> > I want to understand the effectiveness of timerange scans > without > > > > >> setting > > > > >> > start and stop keys? Will HBase do a full table scan or will the > > > scan > > > > be > > > > >> > optimized in any way? > > > > >> > > > > > >> > Cheers, > > > > >> > -Kristoffer > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > >
