Sorry Kristoffer, but I believe my previous statement was mistaken. I
cannot find a location where the timestamp is taken into account at the
StoreFile level. I thought the above statement about metadata from the
HFile headers was correct, but I cannot locate the code that takes such
information into consideration. You can start at
org.apache.hadoop.hbase.regionserver.StoreScanner and work your way down;
or from the o.a.h.h.regionserver.StoreFileManager implementation (currently
there exists two: DefaultStoreFileManager and StripeStoreFileManager) and
work your way back. The closest thing we (may) have is the
StripeStoreFileManager implementation, which creates "mini regions" within
a single region. Even there, the stripes are arranged by row key (i.e.,
Scan#getStartRow()), not by key-value key.

I think we have no optimizations at the HFile level for the timestamp
limits of a query. Which means, to answer your original question, absent a
start row and end row on your scanner, you will be consuming the entire
table. A long way of explaining, but HBase does not index by cell version
(orders by at the end of a key-value's key, but not indexed), so if you
want to model time in your schema, it's best to promote it to an indexed
field -- i.e., make it a component of your row key.

-n

On Mon, Mar 2, 2015 at 12:42 AM, Kristoffer Sjögren <[email protected]>
wrote:

> Thanks, great explanation!
>
> Forgive my laziness, but do you happen to know what part(s) of the code
> base to look into even more details?
>
> On Sun, Mar 1, 2015 at 9:38 PM, Jean-Marc Spaggiari <
> [email protected]
> > wrote:
>
> > I was going to say something similar. But as soon as you have a major
> > compaction you endup with a single file and everything into it. So
> > depending on your key distribution you might still read everything. If
> you
> > read just the last few minutes over a huge table, then yes, skip will
> help.
> > Else, I'm not sure it will hep that much :(
> >
> > 2015-02-28 18:25 GMT-05:00 Nick Dimiduk <[email protected]>:
> >
> > > A Scan without start and end rows will be issued to all regions in the
> > > table -- a full table scan. Within each region, store files will be
> > > selected to participate in the scan based on on the min/max timestamps
> > > from their
> > > headers.
> > >
> > > On Saturday, February 28, 2015, Kristoffer Sjögren <[email protected]>
> > > wrote:
> > >
> > > > If Scan.setTimeRange is a full table scan then it runs surprisingly
> > fast
> > > on
> > > > tables that host a few hundred million rows :-)
> > > >
> > > >
> > > >
> > > > On Sat, Feb 28, 2015 at 8:05 PM, Kristoffer Sjögren <
> [email protected]
> > > > <javascript:;>>
> > > > wrote:
> > > >
> > > > > Hi Jean-Marc
> > > > >
> > > > > I was thinking of Scan.setTimeRange to only get the x latest rows,
> > but
> > > I
> > > > > would like to avoid a full table scan.
> > > > >
> > > > > The alternative would be to use set the timestamp in the key and
> use
> > > > start
> > > > > and stop key. But since HBase already is aware of timestamps I
> tought
> > > it
> > > > > might optimize Scan.setTimeRange scans?
> > > > >
> > > > > Cheers,
> > > > > -Kristoffer
> > > > >
> > > > > On Sat, Feb 28, 2015 at 7:45 PM, Jean-Marc Spaggiari <
> > > > > [email protected] <javascript:;>> wrote:
> > > > >
> > > > >> Hi Kristoffer,
> > > > >>
> > > > >> What do you mean by "timerange scans"? If you want to scan
> > everything
> > > > from
> > > > >> your table, you will always end up with a full table scan, no?
> > > > >>
> > > > >> JM
> > > > >>
> > > > >> 2015-02-28 13:41 GMT-05:00 Kristoffer Sjögren <[email protected]
> > > > <javascript:;>>:
> > > > >>
> > > > >> > Hi
> > > > >> >
> > > > >> > I want to understand the effectiveness of timerange scans
> without
> > > > >> setting
> > > > >> > start and stop keys? Will HBase do a full table scan or will the
> > > scan
> > > > be
> > > > >> > optimized in any way?
> > > > >> >
> > > > >> > Cheers,
> > > > >> > -Kristoffer
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to