Re: Performance of scan setTimeRange VS manually doing it

Tom Brown Wed, 12 Sep 2012 15:42:40 -0700

It seems like the the internal logic for handling a time range is two
part: First, as you said, each file contains the minimum and maximum
timestamps contained within. This provides a very rough filter for the
data, but if your data is right, the effect can be huge. Second, a
time range acts a simple filter during a scan; While looking for the
next row to return, it checks whether the timestamp for the row is
within the time range; Returns that row if it is, and continues to the
next row if it isn't.


What it *doesn't* appear to do, however, is reseek to the row with the
minimum timestamp. Since my row key also contains a copy of the
timestamp, a reseek is able to bypass a lot of rows that the generic
logic would test individually. Perhaps HBase itself could be made to
work this way, but I'm unsure enough of its internal workings that I
can't say for sure.

(The above is my best guess; Let me know if something about that
explanation doesn't smell right)

--Tom

On Wed, Sep 12, 2012 at 4:08 PM, n keywal <[email protected]> wrote:
> For each file; there is a time range. When you scan/search, the file is
> skipped if there is no overlap between the file timerange and the timerange
> of the query. As there are other parameters as well (row distribution,
> compaction effects, cache, bloom filters, ...) it's difficult to know in
> advance what's going to happen exactly.  But specifying a timerange does no
> harm for sure, if it matches your functional needs...
>
> This said, if you already have the rowkey, the time range is less
> interesting as you will skip a lot of file already.
>
> On Wed, Sep 12, 2012 at 11:52 PM, Tom Brown <[email protected]> wrote:
>
>> When I query HBase, I always include a time range. This has not been a
>> problem when querying recent data, but it seems to be an issue when I
>> query older data (a few hours old). All of my row keys include the
>> timestamp as part of the key (this value is the same as the HBase
>> timestamp for the row).  I recently tried an experiment where I
>> manually re-seek to the possible row (based on the timestamp as part
>> of the row key) instead of using "setTimeRange" on my scan object and
>> was amazed to see that there was no degradation for older data.
>>
>> Can someone postulate a theory as to why this might be happening? I'm
>> happy to provide extra data if it will help you theorize...
>>
>> Is there a downside to stopping using "setTimeRange"?
>>
>> --Tom
>>

Re: Performance of scan setTimeRange VS manually doing it

Reply via email to