It seems like the the internal logic for handling a time range is two part: First, as you said, each file contains the minimum and maximum timestamps contained within. This provides a very rough filter for the data, but if your data is right, the effect can be huge. Second, a time range acts a simple filter during a scan; While looking for the next row to return, it checks whether the timestamp for the row is within the time range; Returns that row if it is, and continues to the next row if it isn't.
What it *doesn't* appear to do, however, is reseek to the row with the minimum timestamp. Since my row key also contains a copy of the timestamp, a reseek is able to bypass a lot of rows that the generic logic would test individually. Perhaps HBase itself could be made to work this way, but I'm unsure enough of its internal workings that I can't say for sure. (The above is my best guess; Let me know if something about that explanation doesn't smell right) --Tom On Wed, Sep 12, 2012 at 4:08 PM, n keywal <[email protected]> wrote: > For each file; there is a time range. When you scan/search, the file is > skipped if there is no overlap between the file timerange and the timerange > of the query. As there are other parameters as well (row distribution, > compaction effects, cache, bloom filters, ...) it's difficult to know in > advance what's going to happen exactly. But specifying a timerange does no > harm for sure, if it matches your functional needs... > > This said, if you already have the rowkey, the time range is less > interesting as you will skip a lot of file already. > > On Wed, Sep 12, 2012 at 11:52 PM, Tom Brown <[email protected]> wrote: > >> When I query HBase, I always include a time range. This has not been a >> problem when querying recent data, but it seems to be an issue when I >> query older data (a few hours old). All of my row keys include the >> timestamp as part of the key (this value is the same as the HBase >> timestamp for the row). I recently tried an experiment where I >> manually re-seek to the possible row (based on the timestamp as part >> of the row key) instead of using "setTimeRange" on my scan object and >> was amazed to see that there was no degradation for older data. >> >> Can someone postulate a theory as to why this might be happening? I'm >> happy to provide extra data if it will help you theorize... >> >> Is there a downside to stopping using "setTimeRange"? >> >> --Tom >>
