What's the TTL setting for your table ?

Which hbase release are you using ?

Was there compaction in between the scans ?

Thanks


> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <[email protected]> wrote:
> 
> I have some code that accepts a time range and looks for data written to an 
> HBase table during that range. If anything has been written for that row 
> during that range, the row key is saved off, and sometime later in the 
> pipeline those row keys are used to extract the entire row. I’m testing 
> against a fixed time range, at some point in the past. This is being done as 
> part of a Map/Reduce job (using Apache Crunch). I have some job counters 
> setup to keep track of the number of rows extracted. Since the time range is 
> fixed, I would expect the scan to return the same number of rows with data in 
> the provided time range. However, I am seeing this number vary from scan to 
> scan (bouncing between increasing and decreasing). 
> 
> I’ve eliminated the possibility that data is being pulled in from outside the 
> time range. I did this by scanning for one column qualifier (and only using 
> this as the qualifier for if a row had data in the time range), getting the 
> timestamp on the cell for each returned row and compared it against the begin 
> and end times for the scan, and I didn’t find any that satisfied that 
> criteria. I’ve observed some row keys show up in the 1st scan, then drop out 
> in the 2nd scan, only to show back up again in the 3rd scan (all with the 
> exact same Scan object). These numbers have varied wildly, from being off by 
> 2-3 between subsequent scans to 40 row increases, followed by a drop of 70 
> rows. 
> 
> I’m kind of looking for ideas to try to track down what could be causing this 
> to happen. The code itself is pretty simple, it creates a Scan object, scans 
> the table, and then in the map phase, extract out the row key, and at the 
> end, it dumps them to a directory in hdfs.

Reply via email to