What's the TTL setting for your table ? Which hbase release are you using ?
Was there compaction in between the scans ? Thanks > On Feb 24, 2015, at 2:32 PM, Stephen Durfey <[email protected]> wrote: > > I have some code that accepts a time range and looks for data written to an > HBase table during that range. If anything has been written for that row > during that range, the row key is saved off, and sometime later in the > pipeline those row keys are used to extract the entire row. I’m testing > against a fixed time range, at some point in the past. This is being done as > part of a Map/Reduce job (using Apache Crunch). I have some job counters > setup to keep track of the number of rows extracted. Since the time range is > fixed, I would expect the scan to return the same number of rows with data in > the provided time range. However, I am seeing this number vary from scan to > scan (bouncing between increasing and decreasing). > > I’ve eliminated the possibility that data is being pulled in from outside the > time range. I did this by scanning for one column qualifier (and only using > this as the qualifier for if a row had data in the time range), getting the > timestamp on the cell for each returned row and compared it against the begin > and end times for the scan, and I didn’t find any that satisfied that > criteria. I’ve observed some row keys show up in the 1st scan, then drop out > in the 2nd scan, only to show back up again in the 3rd scan (all with the > exact same Scan object). These numbers have varied wildly, from being off by > 2-3 between subsequent scans to 40 row increases, followed by a drop of 70 > rows. > > I’m kind of looking for ideas to try to track down what could be causing this > to happen. The code itself is pretty simple, it creates a Scan object, scans > the table, and then in the map phase, extract out the row key, and at the > end, it dumps them to a directory in hdfs.
