I have some code that accepts a time range and looks for data written to an 
HBase table during that range. If anything has been written for that row during 
that range, the row key is saved off, and sometime later in the pipeline those 
row keys are used to extract the entire row. I’m testing against a fixed time 
range, at some point in the past. This is being done as part of a Map/Reduce 
job (using Apache Crunch). I have some job counters setup to keep track of the 
number of rows extracted. Since the time range is fixed, I would expect the 
scan to return the same number of rows with data in the provided time range. 
However, I am seeing this number vary from scan to scan (bouncing between 
increasing and decreasing). 

I’ve eliminated the possibility that data is being pulled in from outside the 
time range. I did this by scanning for one column qualifier (and only using 
this as the qualifier for if a row had data in the time range), getting the 
timestamp on the cell for each returned row and compared it against the begin 
and end times for the scan, and I didn’t find any that satisfied that criteria. 
I’ve observed some row keys show up in the 1st scan, then drop out in the 2nd 
scan, only to show back up again in the 3rd scan (all with the exact same Scan 
object). These numbers have varied wildly, from being off by 2-3 between 
subsequent scans to 40 row increases, followed by a drop of 70 rows. 

I’m kind of looking for ideas to try to track down what could be causing this 
to happen. The code itself is pretty simple, it creates a Scan object, scans 
the table, and then in the map phase, extract out the row key, and at the end, 
it dumps them to a directory in hdfs. 

Reply via email to