I cant find any documentation which says you shouldn't write to the same HBase table you're scanning. But it doesnt seem to work... I have a mapper (subclass of TableMapper) which scans a table, and for each row encountered during the scan, it updates a column of the row, writing it back to the same table immediately (using TableOutputFormat). Occasionally the scanner ends without completing the scan. No exception is thrown, no indication of failure, it just says its done when it hasnt actually returned all the rows. This happens even if the scan has no timestamp specification or filters. It seems to only happen when I use a cache size greater than 1 (hbase.client.scanner.caching). This behavior is also repeatable using an HTable outside of a map-reduce job.
The following blog entry implies that it might be risky, or worse, socially unacceptable :) http://www.larsgeorge.com/2009/01/how-to-use-hbase-with-hadoop.html: > Again, I have cases where I do not output but save back to > HBase. I could easily write the records back into HBase in > the reduce step, so why pass them on first? I think this is > in some cases just common sense or being a "good citizen". > ... > Writing back to the same HBase table is OK when doing it in > the Reduce phase as all scanning has concluded in the Map > phase beforehand, so all rows and columns are saved to an > intermediate Hadoop SequenceFile internally and when you > process these and write back to HBase you have no problems > that there is still a scanner for the same job open reading > the table. > > Otherwise it is OK to write to a different HBase table even > during the Map phase. But I also found a jira issue which indicates it "should" work, but there was a bug awhile back which was fixed: > https://issues.apache.org/jira/browse/HBASE-810: Prevent temporary > deadlocks when, during a scan with write operations, the region splits Anyone else writing while scanning? Or know of documentation that addresses this case? Thanks, -Curt