I cant find any documentation which says you shouldn't write to the same HBase 
table you're scanning.  But it doesnt seem to work...  I have a mapper 
(subclass of TableMapper) which scans a table, and for each row encountered 
during the scan, it updates a column of the row, writing it back to the same 
table immediately (using TableOutputFormat).  Occasionally the scanner ends 
without completing the scan.  No exception is thrown, no indication of failure, 
it just says its done when it hasnt actually returned all the rows.  This 
happens even if the scan has no timestamp specification or filters.  It seems 
to only happen when I use a cache size greater than 1 
(hbase.client.scanner.caching).  This behavior is also repeatable using an 
HTable outside of a map-reduce job.

The following blog entry implies that it might be risky, or worse, socially 
unacceptable :)


 > Again, I have cases where I do not output but save back to
 > HBase. I could easily write the records back into HBase in
 > the reduce step, so why pass them on first? I think this is
 > in some cases just common sense or being a "good citizen".
 > ...
 > Writing back to the same HBase table is OK when doing it in
 > the Reduce phase as all scanning has concluded in the Map
 > phase beforehand, so all rows and columns are saved to an
 > intermediate Hadoop SequenceFile internally and when you
 > process these and write back to HBase you have no problems
 > that there is still a scanner for the same job open reading
 > the table.
 > Otherwise it is OK to write to a different HBase table even
 > during the Map phase.

But I also found a jira issue which indicates it "should" work, but there was a 
bug awhile back which was fixed:

 > https://issues.apache.org/jira/browse/HBASE-810: Prevent temporary
 > deadlocks when, during a scan with write operations, the region splits

Anyone else writing while scanning? Or know of documentation that addresses 
this case?


Reply via email to