Yes that should work. What version of hbase are you using? I would appreciate a test case :-)
-ryan On Thu, Sep 30, 2010 at 2:47 PM, Curt Allred <[email protected]> wrote: > Actually, I first encountered the problem when I was using HBase standalone, > i.e. outside of map-reduce. I use an HTable client to simultaneously scan > and update a table. While iterating the scan, I do a put() of a single > column back to the current row returned by the scanner. > > What I see is that the scan sometimes does not return all rows of the table. > If I disable the put(), it always returns all rows. One interesting > observation is that the number of missing rows is always a multiple of the > cache size (hbase.client.scanner.caching). It happens on a single node > installation as well as a multi-node cluster. It also happens if I use > separate instances of HTable for the scanner and writer. > > But... my initial question is "should it work?". I can understand why this > pattern of usage might be considered bad practice. > > If it is allowed and should work, then I'll try to write a simple test case > to isolate the problem. > > Thanks! > > > -----Original Message----- > From: Ryan Rawson [mailto:[email protected]] > Sent: Thursday, September 30, 2010 12:45 PM > To: [email protected] > Subject: Re: Is it "legal" to write to the same HBase table you're scanning? > > When you "write" to the context you are really just creating values > that get sent to the reduce phase where the actual puts to HBase > occur. > > Could you describe more why you think TIF is failing to return all > rows? Setup code for your map reduce, existing row counts, etc would > all be helpful. In otherwords, no we have not heard of any issues of > TIF failing to read all rows of a table - although it of course isn't > impossible. > > -ryan > > On Thu, Sep 30, 2010 at 12:29 PM, Curt Allred <[email protected]> wrote: >> Thanks for the reply. I wasnt clear on one point... the scan and put are >> both in the map phase. i.e... >> >> TestMapper extends TableMapper<ImmutableBytesWritable, Put> { >> map(rowId, Result, Context) { >> // read & process row data >> ... >> // now write new data to the same row of the same table >> put = new Put(rowId); >> put.add(newStuff) >> context.write(rowId, put); // using TableOutputFormat >> } >> } >> >> I dont expect to see the new data I just wrote, but it needs to be there for >> later map-reduce processes. >> >> When I run this the scanner iterator fails to return all rows, without an >> exception or any indication of failure. >> >> I could write the new data to an interim location and write it to the table >> during the Reduce phase but this seems inefficient since I'm not actually >> doing any reduction. >> >> >> >> -----Original Message----- >> From: Ryan Rawson [mailto:[email protected]] >> Sent: Thursday, September 30, 2010 11:27 AM >> To: [email protected] >> Subject: Re: Is it "legal" to write to the same HBase table you're scanning? >> >> Something else is going on, since TableInputFormat and >> TableOutputFormat in the same map reduce are not concurrent... the >> maps run, then the reduces run, and there is no overlap. A feature of >> mapreduce. >> >> So if you were expecting to see the rows you were 'just writing' >> during your map phase, you wont alas. >> >> -ryan >> >> On Thu, Sep 30, 2010 at 11:11 AM, Curt Allred <[email protected]> wrote: >>> I cant find any documentation which says you shouldn't write to the same >>> HBase table you're scanning. But it doesnt seem to work... I have a >>> mapper (subclass of TableMapper) which scans a table, and for each row >>> encountered during the scan, it updates a column of the row, writing it >>> back to the same table immediately (using TableOutputFormat). Occasionally >>> the scanner ends without completing the scan. No exception is thrown, no >>> indication of failure, it just says its done when it hasnt actually >>> returned all the rows. This happens even if the scan has no timestamp >>> specification or filters. It seems to only happen when I use a cache size >>> greater than 1 (hbase.client.scanner.caching). This behavior is also >>> repeatable using an HTable outside of a map-reduce job. >>> >>> The following blog entry implies that it might be risky, or worse, socially >>> unacceptable :) >>> >>> http://www.larsgeorge.com/2009/01/how-to-use-hbase-with-hadoop.html: >>> >>> > Again, I have cases where I do not output but save back to >>> > HBase. I could easily write the records back into HBase in >>> > the reduce step, so why pass them on first? I think this is >>> > in some cases just common sense or being a "good citizen". >>> > ... >>> > Writing back to the same HBase table is OK when doing it in >>> > the Reduce phase as all scanning has concluded in the Map >>> > phase beforehand, so all rows and columns are saved to an >>> > intermediate Hadoop SequenceFile internally and when you >>> > process these and write back to HBase you have no problems >>> > that there is still a scanner for the same job open reading >>> > the table. >>> > >>> > Otherwise it is OK to write to a different HBase table even >>> > during the Map phase. >>> >>> But I also found a jira issue which indicates it "should" work, but there >>> was a bug awhile back which was fixed: >>> >>> > https://issues.apache.org/jira/browse/HBASE-810: Prevent temporary >>> > deadlocks when, during a scan with write operations, the region splits >>> >>> Anyone else writing while scanning? Or know of documentation that addresses >>> this case? >>> >>> Thanks, >>> -Curt >>> >>> >> >
