Actually, I first encountered the problem when I was using HBase standalone, 
i.e. outside of map-reduce.  I use an HTable client to simultaneously scan and 
update a table.  While iterating the scan, I do a put() of a single column back 
to the current row returned by the scanner.

What I see is that the scan sometimes does not return all rows of the table.  
If I disable the put(), it always returns all rows.  One interesting 
observation is that the number of missing rows is always a multiple of the 
cache size (hbase.client.scanner.caching).  It happens on a single node 
installation as well as a multi-node cluster.  It also happens if I use 
separate instances of HTable for the scanner and writer.

But... my initial question is "should it work?".  I can understand why this 
pattern of usage might be considered bad practice.

If it is allowed and should work, then I'll try to write a simple test case to 
isolate the problem.

Thanks!

        
-----Original Message-----
From: Ryan Rawson [mailto:[email protected]] 
Sent: Thursday, September 30, 2010 12:45 PM
To: [email protected]
Subject: Re: Is it "legal" to write to the same HBase table you're scanning?

When you "write" to the context you are really just creating values
that get sent to the reduce phase where the actual puts to HBase
occur.

Could you describe more why you think TIF is failing to return all
rows?  Setup code for your map reduce, existing row counts, etc would
all be helpful.  In otherwords, no we have not heard of any issues of
TIF failing to read all rows of a table - although it of course isn't
impossible.

-ryan

On Thu, Sep 30, 2010 at 12:29 PM, Curt Allred <[email protected]> wrote:
> Thanks for the reply.  I wasnt clear on one point... the scan and put are 
> both in the map phase. i.e...
>
> TestMapper extends TableMapper<ImmutableBytesWritable, Put> {
>  map(rowId, Result, Context) {
>    // read & process row data
>    ...
>    // now write new data to the same row of the same table
>    put = new Put(rowId);
>    put.add(newStuff)
>    context.write(rowId, put); // using TableOutputFormat
>  }
> }
>
> I dont expect to see the new data I just wrote, but it needs to be there for 
> later map-reduce processes.
>
> When I run this the scanner iterator fails to return all rows, without an 
> exception or any indication of failure.
>
> I could write the new data to an interim location and write it to the table 
> during the Reduce phase but this seems inefficient since I'm not actually 
> doing any reduction.
>
>
>
> -----Original Message-----
> From: Ryan Rawson [mailto:[email protected]]
> Sent: Thursday, September 30, 2010 11:27 AM
> To: [email protected]
> Subject: Re: Is it "legal" to write to the same HBase table you're scanning?
>
> Something else is going on, since TableInputFormat and
> TableOutputFormat in the same map reduce are not concurrent... the
> maps run, then the reduces run, and there is no overlap.  A feature of
> mapreduce.
>
> So if you were expecting to see the rows you were 'just writing'
> during your map phase, you wont alas.
>
> -ryan
>
> On Thu, Sep 30, 2010 at 11:11 AM, Curt Allred <[email protected]> wrote:
>> I cant find any documentation which says you shouldn't write to the same 
>> HBase table you're scanning.  But it doesnt seem to work...  I have a mapper 
>> (subclass of TableMapper) which scans a table, and for each row encountered 
>> during the scan, it updates a column of the row, writing it back to the same 
>> table immediately (using TableOutputFormat).  Occasionally the scanner ends 
>> without completing the scan.  No exception is thrown, no indication of 
>> failure, it just says its done when it hasnt actually returned all the rows. 
>>  This happens even if the scan has no timestamp specification or filters.  
>> It seems to only happen when I use a cache size greater than 1 
>> (hbase.client.scanner.caching).  This behavior is also repeatable using an 
>> HTable outside of a map-reduce job.
>>
>> The following blog entry implies that it might be risky, or worse, socially 
>> unacceptable :)
>>
>> http://www.larsgeorge.com/2009/01/how-to-use-hbase-with-hadoop.html:
>>
>>  > Again, I have cases where I do not output but save back to
>>  > HBase. I could easily write the records back into HBase in
>>  > the reduce step, so why pass them on first? I think this is
>>  > in some cases just common sense or being a "good citizen".
>>  > ...
>>  > Writing back to the same HBase table is OK when doing it in
>>  > the Reduce phase as all scanning has concluded in the Map
>>  > phase beforehand, so all rows and columns are saved to an
>>  > intermediate Hadoop SequenceFile internally and when you
>>  > process these and write back to HBase you have no problems
>>  > that there is still a scanner for the same job open reading
>>  > the table.
>>  >
>>  > Otherwise it is OK to write to a different HBase table even
>>  > during the Map phase.
>>
>> But I also found a jira issue which indicates it "should" work, but there 
>> was a bug awhile back which was fixed:
>>
>>  > https://issues.apache.org/jira/browse/HBASE-810: Prevent temporary
>>  > deadlocks when, during a scan with write operations, the region splits
>>
>> Anyone else writing while scanning? Or know of documentation that addresses 
>> this case?
>>
>> Thanks,
>> -Curt
>>
>>
>

Reply via email to