Yes that should work.  What version of hbase are you using?  I would
appreciate a test case :-)

-ryan

On Thu, Sep 30, 2010 at 2:47 PM, Curt Allred <[email protected]> wrote:
> Actually, I first encountered the problem when I was using HBase standalone, 
> i.e. outside of map-reduce.  I use an HTable client to simultaneously scan 
> and update a table.  While iterating the scan, I do a put() of a single 
> column back to the current row returned by the scanner.
>
> What I see is that the scan sometimes does not return all rows of the table.  
> If I disable the put(), it always returns all rows.  One interesting 
> observation is that the number of missing rows is always a multiple of the 
> cache size (hbase.client.scanner.caching).  It happens on a single node 
> installation as well as a multi-node cluster.  It also happens if I use 
> separate instances of HTable for the scanner and writer.
>
> But... my initial question is "should it work?".  I can understand why this 
> pattern of usage might be considered bad practice.
>
> If it is allowed and should work, then I'll try to write a simple test case 
> to isolate the problem.
>
> Thanks!
>
>
> -----Original Message-----
> From: Ryan Rawson [mailto:[email protected]]
> Sent: Thursday, September 30, 2010 12:45 PM
> To: [email protected]
> Subject: Re: Is it "legal" to write to the same HBase table you're scanning?
>
> When you "write" to the context you are really just creating values
> that get sent to the reduce phase where the actual puts to HBase
> occur.
>
> Could you describe more why you think TIF is failing to return all
> rows?  Setup code for your map reduce, existing row counts, etc would
> all be helpful.  In otherwords, no we have not heard of any issues of
> TIF failing to read all rows of a table - although it of course isn't
> impossible.
>
> -ryan
>
> On Thu, Sep 30, 2010 at 12:29 PM, Curt Allred <[email protected]> wrote:
>> Thanks for the reply.  I wasnt clear on one point... the scan and put are 
>> both in the map phase. i.e...
>>
>> TestMapper extends TableMapper<ImmutableBytesWritable, Put> {
>>  map(rowId, Result, Context) {
>>    // read & process row data
>>    ...
>>    // now write new data to the same row of the same table
>>    put = new Put(rowId);
>>    put.add(newStuff)
>>    context.write(rowId, put); // using TableOutputFormat
>>  }
>> }
>>
>> I dont expect to see the new data I just wrote, but it needs to be there for 
>> later map-reduce processes.
>>
>> When I run this the scanner iterator fails to return all rows, without an 
>> exception or any indication of failure.
>>
>> I could write the new data to an interim location and write it to the table 
>> during the Reduce phase but this seems inefficient since I'm not actually 
>> doing any reduction.
>>
>>
>>
>> -----Original Message-----
>> From: Ryan Rawson [mailto:[email protected]]
>> Sent: Thursday, September 30, 2010 11:27 AM
>> To: [email protected]
>> Subject: Re: Is it "legal" to write to the same HBase table you're scanning?
>>
>> Something else is going on, since TableInputFormat and
>> TableOutputFormat in the same map reduce are not concurrent... the
>> maps run, then the reduces run, and there is no overlap.  A feature of
>> mapreduce.
>>
>> So if you were expecting to see the rows you were 'just writing'
>> during your map phase, you wont alas.
>>
>> -ryan
>>
>> On Thu, Sep 30, 2010 at 11:11 AM, Curt Allred <[email protected]> wrote:
>>> I cant find any documentation which says you shouldn't write to the same 
>>> HBase table you're scanning.  But it doesnt seem to work...  I have a 
>>> mapper (subclass of TableMapper) which scans a table, and for each row 
>>> encountered during the scan, it updates a column of the row, writing it 
>>> back to the same table immediately (using TableOutputFormat).  Occasionally 
>>> the scanner ends without completing the scan.  No exception is thrown, no 
>>> indication of failure, it just says its done when it hasnt actually 
>>> returned all the rows.  This happens even if the scan has no timestamp 
>>> specification or filters.  It seems to only happen when I use a cache size 
>>> greater than 1 (hbase.client.scanner.caching).  This behavior is also 
>>> repeatable using an HTable outside of a map-reduce job.
>>>
>>> The following blog entry implies that it might be risky, or worse, socially 
>>> unacceptable :)
>>>
>>> http://www.larsgeorge.com/2009/01/how-to-use-hbase-with-hadoop.html:
>>>
>>>  > Again, I have cases where I do not output but save back to
>>>  > HBase. I could easily write the records back into HBase in
>>>  > the reduce step, so why pass them on first? I think this is
>>>  > in some cases just common sense or being a "good citizen".
>>>  > ...
>>>  > Writing back to the same HBase table is OK when doing it in
>>>  > the Reduce phase as all scanning has concluded in the Map
>>>  > phase beforehand, so all rows and columns are saved to an
>>>  > intermediate Hadoop SequenceFile internally and when you
>>>  > process these and write back to HBase you have no problems
>>>  > that there is still a scanner for the same job open reading
>>>  > the table.
>>>  >
>>>  > Otherwise it is OK to write to a different HBase table even
>>>  > during the Map phase.
>>>
>>> But I also found a jira issue which indicates it "should" work, but there 
>>> was a bug awhile back which was fixed:
>>>
>>>  > https://issues.apache.org/jira/browse/HBASE-810: Prevent temporary
>>>  > deadlocks when, during a scan with write operations, the region splits
>>>
>>> Anyone else writing while scanning? Or know of documentation that addresses 
>>> this case?
>>>
>>> Thanks,
>>> -Curt
>>>
>>>
>>
>

Reply via email to