Can you also try 0.20.6 and see if you can repro on that?  There has
been a lot of changes between those versions that potentially affect
this issue.

-ryan

On Thu, Sep 30, 2010 at 2:58 PM, Curt Allred <[email protected]> wrote:
> version 0.20.3.
> I'll try to reduce my code to a testcase I can give you.
> Thanks!
>
> -----Original Message-----
> From: Ryan Rawson [mailto:[email protected]]
> Sent: Thursday, September 30, 2010 2:53 PM
> To: [email protected]
> Subject: Re: Is it "legal" to write to the same HBase table you're scanning?
>
> Yes that should work.  What version of hbase are you using?  I would
> appreciate a test case :-)
>
> -ryan
>
> On Thu, Sep 30, 2010 at 2:47 PM, Curt Allred <[email protected]> wrote:
>> Actually, I first encountered the problem when I was using HBase standalone, 
>> i.e. outside of map-reduce.  I use an HTable client to simultaneously scan 
>> and update a table.  While iterating the scan, I do a put() of a single 
>> column back to the current row returned by the scanner.
>>
>> What I see is that the scan sometimes does not return all rows of the table. 
>>  If I disable the put(), it always returns all rows.  One interesting 
>> observation is that the number of missing rows is always a multiple of the 
>> cache size (hbase.client.scanner.caching).  It happens on a single node 
>> installation as well as a multi-node cluster.  It also happens if I use 
>> separate instances of HTable for the scanner and writer.
>>
>> But... my initial question is "should it work?".  I can understand why this 
>> pattern of usage might be considered bad practice.
>>
>> If it is allowed and should work, then I'll try to write a simple test case 
>> to isolate the problem.
>>
>> Thanks!
>>
>>
>> -----Original Message-----
>> From: Ryan Rawson [mailto:[email protected]]
>> Sent: Thursday, September 30, 2010 12:45 PM
>> To: [email protected]
>> Subject: Re: Is it "legal" to write to the same HBase table you're scanning?
>>
>> When you "write" to the context you are really just creating values
>> that get sent to the reduce phase where the actual puts to HBase
>> occur.
>>
>> Could you describe more why you think TIF is failing to return all
>> rows?  Setup code for your map reduce, existing row counts, etc would
>> all be helpful.  In otherwords, no we have not heard of any issues of
>> TIF failing to read all rows of a table - although it of course isn't
>> impossible.
>>
>> -ryan
>>
>> On Thu, Sep 30, 2010 at 12:29 PM, Curt Allred <[email protected]> wrote:
>>> Thanks for the reply.  I wasnt clear on one point... the scan and put are 
>>> both in the map phase. i.e...
>>>
>>> TestMapper extends TableMapper<ImmutableBytesWritable, Put> {
>>>  map(rowId, Result, Context) {
>>>    // read & process row data
>>>    ...
>>>    // now write new data to the same row of the same table
>>>    put = new Put(rowId);
>>>    put.add(newStuff)
>>>    context.write(rowId, put); // using TableOutputFormat
>>>  }
>>> }
>>>
>>> I dont expect to see the new data I just wrote, but it needs to be there 
>>> for later map-reduce processes.
>>>
>>> When I run this the scanner iterator fails to return all rows, without an 
>>> exception or any indication of failure.
>>>
>>> I could write the new data to an interim location and write it to the table 
>>> during the Reduce phase but this seems inefficient since I'm not actually 
>>> doing any reduction.
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Ryan Rawson [mailto:[email protected]]
>>> Sent: Thursday, September 30, 2010 11:27 AM
>>> To: [email protected]
>>> Subject: Re: Is it "legal" to write to the same HBase table you're scanning?
>>>
>>> Something else is going on, since TableInputFormat and
>>> TableOutputFormat in the same map reduce are not concurrent... the
>>> maps run, then the reduces run, and there is no overlap.  A feature of
>>> mapreduce.
>>>
>>> So if you were expecting to see the rows you were 'just writing'
>>> during your map phase, you wont alas.
>>>
>>> -ryan
>>>
>>> On Thu, Sep 30, 2010 at 11:11 AM, Curt Allred <[email protected]> wrote:
>>>> I cant find any documentation which says you shouldn't write to the same 
>>>> HBase table you're scanning.  But it doesnt seem to work...  I have a 
>>>> mapper (subclass of TableMapper) which scans a table, and for each row 
>>>> encountered during the scan, it updates a column of the row, writing it 
>>>> back to the same table immediately (using TableOutputFormat).  
>>>> Occasionally the scanner ends without completing the scan.  No exception 
>>>> is thrown, no indication of failure, it just says its done when it hasnt 
>>>> actually returned all the rows.  This happens even if the scan has no 
>>>> timestamp specification or filters.  It seems to only happen when I use a 
>>>> cache size greater than 1 (hbase.client.scanner.caching).  This behavior 
>>>> is also repeatable using an HTable outside of a map-reduce job.
>>>>
>>>> The following blog entry implies that it might be risky, or worse, 
>>>> socially unacceptable :)
>>>>
>>>> http://www.larsgeorge.com/2009/01/how-to-use-hbase-with-hadoop.html:
>>>>
>>>>  > Again, I have cases where I do not output but save back to
>>>>  > HBase. I could easily write the records back into HBase in
>>>>  > the reduce step, so why pass them on first? I think this is
>>>>  > in some cases just common sense or being a "good citizen".
>>>>  > ...
>>>>  > Writing back to the same HBase table is OK when doing it in
>>>>  > the Reduce phase as all scanning has concluded in the Map
>>>>  > phase beforehand, so all rows and columns are saved to an
>>>>  > intermediate Hadoop SequenceFile internally and when you
>>>>  > process these and write back to HBase you have no problems
>>>>  > that there is still a scanner for the same job open reading
>>>>  > the table.
>>>>  >
>>>>  > Otherwise it is OK to write to a different HBase table even
>>>>  > during the Map phase.
>>>>
>>>> But I also found a jira issue which indicates it "should" work, but there 
>>>> was a bug awhile back which was fixed:
>>>>
>>>>  > https://issues.apache.org/jira/browse/HBASE-810: Prevent temporary
>>>>  > deadlocks when, during a scan with write operations, the region splits
>>>>
>>>> Anyone else writing while scanning? Or know of documentation that 
>>>> addresses this case?
>>>>
>>>> Thanks,
>>>> -Curt
>>>>
>>>>
>>>
>>
>

Reply via email to