I didnt try changing the write buffer size from the default.  Looks like the 
default is 2MB.  I'll try setting it to zero. Thanks for the tip.

-----Original Message-----
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Thursday, September 30, 2010 4:19 PM
To: user@hbase.apache.org
Subject: RE: Is it "legal" to write to the same HBase table you're scanning?


Sorry to jump in on the tail end of this...

What sort of buffer size do you have on your client?
What it sounds like is that you're doing a put() but you don't see the row in 
the table until either the client side buffer is full or if you've flushed the 
buffer which then writes the records to hbase. We had the same issue and it 
wasn't until we set the buffer cache size to 0 that it forced the put() to send 
the record to the HBase server.

Note: If you're not doing processing in your m/r job, this will cause a 
performance hit because you're not bunching up your writes to HBase. If you are 
doing processing of the records, you will find that the cost of each write to 
be negligible. 

It looks like you found the issue. 

One of our developers tried to set the autoflush to true, yet we still had the 
problem until we made the cache size 0.

HTH

-Mike

> Date: Thu, 30 Sep 2010 15:05:38 -0700
> Subject: Re: Is it "legal" to write to the same HBase table you're scanning?
> From: ryano...@gmail.com
> To: user@hbase.apache.org
> 
> Can you also try 0.20.6 and see if you can repro on that?  There has
> been a lot of changes between those versions that potentially affect
> this issue.
> 
> -ryan
> 
> On Thu, Sep 30, 2010 at 2:58 PM, Curt Allred <c...@mediosystems.com> wrote:
> > version 0.20.3.
> > I'll try to reduce my code to a testcase I can give you.
> > Thanks!
> >
> > -----Original Message-----
> > From: Ryan Rawson [mailto:ryano...@gmail.com]
> > Sent: Thursday, September 30, 2010 2:53 PM
> > To: user@hbase.apache.org
> > Subject: Re: Is it "legal" to write to the same HBase table you're scanning?
> >
> > Yes that should work.  What version of hbase are you using?  I would
> > appreciate a test case :-)
> >
> > -ryan
> >
> > On Thu, Sep 30, 2010 at 2:47 PM, Curt Allred <c...@mediosystems.com> wrote:
> >> Actually, I first encountered the problem when I was using HBase 
> >> standalone, i.e. outside of map-reduce.  I use an HTable client to 
> >> simultaneously scan and update a table.  While iterating the scan, I do a 
> >> put() of a single column back to the current row returned by the scanner.
> >>
> >> What I see is that the scan sometimes does not return all rows of the 
> >> table.  If I disable the put(), it always returns all rows.  One 
> >> interesting observation is that the number of missing rows is always a 
> >> multiple of the cache size (hbase.client.scanner.caching).  It happens on 
> >> a single node installation as well as a multi-node cluster.  It also 
> >> happens if I use separate instances of HTable for the scanner and writer.
> >>
> >> But... my initial question is "should it work?".  I can understand why 
> >> this pattern of usage might be considered bad practice.
> >>
> >> If it is allowed and should work, then I'll try to write a simple test 
> >> case to isolate the problem.
> >>
> >> Thanks!
> >>
> >>
> >> -----Original Message-----
> >> From: Ryan Rawson [mailto:ryano...@gmail.com]
> >> Sent: Thursday, September 30, 2010 12:45 PM
> >> To: user@hbase.apache.org
> >> Subject: Re: Is it "legal" to write to the same HBase table you're 
> >> scanning?
> >>
> >> When you "write" to the context you are really just creating values
> >> that get sent to the reduce phase where the actual puts to HBase
> >> occur.
> >>
> >> Could you describe more why you think TIF is failing to return all
> >> rows?  Setup code for your map reduce, existing row counts, etc would
> >> all be helpful.  In otherwords, no we have not heard of any issues of
> >> TIF failing to read all rows of a table - although it of course isn't
> >> impossible.
> >>
> >> -ryan
> >>
> >> On Thu, Sep 30, 2010 at 12:29 PM, Curt Allred <c...@mediosystems.com> 
> >> wrote:
> >>> Thanks for the reply.  I wasnt clear on one point... the scan and put are 
> >>> both in the map phase. i.e...
> >>>
> >>> TestMapper extends TableMapper<ImmutableBytesWritable, Put> {
> >>>  map(rowId, Result, Context) {
> >>>    // read & process row data
> >>>    ...
> >>>    // now write new data to the same row of the same table
> >>>    put = new Put(rowId);
> >>>    put.add(newStuff)
> >>>    context.write(rowId, put); // using TableOutputFormat
> >>>  }
> >>> }
> >>>
> >>> I dont expect to see the new data I just wrote, but it needs to be there 
> >>> for later map-reduce processes.
> >>>
> >>> When I run this the scanner iterator fails to return all rows, without an 
> >>> exception or any indication of failure.
> >>>
> >>> I could write the new data to an interim location and write it to the 
> >>> table during the Reduce phase but this seems inefficient since I'm not 
> >>> actually doing any reduction.
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Ryan Rawson [mailto:ryano...@gmail.com]
> >>> Sent: Thursday, September 30, 2010 11:27 AM
> >>> To: user@hbase.apache.org
> >>> Subject: Re: Is it "legal" to write to the same HBase table you're 
> >>> scanning?
> >>>
> >>> Something else is going on, since TableInputFormat and
> >>> TableOutputFormat in the same map reduce are not concurrent... the
> >>> maps run, then the reduces run, and there is no overlap.  A feature of
> >>> mapreduce.
> >>>
> >>> So if you were expecting to see the rows you were 'just writing'
> >>> during your map phase, you wont alas.
> >>>
> >>> -ryan
> >>>
> >>> On Thu, Sep 30, 2010 at 11:11 AM, Curt Allred <c...@mediosystems.com> 
> >>> wrote:
> >>>> I cant find any documentation which says you shouldn't write to the same 
> >>>> HBase table you're scanning.  But it doesnt seem to work...  I have a 
> >>>> mapper (subclass of TableMapper) which scans a table, and for each row 
> >>>> encountered during the scan, it updates a column of the row, writing it 
> >>>> back to the same table immediately (using TableOutputFormat).  
> >>>> Occasionally the scanner ends without completing the scan.  No exception 
> >>>> is thrown, no indication of failure, it just says its done when it hasnt 
> >>>> actually returned all the rows.  This happens even if the scan has no 
> >>>> timestamp specification or filters.  It seems to only happen when I use 
> >>>> a cache size greater than 1 (hbase.client.scanner.caching).  This 
> >>>> behavior is also repeatable using an HTable outside of a map-reduce job.
> >>>>
> >>>> The following blog entry implies that it might be risky, or worse, 
> >>>> socially unacceptable :)
> >>>>
> >>>> http://www.larsgeorge.com/2009/01/how-to-use-hbase-with-hadoop.html:
> >>>>
> >>>>  > Again, I have cases where I do not output but save back to
> >>>>  > HBase. I could easily write the records back into HBase in
> >>>>  > the reduce step, so why pass them on first? I think this is
> >>>>  > in some cases just common sense or being a "good citizen".
> >>>>  > ...
> >>>>  > Writing back to the same HBase table is OK when doing it in
> >>>>  > the Reduce phase as all scanning has concluded in the Map
> >>>>  > phase beforehand, so all rows and columns are saved to an
> >>>>  > intermediate Hadoop SequenceFile internally and when you
> >>>>  > process these and write back to HBase you have no problems
> >>>>  > that there is still a scanner for the same job open reading
> >>>>  > the table.
> >>>>  >
> >>>>  > Otherwise it is OK to write to a different HBase table even
> >>>>  > during the Map phase.
> >>>>
> >>>> But I also found a jira issue which indicates it "should" work, but 
> >>>> there was a bug awhile back which was fixed:
> >>>>
> >>>>  > https://issues.apache.org/jira/browse/HBASE-810: Prevent temporary
> >>>>  > deadlocks when, during a scan with write operations, the region splits
> >>>>
> >>>> Anyone else writing while scanning? Or know of documentation that 
> >>>> addresses this case?
> >>>>
> >>>> Thanks,
> >>>> -Curt
> >>>>
> >>>>
> >>>
> >>
> >
                                          

Reply via email to