On Aug 18, 2011, at 8:22 PM, Stack wrote:

> On Fri, Aug 19, 2011 at 12:05 AM, Joseph Pallas
> <[email protected]> wrote:
>> The test program has multiple client threads, each of which is performing a 
>> stream of operations (it's actually a custom workload running in the YCSB 
>> framework).  The program is keeping track of data that was inserted by write 
>> operations, and subsequent read operations only retrieve data that was 
>> previously written.  The read operation involves first doing a 
>> HTableInterface.exists call on a row/cf/qual that is expected to exist.  It 
>> is this exists call that we have seen fail.  When the failure occurs, the 
>> client reports an exception and stops.  Then we examine the data using the 
>> HBase shell, and the item we were looking for is there: the exists call 
>> should have succeeded.  Furthermore, the item has a timestamp that shows it 
>> really was inserted several minutes previously—it was not inserted right 
>> around the time of the failure (which might happen if there were a race 
>> condition of some sort in our client).
>> 
> 
> OK.  The exists call is rarely used I'd say which may be why you are
> seeing something we don't.

Yeah, I was concerned about this as well.  It looks like the server-side 
implementation of exists is really just get and see if the result is empty.  
But there could be something more subtle there.

> Well, we can do a transaction that involved mutliple rows.  Currently
> (as I'm sure you know by now), the steps are:
> 
> 1. close region (NSRE if anyone asks for the region after close)
> 2. offline region in edit (still NSRE'ing)
> 3. Open Daughters in parallel and then in parallel update .META.
> 
> We should add daughters, daughter B first, then daughter A, and then
> offline parent?  If we do it in this sequence, if you are looking for
> a row in daughter A, you'll get the parent still and then a NSRE
> because its closed.... so you'll go back to .META. and then find
> daughter A eventually.  If you are looking for a row in B and A is
> online first, you'll think it has it when it doesn't... which would be
> bad.
> 
> If we offline parent first and then add daughter B first... and we're
> looking for row in daughter A, but its not online yet, we'll get
> WrongRegionException which would be a blast from the past... something
> we used to get in the old days but like polio, managed to eradicate
> them.

Is that what would happen?  I thought the client would throw 
RegionOfflineException if .META. says the region is offline (from 
HConnectionManager.locateRegionInMeta), and if daughter A is not added to 
.META. until it is online, then wouldn't locateRegionInMeta choose the offlined 
parent instead of daughter B?

> How does this sound Joe?  We could rig you a SplitTransaction to do
> the above.  We could hack one up first and if it did away with your
> issue, we'd then spend a bit of time making sure it rolled back
> properly on fail (need to make sure rollback works properly).

The awkward part is that this happens rarely enough that I can't say with 
confidence how long I would need to test it before I could say that the problem 
is gone.  That's why I was hoping to get a good theory for what happens and to 
construct a test that forces it.

joe

Reply via email to