On Aug 18, 2011, at 8:22 PM, Stack wrote: > On Fri, Aug 19, 2011 at 12:05 AM, Joseph Pallas > <[email protected]> wrote: >> The test program has multiple client threads, each of which is performing a >> stream of operations (it's actually a custom workload running in the YCSB >> framework). The program is keeping track of data that was inserted by write >> operations, and subsequent read operations only retrieve data that was >> previously written. The read operation involves first doing a >> HTableInterface.exists call on a row/cf/qual that is expected to exist. It >> is this exists call that we have seen fail. When the failure occurs, the >> client reports an exception and stops. Then we examine the data using the >> HBase shell, and the item we were looking for is there: the exists call >> should have succeeded. Furthermore, the item has a timestamp that shows it >> really was inserted several minutes previously—it was not inserted right >> around the time of the failure (which might happen if there were a race >> condition of some sort in our client). >> > > OK. The exists call is rarely used I'd say which may be why you are > seeing something we don't.
Yeah, I was concerned about this as well. It looks like the server-side implementation of exists is really just get and see if the result is empty. But there could be something more subtle there. > Well, we can do a transaction that involved mutliple rows. Currently > (as I'm sure you know by now), the steps are: > > 1. close region (NSRE if anyone asks for the region after close) > 2. offline region in edit (still NSRE'ing) > 3. Open Daughters in parallel and then in parallel update .META. > > We should add daughters, daughter B first, then daughter A, and then > offline parent? If we do it in this sequence, if you are looking for > a row in daughter A, you'll get the parent still and then a NSRE > because its closed.... so you'll go back to .META. and then find > daughter A eventually. If you are looking for a row in B and A is > online first, you'll think it has it when it doesn't... which would be > bad. > > If we offline parent first and then add daughter B first... and we're > looking for row in daughter A, but its not online yet, we'll get > WrongRegionException which would be a blast from the past... something > we used to get in the old days but like polio, managed to eradicate > them. Is that what would happen? I thought the client would throw RegionOfflineException if .META. says the region is offline (from HConnectionManager.locateRegionInMeta), and if daughter A is not added to .META. until it is online, then wouldn't locateRegionInMeta choose the offlined parent instead of daughter B? > How does this sound Joe? We could rig you a SplitTransaction to do > the above. We could hack one up first and if it did away with your > issue, we'd then spend a bit of time making sure it rolled back > properly on fail (need to make sure rollback works properly). The awkward part is that this happens rarely enough that I can't say with confidence how long I would need to test it before I could say that the problem is gone. That's why I was hoping to get a good theory for what happens and to construct a test that forces it. joe
