I’m not exactly at my wit’s end, but I could use some insight here from someone who is intimately familiar with region splitting. Any guidance (clearing up misunderstandings) would be great.
I’ll start with the main point: I think there may be a bug in HBase 0.90.3 that involves reading from a region while it is being split. Below, I’ll explain where I think the bug might be and why I am not certain. Here is the situation. We have seen our test program turn up a mysterious failure on a small number of occasions. The first couple of times we thought it must be a bug in our client code, put it into the “fix this later” category, and carried on. Then we stopped seeing it for a while and figured it must have been fixed by some other change. But when it next appeared we looked at it more closely, and we could not explain what we were seeing. The test program has multiple client threads, each of which is performing a stream of operations (it's actually a custom workload running in the YCSB framework). The program is keeping track of data that was inserted by write operations, and subsequent read operations only retrieve data that was previously written. The read operation involves first doing a HTableInterface.exists call on a row/cf/qual that is expected to exist. It is this exists call that we have seen fail. When the failure occurs, the client reports an exception and stops. Then we examine the data using the HBase shell, and the item we were looking for is there: the exists call should have succeeded. Furthermore, the item has a timestamp that shows it really was inserted several minutes previously—it was not inserted right around the time of the failure (which might happen if there were a race condition of some sort in our client). So, what is interesting is when we look at the log files for the region server, and at the time this happens, the region involved is in the middle of a split. Also, the key we failed on is greater than the split key. After much reading of the code in SplitTransaction and HRegionServer, I came up with a theory. When a region splits, daughter regions are created and the region is marked as offline/splitting in META (by MetaEditor.offlineParentInMeta). The daughter regions are brought online and added to META by SplitTransaction.openDaughterRegion and HRegionServer.postOpenDeployTasks. Later, the META entry for the original region is cleaned up. The two daughter regions are managed in their own DaughterOpener thread. This is where I am suspicious: if daughter A’s thread updates META before daughter B’s thread does, then there's a window of time on the client when HConnectionManager.locateRegionInMeta if looking for a key in daughter B will see only daughter A. The client, I believe, does not check end rows in META, so it will think that daughter A is the region to handle the request. Now, the question is: are they any circumstances under which sending that request to the wrong region (daughter A instead of daughter B) might yield incorrect results, instead of an exception? My gut says maybe, but my experiments have not yet managed to find it. I know that a path that goes through HRegion.checkRow should throw an exception. But around the point where RegionScanner is making scanners on all the stores I get lost and uncertain about whether there is a case that can slip through the cracks and not call checkRow. Maybe involving MemStoreScanner? So, in summary: I think it is definitely undesirable to have daughter A appear in META before daughter B, but I haven't figured out how that might lead to the error we encountered. I would be a lot more comfortable filing a JIRA if I could find that faulty case, or if someone could explain why I'm barking up the wrong tree. Can anyone help? Thanks. joe
