Well it took almost a month, but I finally saw this problem in the wild again (0.20.3 + HBASE-2180), and this time I think I have more info about what happened. The symptoms were the same as usual: read throughput drops drastically, the HBase client throws tons of errors, and at least one region server has lots of errors in its log. I can't figure out exactly what went wrong, but I think this is the series of events:
1. A region split occurred, and region server rts-hdfs01 opened a new region (users,ce5ba693689ed0648c99fb87b5eb2b53,1275964270542) 2. There was a duplicate region assignment (the new region from #1 above), and it appears that both rts-hdfs01 and rts-hdfs02 had that region for about 15 seconds 3. During the time that rts-hdfs01 and rts-hdfs02 had the same region, they both started a compaction on that region almost simultaneously 4. rts-hdfs02 detected the duplicate region assignment and closed the region while rts-hdfs01 was still running its compaction on that region 5. rts-hdfs01 got an error that its lease expired on some compaction directory (caused by rts-hdfs02 closing the region, I assume), and it logged an error that the compaction failed 6. rts-hdfs01 logs lots of "Got error in response to OP_READ_BLOCK for file", "No live nodes contain current block", and "Cannot open filename" errors I was able to fix the problem by simply restarting the region server on rts-hdfs01. After that region server came back up and the master finished reassigning all the regions, everything was back to normal. I didn't restart any other region servers, any data nodes, or the master. I still not an expert in HBase internals, but it seems like this is some bug related to region splitting and assignment of daughter regions. Maybe this is a known issue in 0.20.3? Here are the relevant logs: Region server log for rts-hdfs01, which is the region server which got the error about the lease expiring and all the read errors: http://pastebin.com/0L6fS4du The problem starts on line 38. Region server log for rts-hdfs02, which is the region server which got the duplicate region (users,ce5ba693689ed0648c99fb87b5eb2b53,1275964270542): http://pastebin.com/EmMtF4V8 See lines 29-36 where the duplicate region assignment occurs. The master log: http://pastebin.com/YCa3L9e1 See lines 14-33. Thanks, James On Tue, May 11, 2010 at 1:01 AM, Todd Lipcon <[email protected]> wrote: > Hi James, > > I'd recommend just the following in your log4j properties to tone down > the log volume: > > log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN > > log4j.logger.org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace=WARN > > This will keep the INFO level logs that are very useful for debugging > the issue, but shouldn't generate GB after GB. > > I'm both sorry and glad to hear your problem is reproducing every day. > The bad news is you've got a problem, the good news is we should be > able to get to the bottom of it much faster ! :) > > -Todd > > > On Mon, May 10, 2010 at 9:59 AM, James Baldassari <[email protected]> > wrote: > > Hi Todd, > > > > Our log files were getting to be several gigabytes in size at the INFO > level > > (particularly the datanode logs), so we changed the log level in all > log4j > > configs to be WARN. Do you think we're potentially missing some useful > > information at INFO and lower? I could lower the log level if you think > it > > would help. > > > > I did go back as far as I could in the logs to try to find out what was > > going on as soon as the problem appeared. I didn't see anything > > interesting, possibly because the log level was at WARN. The only two > > errors I see are these: > > > > 2010-05-06 19:15:56,176 ERROR > > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > > 10.25.119.115:50010, > > storageID=DS-2139827131-10.25.119.115-50010-1266376030910, > infoPort=50075, > > ipcPort=50020):DataXceiver > > java.io.EOFException > > > > This happens once every 5 minutes like clockwork. I'm not sure if this > has > > anything to do with the problem. Then all of a sudden those errors about > > the block being invalid started showing up: > > > > 2010-05-06 19:17:37,946 WARN > > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > > 10.25.119.115:50010, > > storageID=DS-2139827131-10.25.119.115-50010-1266376030910, > infoPort=50075, > > ipcPort=50020):Got exception while serving blk_-26696347696547536_49275 > to / > > 10.24.166.67: > > java.io.IOException: Block blk_-26696347696547536_49275 is not valid. > > > > So let me know if I should change the log level and whether I need to > enable > > NN clienttrace. I don't think I'll have to wait long for this problem to > > reappear. It seems to be happening almost every day now. > > > > Thanks, > > James > > > > > > On Sat, May 8, 2010 at 6:43 PM, Todd Lipcon <[email protected]> wrote: > > > >> Hi James, > >> > >> You'll need to go farther back in the logs to find what happened to the > >> block that caused it to get deleted. All of the logs below are too late > >> (the > >> block's already gone, we need to figure out why). > >> > >> Can you look backwards through the past several days of the NN logs? > Have > >> you disabled the NN clienttrace log in log4j.properties? > >> > >> -Todd > >> > >> On Fri, May 7, 2010 at 11:17 PM, James Baldassari < > [email protected] > >> >wrote: > >> > >> > OK, these logs are huge, so I'm just going to post the first 1,000 > lines > >> > from each for now. Let me know if it would be helpful to have more. > The > >> > namenode logs didn't contain either of the strings you were interested > >> in. > >> > A few of the datanode logs had '4841840178880951849': > >> > > >> > http://pastebin.com/4MxzkxjW > >> > http://pastebin.com/C7XErfTJ > >> > http://pastebin.com/tn8pk2Xa > >> > > >> > One of the region servers had log entries about the region 73382377: > >> > > >> > http://pastebin.com/8uh6hQxQ > >> > > >> > We rebooted the name node as well as all data nodes and region > servers; > >> > then > >> > we ran a major compaction on both of our tables. The cluster seems > >> stable > >> > at this point, but I have a bad feeling that this problem may hit us > >> again > >> > soon because it has happened twice in the past two days now. > >> > > >> > -James > >> > > >> > > >> > On Sat, May 8, 2010 at 12:30 AM, Todd Lipcon <[email protected]> > wrote: > >> > > >> > > If you can grep for '4841840178880951849' as well > >> > > as /hbase/users/73382377/data/312780071564432169 across all of your > >> > > datanode > >> > > logs plus your NN, and put that online somewhere, that would be > great. > >> If > >> > > you can grep with -C 20 to get some context that would help as well. > >> > > > >> > > Grepping for the region in question (73382377) in the RS logs would > >> also > >> > be > >> > > helpful. > >> > > > >> > > Thanks > >> > > -Todd > >> > > > >> > > On Fri, May 7, 2010 at 9:16 PM, James Baldassari < > >> [email protected] > >> > > >wrote: > >> > > > >> > > > On Sat, May 8, 2010 at 12:02 AM, Stack <[email protected]> wrote: > >> > > > > >> > > > > On Fri, May 7, 2010 at 8:27 PM, James Baldassari < > >> > > [email protected]> > >> > > > > wrote: > >> > > > > > java.io.IOException: Cannot open filename > >> > > > > > /hbase/users/73382377/data/312780071564432169 > >> > > > > > > >> > > > > This is the regionserver log? Is this deploying the region? It > >> > fails? > >> > > > > > >> > > > > >> > > > This error is on the client accessing HBase. This exception was > >> thrown > >> > > on > >> > > > a > >> > > > get call to an HTable instance. I'm not sure if it was deploying > the > >> > > > region. All I know is that the system had been running with all > >> > regions > >> > > > available (as far as I know), and then all of a sudden these > errors > >> > > started > >> > > > showing up on the client. > >> > > > > >> > > > > >> > > > > > >> > > > > > Our cluster throughput goes from around 3k requests/second > down > >> to > >> > > > > 500-1000 > >> > > > > > and does not recover without manual intervention. The region > >> > server > >> > > > log > >> > > > > for > >> > > > > > that region says: > >> > > > > > > >> > > > > > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to / > >> > > > > > 10.24.166.74:50010 for file > >> > > > > /hbase/users/73382377/data/312780071564432169 > >> > > > > > for block -4841840178880951849:java.io.IOException: Got error > in > >> > > > response > >> > > > > to > >> > > > > > OP_READ_BLOCK for file > >> > /hbase/users/73382377/data/312780071564432169 > >> > > > for > >> > > > > > block -4841840178880951849 > >> > > > > > > >> > > > > > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 > on > >> > > 60020, > >> > > > > call > >> > > > > > get([...@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, > >> > > maxVersions=1, > >> > > > > > timeRange=[0,9223372036854775807), families={(family=data, > >> > > > columns=ALL}) > >> > > > > > from 10.24.117.100:2365: error: java.io.IOException: Cannot > open > >> > > > > filename > >> > > > > > /hbase/users/73382377/data/312780071564432169 > >> > > > > > java.io.IOException: Cannot open filename > >> > > > > > /hbase/users/73382377/data/312780071564432169 > >> > > > > > > >> > > > > > The datanode log for 10.24.116.74 says: > >> > > > > > > >> > > > > > WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > >> > > > > DatanodeRegistration( > >> > > > > > 10.24.166.74:50010, > >> > > > > storageID=DS-14401423-10.24.166.74-50010-1270741415211, > >> > > > > > infoPort=50075, ipcPort=50020): > >> > > > > > Got exception while serving blk_-4841840178880951849_50277 to > / > >> > > > > 10.25.119.113 > >> > > > > > : > >> > > > > > java.io.IOException: Block blk_-4841840178880951849_50277 is > not > >> > > valid. > >> > > > > > > >> > > > > > >> > > > > Whats your hadoop? Is it 0.20.2 or CDH? Any patches? > >> > > > > > >> > > > > >> > > > Hadoop is vanilla CDH 2. HBase is 0.20.3 + HBase-2180 > >> > > > > >> > > > > >> > > > > > >> > > > > > >> > > > > > Running a major compaction on the users table fixed the > problem > >> the > >> > > > first > >> > > > > > time it happened, but this time the major compaction didn't > fix > >> it, > >> > > so > >> > > > > we're > >> > > > > > in the process of rebooting the whole cluster. I'm wondering > a > >> few > >> > > > > things: > >> > > > > > > >> > > > > > 1. What could trigger this problem? > >> > > > > > 2. Why can't the system fail over to another > >> > > block/file/datanode/region > >> > > > > > server? We're using 3x replication in HDFS, and we have 8 > data > >> > nodes > >> > > > > which > >> > > > > > double as our region servers. > >> > > > > > 3. Are there any best practices for achieving high > availability > >> in > >> > an > >> > > > > HBase > >> > > > > > cluster? How can I configure the system to gracefully (and > >> > > > > automatically) > >> > > > > > handle these types of problems? > >> > > > > > > >> > > > > > >> > > > > Let us know what your hadoop is and then we figure more on the > >> issues > >> > > > > above. > >> > > > > > >> > > > > >> > > > If you need complete stack traces or any additional information, > >> please > >> > > let > >> > > > me know. > >> > > > > >> > > > > >> > > > > Thanks James, > >> > > > > St.Ack > >> > > > > P.S. Its eight node cluster on what kinda hw? (You've probably > said > >> > in > >> > > > > the past and I can dig through mail -- just say -- and then what > >> kind > >> > > > > of loading are you applying? Ditto for if you've said this > >> already) > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Todd Lipcon > >> > > Software Engineer, Cloudera > >> > > > >> > > >> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > >> > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
