So your region servers had their session expired but I don't see any sign of GC activity. Usually that points to a case where zookeeper isn't able to answer request fast enough because it is IO starved. Are the zookeeper quorum member on the same nodes as the region servers?
J-D On Wed, Jul 14, 2010 at 10:16 AM, Jinsong Hu <[email protected]> wrote: > I have uploaded the crashed log files to > > http://somewhere > http://somewhere > > it includes the GC log and config files too. both of these regionserver > crashed when I > use only 3 mappers to insert records to hbase, and each task I limit the > data rate > to 80 records /second. > > The machine I use is relatively old, having only 4G of ram. 4 core CPU, and > 250G disk. > I run tasktracker , datanode and regionserver on them. I have 3 machines, > only 1 regionserver > is still up after the continuous insertion for overnight. before the test I > created hbase > tables so I started with an empty table. I saw it running fine for several > hours and this morning > I check again, 2 regionserver died. I backup the log and restart the > regionserver, and then > I found the regionserver process is up, but not listening to any port. > > the log doesn't show any error: > 2010-07-14 16:23:45,769 DEBUG > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper > : Set watcher on master address ZNode /hbase/master > 2010-07-14 16:23:45,951 DEBUG > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper > : Read ZNode /hbase/master got 10.110.24.48:60000 > 2010-07-14 16:23:45,957 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > Telling master at 10.110.24.48:60000 that we are up > 2010-07-14 16:23:46,283 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: > Installed shutdown hook thread: Shutdownhook:regionserver60020 > > > I am using the June 10 release of the cloudera distribution for both hadoop > and hbase. > > Jimmy > > > > -------------------------------------------------- > From: "Jean-Daniel Cryans" <[email protected]> > Sent: Tuesday, July 13, 2010 4:24 PM > To: <[email protected]> > Subject: Re: regionserver crash under heavy load > >> Your region server doesn't look much loaded from the metrics POV. But >> I specifically asked for the lines around that, not just the dump, >> since it will contain the reason for the shutdown. >> >>> I do notice that the disk usage is pretty high. I am just thinking that >>> our >>> problem probably is a hardware limit. but the server should not crash >>> when >>> the hardware limit is reached. >> >> We still don't know why it crashed and it may not even be related to >> HW limits, we need those bigger log traces. Also use pastebin.com or >> anything like that. >> >>> >>> do you have any idea when CDH3 official release will be out ? >> >> I don't work for cloudera, but IIRC the next beta for CDH3 is due for >> September. >> >>> >>> Jimmy >>> >>> -------------------------------------------------- >>> From: "Jean-Daniel Cryans" <[email protected]> >>> Sent: Tuesday, July 13, 2010 2:55 PM >>> To: <[email protected]> >>> Subject: Re: regionserver crash under heavy load >>> >>>> Please use a pasting service for the log traces. I personally use >>>> pastebin.com >>>> >>>> You probably had a GC that lasted too long, this is something out of >>>> the control of the application (apart from trying to put as less data >>>> in memory as possible, but you are inserting so...). Your log doesn't >>>> contain enough information for us to tell, please look for a "Dump of >>>> metrics" line and paste the lines around that. >>>> >>>> J-D >>>> >>>> On Tue, Jul 13, 2010 at 2:49 PM, Jinsong Hu <[email protected]> >>>> wrote: >>>>> >>>>> Hi, Todd: >>>>> I downloaded hadoop-0.20.2+320 and hbase-0.89.20100621+17 from CDH3 >>>>> and >>>>> inserted data with full load, after a while the hbase regionserver >>>>> crashed. >>>>> I checked system with "iostat -x 5" and notice the disk is pretty >>>>> busy. >>>>> Then I modified my client code and reduced the insertion rate by 6 >>>>> times, >>>>> and the test runs fine. Is there any way that regionserver be modified >>>>> so >>>>> that at least it doesn't crash under heavy load ? I used apache hbase >>>>> 0.20.5 distribution and the same problem happens. I am thinking that >>>>> when >>>>> the regionserver is too busy, it should throttle incoming data rate to >>>>> protect the server. Could this be done ? >>>>> Do you also know when the CDH3 official release will come out ? the >>>>> one >>>>> I >>>>> downloaded is beta version. >>>>> >>>>> Jimmy >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 2010-07-13 02:24:34,389 INFO >>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>> Close >>>>> d Spam_MsgEventTable,56-2010-05-19 >>>>> 10:09:02\x099a420f4f31748828fd24aeea1d06b294, >>>>> 1278973678315.01dd22f517dabf53ddd135709b68ba6c. >>>>> 2010-07-13 02:24:34,389 INFO >>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: >>>>> aborting server at: m0002029.ppops.net,60020,1278969481450 >>>>> 2010-07-13 02:24:34,389 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper >>>>> : Closed connection with ZooKeeper; /hbase/root-region-server >>>>> 2010-07-13 02:24:34,389 INFO >>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: >>>>> regionserver60020 exiting >>>>> 2010-07-13 02:24:34,608 INFO >>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: >>>>> Shutdown hook starting; hbase.shutdown.hook=true; >>>>> fsShutdownHook=Thread[Thread-1 >>>>> 0,5,main] >>>>> 2010-07-13 02:24:34,608 INFO >>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: >>>>> Starting fs shutdown hook thread. >>>>> 2010-07-13 02:24:34,608 ERROR org.apache.hadoop.hdfs.DFSClient: >>>>> Exception >>>>> closin >>>>> g file >>>>> >>>>> /hbase/.logs/m0002029.ppops.net,60020,1278969481450/10.110.24.79%3A60020. >>>>> 1278987220794 : java.io.IOException: IOException >>>>> flush:java.io.IOException: >>>>> IOEx >>>>> ception flush:java.io.IOException: IOException >>>>> flush:java.io.IOException: >>>>> IOExce >>>>> ption flush:java.io.IOException: IOException flush:java.io.IOException: >>>>> IOExcept >>>>> ion flush:java.io.IOException: IOException flush:java.io.IOException: >>>>> IOExceptio >>>>> n flush:java.io.IOException: IOException flush:java.io.IOException: >>>>> Error >>>>> Recove >>>>> ry for block blk_-1605696159279298313_2395924 failed because recovery >>>>> from >>>>> prim >>>>> ary datanode 10.110.24.80:50010 failed 6 times. Pipeline was >>>>> 10.110.24.80:50010 >>>>> . Aborting... >>>>> java.io.IOException: IOException flush:java.io.IOException: IOException >>>>> flush:ja >>>>> va.io.IOException: IOException flush:java.io.IOException: IOException >>>>> flush:java >>>>> .io.IOException: IOException flush:java.io.IOException: IOException >>>>> flush:java.i >>>>> o.IOException: IOException flush:java.io.IOException: IOException >>>>> flush:java.io. >>>>> IOException: IOException flush:java.io.IOException: Error Recovery for >>>>> block >>>>> blk >>>>> _-1605696159279298313_2395924 failed because recovery from primary >>>>> datanode >>>>> 10. >>>>> 110.24.80:50010 failed 6 times. Pipeline was 10.110.24.80:50010. >>>>> Aborting... >>>>> at >>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java: >>>>> 3214) >>>>> at >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java: >>>>> 97) >>>>> at >>>>> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944 >>>>> ) >>>>> at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) >>>>> at >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >>>>> sorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at >>>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(S >>>>> equenceFileLogWriter.java:124) >>>>> at >>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.hflush(HLog.java:826) >>>>> at >>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1004) >>>>> at >>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:817) >>>>> at >>>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.j >>>>> ava:1531) >>>>> at >>>>> org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1447) >>>>> at >>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer. >>>>> java:1703) >>>>> at >>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.multiPut(HRegionSe >>>>> rver.java:2361) >>>>> at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) >>>>> at >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >>>>> sorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at >>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) >>>>> at >>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java: >>>>> 919) >>>>> 2010-07-13 02:24:34,610 ERROR org.apache.hadoop.hdfs.DFSClient: >>>>> Exception >>>>> closin >>>>> g file >>>>> >>>>> /hbase/Spam_MsgEventTable/079c7de876422e57e5f09fef5d997e06/.tmp/677365813 >>>>> 4549268273 : java.io.IOException: All datanodes 10.110.24.80:50010 are >>>>> bad. >>>>> Abor >>>>> ting... >>>>> java.io.IOException: All datanodes 10.110.24.80:50010 are bad. >>>>> Aborting... >>>>> at >>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError >>>>> (DFSClient.java:2603) >>>>> at >>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClien >>>>> t.java:2139) >>>>> at >>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFS >>>>> Client.java:2306) >>>>> 2010-07-13 02:24:34,729 INFO >>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: >>>>> Shutdown hook finished. >>>>> >>>> >>> >> >
