Thanks for the zookeeper log, I was able to pinpoint more the problem. So I see this (didn't dig enough in the logs it seems, wasn't expecting a 3 minutes difference)
2010-07-13 21:49:17,864 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2010-07-13 22:03:46,313 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=3.3544312MB (3517376), Free=405.42056MB (425114272), Max=408.775MB (428631648), Counts: Blocks=0, Access=6580365, Hit=0, Miss=658036 5, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%, Evicted/Run=NaN 2010-07-13 22:03:46,483 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 2010-07-13 22:03:46,404 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 908177ms for sessionid 0x229c87a7f4e0011, closing socket connection and attempting reconnect2010-07-13 22:03:46,400 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 2010-07-13 22:03:46,313 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 2 on 60020 took 868374ms appending an edit to hlog; editcount=4274 At the same time, we can see that the zookeeper log does expire the session: 2010-07-13 21:52:00,001 - INFO [SessionTracker:zookeeperser...@315] - Expiring session 0x229c87a7f4e0011, timeout of 120000ms exceeded 2010-07-13 21:52:00,001 - INFO [SessionTracker:zookeeperser...@315] - Expiring session 0x129c87a7f980015, timeout of 120000ms exceeded 2010-07-13 21:52:00,002 - INFO [ProcessThread:-1:preprequestproces...@385] - Processed session termination for sessionid: 0x229c87a7f4e0011 2010-07-13 21:52:00,002 - INFO [ProcessThread:-1:preprequestproces...@385] - Processed session termination for sessionid: 0x129c87a7f980015 So your region server process was paused for more than 10 minutes, the logs show that gap and the lines that follow talk about that big pause. Are your nodes swapping? J-D On Wed, Jul 14, 2010 at 11:48 AM, Jinsong Hu <[email protected]> wrote: > the zookeepers are on 3 separate physical machine that colocates with 3 > master. > I have put the logs for them here. > http://t-collectors1.proofpoint.com:8079/zookeeper.tar.gz > the version is also from CDH3 June 10 release. > > I checked the datanode in that period , there are lots of exceptions, I have > put it here: > http://t-collectors1.proofpoint.com:8079/backup.tar.gz > > Jinsong > > -------------------------------------------------- > From: "Jean-Daniel Cryans" <[email protected]> > Sent: Wednesday, July 14, 2010 11:16 AM > To: <[email protected]> > Subject: Re: regionserver crash under heavy load > >> So your region servers had their session expired but I don't see any >> sign of GC activity. Usually that points to a case where zookeeper >> isn't able to answer request fast enough because it is IO starved. Are >> the zookeeper quorum member on the same nodes as the region servers? >> >> J-D >> >> On Wed, Jul 14, 2010 at 10:16 AM, Jinsong Hu <[email protected]> >> wrote: >>> >>> I have uploaded the crashed log files to >>> >>> http://somewhere >>> http://somewhere >>> >>> it includes the GC log and config files too. both of these regionserver >>> crashed when I >>> use only 3 mappers to insert records to hbase, and each task I limit the >>> data rate >>> to 80 records /second. >>> >>> The machine I use is relatively old, having only 4G of ram. 4 core CPU, >>> and >>> 250G disk. >>> I run tasktracker , datanode and regionserver on them. I have 3 machines, >>> only 1 regionserver >>> is still up after the continuous insertion for overnight. before the test >>> I >>> created hbase >>> tables so I started with an empty table. I saw it running fine for >>> several >>> hours and this morning >>> I check again, 2 regionserver died. I backup the log and restart the >>> regionserver, and then >>> I found the regionserver process is up, but not listening to any port. >>> >>> the log doesn't show any error: >>> 2010-07-14 16:23:45,769 DEBUG >>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper >>> : Set watcher on master address ZNode /hbase/master >>> 2010-07-14 16:23:45,951 DEBUG >>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper >>> : Read ZNode /hbase/master got 10.110.24.48:60000 >>> 2010-07-14 16:23:45,957 INFO >>> org.apache.hadoop.hbase.regionserver.HRegionServer: >>> Telling master at 10.110.24.48:60000 that we are up >>> 2010-07-14 16:23:46,283 INFO >>> org.apache.hadoop.hbase.regionserver.ShutdownHook: >>> Installed shutdown hook thread: Shutdownhook:regionserver60020 >>> >>> >>> I am using the June 10 release of the cloudera distribution for both >>> hadoop >>> and hbase. >>> >>> Jimmy >>> >>> >>> >>> -------------------------------------------------- >>> From: "Jean-Daniel Cryans" <[email protected]> >>> Sent: Tuesday, July 13, 2010 4:24 PM >>> To: <[email protected]> >>> Subject: Re: regionserver crash under heavy load >>> >>>> Your region server doesn't look much loaded from the metrics POV. But >>>> I specifically asked for the lines around that, not just the dump, >>>> since it will contain the reason for the shutdown. >>>> >>>>> I do notice that the disk usage is pretty high. I am just thinking >>>>> that >>>>> our >>>>> problem probably is a hardware limit. but the server should not crash >>>>> when >>>>> the hardware limit is reached. >>>> >>>> We still don't know why it crashed and it may not even be related to >>>> HW limits, we need those bigger log traces. Also use pastebin.com or >>>> anything like that. >>>> >>>>> >>>>> do you have any idea when CDH3 official release will be out ? >>>> >>>> I don't work for cloudera, but IIRC the next beta for CDH3 is due for >>>> September. >>>> >>>>> >>>>> Jimmy >>>>> >>>>> -------------------------------------------------- >>>>> From: "Jean-Daniel Cryans" <[email protected]> >>>>> Sent: Tuesday, July 13, 2010 2:55 PM >>>>> To: <[email protected]> >>>>> Subject: Re: regionserver crash under heavy load >>>>> >>>>>> Please use a pasting service for the log traces. I personally use >>>>>> pastebin.com >>>>>> >>>>>> You probably had a GC that lasted too long, this is something out of >>>>>> the control of the application (apart from trying to put as less data >>>>>> in memory as possible, but you are inserting so...). Your log doesn't >>>>>> contain enough information for us to tell, please look for a "Dump of >>>>>> metrics" line and paste the lines around that. >>>>>> >>>>>> J-D >>>>>> >>>>>> On Tue, Jul 13, 2010 at 2:49 PM, Jinsong Hu <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Hi, Todd: >>>>>>> I downloaded hadoop-0.20.2+320 and hbase-0.89.20100621+17 from CDH3 >>>>>>> and >>>>>>> inserted data with full load, after a while the hbase regionserver >>>>>>> crashed. >>>>>>> I checked system with "iostat -x 5" and notice the disk is pretty >>>>>>> busy. >>>>>>> Then I modified my client code and reduced the insertion rate by 6 >>>>>>> times, >>>>>>> and the test runs fine. Is there any way that regionserver be >>>>>>> modified >>>>>>> so >>>>>>> that at least it doesn't crash under heavy load ? I used apache >>>>>>> hbase >>>>>>> 0.20.5 distribution and the same problem happens. I am thinking that >>>>>>> when >>>>>>> the regionserver is too busy, it should throttle incoming data rate >>>>>>> to >>>>>>> protect the server. Could this be done ? >>>>>>> Do you also know when the CDH3 official release will come out ? the >>>>>>> one >>>>>>> I >>>>>>> downloaded is beta version. >>>>>>> >>>>>>> Jimmy >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2010-07-13 02:24:34,389 INFO >>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>>>> Close >>>>>>> d Spam_MsgEventTable,56-2010-05-19 >>>>>>> 10:09:02\x099a420f4f31748828fd24aeea1d06b294, >>>>>>> 1278973678315.01dd22f517dabf53ddd135709b68ba6c. >>>>>>> 2010-07-13 02:24:34,389 INFO >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: >>>>>>> aborting server at: m0002029.ppops.net,60020,1278969481450 >>>>>>> 2010-07-13 02:24:34,389 DEBUG >>>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper >>>>>>> : Closed connection with ZooKeeper; /hbase/root-region-server >>>>>>> 2010-07-13 02:24:34,389 INFO >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: >>>>>>> regionserver60020 exiting >>>>>>> 2010-07-13 02:24:34,608 INFO >>>>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: >>>>>>> Shutdown hook starting; hbase.shutdown.hook=true; >>>>>>> fsShutdownHook=Thread[Thread-1 >>>>>>> 0,5,main] >>>>>>> 2010-07-13 02:24:34,608 INFO >>>>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: >>>>>>> Starting fs shutdown hook thread. >>>>>>> 2010-07-13 02:24:34,608 ERROR org.apache.hadoop.hdfs.DFSClient: >>>>>>> Exception >>>>>>> closin >>>>>>> g file >>>>>>> >>>>>>> >>>>>>> /hbase/.logs/m0002029.ppops.net,60020,1278969481450/10.110.24.79%3A60020. >>>>>>> 1278987220794 : java.io.IOException: IOException >>>>>>> flush:java.io.IOException: >>>>>>> IOEx >>>>>>> ception flush:java.io.IOException: IOException >>>>>>> flush:java.io.IOException: >>>>>>> IOExce >>>>>>> ption flush:java.io.IOException: IOException >>>>>>> flush:java.io.IOException: >>>>>>> IOExcept >>>>>>> ion flush:java.io.IOException: IOException flush:java.io.IOException: >>>>>>> IOExceptio >>>>>>> n flush:java.io.IOException: IOException flush:java.io.IOException: >>>>>>> Error >>>>>>> Recove >>>>>>> ry for block blk_-1605696159279298313_2395924 failed because >>>>>>> recovery >>>>>>> from >>>>>>> prim >>>>>>> ary datanode 10.110.24.80:50010 failed 6 times. Pipeline was >>>>>>> 10.110.24.80:50010 >>>>>>> . Aborting... >>>>>>> java.io.IOException: IOException flush:java.io.IOException: >>>>>>> IOException >>>>>>> flush:ja >>>>>>> va.io.IOException: IOException flush:java.io.IOException: IOException >>>>>>> flush:java >>>>>>> .io.IOException: IOException flush:java.io.IOException: IOException >>>>>>> flush:java.i >>>>>>> o.IOException: IOException flush:java.io.IOException: IOException >>>>>>> flush:java.io. >>>>>>> IOException: IOException flush:java.io.IOException: Error Recovery >>>>>>> for >>>>>>> block >>>>>>> blk >>>>>>> _-1605696159279298313_2395924 failed because recovery from primary >>>>>>> datanode >>>>>>> 10. >>>>>>> 110.24.80:50010 failed 6 times. Pipeline was 10.110.24.80:50010. >>>>>>> Aborting... >>>>>>> at >>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java: >>>>>>> 3214) >>>>>>> at >>>>>>> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java: >>>>>>> 97) >>>>>>> at >>>>>>> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944 >>>>>>> ) >>>>>>> at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) >>>>>>> at >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >>>>>>> sorImpl.java:25) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(S >>>>>>> equenceFileLogWriter.java:124) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.hflush(HLog.java:826) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1004) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:817) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.j >>>>>>> ava:1531) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1447) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer. >>>>>>> java:1703) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.multiPut(HRegionSe >>>>>>> rver.java:2361) >>>>>>> at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) >>>>>>> at >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >>>>>>> sorImpl.java:25) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java: >>>>>>> 919) >>>>>>> 2010-07-13 02:24:34,610 ERROR org.apache.hadoop.hdfs.DFSClient: >>>>>>> Exception >>>>>>> closin >>>>>>> g file >>>>>>> >>>>>>> >>>>>>> /hbase/Spam_MsgEventTable/079c7de876422e57e5f09fef5d997e06/.tmp/677365813 >>>>>>> 4549268273 : java.io.IOException: All datanodes 10.110.24.80:50010 >>>>>>> are >>>>>>> bad. >>>>>>> Abor >>>>>>> ting... >>>>>>> java.io.IOException: All datanodes 10.110.24.80:50010 are bad. >>>>>>> Aborting... >>>>>>> at >>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError >>>>>>> (DFSClient.java:2603) >>>>>>> at >>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClien >>>>>>> t.java:2139) >>>>>>> at >>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFS >>>>>>> Client.java:2306) >>>>>>> 2010-07-13 02:24:34,729 INFO >>>>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: >>>>>>> Shutdown hook finished. >>>>>>> >>>>>> >>>>> >>>> >>> >> >
