Here is the grep of dump of metrics:

2010-07-13 02:22:45,818 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=305.0, regions=14, stores=167, storefiles=287, storefi leIndexSize=54, memstoreSize=489, compactionQueueSize=1, usedHeap=488, maxHeap=2 043, blockCacheSize=5800680, blockCacheFree=422830968, blockCacheCount=244, bloc
kCacheHitRatio=29
2010-07-13 02:22:48,286 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=14, stores=167, storefiles=287, storefile IndexSize=54, memstoreSize=489, compactionQueueSize=1, usedHeap=491, maxHeap=204 3, blockCacheSize=5800680, blockCacheFree=422830968, blockCacheCount=244, blockC
acheHitRatio=29

I logged all gc and the longest gc is 8.8 seconds. but most of them are not that long. I used " -XX:+UseConcMarkSweepGC" flag in the java code so GC doesn't look like a problem.

I do notice that the disk usage is pretty high. I am just thinking that our problem probably is a hardware limit. but the server should not crash when the hardware limit is reached.

do you have any idea when CDH3 official release will be out ?

Jimmy

--------------------------------------------------
From: "Jean-Daniel Cryans" <[email protected]>
Sent: Tuesday, July 13, 2010 2:55 PM
To: <[email protected]>
Subject: Re: regionserver crash under heavy load

Please use a pasting service for the log traces. I personally use pastebin.com

You probably had a GC that lasted too long, this is something out of
the control of the application (apart from trying to put as less data
in memory as possible, but you are inserting so...). Your log doesn't
contain enough information for us to tell, please look for a "Dump of
metrics" line and paste the lines around that.

J-D

On Tue, Jul 13, 2010 at 2:49 PM, Jinsong Hu <[email protected]> wrote:
Hi, Todd:
 I downloaded hadoop-0.20.2+320 and hbase-0.89.20100621+17 from CDH3 and
inserted data with full load, after a while the hbase regionserver crashed.
I checked  system with "iostat -x 5" and notice the disk is pretty busy.
Then I modified my client code and reduced the insertion rate by 6 times,
and the test runs fine. Is there any way that regionserver be modified so
that at least it doesn't crash under heavy load ?  I used apache hbase
0.20.5 distribution and the same problem happens. I am thinking that when
the regionserver is too busy, it should throttle incoming data rate to
protect the server.  Could this be done ?
Do you also know when the CDH3 official release will come out ? the one I
downloaded is beta version.

Jimmy






2010-07-13 02:24:34,389 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Close
d Spam_MsgEventTable,56-2010-05-19
10:09:02\x099a420f4f31748828fd24aeea1d06b294,
1278973678315.01dd22f517dabf53ddd135709b68ba6c.
2010-07-13 02:24:34,389 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
aborting server at: m0002029.ppops.net,60020,1278969481450
2010-07-13 02:24:34,389 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
: Closed connection with ZooKeeper; /hbase/root-region-server
2010-07-13 02:24:34,389 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020 exiting
2010-07-13 02:24:34,608 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook:
Shutdown hook starting; hbase.shutdown.hook=true;
fsShutdownHook=Thread[Thread-1
0,5,main]
2010-07-13 02:24:34,608 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook:
Starting fs shutdown hook thread.
2010-07-13 02:24:34,608 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
closin
g file
/hbase/.logs/m0002029.ppops.net,60020,1278969481450/10.110.24.79%3A60020.
1278987220794 : java.io.IOException: IOException flush:java.io.IOException:
IOEx
ception flush:java.io.IOException: IOException flush:java.io.IOException:
IOExce
ption flush:java.io.IOException: IOException flush:java.io.IOException:
IOExcept
ion flush:java.io.IOException: IOException flush:java.io.IOException:
IOExceptio
n flush:java.io.IOException: IOException flush:java.io.IOException: Error
Recove
ry for block blk_-1605696159279298313_2395924 failed because recovery from
prim
ary datanode 10.110.24.80:50010 failed 6 times.  Pipeline was
10.110.24.80:50010
. Aborting...
java.io.IOException: IOException flush:java.io.IOException: IOException
flush:ja
va.io.IOException: IOException flush:java.io.IOException: IOException
flush:java
.io.IOException: IOException flush:java.io.IOException: IOException
flush:java.i
o.IOException: IOException flush:java.io.IOException: IOException
flush:java.io.
IOException: IOException flush:java.io.IOException: Error Recovery for block
blk
_-1605696159279298313_2395924 failed because recovery from primary datanode
10.
110.24.80:50010 failed 6 times.  Pipeline was 10.110.24.80:50010.
Aborting...
      at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:
3214)
      at
org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:
97)
      at
org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944
)
      at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
      at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(S
equenceFileLogWriter.java:124)
at org.apache.hadoop.hbase.regionserver.wal.HLog.hflush(HLog.java:826) at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1004) at org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:817)
      at
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.j
ava:1531)
at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1447)
      at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.
java:1703)
      at
org.apache.hadoop.hbase.regionserver.HRegionServer.multiPut(HRegionSe
rver.java:2361)
      at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
      at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
      at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:
919)
2010-07-13 02:24:34,610 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
closin
g file
/hbase/Spam_MsgEventTable/079c7de876422e57e5f09fef5d997e06/.tmp/677365813
4549268273 : java.io.IOException: All datanodes 10.110.24.80:50010 are bad.
Abor
ting...
java.io.IOException: All datanodes 10.110.24.80:50010 are bad. Aborting...
      at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError
(DFSClient.java:2603)
      at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClien
t.java:2139)
      at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFS
Client.java:2306)
2010-07-13 02:24:34,729 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook:
Shutdown hook finished.


Reply via email to