jmap -heap of app-hbase-1 why oldSize so small? Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 8388608000 (8000.0MB) NewSize = 536870912 (512.0MB) MaxNewSize = 536870912 (512.0MB) OldSize = 5439488 (5.1875MB) NewRatio = 2 SurvivorRatio = 8 PermSize = 21757952 (20.75MB) MaxPermSize = 85983232 (82.0MB)
Heap Usage: New Generation (Eden + 1 Survivor Space): capacity = 483196928 (460.8125MB) used = 263917680 (251.69151306152344MB) free = 219279248 (209.12098693847656MB) 54.6190724126458% used Eden Space: capacity = 429522944 (409.625MB) used = 242509200 (231.27479553222656MB) free = 187013744 (178.35020446777344MB) 56.460127075307064% used >From Space: capacity = 53673984 (51.1875MB) used = 21408480 (20.416717529296875MB) free = 32265504 (30.770782470703125MB) 39.88613925137363% used To Space: capacity = 53673984 (51.1875MB) used = 0 (0.0MB) free = 53673984 (51.1875MB) 0.0% used concurrent mark-sweep generation: capacity = 2857693184 (2725.30859375MB) used = 1738653912 (1658.1095809936523MB) free = 1119039272 (1067.1990127563477MB) 60.841168034923655% used Perm Generation: capacity = 40685568 (38.80078125MB) used = 24327624 (23.20063018798828MB) free = 16357944 (15.600151062011719MB) 59.79423465342797% used On Tue, Apr 22, 2014 at 4:00 PM, Li Li <[email protected]> wrote: > Region Servers > ServerName Start time Load > app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014 > requestsPerSecond=448799, numberOfOnlineRegions=8, usedHeapMB=1241, > maxHeapMB=7948 > app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014 > requestsPerSecond=0, numberOfOnlineRegions=3, usedHeapMB=829, > maxHeapMB=7948 > app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014 > requestsPerSecond=217349, numberOfOnlineRegions=6, usedHeapMB=1356, > maxHeapMB=7948 > app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014 > requestsPerSecond=844, numberOfOnlineRegions=3, usedHeapMB=285, > maxHeapMB=7948 > Total: servers: 4 requestsPerSecond=666992, numberOfOnlineRegions=20 > > hbase-2 and 4 only have 3 region. how to balance them? > > On Tue, Apr 22, 2014 at 3:53 PM, Li Li <[email protected]> wrote: >> hbase current statistics: >> >> Region Servers >> ServerName Start time Load >> app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014 >> requestsPerSecond=6100, numberOfOnlineRegions=7, usedHeapMB=1201, >> maxHeapMB=7948 >> app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014 >> requestsPerSecond=1770, numberOfOnlineRegions=4, usedHeapMB=224, >> maxHeapMB=7948 >> app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014 >> requestsPerSecond=3445, numberOfOnlineRegions=5, usedHeapMB=798, >> maxHeapMB=7948 >> app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014 >> requestsPerSecond=57, numberOfOnlineRegions=2, usedHeapMB=328, >> maxHeapMB=7948 >> Total: servers: 4 requestsPerSecond=11372, numberOfOnlineRegions=18 >> >> On Tue, Apr 22, 2014 at 3:40 PM, Li Li <[email protected]> wrote: >>> I am now restart the sever and running. maybe an hour later the load >>> will become high >>> >>> On Tue, Apr 22, 2014 at 3:02 PM, Azuryy Yu <[email protected]> wrote: >>>> Do you still have the same issue? >>>> >>>> and: >>>> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m >>>> >>>> the Eden size is too small. >>>> >>>> >>>> >>>> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <[email protected]> wrote: >>>> >>>>> <property> >>>>> <name>dfs.datanode.handler.count</name> >>>>> <value>100</value> >>>>> <description>The number of server threads for the >>>>> datanode.</description> >>>>> </property> >>>>> >>>>> >>>>> 1. namenode/master 192.168.10.48 >>>>> http://pastebin.com/7M0zzAAc >>>>> >>>>> $free -m (this is value when I restart the hadoop and hbase now, not >>>>> the value when it crashed) >>>>> total used free shared buffers cached >>>>> Mem: 15951 3819 12131 0 509 1990 >>>>> -/+ buffers/cache: 1319 14631 >>>>> Swap: 8191 0 8191 >>>>> >>>>> 2. datanode/region 192.168.10.45 >>>>> http://pastebin.com/FiAw1yju >>>>> >>>>> $free -m >>>>> total used free shared buffers cached >>>>> Mem: 15951 3627 12324 0 1516 641 >>>>> -/+ buffers/cache: 1469 14482 >>>>> Swap: 8191 8 8183 >>>>> >>>>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <[email protected]> wrote: >>>>> > one big possible issue is that you have a high concurrent request on >>>>> > HDFS >>>>> > or HBASE, then all Data nodes handlers are all busy, then more requests >>>>> are >>>>> > pending, then timeout, so you can try to increase >>>>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the >>>>> > hdfs-site.xml, then restart the HDFS. >>>>> > >>>>> > another, do you have datanode, namenode, region servers JVM options? if >>>>> > they are all by default, then there is also have this issue. >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <[email protected]> wrote: >>>>> > >>>>> >> my cluster setup: both 6 machines are virtual machine. each machine: >>>>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory >>>>> >> 192.168.10.48 namenode/jobtracker >>>>> >> 192.168.10.47 secondary namenode >>>>> >> 192.168.10.45 datanode/tasktracker >>>>> >> 192.168.10.46 datanode/tasktracker >>>>> >> 192.168.10.49 datanode/tasktracker >>>>> >> 192.168.10.50 datanode/tasktracker >>>>> >> >>>>> >> hdfs logs around 20:33 >>>>> >> 192.168.10.48 namenode log http://pastebin.com/rwgmPEXR >>>>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this >>>>> >> datanode crash first) >>>>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi >>>>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1 >>>>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB >>>>> >> >>>>> >> hbase logs around 20:33 >>>>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p >>>>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y >>>>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f >>>>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz >>>>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange, >>>>> >> not log at 20:33, but have log at 20:32 and 20:34) >>>>> >> >>>>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <[email protected]> wrote: >>>>> >> > Can you post more of the data node log, around 20:33 ? >>>>> >> > >>>>> >> > Cheers >>>>> >> > >>>>> >> > >>>>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <[email protected]> wrote: >>>>> >> > >>>>> >> >> hadoop 1.0 >>>>> >> >> hbase 0.94.11 >>>>> >> >> >>>>> >> >> datanode log from 192.168.10.45. why it shut down itself? >>>>> >> >> >>>>> >> >> 2014-04-21 20:33:59,309 INFO >>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock >>>>> >> >> blk_-7969006819959471805_202154 received exception >>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on >>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout >>>>> >> >> left. >>>>> >> >> 2014-04-21 20:33:59,310 ERROR >>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: >>>>> >> >> DatanodeRegistration(192.168.10.45:50010, >>>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949, >>>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver >>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on >>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout >>>>> >> >> left. >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) >>>>> >> >> at >>>>> >> >> >>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) >>>>> >> >> at >>>>> >> >> >>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) >>>>> >> >> at >>>>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273) >>>>> >> >> at >>>>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334) >>>>> >> >> at java.io.DataInputStream.read(DataInputStream.java:149) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) >>>>> >> >> at java.lang.Thread.run(Thread.java:722) >>>>> >> >> 2014-04-21 20:33:59,310 ERROR >>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: >>>>> >> >> DatanodeRegistration(192.168.10.45:50010, >>>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949, >>>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver >>>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on >>>>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis >>>>> timeout >>>>> >> >> left. >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) >>>>> >> >> at >>>>> >> >> >>>>> >> >>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) >>>>> >> >> at java.lang.Thread.run(Thread.java:722) >>>>> >> >> 2014-04-21 20:34:00,291 INFO >>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for >>>>> >> >> threadgroup to exit, active threads is 0 >>>>> >> >> 2014-04-21 20:34:00,404 INFO >>>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: >>>>> >> >> Shutting down all async disk service threads... >>>>> >> >> 2014-04-21 20:34:00,405 INFO >>>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: >>>>> >> >> All >>>>> >> >> async disk service threads have been shut down. >>>>> >> >> 2014-04-21 20:34:00,413 INFO >>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode >>>>> >> >> 2014-04-21 20:34:00,424 INFO >>>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: >>>>> >> >> /************************************************************ >>>>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45 >>>>> >> >> ************************************************************/ >>>>> >> >> >>>>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <[email protected]> >>>>> wrote: >>>>> >> >> > bq. one datanode failed >>>>> >> >> > >>>>> >> >> > Was the crash due to out of memory error ? >>>>> >> >> > Can you post the tail of data node log on pastebin ? >>>>> >> >> > >>>>> >> >> > Giving us versions of hadoop and hbase would be helpful. >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <[email protected]> >>>>> wrote: >>>>> >> >> > >>>>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary >>>>> namenode, 4 >>>>> >> >> >> datanode. >>>>> >> >> >> and the hbase master is on the same machine with namenode, 4 >>>>> >> >> >> hbase >>>>> >> >> >> slave on datanode machine. >>>>> >> >> >> I found average requests per seconds is about 10,000. and the >>>>> >> clusters >>>>> >> >> >> crashed. and I found the reason is one datanode failed. >>>>> >> >> >> >>>>> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory >>>>> >> >> >> is my cluster overloaded? >>>>> >> >> >> >>>>> >> >> >>>>> >> >>>>>
