What makes you say this? HBase has a lot of very short lived garbage (like KeyValue objects that do not outlive an RPC request) and a lot of long lived data in the memstore and the block cache. We want to avoid accumulating the short lived garbage and at the same time leave most heap for memstores and blockcache.
A small eden size of 512mb or even less makes sense to me. -- Lars ----- Original Message ----- From: Azuryy Yu <[email protected]> To: [email protected] Cc: Sent: Tuesday, April 22, 2014 12:02 AM Subject: Re: is my hbase cluster overloaded? Do you still have the same issue? and: -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m the Eden size is too small. On Tue, Apr 22, 2014 at 2:55 PM, Li Li <[email protected]> wrote: > <property> > <name>dfs.datanode.handler.count</name> > <value>100</value> > <description>The number of server threads for the datanode.</description> > </property> > > > 1. namenode/master 192.168.10.48 > http://pastebin.com/7M0zzAAc > > $free -m (this is value when I restart the hadoop and hbase now, not > the value when it crashed) > total used free shared buffers cached > Mem: 15951 3819 12131 0 509 1990 > -/+ buffers/cache: 1319 14631 > Swap: 8191 0 8191 > > 2. datanode/region 192.168.10.45 > http://pastebin.com/FiAw1yju > > $free -m > total used free shared buffers cached > Mem: 15951 3627 12324 0 1516 641 > -/+ buffers/cache: 1469 14482 > Swap: 8191 8 8183 > > On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <[email protected]> wrote: > > one big possible issue is that you have a high concurrent request on HDFS > > or HBASE, then all Data nodes handlers are all busy, then more requests > are > > pending, then timeout, so you can try to increase > > dfs.datanode.handler.count and dfs.namenode.handler.count in the > > hdfs-site.xml, then restart the HDFS. > > > > another, do you have datanode, namenode, region servers JVM options? if > > they are all by default, then there is also have this issue. > > > > > > > > > > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <[email protected]> wrote: > > > >> my cluster setup: both 6 machines are virtual machine. each machine: > >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory > >> 192.168.10.48 namenode/jobtracker > >> 192.168.10.47 secondary namenode > >> 192.168.10.45 datanode/tasktracker > >> 192.168.10.46 datanode/tasktracker > >> 192.168.10.49 datanode/tasktracker > >> 192.168.10.50 datanode/tasktracker > >> > >> hdfs logs around 20:33 > >> 192.168.10.48 namenode log http://pastebin.com/rwgmPEXR > >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this > >> datanode crash first) > >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi > >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1 > >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB > >> > >> hbase logs around 20:33 > >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p > >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y > >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f > >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz > >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange, > >> not log at 20:33, but have log at 20:32 and 20:34) > >> > >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <[email protected]> wrote: > >> > Can you post more of the data node log, around 20:33 ? > >> > > >> > Cheers > >> > > >> > > >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <[email protected]> wrote: > >> > > >> >> hadoop 1.0 > >> >> hbase 0.94.11 > >> >> > >> >> datanode log from 192.168.10.45. why it shut down itself? > >> >> > >> >> 2014-04-21 20:33:59,309 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > >> >> blk_-7969006819959471805_202154 received exception > >> >> java.io.InterruptedIOException: Interruped while waiting for IO on > >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout > >> >> left. > >> >> 2014-04-21 20:33:59,310 ERROR > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: > >> >> DatanodeRegistration(192.168.10.45:50010, > >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949, > >> >> infoPort=50075, ipcPort=50020):DataXceiver > >> >> java.io.InterruptedIOException: Interruped while waiting for IO on > >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout > >> >> left. > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > >> >> at > >> >> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > >> >> at > >> >> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > >> >> at > >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273) > >> >> at > >> java.io.BufferedInputStream.read(BufferedInputStream.java:334) > >> >> at java.io.DataInputStream.read(DataInputStream.java:149) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) > >> >> at java.lang.Thread.run(Thread.java:722) > >> >> 2014-04-21 20:33:59,310 ERROR > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: > >> >> DatanodeRegistration(192.168.10.45:50010, > >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949, > >> >> infoPort=50075, ipcPort=50020):DataXceiver > >> >> java.io.InterruptedIOException: Interruped while waiting for IO on > >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis > timeout > >> >> left. > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245) > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) > >> >> at java.lang.Thread.run(Thread.java:722) > >> >> 2014-04-21 20:34:00,291 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for > >> >> threadgroup to exit, active threads is 0 > >> >> 2014-04-21 20:34:00,404 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: > >> >> Shutting down all async disk service threads... > >> >> 2014-04-21 20:34:00,405 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All > >> >> async disk service threads have been shut down. > >> >> 2014-04-21 20:34:00,413 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode > >> >> 2014-04-21 20:34:00,424 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: > >> >> /************************************************************ > >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45 > >> >> ************************************************************/ > >> >> > >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <[email protected]> > wrote: > >> >> > bq. one datanode failed > >> >> > > >> >> > Was the crash due to out of memory error ? > >> >> > Can you post the tail of data node log on pastebin ? > >> >> > > >> >> > Giving us versions of hadoop and hbase would be helpful. > >> >> > > >> >> > > >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <[email protected]> > wrote: > >> >> > > >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary > namenode, 4 > >> >> >> datanode. > >> >> >> and the hbase master is on the same machine with namenode, 4 hbase > >> >> >> slave on datanode machine. > >> >> >> I found average requests per seconds is about 10,000. and the > >> clusters > >> >> >> crashed. and I found the reason is one datanode failed. > >> >> >> > >> >> >> the datanode configuration is about 4 cpu core and 10GB memory > >> >> >> is my cluster overloaded? > >> >> >> > >> >> > >> >
