I allocated 5G.
I think OOM is not the cause of essentially
-----Original Message-----
From: "Han-Cheol Cho"<[email protected]>
To: <[email protected]>;
Cc:
Sent: 2015-04-22 (수) 15:32:35
Subject: RE: rolling upgrade(2.4.1 to 2.6.0) problem
Hi,
The first warning shows out-of-memory error of JVM.
Did you give enough max heap memory for DataNode daemons?
DN daemons, by default, uses max heap size 1GB. So if your DN requires more
than that, it will be in a trouble.
You can check the memory consumption of you DN dameons (e.g., top command)
and the memory allocated to them by -Xmx option (e.g., jps -lmv).
If the max heap size is too small, you can use HADOOP_DATANODE_OPTS variable
(e.g., HADOOP_DATANODE_OPTS="-Xmx4g") to override it.
Best wishes,
Han-Cheol
-----Original Message-----
From: "조주일"<[email protected]>
To: <[email protected]>;
Cc:
Sent: 2015-04-22 (수) 14:54:16
Subject: rolling upgrade(2.4.1 to 2.6.0) problem
My Cluster is..
hadoop 2.4.1
Capacity : 1.24PB
Used 1.1PB
16 Datanodes
Each node is a capacity of 65TB, 96TB, 80TB, Etc..
I had to proceed with the rolling upgrade 2.4.1 to 2.6.0.
A data node upgraded takes about 40 minutes.
Occurs during the upgrade is in progress under-block.
10 nodes completed upgrade 2.6.0.
Had a problem at some point during a rolling upgrade of the remaining nodes.
Heartbeat of the many nodes(2.6.0 only) has failed.
I did changes the following attributes but I did not fix the problem,
dfs.datanode.handler.count = 100 ---> 300, 400, 500
dfs.datanode.max.transfer.threads = 4096 ---> 8000, 10000
I think,
1. Something that causes a delay in processing threads. I think it may be
because the block replication between different versions.
2. Whereby the many handlers and xceiver became necessary.
3. Whereby the out of memory, an error occurs. Or the problem arises on a
datanode.
4. Heartbeat fails, and datanode dies.
I found a datanode error log for the following:
However, it is impossible to determine the cause.
I think, therefore I am. Called because it blocks the replication between
different versions
Give me someone help me !!
DATANODE LOG
--------------------------------------------------------------------------
### I had to check a few thousand close_wait connection from the datanode.
org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write
packet to mirror took 1207ms (threshold=300ms)
2015-04-21 22:46:01,772 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
DataNode is out of memory. Will retry in 30 seconds.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:145)
at java.lang.Thread.run(Thread.java:662)
2015-04-21 22:49:45,378 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver
count 8193 exceeds the limit of concurrent xcievers: 8192
at
org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
at java.lang.Thread.run(Thread.java:662)
2015-04-22 01:01:25,632 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver
count 8193 exceeds the limit of concurrent xcievers: 8192
at
org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
at java.lang.Thread.run(Thread.java:662)
2015-04-22 03:49:44,125 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
datanode-192.168.1.204:40010:DataXceiver error processing READ_BLOCK operation
src: /192.168.2.174:45606 dst: /192.168.1.204:40010
java.io.IOException: cannot find BPOfferService for
bpid=BP-1770955034-0.0.0.0-1401163460236
at
org.apache.hadoop.hdfs.server.datanode.DataNode.getDNRegistrationForBP(DataNode.java:1387)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:470)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
at java.lang.Thread.run(Thread.java:662)
2015-04-22 05:30:28,947 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.1.203,
datanodeUuid=654f22ef-84b3-4ecb-a959-2ea46d817c19, infoPort=40075,
ipcPort=40020,
storageInfo=lv=-56;cid=CID-CLUSTER;nsid=239138164;c=1404883838982):Failed to
transfer BP-1770955034-0.0.0.0-1401163460236:blk_1075354042_1613403 to
192.168.2.156:40010 got
java.net.SocketException: Original Exception : java.io.IOException: Connection
reset by peer
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:405)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:506)
at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:559)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:728)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2017)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Connection reset by peer
... 8 more