Hi, IMHO, Upgrade *with downtime* after 2.7.1 is the best option left.
Thanks. Drake 민영근 Ph.D kt NexR On Mon, Apr 27, 2015 at 5:46 PM, Nitin Pawar <[email protected]> wrote: > I had read somewhere 2.7 has lots of issues so you should wait for 2.7.1 > where most of them are getting addressed > > On Mon, Apr 27, 2015 at 2:14 PM, 조주일 <[email protected]> wrote: > >> >> >> I think heartbeat failure cause is hang of nodes. >> >> I found a bug report associated with this problem. >> >> >> >> https://issues.apache.org/jira/browse/HDFS-7489 >> >> https://issues.apache.org/jira/browse/HDFS-7496 >> >> https://issues.apache.org/jira/browse/HDFS-7531 >> >> https://issues.apache.org/jira/browse/HDFS-8051 >> >> >> >> It has been fixed in 2.7. >> >> >> >> I do not have experience patch. >> >> And Because of this stability has not been confirmed, I can not upgrade >> to 2.7. >> >> >> >> What do you recommend for that? >> >> >> >> How can I do the patch, if I will do patch? >> >> Can I patch without service dowtime. >> >> >> >> >> >> >> >> >> >> -----Original Message----- >> *From:* "Drake민영근"<[email protected]> >> *To:* "user"<[email protected]>; "조주일"<[email protected]>; >> *Cc:* >> *Sent:* 2015-04-24 (금) 17:41:59 >> *Subject:* Re: rolling upgrade(2.4.1 to 2.6.0) problem >> >> >> Hi, >> >> I think limited by "max user processes". see this: >> https://plumbr.eu/outofmemoryerror/unable-to-create-new-native-thread In >> your case, user cannot create more than 10240 processes. In our env, the >> limit is more like "65000". >> >> I think it's worth a try. And, if hdfs datanode daemon's user is not >> root, set the limit file into /etc/security/limits.d >> >> Thanks. >> >> Drake 민영근 Ph.D >> kt NexR >> >> On Fri, Apr 24, 2015 at 5:15 PM, 조주일 <[email protected]> wrote: >> >> ulimit -a >> >> core file size (blocks, -c) 0 >> >> data seg size (kbytes, -d) unlimited >> >> scheduling priority (-e) 0 >> >> file size (blocks, -f) unlimited >> >> pending signals (-i) 62580 >> >> max locked memory (kbytes, -l) 64 >> >> max memory size (kbytes, -m) unlimited >> >> open files (-n) 102400 >> >> pipe size (512 bytes, -p) 8 >> >> POSIX message queues (bytes, -q) 819200 >> >> real-time priority (-r) 0 >> >> stack size (kbytes, -s) 10240 >> >> cpu time (seconds, -t) unlimited >> >> max user processes (-u) 10240 >> >> virtual memory (kbytes, -v) unlimited >> >> file locks (-x) unlimited >> >> >> >> ------------------------------------------------------ >> >> Hadoop cluster was operating normally in the 2.4.1 version. >> >> Hadoop cluster is a problem in version 2.6. >> >> >> >> E.g >> >> >> >> Slow BlockReceiver logs are often seen >> >> "org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver >> write data to disk cost" >> >> >> >> If the data node failure and under-block occurs, >> >> another many nodes heartbeat check is fails. >> >> So, I stop all nodes and I start all nodes. >> >> The cluster is then normalized. >> >> >> >> In this regard, Hadoop Is there a difference between version 2.4 and 2.6? >> >> >> >> >> >> ulimit -a >> >> core file size (blocks, -c) 0 >> >> data seg size (kbytes, -d) unlimited >> >> scheduling priority (-e) 0 >> >> file size (blocks, -f) unlimited >> >> pending signals (-i) 62580 >> >> max locked memory (kbytes, -l) 64 >> >> max memory size (kbytes, -m) unlimited >> >> open files (-n) 102400 >> >> pipe size (512 bytes, -p) 8 >> >> POSIX message queues (bytes, -q) 819200 >> >> real-time priority (-r) 0 >> >> stack size (kbytes, -s) 10240 >> >> cpu time (seconds, -t) unlimited >> >> max user processes (-u) 10240 >> >> virtual memory (kbytes, -v) unlimited >> >> file locks (-x) unlimited >> >> >> >> >> >> -----Original Message----- >> *From:* "Drake민영근"<[email protected]> >> *To:* "user"<[email protected]>; "조주일"<[email protected]>; >> *Cc:* >> *Sent:* 2015-04-24 (금) 16:58:46 >> *Subject:* Re: rolling upgrade(2.4.1 to 2.6.0) problem >> >> HI, >> >> How about the ulimit setting of the user for hdfs datanode ? >> >> Drake 민영근 Ph.D >> kt NexR >> >> On Wed, Apr 22, 2015 at 6:25 PM, 조주일 <[email protected]> wrote: >> >> >> >> I allocated 5G. >> >> I think OOM is not the cause of essentially >> >> >> >> -----Original Message----- >> *From:* "Han-Cheol Cho"<[email protected]> >> *To:* <[email protected]>; >> *Cc:* >> *Sent:* 2015-04-22 (수) 15:32:35 >> *Subject:* RE: rolling upgrade(2.4.1 to 2.6.0) problem >> >> >> Hi, >> >> >> >> The first warning shows out-of-memory error of JVM. >> >> Did you give enough max heap memory for DataNode daemons? >> >> DN daemons, by default, uses max heap size 1GB. So if your DN requires >> more >> >> than that, it will be in a trouble. >> >> >> You can check the memory consumption of you DN dameons (e.g., top >> command) >> >> and the memory allocated to them by -Xmx option (e.g., jps -lmv). >> >> If the max heap size is too small, you can use HADOOP_DATANODE_OPTS >> variable >> >> (e.g., HADOOP_DATANODE_OPTS="-Xmx4g") to override it. >> >> >> >> Best wishes, >> >> Han-Cheol >> >> >> >> >> >> >> >> >> >> >> >> -----Original Message----- >> *From:* "조주일"<[email protected]> >> *To:* <[email protected]>; >> *Cc:* >> *Sent:* 2015-04-22 (수) 14:54:16 >> *Subject:* rolling upgrade(2.4.1 to 2.6.0) problem >> >> >> >> >> My Cluster is.. >> >> hadoop 2.4.1 >> >> Capacity : 1.24PB >> >> Used 1.1PB >> >> 16 Datanodes >> >> Each node is a capacity of 65TB, 96TB, 80TB, Etc.. >> >> >> >> I had to proceed with the rolling upgrade 2.4.1 to 2.6.0. >> >> A data node upgraded takes about 40 minutes. >> >> Occurs during the upgrade is in progress under-block. >> >> >> >> 10 nodes completed upgrade 2.6.0. >> >> Had a problem at some point during a rolling upgrade of the remaining >> nodes. >> >> >> >> Heartbeat of the many nodes(2.6.0 only) has failed. >> >> >> >> I did changes the following attributes but I did not fix the problem, >> >> dfs.datanode.handler.count = 100 ---> 300, 400, 500 >> >> dfs.datanode.max.transfer.threads = 4096 ---> 8000, 10000 >> >> >> >> I think, >> >> 1. Something that causes a delay in processing threads. I think it may >> be because the block replication between different versions. >> >> 2. Whereby the many handlers and xceiver became necessary. >> >> 3. Whereby the out of memory, an error occurs. Or the problem arises on >> a datanode. >> >> 4. Heartbeat fails, and datanode dies. >> >> >> I found a datanode error log for the following: >> >> However, it is impossible to determine the cause. >> >> >> >> I think, therefore I am. Called because it blocks the replication between >> different versions >> >> >> >> Give me someone help me !! >> >> >> >> DATANODE LOG >> >> -------------------------------------------------------------------------- >> >> ### I had to check a few thousand close_wait connection from the datanode. >> >> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write >> packet to mirror took 1207ms (threshold=300ms) >> >> >> >> 2015-04-21 22:46:01,772 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. >> Will retry in 30 seconds. >> >> java.lang.OutOfMemoryError: unable to create new native thread >> >> at java.lang.Thread.start0(Native Method) >> >> at java.lang.Thread.start(Thread.java:640) >> >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:145) >> >> at java.lang.Thread.run(Thread.java:662) >> >> 2015-04-21 22:49:45,378 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode: >> datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver >> count 8193 exceeds the limit of concurrent xcievers: 8192 >> >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140) >> >> at java.lang.Thread.run(Thread.java:662) >> >> 2015-04-22 01:01:25,632 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode: >> datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver >> count 8193 exceeds the limit of concurrent xcievers: 8192 >> >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140) >> >> at java.lang.Thread.run(Thread.java:662) >> >> 2015-04-22 03:49:44,125 ERROR >> org.apache.hadoop.hdfs.server.datanode.DataNode: >> datanode-192.168.1.204:40010:DataXceiver error processing READ_BLOCK >> operation src: /192.168.2.174:45606 dst: /192.168.1.204:40010 >> >> java.io.IOException: cannot find BPOfferService for >> bpid=BP-1770955034-0.0.0.0-1401163460236 >> >> at >> org.apache.hadoop.hdfs.server.datanode.DataNode.getDNRegistrationForBP(DataNode.java:1387) >> >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:470) >> >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) >> >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) >> >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235) >> >> at java.lang.Thread.run(Thread.java:662) >> >> 2015-04-22 05:30:28,947 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode: >> DatanodeRegistration(192.168.1.203, >> datanodeUuid=654f22ef-84b3-4ecb-a959-2ea46d817c19, infoPort=40075, >> ipcPort=40020, >> storageInfo=lv=-56;cid=CID-CLUSTER;nsid=239138164;c=1404883838982):Failed >> to transfer BP-1770955034-0.0.0.0-1401163460236:blk_1075354042_1613403 to >> 192.168.2.156:40010 got >> >> java.net.SocketException: Original Exception : java.io.IOException: >> Connection reset by peer >> >> at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) >> >> at >> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:405) >> >> at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:506) >> >> at >> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) >> >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:559) >> >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:728) >> >> at >> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2017) >> >> at java.lang.Thread.run(Thread.java:662) >> >> Caused by: java.io.IOException: Connection reset by peer >> >> ... 8 more >> >> >> >> >> >> >> >> >> > > > > -- > Nitin Pawar >
