[ https://issues.apache.org/jira/browse/HDFS-14770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916552#comment-16916552 ]
HuangTao edited comment on HDFS-14770 at 8/27/19 9:46 AM: ---------------------------------------------------------- {code:java} [2019-08-13T18:51:09.587+08:00] [INFO] [stripedRead-2122] : Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/xx.xxx.xxx.xx:44044 remote=/10.198.104.232:50010]. 180000 millis timeout left. [2019-08-13T18:51:09.588+08:00] [WARN] [StripedBlockReconstruction-441] : Failed to reconstruct striped block: BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346848_941534825 java.lang.NullPointerException [2019-08-13T18:51:09.588+08:00] [INFO] [DataXceiver for client at /xx.xxx.xxx.xx:14224 [Receiving block BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346845_941534825]] : Exception for BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346845_941534825 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:212) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:538) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:987) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:891) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:105) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) at java.lang.Thread.run(Thread.java:745) [2019-08-13T18:51:09.589+08:00] [WARN] [DataXceiver for client at /xx.xxx.xxx.xx:14224 [Receiving block BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346845_941534825]] : Block BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346845_941534825 unfinalized and removed. {code} After deeply diving into the code and adding a log of logs, I find the *StripedReconstructor#BUFFER_POOL* is the root cause. When a part of StripedBlockReaders have already gotten data and put them in the BUFFERs, but one StripedBlockReader(stipedRead-2122) timeout, the BUFFERs will be put back in *StripedReconstructor#BUFFER_POOL* and the recovery is terminated. Next, another recovery will use the BUFFERs which remaining() maybe is 0, so these dirty BUFFERs will be used to reconstruct the internal blocks. As a result, reconstructed internal block is error. {code:java} /** * Perform actual reading of bytes from block. */ private BlockReadStats actualReadFromBlock() throws IOException { int len = buffer.remaining(); int n = 0; while (n < len) { int nread = blockReader.read(buffer); if (nread <= 0) { break; } n += nread; stripedReader.getReconstructor().incrBytesRead(isLocal, nread); } return new BlockReadStats(n, blockReader.isShortCircuit(), blockReader.getNetworkDistance()); } {code} was (Author: marvelrock): {code:java} [2019-08-13T18:51:09.587+08:00] [INFO] [stripedRead-2122] : Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/xx.xxx.xxx.xx:44044 remote=/10.198.104.232:50010]. 180000 millis timeout left. [2019-08-13T18:51:09.588+08:00] [WARN] [StripedBlockReconstruction-441] : Failed to reconstruct striped block: BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346848_941534825 java.lang.NullPointerException [2019-08-13T18:51:09.588+08:00] [INFO] [DataXceiver for client at /xx.xxx.xxx.xx:14224 [Receiving block BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346845_941534825]] : Exception for BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346845_941534825 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:212) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:538) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:987) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:891) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:105) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) at java.lang.Thread.run(Thread.java:745) [2019-08-13T18:51:09.589+08:00] [WARN] [DataXceiver for client at /xx.xxx.xxx.xx:14224 [Receiving block BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346845_941534825]] : Block BP-714356632-xx.xxx.xxx.yy-1519726836856:blk_-9223372036844346845_941534825 unfinalized and removed. {code} After deeply diving into the code and adding a log of logs, I find the *StripedReconstructor#BUFFER_POOL* is the root cause. When a part of StripedBlockReaders have already gotten data and put them in the BUFFERs, but one StripedBlockReader(stipedRead-2122) timeout, the BUFFERs will be put in *StripedReconstructor#BUFFER_POOL* and the recovery is terminated. Next, another recovery will use the BUFFERs which remaining() maybe is 0, so these dirty BUFFERs will be used to reconstruct the internal blocks. As a result, reconstructed internal block is error. {code:java} /** * Perform actual reading of bytes from block. */ private BlockReadStats actualReadFromBlock() throws IOException { int len = buffer.remaining(); int n = 0; while (n < len) { int nread = blockReader.read(buffer); if (nread <= 0) { break; } n += nread; stripedReader.getReconstructor().incrBytesRead(isLocal, nread); } return new BlockReadStats(n, blockReader.isShortCircuit(), blockReader.getNetworkDistance()); } {code} > Erasure Coding: reconstructed internal block is error > ----------------------------------------------------- > > Key: HDFS-14770 > URL: https://issues.apache.org/jira/browse/HDFS-14770 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding > Reporter: HuangTao > Assignee: HuangTao > Priority: Major > > We found some missing EC internal blocks were reconstructed incorrectly. > The incorrect block(md5sum) > {code:java} > 73ceb92c4223ce00ff88802fdafd19a3 > /data5/dfs/current/BP-XXXXXX/current/finalized/subdir6/subdir7/blk_-9223372036570159327 > {code} > It should be > {code:java} > 137e1c8483126291c2e84fc0115cfcc6 > /data3/dfs/current/BP-XXXXXX/current/finalized/subdir7/subdir6/blk_-9223372036401277247 > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org