[jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker
[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206363#comment-16206363 ] Hudson commented on HDFS-12642: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13090 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/13090/]) HDFS-12642. Log block and datanode details in BlockRecoveryWorker. (xiao: rev 21bc85558718490e558c5b3bdb44c9c64eada994) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java > Log block and datanode details in BlockRecoveryWorker > - > > Key: HDFS-12642 > URL: https://issues.apache.org/jira/browse/HDFS-12642 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiao Chen >Assignee: Xiao Chen > Fix For: 2.9.0, 2.8.3, 3.0.0 > > Attachments: HDFS-12642.01.patch, HDFS-12642.02.patch > > > In a recent investigation, we have seen a weird block recovery issue, which > is difficult to reach to a conclusion because of insufficient logs. > For the most critical part of the events, we see block recovery failed to > {{commitBlockSynchronization]} on the NN, due to the block not closed. This > leaves the file as open forever (for 1+ months). > The reason the block was not closed on NN, was because it is configured with > {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest > genstamp. > We were not able to tell why only 1 replica is on latest genstamp. > From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} > was called on each of the 7 DNs the block were ever placed. All DNs but > ps2204 and ps3765 failed because of genstamp comparison - that's expected. > ps2204 and ps3765 have gone past the comparison (since no exceptions from > their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on > ps3765. > This jira is to propose we log more details when {{BlockRecoveryWorker}} is > about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could > be figured out in the future. > {noformat} > $ grep "updateReplica:" ps2204.dn.log > $ grep "updateReplica:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC > Server handler 6 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: > BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, > recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, > blk_2172795728_1106150312, RUR > $ grep "initReplicaRecovery:" ps2204.dn.log > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > changing replica state for blk_2172795728_1106150312 from RWR to > RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > $ grep "initReplicaRecovery:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on >
[jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker
[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206253#comment-16206253 ] Xiao Chen commented on HDFS-12642: -- Thanks Yongjun. Failed test is not related to the change. No tests needed since the change is log-only. Committing this. > Log block and datanode details in BlockRecoveryWorker > - > > Key: HDFS-12642 > URL: https://issues.apache.org/jira/browse/HDFS-12642 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-12642.01.patch, HDFS-12642.02.patch > > > In a recent investigation, we have seen a weird block recovery issue, which > is difficult to reach to a conclusion because of insufficient logs. > For the most critical part of the events, we see block recovery failed to > {{commitBlockSynchronization]} on the NN, due to the block not closed. This > leaves the file as open forever (for 1+ months). > The reason the block was not closed on NN, was because it is configured with > {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest > genstamp. > We were not able to tell why only 1 replica is on latest genstamp. > From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} > was called on each of the 7 DNs the block were ever placed. All DNs but > ps2204 and ps3765 failed because of genstamp comparison - that's expected. > ps2204 and ps3765 have gone past the comparison (since no exceptions from > their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on > ps3765. > This jira is to propose we log more details when {{BlockRecoveryWorker}} is > about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could > be figured out in the future. > {noformat} > $ grep "updateReplica:" ps2204.dn.log > $ grep "updateReplica:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC > Server handler 6 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: > BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, > recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, > blk_2172795728_1106150312, RUR > $ grep "initReplicaRecovery:" ps2204.dn.log > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > changing replica state for blk_2172795728_1106150312 from RWR to > RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > $ grep "initReplicaRecovery:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW >
[jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker
[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205432#comment-16205432 ] Yongjun Zhang commented on HDFS-12642: -- Makes sense [~xiaochen]. I found the failed test was reported as HDFS-12279. +2 on rev2. > Log block and datanode details in BlockRecoveryWorker > - > > Key: HDFS-12642 > URL: https://issues.apache.org/jira/browse/HDFS-12642 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-12642.01.patch, HDFS-12642.02.patch > > > In a recent investigation, we have seen a weird block recovery issue, which > is difficult to reach to a conclusion because of insufficient logs. > For the most critical part of the events, we see block recovery failed to > {{commitBlockSynchronization]} on the NN, due to the block not closed. This > leaves the file as open forever (for 1+ months). > The reason the block was not closed on NN, was because it is configured with > {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest > genstamp. > We were not able to tell why only 1 replica is on latest genstamp. > From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} > was called on each of the 7 DNs the block were ever placed. All DNs but > ps2204 and ps3765 failed because of genstamp comparison - that's expected. > ps2204 and ps3765 have gone past the comparison (since no exceptions from > their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on > ps3765. > This jira is to propose we log more details when {{BlockRecoveryWorker}} is > about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could > be figured out in the future. > {noformat} > $ grep "updateReplica:" ps2204.dn.log > $ grep "updateReplica:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC > Server handler 6 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: > BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, > recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, > blk_2172795728_1106150312, RUR > $ grep "initReplicaRecovery:" ps2204.dn.log > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > changing replica state for blk_2172795728_1106150312 from RWR to > RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > $ grep "initReplicaRecovery:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW >
[jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker
[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205376#comment-16205376 ] Xiao Chen commented on HDFS-12642: -- Thanks [~yzhangal] for the additional review. Sorry I missed the {{blockId}} part in your initial comment. But when updating patch 2 I looked at blockId too. :) Reason I didn't add it is, blockId is used at 2 places: # construct {{newBlock}} # {{updateReplicaUnderRecovery}} Since we always log {{newBlock}} before iterating through {{participatingList}}, {{blockId}} can be known from the logged {{newBlock}} info logs. > Log block and datanode details in BlockRecoveryWorker > - > > Key: HDFS-12642 > URL: https://issues.apache.org/jira/browse/HDFS-12642 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-12642.01.patch, HDFS-12642.02.patch > > > In a recent investigation, we have seen a weird block recovery issue, which > is difficult to reach to a conclusion because of insufficient logs. > For the most critical part of the events, we see block recovery failed to > {{commitBlockSynchronization]} on the NN, due to the block not closed. This > leaves the file as open forever (for 1+ months). > The reason the block was not closed on NN, was because it is configured with > {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest > genstamp. > We were not able to tell why only 1 replica is on latest genstamp. > From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} > was called on each of the 7 DNs the block were ever placed. All DNs but > ps2204 and ps3765 failed because of genstamp comparison - that's expected. > ps2204 and ps3765 have gone past the comparison (since no exceptions from > their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on > ps3765. > This jira is to propose we log more details when {{BlockRecoveryWorker}} is > about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could > be figured out in the future. > {noformat} > $ grep "updateReplica:" ps2204.dn.log > $ grep "updateReplica:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC > Server handler 6 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: > BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, > recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, > blk_2172795728_1106150312, RUR > $ grep "initReplicaRecovery:" ps2204.dn.log > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > changing replica state for blk_2172795728_1106150312 from RWR to > RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > $ grep "initReplicaRecovery:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on >
[jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker
[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205031#comment-16205031 ] Yongjun Zhang commented on HDFS-12642: -- Hi [~xiaochen], Thanks for the updated patch. Sorry I did not make it clear earlier, I meant to add both isTruncateRecovery and blockId in my previous comment 1, because {code} long blockId = (isTruncateRecovery) ? rBlock.getNewBlock().getBlockId() : block.getBlockId(); {code} I suggest we make the message: {code} 200 LOG.info("BlockRecoveryWorker: block={} (length={})," 201 + " isTruncateRecovery={}, blockId={} syncList={}", block, 202 block.getNumBytes(), isTruncateRecovery, blockId, syncList); {code} Thanks. > Log block and datanode details in BlockRecoveryWorker > - > > Key: HDFS-12642 > URL: https://issues.apache.org/jira/browse/HDFS-12642 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-12642.01.patch, HDFS-12642.02.patch > > > In a recent investigation, we have seen a weird block recovery issue, which > is difficult to reach to a conclusion because of insufficient logs. > For the most critical part of the events, we see block recovery failed to > {{commitBlockSynchronization]} on the NN, due to the block not closed. This > leaves the file as open forever (for 1+ months). > The reason the block was not closed on NN, was because it is configured with > {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest > genstamp. > We were not able to tell why only 1 replica is on latest genstamp. > From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} > was called on each of the 7 DNs the block were ever placed. All DNs but > ps2204 and ps3765 failed because of genstamp comparison - that's expected. > ps2204 and ps3765 have gone past the comparison (since no exceptions from > their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on > ps3765. > This jira is to propose we log more details when {{BlockRecoveryWorker}} is > about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could > be figured out in the future. > {noformat} > $ grep "updateReplica:" ps2204.dn.log > $ grep "updateReplica:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC > Server handler 6 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: > BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, > recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, > blk_2172795728_1106150312, RUR > $ grep "initReplicaRecovery:" ps2204.dn.log > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > changing replica state for blk_2172795728_1106150312 from RWR to > RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > $ grep "initReplicaRecovery:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW >
[jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker
[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205029#comment-16205029 ] Hadoop QA commented on HDFS-12642: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 26s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 53s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}108m 31s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}153m 35s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0de40f0 | | JIRA Issue | HDFS-12642 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12892263/HDFS-12642.02.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 16dccb4d8b5a 3.13.0-123-generic #172-Ubuntu SMP Mon Jun 26 18:04:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 87ea1df | | Default Java | 1.8.0_144 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/21703/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/21703/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/21703/console | | Powered by | Apache Yetus 0.6.0-SNAPSHOT
[jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker
[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204993#comment-16204993 ] Xiao Chen commented on HDFS-12642: -- Thanks Yongjun for the review. Patch 2 addressed the comments. New example logs for convenience: {noformat} 2017-10-14 19:43:25,345 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@40bdcea2] INFO datanode.DataNode (BlockRecoveryWorker.java:logRecoverBlock(549)) - BlockRecoveryWorker: NameNode at localhost/127.0.0.1:52885 calls recoverBlock(BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1003, targets=[DatanodeInfoWithStorage[127.0.0.1:52896,null,null], DatanodeInfoWithStorage[127.0.0.1:52901,null,null], DatanodeInfoWithStorage[127.0.0.1:52888,null,null]], newGenerationStamp=1004, newBlock=null, isStriped=false) 2017-10-14 19:43:25,368 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@40bdcea2] INFO datanode.DataNode (BlockRecoveryWorker.java:syncBlock(200)) - BlockRecoveryWorker: block=BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1003 (length=952), isTruncateRecovery=false, syncList=[block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52896,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52901,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52888,null,null]] 2017-10-14 19:43:25,369 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@40bdcea2] INFO datanode.DataNode (BlockRecoveryWorker.java:syncBlock(291)) - BlockRecoveryWorker: block=BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1003 (length=952), bestState=FINALIZED, newBlock=BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1004 (length=952), participatingList=[block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52896,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52901,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52888,null,null]] {noformat} > Log block and datanode details in BlockRecoveryWorker > - > > Key: HDFS-12642 > URL: https://issues.apache.org/jira/browse/HDFS-12642 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-12642.01.patch > > > In a recent investigation, we have seen a weird block recovery issue, which > is difficult to reach to a conclusion because of insufficient logs. > For the most critical part of the events, we see block recovery failed to > {{commitBlockSynchronization]} on the NN, due to the block not closed. This > leaves the file as open forever (for 1+ months). > The reason the block was not closed on NN, was because it is configured with > {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest > genstamp. > We were not able to tell why only 1 replica is on latest genstamp. > From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} > was called on each of the 7 DNs the block were ever placed. All DNs but > ps2204 and ps3765 failed because of genstamp comparison - that's expected. > ps2204 and ps3765 have gone past the comparison (since no exceptions from > their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on > ps3765. > This jira is to propose we log more details when {{BlockRecoveryWorker}} is > about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could > be figured out in the future. > {noformat} > $ grep "updateReplica:" ps2204.dn.log > $ grep "updateReplica:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC > Server handler 6 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: > BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, > recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, > blk_2172795728_1106150312, RUR > $ grep "initReplicaRecovery:" ps2204.dn.log >
[jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker
[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204542#comment-16204542 ] Yongjun Zhang commented on HDFS-12642: -- Thanks [~xiaochen] for working on this. Some comments: 1. Include isTruncateRecovery and blockId in the following message {code} LOG.info("BlockRecoveryWorker: block=" + block + ", (length=" + block .getNumBytes() + "), syncList=" + syncList); {code} 2. Include bestState and newBlock info in the following message: {code} LOG.info("BlockRecoveryWorker: block=" + block + ", (length=" + block .getNumBytes() + "), participatingList=" + participatingList); {code} For example {code} LOG.info("BlockRecoveryWorker: block=" + block + ", (length=" + block .getNumBytes() + "), bestState=" + bestState.name() + ", newBlock=" + newBlock + ", (length=" + newBlock.getNumBytes + "), participatingList=" + participatingList); {code} BTW, I found using TestLeaseRecovery code can help observing the output. Thanks. > Log block and datanode details in BlockRecoveryWorker > - > > Key: HDFS-12642 > URL: https://issues.apache.org/jira/browse/HDFS-12642 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-12642.01.patch > > > In a recent investigation, we have seen a weird block recovery issue, which > is difficult to reach to a conclusion because of insufficient logs. > For the most critical part of the events, we see block recovery failed to > {{commitBlockSynchronization]} on the NN, due to the block not closed. This > leaves the file as open forever (for 1+ months). > The reason the block was not closed on NN, was because it is configured with > {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest > genstamp. > We were not able to tell why only 1 replica is on latest genstamp. > From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} > was called on each of the 7 DNs the block were ever placed. All DNs but > ps2204 and ps3765 failed because of genstamp comparison - that's expected. > ps2204 and ps3765 have gone past the comparison (since no exceptions from > their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on > ps3765. > This jira is to propose we log more details when {{BlockRecoveryWorker}} is > about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could > be figured out in the future. > {noformat} > $ grep "updateReplica:" ps2204.dn.log > $ grep "updateReplica:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC > Server handler 6 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: > BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, > recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, > blk_2172795728_1106150312, RUR > $ grep "initReplicaRecovery:" ps2204.dn.log > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > changing replica state for blk_2172795728_1106150312 from RWR to > RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > $ grep "initReplicaRecovery:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on >
[jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker
[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203062#comment-16203062 ] Hadoop QA commented on HDFS-12642: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 22s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 43s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 97m 2s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}142m 49s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestReadStripedFileWithMissingBlocks | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:3d04c00 | | JIRA Issue | HDFS-12642 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12891634/HDFS-12642.01.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux fd594502c92b 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / e46d5bb | | Default Java | 1.8.0_144 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/21677/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/21677/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/21677/console | | Powered by | Apache Yetus 0.6.0-SNAPSHOT