[jira] [Work started] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout
[ https://issues.apache.org/jira/browse/HDFS-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-15719 started by Wei-Chiu Chuang. -- > [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket > timeout > - > > Key: HDFS-15719 > URL: https://issues.apache.org/jira/browse/HDFS-15719 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Critical > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in > HADOOP-10075. > However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too > low. > We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with > ServerConnector.setIdleTimeout() but they aren't the same. > Essentially, the HttpServer2's idle timeout was the default timeout set by > Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 > seconds, which is unreasonable for JN. If NameNodes try to download a big > edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 > seconds. When it happens, both NN crashes and there's no way to workaround > unless you apply the patch in HADOOP-15696 to add a config switch for the > idle timeout. Fortunately, it doesn't happen a lot. > Propose: bump the idle timeout default to 200 seconds to match the behavior > in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is > not suitable for JN) > Other things to consider: > 1. fsck serverlet? (somehow I suspect this is related to the socket timeout > reported in HDFS-7175) > 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. > so having a longer timeout makes sense here. > 2. kms? will the longer timeout cause more lingering sockets? > Thanks [~zhenshan.wen] for the discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop
[ https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=531041=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531041 ] ASF GitHub Bot logged work on HDFS-15624: - Author: ASF GitHub Bot Created on: 05/Jan/21 07:08 Start Date: 05/Jan/21 07:08 Worklog Time Spent: 10m Work Description: huangtianhua edited a comment on pull request #2377: URL: https://github.com/apache/hadoop/pull/2377#issuecomment-754446695 @ayushtkn, thanks for review it. HDFS-15660 supports handling storage types for older clients in a generic way, and it has been merged, or I missed it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 531041) Time Spent: 8h (was: 7h 50m) > Fix the SetQuotaByStorageTypeOp problem after updating hadoop > --- > > Key: HDFS-15624 > URL: https://issues.apache.org/jira/browse/HDFS-15624 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: YaYun Wang >Priority: Major > Labels: pull-request-available, release-blocker > Time Spent: 8h > Remaining Estimate: 0h > > HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum > of StorageType. And, setting the quota by storageType depends on the > ordinal(), therefore, it may cause the setting of quota to be invalid after > upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop
[ https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=531040=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531040 ] ASF GitHub Bot logged work on HDFS-15624: - Author: ASF GitHub Bot Created on: 05/Jan/21 07:07 Start Date: 05/Jan/21 07:07 Worklog Time Spent: 10m Work Description: huangtianhua commented on pull request #2377: URL: https://github.com/apache/hadoop/pull/2377#issuecomment-754446695 @ayushtkn, thanks for review it. HDFS-15660 supports handling storage types for older clients in a generic way, and it has been merged, or I missed you? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 531040) Time Spent: 7h 50m (was: 7h 40m) > Fix the SetQuotaByStorageTypeOp problem after updating hadoop > --- > > Key: HDFS-15624 > URL: https://issues.apache.org/jira/browse/HDFS-15624 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: YaYun Wang >Priority: Major > Labels: pull-request-available, release-blocker > Time Spent: 7h 50m > Remaining Estimate: 0h > > HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum > of StorageType. And, setting the quota by storageType depends on the > ordinal(), therefore, it may cause the setting of quota to be invalid after > upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop
[ https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=531063=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531063 ] ASF GitHub Bot logged work on HDFS-15624: - Author: ASF GitHub Bot Created on: 05/Jan/21 07:55 Start Date: 05/Jan/21 07:55 Worklog Time Spent: 10m Work Description: huangtianhua commented on pull request #2377: URL: https://github.com/apache/hadoop/pull/2377#issuecomment-754470304 @ayushtkn , in fact we don't have to hold this for HDFS-15660 as vinay said, the codes here is to fix the specific issues of NVDIMM, to avoid operations which related with storage type during rollingupgrade, to keep the orinal of storage type to make sure the editLog/fsimage works after restart namenode. IIUC, the miniCompatLV of namenodelayout version is introduced to make sure to refuse operations while rollingupgrade, so I think the approach is appropriate for the situation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 531063) Time Spent: 8h 20m (was: 8h 10m) > Fix the SetQuotaByStorageTypeOp problem after updating hadoop > --- > > Key: HDFS-15624 > URL: https://issues.apache.org/jira/browse/HDFS-15624 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: YaYun Wang >Priority: Major > Labels: pull-request-available, release-blocker > Time Spent: 8h 20m > Remaining Estimate: 0h > > HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum > of StorageType. And, setting the quota by storageType depends on the > ordinal(), therefore, it may cause the setting of quota to be invalid after > upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop
[ https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=531046=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531046 ] ASF GitHub Bot logged work on HDFS-15624: - Author: ASF GitHub Bot Created on: 05/Jan/21 07:20 Start Date: 05/Jan/21 07:20 Worklog Time Spent: 10m Work Description: ayushtkn commented on pull request #2377: URL: https://github.com/apache/hadoop/pull/2377#issuecomment-754453252 @huangtianhua nopes you didn't. I know that is merged. That is what I said, but there were assertions earlier on jira that we should hold this code in Jira, for HDFS-15660. That would fix something or change our code here. So, we held this jira because of that only. So, just want to wait, so that can be clarified what needs to be done here post HDFS-15660. And secondly the NamenodeLayout version approach had objection too as I quoted above. We need to get an agreement over there. For me the code is good enough, once we have clarifications regarding these things, can conclude this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 531046) Time Spent: 8h 10m (was: 8h) > Fix the SetQuotaByStorageTypeOp problem after updating hadoop > --- > > Key: HDFS-15624 > URL: https://issues.apache.org/jira/browse/HDFS-15624 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: YaYun Wang >Priority: Major > Labels: pull-request-available, release-blocker > Time Spent: 8h 10m > Remaining Estimate: 0h > > HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum > of StorageType. And, setting the quota by storageType depends on the > ordinal(), therefore, it may cause the setting of quota to be invalid after > upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout
[ https://issues.apache.org/jira/browse/HDFS-15719?focusedWorklogId=531016=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531016 ] ASF GitHub Bot logged work on HDFS-15719: - Author: ASF GitHub Bot Created on: 05/Jan/21 04:54 Start Date: 05/Jan/21 04:54 Worklog Time Spent: 10m Work Description: jojochuang merged pull request #2533: URL: https://github.com/apache/hadoop/pull/2533 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 531016) Time Spent: 1h 20m (was: 1h 10m) > [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket > timeout > - > > Key: HDFS-15719 > URL: https://issues.apache.org/jira/browse/HDFS-15719 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Critical > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in > HADOOP-10075. > However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too > low. > We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with > ServerConnector.setIdleTimeout() but they aren't the same. > Essentially, the HttpServer2's idle timeout was the default timeout set by > Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 > seconds, which is unreasonable for JN. If NameNodes try to download a big > edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 > seconds. When it happens, both NN crashes and there's no way to workaround > unless you apply the patch in HADOOP-15696 to add a config switch for the > idle timeout. Fortunately, it doesn't happen a lot. > Propose: bump the idle timeout default to 200 seconds to match the behavior > in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is > not suitable for JN) > Other things to consider: > 1. fsck serverlet? (somehow I suspect this is related to the socket timeout > reported in HDFS-7175) > 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. > so having a longer timeout makes sense here. > 2. kms? will the longer timeout cause more lingering sockets? > Thanks [~zhenshan.wen] for the discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout
[ https://issues.apache.org/jira/browse/HDFS-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang resolved HDFS-15719. Fix Version/s: 3.2.3 3.1.5 3.4.0 3.3.1 Resolution: Fixed > [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket > timeout > - > > Key: HDFS-15719 > URL: https://issues.apache.org/jira/browse/HDFS-15719 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Critical > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Time Spent: 1.5h > Remaining Estimate: 0h > > After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in > HADOOP-10075. > However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too > low. > We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with > ServerConnector.setIdleTimeout() but they aren't the same. > Essentially, the HttpServer2's idle timeout was the default timeout set by > Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 > seconds, which is unreasonable for JN. If NameNodes try to download a big > edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 > seconds. When it happens, both NN crashes and there's no way to workaround > unless you apply the patch in HADOOP-15696 to add a config switch for the > idle timeout. Fortunately, it doesn't happen a lot. > Propose: bump the idle timeout default to 200 seconds to match the behavior > in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is > not suitable for JN) > Other things to consider: > 1. fsck serverlet? (somehow I suspect this is related to the socket timeout > reported in HDFS-7175) > 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. > so having a longer timeout makes sense here. > 2. kms? will the longer timeout cause more lingering sockets? > Thanks [~zhenshan.wen] for the discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout
[ https://issues.apache.org/jira/browse/HDFS-15719?focusedWorklogId=531017=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531017 ] ASF GitHub Bot logged work on HDFS-15719: - Author: ASF GitHub Bot Created on: 05/Jan/21 04:55 Start Date: 05/Jan/21 04:55 Worklog Time Spent: 10m Work Description: jojochuang commented on pull request #2533: URL: https://github.com/apache/hadoop/pull/2533#issuecomment-754395404 Thanks Ayush and Stephen! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 531017) Time Spent: 1.5h (was: 1h 20m) > [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket > timeout > - > > Key: HDFS-15719 > URL: https://issues.apache.org/jira/browse/HDFS-15719 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Critical > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in > HADOOP-10075. > However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too > low. > We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with > ServerConnector.setIdleTimeout() but they aren't the same. > Essentially, the HttpServer2's idle timeout was the default timeout set by > Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 > seconds, which is unreasonable for JN. If NameNodes try to download a big > edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 > seconds. When it happens, both NN crashes and there's no way to workaround > unless you apply the patch in HADOOP-15696 to add a config switch for the > idle timeout. Fortunately, it doesn't happen a lot. > Propose: bump the idle timeout default to 200 seconds to match the behavior > in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is > not suitable for JN) > Other things to consider: > 1. fsck serverlet? (somehow I suspect this is related to the socket timeout > reported in HDFS-7175) > 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. > so having a longer timeout makes sense here. > 2. kms? will the longer timeout cause more lingering sockets? > Thanks [~zhenshan.wen] for the discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop
[ https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=530579=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530579 ] ASF GitHub Bot logged work on HDFS-15624: - Author: ASF GitHub Bot Created on: 04/Jan/21 09:46 Start Date: 04/Jan/21 09:46 Worklog Time Spent: 10m Work Description: ayushtkn commented on pull request #2377: URL: https://github.com/apache/hadoop/pull/2377#issuecomment-753872315 Thanx @huangtianhua for the work here, Sorry I couldn't revert back to your emails & pings. @brahmareddybattula has objections on the jira with the approach itself. Quoting him from the jira >I dn't think bumping the namelayout is best solution, need to check other way. ( may be like checking the client version during the upgrade.) There is no code change post HDFS-15660? It was asserted the generic solution shall solve this problem or will change something So, We might need changes here post HDFS-15660. should wait for him, unless he is convinced. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 530579) Time Spent: 7h 40m (was: 7.5h) > Fix the SetQuotaByStorageTypeOp problem after updating hadoop > --- > > Key: HDFS-15624 > URL: https://issues.apache.org/jira/browse/HDFS-15624 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: YaYun Wang >Priority: Major > Labels: pull-request-available, release-blocker > Time Spent: 7h 40m > Remaining Estimate: 0h > > HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum > of StorageType. And, setting the quota by storageType depends on the > ordinal(), therefore, it may cause the setting of quota to be invalid after > upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15549) Improve DISK/ARCHIVE movement if they are on same filesystem
[ https://issues.apache.org/jira/browse/HDFS-15549?focusedWorklogId=530583=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530583 ] ASF GitHub Bot logged work on HDFS-15549: - Author: ASF GitHub Bot Created on: 04/Jan/21 09:51 Start Date: 04/Jan/21 09:51 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2583: URL: https://github.com/apache/hadoop/pull/2583#issuecomment-753874888 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 47m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | | 0m 0s | [test4tests](test4tests) | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 0m 21s | | Maven dependency ordering for branch | | -1 :x: | mvninstall | 0m 23s | [/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-mvninstall-root.txt) | root in trunk failed. | | -1 :x: | compile | 0m 25s | [/branch-compile-root-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-compile-root-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt) | root in trunk failed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04. | | -1 :x: | compile | 0m 22s | [/branch-compile-root-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-compile-root-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt) | root in trunk failed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01. | | -0 :warning: | checkstyle | 0m 21s | [/buildtool-branch-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/buildtool-branch-checkstyle-root.txt) | The patch fails to run checkstyle in root | | -1 :x: | mvnsite | 0m 24s | [/branch-mvnsite-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-mvnsite-hadoop-common-project_hadoop-common.txt) | hadoop-common in trunk failed. | | -1 :x: | mvnsite | 4m 15s | [/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in trunk failed. | | -1 :x: | shadedclient | 11m 37s | | branch has errors when building and testing our client artifacts. | | -1 :x: | javadoc | 0m 23s | [/branch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt) | hadoop-common in trunk failed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04. | | -1 :x: | javadoc | 0m 29s | [/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt) | hadoop-hdfs in trunk failed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04. | | -1 :x: | javadoc | 0m 24s | [/branch-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt) | hadoop-common in trunk failed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01. | | -1 :x: | javadoc | 0m 24s | [/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt) | hadoop-hdfs in trunk failed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01. | | +0 :ok: | spotbugs | 14m 11s | | Used deprecated FindBugs config; considering switching to SpotBugs. | | -1 :x: | findbugs | 0m 30s |
[jira] [Commented] (HDFS-15735) NameNode memory Leak on frequent execution of fsck
[ https://issues.apache.org/jira/browse/HDFS-15735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258131#comment-17258131 ] Ayush Saxena commented on HDFS-15735: - Tracer is a {{private}} variable, Not used anywhere, Tracer is subject to removal due to CVE(IIRC), guess HADOOP-17387 and others, one recently mentioned too. Harmless things are not always correct, closing tracer in fsck() may impact if someone is using tracer post it(if so). Closing in the last line of fsck may not be this issue what you are fixing. the moment you come out from the method control, the tracer would be subject to GC? closing it won't help, it will also make it subject to GC only. If someone is using tracer in internal code, not in OS, or there is no where to use here, no need to keep, the guy can keep in his internal code. Removal will save memory allocation, and isn't incompatible in any way. Would be even better. Sometimes listening to others doesn't hurt, not me atleast [~John Smith] too had a comment. ** Now the catch, I will still respect your opinion on this, though you aren't interested in mine :( You won't see a committing shortly from my end unless you are convinced in "any" jira. and I don't claim I am correct here, just proposing something if it looks good, can be done. I can be *wrong*/completely wrong, would be happy to accept that. Would request consider the other options as well. I shall be happy to connect with you offline as well, If you want. On this note, I take my vote back. > NameNode memory Leak on frequent execution of fsck > > > Key: HDFS-15735 > URL: https://issues.apache.org/jira/browse/HDFS-15735 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ravuri Sushma sree >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15735.001.patch > > > The memory of the cluster NameNode continues to grow, and the full gc > eventually leads to the failure of the active and standby HDFS > Htrace is used to track the processing time of fsck > Checking the code it is found that the tracer object in NamenodeFsck.java was > only created but not closed because of this the memory footprint continues to > grow -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15754) Create packet metrics for DataNode
[ https://issues.apache.org/jira/browse/HDFS-15754?focusedWorklogId=530640=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530640 ] ASF GitHub Bot logged work on HDFS-15754: - Author: ASF GitHub Bot Created on: 04/Jan/21 12:43 Start Date: 04/Jan/21 12:43 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2578: URL: https://github.com/apache/hadoop/pull/2578#issuecomment-753954286 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 47s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | | 0m 0s | [test4tests](test4tests) | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 13m 39s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 26m 51s | | trunk passed | | +1 :green_heart: | compile | 24m 25s | | trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | compile | 19m 57s | | trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | checkstyle | 2m 44s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 7s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 24s | | branch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 2m 10s | | trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javadoc | 3m 20s | | trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +0 :ok: | spotbugs | 3m 17s | | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 :green_heart: | findbugs | 5m 38s | | trunk passed | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 27s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 10s | | the patch passed | | +1 :green_heart: | compile | 20m 50s | | the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javac | 20m 50s | | the patch passed | | +1 :green_heart: | compile | 18m 32s | | the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | javac | 18m 32s | | the patch passed | | -0 :warning: | checkstyle | 2m 39s | [/diff-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2578/3/artifact/out/diff-checkstyle-root.txt) | root: The patch generated 4 new + 124 unchanged - 0 fixed = 128 total (was 124) | | +1 :green_heart: | mvnsite | 3m 4s | | the patch passed | | +1 :green_heart: | whitespace | 0m 0s | | The patch has no whitespace issues. | | +1 :green_heart: | shadedclient | 15m 25s | | patch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 2m 6s | | the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javadoc | 3m 17s | | the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | findbugs | 5m 52s | | the patch passed | _ Other Tests _ | | +1 :green_heart: | unit | 9m 58s | | hadoop-common in the patch passed. | | +1 :green_heart: | unit | 102m 1s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 5s | | The patch does not generate ASF License warnings. | | | | 311m 16s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2578/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2578 | | Optional Tests | dupname asflicense mvnsite markdownlint compile javac javadoc mvninstall unit shadedclient findbugs checkstyle | | uname | Linux 63beae1e56dc 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 2825d060cf9 | | Default Java | Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04
[jira] [Commented] (HDFS-15735) NameNode memory Leak on frequent execution of fsck
[ https://issues.apache.org/jira/browse/HDFS-15735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258055#comment-17258055 ] Brahma Reddy Battula commented on HDFS-15735: - {quote}I am not sure about it, why it being configurable makes it necessary to be here, why closing is better. Please hold it. -1 {quote} Removal can impact existing user whoever use this feature as they've configured and Proposed fix will not break anything. Not sure why this needs to hold and given -1 and I feel this will not good practice. > NameNode memory Leak on frequent execution of fsck > > > Key: HDFS-15735 > URL: https://issues.apache.org/jira/browse/HDFS-15735 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ravuri Sushma sree >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15735.001.patch > > > The memory of the cluster NameNode continues to grow, and the full gc > eventually leads to the failure of the active and standby HDFS > Htrace is used to track the processing time of fsck > Checking the code it is found that the tracer object in NamenodeFsck.java was > only created but not closed because of this the memory footprint continues to > grow -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=530681=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530681 ] ASF GitHub Bot logged work on HDFS-15759: - Author: ASF GitHub Bot Created on: 04/Jan/21 14:13 Start Date: 04/Jan/21 14:13 Worklog Time Spent: 10m Work Description: touchida opened a new pull request #2585: URL: https://github.com/apache/hadoop/pull/2585 ## NOTICE Please create an issue in ASF JIRA before opening a pull request, and you need to set the title of the pull request which starts with the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.) For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 530681) Remaining Estimate: 0h Time Spent: 10m > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-15759: -- Labels: pull-request-available (was: ) > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
Toshihiko Uchida created HDFS-15759: --- Summary: EC: Verify EC reconstruction correctness on DataNode Key: HDFS-15759 URL: https://issues.apache.org/jira/browse/HDFS-15759 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, ec, erasure-coding Affects Versions: 3.4.0 Reporter: Toshihiko Uchida EC reconstruction on DataNode has caused data corruption: HDFS-14768, HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and the corruption is neither detected nor auto-healed by HDFS. It is obviously hard for users to monitor data integrity by themselves, and even if they find corrupted data, it is difficult or sometimes impossible to recover them. To prevent further data corruption issues, this feature proposes a simple and effective way to verify EC reconstruction correctness on DataNode at each reconstruction process. It verifies correctness of outputs decoded from inputs as follows: 1. Decoding an input with the outputs; 2. Compare the decoded input with the original input. For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. When an EC reconstruction task goes wrong, the comparison will fail with high probability. Then the task will also fail and be retried by NameNode. The next reconstruction will succeed if the condition triggered the failure is gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15751) Add documentation for msync() API to filesystem.md
[ https://issues.apache.org/jira/browse/HDFS-15751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258259#comment-17258259 ] Steve Loughran commented on HDFS-15751: --- LGTM, though the doc reference to HDFS should be relative to the final build paths, i.e. self-contained. Still need a story for viewfs > Add documentation for msync() API to filesystem.md > -- > > Key: HDFS-15751 > URL: https://issues.apache.org/jira/browse/HDFS-15751 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3 > > Attachments: HDFS-15751-01.patch, HDFS-15751-02.patch, > HDFS-15751-03.patch > > > HDFS-15567 introduced new {{FileSystem}} call {{msync()}}. Should add it to > the API definitions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiko Uchida reassigned HDFS-15759: --- Assignee: Toshihiko Uchida > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15757) RBF: Improving Router Connection Management
[ https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258364#comment-17258364 ] Íñigo Goiri commented on HDFS-15757: Thank you [~fengnanli] for the proposal. The connection manager was pretty tricky as it can impact the performance of the router substantially. Your proposal makes sense? Do you have specific scenarios where the metrics show the connections in a bad state clearly? It would be nice to have some benchmarks too. In any case, your proposal doesn't seem too complex so we should go ahead with a patch and go from there. > RBF: Improving Router Connection Management > --- > > Key: HDFS-15757 > URL: https://issues.apache.org/jira/browse/HDFS-15757 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Major > Attachments: RBF_ Router Connection Management.pdf > > > We have seen high number of connections from Router to namenodes, leaving > namenodes unstable. > This ticket is trying to reduce connections through some changes. Please take > a look at the design and leave comments. > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15760) Validate the target indices in ErasureCoding worker in reconstruction process
Uma Maheswara Rao G created HDFS-15760: -- Summary: Validate the target indices in ErasureCoding worker in reconstruction process Key: HDFS-15760 URL: https://issues.apache.org/jira/browse/HDFS-15760 Project: Hadoop HDFS Issue Type: Improvement Components: ec Affects Versions: 3.4.0 Reporter: Uma Maheswara Rao G As we have seen issues like # HDFS-15186 # HDFS-14768 It is a good idea to validate the indices at the ECWorker side and skip the unintended in indices from target list. Both of the issues triggered because, NN accidentally scheduled for reconstruction in decom process due to busy node. We have fixed to make sure NN considers busy nodes as live replicas. However, it may be good idea to safe gaud the condition at ECWorker also in case if any other condition triggers and that leads ECWroker to calculate the indices similar the above issues, then EC function returns wrong o/p. I think it's ok to recover only the missing indices from the given src indices. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15748) RBF: Move the router related part from hadoop-federation-balance module to hadoop-hdfs-rbf.
[ https://issues.apache.org/jira/browse/HDFS-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258366#comment-17258366 ] Íñigo Goiri commented on HDFS-15748: +1 on [^HDFS-15748.004.patch]. > RBF: Move the router related part from hadoop-federation-balance module to > hadoop-hdfs-rbf. > --- > > Key: HDFS-15748 > URL: https://issues.apache.org/jira/browse/HDFS-15748 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15748.001.patch, HDFS-15748.002.patch, > HDFS-15748.003.patch, HDFS-15748.004.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15757) RBF: Improving Router Connection Management
[ https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258376#comment-17258376 ] Fengnan Li commented on HDFS-15757: --- Thanks for the review [~elgoiri] There are two metrics we will try to improve. 1. RpcClientNumConnections should go down in each router 2. RpcClientNumActiveConnections / RpcClientNumConnections should go up in each router. I will add more graphs for this in an updated doc. The first version was trying to get some initial feedback. > RBF: Improving Router Connection Management > --- > > Key: HDFS-15757 > URL: https://issues.apache.org/jira/browse/HDFS-15757 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Major > Attachments: RBF_ Router Connection Management.pdf > > > We have seen high number of connections from Router to namenodes, leaving > namenodes unstable. > This ticket is trying to reduce connections through some changes. Please take > a look at the design and leave comments. > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15748) RBF: Move the router related part from hadoop-federation-balance module to hadoop-hdfs-rbf.
[ https://issues.apache.org/jira/browse/HDFS-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15748: Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) > RBF: Move the router related part from hadoop-federation-balance module to > hadoop-hdfs-rbf. > --- > > Key: HDFS-15748 > URL: https://issues.apache.org/jira/browse/HDFS-15748 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15748.001.patch, HDFS-15748.002.patch, > HDFS-15748.003.patch, HDFS-15748.004.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258452#comment-17258452 ] Ye Ni edited comment on HDFS-15761 at 1/4/21, 7:45 PM: --- cc [~mingma], [~andrew.wang], [~zhz] , [~inigoiri] was (Author: nickyye): cc [~mingma], [~andrew.wang], [~aiden_zhang], [~inigoiri] > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 > HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator puts them to DECOMMISSIONED. Namenode should check first before > transit them to DECOMMISSIONED. Otherwise, it would be a data loss. > In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The > administrator needs to do some manual intervention, either repair the dead > machine or service or recover the data before decommission them. > This change is to add Dead, DECOMMISSION_INPROGRESS back. > 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. > 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount > are both 0 > 3. Transit the dead DN to DECOMMISSIONED. > 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which > adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to > DECOMMISSIONED state if all files on the filesystem are fully-replicated, > dead DN is in DECOMMISSION_INPROGRESS, then checked, before become > DECOMMISSIONED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Ni updated HDFS-15761: - Description: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by HDFS-7374 which is made because of HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0. 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. was: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0. 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by HDFS-7374 which is made because of HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator wants to decommission them. Namenode should check first before > transit them to DECOMMISSIONED. Otherwise, it would be a data loss. > In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The > administrator needs to do some manual intervention, either repair the dead > machine or service or
[jira] [Commented] (HDFS-15748) RBF: Move the router related part from hadoop-federation-balance module to hadoop-hdfs-rbf.
[ https://issues.apache.org/jira/browse/HDFS-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258410#comment-17258410 ] Ayush Saxena commented on HDFS-15748: - Committed to trunk. Thanx [~LiJinglun] for the contribution and [~elgoiri] for the review!!! > RBF: Move the router related part from hadoop-federation-balance module to > hadoop-hdfs-rbf. > --- > > Key: HDFS-15748 > URL: https://issues.apache.org/jira/browse/HDFS-15748 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15748.001.patch, HDFS-15748.002.patch, > HDFS-15748.003.patch, HDFS-15748.004.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15549) Improve DISK/ARCHIVE movement if they are on same filesystem
[ https://issues.apache.org/jira/browse/HDFS-15549?focusedWorklogId=530857=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530857 ] ASF GitHub Bot logged work on HDFS-15549: - Author: ASF GitHub Bot Created on: 04/Jan/21 20:06 Start Date: 04/Jan/21 20:06 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2583: URL: https://github.com/apache/hadoop/pull/2583#issuecomment-754188800 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 31s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | | 0m 0s | [test4tests](test4tests) | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 13m 59s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 24m 15s | | trunk passed | | +1 :green_heart: | compile | 22m 23s | | trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | compile | 26m 9s | | trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | checkstyle | 4m 36s | | trunk passed | | -1 :x: | mvnsite | 1m 9s | [/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in trunk failed. | | +1 :green_heart: | shadedclient | 9m 58s | | branch has no errors when building and testing our client artifacts. | | -1 :x: | javadoc | 0m 51s | [/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt) | hadoop-hdfs in trunk failed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04. | | -1 :x: | javadoc | 0m 59s | [/branch-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt) | hadoop-common in trunk failed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01. | | -1 :x: | javadoc | 1m 0s | [/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt) | hadoop-hdfs in trunk failed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01. | | +0 :ok: | spotbugs | 16m 45s | | Used deprecated FindBugs config; considering switching to SpotBugs. | | -1 :x: | findbugs | 1m 2s | [/branch-findbugs-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-findbugs-hadoop-common-project_hadoop-common.txt) | hadoop-common in trunk failed. | | -1 :x: | findbugs | 1m 1s | [/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in trunk failed. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 40s | | Maven dependency ordering for patch | | -1 :x: | mvninstall | 0m 32s | [/patch-mvninstall-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/patch-mvninstall-hadoop-common-project_hadoop-common.txt) | hadoop-common in the patch failed. | | -1 :x: | mvninstall | 0m 28s | [/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | -1 :x: | compile | 0m 30s | [/patch-compile-root-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/patch-compile-root-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt) | root in the patch failed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04. | | -1 :x: | javac | 0m 30s |
[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=530820=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530820 ] ASF GitHub Bot logged work on HDFS-15759: - Author: ASF GitHub Bot Created on: 04/Jan/21 19:10 Start Date: 04/Jan/21 19:10 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2585: URL: https://github.com/apache/hadoop/pull/2585#issuecomment-754160262 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | | 0m 0s | [test4tests](test4tests) | The patch appears to include 5 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 1s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 20m 41s | | trunk passed | | +1 :green_heart: | compile | 20m 10s | | trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | compile | 17m 17s | | trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | checkstyle | 2m 52s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 7s | | trunk passed | | +1 :green_heart: | shadedclient | 24m 41s | | branch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 2m 11s | | trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javadoc | 3m 18s | | trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +0 :ok: | spotbugs | 3m 19s | | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 :green_heart: | findbugs | 5m 38s | | trunk passed | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 27s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 6s | | the patch passed | | +1 :green_heart: | compile | 19m 21s | | the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javac | 19m 21s | | the patch passed | | +1 :green_heart: | compile | 17m 23s | | the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | javac | 17m 23s | | the patch passed | | +1 :green_heart: | checkstyle | 2m 52s | | the patch passed | | +1 :green_heart: | mvnsite | 3m 6s | | the patch passed | | +1 :green_heart: | whitespace | 0m 0s | | The patch has no whitespace issues. | | +1 :green_heart: | xml | 0m 2s | | The patch has no ill-formed XML file. | | +1 :green_heart: | shadedclient | 15m 26s | | patch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 2m 8s | | the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javadoc | 3m 17s | | the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | findbugs | 5m 47s | | the patch passed | _ Other Tests _ | | +1 :green_heart: | unit | 9m 52s | | hadoop-common in the patch passed. | | -1 :x: | unit | 98m 45s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2585/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 9s | | The patch does not generate ASF License warnings. | | | | 296m 1s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestReconstructStripedFileWithValidator | | | hadoop.hdfs.TestMultipleNNPortQOP | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2585/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2585 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux 98ae016c6e0c 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 2825d060cf9 | | Default Java | Private
[jira] [Created] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
Ye Ni created HDFS-15761: Summary: Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately Key: HDFS-15761 URL: https://issues.apache.org/jira/browse/HDFS-15761 Project: Hadoop HDFS Issue Type: Bug Reporter: Ye Ni To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator puts them to DECOMMISSIONED. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before decommission them. This change is to add Dead, DECOMMISSION_INPROGRESS back. 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?focusedWorklogId=530835=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530835 ] ASF GitHub Bot logged work on HDFS-15761: - Author: ASF GitHub Bot Created on: 04/Jan/21 19:41 Start Date: 04/Jan/21 19:41 Worklog Time Spent: 10m Work Description: NickyYe opened a new pull request #2588: URL: https://github.com/apache/hadoop/pull/2588 https://issues.apache.org/jira/browse/HDFS-15761 ## NOTICE Please create an issue in ASF JIRA before opening a pull request, and you need to set the title of the pull request which starts with the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.) For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 530835) Remaining Estimate: 0h Time Spent: 10m > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 > HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator puts them to DECOMMISSIONED. Namenode should check first before > transit them to DECOMMISSIONED. Otherwise, it would be a data loss. > In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The > administrator needs to do some manual intervention, either repair the dead > machine or service or recover the data before decommission them. > This change is to add Dead, DECOMMISSION_INPROGRESS back. > 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. > 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount > are both 0 > 3. Transit the dead DN to DECOMMISSIONED. > 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which > adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to > DECOMMISSIONED state if all files on the filesystem are fully-replicated, > dead DN is in DECOMMISSION_INPROGRESS, then checked, before become > DECOMMISSIONED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-15761: -- Labels: pull-request-available (was: ) > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 > HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator puts them to DECOMMISSIONED. Namenode should check first before > transit them to DECOMMISSIONED. Otherwise, it would be a data loss. > In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The > administrator needs to do some manual intervention, either repair the dead > machine or service or recover the data before decommission them. > This change is to add Dead, DECOMMISSION_INPROGRESS back. > 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. > 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount > are both 0 > 3. Transit the dead DN to DECOMMISSIONED. > 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which > adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to > DECOMMISSIONED state if all files on the filesystem are fully-replicated, > dead DN is in DECOMMISSION_INPROGRESS, then checked, before become > DECOMMISSIONED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258452#comment-17258452 ] Ye Ni edited comment on HDFS-15761 at 1/4/21, 7:46 PM: --- cc [~mingma], [~andrew.wang], [~zhz] ,[~elgoiri] was (Author: nickyye): cc [~mingma], [~andrew.wang], [~zhz] , [~inigoiri] > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 > HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator puts them to DECOMMISSIONED. Namenode should check first before > transit them to DECOMMISSIONED. Otherwise, it would be a data loss. > In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The > administrator needs to do some manual intervention, either repair the dead > machine or service or recover the data before decommission them. > This change is to add Dead, DECOMMISSION_INPROGRESS back. > 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. > 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount > are both 0 > 3. Transit the dead DN to DECOMMISSIONED. > 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which > adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to > DECOMMISSIONED state if all files on the filesystem are fully-replicated, > dead DN is in DECOMMISSION_INPROGRESS, then checked, before become > DECOMMISSIONED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Ni updated HDFS-15761: - Description: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. This change is to add Dead, DECOMMISSION_INPROGRESS back. 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. Then NN check pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. was: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator puts them to DECOMMISSIONED. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before decommission them. This change is to add Dead, DECOMMISSION_INPROGRESS back. 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 > HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator wants to decommission them. Namenode should check first before > transit them to
[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Ni updated HDFS-15761: - Description: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by HDFS-7374 which is made because of HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0. 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated. was: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by HDFS-7374 which is made because of HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0. 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by HDFS-7374 which is made because of HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator wants to decommission them. Namenode should check first before > transit them to DECOMMISSIONED. Otherwise, it would be a data loss. > In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The > administrator needs to do some manual intervention, either repair the dead > machine or service or recover the data before take action on them. > *This change is to add Dead, DECOMMISSION_INPROGRESS back.* > 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. >
[jira] [Commented] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258452#comment-17258452 ] Ye Ni commented on HDFS-15761: -- cc [~mingma], [~andrew.wang], [~aiden_zhang], [~inigoiri] > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 > HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator puts them to DECOMMISSIONED. Namenode should check first before > transit them to DECOMMISSIONED. Otherwise, it would be a data loss. > In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The > administrator needs to do some manual intervention, either repair the dead > machine or service or recover the data before decommission them. > This change is to add Dead, DECOMMISSION_INPROGRESS back. > 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. > 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount > are both 0 > 3. Transit the dead DN to DECOMMISSIONED. > 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which > adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to > DECOMMISSIONED state if all files on the filesystem are fully-replicated, > dead DN is in DECOMMISSION_INPROGRESS, then checked, before become > DECOMMISSIONED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Ni updated HDFS-15761: - Description: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0. 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. was: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. Then NN check pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 > HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator wants to decommission them. Namenode should check first before > transit
[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Ni updated HDFS-15761: - Description: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0. 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. was: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0. 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 > HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator wants to decommission them. Namenode should check first before > transit them to DECOMMISSIONED. Otherwise, it
[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Ni updated HDFS-15761: - Description: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. Then NN check pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. was: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. This change is to add Dead, DECOMMISSION_INPROGRESS back. 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. Then NN check pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED. > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374 > HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator wants to decommission them. Namenode should check first before > transit
[jira] [Commented] (HDFS-15732) EC client will not retry get block token when block token expired in kerberized cluster
[ https://issues.apache.org/jira/browse/HDFS-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258598#comment-17258598 ] Wei-Chiu Chuang commented on HDFS-15732: Probably similar to HDFS-10609 and HDFS-11741 where we should retry upon the invalid block token exception. > EC client will not retry get block token when block token expired in > kerberized cluster > > > Key: HDFS-15732 > URL: https://issues.apache.org/jira/browse/HDFS-15732 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.1.1 > Environment: hadoop 3.1.1 > kerberos > ec RS-3-2-1024k >Reporter: gaozhan ding >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > When enable ec policy on hbase, we got some issues. Once block token was > expired in datanode side, client side will not identify the InvalidToken > error because of the SASL negotiation. As a result, ec client will not do > retry by refetch token when create blockreader. Then the peer datanode was > added to DeadNodes, and all calls to function createBlockReader aim at this > datanode in current DFSStripedInputStream will consider this datanode was > dead and return false. The finally result is a read failure. > Some logs : > hbase regionserver: > 2020-12-17 10:00:24,291 WARN > [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: > Failed to connect to /10.65.19.41:9866 for > blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566 > java.io.IOException: DIGEST-MD5: IO error acquiring password > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647) > at > org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272) > at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333) > at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365) > at > org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514) > at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318) > at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496) > at > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228) > at >
[jira] [Commented] (HDFS-15757) RBF: Improving Router Connection Management
[ https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258611#comment-17258611 ] Fengnan Li commented on HDFS-15757: --- Uploaded v2 with more metrics and some changes. I will start some POC towards this direction. > RBF: Improving Router Connection Management > --- > > Key: HDFS-15757 > URL: https://issues.apache.org/jira/browse/HDFS-15757 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Major > Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ > Router Connection Management.pdf > > > We have seen high number of connections from Router to namenodes, leaving > namenodes unstable. > This ticket is trying to reduce connections through some changes. Please take > a look at the design and leave comments. > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management
[ https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengnan Li updated HDFS-15757: -- Attachment: RBF_ Improving Router Connection Management_v2.pdf > RBF: Improving Router Connection Management > --- > > Key: HDFS-15757 > URL: https://issues.apache.org/jira/browse/HDFS-15757 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Major > Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ > Router Connection Management.pdf > > > We have seen high number of connections from Router to namenodes, leaving > namenodes unstable. > This ticket is trying to reduce connections through some changes. Please take > a look at the design and leave comments. > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management
[ https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengnan Li updated HDFS-15757: -- Attachment: RBF_ Improving Router Connection Management_v2.pdf > RBF: Improving Router Connection Management > --- > > Key: HDFS-15757 > URL: https://issues.apache.org/jira/browse/HDFS-15757 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Major > Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ > Router Connection Management.pdf > > > We have seen high number of connections from Router to namenodes, leaving > namenodes unstable. > This ticket is trying to reduce connections through some changes. Please take > a look at the design and leave comments. > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15732) EC client will not retry get block token when block token expired in kerberized cluster
[ https://issues.apache.org/jira/browse/HDFS-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258600#comment-17258600 ] Wei-Chiu Chuang commented on HDFS-15732: [~lalapala] would you like to submit a PR? I see that the PR2558 was closed. Will add you to the contributor list. Thanks. > EC client will not retry get block token when block token expired in > kerberized cluster > > > Key: HDFS-15732 > URL: https://issues.apache.org/jira/browse/HDFS-15732 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.1.1 > Environment: hadoop 3.1.1 > kerberos > ec RS-3-2-1024k >Reporter: gaozhan ding >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > When enable ec policy on hbase, we got some issues. Once block token was > expired in datanode side, client side will not identify the InvalidToken > error because of the SASL negotiation. As a result, ec client will not do > retry by refetch token when create blockreader. Then the peer datanode was > added to DeadNodes, and all calls to function createBlockReader aim at this > datanode in current DFSStripedInputStream will consider this datanode was > dead and return false. The finally result is a read failure. > Some logs : > hbase regionserver: > 2020-12-17 10:00:24,291 WARN > [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: > Failed to connect to /10.65.19.41:9866 for > blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566 > java.io.IOException: DIGEST-MD5: IO error acquiring password > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647) > at > org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272) > at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333) > at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365) > at > org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514) > at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318) > at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496) > at > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228) > at >
[jira] [Assigned] (HDFS-15732) EC client will not retry get block token when block token expired in kerberized cluster
[ https://issues.apache.org/jira/browse/HDFS-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang reassigned HDFS-15732: -- Assignee: gaozhan ding > EC client will not retry get block token when block token expired in > kerberized cluster > > > Key: HDFS-15732 > URL: https://issues.apache.org/jira/browse/HDFS-15732 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.1.1 > Environment: hadoop 3.1.1 > kerberos > ec RS-3-2-1024k >Reporter: gaozhan ding >Assignee: gaozhan ding >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > When enable ec policy on hbase, we got some issues. Once block token was > expired in datanode side, client side will not identify the InvalidToken > error because of the SASL negotiation. As a result, ec client will not do > retry by refetch token when create blockreader. Then the peer datanode was > added to DeadNodes, and all calls to function createBlockReader aim at this > datanode in current DFSStripedInputStream will consider this datanode was > dead and return false. The finally result is a read failure. > Some logs : > hbase regionserver: > 2020-12-17 10:00:24,291 WARN > [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: > Failed to connect to /10.65.19.41:9866 for > blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566 > java.io.IOException: DIGEST-MD5: IO error acquiring password > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647) > at > org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272) > at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333) > at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365) > at > org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514) > at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318) > at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496) > at > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:395) > at >
[jira] [Work logged] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?focusedWorklogId=530986=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530986 ] ASF GitHub Bot logged work on HDFS-15761: - Author: ASF GitHub Bot Created on: 05/Jan/21 00:48 Start Date: 05/Jan/21 00:48 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2588: URL: https://github.com/apache/hadoop/pull/2588#issuecomment-754315251 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 15s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | | 0m 0s | [test4tests](test4tests) | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 36m 48s | | trunk passed | | +1 :green_heart: | compile | 1m 20s | | trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | compile | 1m 11s | | trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | checkstyle | 0m 50s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 18s | | trunk passed | | +1 :green_heart: | shadedclient | 20m 11s | | branch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 0m 55s | | trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javadoc | 1m 26s | | trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +0 :ok: | spotbugs | 3m 33s | | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 :green_heart: | findbugs | 3m 29s | | trunk passed | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 16s | | the patch passed | | +1 :green_heart: | compile | 1m 15s | | the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javac | 1m 15s | | the patch passed | | +1 :green_heart: | compile | 1m 9s | | the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | javac | 1m 9s | | the patch passed | | -0 :warning: | checkstyle | 0m 43s | [/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2588/1/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 13 unchanged - 0 fixed = 14 total (was 13) | | +1 :green_heart: | mvnsite | 1m 15s | | the patch passed | | -1 :x: | whitespace | 0m 0s | [/whitespace-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2588/1/artifact/out/whitespace-eol.txt) | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 :green_heart: | shadedclient | 19m 25s | | patch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 1m 0s | | the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | findbugs | 3m 53s | | the patch passed | _ Other Tests _ | | -1 :x: | unit | 202m 5s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2588/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | -1 :x: | asflicense | 0m 49s | [/patch-asflicense-problems.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2588/1/artifact/out/patch-asflicense-problems.txt) | The patch generated 4 ASF License warnings. | | | | 305m 33s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestReadStripedFileWithDecodingDeletedData | | | hadoop.hdfs.TestDatanodeDeath | | | hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewerForContentSummary | | | hadoop.hdfs.server.diskbalancer.TestDiskBalancerWithMockMover | | | hadoop.hdfs.TestFileChecksum | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | | hadoop.hdfs.TestSetrepIncreasing | | | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics | | | hadoop.hdfs.server.datanode.TestBPOfferService
[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
[ https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Ni updated HDFS-15761: - Description: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by HDFS-7374 which is made because of HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them unintentionally. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0. 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated. was: To decommission a dead DN, the complete logic should be Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED *Currently logic:* If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. This logic is introduced by HDFS-7374 which is made because of HDFS-6791. HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive. However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss. In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them. *This change is to add Dead, DECOMMISSION_INPROGRESS back.* 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0. 3. Transit the dead DN to DECOMMISSIONED. 2 is implemented by HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated. > Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately > -- > > Key: HDFS-15761 > URL: https://issues.apache.org/jira/browse/HDFS-15761 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ye Ni >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > To decommission a dead DN, the complete logic should be > Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED > *Currently logic:* > If a DN is already dead when DECOMMISSIONING starts, it becomes > DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped. > This logic is introduced by HDFS-7374 which is made because of HDFS-6791. > HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes > dead during decommission, which could possibly make a dead DN in > DECOMMISSION_INPROGRESS forever, if the DN could never be alive. > However, putting a dead DN to DECOMMISSIONED directly is not secure. For > example, 3 DN of the same block are dead at the same time, then the > administrator wants to decommission them unintentionally. Namenode should > check first before transit them to DECOMMISSIONED. Otherwise, it would be a > data loss. > In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The > administrator needs to do some manual intervention, either repair the dead > machine or service or recover the data before take action on them. > *This change is to add Dead, DECOMMISSION_INPROGRESS back.* > 1. Dead normal DN is in DECOMMISSION_INPROGRESS first. > 2. NN checks pendingReplicationBlocksCount and
[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management
[ https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengnan Li updated HDFS-15757: -- Attachment: (was: RBF_ Improving Router Connection Management_v2.pdf) > RBF: Improving Router Connection Management > --- > > Key: HDFS-15757 > URL: https://issues.apache.org/jira/browse/HDFS-15757 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Major > Attachments: RBF_ Router Connection Management.pdf > > > We have seen high number of connections from Router to namenodes, leaving > namenodes unstable. > This ticket is trying to reduce connections through some changes. Please take > a look at the design and leave comments. > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management
[ https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengnan Li updated HDFS-15757: -- Attachment: RBF_ Improving Router Connection Management_v2.pdf > RBF: Improving Router Connection Management > --- > > Key: HDFS-15757 > URL: https://issues.apache.org/jira/browse/HDFS-15757 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Major > Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ > Router Connection Management.pdf > > > We have seen high number of connections from Router to namenodes, leaving > namenodes unstable. > This ticket is trying to reduce connections through some changes. Please take > a look at the design and leave comments. > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management
[ https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengnan Li updated HDFS-15757: -- Attachment: (was: RBF_ Improving Router Connection Management_v2.pdf) > RBF: Improving Router Connection Management > --- > > Key: HDFS-15757 > URL: https://issues.apache.org/jira/browse/HDFS-15757 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Major > Attachments: RBF_ Router Connection Management.pdf > > > We have seen high number of connections from Router to namenodes, leaving > namenodes unstable. > This ticket is trying to reduce connections through some changes. Please take > a look at the design and leave comments. > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org