[jira] [Work started] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout

2021-01-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-15719 started by Wei-Chiu Chuang.
--
> [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket 
> timeout
> -
>
> Key: HDFS-15719
> URL: https://issues.apache.org/jira/browse/HDFS-15719
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in 
> HADOOP-10075.
> However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too 
> low.
> We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with 
> ServerConnector.setIdleTimeout() but they aren't the same.
> Essentially, the HttpServer2's idle timeout was the default timeout set by 
> Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 
> seconds, which is unreasonable for JN. If NameNodes try to download a big 
> edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 
> seconds. When it happens, both NN crashes and there's no way to workaround 
> unless you apply the patch in HADOOP-15696 to add a config switch for the 
> idle timeout. Fortunately, it doesn't happen a lot.
> Propose: bump the idle timeout default to 200 seconds to match the behavior 
> in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is 
> not suitable for JN)
> Other things to consider:
> 1. fsck serverlet? (somehow I suspect this is related to the socket timeout 
> reported in HDFS-7175)
> 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. 
> so having a longer timeout makes sense here.
> 2. kms? will the longer timeout cause more lingering sockets?
> Thanks [~zhenshan.wen] for the discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=531041=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531041
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 05/Jan/21 07:08
Start Date: 05/Jan/21 07:08
Worklog Time Spent: 10m 
  Work Description: huangtianhua edited a comment on pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377#issuecomment-754446695


   @ayushtkn, thanks for review it. HDFS-15660 supports handling storage types 
for older clients in a generic way, and it has been merged, or I missed it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 531041)
Time Spent: 8h  (was: 7h 50m)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=531040=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531040
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 05/Jan/21 07:07
Start Date: 05/Jan/21 07:07
Worklog Time Spent: 10m 
  Work Description: huangtianhua commented on pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377#issuecomment-754446695


   @ayushtkn, thanks for review it. HDFS-15660 supports handling storage types 
for older clients in a generic way, and it has been merged, or I missed you?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 531040)
Time Spent: 7h 50m  (was: 7h 40m)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=531063=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531063
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 05/Jan/21 07:55
Start Date: 05/Jan/21 07:55
Worklog Time Spent: 10m 
  Work Description: huangtianhua commented on pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377#issuecomment-754470304


   @ayushtkn , in fact we don't have to hold this for HDFS-15660 as vinay said, 
the codes here is to fix the specific issues of NVDIMM, to avoid operations 
which related with storage type during rollingupgrade, to keep the orinal of 
storage type to make sure the editLog/fsimage works after restart namenode.  
IIUC, the miniCompatLV of namenodelayout version is introduced to make sure to 
refuse operations while rollingupgrade, so I think the approach is appropriate 
for the situation.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 531063)
Time Spent: 8h 20m  (was: 8h 10m)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=531046=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531046
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 05/Jan/21 07:20
Start Date: 05/Jan/21 07:20
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377#issuecomment-754453252


   @huangtianhua nopes you didn't. I know that is merged. That is what I said, 
but there were assertions earlier on jira that we should hold this code in 
Jira, for HDFS-15660. That would fix something or change our code here. So, we 
held this jira because of that only. So, just want to wait, so that can be 
clarified what needs to be done here post HDFS-15660.
   
   And secondly the NamenodeLayout version approach had objection too as I 
quoted above. We need to get an agreement over there.
   
   For me the code is good enough, once we have clarifications regarding these 
things, can conclude this



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 531046)
Time Spent: 8h 10m  (was: 8h)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15719?focusedWorklogId=531016=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531016
 ]

ASF GitHub Bot logged work on HDFS-15719:
-

Author: ASF GitHub Bot
Created on: 05/Jan/21 04:54
Start Date: 05/Jan/21 04:54
Worklog Time Spent: 10m 
  Work Description: jojochuang merged pull request #2533:
URL: https://github.com/apache/hadoop/pull/2533


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 531016)
Time Spent: 1h 20m  (was: 1h 10m)

> [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket 
> timeout
> -
>
> Key: HDFS-15719
> URL: https://issues.apache.org/jira/browse/HDFS-15719
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in 
> HADOOP-10075.
> However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too 
> low.
> We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with 
> ServerConnector.setIdleTimeout() but they aren't the same.
> Essentially, the HttpServer2's idle timeout was the default timeout set by 
> Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 
> seconds, which is unreasonable for JN. If NameNodes try to download a big 
> edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 
> seconds. When it happens, both NN crashes and there's no way to workaround 
> unless you apply the patch in HADOOP-15696 to add a config switch for the 
> idle timeout. Fortunately, it doesn't happen a lot.
> Propose: bump the idle timeout default to 200 seconds to match the behavior 
> in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is 
> not suitable for JN)
> Other things to consider:
> 1. fsck serverlet? (somehow I suspect this is related to the socket timeout 
> reported in HDFS-7175)
> 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. 
> so having a longer timeout makes sense here.
> 2. kms? will the longer timeout cause more lingering sockets?
> Thanks [~zhenshan.wen] for the discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout

2021-01-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-15719.

Fix Version/s: 3.2.3
   3.1.5
   3.4.0
   3.3.1
   Resolution: Fixed

> [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket 
> timeout
> -
>
> Key: HDFS-15719
> URL: https://issues.apache.org/jira/browse/HDFS-15719
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in 
> HADOOP-10075.
> However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too 
> low.
> We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with 
> ServerConnector.setIdleTimeout() but they aren't the same.
> Essentially, the HttpServer2's idle timeout was the default timeout set by 
> Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 
> seconds, which is unreasonable for JN. If NameNodes try to download a big 
> edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 
> seconds. When it happens, both NN crashes and there's no way to workaround 
> unless you apply the patch in HADOOP-15696 to add a config switch for the 
> idle timeout. Fortunately, it doesn't happen a lot.
> Propose: bump the idle timeout default to 200 seconds to match the behavior 
> in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is 
> not suitable for JN)
> Other things to consider:
> 1. fsck serverlet? (somehow I suspect this is related to the socket timeout 
> reported in HDFS-7175)
> 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. 
> so having a longer timeout makes sense here.
> 2. kms? will the longer timeout cause more lingering sockets?
> Thanks [~zhenshan.wen] for the discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15719?focusedWorklogId=531017=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-531017
 ]

ASF GitHub Bot logged work on HDFS-15719:
-

Author: ASF GitHub Bot
Created on: 05/Jan/21 04:55
Start Date: 05/Jan/21 04:55
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on pull request #2533:
URL: https://github.com/apache/hadoop/pull/2533#issuecomment-754395404


   Thanks Ayush and Stephen!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 531017)
Time Spent: 1.5h  (was: 1h 20m)

> [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket 
> timeout
> -
>
> Key: HDFS-15719
> URL: https://issues.apache.org/jira/browse/HDFS-15719
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in 
> HADOOP-10075.
> However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too 
> low.
> We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with 
> ServerConnector.setIdleTimeout() but they aren't the same.
> Essentially, the HttpServer2's idle timeout was the default timeout set by 
> Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 
> seconds, which is unreasonable for JN. If NameNodes try to download a big 
> edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 
> seconds. When it happens, both NN crashes and there's no way to workaround 
> unless you apply the patch in HADOOP-15696 to add a config switch for the 
> idle timeout. Fortunately, it doesn't happen a lot.
> Propose: bump the idle timeout default to 200 seconds to match the behavior 
> in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is 
> not suitable for JN)
> Other things to consider:
> 1. fsck serverlet? (somehow I suspect this is related to the socket timeout 
> reported in HDFS-7175)
> 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. 
> so having a longer timeout makes sense here.
> 2. kms? will the longer timeout cause more lingering sockets?
> Thanks [~zhenshan.wen] for the discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=530579=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530579
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 04/Jan/21 09:46
Start Date: 04/Jan/21 09:46
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377#issuecomment-753872315


   Thanx @huangtianhua for the work here, Sorry I couldn't revert back to your 
emails & pings.
   
   @brahmareddybattula has objections on the jira with the approach itself. 
Quoting him from the jira
   
   >I dn't think bumping the namelayout is best solution, need to check other 
way. ( may be like checking the client version during the upgrade.)
   
   There is no code change post HDFS-15660? It was asserted the generic 
solution shall solve this problem or will change something
   
   So, We might need changes here post HDFS-15660. should wait for him, unless 
he is convinced.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 530579)
Time Spent: 7h 40m  (was: 7.5h)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15549) Improve DISK/ARCHIVE movement if they are on same filesystem

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15549?focusedWorklogId=530583=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530583
 ]

ASF GitHub Bot logged work on HDFS-15549:
-

Author: ASF GitHub Bot
Created on: 04/Jan/21 09:51
Start Date: 04/Jan/21 09:51
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2583:
URL: https://github.com/apache/hadoop/pull/2583#issuecomment-753874888


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  47m 34s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |   |   0m  0s | [test4tests](test4tests) |  The patch 
appears to include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for branch  |
   | -1 :x: |  mvninstall  |   0m 23s | 
[/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-mvninstall-root.txt)
 |  root in trunk failed.  |
   | -1 :x: |  compile  |   0m 25s | 
[/branch-compile-root-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-compile-root-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt)
 |  root in trunk failed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.  |
   | -1 :x: |  compile  |   0m 22s | 
[/branch-compile-root-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-compile-root-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt)
 |  root in trunk failed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.  |
   | -0 :warning: |  checkstyle  |   0m 21s | 
[/buildtool-branch-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/buildtool-branch-checkstyle-root.txt)
 |  The patch fails to run checkstyle in root  |
   | -1 :x: |  mvnsite  |   0m 24s | 
[/branch-mvnsite-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-mvnsite-hadoop-common-project_hadoop-common.txt)
 |  hadoop-common in trunk failed.  |
   | -1 :x: |  mvnsite  |   4m 15s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | -1 :x: |  shadedclient  |  11m 37s |  |  branch has errors when building 
and testing our client artifacts.  |
   | -1 :x: |  javadoc  |   0m 23s | 
[/branch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt)
 |  hadoop-common in trunk failed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.  |
   | -1 :x: |  javadoc  |   0m 29s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt)
 |  hadoop-hdfs in trunk failed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.  |
   | -1 :x: |  javadoc  |   0m 24s | 
[/branch-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt)
 |  hadoop-common in trunk failed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.  |
   | -1 :x: |  javadoc  |   0m 24s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/1/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt)
 |  hadoop-hdfs in trunk failed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.  |
   | +0 :ok: |  spotbugs  |  14m 11s |  |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | -1 :x: |  findbugs  |   0m 30s | 

[jira] [Commented] (HDFS-15735) NameNode memory Leak on frequent execution of fsck

2021-01-04 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258131#comment-17258131
 ] 

Ayush Saxena commented on HDFS-15735:
-

Tracer is a {{private}} variable,

Not used anywhere,

Tracer is subject to removal due to CVE(IIRC), guess HADOOP-17387 and others, 
one recently mentioned too.

Harmless things are not always correct,

closing tracer in fsck() may impact if someone is using tracer post it(if so).

Closing in the last line of fsck may not be this issue what you are fixing. the 
moment you come out from the method control, the tracer would be subject to GC? 
closing it won't help, it will also make it subject to GC only.

If someone is using tracer in internal code, not in OS, or there is no where to 
use here, no need to keep, the guy can keep in his internal code. 

Removal will save memory allocation, and isn't incompatible in any way. Would 
be even better.

Sometimes listening to others doesn't hurt, not me atleast [~John Smith] too 
had a comment.

 

** Now the catch, I will still respect your opinion on this, though you aren't 
interested in mine :( You won't see a committing shortly from my end unless you 
are convinced in "any" jira. and I don't claim I am correct here, just 
proposing something if it looks good, can be done. I can be *wrong*/completely 
wrong, would be happy to accept that.

Would request consider the other options as well. I shall be happy to connect 
with you offline as well, If you want.

On this note, I take my vote back. 

> NameNode memory Leak on frequent execution of fsck  
> 
>
> Key: HDFS-15735
> URL: https://issues.apache.org/jira/browse/HDFS-15735
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ravuri Sushma sree
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15735.001.patch
>
>
> The memory of the cluster NameNode continues to grow, and the full gc 
> eventually leads to the failure of the active and standby HDFS
> Htrace is used to track the processing time of fsck
> Checking the code it is found that the tracer object in NamenodeFsck.java was 
> only created but not closed because of this the memory footprint continues to 
> grow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15754) Create packet metrics for DataNode

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15754?focusedWorklogId=530640=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530640
 ]

ASF GitHub Bot logged work on HDFS-15754:
-

Author: ASF GitHub Bot
Created on: 04/Jan/21 12:43
Start Date: 04/Jan/21 12:43
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2578:
URL: https://github.com/apache/hadoop/pull/2578#issuecomment-753954286


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 47s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |   |   0m  0s | [test4tests](test4tests) |  The patch 
appears to include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 39s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  26m 51s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  24m 25s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  compile  |  19m 57s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  checkstyle  |   2m 44s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m  7s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 24s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   2m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  javadoc  |   3m 20s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +0 :ok: |  spotbugs  |   3m 17s |  |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | +1 :green_heart: |  findbugs  |   5m 38s |  |  trunk passed  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 10s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  20m 50s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  javac  |  20m 50s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  18m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  javac  |  18m 32s |  |  the patch passed  |
   | -0 :warning: |  checkstyle  |   2m 39s | 
[/diff-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2578/3/artifact/out/diff-checkstyle-root.txt)
 |  root: The patch generated 4 new + 124 unchanged - 0 fixed = 128 total (was 
124)  |
   | +1 :green_heart: |  mvnsite  |   3m  4s |  |  the patch passed  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  |  The patch has no 
whitespace issues.  |
   | +1 :green_heart: |  shadedclient  |  15m 25s |  |  patch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   2m  6s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  javadoc  |   3m 17s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  findbugs  |   5m 52s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   9m 58s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 102m  1s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  5s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 311m 16s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2578/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2578 |
   | Optional Tests | dupname asflicense mvnsite markdownlint compile javac 
javadoc mvninstall unit shadedclient findbugs checkstyle |
   | uname | Linux 63beae1e56dc 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 
23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 2825d060cf9 |
   | Default Java | Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 

[jira] [Commented] (HDFS-15735) NameNode memory Leak on frequent execution of fsck

2021-01-04 Thread Brahma Reddy Battula (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258055#comment-17258055
 ] 

Brahma Reddy Battula commented on HDFS-15735:
-

{quote}I am not sure about it, why it being configurable makes it necessary to 
be here, why closing is better. Please hold it.
-1
{quote}
Removal can impact existing user whoever use this feature as they've configured 
and Proposed fix will not break anything.

Not sure why this needs to hold and given -1 and I feel this will not good 
practice. 

> NameNode memory Leak on frequent execution of fsck  
> 
>
> Key: HDFS-15735
> URL: https://issues.apache.org/jira/browse/HDFS-15735
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ravuri Sushma sree
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15735.001.patch
>
>
> The memory of the cluster NameNode continues to grow, and the full gc 
> eventually leads to the failure of the active and standby HDFS
> Htrace is used to track the processing time of fsck
> Checking the code it is found that the tracer object in NamenodeFsck.java was 
> only created but not closed because of this the memory footprint continues to 
> grow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=530681=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530681
 ]

ASF GitHub Bot logged work on HDFS-15759:
-

Author: ASF GitHub Bot
Created on: 04/Jan/21 14:13
Start Date: 04/Jan/21 14:13
Worklog Time Spent: 10m 
  Work Description: touchida opened a new pull request #2585:
URL: https://github.com/apache/hadoop/pull/2585


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.)
   For more details, please see 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 530681)
Remaining Estimate: 0h
Time Spent: 10m

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15759:
--
Labels: pull-request-available  (was: )

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-01-04 Thread Toshihiko Uchida (Jira)
Toshihiko Uchida created HDFS-15759:
---

 Summary: EC: Verify EC reconstruction correctness on DataNode
 Key: HDFS-15759
 URL: https://issues.apache.org/jira/browse/HDFS-15759
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, ec, erasure-coding
Affects Versions: 3.4.0
Reporter: Toshihiko Uchida


EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and the 
corruption is neither detected nor auto-healed by HDFS. It is obviously hard 
for users to monitor data integrity by themselves, and even if they find 
corrupted data, it is difficult or sometimes impossible to recover them.

To prevent further data corruption issues, this feature proposes a simple and 
effective way to verify EC reconstruction correctness on DataNode at each 
reconstruction process.
It verifies correctness of outputs decoded from inputs as follows:
1. Decoding an input with the outputs;
2. Compare the decoded input with the original input.
For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
[d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
[d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.

When an EC reconstruction task goes wrong, the comparison will fail with high 
probability.
Then the task will also fail and be retried by NameNode.
The next reconstruction will succeed if the condition triggered the failure is 
gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15751) Add documentation for msync() API to filesystem.md

2021-01-04 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258259#comment-17258259
 ] 

Steve Loughran commented on HDFS-15751:
---

LGTM, though the doc reference to HDFS should be relative to the final build 
paths, i.e. self-contained.

Still need a story for viewfs

> Add documentation for msync() API to filesystem.md
> --
>
> Key: HDFS-15751
> URL: https://issues.apache.org/jira/browse/HDFS-15751
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: HDFS-15751-01.patch, HDFS-15751-02.patch, 
> HDFS-15751-03.patch
>
>
> HDFS-15567 introduced new {{FileSystem}} call {{msync()}}. Should add it to 
> the API definitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-01-04 Thread Toshihiko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toshihiko Uchida reassigned HDFS-15759:
---

Assignee: Toshihiko Uchida

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15757) RBF: Improving Router Connection Management

2021-01-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258364#comment-17258364
 ] 

Íñigo Goiri commented on HDFS-15757:


Thank you [~fengnanli] for the proposal.
The connection manager was pretty tricky as it can impact the performance of 
the router substantially.
Your proposal makes sense? Do you have specific scenarios where the metrics 
show the connections in a bad state clearly?
It would be nice to have some benchmarks too.
In any case, your proposal doesn't seem too complex so we should go ahead with 
a patch and go from there.

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: RBF_ Router Connection Management.pdf
>
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15760) Validate the target indices in ErasureCoding worker in reconstruction process

2021-01-04 Thread Uma Maheswara Rao G (Jira)
Uma Maheswara Rao G created HDFS-15760:
--

 Summary: Validate the target indices in ErasureCoding worker in 
reconstruction process
 Key: HDFS-15760
 URL: https://issues.apache.org/jira/browse/HDFS-15760
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ec
Affects Versions: 3.4.0
Reporter: Uma Maheswara Rao G


As we have seen issues like 
 # HDFS-15186
 # HDFS-14768

It is a good idea to validate the indices at the ECWorker side and skip the 
unintended in indices from target list.

Both of the issues triggered because, NN accidentally scheduled for 
reconstruction in decom process due to busy node. We have fixed to make sure NN 
considers busy nodes as live replicas. However, it may be good idea to safe 
gaud the condition at ECWorker also in case if any other condition triggers and 
that leads ECWroker to calculate the indices similar the above issues, then EC 
function returns wrong o/p.  I think it's ok to recover only the missing 
indices from the given src indices.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15748) RBF: Move the router related part from hadoop-federation-balance module to hadoop-hdfs-rbf.

2021-01-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258366#comment-17258366
 ] 

Íñigo Goiri commented on HDFS-15748:


+1 on  [^HDFS-15748.004.patch].

> RBF: Move the router related part from hadoop-federation-balance module to 
> hadoop-hdfs-rbf.
> ---
>
> Key: HDFS-15748
> URL: https://issues.apache.org/jira/browse/HDFS-15748
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15748.001.patch, HDFS-15748.002.patch, 
> HDFS-15748.003.patch, HDFS-15748.004.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15757) RBF: Improving Router Connection Management

2021-01-04 Thread Fengnan Li (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258376#comment-17258376
 ] 

Fengnan Li commented on HDFS-15757:
---

Thanks for the review [~elgoiri] There are two metrics we will try to improve.
1. RpcClientNumConnections should go down in each router
2. RpcClientNumActiveConnections / RpcClientNumConnections should go up in each 
router.

I will add more graphs for this in an updated doc. The first version was trying 
to get some initial feedback.

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: RBF_ Router Connection Management.pdf
>
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15748) RBF: Move the router related part from hadoop-federation-balance module to hadoop-hdfs-rbf.

2021-01-04 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15748:

Hadoop Flags: Reviewed
  Resolution: Fixed
  Status: Resolved  (was: Patch Available)

> RBF: Move the router related part from hadoop-federation-balance module to 
> hadoop-hdfs-rbf.
> ---
>
> Key: HDFS-15748
> URL: https://issues.apache.org/jira/browse/HDFS-15748
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15748.001.patch, HDFS-15748.002.patch, 
> HDFS-15748.003.patch, HDFS-15748.004.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258452#comment-17258452
 ] 

Ye Ni edited comment on HDFS-15761 at 1/4/21, 7:45 PM:
---

cc [~mingma], [~andrew.wang], [~zhz] , [~inigoiri]


was (Author: nickyye):
cc [~mingma], [~andrew.wang], [~aiden_zhang], [~inigoiri]

> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
> Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator puts them to DECOMMISSIONED. Namenode should check first before 
> transit them to DECOMMISSIONED. Otherwise, it would be a data loss.
> In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
> administrator needs to do some manual intervention, either repair the dead 
> machine or service or recover the data before decommission them.
> This change is to add Dead, DECOMMISSION_INPROGRESS back.
> 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
> 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount 
> are both 0
> 3. Transit the dead DN to DECOMMISSIONED.
> 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which 
> adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
> DECOMMISSIONED state if all files on the filesystem are fully-replicated, 
> dead DN is in DECOMMISSION_INPROGRESS, then checked, before become 
> DECOMMISSIONED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Ni updated HDFS-15761:
-
Description: 
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by HDFS-7374 which is made because of HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are 
both 0.
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by HDFS-7409, which adds a check to allow dead nodes in 
DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on 
the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, 
then checked, before become DECOMMISSIONED.

  was:
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are 
both 0.
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by HDFS-7409, which adds a check to allow dead nodes in 
DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on 
the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, 
then checked, before become DECOMMISSIONED.


> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
>  Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by HDFS-7374 which is made because of HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator wants to decommission them. Namenode should check first before 
> transit them to DECOMMISSIONED. Otherwise, it would be a data loss.
> In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
> administrator needs to do some manual intervention, either repair the dead 
> machine or service or 

[jira] [Commented] (HDFS-15748) RBF: Move the router related part from hadoop-federation-balance module to hadoop-hdfs-rbf.

2021-01-04 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258410#comment-17258410
 ] 

Ayush Saxena commented on HDFS-15748:
-

Committed to trunk.

Thanx [~LiJinglun] for the contribution and [~elgoiri] for the review!!!

> RBF: Move the router related part from hadoop-federation-balance module to 
> hadoop-hdfs-rbf.
> ---
>
> Key: HDFS-15748
> URL: https://issues.apache.org/jira/browse/HDFS-15748
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15748.001.patch, HDFS-15748.002.patch, 
> HDFS-15748.003.patch, HDFS-15748.004.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15549) Improve DISK/ARCHIVE movement if they are on same filesystem

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15549?focusedWorklogId=530857=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530857
 ]

ASF GitHub Bot logged work on HDFS-15549:
-

Author: ASF GitHub Bot
Created on: 04/Jan/21 20:06
Start Date: 04/Jan/21 20:06
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2583:
URL: https://github.com/apache/hadoop/pull/2583#issuecomment-754188800


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m 31s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |   |   0m  0s | [test4tests](test4tests) |  The patch 
appears to include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 59s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  24m 15s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  22m 23s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  compile  |  26m  9s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  checkstyle  |   4m 36s |  |  trunk passed  |
   | -1 :x: |  mvnsite  |   1m  9s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | +1 :green_heart: |  shadedclient  |   9m 58s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -1 :x: |  javadoc  |   0m 51s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt)
 |  hadoop-hdfs in trunk failed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.  |
   | -1 :x: |  javadoc  |   0m 59s | 
[/branch-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt)
 |  hadoop-common in trunk failed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.  |
   | -1 :x: |  javadoc  |   1m  0s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.txt)
 |  hadoop-hdfs in trunk failed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01.  |
   | +0 :ok: |  spotbugs  |  16m 45s |  |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | -1 :x: |  findbugs  |   1m  2s | 
[/branch-findbugs-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-findbugs-hadoop-common-project_hadoop-common.txt)
 |  hadoop-common in trunk failed.  |
   | -1 :x: |  findbugs  |   1m  1s | 
[/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 40s |  |  Maven dependency ordering for patch  |
   | -1 :x: |  mvninstall  |   0m 32s | 
[/patch-mvninstall-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/patch-mvninstall-hadoop-common-project_hadoop-common.txt)
 |  hadoop-common in the patch failed.  |
   | -1 :x: |  mvninstall  |   0m 28s | 
[/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch failed.  |
   | -1 :x: |  compile  |   0m 30s | 
[/patch-compile-root-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2583/2/artifact/out/patch-compile-root-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.txt)
 |  root in the patch failed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04.  
|
   | -1 :x: |  javac  |   0m 30s | 

[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=530820=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530820
 ]

ASF GitHub Bot logged work on HDFS-15759:
-

Author: ASF GitHub Bot
Created on: 04/Jan/21 19:10
Start Date: 04/Jan/21 19:10
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2585:
URL: https://github.com/apache/hadoop/pull/2585#issuecomment-754160262


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 34s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |   |   0m  0s | [test4tests](test4tests) |  The patch 
appears to include 5 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m  1s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 41s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  20m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  compile  |  17m 17s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  checkstyle  |   2m 52s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m  7s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  24m 41s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   2m 11s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  javadoc  |   3m 18s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +0 :ok: |  spotbugs  |   3m 19s |  |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | +1 :green_heart: |  findbugs  |   5m 38s |  |  trunk passed  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  6s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  19m 21s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  javac  |  19m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  javac  |  17m 23s |  |  the patch passed  |
   | +1 :green_heart: |  checkstyle  |   2m 52s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m  6s |  |  the patch passed  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  |  The patch has no 
whitespace issues.  |
   | +1 :green_heart: |  xml  |   0m  2s |  |  The patch has no ill-formed XML 
file.  |
   | +1 :green_heart: |  shadedclient  |  15m 26s |  |  patch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   2m  8s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  javadoc  |   3m 17s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  findbugs  |   5m 47s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   9m 52s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  |  98m 45s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2585/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m  9s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 296m  1s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestReconstructStripedFileWithValidator |
   |   | hadoop.hdfs.TestMultipleNNPortQOP |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2585/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2585 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient findbugs checkstyle xml |
   | uname | Linux 98ae016c6e0c 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 
16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 2825d060cf9 |
   | Default Java | Private 

[jira] [Created] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)
Ye Ni created HDFS-15761:


 Summary: Dead NORMAL DN shouldn't transit to DECOMMISSIONED 
immediately
 Key: HDFS-15761
 URL: https://issues.apache.org/jira/browse/HDFS-15761
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ye Ni


To decommission a dead DN, the complete logic should be
Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator puts them to DECOMMISSIONED. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before decommission them.

This change is to add Dead, DECOMMISSION_INPROGRESS back.
1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount 
are both 0
3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds 
a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead 
DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?focusedWorklogId=530835=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530835
 ]

ASF GitHub Bot logged work on HDFS-15761:
-

Author: ASF GitHub Bot
Created on: 04/Jan/21 19:41
Start Date: 04/Jan/21 19:41
Worklog Time Spent: 10m 
  Work Description: NickyYe opened a new pull request #2588:
URL: https://github.com/apache/hadoop/pull/2588


   https://issues.apache.org/jira/browse/HDFS-15761
   
   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.)
   For more details, please see 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 530835)
Remaining Estimate: 0h
Time Spent: 10m

> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
> Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator puts them to DECOMMISSIONED. Namenode should check first before 
> transit them to DECOMMISSIONED. Otherwise, it would be a data loss.
> In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
> administrator needs to do some manual intervention, either repair the dead 
> machine or service or recover the data before decommission them.
> This change is to add Dead, DECOMMISSION_INPROGRESS back.
> 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
> 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount 
> are both 0
> 3. Transit the dead DN to DECOMMISSIONED.
> 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which 
> adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
> DECOMMISSIONED state if all files on the filesystem are fully-replicated, 
> dead DN is in DECOMMISSION_INPROGRESS, then checked, before become 
> DECOMMISSIONED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15761:
--
Labels: pull-request-available  (was: )

> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
> Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator puts them to DECOMMISSIONED. Namenode should check first before 
> transit them to DECOMMISSIONED. Otherwise, it would be a data loss.
> In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
> administrator needs to do some manual intervention, either repair the dead 
> machine or service or recover the data before decommission them.
> This change is to add Dead, DECOMMISSION_INPROGRESS back.
> 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
> 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount 
> are both 0
> 3. Transit the dead DN to DECOMMISSIONED.
> 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which 
> adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
> DECOMMISSIONED state if all files on the filesystem are fully-replicated, 
> dead DN is in DECOMMISSION_INPROGRESS, then checked, before become 
> DECOMMISSIONED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258452#comment-17258452
 ] 

Ye Ni edited comment on HDFS-15761 at 1/4/21, 7:46 PM:
---

cc [~mingma], [~andrew.wang], [~zhz] ,[~elgoiri]


was (Author: nickyye):
cc [~mingma], [~andrew.wang], [~zhz] , [~inigoiri]

> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
> Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator puts them to DECOMMISSIONED. Namenode should check first before 
> transit them to DECOMMISSIONED. Otherwise, it would be a data loss.
> In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
> administrator needs to do some manual intervention, either repair the dead 
> machine or service or recover the data before decommission them.
> This change is to add Dead, DECOMMISSION_INPROGRESS back.
> 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
> 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount 
> are both 0
> 3. Transit the dead DN to DECOMMISSIONED.
> 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which 
> adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
> DECOMMISSIONED state if all files on the filesystem are fully-replicated, 
> dead DN is in DECOMMISSION_INPROGRESS, then checked, before become 
> DECOMMISSIONED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Ni updated HDFS-15761:
-
Description: 
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

This change is to add Dead, DECOMMISSION_INPROGRESS back.
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. Then NN check pendingReplicationBlocksCount and underReplicatedBlocksCount 
are both 0
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds 
a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead 
DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED.

  was:
To decommission a dead DN, the complete logic should be
Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator puts them to DECOMMISSIONED. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before decommission them.

This change is to add Dead, DECOMMISSION_INPROGRESS back.
1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount 
are both 0
3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds 
a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead 
DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED.


> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
>  Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator wants to decommission them. Namenode should check first before 
> transit them to 

[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Ni updated HDFS-15761:
-
Description: 
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by HDFS-7374 which is made because of HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are 
both 0.
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by HDFS-7409, which adds a check to allow dead nodes in 
DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on 
the filesystem are fully-replicated.

  was:
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by HDFS-7374 which is made because of HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are 
both 0.
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by HDFS-7409, which adds a check to allow dead nodes in 
DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on 
the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, 
then checked, before become DECOMMISSIONED.


> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
>  Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by HDFS-7374 which is made because of HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator wants to decommission them. Namenode should check first before 
> transit them to DECOMMISSIONED. Otherwise, it would be a data loss.
> In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
> administrator needs to do some manual intervention, either repair the dead 
> machine or service or recover the data before take action on them.
> *This change is to add Dead, DECOMMISSION_INPROGRESS back.*
>  1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
>  

[jira] [Commented] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258452#comment-17258452
 ] 

Ye Ni commented on HDFS-15761:
--

cc [~mingma], [~andrew.wang], [~aiden_zhang], [~inigoiri]

> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
> Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator puts them to DECOMMISSIONED. Namenode should check first before 
> transit them to DECOMMISSIONED. Otherwise, it would be a data loss.
> In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
> administrator needs to do some manual intervention, either repair the dead 
> machine or service or recover the data before decommission them.
> This change is to add Dead, DECOMMISSION_INPROGRESS back.
> 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
> 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount 
> are both 0
> 3. Transit the dead DN to DECOMMISSIONED.
> 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which 
> adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
> DECOMMISSIONED state if all files on the filesystem are fully-replicated, 
> dead DN is in DECOMMISSION_INPROGRESS, then checked, before become 
> DECOMMISSIONED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Ni updated HDFS-15761:
-
Description: 
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are 
both 0.
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds 
a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead 
DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED.

  was:
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. Then NN check pendingReplicationBlocksCount and underReplicatedBlocksCount 
are both 0
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds 
a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead 
DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED.


> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
>  Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator wants to decommission them. Namenode should check first before 
> transit 

[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Ni updated HDFS-15761:
-
Description: 
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are 
both 0.
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by HDFS-7409, which adds a check to allow dead nodes in 
DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on 
the filesystem are fully-replicated, dead DN is in DECOMMISSION_INPROGRESS, 
then checked, before become DECOMMISSIONED.

  was:
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are 
both 0.
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds 
a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead 
DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED.


> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
>  Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator wants to decommission them. Namenode should check first before 
> transit them to DECOMMISSIONED. Otherwise, it 

[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Ni updated HDFS-15761:
-
Description: 
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. Then NN check pendingReplicationBlocksCount and underReplicatedBlocksCount 
are both 0
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds 
a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead 
DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED.

  was:
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374

HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

This change is to add Dead, DECOMMISSION_INPROGRESS back.
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. Then NN check pendingReplicationBlocksCount and underReplicatedBlocksCount 
are both 0
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which adds 
a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
DECOMMISSIONED state if all files on the filesystem are fully-replicated, dead 
DN is in DECOMMISSION_INPROGRESS, then checked, before become DECOMMISSIONED.


> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
>  Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator wants to decommission them. Namenode should check first before 
> transit 

[jira] [Commented] (HDFS-15732) EC client will not retry get block token when block token expired in kerberized cluster

2021-01-04 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258598#comment-17258598
 ] 

Wei-Chiu Chuang commented on HDFS-15732:


Probably similar to HDFS-10609 and HDFS-11741 where we should retry upon the 
invalid block token exception.

> EC client will not retry get block token when block token expired  in 
> kerberized cluster
> 
>
> Key: HDFS-15732
> URL: https://issues.apache.org/jira/browse/HDFS-15732
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.1.1
> Environment: hadoop 3.1.1
> kerberos
> ec RS-3-2-1024k
>Reporter: gaozhan ding
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When enable ec policy on hbase, we got some issues. Once block token was 
> expired in datanode side, client side will not identify the InvalidToken 
> error because of the SASL negotiation. As a result, ec client will not do 
> retry by refetch token when create blockreader.  Then the peer datanode was 
> added to DeadNodes, and all calls to function createBlockReader aim at this 
> datanode in current DFSStripedInputStream will consider this datanode  was 
> dead and return false. The finally result is a read failure.
> Some logs :
> hbase regionserver:
> 2020-12-17 10:00:24,291 WARN 
> [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: 
> Failed to connect to /10.65.19.41:9866 for 
> blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>  at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647)
>  at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
>  at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272)
>  at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333)
>  at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514)
>  at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354)
>  at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318)
>  at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228)
>  at 
> 

[jira] [Commented] (HDFS-15757) RBF: Improving Router Connection Management

2021-01-04 Thread Fengnan Li (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258611#comment-17258611
 ] 

Fengnan Li commented on HDFS-15757:
---

Uploaded v2 with more metrics and some changes. I will start some POC towards 
this direction.

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Router Connection Management.pdf
>
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management

2021-01-04 Thread Fengnan Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengnan Li updated HDFS-15757:
--
Attachment: RBF_ Improving Router Connection Management_v2.pdf

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Router Connection Management.pdf
>
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management

2021-01-04 Thread Fengnan Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengnan Li updated HDFS-15757:
--
Attachment: RBF_ Improving Router Connection Management_v2.pdf

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Router Connection Management.pdf
>
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15732) EC client will not retry get block token when block token expired in kerberized cluster

2021-01-04 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258600#comment-17258600
 ] 

Wei-Chiu Chuang commented on HDFS-15732:


[~lalapala] would you like to submit a PR? I see that the PR2558 was closed. 
Will add you to the contributor list. Thanks.

> EC client will not retry get block token when block token expired  in 
> kerberized cluster
> 
>
> Key: HDFS-15732
> URL: https://issues.apache.org/jira/browse/HDFS-15732
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.1.1
> Environment: hadoop 3.1.1
> kerberos
> ec RS-3-2-1024k
>Reporter: gaozhan ding
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When enable ec policy on hbase, we got some issues. Once block token was 
> expired in datanode side, client side will not identify the InvalidToken 
> error because of the SASL negotiation. As a result, ec client will not do 
> retry by refetch token when create blockreader.  Then the peer datanode was 
> added to DeadNodes, and all calls to function createBlockReader aim at this 
> datanode in current DFSStripedInputStream will consider this datanode  was 
> dead and return false. The finally result is a read failure.
> Some logs :
> hbase regionserver:
> 2020-12-17 10:00:24,291 WARN 
> [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: 
> Failed to connect to /10.65.19.41:9866 for 
> blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>  at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647)
>  at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
>  at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272)
>  at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333)
>  at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514)
>  at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354)
>  at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318)
>  at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228)
>  at 
> 

[jira] [Assigned] (HDFS-15732) EC client will not retry get block token when block token expired in kerberized cluster

2021-01-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-15732:
--

Assignee: gaozhan ding

> EC client will not retry get block token when block token expired  in 
> kerberized cluster
> 
>
> Key: HDFS-15732
> URL: https://issues.apache.org/jira/browse/HDFS-15732
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.1.1
> Environment: hadoop 3.1.1
> kerberos
> ec RS-3-2-1024k
>Reporter: gaozhan ding
>Assignee: gaozhan ding
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When enable ec policy on hbase, we got some issues. Once block token was 
> expired in datanode side, client side will not identify the InvalidToken 
> error because of the SASL negotiation. As a result, ec client will not do 
> retry by refetch token when create blockreader.  Then the peer datanode was 
> added to DeadNodes, and all calls to function createBlockReader aim at this 
> datanode in current DFSStripedInputStream will consider this datanode  was 
> dead and return false. The finally result is a read failure.
> Some logs :
> hbase regionserver:
> 2020-12-17 10:00:24,291 WARN 
> [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: 
> Failed to connect to /10.65.19.41:9866 for 
> blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>  at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647)
>  at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
>  at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272)
>  at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333)
>  at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514)
>  at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354)
>  at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318)
>  at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:395)
>  at 
> 

[jira] [Work logged] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?focusedWorklogId=530986=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530986
 ]

ASF GitHub Bot logged work on HDFS-15761:
-

Author: ASF GitHub Bot
Created on: 05/Jan/21 00:48
Start Date: 05/Jan/21 00:48
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2588:
URL: https://github.com/apache/hadoop/pull/2588#issuecomment-754315251


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m 15s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |   |   0m  0s | [test4tests](test4tests) |  The patch 
appears to include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  36m 48s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  compile  |   1m 11s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  checkstyle  |   0m 50s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 18s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 11s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  javadoc  |   1m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +0 :ok: |  spotbugs  |   3m 33s |  |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | +1 :green_heart: |  findbugs  |   3m 29s |  |  trunk passed  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  javac  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  javac  |   1m  9s |  |  the patch passed  |
   | -0 :warning: |  checkstyle  |   0m 43s | 
[/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2588/1/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 13 unchanged - 
0 fixed = 14 total (was 13)  |
   | +1 :green_heart: |  mvnsite  |   1m 15s |  |  the patch passed  |
   | -1 :x: |  whitespace  |   0m  0s | 
[/whitespace-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2588/1/artifact/out/whitespace-eol.txt)
 |  The patch has 1 line(s) that end in whitespace. Use git apply 
--whitespace=fix <>. Refer https://git-scm.com/docs/git-apply  |
   | +1 :green_heart: |  shadedclient  |  19m 25s |  |  patch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01  |
   | +1 :green_heart: |  findbugs  |   3m 53s |  |  the patch passed  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 202m  5s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2588/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | -1 :x: |  asflicense  |   0m 49s | 
[/patch-asflicense-problems.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2588/1/artifact/out/patch-asflicense-problems.txt)
 |  The patch generated 4 ASF License warnings.  |
   |  |   | 305m 33s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.TestReadStripedFileWithDecodingDeletedData |
   |   | hadoop.hdfs.TestDatanodeDeath |
   |   | 
hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewerForContentSummary |
   |   | hadoop.hdfs.server.diskbalancer.TestDiskBalancerWithMockMover |
   |   | hadoop.hdfs.TestFileChecksum |
   |   | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
   |   | hadoop.hdfs.TestSetrepIncreasing |
   |   | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics |
   |   | hadoop.hdfs.server.datanode.TestBPOfferService 

[jira] [Updated] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

2021-01-04 Thread Ye Ni (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Ni updated HDFS-15761:
-
Description: 
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by HDFS-7374 which is made because of HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them unintentionally. Namenode should check 
first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are 
both 0.
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by HDFS-7409, which adds a check to allow dead nodes in 
DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on 
the filesystem are fully-replicated.

  was:
To decommission a dead DN, the complete logic should be
 Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

*Currently logic:*

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED 
immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by HDFS-7374 which is made because of HDFS-6791.

HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
dead during decommission, which could possibly make a dead DN in 
DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
example, 3 DN of the same block are dead at the same time, then the 
administrator wants to decommission them. Namenode should check first before 
transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
administrator needs to do some manual intervention, either repair the dead 
machine or service or recover the data before take action on them.

*This change is to add Dead, DECOMMISSION_INPROGRESS back.*
 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
 2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are 
both 0.
 3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by HDFS-7409, which adds a check to allow dead nodes in 
DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on 
the filesystem are fully-replicated.


> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --
>
> Key: HDFS-15761
> URL: https://issues.apache.org/jira/browse/HDFS-15761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ye Ni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
>  Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by HDFS-7374 which is made because of HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator wants to decommission them unintentionally. Namenode should 
> check first before transit them to DECOMMISSIONED. Otherwise, it would be a 
> data loss.
> In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
> administrator needs to do some manual intervention, either repair the dead 
> machine or service or recover the data before take action on them.
> *This change is to add Dead, DECOMMISSION_INPROGRESS back.*
>  1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
>  2. NN checks pendingReplicationBlocksCount and 

[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management

2021-01-04 Thread Fengnan Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengnan Li updated HDFS-15757:
--
Attachment: (was: RBF_ Improving Router Connection Management_v2.pdf)

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: RBF_ Router Connection Management.pdf
>
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management

2021-01-04 Thread Fengnan Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengnan Li updated HDFS-15757:
--
Attachment: RBF_ Improving Router Connection Management_v2.pdf

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Router Connection Management.pdf
>
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15757) RBF: Improving Router Connection Management

2021-01-04 Thread Fengnan Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengnan Li updated HDFS-15757:
--
Attachment: (was: RBF_ Improving Router Connection Management_v2.pdf)

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: RBF_ Router Connection Management.pdf
>
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org