[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes

2016-10-24 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603358#comment-15603358
 ] 

Kihwal Lee commented on HDFS-10857:
---

I am not actively working on this. Please feel free to take over.

> Rolling upgrade can make data unavailable when the cluster has many failed 
> volumes
> --
>
> Key: HDFS-10857
> URL: https://issues.apache.org/jira/browse/HDFS-10857
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.4
>Reporter: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-10857.branch-2.6.patch
>
>
> When the marker file or trash dir is created or removed during the heartbeat 
> response processing, an {{IOException}} is thrown if tried on a failed 
> volume.   This stops processing of the rest of storage directories and any 
> DNA commands that were part of the heartbeat response.
> While this is happening, the block token key update does not happen and all 
> read and write requests start to fail, until the upgrade is finalized and the 
> DN receives a new key. All it takes is one failed volume. If there are three 
> such nodes in the cluster, it is very likely that some blocks cannot be read. 
> The NN has no idea unlike the common missing blocks scenarios, although the 
> effect is the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes

2016-10-24 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603256#comment-15603256
 ] 

Arpit Agarwal commented on HDFS-10857:
--

Looks like {{checkDiskError}} should get the DataNode object lock for the 
{{dataDirs}} modification to avoid a potential race with {{refreshVolumes}}.

> Rolling upgrade can make data unavailable when the cluster has many failed 
> volumes
> --
>
> Key: HDFS-10857
> URL: https://issues.apache.org/jira/browse/HDFS-10857
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.4
>Reporter: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-10857.branch-2.6.patch
>
>
> When the marker file or trash dir is created or removed during the heartbeat 
> response processing, an {{IOException}} is thrown if tried on a failed 
> volume.   This stops processing of the rest of storage directories and any 
> DNA commands that were part of the heartbeat response.
> While this is happening, the block token key update does not happen and all 
> read and write requests start to fail, until the upgrade is finalized and the 
> DN receives a new key. All it takes is one failed volume. If there are three 
> such nodes in the cluster, it is very likely that some blocks cannot be read. 
> The NN has no idea unlike the common missing blocks scenarios, although the 
> effect is the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes

2016-09-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504990#comment-15504990
 ] 

Hadoop QA commented on HDFS-10857:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 
29s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
 1s{color} | {color:green} branch-2.6 passed {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
49s{color} | {color:red} hadoop-hdfs in branch-2.6 failed with JDK v1.8.0_101. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
50s{color} | {color:red} hadoop-hdfs in branch-2.6 failed with JDK v1.7.0_111. 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} branch-2.6 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} branch-2.6 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
17s{color} | {color:green} branch-2.6 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  3m  
5s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in branch-2.6 has 273 
extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
13s{color} | {color:green} branch-2.6 passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
0s{color} | {color:green} branch-2.6 passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
48s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_101. 
{color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 48s{color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_101. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
47s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_111. 
{color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 47s{color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_111. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1347 line(s) that end in whitespace. Use 
git apply --whitespace=fix <>. Refer 
https://git-scm.com/docs/git-apply {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m 
33s{color} | {color:red} The patch 70 line(s) with tabs. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  3m 
19s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 273 
unchanged - 0 fixed = 274 total (was 273) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
59s{color} | {color:green} the patch passed with JDK v1.7.0_111 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 46s{color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_111. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
34s{color} | {color:red} The patch generated 75 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 48m  3s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs-project/hadoop-hdfs |
|  |  Inconsistent synchronization of 
org.apache.hadoop.hdfs.server.datanode.DataNode.dataDirs; locked 71% of time  
Unsynchronized access at DataNode.java:71% of time  Unsynchronized access at 
DataNode.java:[line 3098] |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  

[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes

2016-09-19 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504768#comment-15504768
 ] 

Kihwal Lee commented on HDFS-10857:
---

The patch partially brings in changes available in branch-2.7 to removing 
volumes easier.

> Rolling upgrade can make data unavailable when the cluster has many failed 
> volumes
> --
>
> Key: HDFS-10857
> URL: https://issues.apache.org/jira/browse/HDFS-10857
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.4
>Reporter: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-10857.branch-2.6.patch
>
>
> When the marker file or trash dir is created or removed during the heartbeat 
> response processing, an {{IOException}} is thrown if tried on a failed 
> volume.   This stops processing of the rest of storage directories and any 
> DNA commands that were part of the heartbeat response.
> While this is happening, the block token key update does not happen and all 
> read and write requests start to fail, until the upgrade is finalized and the 
> DN receives a new key. All it takes is one failed volume. If there are three 
> such nodes in the cluster, it is very likely that some blocks cannot be read. 
> The NN has no idea unlike the common missing blocks scenarios, although the 
> effect is the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes

2016-09-12 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484851#comment-15484851
 ] 

Kihwal Lee commented on HDFS-10857:
---

We can have {{FsDatasetImpl#checkDataDir()}} remove the failed volumes from 
{{dataStorage}} at the end. It might need to remove them from 
{{DataNode.dataDirs}} as well.

> Rolling upgrade can make data unavailable when the cluster has many failed 
> volumes
> --
>
> Key: HDFS-10857
> URL: https://issues.apache.org/jira/browse/HDFS-10857
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.4
>Reporter: Kihwal Lee
>Priority: Critical
>
> When the marker file or trash dir is created or removed during the heartbeat 
> response processing, an {{IOException}} is thrown if tried on a failed 
> volume.   This stops processing of the rest of storage directories and any 
> DNA commands that were part of the heartbeat response.
> While this is happening, the block token key update does not happen and all 
> read and write requests start to fail, until the upgrade is finalized and the 
> DN receives a new key. All it takes is one failed volume. If there are three 
> such nodes in the cluster, it is very likely that some blocks cannot be read. 
> The NN has no idea unlike the common missing blocks scenarios, although the 
> effect is the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes

2016-09-12 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484751#comment-15484751
 ] 

Kihwal Lee commented on HDFS-10857:
---

It looks like it is fixed in 2.8 and later. {{DataNode#checkDiskError()}} does 
remove the failed volume from {{DataStorage}}.

> Rolling upgrade can make data unavailable when the cluster has many failed 
> volumes
> --
>
> Key: HDFS-10857
> URL: https://issues.apache.org/jira/browse/HDFS-10857
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.4
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
>
> When the marker file or trash dir is created or removed during the heartbeat 
> response processing, an {{IOException}} is thrown if tried on a failed 
> volume.   This stops processing of the rest of storage directories and any 
> DNA commands that were part of the heartbeat response.
> While this is happening, the block token key update does not happen and all 
> read and write requests start to fail, until the upgrade is finalized and the 
> DN receives a new key. All it takes is one failed volume. If there are three 
> such nodes in the cluster, it is very likely that some blocks cannot be read. 
> The NN has no idea unlike the common missing blocks scenarios, although the 
> effect is the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org