[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes
[ https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603358#comment-15603358 ] Kihwal Lee commented on HDFS-10857: --- I am not actively working on this. Please feel free to take over. > Rolling upgrade can make data unavailable when the cluster has many failed > volumes > -- > > Key: HDFS-10857 > URL: https://issues.apache.org/jira/browse/HDFS-10857 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.4 >Reporter: Kihwal Lee >Priority: Critical > Attachments: HDFS-10857.branch-2.6.patch > > > When the marker file or trash dir is created or removed during the heartbeat > response processing, an {{IOException}} is thrown if tried on a failed > volume. This stops processing of the rest of storage directories and any > DNA commands that were part of the heartbeat response. > While this is happening, the block token key update does not happen and all > read and write requests start to fail, until the upgrade is finalized and the > DN receives a new key. All it takes is one failed volume. If there are three > such nodes in the cluster, it is very likely that some blocks cannot be read. > The NN has no idea unlike the common missing blocks scenarios, although the > effect is the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes
[ https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603256#comment-15603256 ] Arpit Agarwal commented on HDFS-10857: -- Looks like {{checkDiskError}} should get the DataNode object lock for the {{dataDirs}} modification to avoid a potential race with {{refreshVolumes}}. > Rolling upgrade can make data unavailable when the cluster has many failed > volumes > -- > > Key: HDFS-10857 > URL: https://issues.apache.org/jira/browse/HDFS-10857 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.4 >Reporter: Kihwal Lee >Priority: Critical > Attachments: HDFS-10857.branch-2.6.patch > > > When the marker file or trash dir is created or removed during the heartbeat > response processing, an {{IOException}} is thrown if tried on a failed > volume. This stops processing of the rest of storage directories and any > DNA commands that were part of the heartbeat response. > While this is happening, the block token key update does not happen and all > read and write requests start to fail, until the upgrade is finalized and the > DN receives a new key. All it takes is one failed volume. If there are three > such nodes in the cluster, it is very likely that some blocks cannot be read. > The NN has no idea unlike the common missing blocks scenarios, although the > effect is the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes
[ https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504990#comment-15504990 ] Hadoop QA commented on HDFS-10857: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 29s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 1s{color} | {color:green} branch-2.6 passed {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 49s{color} | {color:red} hadoop-hdfs in branch-2.6 failed with JDK v1.8.0_101. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 50s{color} | {color:red} hadoop-hdfs in branch-2.6 failed with JDK v1.7.0_111. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s{color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s{color} | {color:green} branch-2.6 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 3m 5s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in branch-2.6 has 273 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s{color} | {color:green} branch-2.6 passed with JDK v1.8.0_101 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 0s{color} | {color:green} branch-2.6 passed with JDK v1.7.0_111 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 48s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_101. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 48s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_101. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 47s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_111. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 47s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_111. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1347 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 33s{color} | {color:red} The patch 70 line(s) with tabs. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 3m 19s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 273 unchanged - 0 fixed = 274 total (was 273) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 10s{color} | {color:green} the patch passed with JDK v1.8.0_101 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 59s{color} | {color:green} the patch passed with JDK v1.7.0_111 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 46s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_111. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 34s{color} | {color:red} The patch generated 75 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 48m 3s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs-project/hadoop-hdfs | | | Inconsistent synchronization of org.apache.hadoop.hdfs.server.datanode.DataNode.dataDirs; locked 71% of time Unsynchronized access at DataNode.java:71% of time Unsynchronized access at DataNode.java:[line 3098] | \\ \\ || Subsystem || Report/Notes || | Docker |
[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes
[ https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504768#comment-15504768 ] Kihwal Lee commented on HDFS-10857: --- The patch partially brings in changes available in branch-2.7 to removing volumes easier. > Rolling upgrade can make data unavailable when the cluster has many failed > volumes > -- > > Key: HDFS-10857 > URL: https://issues.apache.org/jira/browse/HDFS-10857 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.4 >Reporter: Kihwal Lee >Priority: Critical > Attachments: HDFS-10857.branch-2.6.patch > > > When the marker file or trash dir is created or removed during the heartbeat > response processing, an {{IOException}} is thrown if tried on a failed > volume. This stops processing of the rest of storage directories and any > DNA commands that were part of the heartbeat response. > While this is happening, the block token key update does not happen and all > read and write requests start to fail, until the upgrade is finalized and the > DN receives a new key. All it takes is one failed volume. If there are three > such nodes in the cluster, it is very likely that some blocks cannot be read. > The NN has no idea unlike the common missing blocks scenarios, although the > effect is the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes
[ https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484851#comment-15484851 ] Kihwal Lee commented on HDFS-10857: --- We can have {{FsDatasetImpl#checkDataDir()}} remove the failed volumes from {{dataStorage}} at the end. It might need to remove them from {{DataNode.dataDirs}} as well. > Rolling upgrade can make data unavailable when the cluster has many failed > volumes > -- > > Key: HDFS-10857 > URL: https://issues.apache.org/jira/browse/HDFS-10857 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.4 >Reporter: Kihwal Lee >Priority: Critical > > When the marker file or trash dir is created or removed during the heartbeat > response processing, an {{IOException}} is thrown if tried on a failed > volume. This stops processing of the rest of storage directories and any > DNA commands that were part of the heartbeat response. > While this is happening, the block token key update does not happen and all > read and write requests start to fail, until the upgrade is finalized and the > DN receives a new key. All it takes is one failed volume. If there are three > such nodes in the cluster, it is very likely that some blocks cannot be read. > The NN has no idea unlike the common missing blocks scenarios, although the > effect is the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes
[ https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484751#comment-15484751 ] Kihwal Lee commented on HDFS-10857: --- It looks like it is fixed in 2.8 and later. {{DataNode#checkDiskError()}} does remove the failed volume from {{DataStorage}}. > Rolling upgrade can make data unavailable when the cluster has many failed > volumes > -- > > Key: HDFS-10857 > URL: https://issues.apache.org/jira/browse/HDFS-10857 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.4 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Critical > > When the marker file or trash dir is created or removed during the heartbeat > response processing, an {{IOException}} is thrown if tried on a failed > volume. This stops processing of the rest of storage directories and any > DNA commands that were part of the heartbeat response. > While this is happening, the block token key update does not happen and all > read and write requests start to fail, until the upgrade is finalized and the > DN receives a new key. All it takes is one failed volume. If there are three > such nodes in the cluster, it is very likely that some blocks cannot be read. > The NN has no idea unlike the common missing blocks scenarios, although the > effect is the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org