[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss
[ https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-12070: - Fix Version/s: (was: 3.0.2) 3.0.3 > Failed block recovery leaves files open indefinitely and at risk for data loss > -- > > Key: HDFS-12070 > URL: https://issues.apache.org/jira/browse/HDFS-12070 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Kihwal Lee >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 3.0.3 > > Attachments: HDFS-12070.0.patch, HDFS-12070.1.patch, lease.patch > > > Files will remain open indefinitely if block recovery fails which creates a > high risk of data loss. The replication monitor will not replicate these > blocks. > The NN provides the primary node a list of candidate nodes for recovery which > involves a 2-stage process. The primary node removes any candidates that > cannot init replica recovery (essentially alive and knows about the block) to > create a sync list. Stage 2 issues updates to the sync list – _but fails if > any node fails_ unlike the first stage. The NN should be informed of nodes > that did succeed. > Manual recovery will also fail until the problematic node is temporarily > stopped so a connection refused will induce the bad node to be pruned from > the candidates. Recovery succeeds, the lease is released, under replication > is fixed, and block is invalidated from the bad node. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss
[ https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-12070: -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.2.0 3.0.2 2.8.4 2.9.1 2.10.0 3.1.0 Status: Resolved (was: Patch Available) Committed to trunk, branch-3.1, branch-3.0, branch-2, branch-2.9 and branch-2.8. > Failed block recovery leaves files open indefinitely and at risk for data loss > -- > > Key: HDFS-12070 > URL: https://issues.apache.org/jira/browse/HDFS-12070 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Kihwal Lee >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 3.0.2, 3.2.0 > > Attachments: HDFS-12070.0.patch, HDFS-12070.1.patch, lease.patch > > > Files will remain open indefinitely if block recovery fails which creates a > high risk of data loss. The replication monitor will not replicate these > blocks. > The NN provides the primary node a list of candidate nodes for recovery which > involves a 2-stage process. The primary node removes any candidates that > cannot init replica recovery (essentially alive and knows about the block) to > create a sync list. Stage 2 issues updates to the sync list – _but fails if > any node fails_ unlike the first stage. The NN should be informed of nodes > that did succeed. > Manual recovery will also fail until the problematic node is temporarily > stopped so a connection refused will induce the bad node to be pruned from > the candidates. Recovery succeeds, the lease is released, under replication > is fixed, and block is invalidated from the bad node. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss
[ https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-12070: -- Attachment: HDFS-12070.1.patch > Failed block recovery leaves files open indefinitely and at risk for data loss > -- > > Key: HDFS-12070 > URL: https://issues.apache.org/jira/browse/HDFS-12070 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Kihwal Lee >Priority: Major > Attachments: HDFS-12070.0.patch, HDFS-12070.1.patch, lease.patch > > > Files will remain open indefinitely if block recovery fails which creates a > high risk of data loss. The replication monitor will not replicate these > blocks. > The NN provides the primary node a list of candidate nodes for recovery which > involves a 2-stage process. The primary node removes any candidates that > cannot init replica recovery (essentially alive and knows about the block) to > create a sync list. Stage 2 issues updates to the sync list – _but fails if > any node fails_ unlike the first stage. The NN should be informed of nodes > that did succeed. > Manual recovery will also fail until the problematic node is temporarily > stopped so a connection refused will induce the bad node to be pruned from > the candidates. Recovery succeeds, the lease is released, under replication > is fixed, and block is invalidated from the bad node. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss
[ https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-12070: -- Attachment: HDFS-12070.0.patch > Failed block recovery leaves files open indefinitely and at risk for data loss > -- > > Key: HDFS-12070 > URL: https://issues.apache.org/jira/browse/HDFS-12070 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Kihwal Lee >Priority: Major > Attachments: HDFS-12070.0.patch, lease.patch > > > Files will remain open indefinitely if block recovery fails which creates a > high risk of data loss. The replication monitor will not replicate these > blocks. > The NN provides the primary node a list of candidate nodes for recovery which > involves a 2-stage process. The primary node removes any candidates that > cannot init replica recovery (essentially alive and knows about the block) to > create a sync list. Stage 2 issues updates to the sync list – _but fails if > any node fails_ unlike the first stage. The NN should be informed of nodes > that did succeed. > Manual recovery will also fail until the problematic node is temporarily > stopped so a connection refused will induce the bad node to be pruned from > the candidates. Recovery succeeds, the lease is released, under replication > is fixed, and block is invalidated from the bad node. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss
[ https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-12070: -- Status: Patch Available (was: Open) > Failed block recovery leaves files open indefinitely and at risk for data loss > -- > > Key: HDFS-12070 > URL: https://issues.apache.org/jira/browse/HDFS-12070 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Kihwal Lee > Attachments: lease.patch > > > Files will remain open indefinitely if block recovery fails which creates a > high risk of data loss. The replication monitor will not replicate these > blocks. > The NN provides the primary node a list of candidate nodes for recovery which > involves a 2-stage process. The primary node removes any candidates that > cannot init replica recovery (essentially alive and knows about the block) to > create a sync list. Stage 2 issues updates to the sync list – _but fails if > any node fails_ unlike the first stage. The NN should be informed of nodes > that did succeed. > Manual recovery will also fail until the problematic node is temporarily > stopped so a connection refused will induce the bad node to be pruned from > the candidates. Recovery succeeds, the lease is released, under replication > is fixed, and block is invalidated from the bad node. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss
[ https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-12070: -- Attachment: lease.patch Attaching a patch without any unit test to run through the qa bot. > Failed block recovery leaves files open indefinitely and at risk for data loss > -- > > Key: HDFS-12070 > URL: https://issues.apache.org/jira/browse/HDFS-12070 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Kihwal Lee > Attachments: lease.patch > > > Files will remain open indefinitely if block recovery fails which creates a > high risk of data loss. The replication monitor will not replicate these > blocks. > The NN provides the primary node a list of candidate nodes for recovery which > involves a 2-stage process. The primary node removes any candidates that > cannot init replica recovery (essentially alive and knows about the block) to > create a sync list. Stage 2 issues updates to the sync list – _but fails if > any node fails_ unlike the first stage. The NN should be informed of nodes > that did succeed. > Manual recovery will also fail until the problematic node is temporarily > stopped so a connection refused will induce the bad node to be pruned from > the candidates. Recovery succeeds, the lease is released, under replication > is fixed, and block is invalidated from the bad node. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss
[ https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated HDFS-12070: -- Target Version/s: 2.8.3 (was: 2.8.2) > Failed block recovery leaves files open indefinitely and at risk for data loss > -- > > Key: HDFS-12070 > URL: https://issues.apache.org/jira/browse/HDFS-12070 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp > > Files will remain open indefinitely if block recovery fails which creates a > high risk of data loss. The replication monitor will not replicate these > blocks. > The NN provides the primary node a list of candidate nodes for recovery which > involves a 2-stage process. The primary node removes any candidates that > cannot init replica recovery (essentially alive and knows about the block) to > create a sync list. Stage 2 issues updates to the sync list – _but fails if > any node fails_ unlike the first stage. The NN should be informed of nodes > that did succeed. > Manual recovery will also fail until the problematic node is temporarily > stopped so a connection refused will induce the bad node to be pruned from > the candidates. Recovery succeeds, the lease is released, under replication > is fixed, and block is invalidated from the bad node. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org