[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss

2018-04-05 Thread Lei (Eddy) Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu updated HDFS-12070:
-
Fix Version/s: (was: 3.0.2)
   3.0.3

> Failed block recovery leaves files open indefinitely and at risk for data loss
> --
>
> Key: HDFS-12070
> URL: https://issues.apache.org/jira/browse/HDFS-12070
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Kihwal Lee
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 3.0.3
>
> Attachments: HDFS-12070.0.patch, HDFS-12070.1.patch, lease.patch
>
>
> Files will remain open indefinitely if block recovery fails which creates a 
> high risk of data loss.  The replication monitor will not replicate these 
> blocks.
> The NN provides the primary node a list of candidate nodes for recovery which 
> involves a 2-stage process. The primary node removes any candidates that 
> cannot init replica recovery (essentially alive and knows about the block) to 
> create a sync list.  Stage 2 issues updates to the sync list – _but fails if 
> any node fails_ unlike the first stage.  The NN should be informed of nodes 
> that did succeed.
> Manual recovery will also fail until the problematic node is temporarily 
> stopped so a connection refused will induce the bad node to be pruned from 
> the candidates.  Recovery succeeds, the lease is released, under replication 
> is fixed, and block is invalidated from the bad node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss

2018-02-26 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-12070:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.2.0
   3.0.2
   2.8.4
   2.9.1
   2.10.0
   3.1.0
   Status: Resolved  (was: Patch Available)

Committed to trunk, branch-3.1, branch-3.0, branch-2, branch-2.9 and branch-2.8.

> Failed block recovery leaves files open indefinitely and at risk for data loss
> --
>
> Key: HDFS-12070
> URL: https://issues.apache.org/jira/browse/HDFS-12070
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Kihwal Lee
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 3.0.2, 3.2.0
>
> Attachments: HDFS-12070.0.patch, HDFS-12070.1.patch, lease.patch
>
>
> Files will remain open indefinitely if block recovery fails which creates a 
> high risk of data loss.  The replication monitor will not replicate these 
> blocks.
> The NN provides the primary node a list of candidate nodes for recovery which 
> involves a 2-stage process. The primary node removes any candidates that 
> cannot init replica recovery (essentially alive and knows about the block) to 
> create a sync list.  Stage 2 issues updates to the sync list – _but fails if 
> any node fails_ unlike the first stage.  The NN should be informed of nodes 
> that did succeed.
> Manual recovery will also fail until the problematic node is temporarily 
> stopped so a connection refused will induce the bad node to be pruned from 
> the candidates.  Recovery succeeds, the lease is released, under replication 
> is fixed, and block is invalidated from the bad node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss

2018-02-22 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-12070:
--
Attachment: HDFS-12070.1.patch

> Failed block recovery leaves files open indefinitely and at risk for data loss
> --
>
> Key: HDFS-12070
> URL: https://issues.apache.org/jira/browse/HDFS-12070
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Kihwal Lee
>Priority: Major
> Attachments: HDFS-12070.0.patch, HDFS-12070.1.patch, lease.patch
>
>
> Files will remain open indefinitely if block recovery fails which creates a 
> high risk of data loss.  The replication monitor will not replicate these 
> blocks.
> The NN provides the primary node a list of candidate nodes for recovery which 
> involves a 2-stage process. The primary node removes any candidates that 
> cannot init replica recovery (essentially alive and knows about the block) to 
> create a sync list.  Stage 2 issues updates to the sync list – _but fails if 
> any node fails_ unlike the first stage.  The NN should be informed of nodes 
> that did succeed.
> Manual recovery will also fail until the problematic node is temporarily 
> stopped so a connection refused will induce the bad node to be pruned from 
> the candidates.  Recovery succeeds, the lease is released, under replication 
> is fixed, and block is invalidated from the bad node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss

2018-02-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-12070:
--
Attachment: HDFS-12070.0.patch

> Failed block recovery leaves files open indefinitely and at risk for data loss
> --
>
> Key: HDFS-12070
> URL: https://issues.apache.org/jira/browse/HDFS-12070
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Kihwal Lee
>Priority: Major
> Attachments: HDFS-12070.0.patch, lease.patch
>
>
> Files will remain open indefinitely if block recovery fails which creates a 
> high risk of data loss.  The replication monitor will not replicate these 
> blocks.
> The NN provides the primary node a list of candidate nodes for recovery which 
> involves a 2-stage process. The primary node removes any candidates that 
> cannot init replica recovery (essentially alive and knows about the block) to 
> create a sync list.  Stage 2 issues updates to the sync list – _but fails if 
> any node fails_ unlike the first stage.  The NN should be informed of nodes 
> that did succeed.
> Manual recovery will also fail until the problematic node is temporarily 
> stopped so a connection refused will induce the bad node to be pruned from 
> the candidates.  Recovery succeeds, the lease is released, under replication 
> is fixed, and block is invalidated from the bad node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss

2018-01-10 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-12070:
--
Status: Patch Available  (was: Open)

> Failed block recovery leaves files open indefinitely and at risk for data loss
> --
>
> Key: HDFS-12070
> URL: https://issues.apache.org/jira/browse/HDFS-12070
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Kihwal Lee
> Attachments: lease.patch
>
>
> Files will remain open indefinitely if block recovery fails which creates a 
> high risk of data loss.  The replication monitor will not replicate these 
> blocks.
> The NN provides the primary node a list of candidate nodes for recovery which 
> involves a 2-stage process. The primary node removes any candidates that 
> cannot init replica recovery (essentially alive and knows about the block) to 
> create a sync list.  Stage 2 issues updates to the sync list – _but fails if 
> any node fails_ unlike the first stage.  The NN should be informed of nodes 
> that did succeed.
> Manual recovery will also fail until the problematic node is temporarily 
> stopped so a connection refused will induce the bad node to be pruned from 
> the candidates.  Recovery succeeds, the lease is released, under replication 
> is fixed, and block is invalidated from the bad node.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss

2018-01-10 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-12070:
--
Attachment: lease.patch

Attaching a patch without any unit test to run through the qa bot. 

> Failed block recovery leaves files open indefinitely and at risk for data loss
> --
>
> Key: HDFS-12070
> URL: https://issues.apache.org/jira/browse/HDFS-12070
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Kihwal Lee
> Attachments: lease.patch
>
>
> Files will remain open indefinitely if block recovery fails which creates a 
> high risk of data loss.  The replication monitor will not replicate these 
> blocks.
> The NN provides the primary node a list of candidate nodes for recovery which 
> involves a 2-stage process. The primary node removes any candidates that 
> cannot init replica recovery (essentially alive and knows about the block) to 
> create a sync list.  Stage 2 issues updates to the sync list – _but fails if 
> any node fails_ unlike the first stage.  The NN should be informed of nodes 
> that did succeed.
> Manual recovery will also fail until the problematic node is temporarily 
> stopped so a connection refused will induce the bad node to be pruned from 
> the candidates.  Recovery succeeds, the lease is released, under replication 
> is fixed, and block is invalidated from the bad node.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss

2017-09-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12070:
--
Target Version/s: 2.8.3  (was: 2.8.2)

> Failed block recovery leaves files open indefinitely and at risk for data loss
> --
>
> Key: HDFS-12070
> URL: https://issues.apache.org/jira/browse/HDFS-12070
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>
> Files will remain open indefinitely if block recovery fails which creates a 
> high risk of data loss.  The replication monitor will not replicate these 
> blocks.
> The NN provides the primary node a list of candidate nodes for recovery which 
> involves a 2-stage process. The primary node removes any candidates that 
> cannot init replica recovery (essentially alive and knows about the block) to 
> create a sync list.  Stage 2 issues updates to the sync list – _but fails if 
> any node fails_ unlike the first stage.  The NN should be informed of nodes 
> that did succeed.
> Manual recovery will also fail until the problematic node is temporarily 
> stopped so a connection refused will induce the bad node to be pruned from 
> the candidates.  Recovery succeeds, the lease is released, under replication 
> is fixed, and block is invalidated from the bad node.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org