[jira] [Commented] (HDFS-10434) Fix intermittent test failure of TestDataNodeErasureCodingMetrics

2016-05-23 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296176#comment-15296176
 ] 

Li Bo commented on HDFS-10434:
--

Thanks for [~rakeshr]'s detailed explanation.  The situation described will 
cause the test case fail and the patch can fix the problem.  +1 for the patch.

> Fix intermittent test failure of TestDataNodeErasureCodingMetrics
> -
>
> Key: HDFS-10434
> URL: https://issues.apache.org/jira/browse/HDFS-10434
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Rakesh R
>Assignee: Rakesh R
> Attachments: HDFS-10434-00.patch
>
>
> This jira is to fix the test case failure.
> Reference : 
> [Build15485_TestDataNodeErasureCodingMetrics_testEcTasks|https://builds.apache.org/job/PreCommit-HDFS-Build/15485/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeErasureCodingMetrics/testEcTasks/]
> {code}
> Error Message
> Bad value for metric EcReconstructionTasks expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: Bad value for metric EcReconstructionTasks 
> expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.test.MetricsAsserts.assertCounter(MetricsAsserts.java:228)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics.testEcTasks(TestDataNodeErasureCodingMetrics.java:92)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10434) Fix intermittent test failure of TestDataNodeErasureCodingMetrics

2016-05-23 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296112#comment-15296112
 ] 

Li Bo commented on HDFS-10434:
--

Thanks for [~rakeshr] finding the problem. 
{{DFSTestUtil.waitForDatanodeState()}} and 
{{StripedFileTestUtil.waitForReconstructionFinished()}} have make sure that the 
reconstruction work is finished before checking the metrics. I am confused that 
the two sentences not take effect.  Does the failure never happen after 
applying the patch(maybe run the test case more than 20 times)?  

> Fix intermittent test failure of TestDataNodeErasureCodingMetrics
> -
>
> Key: HDFS-10434
> URL: https://issues.apache.org/jira/browse/HDFS-10434
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Rakesh R
>Assignee: Rakesh R
> Attachments: HDFS-10434-00.patch
>
>
> This jira is to fix the test case failure.
> Reference : 
> [Build15485_TestDataNodeErasureCodingMetrics_testEcTasks|https://builds.apache.org/job/PreCommit-HDFS-Build/15485/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeErasureCodingMetrics/testEcTasks/]
> {code}
> Error Message
> Bad value for metric EcReconstructionTasks expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: Bad value for metric EcReconstructionTasks 
> expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.test.MetricsAsserts.assertCounter(MetricsAsserts.java:228)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics.testEcTasks(TestDataNodeErasureCodingMetrics.java:92)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-16 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284223#comment-15284223
 ] 

Li Bo commented on HDFS-8449:
-

Thanks for Kai's review and commit. I have just resolve this jira.

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, 
> HDFS-8449-v10.patch, HDFS-8449-v11.patch, HDFS-8449-v12.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-16 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

  Resolution: Fixed
Target Version/s:   (was: )
  Status: Resolved  (was: Patch Available)

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, 
> HDFS-8449-v10.patch, HDFS-8449-v11.patch, HDFS-8449-v12.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-15 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-v12.patch

Upload v12(same with v11)  to trigger the test.

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, 
> HDFS-8449-v10.patch, HDFS-8449-v11.patch, HDFS-8449-v12.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-09 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Status: Patch Available  (was: In Progress)

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, 
> HDFS-8449-v10.patch, HDFS-8449-v11.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-09 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Status: In Progress  (was: Patch Available)

To trigger the test

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, 
> HDFS-8449-v10.patch, HDFS-8449-v11.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-09 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-v11.patch

v11 fix checkstyle problems

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, 
> HDFS-8449-v10.patch, HDFS-8449-v11.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-08 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-v10.patch

Rename the patch to trigger the test.

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, 
> HDFS-8449-v10.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8901) Use ByteBuffer in striping positional read

2016-05-06 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8901:

Attachment: HDFS-8901-v9.patch

fix the bugs in TestParalledShortCircuitLegacyRead

> Use ByteBuffer in striping positional read
> --
>
> Key: HDFS-8901
> URL: https://issues.apache.org/jira/browse/HDFS-8901
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Kai Zheng
>Assignee: Kai Zheng
> Attachments: HDFS-8901-v2.patch, HDFS-8901-v3.patch, 
> HDFS-8901-v4.patch, HDFS-8901-v5.patch, HDFS-8901-v6.patch, 
> HDFS-8901-v7.patch, HDFS-8901-v8.patch, HDFS-8901-v9.patch, initial-poc.patch
>
>
> Native erasure coder prefers to direct ByteBuffer for performance 
> consideration. To prepare for it, this change uses ByteBuffer through the 
> codes in implementing striping position read. It will also fix avoiding 
> unnecessary data copying between striping read chunk buffers and decode input 
> buffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-06 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-010.patch

Fix the problems according to Kai's comments

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-05 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272031#comment-15272031
 ] 

Li Bo commented on HDFS-8449:
-

Thanks very much for Kai's review.

bq. Could you enhance TestReconstructStripedFile similarly?
bq. Could we share TestReconstructStripedFile#waitForRecoveryFinished and avoid 
waitForRecoveryFinished?
I think it’s a little strange to call 
{{TestReconstructStripedFile#waitForRecoveryFinished}}  in 
{{TestDataNodeErasureCodingMetrics}} because the change of 
{{TestReconstructStripedFile}} may impact {{TestDataNodeErasureCodingMetrics}}. 
We can move the shared function to a util class.
I think it's better to do the changes of {{TestReconstructStripedFile}} in a 
new separate jira in order to make this jira focusing on the test of datanode 
metrics.

bq. Could we use DFSTestUtil.writeFile to generate the test file?
Both implementations are OK. There’re many test cases directly using 
outputstream to write a file.

bq. I'm not sure about the following block codes are necessary.
The system will execute the actions periodically. In the test we should make 
sure the actions are executed before moving forward. 


> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8901) Use ByteBuffer in striping positional read

2016-05-04 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8901:

Attachment: HDFS-8901-v8.patch

rebase the patch to latest code;
fix the unsupported method problem

> Use ByteBuffer in striping positional read
> --
>
> Key: HDFS-8901
> URL: https://issues.apache.org/jira/browse/HDFS-8901
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Kai Zheng
>Assignee: Kai Zheng
> Attachments: HDFS-8901-v2.patch, HDFS-8901-v3.patch, 
> HDFS-8901-v4.patch, HDFS-8901-v5.patch, HDFS-8901-v6.patch, 
> HDFS-8901-v7.patch, HDFS-8901-v8.patch, initial-poc.patch
>
>
> Native erasure coder prefers to direct ByteBuffer for performance 
> consideration. To prepare for it, this change uses ByteBuffer through the 
> codes in implementing striping position read. It will also fix avoiding 
> unnecessary data copying between striping read chunk buffers and decode input 
> buffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-04 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-009.patch

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch, HDFS-8449-009.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-03 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-008.patch

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, 
> HDFS-8449-008.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-02 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268124#comment-15268124
 ] 

Li Bo commented on HDFS-8449:
-

Update patch 007 to fix some minor problems.

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-05-02 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-007.patch

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-04-27 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261472#comment-15261472
 ] 

Li Bo commented on HDFS-8449:
-

The two failed unit tests have no relation with this patch and will be solved 
in HDFS-10334. The check style problems also can be ignored.
Hi [~drankye], could you help me review the patch again? Thanks.

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-04-27 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-006.patch

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch, HDFS-8449-006.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10317) dfs.domain.socket.path is not set in TestShortCircuitLocalRead.testReadWithRemoteBlockReader

2016-04-21 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253201#comment-15253201
 ] 

Li Bo commented on HDFS-10317:
--

I use Intellij IDEA to run the test and the problem occurs, but it will not 
happen using mvn test. I will check it further.

> dfs.domain.socket.path is not set in 
> TestShortCircuitLocalRead.testReadWithRemoteBlockReader
> 
>
> Key: HDFS-10317
> URL: https://issues.apache.org/jira/browse/HDFS-10317
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Li Bo
>
> org.apache.hadoop.HadoopIllegalArgumentException: The short-circuit local 
> reads feature is enabled but dfs.domain.socket.path is not set.
>   at 
> org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory.(DomainSocketFactory.java:115)
>   at org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:132)
>   at org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:157)
>   at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:358)
>   at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:275)
>   at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:266)
>   at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:258)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.waitActive(MiniDFSCluster.java:2466)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.waitActive(MiniDFSCluster.java:2512)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1632)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:844)
>   at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:482)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:441)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadWithRemoteBlockReader(TestShortCircuitLocalRead.java:608)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testReadWithRemoteBlockReader(TestShortCircuitLocalRead.java:590)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-10317) dfs.domain.socket.path is not set in TestShortCircuitLocalRead.testReadWithRemoteBlockReader

2016-04-20 Thread Li Bo (JIRA)
Li Bo created HDFS-10317:


 Summary: dfs.domain.socket.path is not set in 
TestShortCircuitLocalRead.testReadWithRemoteBlockReader
 Key: HDFS-10317
 URL: https://issues.apache.org/jira/browse/HDFS-10317
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Li Bo


org.apache.hadoop.HadoopIllegalArgumentException: The short-circuit local reads 
feature is enabled but dfs.domain.socket.path is not set.
at 
org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory.(DomainSocketFactory.java:115)
at org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:132)
at org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:157)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:358)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:275)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:266)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:258)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.waitActive(MiniDFSCluster.java:2466)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.waitActive(MiniDFSCluster.java:2512)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1632)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:844)
at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:482)
at 
org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:441)
at 
org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadWithRemoteBlockReader(TestShortCircuitLocalRead.java:608)
at 
org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testReadWithRemoteBlockReader(TestShortCircuitLocalRead.java:590)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-04-20 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-005.patch

Thanks Kai's review.
There're several other ec related metrics to add,  so I think we can put the 
unit tests in  a single file at first  and consider moving them to other file 
at the end.

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, 
> HDFS-8449-005.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead

2016-04-19 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247349#comment-15247349
 ] 

Li Bo commented on HDFS-10307:
--

mark it as duplicated.

> Fix a bug in TestShortCircuitLocalRead
> --
>
> Key: HDFS-10307
> URL: https://issues.apache.org/jira/browse/HDFS-10307
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-10307-001.patch
>
>
> Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead 
> in TestShortCircuitLocalRead throws the following exception:
> java.lang.IndexOutOfBoundsException: Requested more bytes than destination 
> buffer size
>   at 
> org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead

2016-04-19 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-10307:
-
Resolution: Duplicate
Status: Resolved  (was: Patch Available)

> Fix a bug in TestShortCircuitLocalRead
> --
>
> Key: HDFS-10307
> URL: https://issues.apache.org/jira/browse/HDFS-10307
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-10307-001.patch
>
>
> Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead 
> in TestShortCircuitLocalRead throws the following exception:
> java.lang.IndexOutOfBoundsException: Requested more bytes than destination 
> buffer size
>   at 
> org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead

2016-04-19 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-10307:
-
Status: Patch Available  (was: Open)

> Fix a bug in TestShortCircuitLocalRead
> --
>
> Key: HDFS-10307
> URL: https://issues.apache.org/jira/browse/HDFS-10307
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-10307-001.patch
>
>
> Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead 
> in TestShortCircuitLocalRead throws the following exception:
> java.lang.IndexOutOfBoundsException: Requested more bytes than destination 
> buffer size
>   at 
> org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead

2016-04-19 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-10307:
-
Attachment: HDFS-10307-001.patch

> Fix a bug in TestShortCircuitLocalRead
> --
>
> Key: HDFS-10307
> URL: https://issues.apache.org/jira/browse/HDFS-10307
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-10307-001.patch
>
>
> Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead 
> in TestShortCircuitLocalRead throws the following exception:
> java.lang.IndexOutOfBoundsException: Requested more bytes than destination 
> buffer size
>   at 
> org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235)
>   at 
> org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead

2016-04-19 Thread Li Bo (JIRA)
Li Bo created HDFS-10307:


 Summary: Fix a bug in TestShortCircuitLocalRead
 Key: HDFS-10307
 URL: https://issues.apache.org/jira/browse/HDFS-10307
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Li Bo
Assignee: Li Bo


Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead 
in TestShortCircuitLocalRead throws the following exception:

java.lang.IndexOutOfBoundsException: Requested more bytes than destination 
buffer size
at 
org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157)
at 
org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286)
at 
org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235)
at 
org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-04-18 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245211#comment-15245211
 ] 

Li Bo commented on HDFS-8449:
-

hi, [~drankye], could you help me review the newly updated patch?
Thanks

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2016-04-18 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-004.patch

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10258) Erasure Coding: support small cluster whose #DataNode < # (Blocks in a BlockGroup)

2016-04-14 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242287#comment-15242287
 ] 

Li Bo commented on HDFS-10258:
--

Thanks for Kai's idea, I will try to find a solution with the  lowest cost.

> Erasure Coding: support small cluster whose #DataNode < # (Blocks in a 
> BlockGroup)
> --
>
> Key: HDFS-10258
> URL: https://issues.apache.org/jira/browse/HDFS-10258
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
>
> Currently EC has not supported small clusters whose datanode number is 
> smaller than the block numbers in a block group. This sub task will solve 
> this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10258) Erasure Coding: support small cluster whose #DataNode < # (Blocks in a BlockGroup)

2016-04-05 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-10258:
-
Description: Currently EC has not supported small clusters whose datanode 
number is smaller than the block numbers in a block group. This sub task will 
solve this problem.

> Erasure Coding: support small cluster whose #DataNode < # (Blocks in a 
> BlockGroup)
> --
>
> Key: HDFS-10258
> URL: https://issues.apache.org/jira/browse/HDFS-10258
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
>
> Currently EC has not supported small clusters whose datanode number is 
> smaller than the block numbers in a block group. This sub task will solve 
> this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-10258) Erasure Coding: support small cluster whose #DataNode < # (Blocks in a BlockGroup)

2016-04-05 Thread Li Bo (JIRA)
Li Bo created HDFS-10258:


 Summary: Erasure Coding: support small cluster whose #DataNode < # 
(Blocks in a BlockGroup)
 Key: HDFS-10258
 URL: https://issues.apache.org/jira/browse/HDFS-10258
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time

2016-03-08 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186510#comment-15186510
 ] 

Li Bo commented on HDFS-9822:
-

hi,Rakesh
After reading the code of {{UnderReplicationBlocks}} I am still a little 
confused how this error happens. Since the situation is difficult to reproduce, 
how about creating a unit test case that simulates the error situation?


> Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped 
> block at the same time
> 
>
> Key: HDFS-9822
> URL: https://issues.apache.org/jira/browse/HDFS-9822
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: erasure-coding
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Rakesh R
> Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch
>
>
> Found the following AssertionError in 
> https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/
> {code}
> AssertionError: Should wait the previous reconstruction to finish
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100)
>   at java.lang.Thread.run(Thread.java:745)
>   at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126)
>   at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period

2016-03-08 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9826:

Status: Patch Available  (was: Open)

>  Erasure Coding: Postpone the recovery work for a configurable time period
> --
>
> Key: HDFS-9826
> URL: https://issues.apache.org/jira/browse/HDFS-9826
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9826-001.patch, HDFS-9826-002.patch
>
>
> Currently NameNode prepares recovering when finding an under replicated  
> block group. This is inefficient and reduces resources for other operations. 
> It would be better to postpone the recovery work for a period of time if only 
> one internal block is corrupted considering points shown by papers such as 
> \[1\]\[2\]:
> 1.Transient errors in which no data are lost account for more than 90% of 
> data center failures, owing to network partitions, software problems, or 
> non-disk hardware faults.
> 2.Although erasure codes tolerate multiple simultaneous failures, single 
> failures represent 99.75% of recoveries.
> Different clusters may have different status, so we should allow user to 
> configure the time for postponing the recoveries. Proper configuration will 
> reduce a large proportion of unnecessary recoveries. When finding multiple 
> internal blocks corrupted in a block group, we prepare the recovery work 
> immediately because it’s very rare and we don’t want to increase the risk of 
> losing data.
> [1] Availability in globally distributed storage systems
> http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf
> [2] Rethinking erasure codes for cloud file systems: minimizing I/O for 
> recovery and degraded reads
> http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period

2016-03-08 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9826:

Attachment: HDFS-9826-002.patch

>  Erasure Coding: Postpone the recovery work for a configurable time period
> --
>
> Key: HDFS-9826
> URL: https://issues.apache.org/jira/browse/HDFS-9826
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9826-001.patch, HDFS-9826-002.patch
>
>
> Currently NameNode prepares recovering when finding an under replicated  
> block group. This is inefficient and reduces resources for other operations. 
> It would be better to postpone the recovery work for a period of time if only 
> one internal block is corrupted considering points shown by papers such as 
> \[1\]\[2\]:
> 1.Transient errors in which no data are lost account for more than 90% of 
> data center failures, owing to network partitions, software problems, or 
> non-disk hardware faults.
> 2.Although erasure codes tolerate multiple simultaneous failures, single 
> failures represent 99.75% of recoveries.
> Different clusters may have different status, so we should allow user to 
> configure the time for postponing the recoveries. Proper configuration will 
> reduce a large proportion of unnecessary recoveries. When finding multiple 
> internal blocks corrupted in a block group, we prepare the recovery work 
> immediately because it’s very rare and we don’t want to increase the risk of 
> losing data.
> [1] Availability in globally distributed storage systems
> http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf
> [2] Rethinking erasure codes for cloud file systems: minimizing I/O for 
> recovery and degraded reads
> http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period

2016-02-29 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9826:

Attachment: HDFS-9826-001.patch

upload an initial patch without test

>  Erasure Coding: Postpone the recovery work for a configurable time period
> --
>
> Key: HDFS-9826
> URL: https://issues.apache.org/jira/browse/HDFS-9826
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9826-001.patch
>
>
> Currently NameNode prepares recovering when finding an under replicated  
> block group. This is inefficient and reduces resources for other operations. 
> It would be better to postpone the recovery work for a period of time if only 
> one internal block is corrupted considering points shown by papers such as 
> \[1\]\[2\]:
> 1.Transient errors in which no data are lost account for more than 90% of 
> data center failures, owing to network partitions, software problems, or 
> non-disk hardware faults.
> 2.Although erasure codes tolerate multiple simultaneous failures, single 
> failures represent 99.75% of recoveries.
> Different clusters may have different status, so we should allow user to 
> configure the time for postponing the recoveries. Proper configuration will 
> reduce a large proportion of unnecessary recoveries. When finding multiple 
> internal blocks corrupted in a block group, we prepare the recovery work 
> immediately because it’s very rare and we don’t want to increase the risk of 
> losing data.
> [1] Availability in globally distributed storage systems
> http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf
> [2] Rethinking erasure codes for cloud file systems: minimizing I/O for 
> recovery and degraded reads
> http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic

2016-02-18 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152003#comment-15152003
 ] 

Li Bo commented on HDFS-9657:
-

Thanks Zhe’s comment.

90% errors are transient and without permanent data loss, postponing the 
recoveries may reduce a lot of unnecessary work. Configuring the task number by 
time segments is a solution. I have just created HDFS-9826 which delays the 
recoveries for a configured time period. I am thinking about how to combine 
them together.


> Schedule EC tasks at proper time to reduce the impact of recovery traffic
> -
>
> Key: HDFS-9657
> URL: https://issues.apache.org/jira/browse/HDFS-9657
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9657-001.patch, HDFS-9657-002.patch
>
>
> The EC recover tasks consume a lot of network bandwidth and disk I/O. 
> Recovering a corrupt block requires transferring 6 blocks , hence creating a 
> 6X overhead in network bandwidth and disk I/O.  When a datanode fails , the 
> recovery of the whole blocks on this datanode may use up the network 
> bandwith.  We need to start a recovery task at a proper time in order to give 
> less impact to the system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period

2016-02-18 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9826:

Description: 
Currently NameNode prepares recovering when finding an under replicated  block 
group. This is inefficient and reduces resources for other operations. It would 
be better to postpone the recovery work for a period of time if only one 
internal block is corrupted considering points shown by papers such as 
\[1\]\[2\]:
1.  Transient errors in which no data are lost account for more than 90% of 
data center failures, owing to network partitions, software problems, or 
non-disk hardware faults.
2.  Although erasure codes tolerate multiple simultaneous failures, single 
failures represent 99.75% of recoveries.

Different clusters may have different status, so we should allow user to 
configure the time for postponing the recoveries. Proper configuration will 
reduce a large proportion of unnecessary recoveries. When finding multiple 
internal blocks corrupted in a block group, we prepare the recovery work 
immediately because it’s very rare and we don’t want to increase the risk of 
losing data.

[1] Availability in globally distributed storage systems
http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf
[2] Rethinking erasure codes for cloud file systems: minimizing I/O for 
recovery and degraded reads
http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf


  was:
Currently NameNode prepares recovering when finding an under replicated  block 
group. This is inefficient and reduces resources for other operations. It would 
be better to postpone the recovery work for a period of time if only one 
internal block is corrupted considering points shown by papers such as 
\[1\]\[2\]:
1.  Transient errors in which no data are lost account for more than 90% of 
data center failures, owing to network partitions, software problems, or 
non-disk hardware faults.
2.  Although erasure codes tolerate multiple simultaneous failures, single 
failures represent 99.75% of recoveries.

Different clusters may have different status, so we should allow user to 
configure the time for postponing the recoveries. Proper configuration will 
reduce a large proportion of unnecessary recoveries. When finding multiple 
internal blocks corrupted in a block group, we do the recovery work immediately 
because it’s very rare and we don’t want to increase the risk of losing data.

[1] Availability in globally distributed storage systems
http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf
[2] Rethinking erasure codes for cloud file systems: minimizing I/O for 
recovery and degraded reads
http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf



>  Erasure Coding: Postpone the recovery work for a configurable time period
> --
>
> Key: HDFS-9826
> URL: https://issues.apache.org/jira/browse/HDFS-9826
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
>
> Currently NameNode prepares recovering when finding an under replicated  
> block group. This is inefficient and reduces resources for other operations. 
> It would be better to postpone the recovery work for a period of time if only 
> one internal block is corrupted considering points shown by papers such as 
> \[1\]\[2\]:
> 1.Transient errors in which no data are lost account for more than 90% of 
> data center failures, owing to network partitions, software problems, or 
> non-disk hardware faults.
> 2.Although erasure codes tolerate multiple simultaneous failures, single 
> failures represent 99.75% of recoveries.
> Different clusters may have different status, so we should allow user to 
> configure the time for postponing the recoveries. Proper configuration will 
> reduce a large proportion of unnecessary recoveries. When finding multiple 
> internal blocks corrupted in a block group, we prepare the recovery work 
> immediately because it’s very rare and we don’t want to increase the risk of 
> losing data.
> [1] Availability in globally distributed storage systems
> http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf
> [2] Rethinking erasure codes for cloud file systems: minimizing I/O for 
> recovery and degraded reads
> http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period

2016-02-18 Thread Li Bo (JIRA)
Li Bo created HDFS-9826:
---

 Summary:  Erasure Coding: Postpone the recovery work for a 
configurable time period
 Key: HDFS-9826
 URL: https://issues.apache.org/jira/browse/HDFS-9826
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo


Currently NameNode prepares recovering when finding an under replicated  block 
group. This is inefficient and reduces resources for other operations. It would 
be better to postpone the recovery work for a period of time if only one 
internal block is corrupted considering points shown by papers such as 
\[1\]\[2\]:
1.  Transient errors in which no data are lost account for more than 90% of 
data center failures, owing to network partitions, software problems, or 
non-disk hardware faults.
2.  Although erasure codes tolerate multiple simultaneous failures, single 
failures represent 99.75% of recoveries.

Different clusters may have different status, so we should allow user to 
configure the time for postponing the recoveries. Proper configuration will 
reduce a large proportion of unnecessary recoveries. When finding multiple 
internal blocks corrupted in a block group, we do the recovery work immediately 
because it’s very rare and we don’t want to increase the risk of losing data.

[1] Availability in globally distributed storage systems
http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf
[2] Rethinking erasure codes for cloud file systems: minimizing I/O for 
recovery and degraded reads
http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic

2016-02-02 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9657:

Attachment: HDFS-9657-002.patch

Currently only implements one policy, i.e ECRecoveryPolicyTimeSegment. I think 
it satisfies most situations. Will add more policies if needed.

> Schedule EC tasks at proper time to reduce the impact of recovery traffic
> -
>
> Key: HDFS-9657
> URL: https://issues.apache.org/jira/browse/HDFS-9657
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9657-001.patch, HDFS-9657-002.patch
>
>
> The EC recover tasks consume a lot of network bandwidth and disk I/O. 
> Recovering a corrupt block requires transferring 6 blocks , hence creating a 
> 6X overhead in network bandwidth and disk I/O.  When a datanode fails , the 
> recovery of the whole blocks on this datanode may use up the network 
> bandwith.  We need to start a recovery task at a proper time in order to give 
> less impact to the system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic

2016-02-02 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9657:

Status: Patch Available  (was: Open)

> Schedule EC tasks at proper time to reduce the impact of recovery traffic
> -
>
> Key: HDFS-9657
> URL: https://issues.apache.org/jira/browse/HDFS-9657
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9657-001.patch, HDFS-9657-002.patch
>
>
> The EC recover tasks consume a lot of network bandwidth and disk I/O. 
> Recovering a corrupt block requires transferring 6 blocks , hence creating a 
> 6X overhead in network bandwidth and disk I/O.  When a datanode fails , the 
> recovery of the whole blocks on this datanode may use up the network 
> bandwith.  We need to start a recovery task at a proper time in order to give 
> less impact to the system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic

2016-01-27 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9657:

Attachment: HDFS-9657-001.patch

> Schedule EC tasks at proper time to reduce the impact of recovery traffic
> -
>
> Key: HDFS-9657
> URL: https://issues.apache.org/jira/browse/HDFS-9657
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9657-001.patch
>
>
> The EC recover tasks consume a lot of network bandwidth and disk I/O. 
> Recovering a corrupt block requires transferring 6 blocks , hence creating a 
> 6X overhead in network bandwidth and disk I/O.  When a datanode fails , the 
> recovery of the whole blocks on this datanode may use up the network 
> bandwith.  We need to start a recovery task at a proper time in order to give 
> less impact to the system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic

2016-01-18 Thread Li Bo (JIRA)
Li Bo created HDFS-9657:
---

 Summary: Schedule EC tasks at proper time to reduce the impact of 
recovery traffic
 Key: HDFS-9657
 URL: https://issues.apache.org/jira/browse/HDFS-9657
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo


The EC recover tasks consume a lot of network bandwidth and disk I/O. 
Recovering a corrupt block requires transferring 6 blocks , hence creating a 6X 
overhead in network bandwidth and disk I/O.  When a datanode fails , the 
recovery of the whole blocks on this datanode may use up the network bandwith.  
We need to start a recovery task at a proper time in order to give less impact 
to the system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9374) Inform user when a file with corrupted data blocks are read

2016-01-14 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9374:

Attachment: HDFS-9374-001.patch

Another way is to add the decode info to ReadStatistics.

> Inform user when a file with corrupted data blocks are read
> ---
>
> Key: HDFS-9374
> URL: https://issues.apache.org/jira/browse/HDFS-9374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9374-001.patch
>
>
> When reading a block group with corrupt data blocks, it would be better to 
> tell user the corrupt blocks so that user can know the status of the file 
> being read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HDFS-7346) Erasure Coding: perform stripping erasure encoding work given block reader and writer

2016-01-13 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo reopened HDFS-7346:
-

> Erasure Coding: perform stripping erasure encoding work given block reader 
> and writer
> -
>
> Key: HDFS-7346
> URL: https://issues.apache.org/jira/browse/HDFS-7346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Reporter: Kai Zheng
>Assignee: Li Bo
>
> This assumes the facilities like block reader and writer are ready, 
> implements and performs erasure encoding work in *stripping* case utilizing 
> erasure codec and coder provided by the codec framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7346) Erasure Coding: perform stripping erasure encoding work given block reader and writer

2016-01-13 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-7346:

Release Note:   (was: The jira is very old and close it because we'll not 
handle it in the near future.)

> Erasure Coding: perform stripping erasure encoding work given block reader 
> and writer
> -
>
> Key: HDFS-7346
> URL: https://issues.apache.org/jira/browse/HDFS-7346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Reporter: Kai Zheng
>Assignee: Li Bo
>
> This assumes the facilities like block reader and writer are ready, 
> implements and performs erasure encoding work in *stripping* case utilizing 
> erasure codec and coder provided by the codec framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-7346) Erasure Coding: perform stripping erasure encoding work given block reader and writer

2016-01-12 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo resolved HDFS-7346.
-
  Resolution: Won't Fix
Release Note: The jira is very old and close it because we'll not handle it 
in the near future.

> Erasure Coding: perform stripping erasure encoding work given block reader 
> and writer
> -
>
> Key: HDFS-7346
> URL: https://issues.apache.org/jira/browse/HDFS-7346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Reporter: Kai Zheng
>Assignee: Li Bo
>
> This assumes the facilities like block reader and writer are ready, 
> implements and performs erasure encoding work in *stripping* case utilizing 
> erasure codec and coder provided by the codec framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing

2016-01-12 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo resolved HDFS-8889.
-
  Resolution: Won't Fix
Release Note: There has been enough test cases for the datanode failures. 

> Erasure Coding: cover more test situations of datanode failure during client 
> writing
> 
>
> Key: HDFS-8889
> URL: https://issues.apache.org/jira/browse/HDFS-8889
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8889-HDFS-7285-001.patch
>
>
> Currently 9 streamers are working together for the client writing. A small 
> number of failed datanodes (<= 3) for a block group should not influence the 
> writing. There’re a lot of datanode failure cases and we should cover as many 
> as possible in unit test.
> Suppose streamer 4 fails, the following situations for the next block group 
> should be considered:
> 1)all streamers succeed
> 2)Streamer 4 still fails
> 3)only streamer 1 fails
> 4)only streamer 8 fails (test parity streamer)
> 5)streamer 4 and 6 fail
> 6)streamer 4 and 1,6 fail
> 7)streamer 4 and 1,2,6 fail
> 8)streamer 2, 6 fail
> Suppose streamer 2 and 4 fail, the following situations for the next block 
> group should be considered:
> 1)only streamer 2 and 4 fail
> 2)streamer 2, 4, 8 fail
> 3)only streamer 2 fails
> 4)streamer 3 , 8 fail
> For a single streamer, we should consider the following situations of the 
> time of datanode failure:
> 1)before writing the first byte
> 2)before finishing writing the first cell
> 3)right after finishing writing the first cell
> 4)before writing the last byte of the block
> Other situations:
> 1)more than 3 streamers fail at the first block group
> 2)more than 3 streamers fail at the last block group
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8762) Erasure Coding: the log of each streamer should show its index

2016-01-11 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8762:

Resolution: Duplicate
Status: Resolved  (was: Patch Available)

Other jiras have added the necessary index.

> Erasure Coding: the log of each streamer should show its index
> --
>
> Key: HDFS-8762
> URL: https://issues.apache.org/jira/browse/HDFS-8762
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8762-HDFS-7285-001.patch, 
> HDFS-8762-HDFS-7285-002.patch
>
>
> The log in {{DataStreamer}} doesn't show which streamer it's generated from. 
> In order to make log information more convenient for debugging, each log 
> should include the index of the streamer it's generated from. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work stopped] (HDFS-8171) Extend BlockSender to support multiple block data source

2016-01-11 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-8171 stopped by Li Bo.
---
> Extend BlockSender to support multiple block data source
> 
>
> Key: HDFS-8171
> URL: https://issues.apache.org/jira/browse/HDFS-8171
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Reporter: Li Bo
>Assignee: Li Bo
>
> Currently BlockSender reads a block from the disk and sends it to a remote 
> datanode. In EC encode/decode work, new blocks are generated by calculation. 
> In order to store these blocks to remote datanodes, we can ask BlockSender to 
> read data from the output of encode/decode calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers

2015-12-16 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9373:

Attachment: HDFS-9373-003.patch

Sorry for the checkstyle error.Update patch 003 to fix it.

> Show friendly information to user when client succeeds the writing with some 
> failed streamers
> -
>
> Key: HDFS-9373
> URL: https://issues.apache.org/jira/browse/HDFS-9373
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: erasure-coding
>Affects Versions: 3.0.0
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9373-001.patch, HDFS-9373-002.patch, 
> HDFS-9373-003.patch
>
>
> When not more than PARITY_NUM streamers fail for a block group, the client 
> may still succeed to write the data. But several exceptions are thrown to 
> user and user has to check the reasons.  The friendly way is just inform user 
> that some streamers fail when writing a block group. It’s not necessary to 
> show the details of exceptions because a small number of stream failures is 
> not vital to the client writing.
> When only DATA_NUM streamers succeed, the block group is in a high risk 
> because the corrupt of any block will cause all the six blocks' data lost. We 
> should give obvious warning to user when this occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9529) Extend Erasure Code to support POWER Chip acceleration

2015-12-08 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047924#comment-15047924
 ] 

Li Bo commented on HDFS-9529:
-

hi, wqijun
Hadoop-11887 is under "Hadoop common" and this JIRA is trying to solve the 
similar problem, so I think it would be better to put it to "Hadoop common".

> Extend Erasure Code to support POWER Chip acceleration
> --
>
> Key: HDFS-9529
> URL: https://issues.apache.org/jira/browse/HDFS-9529
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: erasure-coding
>Affects Versions: 3.0.0
>Reporter: wqijun
>Assignee: wqijun
> Fix For: 3.0.0
>
>
> Erasure Code is a very important feature in new HDFS version. This JIRA will 
> focus on how to extend EC to support multiple types of EC acceleration by C 
> library and other hardware method, like GPU or FPGA. Compared with 
> Hadoop-11887, this JIRA will more focus on how to leverage POWER Chip 
> capability to accelerate the EC calculating. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers

2015-12-07 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9373:

Attachment: HDFS-9373-002.patch

Thanks Zhe and Daniel’s review. Just update the patch according the newest 
trunk code.
The failed block id can be achieved by other log information, so we just need 
to tell user which block groups have the corrupt blocks.


> Show friendly information to user when client succeeds the writing with some 
> failed streamers
> -
>
> Key: HDFS-9373
> URL: https://issues.apache.org/jira/browse/HDFS-9373
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: erasure-coding
>Affects Versions: 3.0.0
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9373-001.patch, HDFS-9373-002.patch
>
>
> When not more than PARITY_NUM streamers fail for a block group, the client 
> may still succeed to write the data. But several exceptions are thrown to 
> user and user has to check the reasons.  The friendly way is just inform user 
> that some streamers fail when writing a block group. It’s not necessary to 
> show the details of exceptions because a small number of stream failures is 
> not vital to the client writing.
> When only DATA_NUM streamers succeed, the block group is in a high risk 
> because the corrupt of any block will cause all the six blocks' data lost. We 
> should give obvious warning to user when this occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-11-23 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo resolved HDFS-8704.
-
Resolution: Cannot Reproduce

> Erasure Coding: client fails to write large file when one datanode fails
> 
>
> Key: HDFS-8704
> URL: https://issues.apache.org/jira/browse/HDFS-8704
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
> HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
> HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, 
> HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch
>
>
> I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
> corrupt, client succeeds to write a file smaller than a block group but fails 
> to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
> files smaller than a block group, this jira will add more test situations.
> A streamer may encounter some bad datanodes when writing blocks allocated to 
> it. When it fails to connect datanode or send a packet, the streamer needs to 
> prepare for the next block. First it removes the packets of current  block 
> from its data queue. If the first packet of next block has already been in 
> the data queue, the streamer will reset its state and start to wait for the 
> next block allocated for it; otherwise it will just wait for the first packet 
> of next block. The streamer will check periodically if it is asked to 
> terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9382) Track the acks for the packets which are sent from ErasureCodingWorker as part of reconstruction work

2015-11-16 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15008062#comment-15008062
 ] 

Li Bo commented on HDFS-9382:
-

Unifying the readers and writers is very complicated due to the different 
details, so I have postponed the work. Maybe we can consider the optimization 
after most of the functionalities of EC are satisfied. 

> Track the acks for the packets which are sent from ErasureCodingWorker as 
> part of reconstruction work
> -
>
> Key: HDFS-9382
> URL: https://issues.apache.org/jira/browse/HDFS-9382
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 3.0.0
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>
> Currently we are not tracking the acks for the packets which are sent from DN 
> ECWorker as part of reconstruction work. This jira is proposing to tracks the 
> acks as reconstruction work is really expensive, so we should know if any 
> packets failed to write at target DN 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers

2015-11-11 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9373:

Attachment: HDFS-9373-001.patch

> Show friendly information to user when client succeeds the writing with some 
> failed streamers
> -
>
> Key: HDFS-9373
> URL: https://issues.apache.org/jira/browse/HDFS-9373
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-9373-001.patch
>
>
> When not more than PARITY_NUM streamers fail for a block group, the client 
> may still succeed to write the data. But several exceptions are thrown to 
> user and user has to check the reasons.  The friendly way is just inform user 
> that some streamers fail when writing a block group. It’s not necessary to 
> show the details of exceptions because a small number of stream failures is 
> not vital to the client writing.
> When only DATA_NUM streamers succeed, the block group is in a high risk 
> because the corrupt of any block will cause all the six blocks' data lost. We 
> should give obvious warning to user when this occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7781) Use block erasure coder in client stripping

2015-11-11 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-7781:

Parent Issue: HDFS-8030  (was: HDFS-8031)

> Use block erasure coder in client stripping
> ---
>
> Key: HDFS-7781
> URL: https://issues.apache.org/jira/browse/HDFS-7781
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Fix For: HDFS-7285
>
>
> In HDFS-7729, raw erasure coder is used in order not to depend on the 
> {{ErasureCoder}} api  defined in HDFS-7662 or even {{ErasureCodec}} api 
> defined in HDFS-7337 since they're still upcoming.
> This is a follow up issue to work on that when the high level constructs are 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HDFS-7781) Use block erasure coder in client stripping

2015-11-11 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo reopened HDFS-7781:
-

> Use block erasure coder in client stripping
> ---
>
> Key: HDFS-7781
> URL: https://issues.apache.org/jira/browse/HDFS-7781
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Fix For: HDFS-7285
>
>
> In HDFS-7729, raw erasure coder is used in order not to depend on the 
> {{ErasureCoder}} api  defined in HDFS-7662 or even {{ErasureCodec}} api 
> defined in HDFS-7337 since they're still upcoming.
> This is a follow up issue to work on that when the high level constructs are 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers

2015-11-08 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-9373:

Description: 
When not more than PARITY_NUM streamers fail for a block group, the client may 
still succeed to write the data. But several exceptions are thrown to user and 
user has to check the reasons.  The friendly way is just inform user that some 
streamers fail when writing a block group. It’s not necessary to show the 
details of exceptions because a small number of stream failures is not vital to 
the client writing.

When only DATA_NUM streamers succeed, the block group is in a high risk because 
the corrupt of any block will cause all the six blocks' data lost. We should 
give obvious warning to user when this occurs. 

  was:When not more than PARITY_NUM streamers fail for a block group, the 
client may still succeed to write the data. But several exceptions are thrown 
to user and user has to check the reasons.  The friendly way is just inform 
user that some streamers fail when writing a block group. It’s not necessary to 
show the details of exceptions because a small number of stream failures is not 
vital to the client writing.


> Show friendly information to user when client succeeds the writing with some 
> failed streamers
> -
>
> Key: HDFS-9373
> URL: https://issues.apache.org/jira/browse/HDFS-9373
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
>
> When not more than PARITY_NUM streamers fail for a block group, the client 
> may still succeed to write the data. But several exceptions are thrown to 
> user and user has to check the reasons.  The friendly way is just inform user 
> that some streamers fail when writing a block group. It’s not necessary to 
> show the details of exceptions because a small number of stream failures is 
> not vital to the client writing.
> When only DATA_NUM streamers succeed, the block group is in a high risk 
> because the corrupt of any block will cause all the six blocks' data lost. We 
> should give obvious warning to user when this occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers

2015-11-04 Thread Li Bo (JIRA)
Li Bo created HDFS-9373:
---

 Summary: Show friendly information to user when client succeeds 
the writing with some failed streamers
 Key: HDFS-9373
 URL: https://issues.apache.org/jira/browse/HDFS-9373
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo


When not more than PARITY_NUM streamers fail for a block group, the client may 
still succeed to write the data. But several exceptions are thrown to user and 
user has to check the reasons.  The friendly way is just inform user that some 
streamers fail when writing a block group. It’s not necessary to show the 
details of exceptions because a small number of stream failures is not vital to 
the client writing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9374) Inform user when a file with corrupted data blocks are read

2015-11-04 Thread Li Bo (JIRA)
Li Bo created HDFS-9374:
---

 Summary: Inform user when a file with corrupted data blocks are 
read
 Key: HDFS-9374
 URL: https://issues.apache.org/jira/browse/HDFS-9374
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo


When reading a block group with corrupt data blocks, it would be better to tell 
user the corrupt blocks so that user can know the status of the file being 
read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8410) Add computation time metrics to datanode for ECWorker

2015-10-27 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8410:

Attachment: HDFS-8410-002.patch

> Add computation time metrics to datanode for ECWorker
> -
>
> Key: HDFS-8410
> URL: https://issues.apache.org/jira/browse/HDFS-8410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8410-001.patch, HDFS-8410-002.patch
>
>
> This is a sub task of HDFS-7674. It adds time metric for ec decode work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8410) Add computation time metrics to datanode for ECWorker

2015-10-27 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8410:

Summary: Add computation time metrics to datanode for ECWorker  (was: Add 
time count metrics to datanode for ECWorker)

> Add computation time metrics to datanode for ECWorker
> -
>
> Key: HDFS-8410
> URL: https://issues.apache.org/jira/browse/HDFS-8410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8410-001.patch
>
>
> This is a sub task of HDFS-7674. It adds time metric for ec decode work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8410) Add computation time metrics to datanode for ECWorker

2015-10-27 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976021#comment-14976021
 ] 

Li Bo commented on HDFS-8410:
-

Patch 002 reduces metrics number grom 3 to 2. The time metrics make user know 
how much time spent in encoding/decoding work for every datanode.

> Add computation time metrics to datanode for ECWorker
> -
>
> Key: HDFS-8410
> URL: https://issues.apache.org/jira/browse/HDFS-8410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8410-001.patch, HDFS-8410-002.patch
>
>
> This is a sub task of HDFS-7674. It adds time metric for ec decode work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8529) Add blocks count metrics to datanode for ECWorker

2015-10-27 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8529:

Attachment: HDFS-8529-002.patch

> Add blocks count metrics to datanode for ECWorker
> -
>
> Key: HDFS-8529
> URL: https://issues.apache.org/jira/browse/HDFS-8529
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8529-001.patch, HDFS-8529-002.patch
>
>
> This sub task will add block count metrics to datanode that takes the 
> encoding and recovery tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-7653) Block Readers and Writers used in both client side and datanode side

2015-10-21 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo resolved HDFS-7653.
-
  Resolution: Won't Fix
Release Note: The design and implementation have changed a lot after this 
issue created. Due to the complexity and difference of client and datanode side 
read/write, we'll not solve this issue at current stage.  

> Block Readers and Writers used in both client side and datanode side
> 
>
> Key: HDFS-7653
> URL: https://issues.apache.org/jira/browse/HDFS-7653
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: BlockReadersWriters.patch
>
>
> There're a lot of block read/write operations in HDFS-EC, for example, when 
> client writes a file in striping layout, client has to write several blocks 
> to several different datanodes; if a datanode wants to do an 
> encoding/decoding task, it has to read several blocks from itself and other 
> datanodes, and writes one or more blocks to itself or other datanodes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-7679) Erasure Coding: unifying common constructs like coding work, block reader and block writer across client and DataNode

2015-10-21 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo resolved HDFS-7679.
-
Resolution: Won't Fix

> Erasure Coding: unifying common constructs like coding work, block reader and 
> block writer across client and DataNode
> -
>
> Key: HDFS-7679
> URL: https://issues.apache.org/jira/browse/HDFS-7679
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: ECEncodeDecodeFramework.patch
>
>
> Based on the work done, we will have similar constructs like coding work, 
> local/remote block reader/writer in both client and DataNode side, so it's 
> possible to refactor the codes further and unify these constructs to 
> eliminate possible duplicate codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-7781) Use block erasure coder in client stripping

2015-10-21 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo resolved HDFS-7781.
-
Resolution: Won't Fix

> Use block erasure coder in client stripping
> ---
>
> Key: HDFS-7781
> URL: https://issues.apache.org/jira/browse/HDFS-7781
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Fix For: HDFS-7285
>
>
> In HDFS-7729, raw erasure coder is used in order not to depend on the 
> {{ErasureCoder}} api  defined in HDFS-7662 or even {{ErasureCodec}} api 
> defined in HDFS-7337 since they're still upcoming.
> This is a follow up issue to work on that when the high level constructs are 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8411) Add bytes count metrics to datanode for ECWorker

2015-10-21 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8411:

Attachment: HDFS-8411-003.patch

> Add bytes count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8411
> URL: https://issues.apache.org/jira/browse/HDFS-8411
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8411-001.patch, HDFS-8411-002.patch, 
> HDFS-8411-003.patch
>
>
> This is a sub task of HDFS-7674. It calculates the amount of data that is 
> read from local or remote to attend decoding work, and also the amount of 
> data that is written to local or remote datanodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8411) Add bytes count metrics to datanode for ECWorker

2015-10-15 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8411:

Attachment: HDFS-8411-002.patch

 When some datanodes are corrupted, all their blocks are to be reconstructed by 
other healthy datanodes. The network flow incurred is very high and maybe we 
want to track it. We can record the bytes read and written by any datanode. In 
fact, I think HDFS-8529(block counts) and HDFS-8410(time consumed) are not 
necessary. We can estimate the time cost according to the bytes read and write. 
Block count metric is not very meaningful when there’re a lot of small files. 
We can adjust the metrics for the future requirement.

> Add bytes count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8411
> URL: https://issues.apache.org/jira/browse/HDFS-8411
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8411-001.patch, HDFS-8411-002.patch
>
>
> This is a sub task of HDFS-7674. It calculates the amount of data that is 
> read from local or remote to attend decoding work, and also the amount of 
> data that is written to local or remote datanodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2015-10-13 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954670#comment-14954670
 ] 

Li Bo commented on HDFS-8449:
-

Thanks [~rakeshr]'s review. The failed tests seem unrelated with this patch.  
Hi, [~jingzhao], could you help me review the patch and commit it to trunk if 
it's ok ? Then we can work on other metric jiras. Thanks.

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-10-09 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8704:

Status: In Progress  (was: Patch Available)

> Erasure Coding: client fails to write large file when one datanode fails
> 
>
> Key: HDFS-8704
> URL: https://issues.apache.org/jira/browse/HDFS-8704
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
> HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
> HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, 
> HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch
>
>
> I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
> corrupt, client succeeds to write a file smaller than a block group but fails 
> to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
> files smaller than a block group, this jira will add more test situations.
> A streamer may encounter some bad datanodes when writing blocks allocated to 
> it. When it fails to connect datanode or send a packet, the streamer needs to 
> prepare for the next block. First it removes the packets of current  block 
> from its data queue. If the first packet of next block has already been in 
> the data queue, the streamer will reset its state and start to wait for the 
> next block allocated for it; otherwise it will just wait for the first packet 
> of next block. The streamer will check periodically if it is asked to 
> terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-10-08 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949702#comment-14949702
 ] 

Li Bo commented on HDFS-8704:
-

Hi, Zhe
I will give further check and test about this case. If no problem, I will close 
it. Thanks

> Erasure Coding: client fails to write large file when one datanode fails
> 
>
> Key: HDFS-8704
> URL: https://issues.apache.org/jira/browse/HDFS-8704
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
> HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
> HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, 
> HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch
>
>
> I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
> corrupt, client succeeds to write a file smaller than a block group but fails 
> to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
> files smaller than a block group, this jira will add more test situations.
> A streamer may encounter some bad datanodes when writing blocks allocated to 
> it. When it fails to connect datanode or send a packet, the streamer needs to 
> prepare for the next block. First it removes the packets of current  block 
> from its data queue. If the first packet of next block has already been in 
> the data queue, the streamer will reset its state and start to wait for the 
> next block allocated for it; otherwise it will just wait for the first packet 
> of next block. The streamer will check periodically if it is asked to 
> terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2015-09-29 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-002.patch

reduce metrics number from 3 to 2. The count of failed tasks can be calculated 
by total count and successful count.

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker

2015-09-29 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8449:

Attachment: HDFS-8449-003.patch

Thanks Rakesh for the careful review. Update a new patch to fix the problems.

> Add tasks count metrics to datanode for ECWorker
> 
>
> Key: HDFS-8449
> URL: https://issues.apache.org/jira/browse/HDFS-8449
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, 
> HDFS-8449-002.patch, HDFS-8449-003.patch
>
>
> This sub task try to record ec recovery tasks that a datanode has done, 
> including total tasks, failed tasks and sucessful tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9040) Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests to Coordinator)

2015-09-22 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902207#comment-14902207
 ] 

Li Bo commented on HDFS-9040:
-

I have just done a test with and without HDFS-9040-HDFS-7285.003.patch. The 
result is as follows:
10G   2m15s   VS  2m12s
30G   6m28s   VS  6m29s
50G   10m38s  VS  10m48s
We can see that when writing continuously, synchronizing streamers in 
outputstream level will not hurt the performance.


> Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests 
> to Coordinator)
> ---
>
> Key: HDFS-9040
> URL: https://issues.apache.org/jira/browse/HDFS-9040
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Jing Zhao
> Attachments: HDFS-9040-HDFS-7285.002.patch, 
> HDFS-9040-HDFS-7285.003.patch, HDFS-9040.00.patch, HDFS-9040.001.wip.patch, 
> HDFS-9040.02.bgstreamer.patch
>
>
> The general idea is to simplify error handling logic.
> Proposal 1:
> A BlockGroupDataStreamer to communicate with NN to allocate/update block, and 
> StripedDataStreamer s only have to stream blocks to DNs.
> Proposal 2:
> See below the 
> [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388]
>  from [~jingzhao].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9040) Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests to Coordinator)

2015-09-17 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805014#comment-14805014
 ] 

Li Bo commented on HDFS-9040:
-

Thanks for the deep discussion here. One point confused me:
bq. Only based on internal block lengths we cannot identify the failure
For a block not belong to the last block group, if its length is smaller than 
BLOCK_SIZE, we can conclude it’s corrupt, otherwise it is a good block, right? 
For the last block group, we can calculate the length of each block via the 
file length, so if a block doesn’t satisfy the required length, then we can 
conclude it’s corrupt. The precondition is, if <=NUM_PARITY streamers fail, we 
ignore their failures and treat all blocks of this block group are written 
correctly.

Each block in a block group only has one replica,  and if we can judge a block 
corrupt or not, it may not necessary to bump the GS.

Any other points that require bumping the GS?

> Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests 
> to Coordinator)
> ---
>
> Key: HDFS-9040
> URL: https://issues.apache.org/jira/browse/HDFS-9040
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
> Attachments: HDFS-9040-HDFS-7285.002.patch, 
> HDFS-9040-HDFS-7285.003.patch, HDFS-9040.00.patch, HDFS-9040.001.wip.patch, 
> HDFS-9040.02.bgstreamer.patch
>
>
> The general idea is to simplify error handling logic.
> Proposal 1:
> A BlockGroupDataStreamer to communicate with NN to allocate/update block, and 
> StripedDataStreamer s only have to stream blocks to DNs.
> Proposal 2:
> See below the 
> [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388]
>  from [~jingzhao].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9040) Erasure coding: A BlockGroupDataStreamer to rule all internal blocks streamers

2015-09-14 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743182#comment-14743182
 ] 

Li Bo commented on HDFS-9040:
-

Thanks Walter and Jing ‘s work.
In Jing’s patch, {{allocateNewBlock}} is called in {{writeChunk()}}, which 
makes the packets in the {{dataQueue}} of streamer all belong to the same 
block. I think this is a good way to reduce the logic complexity.  Currently 
the fastest streamer also has to wait for other streamers before requesting a 
following block group from NN, so I think we may not feel the writing speed 
becomes slow.


> Erasure coding: A BlockGroupDataStreamer to rule all internal blocks streamers
> --
>
> Key: HDFS-9040
> URL: https://issues.apache.org/jira/browse/HDFS-9040
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9040.00.patch, HDFS-9040.001.wip.patch
>
>
> A {{BlockGroupDataStreamer}} to communicate with NN to allocate/update block, 
> and {{StripedDataStreamer}} s only have to stream blocks to DNs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9040) Erasure coding: A BlockGroupDataStreamer to rule all internal blocks streamers

2015-09-09 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14738076#comment-14738076
 ] 

Li Bo commented on HDFS-9040:
-

Implementing a block group writer is a better solution but you have to copy a 
lot of existing code to the new class, which may be hard to be accepted by 
community. At beginning I also tried this way but gave up after discussion. 
Current design and implementation give as few changes as possible  to the 
existing code. 

> Erasure coding: A BlockGroupDataStreamer to rule all internal blocks streamers
> --
>
> Key: HDFS-9040
> URL: https://issues.apache.org/jira/browse/HDFS-9040
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Walter Su
>
> A {{BlockGroupDataStreamer}} to communicate with NN to allocate/update block, 
> and {{StripedDataStreamer}} s only have to stream blocks to DNs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-09-08 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8704:

Attachment: HDFS-8704-HDFS-7285-008.patch

> Erasure Coding: client fails to write large file when one datanode fails
> 
>
> Key: HDFS-8704
> URL: https://issues.apache.org/jira/browse/HDFS-8704
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
> HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
> HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, 
> HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch
>
>
> I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
> corrupt, client succeeds to write a file smaller than a block group but fails 
> to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
> files smaller than a block group, this jira will add more test situations.
> A streamer may encounter some bad datanodes when writing blocks allocated to 
> it. When it fails to connect datanode or send a packet, the streamer needs to 
> prepare for the next block. First it removes the packets of current  block 
> from its data queue. If the first packet of next block has already been in 
> the data queue, the streamer will reset its state and start to wait for the 
> next block allocated for it; otherwise it will just wait for the first packet 
> of next block. The streamer will check periodically if it is asked to 
> terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-09-06 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733248#comment-14733248
 ] 

Li Bo commented on HDFS-8704:
-

I have tried replacing the failed streamer with a new one.  When replacing, the 
outputstream has to stop sending packets to old streamer and start sending 
packets to new streamer after all packets of next block are moved  from old to 
new streamer. It's much more difficult than restarting the failed streamer. The 
auto restart of failed streamer makes ouputstream unnecessary to care about if 
some streamer is failed.

> Erasure Coding: client fails to write large file when one datanode fails
> 
>
> Key: HDFS-8704
> URL: https://issues.apache.org/jira/browse/HDFS-8704
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
> HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
> HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, 
> HDFS-8704-HDFS-7285-007.patch
>
>
> I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
> corrupt, client succeeds to write a file smaller than a block group but fails 
> to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
> files smaller than a block group, this jira will add more test situations.
> A streamer may encounter some bad datanodes when writing blocks allocated to 
> it. When it fails to connect datanode or send a packet, the streamer needs to 
> prepare for the next block. First it removes the packets of current  block 
> from its data queue. If the first packet of next block has already been in 
> the data queue, the streamer will reset its state and start to wait for the 
> next block allocated for it; otherwise it will just wait for the first packet 
> of next block. The streamer will check periodically if it is asked to 
> terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-09-06 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8704:

Attachment: HDFS-8704-HDFS-7285-007.patch

> Erasure Coding: client fails to write large file when one datanode fails
> 
>
> Key: HDFS-8704
> URL: https://issues.apache.org/jira/browse/HDFS-8704
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
> HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
> HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, 
> HDFS-8704-HDFS-7285-007.patch
>
>
> I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
> corrupt, client succeeds to write a file smaller than a block group but fails 
> to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
> files smaller than a block group, this jira will add more test situations.
> A streamer may encounter some bad datanodes when writing blocks allocated to 
> it. When it fails to connect datanode or send a packet, the streamer needs to 
> prepare for the next block. First it removes the packets of current  block 
> from its data queue. If the first packet of next block has already been in 
> the data queue, the streamer will reset its state and start to wait for the 
> next block allocated for it; otherwise it will just wait for the first packet 
> of next block. The streamer will check periodically if it is asked to 
> terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-09-06 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732314#comment-14732314
 ] 

Li Bo commented on HDFS-8704:
-

Thanks for [~walter.k.su] and [~zhz]' s review.
The cause of failed test cases are caused by the unit test itself. I have fixed 
them in patch 007.
The test will hang at 
{{TestDFSStripedOutputStreamWithFailure#DFSTestUtil.waitReplication}}, I just 
omit this sentence in patch in order to make the tests pass. [~walter.k.su], 
could you help me check this problem?
I will switch to HDFS-8383 later.

> Erasure Coding: client fails to write large file when one datanode fails
> 
>
> Key: HDFS-8704
> URL: https://issues.apache.org/jira/browse/HDFS-8704
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Li Bo
>Assignee: Li Bo
> Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
> HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
> HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, 
> HDFS-8704-HDFS-7285-007.patch
>
>
> I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
> corrupt, client succeeds to write a file smaller than a block group but fails 
> to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
> files smaller than a block group, this jira will add more test situations.
> A streamer may encounter some bad datanodes when writing blocks allocated to 
> it. When it fails to connect datanode or send a packet, the streamer needs to 
> prepare for the next block. First it removes the packets of current  block 
> from its data queue. If the first packet of next block has already been in 
> the data queue, the streamer will reset its state and start to wait for the 
> next block allocated for it; otherwise it will just wait for the first packet 
> of next block. The streamer will check periodically if it is asked to 
> terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8383) Tolerate multiple failures in DFSStripedOutputStream

2015-09-06 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732320#comment-14732320
 ] 

Li Bo commented on HDFS-8383:
-

Thanks [~walter.k.su] for the work! I have just read the code and find some 
points to be discussed:
1)  When only one streamer fails, do we need to do anything? I think we can 
just ignore the failed streamer unless more than 3 streamers are found failed. 
The offline decode work will be started by some datanode later.
2)  I think it’s not right to set the failed status of streamer in 
outputstream due to the asynchronization. I have given some reasons in 
HDFS-8704. The outputstream doesn’t need to care about the status of each 
streamer if just one or two streamers fail. This will not complicate the logic 
of outputstreamer. 
3)  Not very clear about the error handling. For example, streamer_i fails 
to write a packet of block_j, but it succeeds to write block_j+1, could you 
give some detailed description about this situation? 


> Tolerate multiple failures in DFSStripedOutputStream
> 
>
> Key: HDFS-8383
> URL: https://issues.apache.org/jira/browse/HDFS-8383
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Walter Su
> Attachments: HDFS-8383.00.patch, HDFS-8383.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-08-28 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14718235#comment-14718235
 ] 

Li Bo commented on HDFS-8704:
-

The two failed test cases are about insufficient datanodes. They also fail 
without the patch. We can handle them in a separate jira.

 Erasure Coding: client fails to write large file when one datanode fails
 

 Key: HDFS-8704
 URL: https://issues.apache.org/jira/browse/HDFS-8704
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
 HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
 HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch


 I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
 corrupt, client succeeds to write a file smaller than a block group but fails 
 to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
 files smaller than a block group, this jira will add more test situations.
 A streamer may encounter some bad datanodes when writing blocks allocated to 
 it. When it fails to connect datanode or send a packet, the streamer needs to 
 prepare for the next block. First it removes the packets of current  block 
 from its data queue. If the first packet of next block has already been in 
 the data queue, the streamer will reset its state and start to wait for the 
 next block allocated for it; otherwise it will just wait for the first packet 
 of next block. The streamer will check periodically if it is asked to 
 terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-08-27 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8704:

Attachment: HDFS-8704-HDFS-7285-006.patch

Update the patch based on current code of 7285

 Erasure Coding: client fails to write large file when one datanode fails
 

 Key: HDFS-8704
 URL: https://issues.apache.org/jira/browse/HDFS-8704
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
 HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
 HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch


 I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
 corrupt, client succeeds to write a file smaller than a block group but fails 
 to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
 files smaller than a block group, this jira will add more test situations.
 A streamer may encounter some bad datanodes when writing blocks allocated to 
 it. When it fails to connect datanode or send a packet, the streamer needs to 
 prepare for the next block. First it removes the packets of current  block 
 from its data queue. If the first packet of next block has already been in 
 the data queue, the streamer will reset its state and start to wait for the 
 next block allocated for it; otherwise it will just wait for the first packet 
 of next block. The streamer will check periodically if it is asked to 
 terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-08-26 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712581#comment-14712581
 ] 

Li Bo commented on HDFS-8704:
-

Thanks Zhe’s review. The handle of multiple block groups for datanode failure 
is complex, let me give some detailed explanation.

Suppose packets {{p1,p2,p3,p4}} belong to the first block and {{p5,p6,p7,p8}} 
belong to the next one. Now streamer fails to send packet p2 and set its 
{{streamerClosed}} as true. Its {{dataQueue}} still contains {{p3,p4,p5}}.  
Then {{DFSStripedOutputStream}} will fail to write data of packet p6, and set 
this streamer as failed. After that {{DFSStripedOutputStream}} will not write 
{{p7,p8}} to that streamer any more. Maybe the streamer is allocated a good 
datanode and {{p5,p6,p7,p8}} should be successfully written to that datanode. 
We only see {{StripedDataStreamer#setFailed(true)}}, where and when to mark the 
streamer as not failed?
Due to the asynchronization of {{StripedDataStreamer}} and 
{{DFSStripedOutputStream}}, it’s more reasonable to let 
{{DFSStripedOutputStream}} be unware of the status of streamers unless there’re 
no enough streamers. When a streamer fails to connect datanode or write some 
packet, it removes the following trivial packets of current block in its 
{{dataQueue}} and waiting for next block to be allocated to it. 

1.  Please see above explanation
2.  If just writing a file smaller than a block group, you’ll not find the 
problem of {{setFailed}}. The failed status should be marked and erased by data 
streamer itself, not the outputstream.
3.  When streamer fails, it should do some extra work to prepare for the 
next block. It’s very difficult to achieve this if not overriding {{run}}. I 
will give some description in JIRA summary later. When you fail to send a 
packet, the following packets belonging to the same block are trivial packets, 
you just need to remove them from the {{dataQueue}}.
4.  I will update the patch after HDFS-8838 committed.
5.  For small issues: If streamer fails at some block, it still sends to 
coordinator an end block with a negative {{numBytes}}.


 Erasure Coding: client fails to write large file when one datanode fails
 

 Key: HDFS-8704
 URL: https://issues.apache.org/jira/browse/HDFS-8704
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
 HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
 HDFS-8704-HDFS-7285-005.patch


 I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
 corrupt, client succeeds to write a file smaller than a block group but fails 
 to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
 files smaller than a block group, this jira will add more test situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-08-26 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8704:

Description: 
I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
corrupt, client succeeds to write a file smaller than a block group but fails 
to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests files 
smaller than a block group, this jira will add more test situations.


A streamer may encounter some bad datanodes when writing blocks allocated to 
it. When it fails to connect datanode or send a packet, the streamer needs to 
prepare for the next block. First it removes the packets of current  block from 
its data queue. If the first packet of next block has already been in the data 
queue, the streamer will reset its state and start to wait for the next block 
allocated for it; otherwise it will just wait for the first packet of next 
block. The streamer will check periodically if it is asked to terminate during 
its waiting.


  was:I test current code on a 5-node cluster using RS(3,2).  When a datanode 
is corrupt, client succeeds to write a file smaller than a block group but 
fails to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
files smaller than a block group, this jira will add more test situations.


 Erasure Coding: client fails to write large file when one datanode fails
 

 Key: HDFS-8704
 URL: https://issues.apache.org/jira/browse/HDFS-8704
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
 HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
 HDFS-8704-HDFS-7285-005.patch


 I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
 corrupt, client succeeds to write a file smaller than a block group but fails 
 to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
 files smaller than a block group, this jira will add more test situations.
 A streamer may encounter some bad datanodes when writing blocks allocated to 
 it. When it fails to connect datanode or send a packet, the streamer needs to 
 prepare for the next block. First it removes the packets of current  block 
 from its data queue. If the first packet of next block has already been in 
 the data queue, the streamer will reset its state and start to wait for the 
 next block allocated for it; otherwise it will just wait for the first packet 
 of next block. The streamer will check periodically if it is asked to 
 terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small

2015-08-25 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710695#comment-14710695
 ] 

Li Bo commented on HDFS-8838:
-

hi, Walter Su
我看了下branch, 跟7285相关的有
  remotes/origin/HDFS-7285
  remotes/origin/HDFS-7285-REBASE
  remotes/origin/HDFS-7285-merge

这几者目前是什么关系?
Thanks



 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: HDFS-8838-HDFS-7285-000.patch, 
 HDFS-8838-HDFS-7285-20150809-test.patch, HDFS-8838-HDFS-7285-20150809.patch, 
 HDFS-8838-HDFS-7285-20150821.patch, h8838_20150729.patch, 
 h8838_20150731-HDFS-7285.patch, h8838_20150731.log, h8838_20150731.patch, 
 h8838_20150804-HDFS-7285.patch, h8838_20150809.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing

2015-08-24 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8889:

Attachment: HDFS-8889-HDFS-7285-001.patch

 Erasure Coding: cover more test situations of datanode failure during client 
 writing
 

 Key: HDFS-8889
 URL: https://issues.apache.org/jira/browse/HDFS-8889
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8889-HDFS-7285-001.patch


 Currently 9 streamers are working together for the client writing. A small 
 number of failed datanodes (= 3) for a block group should not influence the 
 writing. There’re a lot of datanode failure cases and we should cover as many 
 as possible in unit test.
 Suppose streamer 4 fails, the following situations for the next block group 
 should be considered:
 1)all streamers succeed
 2)Streamer 4 still fails
 3)only streamer 1 fails
 4)only streamer 8 fails (test parity streamer)
 5)streamer 4 and 6 fail
 6)streamer 4 and 1,6 fail
 7)streamer 4 and 1,2,6 fail
 8)streamer 2, 6 fail
 Suppose streamer 2 and 4 fail, the following situations for the next block 
 group should be considered:
 1)only streamer 2 and 4 fail
 2)streamer 2, 4, 8 fail
 3)only streamer 2 fails
 4)streamer 3 , 8 fail
 For a single streamer, we should consider the following situations of the 
 time of datanode failure:
 1)before writing the first byte
 2)before finishing writing the first cell
 3)right after finishing writing the first cell
 4)before writing the last byte of the block
 Other situations:
 1)more than 3 streamers fail at the first block group
 2)more than 3 streamers fail at the last block group
 more …



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small

2015-08-24 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708942#comment-14708942
 ] 

Li Bo commented on HDFS-8838:
-

hi, Jing, some error still exists when applying to newest code. Could you check 
again? Thanks

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: HDFS-8838-HDFS-7285-000.patch, 
 HDFS-8838-HDFS-7285-20150809-test.patch, HDFS-8838-HDFS-7285-20150809.patch, 
 HDFS-8838-HDFS-7285-20150821.patch, h8838_20150729.patch, 
 h8838_20150731-HDFS-7285.patch, h8838_20150731.log, h8838_20150731.patch, 
 h8838_20150804-HDFS-7285.patch, h8838_20150809.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small

2015-08-24 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708943#comment-14708943
 ] 

Li Bo commented on HDFS-8838:
-

hi, Jing, some error still exists when applying to newest code. Could you check 
again? Thanks

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: HDFS-8838-HDFS-7285-000.patch, 
 HDFS-8838-HDFS-7285-20150809-test.patch, HDFS-8838-HDFS-7285-20150809.patch, 
 HDFS-8838-HDFS-7285-20150821.patch, h8838_20150729.patch, 
 h8838_20150731-HDFS-7285.patch, h8838_20150731.log, h8838_20150731.patch, 
 h8838_20150804-HDFS-7285.patch, h8838_20150809.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing

2015-08-12 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8889:

Description: 
Currently 9 streamers are working together for the client writing. A small 
number of failed datanodes (= 3) for a block group should not influence the 
writing. There’re a lot of datanode failure cases and we should cover as many 
as possible in unit test.
Suppose streamer 4 fails, the following situations for the next block group 
should be considered:
1)  all streamers succeed
2)  Streamer 4 still fails
3)  only streamer 1 fails
4)  only streamer 8 fails (test parity streamer)
5)  streamer 4 and 6 fail
6)  streamer 4 and 1,6 fail
7)  streamer 4 and 1,2,6 fail
8)  streamer 2, 6 fail
Suppose streamer 2 and 4 fail, the following situations for the next block 
group should be considered:
1)  only streamer 2 and 4 fail
2)  streamer 2, 4, 8 fail
3)  only streamer 2 fails
4)  streamer 3 , 8 fail
For a single streamer, we should consider the following situations of the time 
of datanode failure:
1)  before writing the first byte
2)  before finishing writing the first cell
3)  right after finishing writing the first cell
4)  before writing the last byte of the block
Other situations:
1)  more than 3 streamers fail at the first block group
2)  more than 3 streamers fail at the last block group
more …

 Erasure Coding: cover more test situations of datanode failure during client 
 writing
 

 Key: HDFS-8889
 URL: https://issues.apache.org/jira/browse/HDFS-8889
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo

 Currently 9 streamers are working together for the client writing. A small 
 number of failed datanodes (= 3) for a block group should not influence the 
 writing. There’re a lot of datanode failure cases and we should cover as many 
 as possible in unit test.
 Suppose streamer 4 fails, the following situations for the next block group 
 should be considered:
 1)all streamers succeed
 2)Streamer 4 still fails
 3)only streamer 1 fails
 4)only streamer 8 fails (test parity streamer)
 5)streamer 4 and 6 fail
 6)streamer 4 and 1,6 fail
 7)streamer 4 and 1,2,6 fail
 8)streamer 2, 6 fail
 Suppose streamer 2 and 4 fail, the following situations for the next block 
 group should be considered:
 1)only streamer 2 and 4 fail
 2)streamer 2, 4, 8 fail
 3)only streamer 2 fails
 4)streamer 3 , 8 fail
 For a single streamer, we should consider the following situations of the 
 time of datanode failure:
 1)before writing the first byte
 2)before finishing writing the first cell
 3)right after finishing writing the first cell
 4)before writing the last byte of the block
 Other situations:
 1)more than 3 streamers fail at the first block group
 2)more than 3 streamers fail at the last block group
 more …



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing

2015-08-12 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693047#comment-14693047
 ] 

Li Bo commented on HDFS-8889:
-

Currently 9 streamers are working together for the client writing. A small 
number of failed datanodes (= 3) for a block group should not influence the 
writing. There’re a lot of datanode failure cases and we should cover as many 
as possible in unit test. 

Suppose streamer 4 fails, the following situations for the next block group 
should be considered:
1)  all streamers succeed
2)  Streamer 4 still fails
3)  only streamer 1 fails
4)  only streamer 8 fails (test parity streamer)
5)  streamer 4 and 6 fail
6)  streamer 4 and 1,6 fail
7)  streamer 4 and 1,2,6 fail
8)  streamer 2, 6 fail
Suppose streamer 2 and 4 fail, the following situations for the next block 
group should be considered:
1)  only streamer 2 and 4 fail
2)  streamer 2, 4, 8 fail
3)  only streamer 2 fails
4)  streamer 3 , 8 fail

For a single streamer, we should consider the following situations of  the time 
of datanode failure:
1)  before writing the first byte
2)  before finishing writing the first cell
3)  right after finishing writing the first cell
4)  before writing the last byte of the block

Other situations:
1)  more than 3 streamers fail at the first block group
2)  more than 3 streamers fail at the last block group
more …


 Erasure Coding: cover more test situations of datanode failure during client 
 writing
 

 Key: HDFS-8889
 URL: https://issues.apache.org/jira/browse/HDFS-8889
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing

2015-08-11 Thread Li Bo (JIRA)
Li Bo created HDFS-8889:
---

 Summary: Erasure Coding: cover more test situations of datanode 
failure during client writing
 Key: HDFS-8889
 URL: https://issues.apache.org/jira/browse/HDFS-8889
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8857) Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure

2015-08-05 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659363#comment-14659363
 ] 

Li Bo commented on HDFS-8857:
-

Thanks Kai for the review. I have just committed the code to 7285.

 Erasure Coding: Fix ArrayIndexOutOfBoundsException in 
 TestWriteStripedFileWithFailure
 -

 Key: HDFS-8857
 URL: https://issues.apache.org/jira/browse/HDFS-8857
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8857-HDFS-7285-001.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-08-05 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654873#comment-14654873
 ] 

Li Bo commented on HDFS-8704:
-

The failure of 
{{org.apache.hadoop.hdfs.TestWriteStripedFileWithFailure.testWriteStripedFileWithDNFailure}}
 has been fixed by HDFS-8857.
The failure of 
{{org.apache.hadoop.hdfs.server.namenode.TestFileTruncate.testTruncateWithDataNodesRestartImmediately}}
 seems have no relation with this patch.

 Erasure Coding: client fails to write large file when one datanode fails
 

 Key: HDFS-8704
 URL: https://issues.apache.org/jira/browse/HDFS-8704
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
 HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
 HDFS-8704-HDFS-7285-005.patch


 I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
 corrupt, client succeeds to write a file smaller than a block group but fails 
 to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
 files smaller than a block group, this jira will add more test situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8857) Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure

2015-08-05 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8857:

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Erasure Coding: Fix ArrayIndexOutOfBoundsException in 
 TestWriteStripedFileWithFailure
 -

 Key: HDFS-8857
 URL: https://issues.apache.org/jira/browse/HDFS-8857
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8857-HDFS-7285-001.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

2015-08-04 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8704:

Attachment: HDFS-8704-HDFS-7285-005.patch

 Erasure Coding: client fails to write large file when one datanode fails
 

 Key: HDFS-8704
 URL: https://issues.apache.org/jira/browse/HDFS-8704
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
 HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
 HDFS-8704-HDFS-7285-005.patch


 I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
 corrupt, client succeeds to write a file smaller than a block group but fails 
 to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
 files smaller than a block group, this jira will add more test situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small

2015-08-04 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653136#comment-14653136
 ] 

Li Bo commented on HDFS-8838:
-

hi, [~walter.k.su] and [~szetszwo], could you help me review the patch of 
HDFS-8704 if you have time? 

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: h8838_20150729.patch, h8838_20150731.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >