[jira] [Commented] (HDFS-10434) Fix intermittent test failure of TestDataNodeErasureCodingMetrics
[ https://issues.apache.org/jira/browse/HDFS-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296176#comment-15296176 ] Li Bo commented on HDFS-10434: -- Thanks for [~rakeshr]'s detailed explanation. The situation described will cause the test case fail and the patch can fix the problem. +1 for the patch. > Fix intermittent test failure of TestDataNodeErasureCodingMetrics > - > > Key: HDFS-10434 > URL: https://issues.apache.org/jira/browse/HDFS-10434 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Rakesh R >Assignee: Rakesh R > Attachments: HDFS-10434-00.patch > > > This jira is to fix the test case failure. > Reference : > [Build15485_TestDataNodeErasureCodingMetrics_testEcTasks|https://builds.apache.org/job/PreCommit-HDFS-Build/15485/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeErasureCodingMetrics/testEcTasks/] > {code} > Error Message > Bad value for metric EcReconstructionTasks expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: Bad value for metric EcReconstructionTasks > expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.test.MetricsAsserts.assertCounter(MetricsAsserts.java:228) > at > org.apache.hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics.testEcTasks(TestDataNodeErasureCodingMetrics.java:92) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10434) Fix intermittent test failure of TestDataNodeErasureCodingMetrics
[ https://issues.apache.org/jira/browse/HDFS-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296112#comment-15296112 ] Li Bo commented on HDFS-10434: -- Thanks for [~rakeshr] finding the problem. {{DFSTestUtil.waitForDatanodeState()}} and {{StripedFileTestUtil.waitForReconstructionFinished()}} have make sure that the reconstruction work is finished before checking the metrics. I am confused that the two sentences not take effect. Does the failure never happen after applying the patch(maybe run the test case more than 20 times)? > Fix intermittent test failure of TestDataNodeErasureCodingMetrics > - > > Key: HDFS-10434 > URL: https://issues.apache.org/jira/browse/HDFS-10434 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Rakesh R >Assignee: Rakesh R > Attachments: HDFS-10434-00.patch > > > This jira is to fix the test case failure. > Reference : > [Build15485_TestDataNodeErasureCodingMetrics_testEcTasks|https://builds.apache.org/job/PreCommit-HDFS-Build/15485/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeErasureCodingMetrics/testEcTasks/] > {code} > Error Message > Bad value for metric EcReconstructionTasks expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: Bad value for metric EcReconstructionTasks > expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.test.MetricsAsserts.assertCounter(MetricsAsserts.java:228) > at > org.apache.hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics.testEcTasks(TestDataNodeErasureCodingMetrics.java:92) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284223#comment-15284223 ] Li Bo commented on HDFS-8449: - Thanks for Kai's review and commit. I have just resolve this jira. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, > HDFS-8449-v10.patch, HDFS-8449-v11.patch, HDFS-8449-v12.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Resolution: Fixed Target Version/s: (was: ) Status: Resolved (was: Patch Available) > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, > HDFS-8449-v10.patch, HDFS-8449-v11.patch, HDFS-8449-v12.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-v12.patch Upload v12(same with v11) to trigger the test. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, > HDFS-8449-v10.patch, HDFS-8449-v11.patch, HDFS-8449-v12.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Status: Patch Available (was: In Progress) > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, > HDFS-8449-v10.patch, HDFS-8449-v11.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Status: In Progress (was: Patch Available) To trigger the test > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, > HDFS-8449-v10.patch, HDFS-8449-v11.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-v11.patch v11 fix checkstyle problems > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, > HDFS-8449-v10.patch, HDFS-8449-v11.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-v10.patch Rename the patch to trigger the test. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch, > HDFS-8449-v10.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8901) Use ByteBuffer in striping positional read
[ https://issues.apache.org/jira/browse/HDFS-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8901: Attachment: HDFS-8901-v9.patch fix the bugs in TestParalledShortCircuitLegacyRead > Use ByteBuffer in striping positional read > -- > > Key: HDFS-8901 > URL: https://issues.apache.org/jira/browse/HDFS-8901 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Attachments: HDFS-8901-v2.patch, HDFS-8901-v3.patch, > HDFS-8901-v4.patch, HDFS-8901-v5.patch, HDFS-8901-v6.patch, > HDFS-8901-v7.patch, HDFS-8901-v8.patch, HDFS-8901-v9.patch, initial-poc.patch > > > Native erasure coder prefers to direct ByteBuffer for performance > consideration. To prepare for it, this change uses ByteBuffer through the > codes in implementing striping position read. It will also fix avoiding > unnecessary data copying between striping read chunk buffers and decode input > buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-010.patch Fix the problems according to Kai's comments > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch, HDFS-8449-010.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272031#comment-15272031 ] Li Bo commented on HDFS-8449: - Thanks very much for Kai's review. bq. Could you enhance TestReconstructStripedFile similarly? bq. Could we share TestReconstructStripedFile#waitForRecoveryFinished and avoid waitForRecoveryFinished? I think it’s a little strange to call {{TestReconstructStripedFile#waitForRecoveryFinished}} in {{TestDataNodeErasureCodingMetrics}} because the change of {{TestReconstructStripedFile}} may impact {{TestDataNodeErasureCodingMetrics}}. We can move the shared function to a util class. I think it's better to do the changes of {{TestReconstructStripedFile}} in a new separate jira in order to make this jira focusing on the test of datanode metrics. bq. Could we use DFSTestUtil.writeFile to generate the test file? Both implementations are OK. There’re many test cases directly using outputstream to write a file. bq. I'm not sure about the following block codes are necessary. The system will execute the actions periodically. In the test we should make sure the actions are executed before moving forward. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8901) Use ByteBuffer in striping positional read
[ https://issues.apache.org/jira/browse/HDFS-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8901: Attachment: HDFS-8901-v8.patch rebase the patch to latest code; fix the unsupported method problem > Use ByteBuffer in striping positional read > -- > > Key: HDFS-8901 > URL: https://issues.apache.org/jira/browse/HDFS-8901 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Attachments: HDFS-8901-v2.patch, HDFS-8901-v3.patch, > HDFS-8901-v4.patch, HDFS-8901-v5.patch, HDFS-8901-v6.patch, > HDFS-8901-v7.patch, HDFS-8901-v8.patch, initial-poc.patch > > > Native erasure coder prefers to direct ByteBuffer for performance > consideration. To prepare for it, this change uses ByteBuffer through the > codes in implementing striping position read. It will also fix avoiding > unnecessary data copying between striping read chunk buffers and decode input > buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-009.patch > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch, HDFS-8449-009.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-008.patch > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch, > HDFS-8449-008.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268124#comment-15268124 ] Li Bo commented on HDFS-8449: - Update patch 007 to fix some minor problems. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-007.patch > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch, HDFS-8449-007.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261472#comment-15261472 ] Li Bo commented on HDFS-8449: - The two failed unit tests have no relation with this patch and will be solved in HDFS-10334. The check style problems also can be ignored. Hi [~drankye], could you help me review the patch again? Thanks. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-006.patch > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch, HDFS-8449-006.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10317) dfs.domain.socket.path is not set in TestShortCircuitLocalRead.testReadWithRemoteBlockReader
[ https://issues.apache.org/jira/browse/HDFS-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253201#comment-15253201 ] Li Bo commented on HDFS-10317: -- I use Intellij IDEA to run the test and the problem occurs, but it will not happen using mvn test. I will check it further. > dfs.domain.socket.path is not set in > TestShortCircuitLocalRead.testReadWithRemoteBlockReader > > > Key: HDFS-10317 > URL: https://issues.apache.org/jira/browse/HDFS-10317 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Li Bo > > org.apache.hadoop.HadoopIllegalArgumentException: The short-circuit local > reads feature is enabled but dfs.domain.socket.path is not set. > at > org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory.(DomainSocketFactory.java:115) > at org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:132) > at org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:157) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:358) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:275) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:266) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:258) > at > org.apache.hadoop.hdfs.MiniDFSCluster.waitActive(MiniDFSCluster.java:2466) > at > org.apache.hadoop.hdfs.MiniDFSCluster.waitActive(MiniDFSCluster.java:2512) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1632) > at > org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:844) > at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:482) > at > org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:441) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadWithRemoteBlockReader(TestShortCircuitLocalRead.java:608) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testReadWithRemoteBlockReader(TestShortCircuitLocalRead.java:590) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-10317) dfs.domain.socket.path is not set in TestShortCircuitLocalRead.testReadWithRemoteBlockReader
Li Bo created HDFS-10317: Summary: dfs.domain.socket.path is not set in TestShortCircuitLocalRead.testReadWithRemoteBlockReader Key: HDFS-10317 URL: https://issues.apache.org/jira/browse/HDFS-10317 Project: Hadoop HDFS Issue Type: Bug Reporter: Li Bo org.apache.hadoop.HadoopIllegalArgumentException: The short-circuit local reads feature is enabled but dfs.domain.socket.path is not set. at org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory.(DomainSocketFactory.java:115) at org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:132) at org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:157) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:358) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:275) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:266) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:258) at org.apache.hadoop.hdfs.MiniDFSCluster.waitActive(MiniDFSCluster.java:2466) at org.apache.hadoop.hdfs.MiniDFSCluster.waitActive(MiniDFSCluster.java:2512) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1632) at org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:844) at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:482) at org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:441) at org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadWithRemoteBlockReader(TestShortCircuitLocalRead.java:608) at org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testReadWithRemoteBlockReader(TestShortCircuitLocalRead.java:590) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-005.patch Thanks Kai's review. There're several other ec related metrics to add, so I think we can put the unit tests in a single file at first and consider moving them to other file at the end. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch, > HDFS-8449-005.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead
[ https://issues.apache.org/jira/browse/HDFS-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247349#comment-15247349 ] Li Bo commented on HDFS-10307: -- mark it as duplicated. > Fix a bug in TestShortCircuitLocalRead > -- > > Key: HDFS-10307 > URL: https://issues.apache.org/jira/browse/HDFS-10307 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-10307-001.patch > > > Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead > in TestShortCircuitLocalRead throws the following exception: > java.lang.IndexOutOfBoundsException: Requested more bytes than destination > buffer size > at > org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead
[ https://issues.apache.org/jira/browse/HDFS-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-10307: - Resolution: Duplicate Status: Resolved (was: Patch Available) > Fix a bug in TestShortCircuitLocalRead > -- > > Key: HDFS-10307 > URL: https://issues.apache.org/jira/browse/HDFS-10307 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-10307-001.patch > > > Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead > in TestShortCircuitLocalRead throws the following exception: > java.lang.IndexOutOfBoundsException: Requested more bytes than destination > buffer size > at > org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead
[ https://issues.apache.org/jira/browse/HDFS-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-10307: - Status: Patch Available (was: Open) > Fix a bug in TestShortCircuitLocalRead > -- > > Key: HDFS-10307 > URL: https://issues.apache.org/jira/browse/HDFS-10307 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-10307-001.patch > > > Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead > in TestShortCircuitLocalRead throws the following exception: > java.lang.IndexOutOfBoundsException: Requested more bytes than destination > buffer size > at > org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead
[ https://issues.apache.org/jira/browse/HDFS-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-10307: - Attachment: HDFS-10307-001.patch > Fix a bug in TestShortCircuitLocalRead > -- > > Key: HDFS-10307 > URL: https://issues.apache.org/jira/browse/HDFS-10307 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-10307-001.patch > > > Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead > in TestShortCircuitLocalRead throws the following exception: > java.lang.IndexOutOfBoundsException: Requested more bytes than destination > buffer size > at > org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235) > at > org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-10307) Fix a bug in TestShortCircuitLocalRead
Li Bo created HDFS-10307: Summary: Fix a bug in TestShortCircuitLocalRead Key: HDFS-10307 URL: https://issues.apache.org/jira/browse/HDFS-10307 Project: Hadoop HDFS Issue Type: Bug Reporter: Li Bo Assignee: Li Bo Unit tests testLocalReadFallback ,testLocalReadLegacy, testSmallFileLocalRead in TestShortCircuitLocalRead throws the following exception: java.lang.IndexOutOfBoundsException: Requested more bytes than destination buffer size at org.apache.hadoop.fs.FSInputStream.validatePositionedReadArgs(FSInputStream.java:107) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:975) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.checkFileContent(TestShortCircuitLocalRead.java:157) at org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadImpl(TestShortCircuitLocalRead.java:286) at org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.doTestShortCircuitReadLegacy(TestShortCircuitLocalRead.java:235) at org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead.testLocalReadFallback(TestShortCircuitLocalRead.java:327) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245211#comment-15245211 ] Li Bo commented on HDFS-8449: - hi, [~drankye], could you help me review the newly updated patch? Thanks > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-004.patch > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch, HDFS-8449-004.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10258) Erasure Coding: support small cluster whose #DataNode < # (Blocks in a BlockGroup)
[ https://issues.apache.org/jira/browse/HDFS-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242287#comment-15242287 ] Li Bo commented on HDFS-10258: -- Thanks for Kai's idea, I will try to find a solution with the lowest cost. > Erasure Coding: support small cluster whose #DataNode < # (Blocks in a > BlockGroup) > -- > > Key: HDFS-10258 > URL: https://issues.apache.org/jira/browse/HDFS-10258 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > > Currently EC has not supported small clusters whose datanode number is > smaller than the block numbers in a block group. This sub task will solve > this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10258) Erasure Coding: support small cluster whose #DataNode < # (Blocks in a BlockGroup)
[ https://issues.apache.org/jira/browse/HDFS-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-10258: - Description: Currently EC has not supported small clusters whose datanode number is smaller than the block numbers in a block group. This sub task will solve this problem. > Erasure Coding: support small cluster whose #DataNode < # (Blocks in a > BlockGroup) > -- > > Key: HDFS-10258 > URL: https://issues.apache.org/jira/browse/HDFS-10258 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > > Currently EC has not supported small clusters whose datanode number is > smaller than the block numbers in a block group. This sub task will solve > this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-10258) Erasure Coding: support small cluster whose #DataNode < # (Blocks in a BlockGroup)
Li Bo created HDFS-10258: Summary: Erasure Coding: support small cluster whose #DataNode < # (Blocks in a BlockGroup) Key: HDFS-10258 URL: https://issues.apache.org/jira/browse/HDFS-10258 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186510#comment-15186510 ] Li Bo commented on HDFS-9822: - hi,Rakesh After reading the code of {{UnderReplicationBlocks}} I am still a little confused how this error happens. Since the situation is difficult to reproduce, how about creating a unit test case that simulates the error situation? > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
[ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9826: Status: Patch Available (was: Open) > Erasure Coding: Postpone the recovery work for a configurable time period > -- > > Key: HDFS-9826 > URL: https://issues.apache.org/jira/browse/HDFS-9826 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9826-001.patch, HDFS-9826-002.patch > > > Currently NameNode prepares recovering when finding an under replicated > block group. This is inefficient and reduces resources for other operations. > It would be better to postpone the recovery work for a period of time if only > one internal block is corrupted considering points shown by papers such as > \[1\]\[2\]: > 1.Transient errors in which no data are lost account for more than 90% of > data center failures, owing to network partitions, software problems, or > non-disk hardware faults. > 2.Although erasure codes tolerate multiple simultaneous failures, single > failures represent 99.75% of recoveries. > Different clusters may have different status, so we should allow user to > configure the time for postponing the recoveries. Proper configuration will > reduce a large proportion of unnecessary recoveries. When finding multiple > internal blocks corrupted in a block group, we prepare the recovery work > immediately because it’s very rare and we don’t want to increase the risk of > losing data. > [1] Availability in globally distributed storage systems > http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf > [2] Rethinking erasure codes for cloud file systems: minimizing I/O for > recovery and degraded reads > http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
[ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9826: Attachment: HDFS-9826-002.patch > Erasure Coding: Postpone the recovery work for a configurable time period > -- > > Key: HDFS-9826 > URL: https://issues.apache.org/jira/browse/HDFS-9826 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9826-001.patch, HDFS-9826-002.patch > > > Currently NameNode prepares recovering when finding an under replicated > block group. This is inefficient and reduces resources for other operations. > It would be better to postpone the recovery work for a period of time if only > one internal block is corrupted considering points shown by papers such as > \[1\]\[2\]: > 1.Transient errors in which no data are lost account for more than 90% of > data center failures, owing to network partitions, software problems, or > non-disk hardware faults. > 2.Although erasure codes tolerate multiple simultaneous failures, single > failures represent 99.75% of recoveries. > Different clusters may have different status, so we should allow user to > configure the time for postponing the recoveries. Proper configuration will > reduce a large proportion of unnecessary recoveries. When finding multiple > internal blocks corrupted in a block group, we prepare the recovery work > immediately because it’s very rare and we don’t want to increase the risk of > losing data. > [1] Availability in globally distributed storage systems > http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf > [2] Rethinking erasure codes for cloud file systems: minimizing I/O for > recovery and degraded reads > http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
[ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9826: Attachment: HDFS-9826-001.patch upload an initial patch without test > Erasure Coding: Postpone the recovery work for a configurable time period > -- > > Key: HDFS-9826 > URL: https://issues.apache.org/jira/browse/HDFS-9826 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9826-001.patch > > > Currently NameNode prepares recovering when finding an under replicated > block group. This is inefficient and reduces resources for other operations. > It would be better to postpone the recovery work for a period of time if only > one internal block is corrupted considering points shown by papers such as > \[1\]\[2\]: > 1.Transient errors in which no data are lost account for more than 90% of > data center failures, owing to network partitions, software problems, or > non-disk hardware faults. > 2.Although erasure codes tolerate multiple simultaneous failures, single > failures represent 99.75% of recoveries. > Different clusters may have different status, so we should allow user to > configure the time for postponing the recoveries. Proper configuration will > reduce a large proportion of unnecessary recoveries. When finding multiple > internal blocks corrupted in a block group, we prepare the recovery work > immediately because it’s very rare and we don’t want to increase the risk of > losing data. > [1] Availability in globally distributed storage systems > http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf > [2] Rethinking erasure codes for cloud file systems: minimizing I/O for > recovery and degraded reads > http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic
[ https://issues.apache.org/jira/browse/HDFS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152003#comment-15152003 ] Li Bo commented on HDFS-9657: - Thanks Zhe’s comment. 90% errors are transient and without permanent data loss, postponing the recoveries may reduce a lot of unnecessary work. Configuring the task number by time segments is a solution. I have just created HDFS-9826 which delays the recoveries for a configured time period. I am thinking about how to combine them together. > Schedule EC tasks at proper time to reduce the impact of recovery traffic > - > > Key: HDFS-9657 > URL: https://issues.apache.org/jira/browse/HDFS-9657 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9657-001.patch, HDFS-9657-002.patch > > > The EC recover tasks consume a lot of network bandwidth and disk I/O. > Recovering a corrupt block requires transferring 6 blocks , hence creating a > 6X overhead in network bandwidth and disk I/O. When a datanode fails , the > recovery of the whole blocks on this datanode may use up the network > bandwith. We need to start a recovery task at a proper time in order to give > less impact to the system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
[ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9826: Description: Currently NameNode prepares recovering when finding an under replicated block group. This is inefficient and reduces resources for other operations. It would be better to postpone the recovery work for a period of time if only one internal block is corrupted considering points shown by papers such as \[1\]\[2\]: 1. Transient errors in which no data are lost account for more than 90% of data center failures, owing to network partitions, software problems, or non-disk hardware faults. 2. Although erasure codes tolerate multiple simultaneous failures, single failures represent 99.75% of recoveries. Different clusters may have different status, so we should allow user to configure the time for postponing the recoveries. Proper configuration will reduce a large proportion of unnecessary recoveries. When finding multiple internal blocks corrupted in a block group, we prepare the recovery work immediately because it’s very rare and we don’t want to increase the risk of losing data. [1] Availability in globally distributed storage systems http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf [2] Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf was: Currently NameNode prepares recovering when finding an under replicated block group. This is inefficient and reduces resources for other operations. It would be better to postpone the recovery work for a period of time if only one internal block is corrupted considering points shown by papers such as \[1\]\[2\]: 1. Transient errors in which no data are lost account for more than 90% of data center failures, owing to network partitions, software problems, or non-disk hardware faults. 2. Although erasure codes tolerate multiple simultaneous failures, single failures represent 99.75% of recoveries. Different clusters may have different status, so we should allow user to configure the time for postponing the recoveries. Proper configuration will reduce a large proportion of unnecessary recoveries. When finding multiple internal blocks corrupted in a block group, we do the recovery work immediately because it’s very rare and we don’t want to increase the risk of losing data. [1] Availability in globally distributed storage systems http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf [2] Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf > Erasure Coding: Postpone the recovery work for a configurable time period > -- > > Key: HDFS-9826 > URL: https://issues.apache.org/jira/browse/HDFS-9826 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > > Currently NameNode prepares recovering when finding an under replicated > block group. This is inefficient and reduces resources for other operations. > It would be better to postpone the recovery work for a period of time if only > one internal block is corrupted considering points shown by papers such as > \[1\]\[2\]: > 1.Transient errors in which no data are lost account for more than 90% of > data center failures, owing to network partitions, software problems, or > non-disk hardware faults. > 2.Although erasure codes tolerate multiple simultaneous failures, single > failures represent 99.75% of recoveries. > Different clusters may have different status, so we should allow user to > configure the time for postponing the recoveries. Proper configuration will > reduce a large proportion of unnecessary recoveries. When finding multiple > internal blocks corrupted in a block group, we prepare the recovery work > immediately because it’s very rare and we don’t want to increase the risk of > losing data. > [1] Availability in globally distributed storage systems > http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf > [2] Rethinking erasure codes for cloud file systems: minimizing I/O for > recovery and degraded reads > http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
Li Bo created HDFS-9826: --- Summary: Erasure Coding: Postpone the recovery work for a configurable time period Key: HDFS-9826 URL: https://issues.apache.org/jira/browse/HDFS-9826 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Currently NameNode prepares recovering when finding an under replicated block group. This is inefficient and reduces resources for other operations. It would be better to postpone the recovery work for a period of time if only one internal block is corrupted considering points shown by papers such as \[1\]\[2\]: 1. Transient errors in which no data are lost account for more than 90% of data center failures, owing to network partitions, software problems, or non-disk hardware faults. 2. Although erasure codes tolerate multiple simultaneous failures, single failures represent 99.75% of recoveries. Different clusters may have different status, so we should allow user to configure the time for postponing the recoveries. Proper configuration will reduce a large proportion of unnecessary recoveries. When finding multiple internal blocks corrupted in a block group, we do the recovery work immediately because it’s very rare and we don’t want to increase the risk of losing data. [1] Availability in globally distributed storage systems http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf [2] Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic
[ https://issues.apache.org/jira/browse/HDFS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9657: Attachment: HDFS-9657-002.patch Currently only implements one policy, i.e ECRecoveryPolicyTimeSegment. I think it satisfies most situations. Will add more policies if needed. > Schedule EC tasks at proper time to reduce the impact of recovery traffic > - > > Key: HDFS-9657 > URL: https://issues.apache.org/jira/browse/HDFS-9657 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9657-001.patch, HDFS-9657-002.patch > > > The EC recover tasks consume a lot of network bandwidth and disk I/O. > Recovering a corrupt block requires transferring 6 blocks , hence creating a > 6X overhead in network bandwidth and disk I/O. When a datanode fails , the > recovery of the whole blocks on this datanode may use up the network > bandwith. We need to start a recovery task at a proper time in order to give > less impact to the system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic
[ https://issues.apache.org/jira/browse/HDFS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9657: Status: Patch Available (was: Open) > Schedule EC tasks at proper time to reduce the impact of recovery traffic > - > > Key: HDFS-9657 > URL: https://issues.apache.org/jira/browse/HDFS-9657 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9657-001.patch, HDFS-9657-002.patch > > > The EC recover tasks consume a lot of network bandwidth and disk I/O. > Recovering a corrupt block requires transferring 6 blocks , hence creating a > 6X overhead in network bandwidth and disk I/O. When a datanode fails , the > recovery of the whole blocks on this datanode may use up the network > bandwith. We need to start a recovery task at a proper time in order to give > less impact to the system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic
[ https://issues.apache.org/jira/browse/HDFS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9657: Attachment: HDFS-9657-001.patch > Schedule EC tasks at proper time to reduce the impact of recovery traffic > - > > Key: HDFS-9657 > URL: https://issues.apache.org/jira/browse/HDFS-9657 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9657-001.patch > > > The EC recover tasks consume a lot of network bandwidth and disk I/O. > Recovering a corrupt block requires transferring 6 blocks , hence creating a > 6X overhead in network bandwidth and disk I/O. When a datanode fails , the > recovery of the whole blocks on this datanode may use up the network > bandwith. We need to start a recovery task at a proper time in order to give > less impact to the system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9657) Schedule EC tasks at proper time to reduce the impact of recovery traffic
Li Bo created HDFS-9657: --- Summary: Schedule EC tasks at proper time to reduce the impact of recovery traffic Key: HDFS-9657 URL: https://issues.apache.org/jira/browse/HDFS-9657 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo The EC recover tasks consume a lot of network bandwidth and disk I/O. Recovering a corrupt block requires transferring 6 blocks , hence creating a 6X overhead in network bandwidth and disk I/O. When a datanode fails , the recovery of the whole blocks on this datanode may use up the network bandwith. We need to start a recovery task at a proper time in order to give less impact to the system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9374) Inform user when a file with corrupted data blocks are read
[ https://issues.apache.org/jira/browse/HDFS-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9374: Attachment: HDFS-9374-001.patch Another way is to add the decode info to ReadStatistics. > Inform user when a file with corrupted data blocks are read > --- > > Key: HDFS-9374 > URL: https://issues.apache.org/jira/browse/HDFS-9374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9374-001.patch > > > When reading a block group with corrupt data blocks, it would be better to > tell user the corrupt blocks so that user can know the status of the file > being read. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HDFS-7346) Erasure Coding: perform stripping erasure encoding work given block reader and writer
[ https://issues.apache.org/jira/browse/HDFS-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo reopened HDFS-7346: - > Erasure Coding: perform stripping erasure encoding work given block reader > and writer > - > > Key: HDFS-7346 > URL: https://issues.apache.org/jira/browse/HDFS-7346 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Kai Zheng >Assignee: Li Bo > > This assumes the facilities like block reader and writer are ready, > implements and performs erasure encoding work in *stripping* case utilizing > erasure codec and coder provided by the codec framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7346) Erasure Coding: perform stripping erasure encoding work given block reader and writer
[ https://issues.apache.org/jira/browse/HDFS-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-7346: Release Note: (was: The jira is very old and close it because we'll not handle it in the near future.) > Erasure Coding: perform stripping erasure encoding work given block reader > and writer > - > > Key: HDFS-7346 > URL: https://issues.apache.org/jira/browse/HDFS-7346 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Kai Zheng >Assignee: Li Bo > > This assumes the facilities like block reader and writer are ready, > implements and performs erasure encoding work in *stripping* case utilizing > erasure codec and coder provided by the codec framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7346) Erasure Coding: perform stripping erasure encoding work given block reader and writer
[ https://issues.apache.org/jira/browse/HDFS-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo resolved HDFS-7346. - Resolution: Won't Fix Release Note: The jira is very old and close it because we'll not handle it in the near future. > Erasure Coding: perform stripping erasure encoding work given block reader > and writer > - > > Key: HDFS-7346 > URL: https://issues.apache.org/jira/browse/HDFS-7346 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Kai Zheng >Assignee: Li Bo > > This assumes the facilities like block reader and writer are ready, > implements and performs erasure encoding work in *stripping* case utilizing > erasure codec and coder provided by the codec framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing
[ https://issues.apache.org/jira/browse/HDFS-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo resolved HDFS-8889. - Resolution: Won't Fix Release Note: There has been enough test cases for the datanode failures. > Erasure Coding: cover more test situations of datanode failure during client > writing > > > Key: HDFS-8889 > URL: https://issues.apache.org/jira/browse/HDFS-8889 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8889-HDFS-7285-001.patch > > > Currently 9 streamers are working together for the client writing. A small > number of failed datanodes (<= 3) for a block group should not influence the > writing. There’re a lot of datanode failure cases and we should cover as many > as possible in unit test. > Suppose streamer 4 fails, the following situations for the next block group > should be considered: > 1)all streamers succeed > 2)Streamer 4 still fails > 3)only streamer 1 fails > 4)only streamer 8 fails (test parity streamer) > 5)streamer 4 and 6 fail > 6)streamer 4 and 1,6 fail > 7)streamer 4 and 1,2,6 fail > 8)streamer 2, 6 fail > Suppose streamer 2 and 4 fail, the following situations for the next block > group should be considered: > 1)only streamer 2 and 4 fail > 2)streamer 2, 4, 8 fail > 3)only streamer 2 fails > 4)streamer 3 , 8 fail > For a single streamer, we should consider the following situations of the > time of datanode failure: > 1)before writing the first byte > 2)before finishing writing the first cell > 3)right after finishing writing the first cell > 4)before writing the last byte of the block > Other situations: > 1)more than 3 streamers fail at the first block group > 2)more than 3 streamers fail at the last block group > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8762) Erasure Coding: the log of each streamer should show its index
[ https://issues.apache.org/jira/browse/HDFS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8762: Resolution: Duplicate Status: Resolved (was: Patch Available) Other jiras have added the necessary index. > Erasure Coding: the log of each streamer should show its index > -- > > Key: HDFS-8762 > URL: https://issues.apache.org/jira/browse/HDFS-8762 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8762-HDFS-7285-001.patch, > HDFS-8762-HDFS-7285-002.patch > > > The log in {{DataStreamer}} doesn't show which streamer it's generated from. > In order to make log information more convenient for debugging, each log > should include the index of the streamer it's generated from. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work stopped] (HDFS-8171) Extend BlockSender to support multiple block data source
[ https://issues.apache.org/jira/browse/HDFS-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-8171 stopped by Li Bo. --- > Extend BlockSender to support multiple block data source > > > Key: HDFS-8171 > URL: https://issues.apache.org/jira/browse/HDFS-8171 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Reporter: Li Bo >Assignee: Li Bo > > Currently BlockSender reads a block from the disk and sends it to a remote > datanode. In EC encode/decode work, new blocks are generated by calculation. > In order to store these blocks to remote datanodes, we can ask BlockSender to > read data from the output of encode/decode calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers
[ https://issues.apache.org/jira/browse/HDFS-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9373: Attachment: HDFS-9373-003.patch Sorry for the checkstyle error.Update patch 003 to fix it. > Show friendly information to user when client succeeds the writing with some > failed streamers > - > > Key: HDFS-9373 > URL: https://issues.apache.org/jira/browse/HDFS-9373 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Affects Versions: 3.0.0 >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9373-001.patch, HDFS-9373-002.patch, > HDFS-9373-003.patch > > > When not more than PARITY_NUM streamers fail for a block group, the client > may still succeed to write the data. But several exceptions are thrown to > user and user has to check the reasons. The friendly way is just inform user > that some streamers fail when writing a block group. It’s not necessary to > show the details of exceptions because a small number of stream failures is > not vital to the client writing. > When only DATA_NUM streamers succeed, the block group is in a high risk > because the corrupt of any block will cause all the six blocks' data lost. We > should give obvious warning to user when this occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9529) Extend Erasure Code to support POWER Chip acceleration
[ https://issues.apache.org/jira/browse/HDFS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047924#comment-15047924 ] Li Bo commented on HDFS-9529: - hi, wqijun Hadoop-11887 is under "Hadoop common" and this JIRA is trying to solve the similar problem, so I think it would be better to put it to "Hadoop common". > Extend Erasure Code to support POWER Chip acceleration > -- > > Key: HDFS-9529 > URL: https://issues.apache.org/jira/browse/HDFS-9529 > Project: Hadoop HDFS > Issue Type: New Feature > Components: erasure-coding >Affects Versions: 3.0.0 >Reporter: wqijun >Assignee: wqijun > Fix For: 3.0.0 > > > Erasure Code is a very important feature in new HDFS version. This JIRA will > focus on how to extend EC to support multiple types of EC acceleration by C > library and other hardware method, like GPU or FPGA. Compared with > Hadoop-11887, this JIRA will more focus on how to leverage POWER Chip > capability to accelerate the EC calculating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers
[ https://issues.apache.org/jira/browse/HDFS-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9373: Attachment: HDFS-9373-002.patch Thanks Zhe and Daniel’s review. Just update the patch according the newest trunk code. The failed block id can be achieved by other log information, so we just need to tell user which block groups have the corrupt blocks. > Show friendly information to user when client succeeds the writing with some > failed streamers > - > > Key: HDFS-9373 > URL: https://issues.apache.org/jira/browse/HDFS-9373 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Affects Versions: 3.0.0 >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9373-001.patch, HDFS-9373-002.patch > > > When not more than PARITY_NUM streamers fail for a block group, the client > may still succeed to write the data. But several exceptions are thrown to > user and user has to check the reasons. The friendly way is just inform user > that some streamers fail when writing a block group. It’s not necessary to > show the details of exceptions because a small number of stream failures is > not vital to the client writing. > When only DATA_NUM streamers succeed, the block group is in a high risk > because the corrupt of any block will cause all the six blocks' data lost. We > should give obvious warning to user when this occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo resolved HDFS-8704. - Resolution: Cannot Reproduce > Erasure Coding: client fails to write large file when one datanode fails > > > Key: HDFS-8704 > URL: https://issues.apache.org/jira/browse/HDFS-8704 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, > HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, > HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, > HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch > > > I test current code on a 5-node cluster using RS(3,2). When a datanode is > corrupt, client succeeds to write a file smaller than a block group but fails > to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests > files smaller than a block group, this jira will add more test situations. > A streamer may encounter some bad datanodes when writing blocks allocated to > it. When it fails to connect datanode or send a packet, the streamer needs to > prepare for the next block. First it removes the packets of current block > from its data queue. If the first packet of next block has already been in > the data queue, the streamer will reset its state and start to wait for the > next block allocated for it; otherwise it will just wait for the first packet > of next block. The streamer will check periodically if it is asked to > terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9382) Track the acks for the packets which are sent from ErasureCodingWorker as part of reconstruction work
[ https://issues.apache.org/jira/browse/HDFS-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15008062#comment-15008062 ] Li Bo commented on HDFS-9382: - Unifying the readers and writers is very complicated due to the different details, so I have postponed the work. Maybe we can consider the optimization after most of the functionalities of EC are satisfied. > Track the acks for the packets which are sent from ErasureCodingWorker as > part of reconstruction work > - > > Key: HDFS-9382 > URL: https://issues.apache.org/jira/browse/HDFS-9382 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 3.0.0 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G > > Currently we are not tracking the acks for the packets which are sent from DN > ECWorker as part of reconstruction work. This jira is proposing to tracks the > acks as reconstruction work is really expensive, so we should know if any > packets failed to write at target DN -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers
[ https://issues.apache.org/jira/browse/HDFS-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9373: Attachment: HDFS-9373-001.patch > Show friendly information to user when client succeeds the writing with some > failed streamers > - > > Key: HDFS-9373 > URL: https://issues.apache.org/jira/browse/HDFS-9373 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9373-001.patch > > > When not more than PARITY_NUM streamers fail for a block group, the client > may still succeed to write the data. But several exceptions are thrown to > user and user has to check the reasons. The friendly way is just inform user > that some streamers fail when writing a block group. It’s not necessary to > show the details of exceptions because a small number of stream failures is > not vital to the client writing. > When only DATA_NUM streamers succeed, the block group is in a high risk > because the corrupt of any block will cause all the six blocks' data lost. We > should give obvious warning to user when this occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7781) Use block erasure coder in client stripping
[ https://issues.apache.org/jira/browse/HDFS-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-7781: Parent Issue: HDFS-8030 (was: HDFS-8031) > Use block erasure coder in client stripping > --- > > Key: HDFS-7781 > URL: https://issues.apache.org/jira/browse/HDFS-7781 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Fix For: HDFS-7285 > > > In HDFS-7729, raw erasure coder is used in order not to depend on the > {{ErasureCoder}} api defined in HDFS-7662 or even {{ErasureCodec}} api > defined in HDFS-7337 since they're still upcoming. > This is a follow up issue to work on that when the high level constructs are > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HDFS-7781) Use block erasure coder in client stripping
[ https://issues.apache.org/jira/browse/HDFS-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo reopened HDFS-7781: - > Use block erasure coder in client stripping > --- > > Key: HDFS-7781 > URL: https://issues.apache.org/jira/browse/HDFS-7781 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Fix For: HDFS-7285 > > > In HDFS-7729, raw erasure coder is used in order not to depend on the > {{ErasureCoder}} api defined in HDFS-7662 or even {{ErasureCodec}} api > defined in HDFS-7337 since they're still upcoming. > This is a follow up issue to work on that when the high level constructs are > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers
[ https://issues.apache.org/jira/browse/HDFS-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9373: Description: When not more than PARITY_NUM streamers fail for a block group, the client may still succeed to write the data. But several exceptions are thrown to user and user has to check the reasons. The friendly way is just inform user that some streamers fail when writing a block group. It’s not necessary to show the details of exceptions because a small number of stream failures is not vital to the client writing. When only DATA_NUM streamers succeed, the block group is in a high risk because the corrupt of any block will cause all the six blocks' data lost. We should give obvious warning to user when this occurs. was:When not more than PARITY_NUM streamers fail for a block group, the client may still succeed to write the data. But several exceptions are thrown to user and user has to check the reasons. The friendly way is just inform user that some streamers fail when writing a block group. It’s not necessary to show the details of exceptions because a small number of stream failures is not vital to the client writing. > Show friendly information to user when client succeeds the writing with some > failed streamers > - > > Key: HDFS-9373 > URL: https://issues.apache.org/jira/browse/HDFS-9373 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > > When not more than PARITY_NUM streamers fail for a block group, the client > may still succeed to write the data. But several exceptions are thrown to > user and user has to check the reasons. The friendly way is just inform user > that some streamers fail when writing a block group. It’s not necessary to > show the details of exceptions because a small number of stream failures is > not vital to the client writing. > When only DATA_NUM streamers succeed, the block group is in a high risk > because the corrupt of any block will cause all the six blocks' data lost. We > should give obvious warning to user when this occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9373) Show friendly information to user when client succeeds the writing with some failed streamers
Li Bo created HDFS-9373: --- Summary: Show friendly information to user when client succeeds the writing with some failed streamers Key: HDFS-9373 URL: https://issues.apache.org/jira/browse/HDFS-9373 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo When not more than PARITY_NUM streamers fail for a block group, the client may still succeed to write the data. But several exceptions are thrown to user and user has to check the reasons. The friendly way is just inform user that some streamers fail when writing a block group. It’s not necessary to show the details of exceptions because a small number of stream failures is not vital to the client writing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9374) Inform user when a file with corrupted data blocks are read
Li Bo created HDFS-9374: --- Summary: Inform user when a file with corrupted data blocks are read Key: HDFS-9374 URL: https://issues.apache.org/jira/browse/HDFS-9374 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo When reading a block group with corrupt data blocks, it would be better to tell user the corrupt blocks so that user can know the status of the file being read. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8410) Add computation time metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8410: Attachment: HDFS-8410-002.patch > Add computation time metrics to datanode for ECWorker > - > > Key: HDFS-8410 > URL: https://issues.apache.org/jira/browse/HDFS-8410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8410-001.patch, HDFS-8410-002.patch > > > This is a sub task of HDFS-7674. It adds time metric for ec decode work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8410) Add computation time metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8410: Summary: Add computation time metrics to datanode for ECWorker (was: Add time count metrics to datanode for ECWorker) > Add computation time metrics to datanode for ECWorker > - > > Key: HDFS-8410 > URL: https://issues.apache.org/jira/browse/HDFS-8410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8410-001.patch > > > This is a sub task of HDFS-7674. It adds time metric for ec decode work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8410) Add computation time metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976021#comment-14976021 ] Li Bo commented on HDFS-8410: - Patch 002 reduces metrics number grom 3 to 2. The time metrics make user know how much time spent in encoding/decoding work for every datanode. > Add computation time metrics to datanode for ECWorker > - > > Key: HDFS-8410 > URL: https://issues.apache.org/jira/browse/HDFS-8410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8410-001.patch, HDFS-8410-002.patch > > > This is a sub task of HDFS-7674. It adds time metric for ec decode work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8529) Add blocks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8529: Attachment: HDFS-8529-002.patch > Add blocks count metrics to datanode for ECWorker > - > > Key: HDFS-8529 > URL: https://issues.apache.org/jira/browse/HDFS-8529 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8529-001.patch, HDFS-8529-002.patch > > > This sub task will add block count metrics to datanode that takes the > encoding and recovery tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7653) Block Readers and Writers used in both client side and datanode side
[ https://issues.apache.org/jira/browse/HDFS-7653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo resolved HDFS-7653. - Resolution: Won't Fix Release Note: The design and implementation have changed a lot after this issue created. Due to the complexity and difference of client and datanode side read/write, we'll not solve this issue at current stage. > Block Readers and Writers used in both client side and datanode side > > > Key: HDFS-7653 > URL: https://issues.apache.org/jira/browse/HDFS-7653 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: BlockReadersWriters.patch > > > There're a lot of block read/write operations in HDFS-EC, for example, when > client writes a file in striping layout, client has to write several blocks > to several different datanodes; if a datanode wants to do an > encoding/decoding task, it has to read several blocks from itself and other > datanodes, and writes one or more blocks to itself or other datanodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7679) Erasure Coding: unifying common constructs like coding work, block reader and block writer across client and DataNode
[ https://issues.apache.org/jira/browse/HDFS-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo resolved HDFS-7679. - Resolution: Won't Fix > Erasure Coding: unifying common constructs like coding work, block reader and > block writer across client and DataNode > - > > Key: HDFS-7679 > URL: https://issues.apache.org/jira/browse/HDFS-7679 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: ECEncodeDecodeFramework.patch > > > Based on the work done, we will have similar constructs like coding work, > local/remote block reader/writer in both client and DataNode side, so it's > possible to refactor the codes further and unify these constructs to > eliminate possible duplicate codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7781) Use block erasure coder in client stripping
[ https://issues.apache.org/jira/browse/HDFS-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo resolved HDFS-7781. - Resolution: Won't Fix > Use block erasure coder in client stripping > --- > > Key: HDFS-7781 > URL: https://issues.apache.org/jira/browse/HDFS-7781 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Fix For: HDFS-7285 > > > In HDFS-7729, raw erasure coder is used in order not to depend on the > {{ErasureCoder}} api defined in HDFS-7662 or even {{ErasureCodec}} api > defined in HDFS-7337 since they're still upcoming. > This is a follow up issue to work on that when the high level constructs are > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8411) Add bytes count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8411: Attachment: HDFS-8411-003.patch > Add bytes count metrics to datanode for ECWorker > > > Key: HDFS-8411 > URL: https://issues.apache.org/jira/browse/HDFS-8411 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8411-001.patch, HDFS-8411-002.patch, > HDFS-8411-003.patch > > > This is a sub task of HDFS-7674. It calculates the amount of data that is > read from local or remote to attend decoding work, and also the amount of > data that is written to local or remote datanodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8411) Add bytes count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8411: Attachment: HDFS-8411-002.patch When some datanodes are corrupted, all their blocks are to be reconstructed by other healthy datanodes. The network flow incurred is very high and maybe we want to track it. We can record the bytes read and written by any datanode. In fact, I think HDFS-8529(block counts) and HDFS-8410(time consumed) are not necessary. We can estimate the time cost according to the bytes read and write. Block count metric is not very meaningful when there’re a lot of small files. We can adjust the metrics for the future requirement. > Add bytes count metrics to datanode for ECWorker > > > Key: HDFS-8411 > URL: https://issues.apache.org/jira/browse/HDFS-8411 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8411-001.patch, HDFS-8411-002.patch > > > This is a sub task of HDFS-7674. It calculates the amount of data that is > read from local or remote to attend decoding work, and also the amount of > data that is written to local or remote datanodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954670#comment-14954670 ] Li Bo commented on HDFS-8449: - Thanks [~rakeshr]'s review. The failed tests seem unrelated with this patch. Hi, [~jingzhao], could you help me review the patch and commit it to trunk if it's ok ? Then we can work on other metric jiras. Thanks. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8704: Status: In Progress (was: Patch Available) > Erasure Coding: client fails to write large file when one datanode fails > > > Key: HDFS-8704 > URL: https://issues.apache.org/jira/browse/HDFS-8704 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, > HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, > HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, > HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch > > > I test current code on a 5-node cluster using RS(3,2). When a datanode is > corrupt, client succeeds to write a file smaller than a block group but fails > to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests > files smaller than a block group, this jira will add more test situations. > A streamer may encounter some bad datanodes when writing blocks allocated to > it. When it fails to connect datanode or send a packet, the streamer needs to > prepare for the next block. First it removes the packets of current block > from its data queue. If the first packet of next block has already been in > the data queue, the streamer will reset its state and start to wait for the > next block allocated for it; otherwise it will just wait for the first packet > of next block. The streamer will check periodically if it is asked to > terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949702#comment-14949702 ] Li Bo commented on HDFS-8704: - Hi, Zhe I will give further check and test about this case. If no problem, I will close it. Thanks > Erasure Coding: client fails to write large file when one datanode fails > > > Key: HDFS-8704 > URL: https://issues.apache.org/jira/browse/HDFS-8704 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, > HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, > HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, > HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch > > > I test current code on a 5-node cluster using RS(3,2). When a datanode is > corrupt, client succeeds to write a file smaller than a block group but fails > to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests > files smaller than a block group, this jira will add more test situations. > A streamer may encounter some bad datanodes when writing blocks allocated to > it. When it fails to connect datanode or send a packet, the streamer needs to > prepare for the next block. First it removes the packets of current block > from its data queue. If the first packet of next block has already been in > the data queue, the streamer will reset its state and start to wait for the > next block allocated for it; otherwise it will just wait for the first packet > of next block. The streamer will check periodically if it is asked to > terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-002.patch reduce metrics number from 3 to 2. The count of failed tasks can be calculated by total count and successful count. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8449) Add tasks count metrics to datanode for ECWorker
[ https://issues.apache.org/jira/browse/HDFS-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8449: Attachment: HDFS-8449-003.patch Thanks Rakesh for the careful review. Update a new patch to fix the problems. > Add tasks count metrics to datanode for ECWorker > > > Key: HDFS-8449 > URL: https://issues.apache.org/jira/browse/HDFS-8449 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8449-000.patch, HDFS-8449-001.patch, > HDFS-8449-002.patch, HDFS-8449-003.patch > > > This sub task try to record ec recovery tasks that a datanode has done, > including total tasks, failed tasks and sucessful tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9040) Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests to Coordinator)
[ https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902207#comment-14902207 ] Li Bo commented on HDFS-9040: - I have just done a test with and without HDFS-9040-HDFS-7285.003.patch. The result is as follows: 10G 2m15s VS 2m12s 30G 6m28s VS 6m29s 50G 10m38s VS 10m48s We can see that when writing continuously, synchronizing streamers in outputstream level will not hurt the performance. > Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests > to Coordinator) > --- > > Key: HDFS-9040 > URL: https://issues.apache.org/jira/browse/HDFS-9040 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Jing Zhao > Attachments: HDFS-9040-HDFS-7285.002.patch, > HDFS-9040-HDFS-7285.003.patch, HDFS-9040.00.patch, HDFS-9040.001.wip.patch, > HDFS-9040.02.bgstreamer.patch > > > The general idea is to simplify error handling logic. > Proposal 1: > A BlockGroupDataStreamer to communicate with NN to allocate/update block, and > StripedDataStreamer s only have to stream blocks to DNs. > Proposal 2: > See below the > [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388] > from [~jingzhao]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9040) Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests to Coordinator)
[ https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805014#comment-14805014 ] Li Bo commented on HDFS-9040: - Thanks for the deep discussion here. One point confused me: bq. Only based on internal block lengths we cannot identify the failure For a block not belong to the last block group, if its length is smaller than BLOCK_SIZE, we can conclude it’s corrupt, otherwise it is a good block, right? For the last block group, we can calculate the length of each block via the file length, so if a block doesn’t satisfy the required length, then we can conclude it’s corrupt. The precondition is, if <=NUM_PARITY streamers fail, we ignore their failures and treat all blocks of this block group are written correctly. Each block in a block group only has one replica, and if we can judge a block corrupt or not, it may not necessary to bump the GS. Any other points that require bumping the GS? > Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests > to Coordinator) > --- > > Key: HDFS-9040 > URL: https://issues.apache.org/jira/browse/HDFS-9040 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su > Attachments: HDFS-9040-HDFS-7285.002.patch, > HDFS-9040-HDFS-7285.003.patch, HDFS-9040.00.patch, HDFS-9040.001.wip.patch, > HDFS-9040.02.bgstreamer.patch > > > The general idea is to simplify error handling logic. > Proposal 1: > A BlockGroupDataStreamer to communicate with NN to allocate/update block, and > StripedDataStreamer s only have to stream blocks to DNs. > Proposal 2: > See below the > [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388] > from [~jingzhao]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9040) Erasure coding: A BlockGroupDataStreamer to rule all internal blocks streamers
[ https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743182#comment-14743182 ] Li Bo commented on HDFS-9040: - Thanks Walter and Jing ‘s work. In Jing’s patch, {{allocateNewBlock}} is called in {{writeChunk()}}, which makes the packets in the {{dataQueue}} of streamer all belong to the same block. I think this is a good way to reduce the logic complexity. Currently the fastest streamer also has to wait for other streamers before requesting a following block group from NN, so I think we may not feel the writing speed becomes slow. > Erasure coding: A BlockGroupDataStreamer to rule all internal blocks streamers > -- > > Key: HDFS-9040 > URL: https://issues.apache.org/jira/browse/HDFS-9040 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9040.00.patch, HDFS-9040.001.wip.patch > > > A {{BlockGroupDataStreamer}} to communicate with NN to allocate/update block, > and {{StripedDataStreamer}} s only have to stream blocks to DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9040) Erasure coding: A BlockGroupDataStreamer to rule all internal blocks streamers
[ https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14738076#comment-14738076 ] Li Bo commented on HDFS-9040: - Implementing a block group writer is a better solution but you have to copy a lot of existing code to the new class, which may be hard to be accepted by community. At beginning I also tried this way but gave up after discussion. Current design and implementation give as few changes as possible to the existing code. > Erasure coding: A BlockGroupDataStreamer to rule all internal blocks streamers > -- > > Key: HDFS-9040 > URL: https://issues.apache.org/jira/browse/HDFS-9040 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > > A {{BlockGroupDataStreamer}} to communicate with NN to allocate/update block, > and {{StripedDataStreamer}} s only have to stream blocks to DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8704: Attachment: HDFS-8704-HDFS-7285-008.patch > Erasure Coding: client fails to write large file when one datanode fails > > > Key: HDFS-8704 > URL: https://issues.apache.org/jira/browse/HDFS-8704 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, > HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, > HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, > HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch > > > I test current code on a 5-node cluster using RS(3,2). When a datanode is > corrupt, client succeeds to write a file smaller than a block group but fails > to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests > files smaller than a block group, this jira will add more test situations. > A streamer may encounter some bad datanodes when writing blocks allocated to > it. When it fails to connect datanode or send a packet, the streamer needs to > prepare for the next block. First it removes the packets of current block > from its data queue. If the first packet of next block has already been in > the data queue, the streamer will reset its state and start to wait for the > next block allocated for it; otherwise it will just wait for the first packet > of next block. The streamer will check periodically if it is asked to > terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733248#comment-14733248 ] Li Bo commented on HDFS-8704: - I have tried replacing the failed streamer with a new one. When replacing, the outputstream has to stop sending packets to old streamer and start sending packets to new streamer after all packets of next block are moved from old to new streamer. It's much more difficult than restarting the failed streamer. The auto restart of failed streamer makes ouputstream unnecessary to care about if some streamer is failed. > Erasure Coding: client fails to write large file when one datanode fails > > > Key: HDFS-8704 > URL: https://issues.apache.org/jira/browse/HDFS-8704 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, > HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, > HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, > HDFS-8704-HDFS-7285-007.patch > > > I test current code on a 5-node cluster using RS(3,2). When a datanode is > corrupt, client succeeds to write a file smaller than a block group but fails > to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests > files smaller than a block group, this jira will add more test situations. > A streamer may encounter some bad datanodes when writing blocks allocated to > it. When it fails to connect datanode or send a packet, the streamer needs to > prepare for the next block. First it removes the packets of current block > from its data queue. If the first packet of next block has already been in > the data queue, the streamer will reset its state and start to wait for the > next block allocated for it; otherwise it will just wait for the first packet > of next block. The streamer will check periodically if it is asked to > terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8704: Attachment: HDFS-8704-HDFS-7285-007.patch > Erasure Coding: client fails to write large file when one datanode fails > > > Key: HDFS-8704 > URL: https://issues.apache.org/jira/browse/HDFS-8704 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, > HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, > HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, > HDFS-8704-HDFS-7285-007.patch > > > I test current code on a 5-node cluster using RS(3,2). When a datanode is > corrupt, client succeeds to write a file smaller than a block group but fails > to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests > files smaller than a block group, this jira will add more test situations. > A streamer may encounter some bad datanodes when writing blocks allocated to > it. When it fails to connect datanode or send a packet, the streamer needs to > prepare for the next block. First it removes the packets of current block > from its data queue. If the first packet of next block has already been in > the data queue, the streamer will reset its state and start to wait for the > next block allocated for it; otherwise it will just wait for the first packet > of next block. The streamer will check periodically if it is asked to > terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732314#comment-14732314 ] Li Bo commented on HDFS-8704: - Thanks for [~walter.k.su] and [~zhz]' s review. The cause of failed test cases are caused by the unit test itself. I have fixed them in patch 007. The test will hang at {{TestDFSStripedOutputStreamWithFailure#DFSTestUtil.waitReplication}}, I just omit this sentence in patch in order to make the tests pass. [~walter.k.su], could you help me check this problem? I will switch to HDFS-8383 later. > Erasure Coding: client fails to write large file when one datanode fails > > > Key: HDFS-8704 > URL: https://issues.apache.org/jira/browse/HDFS-8704 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, > HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, > HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, > HDFS-8704-HDFS-7285-007.patch > > > I test current code on a 5-node cluster using RS(3,2). When a datanode is > corrupt, client succeeds to write a file smaller than a block group but fails > to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests > files smaller than a block group, this jira will add more test situations. > A streamer may encounter some bad datanodes when writing blocks allocated to > it. When it fails to connect datanode or send a packet, the streamer needs to > prepare for the next block. First it removes the packets of current block > from its data queue. If the first packet of next block has already been in > the data queue, the streamer will reset its state and start to wait for the > next block allocated for it; otherwise it will just wait for the first packet > of next block. The streamer will check periodically if it is asked to > terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8383) Tolerate multiple failures in DFSStripedOutputStream
[ https://issues.apache.org/jira/browse/HDFS-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732320#comment-14732320 ] Li Bo commented on HDFS-8383: - Thanks [~walter.k.su] for the work! I have just read the code and find some points to be discussed: 1) When only one streamer fails, do we need to do anything? I think we can just ignore the failed streamer unless more than 3 streamers are found failed. The offline decode work will be started by some datanode later. 2) I think it’s not right to set the failed status of streamer in outputstream due to the asynchronization. I have given some reasons in HDFS-8704. The outputstream doesn’t need to care about the status of each streamer if just one or two streamers fail. This will not complicate the logic of outputstreamer. 3) Not very clear about the error handling. For example, streamer_i fails to write a packet of block_j, but it succeeds to write block_j+1, could you give some detailed description about this situation? > Tolerate multiple failures in DFSStripedOutputStream > > > Key: HDFS-8383 > URL: https://issues.apache.org/jira/browse/HDFS-8383 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Tsz Wo Nicholas Sze >Assignee: Walter Su > Attachments: HDFS-8383.00.patch, HDFS-8383.01.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14718235#comment-14718235 ] Li Bo commented on HDFS-8704: - The two failed test cases are about insufficient datanodes. They also fail without the patch. We can handle them in a separate jira. Erasure Coding: client fails to write large file when one datanode fails Key: HDFS-8704 URL: https://issues.apache.org/jira/browse/HDFS-8704 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch I test current code on a 5-node cluster using RS(3,2). When a datanode is corrupt, client succeeds to write a file smaller than a block group but fails to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests files smaller than a block group, this jira will add more test situations. A streamer may encounter some bad datanodes when writing blocks allocated to it. When it fails to connect datanode or send a packet, the streamer needs to prepare for the next block. First it removes the packets of current block from its data queue. If the first packet of next block has already been in the data queue, the streamer will reset its state and start to wait for the next block allocated for it; otherwise it will just wait for the first packet of next block. The streamer will check periodically if it is asked to terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8704: Attachment: HDFS-8704-HDFS-7285-006.patch Update the patch based on current code of 7285 Erasure Coding: client fails to write large file when one datanode fails Key: HDFS-8704 URL: https://issues.apache.org/jira/browse/HDFS-8704 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch I test current code on a 5-node cluster using RS(3,2). When a datanode is corrupt, client succeeds to write a file smaller than a block group but fails to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests files smaller than a block group, this jira will add more test situations. A streamer may encounter some bad datanodes when writing blocks allocated to it. When it fails to connect datanode or send a packet, the streamer needs to prepare for the next block. First it removes the packets of current block from its data queue. If the first packet of next block has already been in the data queue, the streamer will reset its state and start to wait for the next block allocated for it; otherwise it will just wait for the first packet of next block. The streamer will check periodically if it is asked to terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712581#comment-14712581 ] Li Bo commented on HDFS-8704: - Thanks Zhe’s review. The handle of multiple block groups for datanode failure is complex, let me give some detailed explanation. Suppose packets {{p1,p2,p3,p4}} belong to the first block and {{p5,p6,p7,p8}} belong to the next one. Now streamer fails to send packet p2 and set its {{streamerClosed}} as true. Its {{dataQueue}} still contains {{p3,p4,p5}}. Then {{DFSStripedOutputStream}} will fail to write data of packet p6, and set this streamer as failed. After that {{DFSStripedOutputStream}} will not write {{p7,p8}} to that streamer any more. Maybe the streamer is allocated a good datanode and {{p5,p6,p7,p8}} should be successfully written to that datanode. We only see {{StripedDataStreamer#setFailed(true)}}, where and when to mark the streamer as not failed? Due to the asynchronization of {{StripedDataStreamer}} and {{DFSStripedOutputStream}}, it’s more reasonable to let {{DFSStripedOutputStream}} be unware of the status of streamers unless there’re no enough streamers. When a streamer fails to connect datanode or write some packet, it removes the following trivial packets of current block in its {{dataQueue}} and waiting for next block to be allocated to it. 1. Please see above explanation 2. If just writing a file smaller than a block group, you’ll not find the problem of {{setFailed}}. The failed status should be marked and erased by data streamer itself, not the outputstream. 3. When streamer fails, it should do some extra work to prepare for the next block. It’s very difficult to achieve this if not overriding {{run}}. I will give some description in JIRA summary later. When you fail to send a packet, the following packets belonging to the same block are trivial packets, you just need to remove them from the {{dataQueue}}. 4. I will update the patch after HDFS-8838 committed. 5. For small issues: If streamer fails at some block, it still sends to coordinator an end block with a negative {{numBytes}}. Erasure Coding: client fails to write large file when one datanode fails Key: HDFS-8704 URL: https://issues.apache.org/jira/browse/HDFS-8704 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, HDFS-8704-HDFS-7285-005.patch I test current code on a 5-node cluster using RS(3,2). When a datanode is corrupt, client succeeds to write a file smaller than a block group but fails to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests files smaller than a block group, this jira will add more test situations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8704: Description: I test current code on a 5-node cluster using RS(3,2). When a datanode is corrupt, client succeeds to write a file smaller than a block group but fails to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests files smaller than a block group, this jira will add more test situations. A streamer may encounter some bad datanodes when writing blocks allocated to it. When it fails to connect datanode or send a packet, the streamer needs to prepare for the next block. First it removes the packets of current block from its data queue. If the first packet of next block has already been in the data queue, the streamer will reset its state and start to wait for the next block allocated for it; otherwise it will just wait for the first packet of next block. The streamer will check periodically if it is asked to terminate during its waiting. was:I test current code on a 5-node cluster using RS(3,2). When a datanode is corrupt, client succeeds to write a file smaller than a block group but fails to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests files smaller than a block group, this jira will add more test situations. Erasure Coding: client fails to write large file when one datanode fails Key: HDFS-8704 URL: https://issues.apache.org/jira/browse/HDFS-8704 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, HDFS-8704-HDFS-7285-005.patch I test current code on a 5-node cluster using RS(3,2). When a datanode is corrupt, client succeeds to write a file smaller than a block group but fails to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests files smaller than a block group, this jira will add more test situations. A streamer may encounter some bad datanodes when writing blocks allocated to it. When it fails to connect datanode or send a packet, the streamer needs to prepare for the next block. First it removes the packets of current block from its data queue. If the first packet of next block has already been in the data queue, the streamer will reset its state and start to wait for the next block allocated for it; otherwise it will just wait for the first packet of next block. The streamer will check periodically if it is asked to terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small
[ https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710695#comment-14710695 ] Li Bo commented on HDFS-8838: - hi, Walter Su 我看了下branch, 跟7285相关的有 remotes/origin/HDFS-7285 remotes/origin/HDFS-7285-REBASE remotes/origin/HDFS-7285-merge 这几者目前是什么关系? Thanks Tolerate datanode failures in DFSStripedOutputStream when the data length is small -- Key: HDFS-8838 URL: https://issues.apache.org/jira/browse/HDFS-8838 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Attachments: HDFS-8838-HDFS-7285-000.patch, HDFS-8838-HDFS-7285-20150809-test.patch, HDFS-8838-HDFS-7285-20150809.patch, HDFS-8838-HDFS-7285-20150821.patch, h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, h8838_20150731.log, h8838_20150731.patch, h8838_20150804-HDFS-7285.patch, h8838_20150809.patch Currently, DFSStripedOutputStream cannot tolerate datanode failures when the data length is small. We fix the bugs here and add more tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing
[ https://issues.apache.org/jira/browse/HDFS-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8889: Attachment: HDFS-8889-HDFS-7285-001.patch Erasure Coding: cover more test situations of datanode failure during client writing Key: HDFS-8889 URL: https://issues.apache.org/jira/browse/HDFS-8889 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Attachments: HDFS-8889-HDFS-7285-001.patch Currently 9 streamers are working together for the client writing. A small number of failed datanodes (= 3) for a block group should not influence the writing. There’re a lot of datanode failure cases and we should cover as many as possible in unit test. Suppose streamer 4 fails, the following situations for the next block group should be considered: 1)all streamers succeed 2)Streamer 4 still fails 3)only streamer 1 fails 4)only streamer 8 fails (test parity streamer) 5)streamer 4 and 6 fail 6)streamer 4 and 1,6 fail 7)streamer 4 and 1,2,6 fail 8)streamer 2, 6 fail Suppose streamer 2 and 4 fail, the following situations for the next block group should be considered: 1)only streamer 2 and 4 fail 2)streamer 2, 4, 8 fail 3)only streamer 2 fails 4)streamer 3 , 8 fail For a single streamer, we should consider the following situations of the time of datanode failure: 1)before writing the first byte 2)before finishing writing the first cell 3)right after finishing writing the first cell 4)before writing the last byte of the block Other situations: 1)more than 3 streamers fail at the first block group 2)more than 3 streamers fail at the last block group more … -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small
[ https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708942#comment-14708942 ] Li Bo commented on HDFS-8838: - hi, Jing, some error still exists when applying to newest code. Could you check again? Thanks Tolerate datanode failures in DFSStripedOutputStream when the data length is small -- Key: HDFS-8838 URL: https://issues.apache.org/jira/browse/HDFS-8838 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Attachments: HDFS-8838-HDFS-7285-000.patch, HDFS-8838-HDFS-7285-20150809-test.patch, HDFS-8838-HDFS-7285-20150809.patch, HDFS-8838-HDFS-7285-20150821.patch, h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, h8838_20150731.log, h8838_20150731.patch, h8838_20150804-HDFS-7285.patch, h8838_20150809.patch Currently, DFSStripedOutputStream cannot tolerate datanode failures when the data length is small. We fix the bugs here and add more tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small
[ https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708943#comment-14708943 ] Li Bo commented on HDFS-8838: - hi, Jing, some error still exists when applying to newest code. Could you check again? Thanks Tolerate datanode failures in DFSStripedOutputStream when the data length is small -- Key: HDFS-8838 URL: https://issues.apache.org/jira/browse/HDFS-8838 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Attachments: HDFS-8838-HDFS-7285-000.patch, HDFS-8838-HDFS-7285-20150809-test.patch, HDFS-8838-HDFS-7285-20150809.patch, HDFS-8838-HDFS-7285-20150821.patch, h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, h8838_20150731.log, h8838_20150731.patch, h8838_20150804-HDFS-7285.patch, h8838_20150809.patch Currently, DFSStripedOutputStream cannot tolerate datanode failures when the data length is small. We fix the bugs here and add more tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing
[ https://issues.apache.org/jira/browse/HDFS-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8889: Description: Currently 9 streamers are working together for the client writing. A small number of failed datanodes (= 3) for a block group should not influence the writing. There’re a lot of datanode failure cases and we should cover as many as possible in unit test. Suppose streamer 4 fails, the following situations for the next block group should be considered: 1) all streamers succeed 2) Streamer 4 still fails 3) only streamer 1 fails 4) only streamer 8 fails (test parity streamer) 5) streamer 4 and 6 fail 6) streamer 4 and 1,6 fail 7) streamer 4 and 1,2,6 fail 8) streamer 2, 6 fail Suppose streamer 2 and 4 fail, the following situations for the next block group should be considered: 1) only streamer 2 and 4 fail 2) streamer 2, 4, 8 fail 3) only streamer 2 fails 4) streamer 3 , 8 fail For a single streamer, we should consider the following situations of the time of datanode failure: 1) before writing the first byte 2) before finishing writing the first cell 3) right after finishing writing the first cell 4) before writing the last byte of the block Other situations: 1) more than 3 streamers fail at the first block group 2) more than 3 streamers fail at the last block group more … Erasure Coding: cover more test situations of datanode failure during client writing Key: HDFS-8889 URL: https://issues.apache.org/jira/browse/HDFS-8889 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Currently 9 streamers are working together for the client writing. A small number of failed datanodes (= 3) for a block group should not influence the writing. There’re a lot of datanode failure cases and we should cover as many as possible in unit test. Suppose streamer 4 fails, the following situations for the next block group should be considered: 1)all streamers succeed 2)Streamer 4 still fails 3)only streamer 1 fails 4)only streamer 8 fails (test parity streamer) 5)streamer 4 and 6 fail 6)streamer 4 and 1,6 fail 7)streamer 4 and 1,2,6 fail 8)streamer 2, 6 fail Suppose streamer 2 and 4 fail, the following situations for the next block group should be considered: 1)only streamer 2 and 4 fail 2)streamer 2, 4, 8 fail 3)only streamer 2 fails 4)streamer 3 , 8 fail For a single streamer, we should consider the following situations of the time of datanode failure: 1)before writing the first byte 2)before finishing writing the first cell 3)right after finishing writing the first cell 4)before writing the last byte of the block Other situations: 1)more than 3 streamers fail at the first block group 2)more than 3 streamers fail at the last block group more … -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing
[ https://issues.apache.org/jira/browse/HDFS-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693047#comment-14693047 ] Li Bo commented on HDFS-8889: - Currently 9 streamers are working together for the client writing. A small number of failed datanodes (= 3) for a block group should not influence the writing. There’re a lot of datanode failure cases and we should cover as many as possible in unit test. Suppose streamer 4 fails, the following situations for the next block group should be considered: 1) all streamers succeed 2) Streamer 4 still fails 3) only streamer 1 fails 4) only streamer 8 fails (test parity streamer) 5) streamer 4 and 6 fail 6) streamer 4 and 1,6 fail 7) streamer 4 and 1,2,6 fail 8) streamer 2, 6 fail Suppose streamer 2 and 4 fail, the following situations for the next block group should be considered: 1) only streamer 2 and 4 fail 2) streamer 2, 4, 8 fail 3) only streamer 2 fails 4) streamer 3 , 8 fail For a single streamer, we should consider the following situations of the time of datanode failure: 1) before writing the first byte 2) before finishing writing the first cell 3) right after finishing writing the first cell 4) before writing the last byte of the block Other situations: 1) more than 3 streamers fail at the first block group 2) more than 3 streamers fail at the last block group more … Erasure Coding: cover more test situations of datanode failure during client writing Key: HDFS-8889 URL: https://issues.apache.org/jira/browse/HDFS-8889 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8889) Erasure Coding: cover more test situations of datanode failure during client writing
Li Bo created HDFS-8889: --- Summary: Erasure Coding: cover more test situations of datanode failure during client writing Key: HDFS-8889 URL: https://issues.apache.org/jira/browse/HDFS-8889 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8857) Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure
[ https://issues.apache.org/jira/browse/HDFS-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659363#comment-14659363 ] Li Bo commented on HDFS-8857: - Thanks Kai for the review. I have just committed the code to 7285. Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure - Key: HDFS-8857 URL: https://issues.apache.org/jira/browse/HDFS-8857 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Attachments: HDFS-8857-HDFS-7285-001.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654873#comment-14654873 ] Li Bo commented on HDFS-8704: - The failure of {{org.apache.hadoop.hdfs.TestWriteStripedFileWithFailure.testWriteStripedFileWithDNFailure}} has been fixed by HDFS-8857. The failure of {{org.apache.hadoop.hdfs.server.namenode.TestFileTruncate.testTruncateWithDataNodesRestartImmediately}} seems have no relation with this patch. Erasure Coding: client fails to write large file when one datanode fails Key: HDFS-8704 URL: https://issues.apache.org/jira/browse/HDFS-8704 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, HDFS-8704-HDFS-7285-005.patch I test current code on a 5-node cluster using RS(3,2). When a datanode is corrupt, client succeeds to write a file smaller than a block group but fails to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests files smaller than a block group, this jira will add more test situations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8857) Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure
[ https://issues.apache.org/jira/browse/HDFS-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8857: Resolution: Fixed Status: Resolved (was: Patch Available) Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure - Key: HDFS-8857 URL: https://issues.apache.org/jira/browse/HDFS-8857 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Attachments: HDFS-8857-HDFS-7285-001.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails
[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-8704: Attachment: HDFS-8704-HDFS-7285-005.patch Erasure Coding: client fails to write large file when one datanode fails Key: HDFS-8704 URL: https://issues.apache.org/jira/browse/HDFS-8704 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Li Bo Assignee: Li Bo Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, HDFS-8704-HDFS-7285-005.patch I test current code on a 5-node cluster using RS(3,2). When a datanode is corrupt, client succeeds to write a file smaller than a block group but fails to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests files smaller than a block group, this jira will add more test situations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small
[ https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653136#comment-14653136 ] Li Bo commented on HDFS-8838: - hi, [~walter.k.su] and [~szetszwo], could you help me review the patch of HDFS-8704 if you have time? Tolerate datanode failures in DFSStripedOutputStream when the data length is small -- Key: HDFS-8838 URL: https://issues.apache.org/jira/browse/HDFS-8838 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs-client Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Attachments: h8838_20150729.patch, h8838_20150731.patch Currently, DFSStripedOutputStream cannot tolerate datanode failures when the data length is small. We fix the bugs here and add more tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)