[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846683#comment-17846683
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran merged PR #6825:
URL: https://github.com/apache/hadoop/pull/6825




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846682#comment-17846682
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran merged PR #6824:
URL: https://github.com/apache/hadoop/pull/6824




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846680#comment-17846680
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on PR #6824:
URL: https://github.com/apache/hadoop/pull/6824#issuecomment-2112850821

   update: ignoring yetus. the field is final, this is essentially a case class.




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846146#comment-17846146
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6825:
URL: https://github.com/apache/hadoop/pull/6825#issuecomment-2109190422

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  14m  5s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  1s |  |  Shelldocs was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 11 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 53s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  40m 34s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |  19m 32s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   3m  3s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   4m  9s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   2m 35s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   7m 15s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  40m 21s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 32s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   3m 13s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  18m 42s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |  18m 42s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   2m 50s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6825/1/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 1 new + 22 unchanged - 0 fixed = 23 total (was 
22)  |
   | +1 :green_heart: |  mvnsite  |   3m 59s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m 32s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   2m 32s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   7m 58s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  41m 21s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   8m 49s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  unit  | 159m 20s |  |  hadoop-mapreduce-project in the 
patch passed.  |
   | +1 :green_heart: |  unit  |   2m 33s |  |  hadoop-azure in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  4s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 407m 26s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6825/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6825 |
   | Optional Tests | dupname asflicense codespell detsecrets shellcheck 
shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs 
checkstyle markdownlint xmllint |
   | uname | Linux f6a1aab3dfb3 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / d16eb0a75adcc81bbc90e7c8122632eb8ead494a |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6825/1/testReport/ |
   | Max. process+thread count | 1238 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project hadoop-tools/hadoop-azure U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6825/1/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 shellcheck=0.4.6 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846122#comment-17846122
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6824:
URL: https://github.com/apache/hadoop/pull/6824#issuecomment-2109077510

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   6m 58s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  1s |  |  Shelldocs was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 11 new or modified test files.  |
    _ branch-3.4 Compile Tests _ |
   | +0 :ok: |  mvndep  |   4m  3s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  30m 45s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  compile  |   9m 33s |  |  branch-3.4 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   8m 57s |  |  branch-3.4 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   2m 16s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  mvnsite  |   2m 19s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  javadoc  |   2m  3s |  |  branch-3.4 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 54s |  |  branch-3.4 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   4m  0s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  shadedclient  |  21m 56s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m 57s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   8m 57s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   8m 32s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   2m  8s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6824/1/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 1 new + 22 unchanged - 0 fixed = 23 total (was 
22)  |
   | +1 :green_heart: |  mvnsite  |   2m  6s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m 10s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   2m  1s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 57s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   4m 28s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 57s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   6m 33s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | -1 :x: |  unit  | 119m 38s | 
[/patch-unit-hadoop-mapreduce-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6824/1/artifact/out/patch-unit-hadoop-mapreduce-project.txt)
 |  hadoop-mapreduce-project in the patch failed.  |
   | -1 :x: |  unit  |   0m 34s | 
[/patch-unit-hadoop-tools_hadoop-azure.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6824/1/artifact/out/patch-unit-hadoop-tools_hadoop-azure.txt)
 |  hadoop-azure in the patch failed.  |
   | +0 :ok: |  asflicense  |   0m 35s |  |  ASF License check generated no 
output?  |
   |  |   | 281m 57s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.mapreduce.v2.TestMRJobs |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6824/1/artifact/out/Dockerfile
 |
   | 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846064#comment-17846064
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran opened a new pull request, #6825:
URL: https://github.com/apache/hadoop/pull/6825

   Improve task commit resilience everywhere
   and add an option to reduce delete IO requests on
   job cleanup (relevant for ABFS and HDFS).
   
   Task Commit Resilience
   --
   
   Task manifest saving is re-attempted on failure; the number of  attempts 
made is configurable with the option:
   
 mapreduce.manifest.committer.manifest.save.attempts
   
   * The default is 5.
   * The minimum is 1; asking for less is ignored.
   * A retry policy adds 500ms of sleep per attempt.
   * Move from classic rename() to commitFile() to rename the file, after 
calling getFileStatus() to get its length and possibly etag. This becomes a 
rename() on gcs/hdfs anyway, but on abfs it does reach the 
ResilientCommitByRename callbacks in abfs, which report on the outcome to the 
caller...which is then logged at WARN.
   * New statistic task_stage_save_summary_file to distinguish from other 
saving operations (job success/report file). This is only saved to the manifest 
on task commit retries, and provides statistics on all previous unsuccessful 
attempts to save the manifests
   + test changes to match the codepath changes, including improvements in 
fault injection.
   
   Directory size for deletion
   ---
   
   New option
   
 mapreduce.manifest.committer.cleanup.parallel.delete.base.first
   
   This attempts an initial attempt at deleting the base dir, only falling back 
to parallel deletes if there's a timeout.
   
   This option is disabled by default; Consider enabling it for abfs to reduce 
IO load. Consult the documentation for more details.
   
   Success file printing
   -
   
   The command to print a JSON _SUCCESS file from this committer and any S3A 
committer is now something which can be invoked from the mapred command:
   
 mapred successfile 
   
   Contributed by Steve Loughran
   
   
   
   ### How was this patch tested?
   
   yetus's work, if happy will validate on abfs.
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846062#comment-17846062
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran opened a new pull request, #6824:
URL: https://github.com/apache/hadoop/pull/6824

   
   backport of #6716
   
   Improve task commit resilience everywhere
   and add an option to reduce delete IO requests on
   job cleanup (relevant for ABFS and HDFS).
   
   Task Commit Resilience
   --
   
   Task manifest saving is re-attempted on failure; the number of  attempts 
made is configurable with the option:
   
 mapreduce.manifest.committer.manifest.save.attempts
   
   * The default is 5.
   * The minimum is 1; asking for less is ignored.
   * A retry policy adds 500ms of sleep per attempt.
   * Move from classic rename() to commitFile() to rename the file, after 
calling getFileStatus() to get its length and possibly etag. This becomes a 
rename() on gcs/hdfs anyway, but on abfs it does reach the 
ResilientCommitByRename callbacks in abfs, which report on the outcome to the 
caller...which is then logged at WARN.
   * New statistic task_stage_save_summary_file to distinguish from other 
saving operations (job success/report file). This is only saved to the manifest 
on task commit retries, and provides statistics on all previous unsuccessful 
attempts to save the manifests
   + test changes to match the codepath changes, including improvements in 
fault injection.
   
   Directory size for deletion
   ---
   
   New option
   
 mapreduce.manifest.committer.cleanup.parallel.delete.base.first
   
   This attempts an initial attempt at deleting the base dir, only falling back 
to parallel deletes if there's a timeout.
   
   This option is disabled by default; Consider enabling it for abfs to reduce 
IO load. Consult the documentation for more details.
   
   Success file printing
   -
   
   The command to print a JSON _SUCCESS file from this committer and any S3A 
committer is now something which can be invoked from the mapred command:
   
 mapred successfile 
   
   Contributed by Steve Loughran
   
   
   ### How was this patch tested?
   
   yetus's work, if happy will validate on abfs.
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846061#comment-17846061
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran merged PR #6716:
URL: https://github.com/apache/hadoop/pull/6716




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844932#comment-17844932
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2102258570

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m 05s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m 05s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m 05s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shellcheck  |   0m 05s |  |  Shellcheck was not available.  |
   | +0 :ok: |  shelldocs  |   0m 05s |  |  Shelldocs was not available.  |
   | +0 :ok: |  spotbugs  |   0m 01s |  |  spotbugs executables are not 
available.  |
   | +0 :ok: |  markdownlint  |   0m 01s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m 01s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m 00s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m 00s |  |  The patch appears to 
include 11 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 35s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  87m 02s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  38m 10s |  |  trunk passed  |
   | +1 :green_heart: |  checkstyle  |   6m 01s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |  16m 14s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |  14m 05s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  | 165m 47s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 13s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |  11m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  36m 11s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |  36m 11s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m 01s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   5m 43s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |  16m 13s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |  13m 47s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  | 174m 45s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   6m 07s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 532m 21s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense codespell detsecrets shellcheck 
shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs 
checkstyle markdownlint xmllint |
   | uname | MINGW64_NT-10.0-17763 54377fa890b8 3.4.10-87d57229.x86_64 
2024-02-14 20:17 UTC x86_64 Msys |
   | Build tool | maven |
   | Personality | /c/hadoop/dev-support/bin/hadoop.sh |
   | git revision | trunk / 68dff782a05bdaaf1fcc1e4999e6fd404f07796b |
   | Default Java | Azul Systems, Inc.-1.8.0_332-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/4/testReport/
 |
   | modules | C: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project hadoop-tools/hadoop-azure U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/4/console
 |
   | versions | git=2.44.0.windows.1 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844399#comment-17844399
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2098947031

   @saxenapranav what do you think of the patch now?




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843643#comment-17843643
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

saxenapranav commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1590568348


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -582,19 +611,111 @@ protected final Path directoryMustExist(
* Save a task manifest or summary. This will be done by
* writing to a temp path and then renaming.
* If the destination path exists: Delete it.
+   * This will retry so that a rename failure from abfs load or IO errors
+   * will not fail the task.
* @param manifestData the manifest/success file
* @param tempPath temp path for the initial save
* @param finalPath final path for rename.
-   * @throws IOException failure to load/parse
+   * @return the manifest saved.
+   * @throws IOException failure to rename after retries.
*/
   @SuppressWarnings("unchecked")
-  protected final  void save(T manifestData,
+  protected final  T save(
+  final T manifestData,
   final Path tempPath,
   final Path finalPath) throws IOException {
-LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, 
finalPath);
-trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
-operations.save(manifestData, tempPath, true));
-renameFile(tempPath, finalPath);
+return saveManifest(() -> manifestData, tempPath, finalPath, 
OP_SAVE_TASK_MANIFEST);
+  }
+
+  /**
+   * Generate and save a task manifest or summary file.
+   * This is be done by writing to a temp path and then renaming.
+   * 
+   * If the destination path exists: Delete it before the rename.
+   * 
+   * This will retry so that a rename failure from abfs load or IO errors
+   * such as delete or save failure will not fail the task.
+   * 
+   * The {@code manifestSource} supplier is invoked to get the manifest data
+   * on every attempt.
+   * This permits statistics to be updated, including those of failures.
+   * @param manifestSource supplier the manifest/success file
+   * @param tempPath temp path for the initial save
+   * @param finalPath final path for rename.
+   * @param statistic statistic to use for timing
+   * @return the manifest saved.
+   * @throws IOException failure to save/delete/rename after retries.
+   */
+  @SuppressWarnings("unchecked")
+  protected final  T saveManifest(
+  final Supplier manifestSource,
+  final Path tempPath,
+  final Path finalPath,
+  String statistic) throws IOException {
+
+AtomicInteger retryCount = new AtomicInteger(0);
+RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep(
+getStageConfig().getManifestSaveAttempts(),
+SAVE_SLEEP_INTERVAL,
+TimeUnit.MILLISECONDS);
+
+// loop until returning a value or raising an exception
+while (true) {
+  try {
+T manifestData = requireNonNull(manifestSource.get());
+trackDurationOfInvocation(getIOStatistics(), statistic, () -> {
+  LOG.info("{}: save manifest to {} then rename as {}'); retry 
count={}",
+  getName(), tempPath, finalPath, retryCount);
+
+  // delete temp path.
+  // even though this is written with overwrite=true, this extra 
recursive
+  // delete also handles a directory being there.
+  deleteRecursive(tempPath, OP_DELETE);
+
+  // save the temp file, overwriting any which remains from an earlier 
attempt
+  operations.save(manifestData, tempPath, true);
+
+  // delete the destination in case it exists either from a failed 
previous
+  // attempt or from a concurrent task commit.
+  delete(finalPath, true, OP_DELETE);
+
+  // rename temp to final
+  renameFile(tempPath, finalPath);

Review Comment:
   This shall be better, as if the recovery also fail, we would not do 
additional HEAD calls for `escalateRenameFailure`. This looks good!





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842473#comment-17842473
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2086367792

   I've now moved to commitFile() to rename the task manifest, after doing a 
getFileStatus() call first...which means its iO cost is the same as a rename 
with recovery enabled. it does let us see what happened, which we log at WARN. 




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842389#comment-17842389
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1584841337


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -582,19 +611,111 @@ protected final Path directoryMustExist(
* Save a task manifest or summary. This will be done by
* writing to a temp path and then renaming.
* If the destination path exists: Delete it.
+   * This will retry so that a rename failure from abfs load or IO errors
+   * will not fail the task.
* @param manifestData the manifest/success file
* @param tempPath temp path for the initial save
* @param finalPath final path for rename.
-   * @throws IOException failure to load/parse
+   * @return the manifest saved.
+   * @throws IOException failure to rename after retries.
*/
   @SuppressWarnings("unchecked")
-  protected final  void save(T manifestData,
+  protected final  T save(
+  final T manifestData,
   final Path tempPath,
   final Path finalPath) throws IOException {
-LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, 
finalPath);
-trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
-operations.save(manifestData, tempPath, true));
-renameFile(tempPath, finalPath);
+return saveManifest(() -> manifestData, tempPath, finalPath, 
OP_SAVE_TASK_MANIFEST);
+  }
+
+  /**
+   * Generate and save a task manifest or summary file.
+   * This is be done by writing to a temp path and then renaming.
+   * 
+   * If the destination path exists: Delete it before the rename.
+   * 
+   * This will retry so that a rename failure from abfs load or IO errors
+   * such as delete or save failure will not fail the task.
+   * 
+   * The {@code manifestSource} supplier is invoked to get the manifest data
+   * on every attempt.
+   * This permits statistics to be updated, including those of failures.
+   * @param manifestSource supplier the manifest/success file
+   * @param tempPath temp path for the initial save
+   * @param finalPath final path for rename.
+   * @param statistic statistic to use for timing
+   * @return the manifest saved.
+   * @throws IOException failure to save/delete/rename after retries.
+   */
+  @SuppressWarnings("unchecked")
+  protected final  T saveManifest(
+  final Supplier manifestSource,
+  final Path tempPath,
+  final Path finalPath,
+  String statistic) throws IOException {
+
+AtomicInteger retryCount = new AtomicInteger(0);
+RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep(
+getStageConfig().getManifestSaveAttempts(),
+SAVE_SLEEP_INTERVAL,
+TimeUnit.MILLISECONDS);
+
+// loop until returning a value or raising an exception
+while (true) {
+  try {
+T manifestData = requireNonNull(manifestSource.get());
+trackDurationOfInvocation(getIOStatistics(), statistic, () -> {
+  LOG.info("{}: save manifest to {} then rename as {}'); retry 
count={}",
+  getName(), tempPath, finalPath, retryCount);
+
+  // delete temp path.
+  // even though this is written with overwrite=true, this extra 
recursive
+  // delete also handles a directory being there.
+  deleteRecursive(tempPath, OP_DELETE);
+
+  // save the temp file, overwriting any which remains from an earlier 
attempt
+  operations.save(manifestData, tempPath, true);
+
+  // delete the destination in case it exists either from a failed 
previous
+  // attempt or from a concurrent task commit.
+  delete(finalPath, true, OP_DELETE);
+
+  // rename temp to final
+  renameFile(tempPath, finalPath);

Review Comment:
   you mean use commitFile() after creating a file entry, so pushing more of 
the recovery down? we could do that. we won't have the etag of the create file 
though.





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840307#comment-17840307
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2074254789

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m 05s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m 05s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m 05s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shellcheck  |   0m 05s |  |  Shellcheck was not available.  |
   | +0 :ok: |  shelldocs  |   0m 05s |  |  Shelldocs was not available.  |
   | +0 :ok: |  spotbugs  |   0m 01s |  |  spotbugs executables are not 
available.  |
   | +0 :ok: |  markdownlint  |   0m 01s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m 01s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m 00s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m 00s |  |  The patch appears to 
include 10 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 13s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  89m 13s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  39m 10s |  |  trunk passed  |
   | +1 :green_heart: |  checkstyle  |   6m 15s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |  16m 30s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |  14m 00s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  | 170m 25s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 13s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |  11m 14s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  37m 21s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |  37m 21s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m 01s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   5m 47s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |  16m 05s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |  13m 42s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  | 178m 33s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   5m 29s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 543m 45s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense codespell detsecrets shellcheck 
shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs 
checkstyle markdownlint xmllint |
   | uname | MINGW64_NT-10.0-17763 b033f3aa81e5 3.4.10-87d57229.x86_64 
2024-02-14 20:17 UTC x86_64 Msys |
   | Build tool | maven |
   | Personality | /c/hadoop/dev-support/bin/hadoop.sh |
   | git revision | trunk / 2b38434c46b494f7689acc125a52935d6cb17870 |
   | Default Java | Azul Systems, Inc.-1.8.0_332-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/1/testReport/
 |
   | modules | C: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project hadoop-tools/hadoop-azure U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/1/console
 |
   | versions | git=2.44.0.windows.1 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840295#comment-17840295
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

saxenapranav commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1577303182


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java:
##
@@ -142,64 +154,104 @@ protected Result executeStage(
 }
 
 Outcome outcome = null;
-IOException exception;
+IOException exception = null;
+boolean baseDirDeleted = false;
 
 
 // to delete.
 LOG.info("{}: Deleting job directory {}", getName(), baseDir);
+final long directoryCount = args.directoryCount;
+if (directoryCount > 0) {
+  // log the expected directory count, which drives duration in GCS
+  // and may cause timeouts on azure if the count is too high for a
+  // timely permissions tree scan.
+  LOG.info("{}: Expected directory count: {}", getName(), directoryCount);
+}
 
+progress();
+// check and maybe execute parallel delete of task attempt dirs.
 if (args.deleteTaskAttemptDirsInParallel) {
-  // Attempt to do a parallel delete of task attempt dirs;
-  // don't overreact if a delete fails, but stop trying
-  // to delete the others, and fall back to deleting the
-  // job dir.
-  Path taskSubDir
-  = getStageConfig().getJobAttemptTaskSubDir();
-  try (DurationInfo info = new DurationInfo(LOG,
-  "parallel deletion of task attempts in %s",
-  taskSubDir)) {
-RemoteIterator dirs =
-RemoteIterators.filteringRemoteIterator(
-listStatusIterator(taskSubDir),
-FileStatus::isDirectory);
-TaskPool.foreach(dirs)
-.executeWith(getIOProcessors())
-.stopOnFailure()
-.suppressExceptions(false)
-.run(this::rmTaskAttemptDir);
-getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
-
-if (getLastDeleteException() != null) {
-  // one of the task attempts failed.
-  throw getLastDeleteException();
+
+
+  if (args.parallelDeleteAttemptBaseDeleteFirst) {
+// attempt to delete the base dir first.
+// This can reduce ABFS delete load but may time out
+// (which the fallback to parallel delete will handle).
+// on GCS it is slow.
+try (DurationInfo info = new DurationInfo(LOG, true,
+"Initial delete of %s", baseDir)) {
+  exception = deleteOneDir(baseDir);
+  if (exception == null) {
+// success: record this as the outcome,
+outcome = Outcome.DELETED;
+// and will skip the parallel delete
+baseDirDeleted = true;
+  } else {
+// failure: log and continue
+LOG.warn("{}: Exception on initial attempt at deleting base dir {}"
++ " with directory count {}. Falling back to parallel 
delete",
+getName(), baseDir, directoryCount, exception);
+  }
+}
+  }
+  if (!baseDirDeleted) {
+// no base delete attempted or it failed.
+// Attempt to do a parallel delete of task attempt dirs;
+// don't overreact if a delete fails, but stop trying
+// to delete the others, and fall back to deleting the
+// job dir.
+Path taskSubDir
+= getStageConfig().getJobAttemptTaskSubDir();
+try (DurationInfo info = new DurationInfo(LOG, true,
+"parallel deletion of task attempts in %s",
+taskSubDir)) {
+  RemoteIterator dirs =
+  RemoteIterators.filteringRemoteIterator(
+  listStatusIterator(taskSubDir),
+  FileStatus::isDirectory);
+  TaskPool.foreach(dirs)
+  .executeWith(getIOProcessors())
+  .stopOnFailure()
+  .suppressExceptions(false)
+  .run(this::rmTaskAttemptDir);
+  getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
+
+  if (getLastDeleteException() != null) {
+// one of the task attempts failed.
+throw getLastDeleteException();

Review Comment:
   This will also register an exception raised in `deleteOneDir` in line 183. 
So, although there was no task failure, but would get recorded as task-failure 
in logs.





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840286#comment-17840286
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

saxenapranav commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1577262719


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -582,19 +611,111 @@ protected final Path directoryMustExist(
* Save a task manifest or summary. This will be done by
* writing to a temp path and then renaming.
* If the destination path exists: Delete it.
+   * This will retry so that a rename failure from abfs load or IO errors
+   * will not fail the task.
* @param manifestData the manifest/success file
* @param tempPath temp path for the initial save
* @param finalPath final path for rename.
-   * @throws IOException failure to load/parse
+   * @return the manifest saved.
+   * @throws IOException failure to rename after retries.
*/
   @SuppressWarnings("unchecked")
-  protected final  void save(T manifestData,
+  protected final  T save(
+  final T manifestData,
   final Path tempPath,
   final Path finalPath) throws IOException {
-LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, 
finalPath);
-trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
-operations.save(manifestData, tempPath, true));
-renameFile(tempPath, finalPath);
+return saveManifest(() -> manifestData, tempPath, finalPath, 
OP_SAVE_TASK_MANIFEST);
+  }
+
+  /**
+   * Generate and save a task manifest or summary file.
+   * This is be done by writing to a temp path and then renaming.
+   * 
+   * If the destination path exists: Delete it before the rename.
+   * 
+   * This will retry so that a rename failure from abfs load or IO errors
+   * such as delete or save failure will not fail the task.
+   * 
+   * The {@code manifestSource} supplier is invoked to get the manifest data
+   * on every attempt.
+   * This permits statistics to be updated, including those of failures.
+   * @param manifestSource supplier the manifest/success file
+   * @param tempPath temp path for the initial save
+   * @param finalPath final path for rename.
+   * @param statistic statistic to use for timing
+   * @return the manifest saved.
+   * @throws IOException failure to save/delete/rename after retries.
+   */
+  @SuppressWarnings("unchecked")
+  protected final  T saveManifest(
+  final Supplier manifestSource,
+  final Path tempPath,
+  final Path finalPath,
+  String statistic) throws IOException {
+
+AtomicInteger retryCount = new AtomicInteger(0);
+RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep(
+getStageConfig().getManifestSaveAttempts(),
+SAVE_SLEEP_INTERVAL,
+TimeUnit.MILLISECONDS);
+
+// loop until returning a value or raising an exception
+while (true) {
+  try {
+T manifestData = requireNonNull(manifestSource.get());
+trackDurationOfInvocation(getIOStatistics(), statistic, () -> {
+  LOG.info("{}: save manifest to {} then rename as {}'); retry 
count={}",
+  getName(), tempPath, finalPath, retryCount);
+
+  // delete temp path.
+  // even though this is written with overwrite=true, this extra 
recursive
+  // delete also handles a directory being there.
+  deleteRecursive(tempPath, OP_DELETE);
+
+  // save the temp file, overwriting any which remains from an earlier 
attempt
+  operations.save(manifestData, tempPath, true);
+
+  // delete the destination in case it exists either from a failed 
previous
+  // attempt or from a concurrent task commit.
+  delete(finalPath, true, OP_DELETE);
+
+  // rename temp to final
+  renameFile(tempPath, finalPath);

Review Comment:
   Got your point for the directory case. 
   
   For the first point, I now understand that `executeRenamingOperation` would 
call `escalateRenameFailure` on fs.rename() failure which would raise 
PathIOException. I was thinking if instead of calling `renameFile` if we can do 
`operation.renameFile()` directly and raise exception from there. Reason being, 
`escalateRenameFailure` does a getFileStatus on both src and dst for logging. 
We can save 2 filesystem calls if we know the renameFile for the saveManifest 
has failed. Would like to know your view. But, I am good with this comment.





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
>   

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840255#comment-17840255
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2073739524

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  12m 17s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  1s |  |  Shelldocs was not available.  |
   | +0 :ok: |  markdownlint  |   0m  1s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 10 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 14s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m 12s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |  16m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   4m 23s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 35s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   3m 15s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   2m 56s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   6m 38s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m  4s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 34s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  5s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 51s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |  16m 51s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  15m 57s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |  15m 57s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  1s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   4m 23s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 25s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m 24s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   3m  6s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   3m  1s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   7m 16s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 33s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   8m 27s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  unit  | 162m  7s |  |  hadoop-mapreduce-project in the 
patch passed.  |
   | +1 :green_heart: |  unit  |   2m 56s |  |  hadoop-azure in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m 18s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 422m 41s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/11/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense codespell detsecrets shellcheck 
shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs 
checkstyle markdownlint xmllint |
   | uname | Linux 8700e5c7281c 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 2b38434c46b494f7689acc125a52935d6cb17870 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840116#comment-17840116
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2072467711

   One thing I'm considering here, make that "initial attempt at base dir 
delete" a numeric threshold.
   
   good: agile
   bad: harder to test, less consistent; harder to replicate
   
   for now, leaving a simple switch




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840097#comment-17840097
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1576225937


##
hadoop-tools/hadoop-azure/pom.xml:
##
@@ -45,6 +45,8 @@
 7200
 
10
 
1000
+ [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840096#comment-17840096
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1576221197


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -582,19 +611,111 @@ protected final Path directoryMustExist(
* Save a task manifest or summary. This will be done by
* writing to a temp path and then renaming.
* If the destination path exists: Delete it.
+   * This will retry so that a rename failure from abfs load or IO errors
+   * will not fail the task.
* @param manifestData the manifest/success file
* @param tempPath temp path for the initial save
* @param finalPath final path for rename.
-   * @throws IOException failure to load/parse
+   * @return the manifest saved.
+   * @throws IOException failure to rename after retries.
*/
   @SuppressWarnings("unchecked")
-  protected final  void save(T manifestData,
+  protected final  T save(
+  final T manifestData,
   final Path tempPath,
   final Path finalPath) throws IOException {
-LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, 
finalPath);
-trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
-operations.save(manifestData, tempPath, true));
-renameFile(tempPath, finalPath);
+return saveManifest(() -> manifestData, tempPath, finalPath, 
OP_SAVE_TASK_MANIFEST);
+  }
+
+  /**
+   * Generate and save a task manifest or summary file.
+   * This is be done by writing to a temp path and then renaming.
+   * 
+   * If the destination path exists: Delete it before the rename.
+   * 
+   * This will retry so that a rename failure from abfs load or IO errors
+   * such as delete or save failure will not fail the task.
+   * 
+   * The {@code manifestSource} supplier is invoked to get the manifest data
+   * on every attempt.
+   * This permits statistics to be updated, including those of failures.
+   * @param manifestSource supplier the manifest/success file
+   * @param tempPath temp path for the initial save
+   * @param finalPath final path for rename.
+   * @param statistic statistic to use for timing
+   * @return the manifest saved.
+   * @throws IOException failure to save/delete/rename after retries.
+   */
+  @SuppressWarnings("unchecked")
+  protected final  T saveManifest(
+  final Supplier manifestSource,
+  final Path tempPath,
+  final Path finalPath,
+  String statistic) throws IOException {
+
+AtomicInteger retryCount = new AtomicInteger(0);
+RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep(
+getStageConfig().getManifestSaveAttempts(),
+SAVE_SLEEP_INTERVAL,
+TimeUnit.MILLISECONDS);
+
+// loop until returning a value or raising an exception
+while (true) {
+  try {
+T manifestData = requireNonNull(manifestSource.get());
+trackDurationOfInvocation(getIOStatistics(), statistic, () -> {
+  LOG.info("{}: save manifest to {} then rename as {}'); retry 
count={}",
+  getName(), tempPath, finalPath, retryCount);
+
+  // delete temp path.
+  // even though this is written with overwrite=true, this extra 
recursive
+  // delete also handles a directory being there.
+  deleteRecursive(tempPath, OP_DELETE);
+
+  // save the temp file, overwriting any which remains from an earlier 
attempt
+  operations.save(manifestData, tempPath, true);
+
+  // delete the destination in case it exists either from a failed 
previous
+  // attempt or from a concurrent task commit.
+  delete(finalPath, true, OP_DELETE);
+
+  // rename temp to final
+  renameFile(tempPath, finalPath);

Review Comment:
   1. renameFile javadocs `throws PathIOException – if the rename() call 
returned false.`. so no need to check the result here
   2. directory deletion, maybe: but what is going to create a directory here? 
nothing in the committer will, and if some other process is doing stuff in the 
job attempt dir you are doomed.





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839503#comment-17839503
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

saxenapranav commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1574200780


##
hadoop-tools/hadoop-azure/pom.xml:
##
@@ -45,6 +45,8 @@
 7200
 
10
 
1000
+ [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839499#comment-17839499
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

saxenapranav commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1572484303


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -582,19 +611,111 @@ protected final Path directoryMustExist(
* Save a task manifest or summary. This will be done by
* writing to a temp path and then renaming.
* If the destination path exists: Delete it.
+   * This will retry so that a rename failure from abfs load or IO errors
+   * will not fail the task.
* @param manifestData the manifest/success file
* @param tempPath temp path for the initial save
* @param finalPath final path for rename.
-   * @throws IOException failure to load/parse
+   * @return the manifest saved.
+   * @throws IOException failure to rename after retries.
*/
   @SuppressWarnings("unchecked")
-  protected final  void save(T manifestData,
+  protected final  T save(
+  final T manifestData,
   final Path tempPath,
   final Path finalPath) throws IOException {
-LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, 
finalPath);
-trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
-operations.save(manifestData, tempPath, true));
-renameFile(tempPath, finalPath);
+return saveManifest(() -> manifestData, tempPath, finalPath, 
OP_SAVE_TASK_MANIFEST);
+  }
+
+  /**
+   * Generate and save a task manifest or summary file.
+   * This is be done by writing to a temp path and then renaming.
+   * 
+   * If the destination path exists: Delete it before the rename.
+   * 
+   * This will retry so that a rename failure from abfs load or IO errors
+   * such as delete or save failure will not fail the task.
+   * 
+   * The {@code manifestSource} supplier is invoked to get the manifest data
+   * on every attempt.
+   * This permits statistics to be updated, including those of failures.
+   * @param manifestSource supplier the manifest/success file
+   * @param tempPath temp path for the initial save
+   * @param finalPath final path for rename.
+   * @param statistic statistic to use for timing
+   * @return the manifest saved.
+   * @throws IOException failure to save/delete/rename after retries.
+   */
+  @SuppressWarnings("unchecked")
+  protected final  T saveManifest(
+  final Supplier manifestSource,
+  final Path tempPath,
+  final Path finalPath,
+  String statistic) throws IOException {
+
+AtomicInteger retryCount = new AtomicInteger(0);
+RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep(
+getStageConfig().getManifestSaveAttempts(),
+SAVE_SLEEP_INTERVAL,
+TimeUnit.MILLISECONDS);
+
+// loop until returning a value or raising an exception
+while (true) {
+  try {
+T manifestData = requireNonNull(manifestSource.get());
+trackDurationOfInvocation(getIOStatistics(), statistic, () -> {
+  LOG.info("{}: save manifest to {} then rename as {}'); retry 
count={}",
+  getName(), tempPath, finalPath, retryCount);
+
+  // delete temp path.
+  // even though this is written with overwrite=true, this extra 
recursive
+  // delete also handles a directory being there.
+  deleteRecursive(tempPath, OP_DELETE);
+
+  // save the temp file, overwriting any which remains from an earlier 
attempt
+  operations.save(manifestData, tempPath, true);
+
+  // delete the destination in case it exists either from a failed 
previous
+  // attempt or from a concurrent task commit.
+  delete(finalPath, true, OP_DELETE);
+
+  // rename temp to final
+  renameFile(tempPath, finalPath);

Review Comment:
   There can be a parallel process which might create a directory in between 
line 680 and 683, should we check post line 683, if finalPath is a file or not?



##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -582,19 +611,111 @@ protected final Path directoryMustExist(
* Save a task manifest or summary. This will be done by
* writing to a temp path and then renaming.
* If the destination path exists: Delete it.
+   * This will retry so that a rename failure from abfs load or IO errors
+   * will not fail the task.
* @param manifestData the manifest/success file
* @param tempPath temp path for the initial save
* @param finalPath final path for rename.
-   * @throws 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839140#comment-17839140
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2067365776

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 34s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  markdownlint  |   0m  1s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 10 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m  7s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m  9s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 23s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |  16m 11s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   4m 22s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 31s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   3m 15s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   2m 58s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   6m 39s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m  6s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 33s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  7s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 48s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |  16m 48s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 14s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |  16m 14s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   4m 17s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 2 new + 22 unchanged - 0 fixed = 24 total (was 
22)  |
   | +1 :green_heart: |  mvnsite  |   3m 23s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m 23s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   3m  7s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   3m  1s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   7m 18s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 19s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   7m 54s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  unit  | 161m 19s |  |  hadoop-mapreduce-project in the 
patch passed.  |
   | +1 :green_heart: |  unit  |   2m 44s |  |  hadoop-azure in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m 17s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 408m 51s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense codespell detsecrets shellcheck 
shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs 
checkstyle markdownlint xmllint |
   | uname | Linux 1cd1cb0cbe32 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838851#comment-17838851
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2065738026

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 34s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  markdownlint  |   0m  1s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 10 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 57s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m 39s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 27s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |  16m 20s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   4m 19s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 33s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   3m 15s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   3m  1s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   6m 39s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m 11s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   1m 58s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 16s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 49s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |  16m 49s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 15s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |  16m 15s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | -0 :warning: |  checkstyle  |   4m 17s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 10 new + 22 unchanged - 0 fixed = 32 total (was 
22)  |
   | +1 :green_heart: |  mvnsite  |   3m 24s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m 23s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   3m  9s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   3m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   7m 19s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 28s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   7m 59s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  unit  | 160m 38s |  |  hadoop-mapreduce-project in the 
patch passed.  |
   | +1 :green_heart: |  unit  |   2m 43s |  |  hadoop-azure in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m 14s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 410m 46s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense codespell detsecrets shellcheck 
shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs 
checkstyle markdownlint xmllint |
   | uname 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838736#comment-17838736
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2064541193

   @snvijaya we actually know the total number of subdirs for the deletion!
   
   it is propagated via the manifests: each TA manifest includes the #of dirs 
as an IOStatistic, the aggregate summary adds these all up.
   
   the number of paths under the job dir is that number (counter 
committer_task_directory_count ) + any of failed task attempts.
   
   which means we could actually have a threshold of how many subdirectories 
will trigger an automatic switch to parallel delete.
   
   I'm just going to pass this down and log immediately before the cleanup 
kicks off, so if there are problems we will get the diagnostics adjacent to the 
error.
   
   Note that your details on retry timings imply that on a mapreduce job 
(rather than spark one) the progress() callback will not take place -so there's 
a risk that the job will actually timeout. I don't think that's an issue in MR 
job actions, the way it is is in task-side actions where a heartbeat back to 
the MapRed AM is required.
   
   




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838388#comment-17838388
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2062530498

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 32s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  markdownlint  |   0m  1s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 7 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 43s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m 23s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 35s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |  16m  7s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   4m 19s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 55s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   2m 53s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 33s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  0s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 47s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |  16m 47s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |  16m 25s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   4m 22s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 2 new + 22 unchanged - 0 fixed = 24 total (was 
22)  |
   | +1 :green_heart: |  mvnsite  |   1m 54s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 15s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 26s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   7m 50s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  unit  |   2m 30s |  |  hadoop-azure in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  3s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 228m 23s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint 
xmllint |
   | uname | Linux 2eeac626505b 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 3e5e1e61d99b9a784f1f324f5875792da9b49406 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838341#comment-17838341
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569270966


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java:
##
@@ -142,64 +154,93 @@ protected Result executeStage(
 }
 
 Outcome outcome = null;
-IOException exception;
+IOException exception = null;
+boolean baseDirDeleted = false;
 
 
 // to delete.
 LOG.info("{}: Deleting job directory {}", getName(), baseDir);
 
 if (args.deleteTaskAttemptDirsInParallel) {
-  // Attempt to do a parallel delete of task attempt dirs;
-  // don't overreact if a delete fails, but stop trying
-  // to delete the others, and fall back to deleting the
-  // job dir.
-  Path taskSubDir
-  = getStageConfig().getJobAttemptTaskSubDir();
-  try (DurationInfo info = new DurationInfo(LOG,
-  "parallel deletion of task attempts in %s",
-  taskSubDir)) {
-RemoteIterator dirs =
-RemoteIterators.filteringRemoteIterator(
-listStatusIterator(taskSubDir),
-FileStatus::isDirectory);
-TaskPool.foreach(dirs)
-.executeWith(getIOProcessors())
-.stopOnFailure()
-.suppressExceptions(false)
-.run(this::rmTaskAttemptDir);
-getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
-
-if (getLastDeleteException() != null) {
-  // one of the task attempts failed.
-  throw getLastDeleteException();
+
+  // parallel delete of task attempt dirs.
+
+  if (args.parallelDeleteAttemptBaseDeleteFirst) {
+// attempt to delete the base dir first.
+// This can reduce ABFS delete load but may time out
+// (which the fallback to parallel delete will handle).
+// on GCS it is slow.
+try (DurationInfo info = new DurationInfo(LOG, true,
+"Initial delete of %s", baseDir)) {
+  exception = deleteOneDir(baseDir);
+  if (exception == null) {
+// success: record this as the outcome, which
+// will skip the parallel delete.
+outcome = Outcome.DELETED;
+baseDirDeleted = true;
+  } else {
+// failure: log and continue
+LOG.warn("{}: Exception on initial attempt at deleting base dir 
{}\n"
++ "attempting parallel delete",
+getName(), baseDir, exception);
+  }
+}
+  }
+  if (!baseDirDeleted) {
+// no base delete attempted or it failed.
+// Attempt to do a parallel delete of task attempt dirs;
+// don't overreact if a delete fails, but stop trying
+// to delete the others, and fall back to deleting the
+// job dir.
+Path taskSubDir
+= getStageConfig().getJobAttemptTaskSubDir();
+try (DurationInfo info = new DurationInfo(LOG, true,
+"parallel deletion of task attempts in %s",
+taskSubDir)) {
+  RemoteIterator dirs =
+  RemoteIterators.filteringRemoteIterator(
+  listStatusIterator(taskSubDir),
+  FileStatus::isDirectory);
+  TaskPool.foreach(dirs)
+  .executeWith(getIOProcessors())
+  .stopOnFailure()
+  .suppressExceptions(false)
+  .run(this::rmTaskAttemptDir);
+  getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
+
+  if (getLastDeleteException() != null) {
+// one of the task attempts failed.
+throw getLastDeleteException();

Review Comment:
   good q. 
   rmTaskAttemptDir will not trigger a failure, just note and store the error.
   on L213 one of any such IOEs is rethrown, but that just gets to L222 which 
logs and then goes on to the final root dir attempt again
   
   so it should be recovered from...though if the first delete timed out then 
that root delete on L233 is probably doomed too...





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838340#comment-17838340
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569264943


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java:
##
@@ -142,64 +154,93 @@ protected Result executeStage(
 }
 
 Outcome outcome = null;
-IOException exception;
+IOException exception = null;
+boolean baseDirDeleted = false;
 
 
 // to delete.
 LOG.info("{}: Deleting job directory {}", getName(), baseDir);
 
 if (args.deleteTaskAttemptDirsInParallel) {
-  // Attempt to do a parallel delete of task attempt dirs;
-  // don't overreact if a delete fails, but stop trying
-  // to delete the others, and fall back to deleting the
-  // job dir.
-  Path taskSubDir
-  = getStageConfig().getJobAttemptTaskSubDir();
-  try (DurationInfo info = new DurationInfo(LOG,
-  "parallel deletion of task attempts in %s",
-  taskSubDir)) {
-RemoteIterator dirs =
-RemoteIterators.filteringRemoteIterator(
-listStatusIterator(taskSubDir),
-FileStatus::isDirectory);
-TaskPool.foreach(dirs)
-.executeWith(getIOProcessors())
-.stopOnFailure()
-.suppressExceptions(false)
-.run(this::rmTaskAttemptDir);
-getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
-
-if (getLastDeleteException() != null) {
-  // one of the task attempts failed.
-  throw getLastDeleteException();
+
+  // parallel delete of task attempt dirs.
+
+  if (args.parallelDeleteAttemptBaseDeleteFirst) {
+// attempt to delete the base dir first.
+// This can reduce ABFS delete load but may time out
+// (which the fallback to parallel delete will handle).
+// on GCS it is slow.
+try (DurationInfo info = new DurationInfo(LOG, true,
+"Initial delete of %s", baseDir)) {
+  exception = deleteOneDir(baseDir);
+  if (exception == null) {
+// success: record this as the outcome, which
+// will skip the parallel delete.
+outcome = Outcome.DELETED;
+baseDirDeleted = true;
+  } else {
+// failure: log and continue
+LOG.warn("{}: Exception on initial attempt at deleting base dir 
{}\n"
++ "attempting parallel delete",
+getName(), baseDir, exception);
+  }
+}
+  }
+  if (!baseDirDeleted) {

Review Comment:
   note that the next stage will fail immediately on the list() call, which is 
caught as a FNFE and treated as success





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838338#comment-17838338
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569262867


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java:
##
@@ -142,64 +154,93 @@ protected Result executeStage(
 }
 
 Outcome outcome = null;
-IOException exception;
+IOException exception = null;
+boolean baseDirDeleted = false;
 
 
 // to delete.
 LOG.info("{}: Deleting job directory {}", getName(), baseDir);
 
 if (args.deleteTaskAttemptDirsInParallel) {
-  // Attempt to do a parallel delete of task attempt dirs;
-  // don't overreact if a delete fails, but stop trying
-  // to delete the others, and fall back to deleting the
-  // job dir.
-  Path taskSubDir
-  = getStageConfig().getJobAttemptTaskSubDir();
-  try (DurationInfo info = new DurationInfo(LOG,
-  "parallel deletion of task attempts in %s",
-  taskSubDir)) {
-RemoteIterator dirs =
-RemoteIterators.filteringRemoteIterator(
-listStatusIterator(taskSubDir),
-FileStatus::isDirectory);
-TaskPool.foreach(dirs)
-.executeWith(getIOProcessors())
-.stopOnFailure()
-.suppressExceptions(false)
-.run(this::rmTaskAttemptDir);
-getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
-
-if (getLastDeleteException() != null) {
-  // one of the task attempts failed.
-  throw getLastDeleteException();
+
+  // parallel delete of task attempt dirs.
+
+  if (args.parallelDeleteAttemptBaseDeleteFirst) {
+// attempt to delete the base dir first.
+// This can reduce ABFS delete load but may time out
+// (which the fallback to parallel delete will handle).
+// on GCS it is slow.
+try (DurationInfo info = new DurationInfo(LOG, true,
+"Initial delete of %s", baseDir)) {
+  exception = deleteOneDir(baseDir);
+  if (exception == null) {
+// success: record this as the outcome, which
+// will skip the parallel delete.
+outcome = Outcome.DELETED;
+baseDirDeleted = true;
+  } else {
+// failure: log and continue
+LOG.warn("{}: Exception on initial attempt at deleting base dir 
{}\n"
++ "attempting parallel delete",
+getName(), baseDir, exception);
+  }
+}
+  }
+  if (!baseDirDeleted) {

Review Comment:
   it gets set on L180; will comment 





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838326#comment-17838326
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569228824


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -582,19 +605,46 @@ protected final Path directoryMustExist(
* Save a task manifest or summary. This will be done by
* writing to a temp path and then renaming.
* If the destination path exists: Delete it.
+   * This will retry so that a rename failure from abfs load or IO errors
+   * will not fail the task.
* @param manifestData the manifest/success file
* @param tempPath temp path for the initial save
* @param finalPath final path for rename.
-   * @throws IOException failure to load/parse
+   * @throws IOException failure to rename after retries.
*/
   @SuppressWarnings("unchecked")
   protected final  void save(T manifestData,
   final Path tempPath,
   final Path finalPath) throws IOException {
-LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, 
finalPath);
-trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
-operations.save(manifestData, tempPath, true));
-renameFile(tempPath, finalPath);
+boolean success = false;
+int failures = 0;
+while (!success) {
+  try {
+LOG.trace("{}: attempt {} save('{}, {}, {}')",
+getName(), failures, manifestData, tempPath, finalPath);
+
+trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () 
->
+operations.save(manifestData, tempPath, true));
+renameFile(tempPath, finalPath);

Review Comment:
   any error raised during rename triggers fallback of
   * catch IOE
   * save temp file again
   * delete dest path
   * rename temp path to final path
   
   this is attempted a configurable number of times, with a sleep in between.
   no attempt to be clever about which IOEs are unrecoverable (permissions 
etc), just catch, log, retry





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838328#comment-17838328
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569233047


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java:
##
@@ -142,64 +154,93 @@ protected Result executeStage(
 }
 
 Outcome outcome = null;
-IOException exception;
+IOException exception = null;
+boolean baseDirDeleted = false;
 
 
 // to delete.
 LOG.info("{}: Deleting job directory {}", getName(), baseDir);
 
 if (args.deleteTaskAttemptDirsInParallel) {
-  // Attempt to do a parallel delete of task attempt dirs;
-  // don't overreact if a delete fails, but stop trying
-  // to delete the others, and fall back to deleting the
-  // job dir.
-  Path taskSubDir
-  = getStageConfig().getJobAttemptTaskSubDir();
-  try (DurationInfo info = new DurationInfo(LOG,
-  "parallel deletion of task attempts in %s",
-  taskSubDir)) {
-RemoteIterator dirs =
-RemoteIterators.filteringRemoteIterator(
-listStatusIterator(taskSubDir),
-FileStatus::isDirectory);
-TaskPool.foreach(dirs)
-.executeWith(getIOProcessors())
-.stopOnFailure()
-.suppressExceptions(false)
-.run(this::rmTaskAttemptDir);
-getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
-
-if (getLastDeleteException() != null) {
-  // one of the task attempts failed.
-  throw getLastDeleteException();
+
+  // parallel delete of task attempt dirs.
+
+  if (args.parallelDeleteAttemptBaseDeleteFirst) {
+// attempt to delete the base dir first.
+// This can reduce ABFS delete load but may time out
+// (which the fallback to parallel delete will handle).
+// on GCS it is slow.
+try (DurationInfo info = new DurationInfo(LOG, true,
+"Initial delete of %s", baseDir)) {
+  exception = deleteOneDir(baseDir);
+  if (exception == null) {

Review Comment:
   ooh, so it's going to be quite a long time to fall back.
   I'm going to make the option default to false for now.
   
   > AzureManifestCommitterFactory could probably set this config to 0 before 
FileSystem.get() call happens.
   
   it'll come from the cache, we don't want to set it for everything else, but 
a low MAX_RETRIES_RECURSIVE_DELETE might make sense everywhere. something to 
consider later. 





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838324#comment-17838324
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569225918


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -582,19 +607,64 @@ protected final Path directoryMustExist(
* Save a task manifest or summary. This will be done by
* writing to a temp path and then renaming.
* If the destination path exists: Delete it.
+   * This will retry so that a rename failure from abfs load or IO errors
+   * will not fail the task.
* @param manifestData the manifest/success file
* @param tempPath temp path for the initial save
* @param finalPath final path for rename.
-   * @throws IOException failure to load/parse
+   * @throws IOException failure to rename after retries.
*/
   @SuppressWarnings("unchecked")
   protected final  void save(T manifestData,
   final Path tempPath,
   final Path finalPath) throws IOException {
-LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, 
finalPath);
-trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
-operations.save(manifestData, tempPath, true));
-renameFile(tempPath, finalPath);
+
+int retryCount = 0;
+RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep(
+getStageConfig().getManifestSaveAttempts(),
+SAVE_SLEEP_INTERVAL,
+TimeUnit.MILLISECONDS);
+boolean success = false;
+while (!success) {
+  try {
+LOG.info("{}: save manifest to {} then rename as {}'); retry count={}",
+getName(), tempPath, finalPath, retryCount);
+
+trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () 
->
+operations.save(manifestData, tempPath, true));

Review Comment:
   so renameFile() has always deleted the destination because we need to do 
that to cope with failures of a previous/concurrent task attempt. Whoever 
commits last wins.
   
   To make this clearer I'm pulling up more of the code into this method and 
adding comments.





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838320#comment-17838320
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569210996


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -445,9 +448,29 @@ protected Boolean delete(
   final boolean recursive,
   final String statistic)
   throws IOException {
-return trackDuration(getIOStatistics(), statistic, () -> {
-  return operations.delete(path, recursive);
-});
+if (recursive) {

Review Comment:
   ok, deleteDir will also delete a file. let me highlight that.
   
   I'd done this delete dir/file split to support different capacity requests, 
without that it is a bit over-complex. it does let us collect different 
statistics though, which may be useful





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838318#comment-17838318
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569203962


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/ManifestCommitterConstants.java:
##
@@ -143,6 +145,20 @@ public final class ManifestCommitterConstants {
*/
   public static final boolean OPT_CLEANUP_PARALLEL_DELETE_DIRS_DEFAULT = true;
 
+  /**
+   * Should parallel cleanup try to delete teh base first?
+   * Best for azure as it skips the task attempt deletions unless
+   * the toplevel delete fails.
+   * Value: {@value}.
+   */
+  public static final String OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST =
+  OPT_PREFIX + "cleanup.parallel.delete.base.first";
+
+  /**
+   * Default value of option {@link #OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST}:  
{@value}.
+   */
+  public static final boolean OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST_DEFAULT = 
true;

Review Comment:
   really don't know here. In the docs I try to cover this





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838316#comment-17838316
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569201237


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/ManifestStoreOperations.java:
##
@@ -97,6 +97,36 @@ public boolean isFile(Path path) throws IOException {
   public abstract boolean delete(Path path, boolean recursive)
   throws IOException;
 
+  /**
+   * Forward to {@code delete(Path, true)}
+   * unless overridden.
+   * 
+   * If it returns without an error: there is no file at
+   * the end of the path.
+   * @param path path
+   * @return outcome
+   * @throws IOException failure.
+   */
+  public boolean deleteFile(Path path)
+  throws IOException {
+return delete(path, false);

Review Comment:
   no it won't





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838315#comment-17838315
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569200611


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/ManifestStoreOperations.java:
##
@@ -97,6 +97,36 @@ public boolean isFile(Path path) throws IOException {
   public abstract boolean delete(Path path, boolean recursive)
   throws IOException;
 
+  /**
+   * Forward to {@code delete(Path, true)}
+   * unless overridden.
+   * 
+   * If it returns without an error: there is no file at
+   * the end of the path.
+   * @param path path
+   * @return outcome
+   * @throws IOException failure.
+   */
+  public boolean deleteFile(Path path)
+  throws IOException {
+return delete(path, false);
+  }
+
+  /**
+   * Acquire the delete capacity then call {@code FileSystem#delete(Path, 
true)}

Review Comment:
   aah, I'd cut that. leaving delete capacity out of this PR as it'd need rate 
limiting in abfs *and* guessing about size/depth of directory. which we could 
actually add in future task manifests, leaving only the homework of aggregating 
it





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837966#comment-17837966
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

snvijaya commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1567404776


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java:
##
@@ -142,64 +154,93 @@ protected Result executeStage(
 }
 
 Outcome outcome = null;
-IOException exception;
+IOException exception = null;
+boolean baseDirDeleted = false;
 
 
 // to delete.
 LOG.info("{}: Deleting job directory {}", getName(), baseDir);
 
 if (args.deleteTaskAttemptDirsInParallel) {
-  // Attempt to do a parallel delete of task attempt dirs;
-  // don't overreact if a delete fails, but stop trying
-  // to delete the others, and fall back to deleting the
-  // job dir.
-  Path taskSubDir
-  = getStageConfig().getJobAttemptTaskSubDir();
-  try (DurationInfo info = new DurationInfo(LOG,
-  "parallel deletion of task attempts in %s",
-  taskSubDir)) {
-RemoteIterator dirs =
-RemoteIterators.filteringRemoteIterator(
-listStatusIterator(taskSubDir),
-FileStatus::isDirectory);
-TaskPool.foreach(dirs)
-.executeWith(getIOProcessors())
-.stopOnFailure()
-.suppressExceptions(false)
-.run(this::rmTaskAttemptDir);
-getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
-
-if (getLastDeleteException() != null) {
-  // one of the task attempts failed.
-  throw getLastDeleteException();
+
+  // parallel delete of task attempt dirs.
+
+  if (args.parallelDeleteAttemptBaseDeleteFirst) {
+// attempt to delete the base dir first.
+// This can reduce ABFS delete load but may time out
+// (which the fallback to parallel delete will handle).
+// on GCS it is slow.
+try (DurationInfo info = new DurationInfo(LOG, true,
+"Initial delete of %s", baseDir)) {
+  exception = deleteOneDir(baseDir);
+  if (exception == null) {

Review Comment:
   As added by you in this logic, when the directory tree is very large and is 
over OAuth authentication, Azure cloud could fail the baseDir delete due to 
exhaustive ACL permissions checks. But this delete will entry the retry loop as 
it was request timeout and for this scenario all the retries too might fail and 
can take a while to report failure with backoff and retry attempts as per 
AZURE_MAX_IO_RETRIES (default value 30). 
   
   Default max retry count is 30 today just to ensure any 5-10 min 
network/service transient failures do not lead failures of long running 
workloads. 
   
   If this logic to attempt basedir delete before falling back to parallel 
deletes, is optimal only for Azure cloud, we could look for ways to fail fast 
for Delete with recursive.
   
   Would this work - Add a new config MAX_RETRIES_RECURSIVE_DELETE which by 
default will be the same as AZURE_MAX_IO_RETRIES in ABFS driver. 
AzureManifestCommitterFactory could probably set this config to 0 before 
FileSystem.get() call happens.
   
   If this sounds ok, we can look into changes needed in AbfsClient, 
AbfsRestOperation and ExponentialRetry to make MAX_RETRIES_RECURSIVE_DELETE 
config effective. 





> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837694#comment-17837694
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

anmolanmol1234 commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1567132937


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/ManifestCommitterConstants.java:
##
@@ -143,6 +145,20 @@ public final class ManifestCommitterConstants {
*/
   public static final boolean OPT_CLEANUP_PARALLEL_DELETE_DIRS_DEFAULT = true;
 
+  /**
+   * Should parallel cleanup try to delete teh base first?

Review Comment:
   typo: the



##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java:
##
@@ -142,64 +154,93 @@ protected Result executeStage(
 }
 
 Outcome outcome = null;
-IOException exception;
+IOException exception = null;
+boolean baseDirDeleted = false;
 
 
 // to delete.
 LOG.info("{}: Deleting job directory {}", getName(), baseDir);
 
 if (args.deleteTaskAttemptDirsInParallel) {
-  // Attempt to do a parallel delete of task attempt dirs;
-  // don't overreact if a delete fails, but stop trying
-  // to delete the others, and fall back to deleting the
-  // job dir.
-  Path taskSubDir
-  = getStageConfig().getJobAttemptTaskSubDir();
-  try (DurationInfo info = new DurationInfo(LOG,
-  "parallel deletion of task attempts in %s",
-  taskSubDir)) {
-RemoteIterator dirs =
-RemoteIterators.filteringRemoteIterator(
-listStatusIterator(taskSubDir),
-FileStatus::isDirectory);
-TaskPool.foreach(dirs)
-.executeWith(getIOProcessors())
-.stopOnFailure()
-.suppressExceptions(false)
-.run(this::rmTaskAttemptDir);
-getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
-
-if (getLastDeleteException() != null) {
-  // one of the task attempts failed.
-  throw getLastDeleteException();
+
+  // parallel delete of task attempt dirs.
+
+  if (args.parallelDeleteAttemptBaseDeleteFirst) {
+// attempt to delete the base dir first.
+// This can reduce ABFS delete load but may time out
+// (which the fallback to parallel delete will handle).
+// on GCS it is slow.
+try (DurationInfo info = new DurationInfo(LOG, true,
+"Initial delete of %s", baseDir)) {
+  exception = deleteOneDir(baseDir);
+  if (exception == null) {
+// success: record this as the outcome, which
+// will skip the parallel delete.
+outcome = Outcome.DELETED;
+baseDirDeleted = true;
+  } else {
+// failure: log and continue
+LOG.warn("{}: Exception on initial attempt at deleting base dir 
{}\n"
++ "attempting parallel delete",
+getName(), baseDir, exception);
+  }
+}
+  }
+  if (!baseDirDeleted) {

Review Comment:
   Do we need to change the state of this variable here because if not it will 
go to the next if block



##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java:
##
@@ -142,64 +154,93 @@ protected Result executeStage(
 }
 
 Outcome outcome = null;
-IOException exception;
+IOException exception = null;
+boolean baseDirDeleted = false;
 
 
 // to delete.
 LOG.info("{}: Deleting job directory {}", getName(), baseDir);
 
 if (args.deleteTaskAttemptDirsInParallel) {
-  // Attempt to do a parallel delete of task attempt dirs;
-  // don't overreact if a delete fails, but stop trying
-  // to delete the others, and fall back to deleting the
-  // job dir.
-  Path taskSubDir
-  = getStageConfig().getJobAttemptTaskSubDir();
-  try (DurationInfo info = new DurationInfo(LOG,
-  "parallel deletion of task attempts in %s",
-  taskSubDir)) {
-RemoteIterator dirs =
-RemoteIterators.filteringRemoteIterator(
-listStatusIterator(taskSubDir),
-FileStatus::isDirectory);
-TaskPool.foreach(dirs)
-.executeWith(getIOProcessors())
-.stopOnFailure()
-.suppressExceptions(false)
-.run(this::rmTaskAttemptDir);
-getIOStatistics().aggregate((retrieveIOStatistics(dirs)));
-
-if (getLastDeleteException() != null) {
-  // one of the task attempts failed.
-  throw 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837469#comment-17837469
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2058026632

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 54s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  markdownlint  |   0m  1s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 7 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 58s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m 39s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 26s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |  16m  8s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   4m 23s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 57s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   2m 54s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m 10s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 34s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  1s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 43s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |  16m 43s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 16s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |  16m 16s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   4m 17s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 1 new + 22 unchanged - 0 fixed = 23 total (was 
22)  |
   | +1 :green_heart: |  mvnsite  |   1m 52s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 19s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 20s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   7m 55s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  unit  |   2m 40s |  |  hadoop-azure in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  4s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 229m 15s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint 
xmllint |
   | uname | Linux 34da63c6214c 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 9193085fea42ef84e475ed84303fc39c199ffd67 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837449#comment-17837449
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

mukund-thakur commented on code in PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#discussion_r1566388267


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -582,19 +605,46 @@ protected final Path directoryMustExist(
* Save a task manifest or summary. This will be done by
* writing to a temp path and then renaming.
* If the destination path exists: Delete it.
+   * This will retry so that a rename failure from abfs load or IO errors
+   * will not fail the task.
* @param manifestData the manifest/success file
* @param tempPath temp path for the initial save
* @param finalPath final path for rename.
-   * @throws IOException failure to load/parse
+   * @throws IOException failure to rename after retries.
*/
   @SuppressWarnings("unchecked")
   protected final  void save(T manifestData,
   final Path tempPath,
   final Path finalPath) throws IOException {
-LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, 
finalPath);
-trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
-operations.save(manifestData, tempPath, true));
-renameFile(tempPath, finalPath);
+boolean success = false;
+int failures = 0;
+while (!success) {
+  try {
+LOG.trace("{}: attempt {} save('{}, {}, {}')",
+getName(), failures, manifestData, tempPath, finalPath);
+
+trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () 
->
+operations.save(manifestData, tempPath, true));
+renameFile(tempPath, finalPath);

Review Comment:
   rename may fail in second retry with SourcePathNotFound exception but it 
actually succeeded during the first try only. How are we recovering from that? 



##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java:
##
@@ -445,9 +448,29 @@ protected Boolean delete(
   final boolean recursive,
   final String statistic)
   throws IOException {
-return trackDuration(getIOStatistics(), statistic, () -> {
-  return operations.delete(path, recursive);
-});
+if (recursive) {

Review Comment:
   unable to understand this. How is a recursive flag determining that it is a 
dir or a file?



##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/ManifestStoreOperations.java:
##
@@ -97,6 +97,36 @@ public boolean isFile(Path path) throws IOException {
   public abstract boolean delete(Path path, boolean recursive)
   throws IOException;
 
+  /**
+   * Forward to {@code delete(Path, true)}
+   * unless overridden.
+   * 
+   * If it returns without an error: there is no file at
+   * the end of the path.
+   * @param path path
+   * @return outcome
+   * @throws IOException failure.
+   */
+  public boolean deleteFile(Path path)
+  throws IOException {
+return delete(path, false);
+  }
+
+  /**
+   * Acquire the delete capacity then call {@code FileSystem#delete(Path, 
true)}

Review Comment:
   don't see the ratelimiter acquiring the deletecapacity. 



##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/ManifestCommitterConstants.java:
##
@@ -143,6 +145,20 @@ public final class ManifestCommitterConstants {
*/
   public static final boolean OPT_CLEANUP_PARALLEL_DELETE_DIRS_DEFAULT = true;
 
+  /**
+   * Should parallel cleanup try to delete teh base first?
+   * Best for azure as it skips the task attempt deletions unless
+   * the toplevel delete fails.
+   * Value: {@value}.
+   */
+  public static final String OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST =
+  OPT_PREFIX + "cleanup.parallel.delete.base.first";
+
+  /**
+   * Default value of option {@link #OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST}:  
{@value}.
+   */
+  public static final boolean OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST_DEFAULT = 
true;

Review Comment:
   As it is bad for GCS, shouldn't the default be false?



##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md:
##
@@ -505,7 +491,7 @@ The core set of Azure-optimized options becomes
 
 
   spark.hadoop.fs.azure.io.rate.limit
-  1
+  1000

Review Comment:
   I guess you reduced this for testing? 



##

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836739#comment-17836739
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2052490254

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 31s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  markdownlint  |   0m  1s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 7 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m  2s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m 11s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 40s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |  16m 21s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   4m 21s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 56s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 38s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   2m 55s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m 34s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 33s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  3s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 50s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |  16m 50s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |  16m 28s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   4m 14s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 48s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 18s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 30s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   7m 52s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  unit  |   2m 33s |  |  hadoop-azure in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  4s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 230m 36s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint 
|
   | uname | Linux 412988d12e6f 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / a3117cf5bf7ba3a0cc4ffa0f3b1912241fa13c05 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/testReport/ |
   | Max. process+thread count | 1592 (vs. ulimit of 5500) |
   | modules | C: 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836732#comment-17836732
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2052380306

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 19s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 7 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 19s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  19m 45s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   8m 54s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   8m 10s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   2m  4s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 11s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  7s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 55s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 34s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 23s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   8m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m 11s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   8m 11s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  1s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   2m  1s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 10s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   2m  2s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 49s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   5m 44s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  unit  |   2m  2s |  |  hadoop-azure in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 138m 37s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint 
|
   | uname | Linux b03fe1ba958a 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 16e1be4a39e8415ecb45ddd3c43c70c9a35f01b0 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/6/testReport/ |
   | Max. process+thread count | 1530 (vs. ulimit of 5500) |
   | modules | C: 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836362#comment-17836362
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2050579864

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   6m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 7 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 30s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  19m 58s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   8m 55s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   8m 15s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   2m  8s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 13s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 54s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 35s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 40s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   8m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m 10s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   8m 10s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/4/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | +1 :green_heart: |  checkstyle  |   2m  1s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   2m  5s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 44s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   5m 46s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  unit  |   2m  4s |  |  hadoop-azure in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 145m 11s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint 
|
   | uname | Linux 8e0cc7ab2749 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 0ed84a98c5e630590f376a445eb28c27c2e8e780 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836319#comment-17836319
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2050326218

   Reviews invited from @mukund-thakur @anmolanmol1234 @anujmodi2021 
@HarshitGupta11
   
   




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835531#comment-17835531
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

hadoop-yetus commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2046034751

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  12m 27s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  45m  9s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   0m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   0m 43s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 47s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 35s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 25s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  33m 50s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   0m 30s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/1/artifact/out/blanks-eol.txt)
 |  The patch has 2 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | +1 :green_heart: |  checkstyle  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 20s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  33m 49s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |   7m 41s | 
[/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/1/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt)
 |  hadoop-mapreduce-client-core in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 147m 25s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.mapreduce.lib.output.committer.manifest.TestCreateOutputDirectoriesStage 
|
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6716 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 27c1a33d07a2 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 4dc5a605e08c88f8529ce7d117ff6033a4e6594c |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835511#comment-17835511
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2045879335

   testing: azure cardiff
   -Dparallel-tests=abfs -DtestsThreadCount=8
   lots of tests failed for me, but I've now got an account with very low IO 
threshold. We need to look at those failures in general




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

2024-04-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835505#comment-17835505
 ] 

ASF GitHub Bot commented on MAPREDUCE-7474:
---

steveloughran opened a new pull request, #6716:
URL: https://github.com/apache/hadoop/pull/6716

   
   Improve resilience of task commit save and rename operation with retries.
 
   * Retries of save()
 5 attempts, with 500 millis sleep between them. No configuration.
 Issue: should we make this configurable?
   * Split delete(path, recursive) into deleteFile and rmdir for separate
 statistics.
 
   Test simulation expands to:
   * Support recovery through a countdown of calls to fail.
   * Simulate timeout before *and after* rename calls.
   
   This is based on #6596 but skips the rate limiting logic spanning common and 
azure,
   instead it only contains changes in manifest committer -easier to backport.
   
   
   ### How was this patch tested?
   
   * manual test of new tests
   * full test suite left to yetus
   * azure test run in progress.
   
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> [ABFS] Improve commit resilience and performance in Manifest Committer
> --
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned
> * maybe: rate limit some IO internally, but not delegate to abfs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org