[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-15 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

   Resolution: Fixed
Fix Version/s: 3.1.0
   Status: Resolved  (was: Patch Available)

committed to branch 3.1+

thanks for everyones review and testing...very much one where you need to play 
with distcp on big diffs to appreciate how much it matters.

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch, HADOOP-15209-005.patch, 
> HADOOP-15209-006.patch, HADOOP-15209-007.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-09 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Attachment: HADOOP-15209-007.patch

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch, HADOOP-15209-005.patch, 
> HADOOP-15209-006.patch, HADOOP-15209-007.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-09 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Status: Patch Available  (was: Open)

HADOOP-15209 patch 007

* don't worry about false from deletes. The only FS which *may* return it for 
bigger problems is ftp, and ftp doesn't work as a dest for distcp
* just add it to a counter & log. At it means is that a delete of a parent dir 
has cut it, but that dir is no longer in the cache.
* print cache size at the end of the run
* tests to verify using counters that there's no files copied over to the 
remote store. Tested against: s3a, wasb, adl.
* tried to add counters to the delete operations, but they don't get picked up 
in the job results...obviously the committer is special that way.
* and the -i ignore flag also ignores delete() failure, which makes it 
consistent.


> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch, HADOOP-15209-005.patch, 
> HADOOP-15209-006.patch, HADOOP-15209-007.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-08 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Status: Open  (was: Patch Available)

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch, HADOOP-15209-005.patch, 
> HADOOP-15209-006.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-06 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Status: Patch Available  (was: Open)

Patch 006
checksum comparison changes reverted, though there's one optimisation: if 
source doesn't have a checksum, there's no probe for the endpoint. Saves 
HTTP/RPC calls when doing a distcp from an FS without checksums. 

Tested: s3a ireland, azure & adl UK. (this patch adds the azure one)



> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch, HADOOP-15209-005.patch, 
> HADOOP-15209-006.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-06 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Attachment: HADOOP-15209-006.patch

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch, HADOOP-15209-005.patch, 
> HADOOP-15209-006.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-06 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Status: Open  (was: Patch Available)

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch, HADOOP-15209-005.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Status: Patch Available  (was: Open)

tested ITestS3AContractDistCp against s3 ireland

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch, HADOOP-15209-005.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Attachment: HADOOP-15209-005.patch

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch, HADOOP-15209-005.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Status: Open  (was: Patch Available)

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Status: Patch Available  (was: Open)

Patch 004

adds a retry on the getFileStatus call after a delete fails. This is to try to 
handle eventual consistency quirks against, as usual, s3. 

Also, more logging, including calling targetfs.toString at the end, so you get 
more update states from s3a.

tested manually against s3a london

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Status: Open  (was: Patch Available)

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Attachment: HADOOP-15209-004.patch

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch, HADOOP-15209-004.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15209) DistCp to eliminate needless deletion of files under already-deleted directories

2018-02-21 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15209:

Summary: DistCp to eliminate needless deletion of files under 
already-deleted directories  (was: DistCp to eliminate needless deletion of 
files under deleted directories)

> DistCp to eliminate needless deletion of files under already-deleted 
> directories
> 
>
> Key: HADOOP-15209
> URL: https://issues.apache.org/jira/browse/HADOOP-15209
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.9.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15209-001.patch, HADOOP-15209-002.patch, 
> HADOOP-15209-003.patch
>
>
> DistCP issues a delete(file) request even if is underneath an already deleted 
> directory. This generates needless load on filesystems/object stores, and, if 
> the store throttles delete, can dramatically slow down the delete operation.
> If the distcp delete operation can build a history of deleted directories, 
> then it will know when it does not need to issue those deletes.
> Care is needed here to make sure that whatever structure is created does not 
> overload the heap of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org