[jira] [Updated] (HIVE-16901) Distcp optimization - One distcp per CopyTask

2017-06-14 Thread Sankar Hariappan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-16901:

Description: 
Currently, if a CopyTask is created to copy a list of files, then distcp is 
invoked for each and every file. Instead, need to pass the list of source files 
to be copied to distcp tool which basically copies the files in parallel and 
hence gets lot of performance gain.

If the copy of list of files fail, then traverse the destination directory to 
see which file is missing and checksum mismatches, then trigger copy of those 
files one by one.

  was:
Currently, if a ReplCopyTask is created to copy a list of files, then distcp is 
invoked for each and every file. Instead, need to pass the list of source files 
to be copied to distcp tool which basically copies the files in parallel and 
hence gets lot of performance gain.

If the copy of list of files fail, then traverse the destination directory to 
see which file is missing and checksum mismatches, then trigger copy of those 
files one by one.


> Distcp optimization - One distcp per CopyTask 
> --
>
> Key: HIVE-16901
> URL: https://issues.apache.org/jira/browse/HIVE-16901
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, repl
>Affects Versions: 2.1.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
>
> Currently, if a CopyTask is created to copy a list of files, then distcp is 
> invoked for each and every file. Instead, need to pass the list of source 
> files to be copied to distcp tool which basically copies the files in 
> parallel and hence gets lot of performance gain.
> If the copy of list of files fail, then traverse the destination directory to 
> see which file is missing and checksum mismatches, then trigger copy of those 
> files one by one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-16901) Distcp optimization - One distcp per CopyTask

2017-06-14 Thread Sankar Hariappan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-16901:

Summary: Distcp optimization - One distcp per CopyTask   (was: Distcp 
optimization - One distcp per ReplCopyTask )

> Distcp optimization - One distcp per CopyTask 
> --
>
> Key: HIVE-16901
> URL: https://issues.apache.org/jira/browse/HIVE-16901
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, repl
>Affects Versions: 2.1.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
>
> Currently, if a ReplCopyTask is created to copy a list of files, then distcp 
> is invoked for each and every file. Instead, need to pass the list of source 
> files to be copied to distcp tool which basically copies the files in 
> parallel and hence gets lot of performance gain.
> If the copy of list of files fail, then traverse the destination directory to 
> see which file is missing and checksum mismatches, then trigger copy of those 
> files one by one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)