[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3

2016-09-18 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501920#comment-15501920
 ] 

Rajesh Balamohan commented on HIVE-14776:
-

Have you tried with "--hiveconf fs.trash.interval=0" setting?

> Skip 'distcp' call when copying data from HDSF to S3
> 
>
> Key: HIVE-14776
> URL: https://issues.apache.org/jira/browse/HIVE-14776
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
> Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch
>
>
> Hive uses 'distcp' to copy files in parallel between HDFS encryption zones 
> when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to 
> copy. This 'distcp' is also executed when copying to S3, but it is causing 
> slower copies.
> We should not invoke distcp when copying to blobstore systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3

2016-09-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498177#comment-15498177
 ] 

Sergio Peña commented on HIVE-14776:


You're right, distcp does not use S3 as a temporary place. While debugging the 
code, I saw a '/user/hdfs/.Trash' directory created on S3 with data files being 
created, but after more investigation, I saw that there were copied by Hive 
when using INSERT OVERWRITE (old data being backed up). 

Anyway, distcp is still slow than not using distcp at all. I've no idea why. I 
run several tests with different file sizes (see times below when copied a 
file):

{{noformat}}
1G
S3 with distcp: 93s
S3 with no distcp: 37s

5G
S3 with distcp: 255s
S3 with no distcp: 147s
{{noformat}}

INSERT ... SELECT statements are going to create several files depending on the 
MR jobs and HDFS block-sizes, and they're might be slower than 5G. 

The S3A adapter should already manage multi-part uploads using Amazon API. 
Probably this is why distcp + s3a are not good together? 

> Skip 'distcp' call when copying data from HDSF to S3
> 
>
> Key: HIVE-14776
> URL: https://issues.apache.org/jira/browse/HIVE-14776
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
> Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch
>
>
> Hive uses 'distcp' to copy files in parallel between HDFS encryption zones 
> when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to 
> copy. This 'distcp' is also executed when copying to S3, but it is causing 
> slower copies.
> We should not invoke distcp when copying to blobstore systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3

2016-09-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497841#comment-15497841
 ] 

Hive QA commented on HIVE-14776:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12828886/HIVE-14776.2.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 8 failed/errored test(s), 10561 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ctas]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[acid_bucket_pruning]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3]
org.apache.hadoop.hive.metastore.TestMetaStoreMetrics.testMetaDataCounts
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testAddJarConstructorUnCaching
org.apache.hive.spark.client.TestSparkClient.testJobSubmission
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1216/testReport
Console output: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1216/console
Test logs: 
http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-1216/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 8 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12828886 - PreCommit-HIVE-MASTER-Build

> Skip 'distcp' call when copying data from HDSF to S3
> 
>
> Key: HIVE-14776
> URL: https://issues.apache.org/jira/browse/HIVE-14776
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
> Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch
>
>
> Hive uses 'distcp' to copy files in parallel between HDFS encryption zones 
> when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to 
> copy. This 'distcp' is also executed when copying to S3, but it is causing 
> slower copies.
> We should not invoke distcp when copying to blobstore systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3

2016-09-16 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497699#comment-15497699
 ] 

Szehon Ho commented on HIVE-14776:
--

I'm curious about this.  Distcp parallelizes the copy, and so if the file/dir 
is very splittable then in theory it should be faster than single thread, even 
though there's the overhead of temporary location for it?  I understand for 
some small files it will be slower.

And just orthogonally, I thought actually distcp puts the file in temporary 
location on local file before uploading to S3, not a temporary location on S3.

> Skip 'distcp' call when copying data from HDSF to S3
> 
>
> Key: HIVE-14776
> URL: https://issues.apache.org/jira/browse/HIVE-14776
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
> Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch
>
>
> Hive uses 'distcp' to copy files in parallel between HDFS encryption zones 
> when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to 
> copy. This 'distcp' is also executed when copying to S3, but it is causing 
> slower copies.
> We should not invoke distcp when copying to blobstore systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3

2016-09-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496977#comment-15496977
 ] 

Hive QA commented on HIVE-14776:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12828846/HIVE-14776.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 70 failed/errored test(s), 10561 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ctas]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_00_nonpart_empty]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_01_nonpart]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_02_00_part_empty]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_02_part]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_03_nonpart_over_compat]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_all_part]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_05_some_part]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_06_one_part]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_07_all_part_over_nonoverlap]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_08_nonpart_rename]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_09_part_spec_nonoverlap]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_10_external_managed]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_11_managed_external]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_12_external_location]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_13_managed_location]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_14_managed_location_over_existing]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_15_external_part]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_16_part_external]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_17_part_managed]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_18_part_external]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_19_00_part_external_location]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_19_part_external_location]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_20_part_managed_location]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_21_export_authsuccess]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_22_import_exist_authsuccess]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_23_import_part_authsuccess]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_24_import_nonexist_authsuccess]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_25_export_parentpath_has_inaccessible_children]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_hidden_files]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[repl_2_exim_basic]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[repl_3_exim_metadata]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[acid_bucket_pruning]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_import]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_01_nonpart_over_loaded]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_02_all_part_over_overlap]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_03_nonpart_noncompat_colschema]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_04_nonpart_noncompat_colnumber]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_05_nonpart_noncompat_coltype]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_06_nonpart_noncompat_storage]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_07_nonpart_noncompat_ifof]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_08_nonpart_noncompat_serde]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_09_nonpart_noncompat_serdeparam]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_10_nonpart_noncompat_bucketing]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_11_nonpart_noncompat_sorting]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_13_nonnative_import]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_14_nonpart_part]
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_15_part_nonpart]