[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3
[ https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501920#comment-15501920 ] Rajesh Balamohan commented on HIVE-14776: - Have you tried with "--hiveconf fs.trash.interval=0" setting? > Skip 'distcp' call when copying data from HDSF to S3 > > > Key: HIVE-14776 > URL: https://issues.apache.org/jira/browse/HIVE-14776 > Project: Hive > Issue Type: Sub-task > Components: Hive >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch > > > Hive uses 'distcp' to copy files in parallel between HDFS encryption zones > when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to > copy. This 'distcp' is also executed when copying to S3, but it is causing > slower copies. > We should not invoke distcp when copying to blobstore systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3
[ https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498177#comment-15498177 ] Sergio Peña commented on HIVE-14776: You're right, distcp does not use S3 as a temporary place. While debugging the code, I saw a '/user/hdfs/.Trash' directory created on S3 with data files being created, but after more investigation, I saw that there were copied by Hive when using INSERT OVERWRITE (old data being backed up). Anyway, distcp is still slow than not using distcp at all. I've no idea why. I run several tests with different file sizes (see times below when copied a file): {{noformat}} 1G S3 with distcp: 93s S3 with no distcp: 37s 5G S3 with distcp: 255s S3 with no distcp: 147s {{noformat}} INSERT ... SELECT statements are going to create several files depending on the MR jobs and HDFS block-sizes, and they're might be slower than 5G. The S3A adapter should already manage multi-part uploads using Amazon API. Probably this is why distcp + s3a are not good together? > Skip 'distcp' call when copying data from HDSF to S3 > > > Key: HIVE-14776 > URL: https://issues.apache.org/jira/browse/HIVE-14776 > Project: Hive > Issue Type: Sub-task > Components: Hive >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch > > > Hive uses 'distcp' to copy files in parallel between HDFS encryption zones > when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to > copy. This 'distcp' is also executed when copying to S3, but it is causing > slower copies. > We should not invoke distcp when copying to blobstore systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3
[ https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497841#comment-15497841 ] Hive QA commented on HIVE-14776: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12828886/HIVE-14776.2.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 8 failed/errored test(s), 10561 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ctas] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[acid_bucket_pruning] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] org.apache.hadoop.hive.metastore.TestMetaStoreMetrics.testMetaDataCounts org.apache.hive.jdbc.TestJdbcWithMiniHS2.testAddJarConstructorUnCaching org.apache.hive.spark.client.TestSparkClient.testJobSubmission {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1216/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1216/console Test logs: http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-1216/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 8 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12828886 - PreCommit-HIVE-MASTER-Build > Skip 'distcp' call when copying data from HDSF to S3 > > > Key: HIVE-14776 > URL: https://issues.apache.org/jira/browse/HIVE-14776 > Project: Hive > Issue Type: Sub-task > Components: Hive >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch > > > Hive uses 'distcp' to copy files in parallel between HDFS encryption zones > when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to > copy. This 'distcp' is also executed when copying to S3, but it is causing > slower copies. > We should not invoke distcp when copying to blobstore systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3
[ https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497699#comment-15497699 ] Szehon Ho commented on HIVE-14776: -- I'm curious about this. Distcp parallelizes the copy, and so if the file/dir is very splittable then in theory it should be faster than single thread, even though there's the overhead of temporary location for it? I understand for some small files it will be slower. And just orthogonally, I thought actually distcp puts the file in temporary location on local file before uploading to S3, not a temporary location on S3. > Skip 'distcp' call when copying data from HDSF to S3 > > > Key: HIVE-14776 > URL: https://issues.apache.org/jira/browse/HIVE-14776 > Project: Hive > Issue Type: Sub-task > Components: Hive >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch > > > Hive uses 'distcp' to copy files in parallel between HDFS encryption zones > when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to > copy. This 'distcp' is also executed when copying to S3, but it is causing > slower copies. > We should not invoke distcp when copying to blobstore systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14776) Skip 'distcp' call when copying data from HDSF to S3
[ https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496977#comment-15496977 ] Hive QA commented on HIVE-14776: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12828846/HIVE-14776.1.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 70 failed/errored test(s), 10561 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ctas] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_00_nonpart_empty] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_01_nonpart] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_02_00_part_empty] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_02_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_03_nonpart_over_compat] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_all_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_05_some_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_06_one_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_07_all_part_over_nonoverlap] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_08_nonpart_rename] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_09_part_spec_nonoverlap] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_10_external_managed] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_11_managed_external] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_12_external_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_13_managed_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_14_managed_location_over_existing] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_15_external_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_16_part_external] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_17_part_managed] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_18_part_external] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_19_00_part_external_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_19_part_external_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_20_part_managed_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_21_export_authsuccess] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_22_import_exist_authsuccess] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_23_import_part_authsuccess] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_24_import_nonexist_authsuccess] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_25_export_parentpath_has_inaccessible_children] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_hidden_files] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[repl_2_exim_basic] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[repl_3_exim_metadata] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[acid_bucket_pruning] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_import] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_01_nonpart_over_loaded] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_02_all_part_over_overlap] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_03_nonpart_noncompat_colschema] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_04_nonpart_noncompat_colnumber] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_05_nonpart_noncompat_coltype] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_06_nonpart_noncompat_storage] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_07_nonpart_noncompat_ifof] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_08_nonpart_noncompat_serde] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_09_nonpart_noncompat_serdeparam] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_10_nonpart_noncompat_bucketing] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_11_nonpart_noncompat_sorting] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_13_nonnative_import] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_14_nonpart_part] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_15_part_nonpart]