[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2020-02-10 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033713#comment-17033713
 ] 

Sahil Takiar commented on HIVE-14165:
-

Marking as unassigned as I am no longer working on this. IIRC this speedup only 
applies to very simply queries - e.g. select / project queries.

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Priority: Major
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.07.patch, HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2020-02-10 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033607#comment-17033607
 ] 

Steve Loughran commented on HIVE-14165:
---

What is the current status of this? Is it a defacto WONTFIX? Or is someone 
keeping the patch up to date

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.07.patch, HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2019-03-29 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804833#comment-16804833
 ] 

t oo commented on HIVE-14165:
-

gentle ping

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.07.patch, HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2017-03-25 Thread Pengcheng Xiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941966#comment-15941966
 ] 

Pengcheng Xiong commented on HIVE-14165:


Hello, I am deferring this to Hive 3.0 as we are going to cut the first RC and 
it is not marked as blocker. Please feel free to commit to the branch if this 
can be resolved before the release.

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.07.patch, HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2017-03-22 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937749#comment-15937749
 ] 

Hive QA commented on HIVE-14165:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12860035/HIVE-14165.07.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 10509 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[comments] (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] 
(batchId=29)
org.apache.hive.hcatalog.pig.TestOrcHCatLoader.testReadMissingPartitionBasicNeg 
(batchId=175)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/4304/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/4304/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-4304/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12860035 - PreCommit-HIVE-Build

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.07.patch, HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2017-01-06 Thread Vihang Karajgaonkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806158#comment-15806158
 ] 

Vihang Karajgaonkar commented on HIVE-14165:


Thanks for the patch [~stakiar]. It seems like the previous implementation was 
ignoring zero length files for computing the splits. While 
FileInputFormat.getSplit() creates an empty Split for the zero length files. I 
am not sure how it impacts the execution, may be worth while to test. Also, if 
needed may be you can ignore the empty splits before adding them to 
{{FetchInputFormatSplit[] inputSplit}}

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-12-21 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767455#comment-15767455
 ] 

Sahil Takiar commented on HIVE-14165:
-

[~poeppt] just attached an RB.

I agree we shouldn't make backwards incompatible changes to Hive. Let me know 
what you think of the RB.

There are some alternatives to this approach though:

* The file listing could be done in the background, by a dedicated thread
* Listing could be done eagerly rather than lazily so that the file listing 
does not block the fetch operator

This would offer a good speedup, but would require the same amount of metadata 
operations to S3.

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-12-21 Thread Thomas Poepping (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767035#comment-15767035
 ] 

Thomas Poepping commented on HIVE-14165:


Hi Sahil,

When you update the patch, can you create a new ReviewBoard submission?

WRT the InputFormat issue, my feeling is that we should stray away 
from backwards-incompatible changes. Is there no way we can avoid the 
backwards-incompatible change, but still avoid the unnecessary list? 

I will be able to provide more targeted feedback once the RB submission has 
been updated.

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-12-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766615#comment-15766615
 ] 

Hive QA commented on HIVE-14165:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12844199/HIVE-14165.06.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 10825 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=234)
TestVectorizedColumnReaderBase - did not produce a TEST-*.xml file (likely 
timed out) (batchId=251)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] 
(batchId=71)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] 
(batchId=29)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[str_to_map] (batchId=58)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[transform_ppr2] 
(batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[stats_based_fetch_decision]
 (batchId=151)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exchange_partition_neg_incomplete_partition]
 (batchId=84)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_00_unsupported_schema]
 (batchId=85)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query36] 
(batchId=222)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query70] 
(batchId=222)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query86] 
(batchId=222)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vector_count_distinct]
 (batchId=105)
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3] 
(batchId=171)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2671/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2671/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2671/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 15 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12844199 - PreCommit-HIVE-Build

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-12-20 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766220#comment-15766220
 ] 

Hive QA commented on HIVE-14165:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12844177/HIVE-14165.05.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 10825 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=234)
TestVectorizedColumnReaderBase - did not produce a TEST-*.xml file (likely 
timed out) (batchId=251)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] 
(batchId=71)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] 
(batchId=29)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[str_to_map] (batchId=58)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[transform_ppr2] 
(batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[stats_based_fetch_decision]
 (batchId=151)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exchange_partition_neg_incomplete_partition]
 (batchId=84)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_00_unsupported_schema]
 (batchId=85)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query36] 
(batchId=222)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query70] 
(batchId=222)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query86] 
(batchId=222)
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3] 
(batchId=171)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2667/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2667/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2667/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 14 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12844177 - PreCommit-HIVE-Build

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-12-20 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765927#comment-15765927
 ] 

Hive QA commented on HIVE-14165:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12844165/HIVE-14165.04.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 74 failed/errored test(s), 10825 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=234)
TestVectorizedColumnReaderBase - did not produce a TEST-*.xml file (likely 
timed out) (batchId=251)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std] 
(batchId=40)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const] (batchId=16)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[concat_op] (batchId=67)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] 
(batchId=71)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2] 
(batchId=47)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing]
 (batchId=78)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] 
(batchId=29)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality] 
(batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[interval_alt] (batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[reset_conf] (batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source] 
(batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[str_to_map] (batchId=58)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_date_only] 
(batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal] 
(batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months] 
(batchId=58)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt] 
(batchId=49)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt] 
(batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft] 
(batchId=18)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright] 
(batchId=70)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned]
 (batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt] (batchId=75)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr] (batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32] (batchId=2)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] 
(batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format] 
(batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] (batchId=2)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode] (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial] 
(batchId=76)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number] 
(batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp] 
(batchId=76)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object] 
(batchId=30)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day] 
(batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask] (batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n] 
(batchId=70)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash] 
(batchId=26)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n] 
(batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n] 
(batchId=4)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n] 
(batchId=51)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5] (batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_nullif] (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter] (batchId=41)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace] (batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1] (batchId=6)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2] (batchId=11)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex] (batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index] 
(batchId=1)

[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-12-20 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765854#comment-15765854
 ] 

Hive QA commented on HIVE-14165:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12844165/HIVE-14165.04.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 74 failed/errored test(s), 10825 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=234)
TestVectorizedColumnReaderBase - did not produce a TEST-*.xml file (likely 
timed out) (batchId=251)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std] 
(batchId=40)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const] (batchId=16)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[concat_op] (batchId=67)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] 
(batchId=71)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2] 
(batchId=47)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing]
 (batchId=78)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] 
(batchId=29)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality] 
(batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[interval_alt] (batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[reset_conf] (batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source] 
(batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[str_to_map] (batchId=58)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_date_only] 
(batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal] 
(batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months] 
(batchId=58)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt] 
(batchId=49)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt] 
(batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft] 
(batchId=18)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright] 
(batchId=70)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned]
 (batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt] (batchId=75)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr] (batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32] (batchId=2)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] 
(batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format] 
(batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] (batchId=2)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode] (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial] 
(batchId=76)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number] 
(batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp] 
(batchId=76)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object] 
(batchId=30)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day] 
(batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask] (batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n] 
(batchId=70)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash] 
(batchId=26)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n] 
(batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n] 
(batchId=4)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n] 
(batchId=51)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5] (batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_nullif] (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter] (batchId=41)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace] (batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1] (batchId=6)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2] (batchId=11)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex] (batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index] 
(batchId=1)

[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-12-20 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765737#comment-15765737
 ] 

Sahil Takiar commented on HIVE-14165:
-

Assigning to myself as [~ayousufi] is no longer working on this issue.

I played around with this patch and found a similar speedup for a simple 
{{select * from s3_partitioned_table}} query where {{s3_partitioned_table}} has 
500 partitions all stored on S3 (each partition contains a CSV file of ~80 KB 
in size). Performance improves by about 2x.

The only problem I see with this patch is that it is technically a backwards 
incompatible change. Hive allows any custom {{InputFormat}} to be registered 
for a table, or for a partition. Before this patch, Hive guaranteed that the 
{{Path}} set in {{mapred.input.dir}} would always exist, and would always 
contain files of non-zero length. After this patch, the given {{Path}} may not 
exist, or may just be empty. This patch adds handling for {{FileInputFormat}}s, 
but given that a user can register any custom {{InputFormat}} with a table its 
possible some user queries may break.

I'm not sure how much of an issue this is, technically the {{InputFormat}} API 
makes no claim about whether a given {{Path}} should exist or should not be 
empty.

Also need to add some tests for this patch.

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-08-20 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429389#comment-15429389
 ] 

Hive QA commented on HIVE-14165:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12824670/HIVE-14165.03.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 61 failed/errored test(s), 10470 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_to_utc_timestamp]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_trunc]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_version]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_1]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_2]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_1]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_2]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part1]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part2]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[select_dummy_source]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[transform_ppr1]
org.apache.hive.beeline.TestBeeLineWithArgs.testConnectionWithURLParams
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3]
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testConnectionSchemaAPIs
org.apache.hive.jdbc.TestJdbcWithMiniLlap.testLlapInputFormatEndToEnd
org.apache.hive.jdbc.TestJdbcWithMiniLlap.testNonAsciiStrings

[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-08-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428725#comment-15428725
 ] 

Hive QA commented on HIVE-14165:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12824588/HIVE-14165.02.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 62 failed/errored test(s), 10441 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_to_utc_timestamp]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_trunc]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_version]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_1]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_2]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_1]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_2]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part1]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[select_dummy_source]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[transform_ppr1]
org.apache.hive.beeline.TestBeeLineWithArgs.testConnectionWithURLParams
org.apache.hive.beeline.TestBeeLineWithArgs.testEmbeddedBeelineOutputs
org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3]
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testConnectionSchemaAPIs
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testSelectThriftSerializeInTasks
org.apache.hive.jdbc.TestJdbcWithMiniLlap.testLlapInputFormatEndToEnd

[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-08-18 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427148#comment-15427148
 ] 

Steve Loughran commented on HIVE-14165:
---

the faster list status is only applicable on a recursive listing; if you are 
listing one directory, it's just the same time as before

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Abdullah Yousufi
> Attachments: HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-08-18 Thread Abdullah Yousufi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426846#comment-15426846
 ] 

Abdullah Yousufi commented on HIVE-14165:
-

I believe when Hive calls getSplits() it's actually using 
{code}org.apache.hadoop.mapred.FileInputFormat{code}. And is the updated 
listStatus faster in the non-recursive case as well? Because if not, I think it 
doesn't make sense to pass in the recursive flag as true since Hive is only 
interested in the files in the top level of the path, since it currently calls 
getSplits() for each partition.

However, if Hive were changed to call getSplits() on the root directory in the 
partitioned case, then the listStatus(recursive) would make sense. I decided 
against this change because I was not sure how to best resolve partition 
elimination. For example if the query selects a single partition from a table, 
then doing the listStatus(recursive) on the root directory would be slower than 
just doing a listStatus on the single partition.

Also, Qubole mentions the following, which may be something to pursue in the 
future.
{code}
"we modified split computation to invoke listing at the level of the parent 
directory. This call returns all files (and their sizes) in all subdirectories 
in blocks of 1000. Some subdirectories and files may not be of interest to 
job/query e.g. partition elimination may be eliminated some of them. We take 
advantage of the fact that file listing is in lexicographic order and perform a 
modified merge join of the list of files and list of directories of interest."
{code}
When you mentioned earlier that Hadoop grabs 5000 objects at a time, is that 
including files in subdirectories?

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Abdullah Yousufi
> Attachments: HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-08-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426708#comment-15426708
 ] 

Hive QA commented on HIVE-14165:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12824278/HIVE-14165.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 63 failed/errored test(s), 10426 tests 
executed
*Failed tests:*
{noformat}
TestVectorTimestampExpressions - did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_to_utc_timestamp]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_trunc]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_version]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_1]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_2]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_1]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_2]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part1]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part2]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[select_dummy_source]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[transform_ppr1]
org.apache.hive.beeline.TestBeeLineWithArgs.testConnectionWithURLParams
org.apache.hive.beeline.TestBeeLineWithArgs.testEmbeddedBeelineOutputs
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3]
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testConnectionSchemaAPIs
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testSelectThriftSerializeInTasks
org.apache.hive.jdbc.TestJdbcWithMiniLlap.testLlapInputFormatEndToEnd

[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2016-08-18 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426139#comment-15426139
 ] 

Steve Loughran commented on HIVE-14165:
---

which {{FileInputFormat}} are you using? If it is 
{{org.apache.hadoop.mapreduce.lib.input.FileInputFormat}} we could look at 
switching that to the {{listStatus(recursive)}} and pickup the HADOOP-13208 
speedup?

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Abdullah Yousufi
> Attachments: HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)