[jira] [Commented] (HIVE-16869) Hive returns wrong result when predicates on non-existing columns are pushed down to Parquet reader

2017-06-26 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063242#comment-16063242
 ] 

Yongzhi Chen commented on HIVE-16869:
-

return null when any of the "or" sub-condition return null is more like turn 
off hive.optimize.index.filter  when the filter has none existing columns in 
parquet file. It is a fast fix before the partition filter issue is handled by 
parquet. 
The change looks good. +1

> Hive returns wrong result when predicates on non-existing columns are pushed 
> down to Parquet reader
> ---
>
> Key: HIVE-16869
> URL: https://issues.apache.org/jira/browse/HIVE-16869
> Project: Hive
>  Issue Type: Bug
>Reporter: Yibing Shi
>Assignee: Yibing Shi
>Priority: Critical
> Attachments: HIVE-16869.1.patch, HIVE-16869.2.patch
>
>
> When {{hive.optimize.ppd}} and {{hive.optimize.index.filter}} are turned, and 
> a select query has a condition on a column that doesn't exist in Parquet file 
> (such as a partition column), Hive often returns wrong result.
> Please see below example for details:
> {noformat}
> hive> create table test_parq (a int, b int) partitioned by (p int) stored as 
> parquet;
> OK
> Time taken: 0.292 seconds
> hive> insert overwrite table test_parq partition (p=1) values (1, 2);
> OK
> Time taken: 5.08 seconds
> hive> select * from test_parq where a=1 and p=1;
> OK
> 1 2   1
> Time taken: 0.441 seconds, Fetched: 1 row(s)
> hive> select * from test_parq where (a=1 and p=1) or (a=999 and p=999);
> OK
> 1 2   1
> Time taken: 0.197 seconds, Fetched: 1 row(s)
> hive> set hive.optimize.index.filter=true;
> hive> select * from test_parq where (a=1 and p=1) or (a=999 and p=999);
> OK
> Time taken: 0.167 seconds
> hive> select * from test_parq where (a=1 or a=999) and (a=999 or p=1);
> OK
> Time taken: 0.563 seconds
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16869) Hive returns wrong result when predicates on non-existing columns are pushed down to Parquet reader

2017-06-09 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045300#comment-16045300
 ] 

Hive QA commented on HIVE-16869:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12872361/HIVE-16869.2.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 10832 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=145)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=99)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query78] 
(batchId=232)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5611/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5611/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5611/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12872361 - PreCommit-HIVE-Build

> Hive returns wrong result when predicates on non-existing columns are pushed 
> down to Parquet reader
> ---
>
> Key: HIVE-16869
> URL: https://issues.apache.org/jira/browse/HIVE-16869
> Project: Hive
>  Issue Type: Bug
>Reporter: Yibing Shi
>Assignee: Yibing Shi
>Priority: Critical
> Attachments: HIVE-16869.1.patch, HIVE-16869.2.patch
>
>
> When {{hive.optimize.ppd}} and {{hive.optimize.index.filter}} are turned, and 
> a select query has a condition on a column that doesn't exist in Parquet file 
> (such as a partition column), Hive often returns wrong result.
> Please see below example for details:
> {noformat}
> hive> create table test_parq (a int, b int) partitioned by (p int) stored as 
> parquet;
> OK
> Time taken: 0.292 seconds
> hive> insert overwrite table test_parq partition (p=1) values (1, 2);
> OK
> Time taken: 5.08 seconds
> hive> select * from test_parq where a=1 and p=1;
> OK
> 1 2   1
> Time taken: 0.441 seconds, Fetched: 1 row(s)
> hive> select * from test_parq where (a=1 and p=1) or (a=999 and p=999);
> OK
> 1 2   1
> Time taken: 0.197 seconds, Fetched: 1 row(s)
> hive> set hive.optimize.index.filter=true;
> hive> select * from test_parq where (a=1 and p=1) or (a=999 and p=999);
> OK
> Time taken: 0.167 seconds
> hive> select * from test_parq where (a=1 or a=999) and (a=999 or p=1);
> OK
> Time taken: 0.563 seconds
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16869) Hive returns wrong result when predicates on non-existing columns are pushed down to Parquet reader

2017-06-09 Thread Yibing Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045279#comment-16045279
 ] 

Yibing Shi commented on HIVE-16869:
---

The idea of the patch is to change the logic of predicate {{OR}}. Currently, if 
a child of predicate {{OR}} returns a null predicate, this child is ignored. 
This is not correct. A null predicate means that the condition is on a column 
that doesn't exist in Parquet file (partition column etc.). In such a scenario, 
the whole {{OR}} should be considered to true (returns null) so that the record 
should be returned for further checking (if this {{OR}} is at top level) or the 
parent predicate can be correctly evaluated (if current {{OR}} is a child of 
another predicate).

> Hive returns wrong result when predicates on non-existing columns are pushed 
> down to Parquet reader
> ---
>
> Key: HIVE-16869
> URL: https://issues.apache.org/jira/browse/HIVE-16869
> Project: Hive
>  Issue Type: Bug
>Reporter: Yibing Shi
>Assignee: Yibing Shi
>Priority: Critical
> Attachments: HIVE-16869.1.patch, HIVE-16869.2.patch
>
>
> When {{hive.optimize.ppd}} and {{hive.optimize.index.filter}} are turned, and 
> a select query has a condition on a column that doesn't exist in Parquet file 
> (such as a partition column), Hive often returns wrong result.
> Please see below example for details:
> {noformat}
> hive> create table test_parq (a int, b int) partitioned by (p int) stored as 
> parquet;
> OK
> Time taken: 0.292 seconds
> hive> insert overwrite table test_parq partition (p=1) values (1, 2);
> OK
> Time taken: 5.08 seconds
> hive> select * from test_parq where a=1 and p=1;
> OK
> 1 2   1
> Time taken: 0.441 seconds, Fetched: 1 row(s)
> hive> select * from test_parq where (a=1 and p=1) or (a=999 and p=999);
> OK
> 1 2   1
> Time taken: 0.197 seconds, Fetched: 1 row(s)
> hive> set hive.optimize.index.filter=true;
> hive> select * from test_parq where (a=1 and p=1) or (a=999 and p=999);
> OK
> Time taken: 0.167 seconds
> hive> select * from test_parq where (a=1 or a=999) and (a=999 or p=1);
> OK
> Time taken: 0.563 seconds
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16869) Hive returns wrong result when predicates on non-existing columns are pushed down to Parquet reader

2017-06-09 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044641#comment-16044641
 ] 

Hive QA commented on HIVE-16869:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12872273/HIVE-16869.1.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 8 failed/errored test(s), 10825 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[insert_overwrite_local_directory_1]
 (batchId=237)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[query_with_semi] 
(batchId=80)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=145)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query78] 
(batchId=232)
org.apache.hadoop.hive.cli.TestSparkNegativeCliDriver.org.apache.hadoop.hive.cli.TestSparkNegativeCliDriver
 (batchId=239)
org.apache.hadoop.hive.ql.TestTxnCommands2.testNonAcidToAcidConversion1 
(batchId=268)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5602/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5602/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5602/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 8 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12872273 - PreCommit-HIVE-Build

> Hive returns wrong result when predicates on non-existing columns are pushed 
> down to Parquet reader
> ---
>
> Key: HIVE-16869
> URL: https://issues.apache.org/jira/browse/HIVE-16869
> Project: Hive
>  Issue Type: Bug
>Reporter: Yibing Shi
>Assignee: Yibing Shi
>Priority: Critical
> Attachments: HIVE-16869.1.patch
>
>
> When {{hive.optimize.ppd}} and {{hive.optimize.index.filter}} are turned, and 
> a select query has a condition on a column that doesn't exist in Parquet file 
> (such as a partition column), Hive often returns wrong result.
> Please see below example for details:
> {noformat}
> hive> create table test_parq (a int, b int) partitioned by (p int) stored as 
> parquet;
> OK
> Time taken: 0.292 seconds
> hive> insert overwrite table test_parq partition (p=1) values (1, 2);
> OK
> Time taken: 5.08 seconds
> hive> select * from test_parq where a=1 and p=1;
> OK
> 1 2   1
> Time taken: 0.441 seconds, Fetched: 1 row(s)
> hive> select * from test_parq where (a=1 and p=1) or (a=999 and p=999);
> OK
> 1 2   1
> Time taken: 0.197 seconds, Fetched: 1 row(s)
> hive> set hive.optimize.index.filter=true;
> hive> select * from test_parq where (a=1 and p=1) or (a=999 and p=999);
> OK
> Time taken: 0.167 seconds
> hive> select * from test_parq where (a=1 or a=999) and (a=999 or p=1);
> OK
> Time taken: 0.563 seconds
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)