Re: Review Request 70474: HIVE-21407: Parquet predicate pushdown is not working correctly for char column types

2019-05-09 Thread Peter Vary via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70474/#review215157
---


Ship it!




Ship It!

- Peter Vary


On máj. 9, 2019, 7:51 de, Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70474/
> ---
> 
> (Updated máj. 9, 2019, 7:51 de)
> 
> 
> Review request for hive and Peter Vary.
> 
> 
> Bugs: HIVE-21407
> https://issues.apache.org/jira/browse/HIVE-21407
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> The idea behind the patch is that for CHAR columns extend the predicate which 
> is pushed to Parquet with an “or” clause which contains the same expression 
> with a padded and a stripped value.
> Example:
> column c is a CHAR(10) type and the search expression is c='apple'
> The predicate which is pushed to Parquet looked like c='apple ' before the 
> patch and it would look like (c='apple ' or c='apple') after the patch.
> Since the value 'apple' is stored in Parquet without padding, the predicate 
> before the patch didn’t return any rows. With the patch it will return the 
> correct row. 
> Since on predicate level, there is no distinction between CHAR or VARCHAR, 
> the predicates for VARCHARs will be changed as well, so the result set 
> returned from Parquet will be wider than before.
> Example:
> A table contains a c VARCHAR(10) column and there is a row where c='apple' 
> and there is an other row where c='apple '. If the search expression is 
> c='apple ', both rows will be returned from Parquet after the patch. But 
> since Hive is doing an additional filtering on the rows returned from 
> Parquet, it won’t be a problem, the result set returned by Hive will contain 
> only the row with the value 'apple '.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java 
> be4c0d5 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRecordReaderWrapper.java
>  0210a0a 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/read/TestParquetFilterPredicate.java
>  d464046 
>   ql/src/test/queries/clientpositive/parquet_ppd_char.q 4230d8c 
>   ql/src/test/queries/clientpositive/parquet_ppd_char2.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_ppd_char2.q.out PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/70474/diff/2/
> 
> 
> Testing
> ---
> 
> Added new q test for testing the PPD for char and varchar types. Also 
> extended the unit tests for the 
> ParquetFilterPredicateConverter.toFilterPredicate method.
> 
> The TestParquetRecordReaderWrapper and the TestParquetFilterPredicate are 
> both testing the same thing, the behavior of the 
> ParquetFilterPredicateConverter.toFilterPredicate method. It doesn't make 
> sense to have tests for the same use case in different test classes, so moved 
> the test cases from the TestParquetRecordReaderWrapper to 
> TestParquetFilterPredicate.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>



Re: Review Request 70474: HIVE-21407: Parquet predicate pushdown is not working correctly for char column types

2019-05-09 Thread Marta Kuczora via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70474/
---

(Updated May 9, 2019, 7:51 a.m.)


Review request for hive and Peter Vary.


Changes
---

Fixed the whitespace issue.


Bugs: HIVE-21407
https://issues.apache.org/jira/browse/HIVE-21407


Repository: hive-git


Description
---

The idea behind the patch is that for CHAR columns extend the predicate which 
is pushed to Parquet with an “or” clause which contains the same expression 
with a padded and a stripped value.
Example:
column c is a CHAR(10) type and the search expression is c='apple'
The predicate which is pushed to Parquet looked like c='apple ' before the 
patch and it would look like (c='apple ' or c='apple') after the patch.
Since the value 'apple' is stored in Parquet without padding, the predicate 
before the patch didn’t return any rows. With the patch it will return the 
correct row. 
Since on predicate level, there is no distinction between CHAR or VARCHAR, the 
predicates for VARCHARs will be changed as well, so the result set returned 
from Parquet will be wider than before.
Example:
A table contains a c VARCHAR(10) column and there is a row where c='apple' and 
there is an other row where c='apple '. If the search expression is c='apple ', 
both rows will be returned from Parquet after the patch. But since Hive is 
doing an additional filtering on the rows returned from Parquet, it won’t be a 
problem, the result set returned by Hive will contain only the row with the 
value 'apple '.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java 
be4c0d5 
  
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRecordReaderWrapper.java
 0210a0a 
  
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/read/TestParquetFilterPredicate.java
 d464046 
  ql/src/test/queries/clientpositive/parquet_ppd_char.q 4230d8c 
  ql/src/test/queries/clientpositive/parquet_ppd_char2.q PRE-CREATION 
  ql/src/test/results/clientpositive/parquet_ppd_char2.q.out PRE-CREATION 


Diff: https://reviews.apache.org/r/70474/diff/2/

Changes: https://reviews.apache.org/r/70474/diff/1-2/


Testing
---

Added new q test for testing the PPD for char and varchar types. Also extended 
the unit tests for the ParquetFilterPredicateConverter.toFilterPredicate method.

The TestParquetRecordReaderWrapper and the TestParquetFilterPredicate are both 
testing the same thing, the behavior of the 
ParquetFilterPredicateConverter.toFilterPredicate method. It doesn't make sense 
to have tests for the same use case in different test classes, so moved the 
test cases from the TestParquetRecordReaderWrapper to 
TestParquetFilterPredicate.


Thanks,

Marta Kuczora



Re: Review Request 70474: HIVE-21407: Parquet predicate pushdown is not working correctly for char column types

2019-04-15 Thread Peter Vary via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70474/#review214659
---


Fix it, then Ship it!




Just one little nit.
Otherwise LGTM +1


ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java
Lines 150 (patched)


please remove extra space


- Peter Vary


On ápr. 14, 2019, 1:03 du, Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70474/
> ---
> 
> (Updated ápr. 14, 2019, 1:03 du)
> 
> 
> Review request for hive and Peter Vary.
> 
> 
> Bugs: HIVE-21407
> https://issues.apache.org/jira/browse/HIVE-21407
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> The idea behind the patch is that for CHAR columns extend the predicate which 
> is pushed to Parquet with an “or” clause which contains the same expression 
> with a padded and a stripped value.
> Example:
> column c is a CHAR(10) type and the search expression is c='apple'
> The predicate which is pushed to Parquet looked like c='apple ' before the 
> patch and it would look like (c='apple ' or c='apple') after the patch.
> Since the value 'apple' is stored in Parquet without padding, the predicate 
> before the patch didn’t return any rows. With the patch it will return the 
> correct row. 
> Since on predicate level, there is no distinction between CHAR or VARCHAR, 
> the predicates for VARCHARs will be changed as well, so the result set 
> returned from Parquet will be wider than before.
> Example:
> A table contains a c VARCHAR(10) column and there is a row where c='apple' 
> and there is an other row where c='apple '. If the search expression is 
> c='apple ', both rows will be returned from Parquet after the patch. But 
> since Hive is doing an additional filtering on the rows returned from 
> Parquet, it won’t be a problem, the result set returned by Hive will contain 
> only the row with the value 'apple '.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java 
> be4c0d5 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRecordReaderWrapper.java
>  0210a0a 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/read/TestParquetFilterPredicate.java
>  d464046 
>   ql/src/test/queries/clientpositive/parquet_ppd_char.q 4230d8c 
>   ql/src/test/queries/clientpositive/parquet_ppd_char2.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_ppd_char2.q.out PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/70474/diff/1/
> 
> 
> Testing
> ---
> 
> Added new q test for testing the PPD for char and varchar types. Also 
> extended the unit tests for the 
> ParquetFilterPredicateConverter.toFilterPredicate method.
> 
> The TestParquetRecordReaderWrapper and the TestParquetFilterPredicate are 
> both testing the same thing, the behavior of the 
> ParquetFilterPredicateConverter.toFilterPredicate method. It doesn't make 
> sense to have tests for the same use case in different test classes, so moved 
> the test cases from the TestParquetRecordReaderWrapper to 
> TestParquetFilterPredicate.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>



Review Request 70474: HIVE-21407: Parquet predicate pushdown is not working correctly for char column types

2019-04-14 Thread Marta Kuczora via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70474/
---

Review request for hive and Peter Vary.


Bugs: HIVE-21407
https://issues.apache.org/jira/browse/HIVE-21407


Repository: hive-git


Description
---

The idea behind the patch is that for CHAR columns extend the predicate which 
is pushed to Parquet with an “or” clause which contains the same expression 
with a padded and a stripped value.
Example:
column c is a CHAR(10) type and the search expression is c='apple'
The predicate which is pushed to Parquet looked like c='apple ' before the 
patch and it would look like (c='apple ' or c='apple') after the patch.
Since the value 'apple' is stored in Parquet without padding, the predicate 
before the patch didn’t return any rows. With the patch it will return the 
correct row. 
Since on predicate level, there is no distinction between CHAR or VARCHAR, the 
predicates for VARCHARs will be changed as well, so the result set returned 
from Parquet will be wider than before.
Example:
A table contains a c VARCHAR(10) column and there is a row where c='apple' and 
there is an other row where c='apple '. If the search expression is c='apple ', 
both rows will be returned from Parquet after the patch. But since Hive is 
doing an additional filtering on the rows returned from Parquet, it won’t be a 
problem, the result set returned by Hive will contain only the row with the 
value 'apple '.


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java 
be4c0d5 
  
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRecordReaderWrapper.java
 0210a0a 
  
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/read/TestParquetFilterPredicate.java
 d464046 
  ql/src/test/queries/clientpositive/parquet_ppd_char.q 4230d8c 
  ql/src/test/queries/clientpositive/parquet_ppd_char2.q PRE-CREATION 
  ql/src/test/results/clientpositive/parquet_ppd_char2.q.out PRE-CREATION 


Diff: https://reviews.apache.org/r/70474/diff/1/


Testing
---

Added new q test for testing the PPD for char and varchar types. Also extended 
the unit tests for the ParquetFilterPredicateConverter.toFilterPredicate method.

The TestParquetRecordReaderWrapper and the TestParquetFilterPredicate are both 
testing the same thing, the behavior of the 
ParquetFilterPredicateConverter.toFilterPredicate method. It doesn't make sense 
to have tests for the same use case in different test classes, so moved the 
test cases from the TestParquetRecordReaderWrapper to 
TestParquetFilterPredicate.


Thanks,

Marta Kuczora