[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter

2017-09-14 Thread Junjie Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167401#comment-16167401
 ] 

Junjie Chen edited comment on HIVE-17261 at 9/15/17 6:33 AM:
-

The insert statement following store values in parquet without tail spaces. 
insert overwrite table newtypestbl select * from (select cast("apple" as 
char(10)), cast("bee" as varchar(10)), 0.22, cast("1970-02-20" as date) from 
src src1 union all select cast("hello" as char(10)), cast("world" as 
varchar(10)), 11.22, cast("1970-02-27" as date) from src src2 limit 10) 
uniontbl;

However hive pass predicate {noformat}"eq(c, Binary{"apple "})"{noformat} 
to parquet, so the records are filtered in RecordReader#nextKeyValue().

So hive should also remove spaces in tail for predicate.


was (Author: junjie):
The insert statement following store values in parquet without tail spaces. 
insert overwrite table newtypestbl select * from (select cast("apple" as 
char(10)), cast("bee" as varchar(10)), 0.22, cast("1970-02-20" as date) from 
src src1 union all select cast("hello" as char(10)), cast("world" as 
varchar(10)), 11.22, cast("1970-02-27" as date) from src src2 limit 10) 
uniontbl;

However hive pass predicate "eq(c, Binary{"apple"})" to parquet, so the 
records are filtered in RecordReader#nextKeyValue().

So hive should also remove spaces in tail for predicate.

> Hive use deprecated ParquetInputSplit constructor which blocked parquet 
> dictionary filter
> -
>
> Key: HIVE-17261
> URL: https://issues.apache.org/jira/browse/HIVE-17261
> Project: Hive
>  Issue Type: Improvement
>  Components: Database/Schema
>Affects Versions: 2.2.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
> Fix For: 3.0.0
>
> Attachments: HIVE-17261.10.patch, HIVE-17261.11.patch, 
> HIVE-17261.2.patch, HIVE-17261.3.patch, HIVE-17261.4.patch, 
> HIVE-17261.5.patch, HIVE-17261.6.patch, HIVE-17261.7.patch, 
> HIVE-17261.8.patch, HIVE-17261.diff, HIVE-17261.patch
>
>
> Hive use deprecated ParquetInputSplit in 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128]
> Please see interface definition in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80]
> Old interface set rowgroupoffset values which will lead to skip dictionary 
> filter in parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter

2017-09-14 Thread Junjie Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167401#comment-16167401
 ] 

Junjie Chen edited comment on HIVE-17261 at 9/15/17 6:29 AM:
-

The insert statement following store values in parquet without tail spaces. 
insert overwrite table newtypestbl select * from (select cast("apple" as 
char(10)), cast("bee" as varchar(10)), 0.22, cast("1970-02-20" as date) from 
src src1 union all select cast("hello" as char(10)), cast("world" as 
varchar(10)), 11.22, cast("1970-02-27" as date) from src src2 limit 10) 
uniontbl;

However hive pass predicate "eq(c, Binary{"apple"})" to parquet, so the 
records are filtered in RecordReader#nextKeyValue().

So hive should also remove spaces in tail for predicate.


was (Author: junjie):
The insert statement following store values in parquet without tail spaces. 
insert overwrite table newtypestbl select * from (select cast("apple" as 
char(10)), cast("bee" as varchar(10)), 0.22, cast("1970-02-20" as date) from 
src src1 union all select cast("hello" as char(10)), cast("world" as 
varchar(10)), 11.22, cast("1970-02-27" as date) from src src2 limit 10) 
uniontbl;

However hive pass predicate "eq(c, Binary{"apple "})" to parquet, so the 
records are filtered in RecordReader#nextKeyValue().

So hive should also remove spaces in tail for predicate.

> Hive use deprecated ParquetInputSplit constructor which blocked parquet 
> dictionary filter
> -
>
> Key: HIVE-17261
> URL: https://issues.apache.org/jira/browse/HIVE-17261
> Project: Hive
>  Issue Type: Improvement
>  Components: Database/Schema
>Affects Versions: 2.2.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
> Fix For: 3.0.0
>
> Attachments: HIVE-17261.10.patch, HIVE-17261.11.patch, 
> HIVE-17261.2.patch, HIVE-17261.3.patch, HIVE-17261.4.patch, 
> HIVE-17261.5.patch, HIVE-17261.6.patch, HIVE-17261.7.patch, 
> HIVE-17261.8.patch, HIVE-17261.diff, HIVE-17261.patch
>
>
> Hive use deprecated ParquetInputSplit in 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128]
> Please see interface definition in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80]
> Old interface set rowgroupoffset values which will lead to skip dictionary 
> filter in parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter

2017-09-10 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160552#comment-16160552
 ] 

Ferdinand Xu edited comment on HIVE-17261 at 9/11/17 2:09 AM:
--

Thanks [~junjie] for the patch.
One comment is not addressed:
In ParquetRecordReaderBase.java
* Please remove @ Depercated annotation since we are not using the deprecated 
constructor in L65

A few more comments left:
In ParquetRecordReaderBase.java
* Remove the unnecessary return in L131
In TestParquetRowGroupFilter.java
* Since the filter is taking effect automatically within Parquet reader, we 
should add test cases to ensure its functionality in reader level while current 
tests are only focusing on the functionality of RowGroupFilter.filterRowGroups.
 
Could you create a review board next time for review? Thank you!


was (Author: ferd):
Thanks Junjie Chen for the patch.
One comment is not addressed:
In ParquetRecordReaderBase.java
* Please remove @ Depercated annotation since we are not using the deprecated 
constructor in L65

A few more comments left:
In ParquetRecordReaderBase.java
* Remove the unnecessary return in L131
In TestParquetRowGroupFilter.java
* Since the filter is taking effect automatically within Parquet reader, we 
should add test cases to ensure its functionality in reader level while current 
tests are only focusing on the functionality of RowGroupFilter.filterRowGroups.
 
Could you create a review board next time for review? Thank you!

> Hive use deprecated ParquetInputSplit constructor which blocked parquet 
> dictionary filter
> -
>
> Key: HIVE-17261
> URL: https://issues.apache.org/jira/browse/HIVE-17261
> Project: Hive
>  Issue Type: Improvement
>  Components: Database/Schema
>Affects Versions: 2.2.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
> Attachments: HIVE-17261.2.patch, HIVE-17261.3.patch, 
> HIVE-17261.4.patch, HIVE-17261.5.patch, HIVE-17261.6.patch, HIVE-17261.diff, 
> HIVE-17261.patch
>
>
> Hive use deprecated ParquetInputSplit in 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128]
> Please see interface definition in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80]
> Old interface set rowgroupoffset values which will lead to skip dictionary 
> filter in parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter

2017-09-10 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160552#comment-16160552
 ] 

Ferdinand Xu edited comment on HIVE-17261 at 9/11/17 2:09 AM:
--

Thanks [~junjie] for the patch.
One comment is not addressed:
In ParquetRecordReaderBase.java
* Please remove @ Depercated annotation since we are not using the deprecated 
constructor in L65

A few more comments left:
In ParquetRecordReaderBase.java
* Remove the unnecessary return in L131

In TestParquetRowGroupFilter.java
* Since the filter is taking effect automatically within Parquet reader, we 
should add test cases to ensure its functionality in reader level while current 
tests are only focusing on the functionality of RowGroupFilter.filterRowGroups.
 
Could you create a review board next time for review? Thank you!


was (Author: ferd):
Thanks [~junjie] for the patch.
One comment is not addressed:
In ParquetRecordReaderBase.java
* Please remove @ Depercated annotation since we are not using the deprecated 
constructor in L65

A few more comments left:
In ParquetRecordReaderBase.java
* Remove the unnecessary return in L131
In TestParquetRowGroupFilter.java
* Since the filter is taking effect automatically within Parquet reader, we 
should add test cases to ensure its functionality in reader level while current 
tests are only focusing on the functionality of RowGroupFilter.filterRowGroups.
 
Could you create a review board next time for review? Thank you!

> Hive use deprecated ParquetInputSplit constructor which blocked parquet 
> dictionary filter
> -
>
> Key: HIVE-17261
> URL: https://issues.apache.org/jira/browse/HIVE-17261
> Project: Hive
>  Issue Type: Improvement
>  Components: Database/Schema
>Affects Versions: 2.2.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
> Attachments: HIVE-17261.2.patch, HIVE-17261.3.patch, 
> HIVE-17261.4.patch, HIVE-17261.5.patch, HIVE-17261.6.patch, HIVE-17261.diff, 
> HIVE-17261.patch
>
>
> Hive use deprecated ParquetInputSplit in 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128]
> Please see interface definition in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80]
> Old interface set rowgroupoffset values which will lead to skip dictionary 
> filter in parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter

2017-09-06 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155410#comment-16155410
 ] 

Ferdinand Xu edited comment on HIVE-17261 at 9/6/17 2:13 PM:
-

Thanks [~junjie] for the patch. Some comments left below:
In ParquetRecordReaderBase.java
# Please remove @ Depercated annotation since we are not using the deprecated 
constructor in L65
# In L103 - L107, two space indents.
# Please update the setFilter method since the return value is no more needed.
# The searchArg is passing to setFilter as a final variable. Then the converted 
filter property is not passed to Parquet reader?


was (Author: ferd):
Thanks [~junjie] for the patch. Some minor comments below:
In ParquetRecordReaderBase.java
# Please remove @ Depercated annotation since we are not using the deprecated 
constructor in L65
# In L103 - L107, two space indents.
# Please update the setFilter method since the return value is no more needed.
# The searchArg is passing to setFilter as a final variable. Then the converted 
filter property is not passed to Parquet reader?

> Hive use deprecated ParquetInputSplit constructor which blocked parquet 
> dictionary filter
> -
>
> Key: HIVE-17261
> URL: https://issues.apache.org/jira/browse/HIVE-17261
> Project: Hive
>  Issue Type: Improvement
>  Components: Database/Schema
>Affects Versions: 2.2.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
> Attachments: HIVE-17261.2.patch, HIVE-17261.3.patch, 
> HIVE-17261.4.patch, HIVE-17261.diff, HIVE-17261.patch
>
>
> Hive use deprecated ParquetInputSplit in 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128]
> Please see interface definition in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80]
> Old interface set rowgroupoffset values which will lead to skip dictionary 
> filter in parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter

2017-08-11 Thread Junjie Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122944#comment-16122944
 ] 

Junjie Chen edited comment on HIVE-17261 at 8/11/17 7:07 AM:
-

[~Ferd], Updated original unit tests to apply filter by using new APIs.


was (Author: junjie):
[~Ferd], Updated original unit tests to apply filter from parquet side.

> Hive use deprecated ParquetInputSplit constructor which blocked parquet 
> dictionary filter
> -
>
> Key: HIVE-17261
> URL: https://issues.apache.org/jira/browse/HIVE-17261
> Project: Hive
>  Issue Type: Improvement
>  Components: Database/Schema
>Affects Versions: 2.2.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
> Attachments: HIVE-17261.2.patch, HIVE-17261.3.patch, HIVE-17261.diff, 
> HIVE-17261.patch
>
>
> Hive use deprecated ParquetInputSplit in 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128]
> Please see interface definition in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80]
> Old interface set rowgroupoffset values which will lead to skip dictionary 
> filter in parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter

2017-08-09 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121073#comment-16121073
 ] 

Ferdinand Xu edited comment on HIVE-17261 at 8/10/17 5:56 AM:
--

Can you rename the patch to HIVE-17261.patch? I see the new APIs doesn't 
require filtedBlocks as its parameter. So Parquet can handle filter using 
search argument in its side?


was (Author: ferd):
Can you rename the patch to HIVE-17261.patch? I see the new APIs doesn't 
require filtedBlocks as its parameter. So Parquet can handle the search 
argument in its side?

> Hive use deprecated ParquetInputSplit constructor which blocked parquet 
> dictionary filter
> -
>
> Key: HIVE-17261
> URL: https://issues.apache.org/jira/browse/HIVE-17261
> Project: Hive
>  Issue Type: Improvement
>  Components: Database/Schema
>Affects Versions: 2.2.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
> Attachments: HIVE-17261.diff, HIVE-17261.patch
>
>
> Hive use deprecated ParquetInputSplit in 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128]
> Please see interface definition in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80]
> Old interface set rowgroupoffset values which will lead to skip dictionary 
> filter in parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter

2017-08-09 Thread Junjie Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120996#comment-16120996
 ] 

Junjie Chen edited comment on HIVE-17261 at 8/10/17 3:42 AM:
-

Just update one function for parquet, so no unit test.


was (Author: junjie):
--- 
a/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java
+++ 
b/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java
@@ -131,15 +131,14 @@ protected ParquetInputSplit getSplit(
 filtedBlocks = splitGroup;
   }

+
   split = new ParquetInputSplit(finalPath,
-splitStart,
-splitLength,
-oldSplit.getLocations(),
-filtedBlocks,
-readContext.getRequestedSchema().toString(),
-fileMetaData.getSchema().toString(),
-fileMetaData.getKeyValueMetaData(),
-readContext.getReadSupportMetadata());
+  splitStart,
+  splitStart + splitLength,
+  splitLength,
+  oldSplit.getLocations(),
+  null);
+
   return split;
 } else {
   throw new IllegalArgumentException("Unknown split type: " + oldSplit);


> Hive use deprecated ParquetInputSplit constructor which blocked parquet 
> dictionary filter
> -
>
> Key: HIVE-17261
> URL: https://issues.apache.org/jira/browse/HIVE-17261
> Project: Hive
>  Issue Type: Improvement
>  Components: Database/Schema
>Affects Versions: 2.2.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
> Attachments: HIVE-17261.diff
>
>
> Hive use deprecated ParquetInputSplit in 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128]
> Please see interface definition in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80]
> Old interface set rowgroupoffset values which will lead to skip dictionary 
> filter in parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)