[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter
[ https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167401#comment-16167401 ] Junjie Chen edited comment on HIVE-17261 at 9/15/17 6:33 AM: - The insert statement following store values in parquet without tail spaces. insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee" as varchar(10)), 0.22, cast("1970-02-20" as date) from src src1 union all select cast("hello" as char(10)), cast("world" as varchar(10)), 11.22, cast("1970-02-27" as date) from src src2 limit 10) uniontbl; However hive pass predicate {noformat}"eq(c, Binary{"apple "})"{noformat} to parquet, so the records are filtered in RecordReader#nextKeyValue(). So hive should also remove spaces in tail for predicate. was (Author: junjie): The insert statement following store values in parquet without tail spaces. insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee" as varchar(10)), 0.22, cast("1970-02-20" as date) from src src1 union all select cast("hello" as char(10)), cast("world" as varchar(10)), 11.22, cast("1970-02-27" as date) from src src2 limit 10) uniontbl; However hive pass predicate "eq(c, Binary{"apple"})" to parquet, so the records are filtered in RecordReader#nextKeyValue(). So hive should also remove spaces in tail for predicate. > Hive use deprecated ParquetInputSplit constructor which blocked parquet > dictionary filter > - > > Key: HIVE-17261 > URL: https://issues.apache.org/jira/browse/HIVE-17261 > Project: Hive > Issue Type: Improvement > Components: Database/Schema >Affects Versions: 2.2.0 >Reporter: Junjie Chen >Assignee: Junjie Chen > Fix For: 3.0.0 > > Attachments: HIVE-17261.10.patch, HIVE-17261.11.patch, > HIVE-17261.2.patch, HIVE-17261.3.patch, HIVE-17261.4.patch, > HIVE-17261.5.patch, HIVE-17261.6.patch, HIVE-17261.7.patch, > HIVE-17261.8.patch, HIVE-17261.diff, HIVE-17261.patch > > > Hive use deprecated ParquetInputSplit in > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128] > Please see interface definition in > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80] > Old interface set rowgroupoffset values which will lead to skip dictionary > filter in parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter
[ https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167401#comment-16167401 ] Junjie Chen edited comment on HIVE-17261 at 9/15/17 6:29 AM: - The insert statement following store values in parquet without tail spaces. insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee" as varchar(10)), 0.22, cast("1970-02-20" as date) from src src1 union all select cast("hello" as char(10)), cast("world" as varchar(10)), 11.22, cast("1970-02-27" as date) from src src2 limit 10) uniontbl; However hive pass predicate "eq(c, Binary{"apple"})" to parquet, so the records are filtered in RecordReader#nextKeyValue(). So hive should also remove spaces in tail for predicate. was (Author: junjie): The insert statement following store values in parquet without tail spaces. insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee" as varchar(10)), 0.22, cast("1970-02-20" as date) from src src1 union all select cast("hello" as char(10)), cast("world" as varchar(10)), 11.22, cast("1970-02-27" as date) from src src2 limit 10) uniontbl; However hive pass predicate "eq(c, Binary{"apple "})" to parquet, so the records are filtered in RecordReader#nextKeyValue(). So hive should also remove spaces in tail for predicate. > Hive use deprecated ParquetInputSplit constructor which blocked parquet > dictionary filter > - > > Key: HIVE-17261 > URL: https://issues.apache.org/jira/browse/HIVE-17261 > Project: Hive > Issue Type: Improvement > Components: Database/Schema >Affects Versions: 2.2.0 >Reporter: Junjie Chen >Assignee: Junjie Chen > Fix For: 3.0.0 > > Attachments: HIVE-17261.10.patch, HIVE-17261.11.patch, > HIVE-17261.2.patch, HIVE-17261.3.patch, HIVE-17261.4.patch, > HIVE-17261.5.patch, HIVE-17261.6.patch, HIVE-17261.7.patch, > HIVE-17261.8.patch, HIVE-17261.diff, HIVE-17261.patch > > > Hive use deprecated ParquetInputSplit in > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128] > Please see interface definition in > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80] > Old interface set rowgroupoffset values which will lead to skip dictionary > filter in parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter
[ https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160552#comment-16160552 ] Ferdinand Xu edited comment on HIVE-17261 at 9/11/17 2:09 AM: -- Thanks [~junjie] for the patch. One comment is not addressed: In ParquetRecordReaderBase.java * Please remove @ Depercated annotation since we are not using the deprecated constructor in L65 A few more comments left: In ParquetRecordReaderBase.java * Remove the unnecessary return in L131 In TestParquetRowGroupFilter.java * Since the filter is taking effect automatically within Parquet reader, we should add test cases to ensure its functionality in reader level while current tests are only focusing on the functionality of RowGroupFilter.filterRowGroups. Could you create a review board next time for review? Thank you! was (Author: ferd): Thanks Junjie Chen for the patch. One comment is not addressed: In ParquetRecordReaderBase.java * Please remove @ Depercated annotation since we are not using the deprecated constructor in L65 A few more comments left: In ParquetRecordReaderBase.java * Remove the unnecessary return in L131 In TestParquetRowGroupFilter.java * Since the filter is taking effect automatically within Parquet reader, we should add test cases to ensure its functionality in reader level while current tests are only focusing on the functionality of RowGroupFilter.filterRowGroups. Could you create a review board next time for review? Thank you! > Hive use deprecated ParquetInputSplit constructor which blocked parquet > dictionary filter > - > > Key: HIVE-17261 > URL: https://issues.apache.org/jira/browse/HIVE-17261 > Project: Hive > Issue Type: Improvement > Components: Database/Schema >Affects Versions: 2.2.0 >Reporter: Junjie Chen >Assignee: Junjie Chen > Attachments: HIVE-17261.2.patch, HIVE-17261.3.patch, > HIVE-17261.4.patch, HIVE-17261.5.patch, HIVE-17261.6.patch, HIVE-17261.diff, > HIVE-17261.patch > > > Hive use deprecated ParquetInputSplit in > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128] > Please see interface definition in > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80] > Old interface set rowgroupoffset values which will lead to skip dictionary > filter in parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter
[ https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160552#comment-16160552 ] Ferdinand Xu edited comment on HIVE-17261 at 9/11/17 2:09 AM: -- Thanks [~junjie] for the patch. One comment is not addressed: In ParquetRecordReaderBase.java * Please remove @ Depercated annotation since we are not using the deprecated constructor in L65 A few more comments left: In ParquetRecordReaderBase.java * Remove the unnecessary return in L131 In TestParquetRowGroupFilter.java * Since the filter is taking effect automatically within Parquet reader, we should add test cases to ensure its functionality in reader level while current tests are only focusing on the functionality of RowGroupFilter.filterRowGroups. Could you create a review board next time for review? Thank you! was (Author: ferd): Thanks [~junjie] for the patch. One comment is not addressed: In ParquetRecordReaderBase.java * Please remove @ Depercated annotation since we are not using the deprecated constructor in L65 A few more comments left: In ParquetRecordReaderBase.java * Remove the unnecessary return in L131 In TestParquetRowGroupFilter.java * Since the filter is taking effect automatically within Parquet reader, we should add test cases to ensure its functionality in reader level while current tests are only focusing on the functionality of RowGroupFilter.filterRowGroups. Could you create a review board next time for review? Thank you! > Hive use deprecated ParquetInputSplit constructor which blocked parquet > dictionary filter > - > > Key: HIVE-17261 > URL: https://issues.apache.org/jira/browse/HIVE-17261 > Project: Hive > Issue Type: Improvement > Components: Database/Schema >Affects Versions: 2.2.0 >Reporter: Junjie Chen >Assignee: Junjie Chen > Attachments: HIVE-17261.2.patch, HIVE-17261.3.patch, > HIVE-17261.4.patch, HIVE-17261.5.patch, HIVE-17261.6.patch, HIVE-17261.diff, > HIVE-17261.patch > > > Hive use deprecated ParquetInputSplit in > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128] > Please see interface definition in > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80] > Old interface set rowgroupoffset values which will lead to skip dictionary > filter in parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter
[ https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155410#comment-16155410 ] Ferdinand Xu edited comment on HIVE-17261 at 9/6/17 2:13 PM: - Thanks [~junjie] for the patch. Some comments left below: In ParquetRecordReaderBase.java # Please remove @ Depercated annotation since we are not using the deprecated constructor in L65 # In L103 - L107, two space indents. # Please update the setFilter method since the return value is no more needed. # The searchArg is passing to setFilter as a final variable. Then the converted filter property is not passed to Parquet reader? was (Author: ferd): Thanks [~junjie] for the patch. Some minor comments below: In ParquetRecordReaderBase.java # Please remove @ Depercated annotation since we are not using the deprecated constructor in L65 # In L103 - L107, two space indents. # Please update the setFilter method since the return value is no more needed. # The searchArg is passing to setFilter as a final variable. Then the converted filter property is not passed to Parquet reader? > Hive use deprecated ParquetInputSplit constructor which blocked parquet > dictionary filter > - > > Key: HIVE-17261 > URL: https://issues.apache.org/jira/browse/HIVE-17261 > Project: Hive > Issue Type: Improvement > Components: Database/Schema >Affects Versions: 2.2.0 >Reporter: Junjie Chen >Assignee: Junjie Chen > Attachments: HIVE-17261.2.patch, HIVE-17261.3.patch, > HIVE-17261.4.patch, HIVE-17261.diff, HIVE-17261.patch > > > Hive use deprecated ParquetInputSplit in > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128] > Please see interface definition in > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80] > Old interface set rowgroupoffset values which will lead to skip dictionary > filter in parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter
[ https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122944#comment-16122944 ] Junjie Chen edited comment on HIVE-17261 at 8/11/17 7:07 AM: - [~Ferd], Updated original unit tests to apply filter by using new APIs. was (Author: junjie): [~Ferd], Updated original unit tests to apply filter from parquet side. > Hive use deprecated ParquetInputSplit constructor which blocked parquet > dictionary filter > - > > Key: HIVE-17261 > URL: https://issues.apache.org/jira/browse/HIVE-17261 > Project: Hive > Issue Type: Improvement > Components: Database/Schema >Affects Versions: 2.2.0 >Reporter: Junjie Chen >Assignee: Junjie Chen > Attachments: HIVE-17261.2.patch, HIVE-17261.3.patch, HIVE-17261.diff, > HIVE-17261.patch > > > Hive use deprecated ParquetInputSplit in > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128] > Please see interface definition in > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80] > Old interface set rowgroupoffset values which will lead to skip dictionary > filter in parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter
[ https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121073#comment-16121073 ] Ferdinand Xu edited comment on HIVE-17261 at 8/10/17 5:56 AM: -- Can you rename the patch to HIVE-17261.patch? I see the new APIs doesn't require filtedBlocks as its parameter. So Parquet can handle filter using search argument in its side? was (Author: ferd): Can you rename the patch to HIVE-17261.patch? I see the new APIs doesn't require filtedBlocks as its parameter. So Parquet can handle the search argument in its side? > Hive use deprecated ParquetInputSplit constructor which blocked parquet > dictionary filter > - > > Key: HIVE-17261 > URL: https://issues.apache.org/jira/browse/HIVE-17261 > Project: Hive > Issue Type: Improvement > Components: Database/Schema >Affects Versions: 2.2.0 >Reporter: Junjie Chen >Assignee: Junjie Chen >Priority: Minor > Attachments: HIVE-17261.diff, HIVE-17261.patch > > > Hive use deprecated ParquetInputSplit in > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128] > Please see interface definition in > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80] > Old interface set rowgroupoffset values which will lead to skip dictionary > filter in parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17261) Hive use deprecated ParquetInputSplit constructor which blocked parquet dictionary filter
[ https://issues.apache.org/jira/browse/HIVE-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120996#comment-16120996 ] Junjie Chen edited comment on HIVE-17261 at 8/10/17 3:42 AM: - Just update one function for parquet, so no unit test. was (Author: junjie): --- a/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java +++ b/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java @@ -131,15 +131,14 @@ protected ParquetInputSplit getSplit( filtedBlocks = splitGroup; } + split = new ParquetInputSplit(finalPath, -splitStart, -splitLength, -oldSplit.getLocations(), -filtedBlocks, -readContext.getRequestedSchema().toString(), -fileMetaData.getSchema().toString(), -fileMetaData.getKeyValueMetaData(), -readContext.getReadSupportMetadata()); + splitStart, + splitStart + splitLength, + splitLength, + oldSplit.getLocations(), + null); + return split; } else { throw new IllegalArgumentException("Unknown split type: " + oldSplit); > Hive use deprecated ParquetInputSplit constructor which blocked parquet > dictionary filter > - > > Key: HIVE-17261 > URL: https://issues.apache.org/jira/browse/HIVE-17261 > Project: Hive > Issue Type: Improvement > Components: Database/Schema >Affects Versions: 2.2.0 >Reporter: Junjie Chen >Assignee: Junjie Chen >Priority: Minor > Attachments: HIVE-17261.diff > > > Hive use deprecated ParquetInputSplit in > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128] > Please see interface definition in > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80] > Old interface set rowgroupoffset values which will lead to skip dictionary > filter in parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)