[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used
[ https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17619007#comment-17619007 ] ASF GitHub Bot commented on PARQUET-2161: - shangxinli commented on PR #978: URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1281184662 @ala Thanks for pinging me! At this moment, I don't have ETA yet. > Row positions are computed incorrectly when range or offset metadata filter > is used > --- > > Key: PARQUET-2161 > URL: https://issues.apache.org/jira/browse/PARQUET-2161 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Ala Luszczak >Priority: Major > > The row indexes introduced in PARQUET-2117 are not computed correctly when > (1) range or offset metadata filter is applied, and > (2) the first row group was eliminated by the filter > For example, if a file has two row groups with 10 rows each, and we attempt > to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, > ..., 9 instead of expected 10, 11, ..., 19. > This happens because functions `filterFileMetaDataByStart` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453) > and `filterFileMetaDataByMidpoint` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460) > modify their input `FileMetaData`. To address the issue we need to > `generateRowGroupOffsets` before these filters are applied. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used
[ https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618960#comment-17618960 ] ASF GitHub Bot commented on PARQUET-2161: - ala commented on PR #978: URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1281067651 @ggershinsky @shangxinli Hi! I just wanted to ask if 1.12.4 release might be happening soon (it seems in the previous years there usually was a release around September-October time)? We could really use the fix in Spark. Also: do I need to cherry-pick this fix, or would the next release be cut from `master`? > Row positions are computed incorrectly when range or offset metadata filter > is used > --- > > Key: PARQUET-2161 > URL: https://issues.apache.org/jira/browse/PARQUET-2161 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Ala Luszczak >Priority: Major > > The row indexes introduced in PARQUET-2117 are not computed correctly when > (1) range or offset metadata filter is applied, and > (2) the first row group was eliminated by the filter > For example, if a file has two row groups with 10 rows each, and we attempt > to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, > ..., 9 instead of expected 10, 11, ..., 19. > This happens because functions `filterFileMetaDataByStart` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453) > and `filterFileMetaDataByMidpoint` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460) > modify their input `FileMetaData`. To address the issue we need to > `generateRowGroupOffsets` before these filters are applied. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used
[ https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571233#comment-17571233 ] ASF GitHub Bot commented on PARQUET-2161: - ggershinsky commented on PR #978: URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1195083014 cc @shangxinli > Row positions are computed incorrectly when range or offset metadata filter > is used > --- > > Key: PARQUET-2161 > URL: https://issues.apache.org/jira/browse/PARQUET-2161 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Ala Luszczak >Priority: Major > > The row indexes introduced in PARQUET-2117 are not computed correctly when > (1) range or offset metadata filter is applied, and > (2) the first row group was eliminated by the filter > For example, if a file has two row groups with 10 rows each, and we attempt > to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, > ..., 9 instead of expected 10, 11, ..., 19. > This happens because functions `filterFileMetaDataByStart` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453) > and `filterFileMetaDataByMidpoint` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460) > modify their input `FileMetaData`. To address the issue we need to > `generateRowGroupOffsets` before these filters are applied. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used
[ https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568960#comment-17568960 ] ASF GitHub Bot commented on PARQUET-2161: - ala commented on PR #978: URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1190057981 @ggershinsky Do you know when the next release that will include the fix might happen? We are looking to unblock https://issues.apache.org/jira/browse/SPARK-39634 in Apache Spark. > Row positions are computed incorrectly when range or offset metadata filter > is used > --- > > Key: PARQUET-2161 > URL: https://issues.apache.org/jira/browse/PARQUET-2161 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Ala Luszczak >Priority: Major > > The row indexes introduced in PARQUET-2117 are not computed correctly when > (1) range or offset metadata filter is applied, and > (2) the first row group was eliminated by the filter > For example, if a file has two row groups with 10 rows each, and we attempt > to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, > ..., 9 instead of expected 10, 11, ..., 19. > This happens because functions `filterFileMetaDataByStart` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453) > and `filterFileMetaDataByMidpoint` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460) > modify their input `FileMetaData`. To address the issue we need to > `generateRowGroupOffsets` before these filters are applied. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used
[ https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560629#comment-17560629 ] ASF GitHub Bot commented on PARQUET-2161: - ggershinsky commented on PR #978: URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1170396062 Thanks @ala > Row positions are computed incorrectly when range or offset metadata filter > is used > --- > > Key: PARQUET-2161 > URL: https://issues.apache.org/jira/browse/PARQUET-2161 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Ala Luszczak >Priority: Major > > The row indexes introduced in PARQUET-2117 are not computed correctly when > (1) range or offset metadata filter is applied, and > (2) the first row group was eliminated by the filter > For example, if a file has two row groups with 10 rows each, and we attempt > to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, > ..., 9 instead of expected 10, 11, ..., 19. > This happens because functions `filterFileMetaDataByStart` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453) > and `filterFileMetaDataByMidpoint` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460) > modify their input `FileMetaData`. To address the issue we need to > `generateRowGroupOffsets` before these filters are applied. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used
[ https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560630#comment-17560630 ] ASF GitHub Bot commented on PARQUET-2161: - ggershinsky merged PR #978: URL: https://github.com/apache/parquet-mr/pull/978 > Row positions are computed incorrectly when range or offset metadata filter > is used > --- > > Key: PARQUET-2161 > URL: https://issues.apache.org/jira/browse/PARQUET-2161 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Ala Luszczak >Priority: Major > > The row indexes introduced in PARQUET-2117 are not computed correctly when > (1) range or offset metadata filter is applied, and > (2) the first row group was eliminated by the filter > For example, if a file has two row groups with 10 rows each, and we attempt > to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, > ..., 9 instead of expected 10, 11, ..., 19. > This happens because functions `filterFileMetaDataByStart` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453) > and `filterFileMetaDataByMidpoint` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460) > modify their input `FileMetaData`. To address the issue we need to > `generateRowGroupOffsets` before these filters are applied. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used
[ https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558552#comment-17558552 ] ASF GitHub Bot commented on PARQUET-2161: - ala commented on PR #978: URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1165708679 cc @ggershinsky > Row positions are computed incorrectly when range or offset metadata filter > is used > --- > > Key: PARQUET-2161 > URL: https://issues.apache.org/jira/browse/PARQUET-2161 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Ala Luszczak >Priority: Major > > The row indexes introduced in PARQUET-2117 are not computed correctly when > (1) range or offset metadata filter is applied, and > (2) the first row group was eliminated by the filter > For example, if a file has two row groups with 10 rows each, and we attempt > to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, > ..., 9 instead of expected 10, 11, ..., 19. > This happens because functions `filterFileMetaDataByStart` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453) > and `filterFileMetaDataByMidpoint` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460) > modify their input `FileMetaData`. To address the issue we need to > `generateRowGroupOffsets` before these filters are applied. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used
[ https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558001#comment-17558001 ] ASF GitHub Bot commented on PARQUET-2161: - ala commented on PR #978: URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1164263841 cc @shangxinli This is a small follow-up bug fix for https://github.com/apache/parquet-mr/pull/945 > Row positions are computed incorrectly when range or offset metadata filter > is used > --- > > Key: PARQUET-2161 > URL: https://issues.apache.org/jira/browse/PARQUET-2161 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Ala Luszczak >Priority: Major > > The row indexes introduced in PARQUET-2117 are not computed correctly when > (1) range or offset metadata filter is applied, and > (2) the first row group was eliminated by the filter > For example, if a file has two row groups with 10 rows each, and we attempt > to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, > ..., 9 instead of expected 10, 11, ..., 19. > This happens because functions `filterFileMetaDataByStart` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453) > and `filterFileMetaDataByMidpoint` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460) > modify their input `FileMetaData`. To address the issue we need to > `generateRowGroupOffsets` before these filters are applied. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used
[ https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556329#comment-17556329 ] ASF GitHub Bot commented on PARQUET-2161: - ala opened a new pull request, #978: URL: https://github.com/apache/parquet-mr/pull/978 Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2161 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: - Extends `TestParquetReader` suite. ### Commits - [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [x] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Row positions are computed incorrectly when range or offset metadata filter > is used > --- > > Key: PARQUET-2161 > URL: https://issues.apache.org/jira/browse/PARQUET-2161 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Ala Luszczak >Priority: Major > > The row indexes introduced in PARQUET-2117 are not computed correctly when > (1) range or offset metadata filter is applied, and > (2) the first row group was eliminated by the filter > For example, if a file has two row groups with 10 rows each, and we attempt > to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, > ..., 9 instead of expected 10, 11, ..., 19. > This happens because functions `filterFileMetaDataByStart` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453) > and `filterFileMetaDataByMidpoint` (used here: > https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460) > modify their input `FileMetaData`. To address the issue we need to > `generateRowGroupOffsets` before these filters are applied. -- This message was sent by Atlassian Jira (v8.20.7#820007)