[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2022-10-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17619007#comment-17619007
 ] 

ASF GitHub Bot commented on PARQUET-2161:
-

shangxinli commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1281184662

   @ala Thanks for pinging me! At this moment, I don't have ETA yet. 




> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2022-10-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618960#comment-17618960
 ] 

ASF GitHub Bot commented on PARQUET-2161:
-

ala commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1281067651

   @ggershinsky @shangxinli Hi! I just wanted to ask if 1.12.4 release might be 
happening soon (it seems in the previous years there usually was a release 
around September-October time)? We could really use the fix in Spark. Also: do 
I need to cherry-pick this fix, or would the next release be cut from `master`?




> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2022-07-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571233#comment-17571233
 ] 

ASF GitHub Bot commented on PARQUET-2161:
-

ggershinsky commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1195083014

   cc @shangxinli 




> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2022-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568960#comment-17568960
 ] 

ASF GitHub Bot commented on PARQUET-2161:
-

ala commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1190057981

   @ggershinsky Do you know when the next release that will include the fix 
might happen? We are looking to unblock 
https://issues.apache.org/jira/browse/SPARK-39634 in Apache Spark.




> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2022-06-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560629#comment-17560629
 ] 

ASF GitHub Bot commented on PARQUET-2161:
-

ggershinsky commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1170396062

   Thanks @ala 




> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2022-06-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560630#comment-17560630
 ] 

ASF GitHub Bot commented on PARQUET-2161:
-

ggershinsky merged PR #978:
URL: https://github.com/apache/parquet-mr/pull/978




> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2022-06-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558552#comment-17558552
 ] 

ASF GitHub Bot commented on PARQUET-2161:
-

ala commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1165708679

   cc @ggershinsky
   




> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2022-06-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558001#comment-17558001
 ] 

ASF GitHub Bot commented on PARQUET-2161:
-

ala commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1164263841

   cc @shangxinli This is a small follow-up bug fix for 
https://github.com/apache/parquet-mr/pull/945




> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2022-06-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556329#comment-17556329
 ] 

ASF GitHub Bot commented on PARQUET-2161:
-

ala opened a new pull request, #978:
URL: https://github.com/apache/parquet-mr/pull/978

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2161
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
 - Extends `TestParquetReader` suite. 
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)