Re: [VOTE] Release Apache Parquet 1.12.3 RC1

2022-06-08 Thread Prakhar Jain
Also the commits referred by it
<https://github.com/apache/spark/pull/36629#issuecomment-1139217605> are
not the same as the RC candidate mentioned in the above email -
https://github.com/apache/parquet-mr/blob/parquet-1.12.3/CHANGES.md .
Correction: The commits seem to be the same.

Ignore the previous email. I was able to find more info around this here -
http://parquet.incubator.apache.org/blog/2022/05/26/1.12.3/ .



On Wed, Jun 8, 2022 at 11:39 AM Prakhar Jain 
wrote:

> Hi Team
>   Is *Apache Parquet 1.12.3* already out? I didn't see any
> announcement/release notes around this but saw this thread in Apache Spark
> mentioning it has already been released -
> https://github.com/apache/spark/pull/36629#issuecomment-1139211125 . Also
> the commits referred by it
> <https://github.com/apache/spark/pull/36629#issuecomment-1139217605> are
> not the same as the RC candidate mentioned in the above email -
> https://github.com/apache/parquet-mr/blob/parquet-1.12.3/CHANGES.md .
>
> The Jira tracking the release also seems to be open -
> https://issues.apache.org/jira/browse/PARQUET-2145 .
>
> Thanks
> Prakhar
>
>
>
>
>
>
> On Fri, May 27, 2022 at 7:29 AM Gidon Gershinsky  wrote:
>
>> Excellent news, thanks all! Thanks also Chao Sun for a big help with this
>> release!
>>
>> Cheers, Gidon
>>
>>
>> On Thu, May 26, 2022 at 9:06 AM Xinli shang 
>> wrote:
>>
>> > Thank Julien, Gidon, and Yuming for verifying and voting! The vote
>> passed!
>> > I will move forward with the next steps.
>> >
>> > On Wed, May 25, 2022 at 9:29 PM Julien Le Dem
>> > > >
>> > wrote:
>> >
>> > > +1
>> > > Verified signatures and tested
>> > >
>> > > On Mon, May 23, 2022 at 4:23 PM Xinli shang 
>> > > wrote:
>> > >
>> > > > I also vote +1.
>> > > >
>> > > > On Sun, May 22, 2022 at 5:59 PM Wang, Yuming
>> > > >
>> > > > wrote:
>> > > >
>> > > > > +1. Tested through Spark:
>> https://github.com/apache/spark/pull/36629
>> > > > >
>> > > > > From: Gidon Gershinsky 
>> > > > > Date: Sunday, May 22, 2022 at 19:02
>> > > > > To: dev@parquet.apache.org 
>> > > > > Subject: Re: [VOTE] Release Apache Parquet 1.12.3 RC1
>> > > > > External Email
>> > > > >
>> > > > > +1. Downloaded, verified and tested.
>> > > > >
>> > > > > Cheers, Gidon
>> > > > >
>> > > > >
>> > > > > On Fri, May 20, 2022 at 8:49 PM Xinli shang
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi everyone,
>> > > > > >
>> > > > > >
>> > > > > > I propose the following RC to be released as the official Apache
>> > > > Parquet
>> > > > > >  1.12.3 release.
>> > > > > >
>> > > > > >
>> > > > > > The commit id is f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b
>> > > > > >
>> > > > > > * This corresponds to the tag: apache-parquet-1.12.3-rc1
>> > > > > >
>> > > > > > *
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Freleases%2Ftag%2Fapache-parquet-1.12.3-rc1&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765431335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S7wFdemHnELNQZWPHWSxfOcyz3pwBh1U67eGzLuSxXU%3D&reserved=0
>> > > > > >
>> > > > > >
>> > > > > > The release tarball, signature, and checksums are here:
>> > > > > >
>> > > > > > *
>> > > > >
>> > > >
>> > >
>> >
>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.12.3-rc1&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdat

Re: [VOTE] Release Apache Parquet 1.12.3 RC1

2022-06-08 Thread Prakhar Jain
Hi Team
  Is *Apache Parquet 1.12.3* already out? I didn't see any
announcement/release notes around this but saw this thread in Apache Spark
mentioning it has already been released -
https://github.com/apache/spark/pull/36629#issuecomment-1139211125 . Also
the commits referred by it
 are
not the same as the RC candidate mentioned in the above email -
https://github.com/apache/parquet-mr/blob/parquet-1.12.3/CHANGES.md .

The Jira tracking the release also seems to be open -
https://issues.apache.org/jira/browse/PARQUET-2145 .

Thanks
Prakhar






On Fri, May 27, 2022 at 7:29 AM Gidon Gershinsky  wrote:

> Excellent news, thanks all! Thanks also Chao Sun for a big help with this
> release!
>
> Cheers, Gidon
>
>
> On Thu, May 26, 2022 at 9:06 AM Xinli shang 
> wrote:
>
> > Thank Julien, Gidon, and Yuming for verifying and voting! The vote
> passed!
> > I will move forward with the next steps.
> >
> > On Wed, May 25, 2022 at 9:29 PM Julien Le Dem
>  > >
> > wrote:
> >
> > > +1
> > > Verified signatures and tested
> > >
> > > On Mon, May 23, 2022 at 4:23 PM Xinli shang 
> > > wrote:
> > >
> > > > I also vote +1.
> > > >
> > > > On Sun, May 22, 2022 at 5:59 PM Wang, Yuming
>  > >
> > > > wrote:
> > > >
> > > > > +1. Tested through Spark:
> https://github.com/apache/spark/pull/36629
> > > > >
> > > > > From: Gidon Gershinsky 
> > > > > Date: Sunday, May 22, 2022 at 19:02
> > > > > To: dev@parquet.apache.org 
> > > > > Subject: Re: [VOTE] Release Apache Parquet 1.12.3 RC1
> > > > > External Email
> > > > >
> > > > > +1. Downloaded, verified and tested.
> > > > >
> > > > > Cheers, Gidon
> > > > >
> > > > >
> > > > > On Fri, May 20, 2022 at 8:49 PM Xinli shang
>  > >
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > >
> > > > > > I propose the following RC to be released as the official Apache
> > > > Parquet
> > > > > >  1.12.3 release.
> > > > > >
> > > > > >
> > > > > > The commit id is f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b
> > > > > >
> > > > > > * This corresponds to the tag: apache-parquet-1.12.3-rc1
> > > > > >
> > > > > > *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Freleases%2Ftag%2Fapache-parquet-1.12.3-rc1&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765431335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S7wFdemHnELNQZWPHWSxfOcyz3pwBh1U67eGzLuSxXU%3D&reserved=0
> > > > > >
> > > > > >
> > > > > > The release tarball, signature, and checksums are here:
> > > > > >
> > > > > > *
> > > > >
> > > >
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.12.3-rc1&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zz3qy8mFVR%2FuLrBykg7JKE6O9IOBXIys57n8SIykL4A%3D&reserved=0
> > > > > >
> > > > > >
> > > > > > You can find the KEYS file here:
> > > > > >
> > > > > > * *
> > > > >
> > > >
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Frelease%2Fparquet%2FKEYS&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gIq%2Beqa2z%2BUANXBcz%2FRXFVAYwv%2BczYTS%2FB1uuuq84f4%3D&reserved=0
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Frelease%2Fparquet%2FKEYS&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gIq%2Beqa2z%2BUANXBcz%2FRXFVAYwv%2BczYTS%2FB1uuuq84f4%3D&reserved=0
> > > > > >*
> > > > > >
> > > > > >
> > > > > > Binary artifacts are staged in Nexus here:
> > > > > >
> > > > > > *
> > > > >
> > > >
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nel4PeDZ0dJjZfCFyTwSIloeeiGt30s33o75CL%2B8chc%3D&reserved=0
> > > > > >
> > > > > >
> > > > > > This release includes important changes listed
> > > > > >
> > > > >
> > > >

[jira] [Resolved] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-06-08 Thread Prakhar Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakhar Jain resolved PARQUET-2117.
---
Fix Version/s: 1.12.3
   (was: 1.13.0)
   Resolution: Fixed

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>    Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-06-08 Thread Prakhar Jain (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551740#comment-17551740
 ] 

Prakhar Jain commented on PARQUET-2117:
---

Resolving this issue as the PR is merged. [~gershinsky] Could you reassign the 
Jira to me?

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>    Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: Meeting notes for Parquet sync meeting - March 1st. 2022

2022-03-10 Thread Prakhar Jain
Hi All
  Thanks for sharing the meeting notes. Do we have a tentative timeline in
mind for the new version of Parquet-MR? Also will it be a major/minor/or a
patch release.

Thanks and Regards
Prakhar Jain


On Tue, Mar 1, 2022 at 9:34 AM Xinli shang  wrote:

> 3/1/2022
>
> Attendees: Xinli Shang, Gidon Gershinsky, Vinoo Ganesh
>
>1.
>
>The new website of Apache Parquet is to be launched
>1.
>
>   https://www.vinoo.io/
>   2.
>
>   Vinoo to  send out an email to dev@ for a preview
>   2.
>
>Cell level encryption
>1.
>
>   Objective/Goals need to be clear
>   2.
>
>   Performance
>   3.
>
>   Avoid changing specification
>   3.
>
>Data masking
>1.
>
>   The PR is to be ready to review soon.
>   4.
>
>Release new version of Parquet
>1.
>
>   Blocked on ID resolution change. Need to ping.
>
> --
> Xinli Shang
>


[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-10 Thread Prakhar Jain (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490650#comment-17490650
 ] 

Prakhar Jain commented on PARQUET-2117:
---

[~sha...@uber.com] [~gszadovszky] Could you please [review the 
PR|https://github.com/apache/parquet-mr/pull/945] and provide your feedback. 
Thanks!

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>    Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-03 Thread Prakhar Jain (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486806#comment-17486806
 ] 

Prakhar Jain commented on PARQUET-2117:
---

 [~sha...@uber.com] Yes I am working on this. Will share the PR soon.

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>    Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-01 Thread Prakhar Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakhar Jain updated PARQUET-2117:
--
Description: 
Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
parquet file in columnar fashion or record-by-record.

It will be great to extend them to also support rowPosition API which can tell 
the position of the current record in the parquet file.

The rowPosition can be used as a unique row identifier to mark a row. This can 
be useful to create an index (e.g. B+ tree) over a parquet file/parquet table 
(e.g.  Spark/Hive).

There are multiple projects in the parquet eco-system which can benefit from 
such a functionality: 
 # Apache Iceberg needs this functionality. It has this implementation already 
as it relies on low level parquet APIs -  
[Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
 
[Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
 # Apache Spark can use this functionality - SPARK-37980

  was:
Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
parquet file in columnar fashion or record-by-record.

It will be great to extend them to also support rowPosition API which can tell 
the position of the current record in the parquet file.

The rowPosition can be used as a unique row identifier to mark a row. This can 
be useful to create an index (e.g. B+ tree) over a parquet file/parquet table 
(e.g.  Spark/Hive).

There are multiple projects in the parquet eco-system which can benefit from 
such a functionality: 
 #  Apache Iceberg needs this functionality. It has this implementation already 
as it relies on low level parquet APIs -  
[Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
 
[Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
 #  Apache Spark wants to expose this as a metadata column - SPARK-37980


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>    Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-01 Thread Prakhar Jain (Jira)
Prakhar Jain created PARQUET-2117:
-

 Summary: Add rowPosition API in parquet record readers
 Key: PARQUET-2117
 URL: https://issues.apache.org/jira/browse/PARQUET-2117
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Prakhar Jain
 Fix For: 1.13.0


Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
parquet file in columnar fashion or record-by-record.

It will be great to extend them to also support rowPosition API which can tell 
the position of the current record in the parquet file.

The rowPosition can be used as a unique row identifier to mark a row. This can 
be useful to create an index (e.g. B+ tree) over a parquet file/parquet table 
(e.g.  Spark/Hive).

There are multiple projects in the parquet eco-system which can benefit from 
such a functionality: 
 #  Apache Iceberg needs this functionality. It has this implementation already 
as it relies on low level parquet APIs -  
[Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
 
[Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
 #  Apache Spark wants to expose this as a metadata column - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)