Re: [ANNOUNCE] New Parquet PMC Member: Antoine Pitrou

2024-07-18 Thread Xinli shang
Congratulations! Well deserved!

On Thu, Jul 18, 2024 at 6:56 AM Antoine Pitrou  wrote:

>
> Thank you very much for welcoming me. I really feel honored to be part
> of the Parquet PMC.
>
> Best
>
> Antoine.
>
>
> On Thu, 18 Jul 2024 19:46:17 +0800
> Gang Wu  wrote:
> > On behalf of the Parquet PMC, I'm pleased to announce that Antoine
> > has been invited to be a Parquet PMC member and he has accepted.
> > Welcome, and thank you for your contributions!
> >
> > Cheers,
> > Gang
> >
>
>
>
>

-- 
Xinli Shang


Congrats to Julien Le Dem for being next PMC Chair

2024-07-02 Thread Xinli shang
Hi all,

I am delighted to share some exciting news with you. Please join me in
congratulating Julien Le Dem on his back to be the next PMC Chair!

Julien is not only the co-author of Apache Parquet but also has previously
served as the PMC Chair, where his leadership and contributions have been
invaluable. His expertise and dedication continue to shape our community
and drive innovation.

We look forward to the continued success and growth of our Apache Parquet
under Julien's capable leadership.

Xinli Shang
ex - Apache Parquet PMC Chair


Parquet community sync meeting notes 5/28/2024

2024-05-28 Thread Xinli shang
Attendees: Julien Le Dem, Jan Finis, Vinoo, Rok Mihevc, Ed Seidl, Dewey
Dunnington, Jiashen Zhang, Marcin Krystinc

   1.

   We need to have different working groups for Parquet Improvements
   
<https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit>
like
   Metadata, and encoding, get consensus start working on them, and
   implement them in Java, CPP, and Rust.
   2.

   Have a feature implementation matrix
   1.

  Xinli can start with the draft by collecting each feature
  2.

  Vinoo can help with the website
  3.

   Have a test suite instead of a feature matrix. As long as it passes the
   tests, it is certificated.
   4.

   Geometry logic type: open to have that. We can continue with the PR
   review.
   5.

   Using Jira vs. Github - we can continue the discussion and vote.
   6.

   The ‘Binary’ data type issue - better to clarify from the spec.

-- 
Xinli Shang


Updated invitation: Parquet Sync @ Monthly from 7am to 8am on the fourth Tuesday from Tue Feb 27 to Mon May 27 (PST) (dev@parquet.apache.org)

2024-05-23 Thread Xinli shang
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20240227T07
DTEND;TZID=America/Los_Angeles:20240227T08
RRULE:FREQ=MONTHLY;UNTIL=20240528T065959Z;BYDAY=4TU
DTSTAMP:20240523T144819Z
ORGANIZER;CN=Xinli shang:mailto:sha...@uber.com
UID:6vgu231jai324kjt1041divb7b_r20240227t150...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE;C
 N=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_
 53454131313931326e6441766530387468426c616b656c793756432d343836313237@resour
 ce.calendar.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Xinli shang;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=matthew.m.tur...@outlook.com;X-NUM-GUESTS=0:mailto:matthew.m.turner
 @outlook.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszk
 y...@cloudera.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=arl...@pitt.edu;X-NUM-GUESTS=0:mailto:arl...@pitt.edu
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekrusejason@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=chao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=yumw...@ebay.com;X-NUM-GUESTS=0:mailto:yumw...@ebay.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=fo...@driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=aniskodedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedoss...@etsy.co
 m
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=i...@isolineltd.com;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=brian.bow...@sas.com;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=jiashenzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=vinoo.gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=hadrien.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=py...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=jorgecarlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarlei...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=huaxin.ga...@gmail.com;X-NUM-GUESTS=0:mailto:huaxin.ga...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=robe...@palantir.com;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=theo...@amazon.com;X-NUM-GUESTS=0:mailto:theo...@amazon.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=shengxuan@bytedance.com;X-NUM-GUESTS=0:mailto:shengxuan.liu@bytedan
 ce.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=collimarc...@gmail.com;X-NUM-GUESTS=0:mailto:collimarc...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=yi.he.ust...@gmail.com;X-NUM-GUESTS=0:mailto:yi.he.ust...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=v...@onehouse.ai;X-NUM-GUESTS

Updated invitation: Parquet Sync @ Tue May 28, 2024 9am - 10am (PDT) (dev@parquet.apache.org)

2024-05-23 Thread Xinli shang
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20240528T09
DTEND;TZID=America/Los_Angeles:20240528T10
DTSTAMP:20240523T154428Z
ORGANIZER;CN=Xinli shang:mailto:sha...@uber.com
UID:e0nn7qc9q58dv974d5gmrql...@google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Xinli shang;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=matthew.m.tur...@outlook.com;X-NUM-GUESTS=0:mailto:matthew.m.turner
 @outlook.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszk
 y...@cloudera.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=arl...@pitt.edu;X-NUM-GUESTS=0:mailto:arl...@pitt.edu
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekrusejason@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=chao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=yumw...@ebay.com;X-NUM-GUESTS=0:mailto:yumw...@ebay.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=fo...@driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=aniskodedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedoss...@etsy.co
 m
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=i...@isolineltd.com;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=brian.bow...@sas.com;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=jiashenzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=vinoo.gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=hadrien.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=py...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=jorgecarlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarleitao@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=huaxin.ga...@gmail.com;X-NUM-GUESTS=0:mailto:huaxin.ga...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=robe...@palantir.com;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=theo...@amazon.com;X-NUM-GUESTS=0:mailto:theo...@amazon.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=shengxuan@bytedance.com;X-NUM-GUESTS=0:mailto:shengxuan.liu@byt
 edance.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=collimarc...@gmail.com;X-NUM-GUESTS=0:mailto:collimarc...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=yi.he.ust...@gmail.com;X-NUM-GUESTS=0:mailto:yi.he.ust...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=v...@onehouse.ai;X-NUM-GUESTS=0:mailto:v...@onehouse.ai
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gabor.szadovs...@gmail.com;X-NUM-GUESTS=0:mailto:gabor.szadovszky@g
 mail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE

Updated invitation: Parquet Sync @ Monthly from 7am to 8am on the fourth Tuesday from Tue Apr 25, 2023 to Mon May 27 (PDT) (dev@parquet.apache.org)

2024-05-23 Thread Xinli shang
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20230425T07
DTEND;TZID=America/Los_Angeles:20230425T08
RRULE:FREQ=MONTHLY;UNTIL=20230627T065959Z;BYDAY=4TU
DTSTAMP:20240523T144820Z
ORGANIZER;CN=Xinli shang:mailto:sha...@uber.com
UID:6vgu231jai324kjt1041div...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE;C
 N=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_
 53454131313931326e6441766530387468426c616b656c793756432d343836313237@resour
 ce.calendar.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Xinli shang;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=matthew.m.tur...@outlook.com;X-NUM-GUESTS=0:mailto:matthew.m.turner
 @outlook.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszk
 y...@cloudera.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=arl...@pitt.edu;X-NUM-GUESTS=0:mailto:arl...@pitt.edu
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekrusejason@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=chao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=yumw...@ebay.com;X-NUM-GUESTS=0:mailto:yumw...@ebay.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=fo...@driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=aniskodedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedoss...@etsy.co
 m
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=i...@isolineltd.com;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=brian.bow...@sas.com;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=jiashenzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=vinoo.gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=hadrien.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=py...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=jorgecarlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarlei...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=huaxin.ga...@gmail.com;X-NUM-GUESTS=0:mailto:huaxin.ga...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=dev@parquet.apache.org;X-NUM-GUESTS=0:mailto:dev@parquet.apache.org
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=robe...@palantir.com;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Revin Chalil;X-NUM-GUESTS=0:mailto:revin.cha...@microsoft.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN="Hailu, Andreas";X-NUM-GUESTS=0:mailto:andreas.ha...@ny.email.gs.co
 m
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=theo...@amazon.com;X-NUM-GUESTS=0:mailto:theo...@amazon.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN="Gorecki, Michal&

Updated invitation: Parquet Sync @ Monthly from 9am to 10am on the fourth Tuesday (PDT) (dev@parquet.apache.org)

2024-05-23 Thread Xinli shang
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20240528T09
DTEND;TZID=America/Los_Angeles:20240528T10
RRULE:FREQ=MONTHLY;BYDAY=4TU
DTSTAMP:20240523T144817Z
ORGANIZER;CN=Xinli shang:mailto:sha...@uber.com
UID:e0nn7qc9q58dv974d5gmrql...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TR
 UE;CN=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.
 com_53454131313931326e6441766530387468426c616b656c793756432d343836313237@re
 source.calendar.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Xinli shang;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=matthew.m.tur...@outlook.com;X-NUM-GUESTS=0:mailto:matthew.m.turner
 @outlook.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszk
 y...@cloudera.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=arl...@pitt.edu;X-NUM-GUESTS=0:mailto:arl...@pitt.edu
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekrusejason@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=chao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=yumw...@ebay.com;X-NUM-GUESTS=0:mailto:yumw...@ebay.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=fo...@driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=aniskodedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedossett@etsy
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=i...@isolineltd.com;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=brian.bow...@sas.com;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=jiashenzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=vinoo.gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=hadrien.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=py...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=jorgecarlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarleitao@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=huaxin.ga...@gmail.com;X-NUM-GUESTS=0:mailto:huaxin.ga...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=robe...@palantir.com;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=theo...@amazon.com;X-NUM-GUESTS=0:mailto:theo...@amazon.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=shengxuan@bytedance.com;X-NUM-GUESTS=0:mailto:shengxuan.liu@byt
 edance.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=collimarc...@gmail.com;X-NUM-GUESTS=0:mailto:collimarc...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=yi.he.ust...@gmail.com;X-NUM-GUESTS=0:mailto:yi.he.ust...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=v

Canceled event: Parquet Sync @ Tue May 28, 2024 7am - 8am (PDT) (dev@parquet.apache.org)

2024-05-23 Thread Xinli shang
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:CANCEL
BEGIN:VEVENT
DTSTART:20240528T14Z
DTEND:20240528T15Z
DTSTAMP:20240523T144819Z
ORGANIZER;CN=Xinli shang:mailto:sha...@uber.com
UID:6vgu231jai324kjt1041div...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;CN=SEA | 11
 91 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_5345413131
 3931326e6441766530387468426c616b656c793756432d343836313237@resource.calenda
 r.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=Xinli 
 shang;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ma
 tthew.m.tur...@outlook.com;X-NUM-GUESTS=0:mailto:matthew.m.turner@outlook.c
 om
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ga
 bor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszky@cloudera
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;CN=gg507
 0...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;CN=gwalid
 9...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;CN=emkor
 nfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ar
 l...@pitt.edu;X-NUM-GUESTS=0:mailto:arl...@pitt.edu
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=al
 tekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekruseja...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=Ry
 an Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ch
 ao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=yumwan
 g...@ebay.com;X-NUM-GUESTS=0:mailto:yumw...@ebay.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=fokko@
 driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;CN=anisk
 odedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedoss...@etsy.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;CN=ivan@i
 solineltd.com;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=br
 ian.bow...@sas.com;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=jiashe
 nzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=vinoo.
 gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;CN=hadrie
 n.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=py
 a...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;CN=jorgec
 arlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarlei...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=hu
 axin.ga...@gmail.com;X-NUM-GUESTS=0:mailto:huaxin.ga...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ro
 be...@palantir.com;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;CN=theosi
 b...@amazon.com;X-NUM-GUESTS=0:mailto:theo...@amazon.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=shengx
 uan@bytedance.com;X-NUM-GUESTS=0:mailto:shengxuan@bytedance.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=co
 llimarc...@gmail.com;X-NUM-GUESTS=0:mailto:collimarc...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=yi.he.
 ust...@gmail.com;X-NUM-GUESTS=0:mailto:yi.he.ust...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=vc@one
 house.ai;X-NUM-GUESTS=0:mailto:v...@onehouse.ai
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;CN=gabor
 .szadovs...@gmail.com;X-NUM-GUESTS=0:mailto:gabor.szadovs...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=julien
 .le...@gmail.com;X-NUM-GUESTS=0:mailto:julien.le...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=dev@pa
 rquet.apache.org;X-NUM-GUESTS=0:mailto:dev@parquet.apache.org
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=Re
 vin Chalil;X-NUM-GUESTS=0:mailto:revin.cha...@microsoft.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN="H
 ailu, Andreas";X-NUM-GUESTS=0:mailto:andreas.ha...@ny.email.gs.com
ATTENDEE;CUTYPE=

Re: Interest in Parquet V3

2024-05-19 Thread Xinli shang
Sorry I am late to the party! It's great to see this discussion! Thank you
everyone for the many good points and thank you, Micah, for starting the
discussion and putting it together into a document, which is very helpful!
I agree with most of the points we discussed above, and we need to improve
Parquet and sometimes even speed up to catch up with industry changes.

With that said, we need people to work on it, as Julien mentioned. The
document
<https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit>
that Micah created covers pretty much everything we discussed here. I
encourage all of us to contribute by raising questions, providing
suggestions, adding missing functionality, etc. Once we reach a consensus
on each topic, we can create different tracks and working streams to kick
off the implementations.

I believe continuously improving Parquet would benefit the industry more
than creating a new format, which could add friction. These improvement
ideas are exciting opportunities. If you, your team members, or friends
have time and interest, please encourage them to contribute.

Our Parquet community meeting is next week, on May 28, 2024. We can have
discussions there if you can join. Currently, it is scheduled for 7:00 am
PDT, but I can change it according to the majority's availability.

On Fri, May 17, 2024 at 3:58 PM Rok Mihevc  wrote:

> Hi all,
>
> I've discussed with my colleagues and we would dedicate two engineers for
> 4-6 months on tasks related to implementing the format changes. We're
> already active in design discussions and can help with C++, Rust and C#
> implementations. I thought it'd be good to state this explicitly FWIW.
>
> Our main areas of interest are efficient reads for tables with wide schemas
> and faster random rowgroup access [1].
>
> To workaround the wide schemas issue we actually implemented an internal
> tool [3] for storing index information into a separate file which allows
> for reading only the necessary subset of metadata. We would offer this
> approach for consideration as a possible approach to solve the wide schema
> problem.
>
> [1] https://github.com/apache/arrow/issues/39676
> [2] https://github.com/G-Research/PalletJack
>
> Rok
>
> On Sun, May 12, 2024 at 12:59 AM Micah Kornfield 
> wrote:
>
> > Hi Parquet Dev,
> > I wanted to start a conversation within the community about working on a
> > new revision of Parquet.  For context there have been a bunch of new
> > formats [1][2][3] that show there is decent room for improvement across
> > data encodings and how metadata is organized.
> >
> > Specifically, in a new format revision I think we should be thinking
> about
> > the following areas for improvements:
> > 1.  More efficient encodings that allow for data skipping and SIMD
> > optimizations.
> > 2.  More efficient metadata handling for deserialization and projection
> to
> > address areas when metadata deserialization time is not trivial [4].
> > 3.  Possibly thinking about different encodings instead of
> > repetition/definition for repeated and nested field
> > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> type)
> > that can shred elements into individual columns (a recent thread in
> Iceberg
> > mentions doing this at the metadata level [5])
> >
> > I think the goals of V3 would be to provide existing API compatibility as
> > broadly as possible (possibly with some performance loss) and expose new
> > API surface areas where appropriate to make use of new elements.  New
> > encodings could be backported so they can be made use of without metadata
> > changes.  I think unfortunately that for points 2 and 3 we would want to
> > break file level compatibility.  More thought would be needed to consider
> > whether 4 could be backported effectively.
> >
> > This is a non-trivial amount of work to get good coverage across
> > implementations, so before putting together more formal proposal it would
> > be nice to know if:
> >
> > 1.  If there is an appetite in the general community to consider these
> > changes
> > 2.  If anybody from the community is interested in collaborating on
> > proposals/implementation in this area.
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/maxi-k/btrblocks
> > [2] https://github.com/facebookincubator/nimble
> > [3] https://blog.lancedb.com/lance-v2/
> > [4] https://github.com/apache/arrow/issues/39676
> > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> >
>


-- 
Xinli Shang


[ANNOUNCE] New Parquet PMC Member: Gang Wu

2024-05-11 Thread Xinli shang
Hi all,

As a Parquet committer, Gang Wu has remained very active and instructive in
the community. The Parquet community invited him to be a PMC member, and he
accepted. It's my pleasure to announce that Gang is now officially a PMC
member of Apache Parquet.

Congratulations, Gang!

Xinli Shang, on behalf of the Apache Parquet PMC


Re: [VOTE] Release Apache Parquet 1.14.0 RC1

2024-05-07 Thread Xinli shang
+1 (binding)

Verified the key

On Tue, May 7, 2024 at 12:14 AM Gidon Gershinsky  wrote:

> +1 (binding)
>
> - ran the tests
> - ran with the Iceberg encryption code
>
> Cheers, Gidon
>
>
> On Tue, May 7, 2024 at 4:28 AM Gang Wu  wrote:
>
> > Hi,
> >
> > It has been open for more than 72 hours already. We still need 2 more
> > binding votes. Considering that there was a weekend during the voting
> > hours, let's extend it. Thanks!
> >
> > Best,
> > Gang
> >
> > On Mon, May 6, 2024 at 4:07 PM Fokko Driesprong 
> wrote:
> >
> > > Good catch Gábor!
> > >
> > > I've created PRs to fix this for future releases:
> > >
> > >- https://github.com/apache/parquet-mr/pull/1347
> > >- https://github.com/apache/parquet-mr/pull/1348
> > >
> > > Kind regards,
> > > Fokko
> > >
> > > Op ma 6 mei 2024 om 08:50 schreef Gábor Szádovszky :
> > >
> > > > Thanks Fokko, Gang for working on this.
> > > > I have some findings:
> > > > * nit correction in the original mail: tag is
> apache-parquet-1.14.0-rc1
> > > > (not apache-parquet-1.4.0-rc1)
> > > > * The CHANGES.md should have been updated with the one fix you've
> > > mentioned
> > > > (PARQUET-2465)
> > > >
> > > > Since I've never used CHANGES.md to actually check a release
> content, I
> > > > don't feel this issue is so crucial to fail this vote. I would let
> the
> > > > other voters decide.
> > > > +1 (binding)
> > > >
> > > > Gang Wu  ezt írta (időpont: 2024. máj. 6., H,
> 3:33):
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Verified signature, checksum and build.
> > > > >
> > > > > Thanks Fokko for doing this! Let me take care of the rest.
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > > On Mon, May 6, 2024 at 4:36 AM Fokko Driesprong 
> > > > wrote:
> > > > >
> > > > > > Hey everyone,
> > > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > - Checked against Trino and the RC1 runs cleanly
> > > > > > <https://github.com/trinodb/trino/pull/21802>
> > > > > > - Checked against Iceberg and the tests passed locally. To let
> the
> > CI
> > > > > pass
> > > > > > we must upgrade Gradle, this is because Parquet ships with a new
> > > > Jackson
> > > > > > version that contains JDK21 code, but this is an issue on the
> > Iceberg
> > > > > side
> > > > > > <
> > > https://github.com/apache/iceberg/pull/10209#issuecomment-2094939429
> > > > >.
> > > > > >
> > > > > > Kind regards,
> > > > > > Fokko
> > > > > >
> > > > > >
> > > > > > Op vr 3 mei 2024 om 17:46 schreef Fokko Driesprong <
> > fo...@apache.org
> > > >:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > Since Gang is enjoying a well-deserved vacation
> > > > > > > <
> > > > >
> > https://github.com/apache/parquet-mr/pull/1342#issuecomment-2092774404
> > > > > > >,
> > > > > > > I'm jumping in for this RC. I propose the following RC to be
> > > released
> > > > > as
> > > > > > > the official Apache Parquet 1.14.0 release.
> > > > > > >
> > > > > > > The commit ID is fe9179414906cc19b550d13d2819b4e16fddf8a1
> > > > > > > * This corresponds to the tag: apache-parquet-1.4.0-rc1
> > > > > > > *
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/tree/fe9179414906cc19b550d13d2819b4e16fddf8a1
> > > > > > >
> > > > > > > The release tarball, signature, and checksums are here:
> > > > > > > *
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.14.0-rc1/
> > > > > > >
> > > > > > > You can find the KEYS file here:
> > > > > > > * https://downloads.apache.org/parquet/KEYS
> > > > > > >
> > > > > > > Binary artifacts are staged in Nexus here:
> > > > > > > *
> > > > > >
> > > >
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > > > > > >
> > > > > > > This release includes important changes:
> > > > > > >
> > > > > > > *
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/parquet-1.14.x/CHANGES.md#version-1140
> > > > > > >
> > > > > > > Since RC0 one commit has been added:
> > > > > > > https://github.com/apache/parquet-mr/pull/1342
> > > > > > >
> > > > > > > Please download, verify, and test.
> > > > > > >
> > > > > > > Please vote in the next 72 hours.
> > > > > > >
> > > > > > > [ ] +1 Release this as Apache Parquet 1.14.0
> > > > > > > [ ] +0
> > > > > > > [ ] -1 Do not release this because...
> > > > > > >
> > > > > > > Kind regards,
> > > > > > > Fokko
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Xinli Shang


Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Xinli shang
+1 (binding)

Validated the KEY

On Tue, Apr 30, 2024 at 1:18 AM Gang Wu  wrote:

> Thank you!
>
> On Tue, Apr 30, 2024 at 4:16 PM Gábor Szádovszky  wrote:
>
> > By importing the KEYS file under [1] the check of the .asc file passed!
> > So, I went forward and updated the KEYS file under [2] with your new one.
> >
> > Giving +1 (binding) for the release
> >
> > Cheers,
> > Gabor
> >
> > Gang Wu  ezt írta (időpont: 2024. ápr. 30., K, 9:58):
> >
> > > I have appended my new key to [1]. Please verify again. However, I
> don't
> > > have the permission to update [2]. That may not be an issue as I don't
> > have
> > > to permission to upload the final tarball to the svn release repo.
> > >
> > > [1] https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > [2] https://dist.apache.org/repos/dist/release/parquet/KEYS
> > >
> > > On Tue, Apr 30, 2024 at 3:45 PM Gábor Szádovszky 
> > wrote:
> > >
> > > > Sure, please add your new public key to the referenced KEYS file then
> > we
> > > > should be good. (The previous one would still be required to check
> the
> > > > previous releases, so do not remove it.)
> > > >
> > > > Gang Wu  ezt írta (időpont: 2024. ápr. 30., K,
> > 9:27):
> > > >
> > > > > Hi Gabor,
> > > > >
> > > > > Thanks for raising the issue! My original key was deleted by an
> > > accident
> > > > > of running a shell script and cannot be recovered any more. I have
> > > > created
> > > > > a new key and used it to sign the tarball. That's why it does not
> > > exists
> > > > in
> > > > > the KEYS file. I have sent the new key to some key servers already.
> > > Does
> > > > > it make sense to add my new key to the KEYS file instead?
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > > On Tue, Apr 30, 2024 at 3:11 PM Gábor Szádovszky  >
> > > > wrote:
> > > > >
> > > > > > Hi Gang,
> > > > > >
> > > > > > Thank you for taking care of the release!
> > > > > >
> > > > > > Unfortunately, the .asc check fails for me even after importing
> the
> > > > KEYS
> > > > > > file. Could you double check if you signed it with the correct
> key?
> > > > > > No other issues were discovered, so no RC1 is required for now if
> > you
> > > > can
> > > > > > change the .asc file for the current tarball.
> > > > > >
> > > > > > Cheers,
> > > > > > Gabor
> > > > > >
> > > > > > Gang Wu  ezt írta (időpont: 2024. ápr. 30., K,
> > > > 7:45):
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I propose the following RC to be released as the official
> Apache
> > > > > Parquet
> > > > > > > 1.14.0 release.
> > > > > > >
> > > > > > > The commit id is af0740229929337e1395fd24253a4ed787df2db3
> > > > > > > * This corresponds to the tag: apache-parquet-1.14.0-rc0
> > > > > > > *
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/tree/af0740229929337e1395fd24253a4ed787df2db3
> > > > > > >
> > > > > > > The release tarball, signature, and checksums are here:
> > > > > > > *
> > > > > >
> > > >
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.14.0-rc0
> > > > > > >
> > > > > > > You can find the KEYS file here:
> > > > > > > * https://downloads.apache.org/parquet/KEYS
> > > > > > >
> > > > > > > Binary artifacts are staged in Nexus here:
> > > > > > > *
> > > > > >
> > > >
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > > > > > >
> > > > > > > This release includes important changes:
> > > > > > > *
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/parquet-1.14.x/CHANGES.md#version-1140
> > > > > > >
> > > > > > > Please download, verify, and test.
> > > > > > >
> > > > > > > Please vote in the next 72 hours.
> > > > > > >
> > > > > > > [ ] +1 Release this as Apache Parquet 1.14.0
> > > > > > > [ ] +0
> > > > > > > [ ] -1 Do not release this because...
> > > > > > >
> > > > > > > Best,
> > > > > > > Gang
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Xinli Shang


Parquet Sync meeting notes - April 23 2024

2024-04-23 Thread Xinli shang
4/23/2024

Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang


Parquet-mr 1.14 release:

1. Fokko and Gang will discuss starting the release soon

2. There are a few breaking changes we need to make to ensure backward
compatibility and do proper testing

2. Vinoo will shadow and do some testing

3. Ideas on the release of Parquet 2.0. We start collecting thoughts and
welcome everybody to share opinions.
-- 
Xinli Shang


Parquet sync meeting notes - March 26 2024

2024-03-26 Thread Xinli shang
Hi all,

These are the meeting notes of today's sync meeting.

3/26/2024

Attendees: Gábor Szádovszky, Vinoo Ganesh, Xinli Shang

   1.

   Parquet-mr 1.14 release - target for mid of 2024
   2.

   Vulnerabilities findings - done.
   3.

   Java and scala files in format repo removal - start an email to discuss
   with the community

-- 
Xinli Shang


Re: [VOTE] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-13 Thread Xinli shang
+1 (binding)

Sorry for being late and thanks for working on it!

Xinli Shang


On Fri, Mar 8, 2024 at 8:28 AM Micah Kornfield 
wrote:

> +1 (non-binding)
>
> On Thursday, March 7, 2024, Gang Wu  wrote:
>
> > +1 (non-binding)
> >
> > Best,
> > Gang
> >
> > On Fri, Mar 8, 2024 at 5:05 AM Edward Seidl  wrote:
> >
> > > +1 (non-binding)
> > >
> > > Thanks for your work on this!
> > > Ed
> > > 
> > > From: Antoine Pitrou 
> > > Sent: Thursday, March 7, 2024 5:15 AM
> > > To: d...@parquet.incubator.apache.org  >
> > > Subject: [VOTE] Expand BYTE_STREAM_SPLIT to support
> FIXED_LEN_BYTE_ARRAY,
> > > INT32 and INT64
> > >
> > >
> > > Hello,
> > >
> > > As discussed previously on this ML [1], I am proposing to expand
> > > the types supported by the BYTE_STREAM_SPLIT encoding. The currently
> > > supported types are FLOAT and DOUBLE. The proposal expands the
> > > supported types to INT32, INT64 and FIXED_LEN_BYTE_ARRAY.
> > >
> > > The format addition is tracked on JIRA where some measurements on
> > > sample data are also published and discussed [2].
> > >
> > > (please note that the original ML thread only discussed expanding
> > > to FIXED_LEN_BYTE_ARRAY; discussion on the JIRA issue led to the
> > > conclusion that it would also be beneficial to cover INT32 and INT64)
> > >
> > > The format additions are submitted as a PR in [3].
> > > A data file for integration testing is submitted in [4].
> > > An implementation for Parquet C++ is submitted in [5].
> > > An implementation for parquet-mr is submitted in [6].
> > >
> > > This vote will be open for at least 1 week.
> > >
> > > +1: Accept the format additions
> > > +0: ...
> > > -1: Reject the format additions because ...
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > [1] https://lists.apache.org/thread/5on7rnc141jnw2cdxtsfgm5xhhdmsb4q
> > > [2] https://issues.apache.org/jira/browse/PARQUET-2414
> > > [3] https://github.com/apache/parquet-format/pull/229
> > > [4] https://github.com/apache/parquet-testing/pull/46
> > > [5] https://github.com/apache/arrow/pull/40094
> > > [6] https://github.com/apache/parquet-mr/pull/1291
> > >
> > >
> > >
> > >
> >
>


Parquet community sync meeting - Feb 2024

2024-02-27 Thread Xinli shang
Hi all,

These are notes for today's sync meeting!

2/27/2024

Attendee  Fokko Driesprong, Vinoo Ganesh, Xinli Shang

   1.

   Parquet-mr 1.14 release - target for mid of 2024
   2.

   Vulnerabilities findings - the code isn’t used anymore. We will remove
   them - AI: Vinoo.
   3.

   Some earlier discussion about Parquet to be consumed by Kafka


-- 
Xinli Shang


Updated invitation: Parquet Sync @ Monthly from 7am to 8am on the fourth Tuesday (PST) (dev@parquet.apache.org)

2024-02-20 Thread Xinli shang
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20240227T07
DTEND;TZID=America/Los_Angeles:20240227T08
RRULE:FREQ=MONTHLY;BYDAY=4TU
DTSTAMP:20240219T153858Z
ORGANIZER;CN=Xinli shang:mailto:sha...@uber.com
UID:6vgu231jai324kjt1041divb7b_r20240227t150...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE;C
 N=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_
 53454131313931326e6441766530387468426c616b656c793756432d343836313237@resour
 ce.calendar.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Xinli shang;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=matthew.m.tur...@outlook.com;X-NUM-GUESTS=0:mailto:matthew.m.turner
 @outlook.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszk
 y...@cloudera.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=arl...@pitt.edu;X-NUM-GUESTS=0:mailto:arl...@pitt.edu
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekrusejason@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=chao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=yumw...@ebay.com;X-NUM-GUESTS=0:mailto:yumw...@ebay.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=fo...@driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=aniskodedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedoss...@etsy.co
 m
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=i...@isolineltd.com;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=brian.bow...@sas.com;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=jiashenzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=vinoo.gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=hadrien.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=py...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=jorgecarlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarlei...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=huaxin.ga...@gmail.com;X-NUM-GUESTS=0:mailto:huaxin.ga...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=robe...@palantir.com;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=theo...@amazon.com;X-NUM-GUESTS=0:mailto:theo...@amazon.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=shengxuan@bytedance.com;X-NUM-GUESTS=0:mailto:shengxuan.liu@bytedan
 ce.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=collimarc...@gmail.com;X-NUM-GUESTS=0:mailto:collimarc...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=yi.he.ust...@gmail.com;X-NUM-GUESTS=0:mailto:yi.he.ust...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=v...@onehouse.ai;X-NUM-GUESTS=0:mailto:v

Updated invitation: Parquet Sync @ Monthly from 7am to 8am on the fourth Tuesday from Tue Jul 25, 2023 to Mon Feb 26 (PDT) (dev@parquet.apache.org)

2024-02-20 Thread Xinli shang
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20230725T07
DTEND;TZID=America/Los_Angeles:20230725T08
RRULE:FREQ=MONTHLY;UNTIL=20240227T075959Z;BYDAY=4TU
DTSTAMP:20240219T153858Z
ORGANIZER;CN=Xinli shang:mailto:sha...@uber.com
UID:6vgu231jai324kjt1041divb7b_r20230725t140...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE;C
 N=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_
 53454131313931326e6441766530387468426c616b656c793756432d343836313237@resour
 ce.calendar.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Xinli shang;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=matthew.m.tur...@outlook.com;X-NUM-GUESTS=0:mailto:matthew.m.turner
 @outlook.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszk
 y...@cloudera.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=arl...@pitt.edu;X-NUM-GUESTS=0:mailto:arl...@pitt.edu
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekrusejason@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=chao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=yumw...@ebay.com;X-NUM-GUESTS=0:mailto:yumw...@ebay.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=fo...@driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=aniskodedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedoss...@etsy.co
 m
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=i...@isolineltd.com;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=brian.bow...@sas.com;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=jiashenzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=vinoo.gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=hadrien.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=py...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=jorgecarlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarlei...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=huaxin.ga...@gmail.com;X-NUM-GUESTS=0:mailto:huaxin.ga...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=robe...@palantir.com;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=theo...@amazon.com;X-NUM-GUESTS=0:mailto:theo...@amazon.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=shengxuan@bytedance.com;X-NUM-GUESTS=0:mailto:shengxuan.liu@bytedan
 ce.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=collimarc...@gmail.com;X-NUM-GUESTS=0:mailto:collimarc...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=yi.he.ust...@gmail.com;X-NUM-GUESTS=0:mailto:yi.he.ust...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=v...@onehouse.ai;X

Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

2024-02-18 Thread Xinli shang
HI Max,

This is a very interesting feature with a great idea!

Xinli

On Sun, Feb 18, 2024 at 1:19 PM Max Konstantinov <
konstantinov.ma...@gmail.com> wrote:

> Hi Gang,
>
>
> I actually started working on this feature by trying to extend
> ParquetRewriter with that new capability, but I quickly ran into issues:
> - Many state holder variables in ParquetRewriter will have to be duplicated
> for the left / right side of the join
> - Some methods(ex: nullifyColumn) are very close but not the same for
> joiner and will require more branching in the existing codebase of
> ParquetRewriter
> - Tests for a Join part will have to pay a special attention to the right
> side of the join as it is a new thing in a joiner so it will blow out test
> class quite a bit
> To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter while
> possible will potentially make codebase too complex and hard to reason
> about.
>
> I thought about it, maybe we can utilize Factory method & Builder patterns
> for this? For example we can:
> - Unify Options(RewriteOptions/JoinOptions) into a single class, if one of
> the final implementation is not supporting a certain feature it should
> throw exception during construction
> - Use Factory pattern approach and pick the actual final implementation of
> the class based on provided options
> - Both ParquetRewriter & ParquetJoiner will implement a new Interface that
> has processBlocks() & close() public method
> - Use Builder pattern approach and make all methods including constructors
> private besides those that need to be exposed to users
> By using this approach we can simplify internal implementation by dividing
> it into separate dedicated smaller sub-modules while still providing a
> single feature rich external API. Let me know what you think.
>
> Also I might be wrong but I’ve noticed a few potential issues with
> ParquetRewriter:
> - not sure if that is a bug in but is not we supposed to consume() on the
> reader, here is the place in ParquetJoiner
> <
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345
> >
> you can find the same place in ParquetRewriter
> <
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771
> >
> ?
> - newSchema() & extractField() methods seem to have a small issue with
> complex nested schemas, here is the line that is different in
> ParquetRewriter
> <
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815
> >
> and here is ParquetJoiner
> <
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390
> >.
> ParquetRewriter’s version originally failed in my tests, it originally had
> a different schema, I will try to reproduce it later, right now it works, I
> need to go back and check the commit history.
>
> btw
> I wanted to start covering ParquetJoiner with extensive tests the next
> week.
> Also we built POC for one of our projects based on the current version of
> ParquetJoiner and it showed up to x10 improvement in performance in
> comparison with a default join implementation using Spark.
>
>
> Max.
>
> On Sat, Feb 17, 2024 at 10:37 PM Gang Wu  wrote:
>
> > Hi Max,
> >
> > Thanks for proposing the joiner! I simply took a glimpse of the PR and
> > it looks promising to me. My general question is on the possibility of
> > consolidating the work with ParquetRewriter, which shares a lot of
> > common rewriting logic.
> >
> > Best,
> > Gang
> >
> > On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov <
> > konstantinov.ma...@gmail.com> wrote:
> >
> > > Hi Parquet dev team!
> > >
> > >
> > > I wanted to ask your opinion on the proposal I came up with.
> > > PR: https://github.com/apache/parquet-mr/pull/1273
> > > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
> > > PR's description and JIRA ticket contains all the details, please check
> > it
> > > out. The feature is not yet ready to merge, it is just a proposal for
> > now.
> > > I wanted to ask a PARQUET community opinion if you see any obstacles
> for
> > > adding it? We find it very useful and plan to use it and if PARQUET
> > > community finds no issues with it I can add tests, javadocs and polish
> it
> > > so we can add this new feature to PARQUET.
> > >
> > >
> > > Max.
> > >
> >
>


-- 
Xinli Shang


Re: Fast nullify of columns?

2024-01-03 Thread Xinli shang
HI Paul,

Sorry for the late reply! How many columns in total do you have for that
file? The rewriter generally works better if you only nullify a small
percentage of columns while the remaining columns are not changed. It can
copy & paste those unchanged columns as byte buffer instead of rewriting it
field by field.

The reason we still use ColumnWriter.writeNull is to let the rewriter keep
parity with the original writer. ColumnRewriter.writeNull goes through the
existing code path like generating statistics etc to avoid a lot of short
circuits to keep the writing safe.

Xinli



On Thu, Dec 7, 2023 at 6:06 AM Paul Rooney  wrote:

> Thanks Gang,
>
> On Wed, 6 Dec 2023 at 05:15, Gang Wu  wrote:
>
> > Hi Paul,
> >
> > I agree there are better ways to do this, e.g. we can prepare encoded
> > definition levels and repetition levels (if they exist) and directly
> write
> > the
> > page. However, we need to take care of other rewrite configurations
> > including data page version (v1 or v2), compression, page statistics and
> > page index. By writing null records, the writer handles all the above
> > details
> > internally.
> >
> > BTW, IMO writing `empty` pages may break the specs and fail the reader.
> >
> > Best,
> > Gang
> >
> > On Mon, Dec 4, 2023 at 5:30 PM Paul Rooney  wrote:
> >
> > > Could anyone suggest a faster way to Nullify columns in a parquet file?
> > >
> > > My dataset consists of a lot of parquet files.
> > > Each of them having roughly 12 million rows and 350 columns. Being
> split
> > in
> > > 2 Row groups of 10 million and 2 million rows.
> > >
> > > For each file I need to nullify 150 columns and rewrite the files.
> > >
> > > I tried using 'nullifyColumn' in
> > >
> > >
> >
> 'parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java'
> > > But I find it slow as for each column, it iterates on the number of
> rows
> > > and calls ColumnWriter.writeNull
> > >
> > > Would anyone have suggestions on how to avoid all the iteration?
> > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L743C13
> > > ' for (int i = 0; i < totalChunkValues; i++) {...'
> > >
> > > Could a single call be made per column + row-group to write enough
> > > information to:
> > > A) keep the column present (in schema and as a Column chunk)
> > > B) set Column rowCount and num_nulls= totalChunkValues
> > >
> > >
> > > e.g. perhaps write a single 'empty' page which has:
> > > 1) valueCount and rowCount = totalChunkValues
> > > 2) Statistics.num_nulls set to totalChunkValues
> > >
> > > Thanks, Paul
> > >
> >
>


-- 
Xinli Shang


Canceled event: Parquet Sync @ Tue Dec 26, 2023 7am - 8am (PST) (dev@parquet.apache.org)

2023-12-13 Thread Xinli shang
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:CANCEL
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20231226T07
DTEND;TZID=America/Los_Angeles:20231226T08
DTSTAMP:20231212T224002Z
ORGANIZER;CN=Xinli shang:mailto:sha...@uber.com
UID:6vgu231jai324kjt1041divb7b_r20230725t140...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=SEA 
 | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_534541
 31313931326e6441766530387468426c616b656c793756432d343836313...@resource.cal
 endar.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=Xinli 
 shang;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ma
 tthew.m.tur...@outlook.com;X-NUM-GUESTS=0:mailto:matthew.m.turner@outlook.c
 om
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ga
 bor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszky@cloudera
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=gg
 5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=gw
 ali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=em
 kornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ar
 l...@pitt.edu;X-NUM-GUESTS=0:mailto:arl...@pitt.edu
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=al
 tekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekruseja...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=Ry
 an Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ch
 ao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=yu
 mw...@ebay.com;X-NUM-GUESTS=0:mailto:yumw...@ebay.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=fo
 k...@driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=an
 iskodedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedoss...@etsy.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=iv
 a...@isolineltd.com;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=br
 ian.bow...@sas.com;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ji
 ashenzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=vi
 noo.gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ha
 drien.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=py
 a...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=jo
 rgecarlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarlei...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=hu
 axin.ga...@gmail.com;X-NUM-GUESTS=0:mailto:huaxin.ga...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=ro
 be...@palantir.com;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=th
 eo...@amazon.com;X-NUM-GUESTS=0:mailto:theo...@amazon.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=sh
 engxuan@bytedance.com;X-NUM-GUESTS=0:mailto:shengxuan@bytedance.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=co
 llimarc...@gmail.com;X-NUM-GUESTS=0:mailto:collimarc...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=yi
 .he.ust...@gmail.com;X-NUM-GUESTS=0:mailto:yi.he.ust...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=vc
 @onehouse.ai;X-NUM-GUESTS=0:mailto:v...@onehouse.ai
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=de
 v...@parquet.apache.org;X-NUM-GUESTS=0:mailto:dev@parquet.apache.org
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;CN=Re
 vin Chalil;X-NUM-GUESTS=0

Re: [VOTE][RESULT][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-17 Thread Xinli shang
Thanks Micah for leading the effort!

On Tue, Nov 14, 2023 at 12:02 PM Micah Kornfield 
wrote:

> The vote passes with:
>
> 3 +1 votes (binding)
> 4 +1 votes (non-binding)
>
> and no -1 votes
>
> Thanks everyone for your input.
>
> On Mon, Nov 13, 2023 at 11:20 PM Gidon Gershinsky 
> wrote:
>
> > +1 (binding)
> >
> > Cheers, Gidon
> >
> >
> > On Tue, Nov 14, 2023 at 5:31 AM Xinli shang 
> > wrote:
> >
> > > Yeah, we need one more PMC to vote. If you can help, appreciate it.
> > >
> > > On Mon, Nov 13, 2023 at 6:23 AM Fokko Driesprong 
> > wrote:
> > >
> > > > +1 non-binding
> > > >
> > > > Great work Micah, I went through the PR and it looks very promising.
> > > >
> > > > Kind regards,
> > > > Fokko Driesprong
> > > >
> > > >  (Also pinged two more PMC members, hopefully they have time to jump
> in
> > > > here)
> > > >
> > > > Op vr 10 nov 2023 om 19:40 schreef Micah Kornfield <
> > > emkornfi...@gmail.com
> > > > >:
> > > >
> > > > > Hello, we need one more PMC member to approve this before the
> result
> > > can
> > > > > become official.  Would someone mind chiming in?
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > On Wed, Nov 8, 2023 at 8:55 AM Gábor Szádovszky 
> > > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > Cheers,
> > > > > > Gabor
> > > > > >
> > > > > > On 2023/11/07 02:46:37 Xinli shang wrote:
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > On Mon, Nov 6, 2023 at 4:56 PM Gang Wu 
> wrote:
> > > > > > >
> > > > > > > > +1 (non-binding)
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Gang
> > > > > > > >
> > > > > > > > On Tue, Nov 7, 2023 at 3:57 AM Ed Seidl 
> > > wrote:
> > > > > > > >
> > > > > > > > > +1 (non-binding)
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Ed
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Xinli Shang
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Xinli Shang
> > >
> >
>


-- 
Xinli Shang


Re: [VOTE] Release Apache Parquet Format 2.10.0 RC0

2023-11-17 Thread Xinli shang
+1 (binding)

Verified the signature. Thanks Gang for leading the effort!

On Thu, Nov 16, 2023 at 9:41 PM wish maple  wrote:

> +1 (no-binding)
>
> Thanks Gang for release!
>
> Best,
> Xuwei Fu
>
> Gang Wu  于2023年11月16日周四 14:07写道:
>
> > Hi everyone,
> >
> > I propose the following RC to be released as the official Apache Parquet
> > Format 2.10.0 release.
> >
> > The commit id is b9c4fa81c3be13dc98760c92b037fa4dd465cef8
> > * This corresponds to the tag: apache-parquet-format-2.10.0-rc0
> > *
> >
> >
> https://github.com/apache/parquet-format/tree/b9c4fa81c3be13dc98760c92b037fa4dd465cef8
> >
> > The release tarball, signature, and checksums are here:
> > *
> >
> >
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.10.0-rc0
> >
> > You can find the KEYS file here:
> > * https://downloads.apache.org/parquet/KEYS
> >
> > Binary artifacts are staged in Nexus here:
> > *
> >
> >
> https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.10.0/
> >
> > This release includes important changes listed below:
> > *
> >
> >
> https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-2100
> > * https://issues.apache.org/jira/projects/PARQUET/versions/12350092
> >
> > Please download, verify, and test.
> >
> > This vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Parquet Format 2.10.0
> > [ ] +0
> > [ ] -1 Do not release this because...
> >
> > Thanks,
> > Gang
> >
>


-- 
Xinli Shang


Re: [VOTE][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-13 Thread Xinli shang
Yeah, we need one more PMC to vote. If you can help, appreciate it.

On Mon, Nov 13, 2023 at 6:23 AM Fokko Driesprong  wrote:

> +1 non-binding
>
> Great work Micah, I went through the PR and it looks very promising.
>
> Kind regards,
> Fokko Driesprong
>
>  (Also pinged two more PMC members, hopefully they have time to jump in
> here)
>
> Op vr 10 nov 2023 om 19:40 schreef Micah Kornfield  >:
>
> > Hello, we need one more PMC member to approve this before the result can
> > become official.  Would someone mind chiming in?
> >
> > Thanks,
> > Micah
> >
> > On Wed, Nov 8, 2023 at 8:55 AM Gábor Szádovszky 
> wrote:
> >
> > > +1 (binding)
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On 2023/11/07 02:46:37 Xinli shang wrote:
> > > > +1 (binding)
> > > >
> > > > On Mon, Nov 6, 2023 at 4:56 PM Gang Wu  wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > > On Tue, Nov 7, 2023 at 3:57 AM Ed Seidl  wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > Thanks!
> > > > > > Ed
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Xinli Shang
> > > >
> > >
> >
>


-- 
Xinli Shang


Re: [VOTE][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-06 Thread Xinli shang
+1 (binding)

On Mon, Nov 6, 2023 at 4:56 PM Gang Wu  wrote:

> +1 (non-binding)
>
> Best,
> Gang
>
> On Tue, Nov 7, 2023 at 3:57 AM Ed Seidl  wrote:
>
> > +1 (non-binding)
> >
> > Thanks!
> > Ed
> >
>


-- 
Xinli Shang


Re: [VOTE][Format] Add Float16 type to specification

2023-10-05 Thread Xinli shang
+1

On Thu, Oct 5, 2023 at 1:32 PM Antoine Pitrou  wrote:

>
> Hello,
>
> +1 from me (non-binding).
>
> Regards
>
> Antoine.
>
>
> On Wed, 4 Oct 2023 16:14:00 -0400
> Ben Harkins 
> wrote:
>
> > Hi everyone,
> >
> > I would like to propose adding a half-precision floating point type to
> > the Parquet format specification, in accordance with the active
> > proposal here:
> >
> >
> >- https://github.com/apache/parquet-format/pull/184
> >
> > To summarize, the current proposal would introduce a Float16 logical
> > type, represented by a little-endian 2-byte FixedLenByteArray. The
> > value's encoding would adhere to the IEEE-754 standard [1].
> > Furthermore, implementations should ensure that any value comparisons
> > and ordering requirements (mainly for column statistics) emulate the
> > behavior of native (i.e. physical) floating point types.
> >
> > As for how this would look in practice, there are currently several
> > implementations of this proposal that are more or less complete:
> >
> >
> >- C++ (and Python): https://github.com/apache/arrow/pull/36073
> >- Java: https://github.com/apache/parquet-mr/pull/1142
> >- Go: https://github.com/apache/arrow/pull/37599
> >
> > Of course, we're prepared to make adjustments to the implementations as
> > needed, since the format additions will need to be approved before those
> > PRs are merged. I should also note that naming conventions haven't been
> > extensively discussed, so feel free to chime in if you have a strong
> > preference for HALF or HALF_FLOAT over FLOAT16!
> >
> >
> > This vote will be open for at least 72 hours.
> >
> > [ ] +1 Add this type to the format specification
> > [ ] +0
> > [ ] -1 Do not add this type to the format specification because...
> >
> > Thanks!
> >
> > Ben
> >
> > [1]: https://en.wikipedia.org/wiki/Half-precision_floating-point_format
> >
> >
>
>
>
>

-- 
Xinli Shang


Re: Drop parquet-thrift

2023-10-03 Thread Xinli shang
Hi Fokko,

Thanks for looking into this! I generally agree we probably should retire
parquet-thrift. The only thing is we need to find out what is still using
it which is hard to do because of the large user base of parquet-mr. What
we did earlier is to mark that module as deprecated first. Then after one
release, we officially remove it. But I don't know that process would block
you too long.

Xinli

On Thu, Sep 28, 2023 at 2:20 AM Fokko Driesprong  wrote:

> Hey Gang,
>
> It is also used in some of the code:
>
>- org.apache.parquet.hadoop.thrift.AbstractThriftWriteSupport
>- org.apache.parquet.thrift.AbstractThriftWriteSupport
>- org.apache.parquet.thrift.ThriftSchemaConverter
>- org.apache.parquet.thrift.TupleToThriftWriteSupport
>
> Yesterday I tried to factor it out, but I ended up removing most of the
> codebase. I'm not aware of any alternative to Elephantbird. I tried to ping
> the original author
> <https://github.com/apache/parquet-mr/pull/1068#issuecomment-1729434254>,
> but the GitHub account seems to be abandoned.
>
> Kind regards,
> Fokko
>
> Op do 28 sep 2023 om 11:13 schreef Gang Wu :
>
> > Hi Fokko,
> >
> > Is there any alternative to Elephantbird? Since it is only used in the
> > test, could we rewrite those test cases using the alternative if any?
> > The effort may be huge though.
> >
> > Best,
> > Gang
> >
> > On Thu, Sep 28, 2023 at 5:03 PM Fokko Driesprong 
> wrote:
> >
> > > Hi everyone,
> > >
> > > I was in the process of updating to the latest version of Thrift
> > > <https://github.com/apache/parquet-mr/pull/1138> (from 0.16.0 to
> > 0.19.0).
> > > Mostly because it contains CVEs and makes the release process easier
> > > because you don't have to install Thrift from source (it is just
> > available
> > > on homebrew etc).
> > >
> > > While working on this, I ran into an issue with Elephantbird, which is
> > > using a very old version of Thrift (0.7.0). Trying to bump this I
> noticed
> > > that a lot of classes that we use in the tests have
> > > <https://github.com/apache/parquet-mr/pull/1156> been made private
> > > <https://github.com/apache/parquet-mr/pull/1156>. Therefore it is hard
> > to
> > > test if we break anything.
> > >
> > > It looks like parquet-thrift is not used by anyone anymore
> > > <https://mvnrepository.com/artifact/org.apache.parquet/parquet-thrift
> >.
> > I
> > > would suggest removing the module from the repository
> > > <https://github.com/apache/parquet-mr/pull/1158> unless anyone
> objects.
> > >
> > > Kind regards, Fokko
> > >
> >
>


-- 
Xinli Shang


Re: [Request] Send automated notifications to a separate mailing-list

2023-08-27 Thread Xinli shang
comm...@parquet.apache.org should have already existed but not attached to
the github to send notification. We need just to create issues@. But the
question is how to migrate current receivers of the commits and issues
(currently notifications are sent to dev@). If we redirect all the
notifications to the two empty mailing lists without auto migration, then
suddenly everybody will stop receiving those notifications and people will
have to manually add themselves to the two mailing lists.  I don't find a
way to clone and rename. Does anybody have idea on how to solve this
problem?

On Mon, Aug 21, 2023 at 11:55 PM Uwe L. Korn  wrote:

> +1
>
> On Tue, Aug 22, 2023, at 5:29 AM, Gang Wu wrote:
> > +1 on this.
> >
> > We may create the following mailing lists:
> > - iss...@parquet.apache.org : notifications from JIRA issues.
> > - comm...@parquet.apache.org : notifications from Github PRs and
> comments.
> >
> > This is what the Apache ORC community currently does. Can one of the PMCs
> > do this?
> > Probably we need a formal vote before proceeding.
> > https://infra.apache.org/mailing-list-moderation.html#new-mailing-list
> >
> > Best,
> > Gang
> >
> > On Tue, Aug 22, 2023 at 8:49 AM Xinli shang 
> wrote:
> >
> >> It is a good idea. Thank Antonie for the proposal.
> >>
> >> On Tue, Aug 22, 2023 at 2:03 AM Julien Le Dem
>  >> >
> >> wrote:
> >>
> >> > +1
> >> >
> >> > On Mon, Aug 21, 2023 at 10:16 AM Antoine Pitrou 
> >> > wrote:
> >> >
> >> > >
> >> > > Hello,
> >> > >
> >> > > I would like to request that automated notifications (from GitHub,
> >> > > Jira... whatever) be sent to a separate mailing-list and GMane
> mirror.
> >> > > Currently, the endless stream of automated notifications in this
> >> > > mailing-list means that discussions between humans quickly get lost
> or
> >> > > even unnoticed by other people.
> >> > >
> >> > > For the record, we did this move in Apache Arrow and never came
> back.
> >> > >
> >> > > Thanks in advance
> >> > >
> >> > > Antoine.
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >>
> >> --
> >> Xinli Shang
> >>
>


-- 
Xinli Shang


Re: [Request] Send automated notifications to a separate mailing-list

2023-08-21 Thread Xinli shang
It is a good idea. Thank Antonie for the proposal.

On Tue, Aug 22, 2023 at 2:03 AM Julien Le Dem 
wrote:

> +1
>
> On Mon, Aug 21, 2023 at 10:16 AM Antoine Pitrou 
> wrote:
>
> >
> > Hello,
> >
> > I would like to request that automated notifications (from GitHub,
> > Jira... whatever) be sent to a separate mailing-list and GMane mirror.
> > Currently, the endless stream of automated notifications in this
> > mailing-list means that discussions between humans quickly get lost or
> > even unnoticed by other people.
> >
> > For the record, we did this move in Apache Arrow and never came back.
> >
> > Thanks in advance
> >
> > Antoine.
> >
> >
> >
>


-- 
Xinli Shang


Parquet Sync meeting notes - July 2023

2023-07-25 Thread Xinli shang
7/25/2023

Attendees (Gidon Gershinsky , Gang Wu, Chao Sun, Xinli
Shang, Jiashen Zhang)

Review data masking

   1.

   The current design is to implement on the reader side and it is
   lightweight
   2.

   When KMS returned access denied and the session-based flag is enabled, a
   null value is returned instead of the original value.
   3.

   Relying on the KMS access denied has a few issues
   1.

  Expense because it is RPC calls
  2.

  There are different KMS and the retuning error might be different

Add a column-wise key/nullify flag just like we did in the column encryption
<https://github.com/apache/parquet-mr/tree/master/parquet-hadoop#class-propertiesdrivencryptofactory>.
By doing this, we don’t need to contact KMS

-- 
Xinli Shang


Re: Rewrite Parquet List columns

2023-07-23 Thread Xinli shang
hI Rajesh,

Thanks for posting! Can you give an example of the level 2 -> level 3
converting?

The current rewrite tool is to focus on speed by skipping a lot of
operations and the use case includes transforming the compression,
encryption, and deleting columns.  @Gang Wu 
consolidated the tools to a universal rewriter and I think once you have
your requirement clarified, you can work with him for it.

Xinli

On Sat, Jul 22, 2023 at 12:07 PM Rajesh Mahindra 
wrote:

> Hey folks,
>
> I have a bunch of parquets written with Level 2 list columns (among other
> columns). I was trying to extend the Parquet Rewrite tool to be able to
> read those parquet and only rewrite the list columns as Level 3. Any
> pointers on which classes or APIs i should leverage for this purpose? Any
> pointers would be appreciated.
>
> --
> Take Care,
> Rajesh Mahindra
>


-- 
Xinli Shang


Parquet monthly sync meeting notes - June 2023

2023-06-28 Thread Xinli shang
Hi all,

These are the meeting notes for June 2023 sync meeting.

6/28/2023

Attendees (Yi He, Xinli Shang, Jiashen Zhang)

   1.

   Add more data masking cases than nullify. The initial draft
   
<https://docs.google.com/document/d/1JJrEOAoZDswkwTeKmFD2drZXK60ADdrkaYlMMuMwUV0/edit>
   is here. Yi will share this with the dev mailing list.
   2.

   Data masking Parquet-2223
   <https://github.com/apache/parquet-mr/pull/1112>
   1.

  Changes are needed to remove unnecessary files
  2.

  Start with removeColumnsInSchema() and remove those unused files.
  3.

   Row-level index
   1.

  Will see what changes are needed in Parquet
  4.

   Cell-level encryption
   1.

  To be upstreamed

-- 
Xinli Shang


Re: Bloom filters for full-text search and predicate pushdown

2023-06-15 Thread Xinli shang
Hi Marco,

This is an exciting idea! You think about more use cases of Parquet! As an
open community, we always welcome new ideas and innovations like yours.  I
encourage you to go deeper and broader with this idea and come up with a
proposal and POC. Today, generative AI came to reality. In addition to the
keyword search, you can think of other things like OpenAI embeddings. Maybe
later Parquet filters can do matches based on the closeness of two
embedding vectors.

With that said, other people's comments are also valid that Parquet is a
strict file format and we need to standardization. So we look forward to
your proposal and POC. If you want to come to discuss this week's sync
meeting, you are more than welcome.  I added you.

Xinli Shang

On Thu, Jun 15, 2023 at 4:38 AM Antoine Pitrou  wrote:

>
> Hi,
>
> This would require standardizing on a specific tokenization algorithm,
> right? I'm not sure it's a good idea to add such complexity to the
> Parquet spec (the tokenization might need to be language-specific
> and/or corpus-specific).
>
> I wonder if it would be more productive to try and find ways to build
> e.g. a Lucene index over Parquet columns (perhaps it's already
> possible?).
>
> Regards
>
> Antoine.
>
>
>
> On Wed, 7 Jun 2023 18:01:32 +0800
> Gang Wu  wrote:
> > Hi Marco,
> >
> > That sounds interesting!
> >
> > However, this requires the parquet implementation to be able to tokenize
> > both
> > strings to write and literals in the filters. The actual efficiency
> depends
> > on the
> > data distribution. I am also concerned with the possible explosion of
> > distinct
> > values introduced by splitting words, which may result in a large bloom
> > filter.
> >
> > Have you tried any PoC to get a rough estimate of benefits in your use
> case?
> >
> > Best,
> > Gang
> >
> >
> >
> > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <
> collimarco91-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> >
> > > Hello,
> > >
> > > I see that Parquet already supports Bloom filters.
> > >
> > > For my understanding, it currently uses them only on the entire value.
> > >
> > > Fo example, if I have a column "MovieTitle":
> > >
> > > - "The title of my movie"
> > > - "Another movie title"
> > > - "The best movie title"
> > > - ...
> > >
> > > Then the current Bloom filters can be used to find only the column
> > > chunks/pages that match an exact title. For example you can use the
> bloom
> > > filter to search for "The best movie title".
> > >
> > > It would be interesting to have *a bloom filter on the specific words*,
> > > instead of using the entire value: in this way you can search the word
> > > "best" in the "MovieTitle" column and find the titles that contain that
> > > specific word in an efficient way.
> > >
> > > It would enable a sort of full-text search of keywords inside text
> columns.
> > > It would also allow predicate pushdown for searches based on keywords.
> > >
> > > Would make sense to have such an addition? Is there any strategy
> already
> > > used by Parquet for fast keyword searches inside text columns?
> > >
> > >
> > > Best regards,
> > > Marco Colli
> > > AbstractBrain srls
> > >
> >
>
>
>
>

-- 
Xinli Shang


Re: [ANNOUNCE] Apache Parquet release 1.13.1

2023-05-22 Thread Xinli shang
Thank Fokko for taking the lead on this!

On Thu, May 18, 2023 at 2:24 PM Fokko Driesprong  wrote:

> Hi all,
>
> I'm pleased to announce the release of Parquet 1.13.1!
>
> Parquet is a general-purpose columnar file format for nested data. It
> uses space-efficient
> encodings and a compressed and splittable structure for processing
> frameworks
> like Hadoop.
>
> Changes are listed at:
> https://github.com/apache/parquet-mr/pull/1095/files
> This release can be downloaded from:
> https://parquet.apache.org/blog/2023/05/18/1.13.1/
>
> Java artifacts are available from Maven Central.
>
> Thanks to everyone for contributing and voting!
>
> Kind regards, Fokko
>


-- 
Xinli Shang


Re: [VOTE] Release Apache Parquet 1.13.1 RC0

2023-05-13 Thread Xinli shang
+1

I verified the signature and ran a sanity test.



On Fri, May 12, 2023 at 6:15 PM pk singh  wrote:

> Thanks Fokko, this is super-helpful and unblocks parquet 1.13 upgrade for
> iceberg <https://github.com/apache/iceberg/pull/7301> !
>
> +1 (non-binding) from my end as well.
>
> Regards,
> Prashant Singh
>
>
>
> On 2023/05/12 13:37:30 Fokko Driesprong wrote:
> > Hi everyone,
> >
> >
> > I propose the following RC to be released as the official Apache Parquet
> > 1.13.1 release.
> >
> >
> > The commit id is db4183109d5b734ec5930d870cdae161e408ddba
> >
> > * This corresponds to the tag: apache-parquet-1.13.1-rc0
> >
> > *
> >
> https://github.com/apache/parquet-mr/tree/db4183109d5b734ec5930d870cdae161e408ddba
> >
> >
> > The release tarball, signature, and checksums are here:
> >
> > *
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.13.1-rc0
> >
> >
> > You can find the KEYS file here:
> >
> > * https://downloads.apache.org/parquet/KEYS
> >
> >
> > Binary artifacts are staged in Nexus here:
> >
> > *
> https://repository.apache.org/content/groups/staging/org/apache/parquet/
> >
> >
> > This release includes important changes:
> >
> > * https://github.com/apache/parquet-mr/commits/parquet-1.13.x
> >
> >
> > Handy commands for verifying the release:
> >
> > *
> >
> https://iceberg.apache.org/how-to-release/#validating-a-source-release-candidate
> >
> > Replace Iceberg with Parquet :)
> >
> >
> > Please download, verify, and test.
> >
> >
> > Please vote in the next 72 hours.
> >
> >
> > [ ] +1 Release this as Apache Parquet 1.13.1
> >
> > [ ] +0
> >
> > [ ] -1 Do not release this because...
> >



-- 
Xinli Shang


Re: [DISCUSS] Time to release parquet format 2.10.0?

2023-05-13 Thread Xinli shang
Thank Gang for taking the lead on this! I agree we should have a new
release. In addition to PARQUET-2261, there was also a discussion in Feb
with PMCs for PARQUET-758. We may want to check for the plan with Antoine
Pitrou <https://github.com/pitrou> if PARQUET-758 wants to be in also.



On Sat, May 13, 2023 at 9:51 AM Micah Kornfield 
wrote:

> >
> >  BTW, I'd like to see the implementation from Micah to fully
> > understand the use case. If he is too busy to do that, I can do it based
> on
> > my understanding.
>
>
> I can allocate some time to try to make a PoC in C++ next month if we are
> willing to wait until then.
>
> On Fri, May 12, 2023 at 5:04 AM Gang Wu  wrote:
>
> > I think we can wait for a complete PoC implementation of PARQUET-2261
> > before release. BTW, I'd like to see the implementation from Micah to
> fully
> > understand the use case. If he is too busy to do that, I can do it based
> on
> > my understanding.
> >
> > Best,
> > Gang
> >
> > On Fri, May 12, 2023 at 4:34 PM Gábor Szádovszky 
> wrote:
> >
> > > Thanks a lot for volunteering, Gang!
> > >
> > > However it is more than 2 years indeed since the last release I think
> the
> > > actual changes since then are more important. There are lots of
> > > additions/corrections in the spec docs and the thrift file comments
> which
> > > are very important but not tightly attached to a format release. I only
> > can
> > > see PARQUET-2257 that contains an actual change in the thrift
> structure.
> > >
> > > Related to the ongoing effort of PARQUET-2261: I think, we are waiting
> > for
> > > a PoC implementation. @emkornfield: Do you plan to work on this?
> > >
> > > The question is if we think PARQUET-2257 is urgent enough to not to
> wait
> > > for PARQUET-2261 and have an additional release after the latter is
> ready
> > > or we shall wait for the PoC implementation and release format after
> it.
> > >
> > > On 2023/05/02 03:33:05 Gang Wu wrote:
> > > > Thanks Fokko!
> > > >
> > > > Let us just wait for more inputs to see if it is good to proceed.
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Fri, Apr 28, 2023 at 4:05 PM Fokko Driesprong 
> > > wrote:
> > > >
> > > > > Hey Gang,
> > > > >
> > > > > Great bringing this up, I think that would be a great idea!
> > > > >
> > > > > Kind regards,
> > > > > Fokko
> > > > >
> > > > > Op do 27 apr 2023 om 09:52 schreef Gang Wu :
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > The latest parquet format is v2.9.0 [1] which was released two
> > years
> > > ago.
> > > > > > Is it a good time to release the next version? If there is no
> > > objection,
> > > > > I
> > > > > > can
> > > > > > volunteer to be the release manager.
> > > > > >
> > > > > > [1]
> > https://github.com/apache/parquet-format/blob/master/CHANGES.md
> > > > > >
> > > > > > Best,
> > > > > > Gang
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Xinli Shang


Parquet sync meeting notes - April 2023

2023-04-28 Thread Xinli shang
Hi all,

Here is the meeting notes for today's Parquet sync meeting.


4/28/2023

Attendee  (Shenxuan Liu, Fokko Driesprong, Gang Wu, Jiashen Zhang, Xinli
Shang )

   1.

   Post-release 1.13.0
   1.

  Iceberg upgraded to 1.13.0 bumped the Hadoop support to Hadoop 3 but
  we didn’t notice since we don’t run CI against hadoop 2. This has been
  fixed in #2290 <https://github.com/apache/parquet-mr/pull/1083>.
  2.

  Some small changes (#1073
  <https://github.com/apache/parquet-mr/pull/1073> and #1074
  <https://github.com/apache/parquet-mr/pull/1074>) to make Flink use
  the ParquetMR without having Hadoop on the classpath.
  2.

   In Velox, we store/cache files locally, then we could see a bottleneck
   in the parquet itself.
   1.

  Use SSD to store the local file 3G bytes/sec, For decompression, it
  is 200M/Sec.
  2.

  The current Parquet reader is designed for remote reading.
  3.

  There is a trans-compression
  
<https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/TransCompressionCommand.java>
  API you can use to speed up,  about 20x faster
  4.

  ZSTD is recommended
  3.

   Data masking Parquet-2223
   <https://github.com/apache/parquet-mr/pull/1016>
   1.

  The code is incomplete. It is needed to hide the columns in the
  schema when it is hidden. And we also need to mark it as hidden.


-- 
Xinli Shang


Re: [DISCUSS] Release of Apache Parquet 1.13.1

2023-04-25 Thread Xinli shang
Hi Fokko,

Thanks for volunteering to release 1.13.1! That would be great and I am
looking forward to you being the release manager for that.

We can have the 1.13.1 release to add back the support old Hadoop version,
but the question is should we release ASAP or wait for a reasonable time
window? The new version 1.13.0 is just released and I am not sure if there
are more issues coming so that we can put together the fixes into 1.13.1.
Is Iceberg urgently blocked on this?

Xinli Shang



On Tue, Apr 25, 2023 at 6:51 PM Gang Wu  wrote:

> That sounds good to me.
>
> I have just released 1.13.0, just let me know if you need anything
> on my end to make the next release.
>
> Best,
> Gang
>
> On Tue, Apr 25, 2023 at 10:31 PM Fokko Driesprong 
> wrote:
>
> > Hey Gang,
> >
> > Thanks for the quick reply. I think 2.8.x is water under the bridge, but
> I
> > can be convinced otherwise. I also spend a few cycles to see if we can
> get
> > compatibility with 2.7.3+, but it doesn't seem trivial
> > <https://github.com/apache/parquet-mr/pull/1075#issuecomment-1514518094
> >.
> > As Gabor said on the ticket, it is fine to drop support for older systems
> > from time to time. The public Hadoop 2.8
> > <https://github.com/apache/hadoop/tree/branch-2.8> doesn't seem to get
> any
> > active updates. I don't fully agree with the ticket, you can still read
> > Parquet, but using an older version of the library.
> >
> > Kind regards,
> > Fokko Driesprong
> >
> > Op di 25 apr 2023 om 16:13 schreef Gang Wu :
> >
> > > Hi Fokko,
> > >
> > > There is an issue of the 1.13.0 release:
> > > https://issues.apache.org/jira/browse/PARQUET-2276.
> > >
> > > It seems that Hadoop 2.8.x is no longer supported after 1.13.0. I have
> > seen
> > > that
> > > you have added CI checks for Hadoop 2.9.x. Not sure if this is a
> > > blocking issue.
> > >
> > > Best,
> > > Gang
> > >
> > >
> > >
> > > On Tue, Apr 25, 2023 at 3:25 PM Fokko Driesprong 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to discuss releasing Parquet 1.13.1. For Iceberg we ran
> > into
> > > > two things:
> > > >
> > > >- We noticed that support for Hadoop 2 was dropped. Iceberg is
> still
> > > on
> > > >2.7.3, and we're aware of the fact that has been released in
> August
> > > > 2016.
> > > >The PR that I've created
> > > ><https://github.com/apache/parquet-mr/pull/1083/> bumps the lower
> > > bound
> > > >to Hadoop 2.9.2. Which is also old, but if possible we would like
> to
> > > > cater
> > > >to the widest audience possible.
> > > >- At Iceberg we also have the Apache Flink integration, and Flink
> is
> > > >able to run without Hadoop. This required some minor changes
> (#1074
> > > ><https://github.com/apache/parquet-mr/pull/1074>, #1073
> > > ><https://github.com/apache/parquet-mr/pull/1073>) that already
> have
> > > > been
> > > >backported. It would be awesome to get these out.
> > > >
> > > > My question is, after the release of 1.13.0 are there any issues that
> > > came
> > > > up, or anything that you would like to see being released? I'm happy
> to
> > > > volunteer as a release manager for 1.13.1. Let us know!
> > > >
> > > > Kind regards,
> > > > Fokko
> > > >
> > >
> >
>


-- 
Xinli Shang


[jira] [Comment Edited] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-04-22 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715304#comment-17715304
 ] 

Xinli Shang edited comment on PARQUET-2276 at 4/22/23 4:36 PM:
---

[~a2l]Did you try Hadoop 2.9.x? 

I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. 
Parquet is widely used by so many companies and breaking change means big to 
the industry.  We should have made it clear when taking the breaking changes 
like this.  [~a2l]Do you think you can work on it? 



was (Author: sha...@uber.com):
[~Aufderhar]Did you try Hadoop 2.9.x? 

I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. 
Parquet is widely used by so many companies and breaking change means big to 
the industry.  We should have made it clear when taking the breaking changes 
like this.  [~a2l]Do you think you can work on it? 


> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Priority: Major
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-04-22 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715304#comment-17715304
 ] 

Xinli Shang edited comment on PARQUET-2276 at 4/22/23 4:36 PM:
---

[~a2l] Did you try Hadoop 2.9.x? 

I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. 
Parquet is widely used by so many companies and breaking change means big to 
the industry.  We should have made it clear when taking the breaking changes 
like this.  [~a2l]Do you think you can work on it? 



was (Author: sha...@uber.com):
[~a2l]Did you try Hadoop 2.9.x? 

I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. 
Parquet is widely used by so many companies and breaking change means big to 
the industry.  We should have made it clear when taking the breaking changes 
like this.  [~a2l]Do you think you can work on it? 


> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Priority: Major
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-04-22 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715304#comment-17715304
 ] 

Xinli Shang commented on PARQUET-2276:
--

[~Aufderhar]Did you try Hadoop 2.9.x? 

I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. 
Parquet is widely used by so many companies and breaking change means big to 
the industry.  We should have made it clear when taking the breaking changes 
like this.  [~a2l]Do you think you can work on it? 


> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Priority: Major
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release Apache Parquet 1.13.0 RC0

2023-04-03 Thread Xinli shang
+1

Verified checksum and signature, and ran internal tests.

Gang, thanks a lot for leading this effort!

On Mon, Apr 3, 2023 at 12:06 AM Gábor Szádovszky  wrote:

> Verified checksum and signature, diffed tarball and repo content,
> build/unit tests pass.
> +1 (binding) for releasing this content as 1.13.0
>
> NOTE: It is completely fine or even a good practice to release the first
> minor release from its separate branch (instead of master). Do not forget
> to merge back CHANGES.md and the new version numbers update
> (1.14.0-SNAPSHOT) to master, please.
>
> Thank you again, Gang for working on this release!
>
> On 2023/04/03 05:43:58 "Wang, Yuming" wrote:
> > +1. Tested this release through Apache Spark UT:
> https://github.com/apache/spark/pull/40555
> >
> > From: Gang Wu 
> > Date: Monday, April 3, 2023 at 00:40
> > To: dev@parquet.apache.org 
> > Subject: [VOTE] Release Apache Parquet 1.13.0 RC0
> > External Email
> >
> > Hi everyone,
> >
> > I propose the following RC to be released as the official Apache Parquet
> > 1.13.0 release.
> >
> > The commit id is 2e369ed173f66f057c296e63c1bc31d77f294f41
> > * This corresponds to the tag: apache-parquet-1.13.0-rc0
> > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F2e369ed173f66f057c296e63c1bc31d77f294f41&data=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hwxVa%2FxkYd47gnxJg4PI5nSXPuuF%2FSIC1XqhwcDgbN0%3D&reserved=0
> >
> > The release tarball, signature, and checksums are here:
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.13.0-rc0&data=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=eb%2Fxey4DnprQyTxYRxdF201f7qz1zbm5berRDdVA3rY%3D&reserved=0
> >
> > You can find the KEYS file here:
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownloads.apache.org%2Fparquet%2FKEYS&data=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pN8Xku%2BirF5nYcffdkJe4yh84mDFjjaVXewj0m8b1Kw%3D&reserved=0
> >
> > Binary artifacts are staged in Nexus here:
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qbyr5Y1EDslnqB8qi1CubNbPv9rATxpIoSbUmslaRIg%3D&reserved=0
> >
> > This release includes important changes listed:
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fparquet-1.13.x%2FCHANGES.md&data=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YmU60HCl776s6O4hvu%2FNFFXZY1ij9E0z9HquzmeJDxc%3D&reserved=0
> >
> > Please download, verify, and test.
> >
> > Please vote in the next 72 hours.
> >
> > [ ] +1 Release this as Apache Parquet 1.13.0
> > [ ] +0
> > [ ] -1 Do not release this because...
> >
> > Best regards,
> > Gang
> >
>


-- 
Xinli Shang


Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-30 Thread Xinli shang
Yeah, let's expand the 72 hours time limit. I am asking other PMCs to vote
too now.

On Wed, Mar 29, 2023 at 6:49 PM Gang Wu  wrote:

> Thank you all!
>
> I have checked the Apache Voting Processing [1] and Release Policy [2].
> Both of them say that a vote should be valid for at least 72 hours.
>
> As we need one more binding vote from PMC members to pass the vote, I think
> we may need to extend the vote to receive enough replies.
>
> Any suggestions?
>
> [1]
>
> https://www.apache.org/foundation/voting.html#expressing-votes-1-0-1-and-fractions
> [2] https://www.apache.org/legal/release-policy.html#release-approval
>
> Best,
> Gang
>
> On Thu, Mar 30, 2023 at 12:00 AM Dongjoon Hyun 
> wrote:
>
> > Thank you all.
> >
> > Hi, Gang. Could you conclude this RC0 vote since it seems to pass 72
> hours?
> >
> > Thanks,
> > Dongjoon.
> >
> > On 2023/03/29 05:54:45 Gang Wu wrote:
> > > Yes, I have updated my GPG key but have not sent it to
> > http://pgp.mit.edu/.
> > >
> > > You may find my key from keys.openpgp.org
> > >
> > > Best,
> > > Gang
> > >
> > > On Wed, Mar 29, 2023 at 1:51 PM L. C. Hsieh  wrote:
> > >
> > > > Hi Gang,
> > > >
> > > > I tried to search your public key on http://pgp.mit.edu/.
> > > > It shows a different public key:
> > > >
> > > > pub  4096R/26D4D78E 2018-04-11 Gang Wu 
> > > >
> > > > Looks like it is your older public key? Wondering why your new public
> > key
> > > > is not updated on key server.
> > > >
> > > > On 2023/03/29 02:59:47 Gang Wu wrote:
> > > > > Hi L.C.
> > > > >
> > > > > Could you please elaborate the issue with public key? How can I
> check
> > > > that
> > > > > by myself?
> > > > >
> > > > > Thanks,
> > > > > Gang
> > > > >
> > > > > On Wed, Mar 29, 2023 at 7:48 AM L. C. Hsieh 
> > wrote:
> > > > >
> > > > > > +1 (non-binding) Verified checksum and ran the tests locally.
> > > > > >
> > > > > > Thanks Gang.
> > > > > >
> > > > > > One question is that the public key I saw on key server (
> > > > pgpkeys.mit.edu)
> > > > > > is different to the one in
> > https://downloads.apache.org/parquet/KEYS.
> > > > > >
> > > > > > On 2023/03/28 17:01:30 Chao Sun wrote:
> > > > > > > +1 (non-binding). Verified checksum & signature, and ran all
> the
> > > > tests
> > > > > > locally.
> > > > > > >
> > > > > > > Thanks Gang!
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Mar 28, 2023 at 9:37 AM Gidon Gershinsky <
> > gg5...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Verified signature and ran the tests. Thanks Gang and all
> > > > contributors!
> > > > > > > >
> > > > > > > > Cheers, Gidon
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Mar 28, 2023 at 5:19 PM Xinli shang
> > > > 
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > Verified signature and ran internal tests.  Thanks Gang for
> > > > leading
> > > > > > this
> > > > > > > > > effort!
> > > > > > > > >
> > > > > > > > > On Mon, Mar 27, 2023 at 9:38 AM Dongjoon Hyun <
> > > > dongj...@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1
> > > > > > > > > >
> > > > > > > > > > Thank you, Gang and Yuming.
> > > > > > > > > >
> > > > > > > > > > Dongjoon.
> > > > > > > > > >
> > > > > > > > > > On 2023/03/27 05:44:14 "Wang, Yuming" wrote:
> > > > > > > > >

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-28 Thread Xinli shang
+1

Verified signature and ran internal tests.  Thanks Gang for leading this
effort!

On Mon, Mar 27, 2023 at 9:38 AM Dongjoon Hyun  wrote:

> +1
>
> Thank you, Gang and Yuming.
>
> Dongjoon.
>
> On 2023/03/27 05:44:14 "Wang, Yuming" wrote:
> > +1. Tested this release through Spark UT:
> https://github.com/apache/spark/pull/40555.
> >
> >
> > From: Gang Wu 
> > Date: Sunday, March 26, 2023 at 22:42
> > To: dev@parquet.apache.org 
> > Subject: [VOTE] Release Apache Parquet 1.12.4 RC0
> > External Email
> >
> > Hi everyone,
> >
> > I propose the following RC to be released as the official Apache Parquet
> > 1.12.4 release.
> >
> > The commit id is 22069e58494e7cb5d50e664c7ffa1cf1468404f8
> > * This corresponds to the tag: apache-parquet-1.12.4-rc0
> > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F22069e58494e7cb5d50e664c7ffa1cf1468404f8&data=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2Bny4R%2BgGQwIc3yMxsHfPh87YYTPhJ580UUoGV30WUQU%3D&reserved=0
> >
> > The release tarball, signature, and checksums are here:
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.12.4-rc0&data=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qW7uIIvyamqkT7FbkBWvwKD1VnfeRWnKLUBpcVHXvck%3D&reserved=0
> >
> > You can find the KEYS file here:
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownloads.apache.org%2Fparquet%2FKEYS&data=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=79Et30L9u4w4%2F%2B%2FTvPTpXEobOuvTV9XyVmapKC2qwoY%3D&reserved=0
> >
> > Binary artifacts are staged in Nexus here:
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Z%2FhRa8zc5ZHhs15Epx7X%2BIUwQJI4MoyPMOgAIJemvHU%3D&reserved=0
> >
> > This release includes important changes listed:
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fparquet-1.12.4%2FCHANGES.md&data=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=SXURCILyTz6SYb3iNPEnedkgjMk%2BA%2FLYHyS4TvT4bbM%3D&reserved=0
> >
> > Please download, verify, and test.
> >
> > Please vote in the next 72 hours.
> >
> > [ ] +1 Release this as Apache Parquet 1.12.4
> > [ ] +0
> > [ ] -1 Do not release this because...
> >
> > Best regards,
> > Gang
> >
>


-- 
Xinli Shang


[jira] [Commented] (PARQUET-1690) Integer Overflow of BinaryStatistics#isSmallerThan()

2023-03-17 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701789#comment-17701789
 ] 

Xinli Shang commented on PARQUET-1690:
--

It is a quite long time ago. I don't remember. Yeah, it would be great to start 
off a new PR. 

> Integer Overflow of BinaryStatistics#isSmallerThan()
> 
>
> Key: PARQUET-1690
> URL: https://issues.apache.org/jira/browse/PARQUET-1690
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>  Labels: pull-request-available
>
> "(min.length() + max.length()) < size" didn't handle integer overflow 
> [https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L103]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Gang Wu as new Apache Parquet committer

2023-02-27 Thread Xinli shang
The Project Management Committee (PMC) for Apache Parquet has invited Gang
Wu (gangwu) to become a committer and we are pleased to announce that he
has accepted.

Congratulations and welcome, Gang!

-- 
Xinli Shang


[jira] [Commented] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th

2023-01-25 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680705#comment-17680705
 ] 

Xinli Shang commented on PARQUET-2233:
--

[~Jiashen Zhang]Please have a look and we can discuss if there are still 
blocking issues. 

> Parquet Travis CI jobs to be turned off February 15th
> -
>
> Key: PARQUET-2233
> URL: https://issues.apache.org/jira/browse/PARQUET-2233
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Xinli Shang
>Priority: Major
>
> Greetings Parquet PMC,
> Infrastructure has reached out to you regarding the Travis CI Open Source 
> policy changes, and the resulting need for Apache projects to migrate away 
> from using Travis.
> So far, we have received no response from your PMC.
> On February 15th, we will begin the final phase of this migration, turning 
> off Travis builds in order to bring our Travis usage down to 0.
> We have found the following repositories mention or make use of .travis.yml 
> files:
>  * parquet-mr.git
>  * parquet-cpp.git
> You must immediately move to migrate your builds from Travis. If you do not, 
> you will soon be unable to do builds that now rely on Travis.
> Many projects have moved to using GitHub Actions, and migrating to GHA is 
> quite straightforward. Other projects use Jenkins providing ARM support, with 
> nodes using the arm label
> If you are unsure how to proceed, I would be happy to explain your next steps.
> Please at least respond to acknowledge the need to migrate away from Travis, 
> and to tell us your current plans.
> Thank you!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th

2023-01-24 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680363#comment-17680363
 ] 

Xinli Shang commented on PARQUET-2233:
--

Were you able to log in? 

> Parquet Travis CI jobs to be turned off February 15th
> -
>
> Key: PARQUET-2233
> URL: https://issues.apache.org/jira/browse/PARQUET-2233
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Xinli Shang
>Priority: Major
>
> Greetings Parquet PMC,
> Infrastructure has reached out to you regarding the Travis CI Open Source 
> policy changes, and the resulting need for Apache projects to migrate away 
> from using Travis.
> So far, we have received no response from your PMC.
> On February 15th, we will begin the final phase of this migration, turning 
> off Travis builds in order to bring our Travis usage down to 0.
> We have found the following repositories mention or make use of .travis.yml 
> files:
>  * parquet-mr.git
>  * parquet-cpp.git
> You must immediately move to migrate your builds from Travis. If you do not, 
> you will soon be unable to do builds that now rely on Travis.
> Many projects have moved to using GitHub Actions, and migrating to GHA is 
> quite straightforward. Other projects use Jenkins providing ARM support, with 
> nodes using the arm label
> If you are unsure how to proceed, I would be happy to explain your next steps.
> Please at least respond to acknowledge the need to migrate away from Travis, 
> and to tell us your current plans.
> Thank you!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th

2023-01-24 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680345#comment-17680345
 ] 

Xinli Shang edited comment on PARQUET-2233 at 1/24/23 8:19 PM:
---

In this Issue, we are going to migrate parquet-mr.git and parquet-format.git. 

The hard deadline is 2/15/2023.

More information can be found 
https://cwiki.apache.org/confluence/display/INFRA/Travis+Migrations. We will 
see if we can migrate to Action. 


was (Author: sha...@uber.com):
In this Issue, we are going to migrate parquet-mr.git and parquet-format.git. 

The hard deadline is 2/15/2023

> Parquet Travis CI jobs to be turned off February 15th
> -
>
> Key: PARQUET-2233
> URL: https://issues.apache.org/jira/browse/PARQUET-2233
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Xinli Shang
>Priority: Major
>
> Greetings Parquet PMC,
> Infrastructure has reached out to you regarding the Travis CI Open Source 
> policy changes, and the resulting need for Apache projects to migrate away 
> from using Travis.
> So far, we have received no response from your PMC.
> On February 15th, we will begin the final phase of this migration, turning 
> off Travis builds in order to bring our Travis usage down to 0.
> We have found the following repositories mention or make use of .travis.yml 
> files:
>  * parquet-mr.git
>  * parquet-cpp.git
> You must immediately move to migrate your builds from Travis. If you do not, 
> you will soon be unable to do builds that now rely on Travis.
> Many projects have moved to using GitHub Actions, and migrating to GHA is 
> quite straightforward. Other projects use Jenkins providing ARM support, with 
> nodes using the arm label
> If you are unsure how to proceed, I would be happy to explain your next steps.
> Please at least respond to acknowledge the need to migrate away from Travis, 
> and to tell us your current plans.
> Thank you!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th

2023-01-24 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680345#comment-17680345
 ] 

Xinli Shang commented on PARQUET-2233:
--

In this Issue, we are going to migrate parquet-mr.git and parquet-format.git. 

The hard deadline is 2/15/2023

> Parquet Travis CI jobs to be turned off February 15th
> -
>
> Key: PARQUET-2233
> URL: https://issues.apache.org/jira/browse/PARQUET-2233
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Xinli Shang
>Priority: Major
>
> Greetings Parquet PMC,
> Infrastructure has reached out to you regarding the Travis CI Open Source 
> policy changes, and the resulting need for Apache projects to migrate away 
> from using Travis.
> So far, we have received no response from your PMC.
> On February 15th, we will begin the final phase of this migration, turning 
> off Travis builds in order to bring our Travis usage down to 0.
> We have found the following repositories mention or make use of .travis.yml 
> files:
>  * parquet-mr.git
>  * parquet-cpp.git
> You must immediately move to migrate your builds from Travis. If you do not, 
> you will soon be unable to do builds that now rely on Travis.
> Many projects have moved to using GitHub Actions, and migrating to GHA is 
> quite straightforward. Other projects use Jenkins providing ARM support, with 
> nodes using the arm label
> If you are unsure how to proceed, I would be happy to explain your next steps.
> Please at least respond to acknowledge the need to migrate away from Travis, 
> and to tell us your current plans.
> Thank you!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th

2023-01-24 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2233:


 Summary: Parquet Travis CI jobs to be turned off February 15th
 Key: PARQUET-2233
 URL: https://issues.apache.org/jira/browse/PARQUET-2233
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Xinli Shang


Greetings Parquet PMC,

Infrastructure has reached out to you regarding the Travis CI Open Source 
policy changes, and the resulting need for Apache projects to migrate away from 
using Travis.

So far, we have received no response from your PMC.

On February 15th, we will begin the final phase of this migration, turning off 
Travis builds in order to bring our Travis usage down to 0.

We have found the following repositories mention or make use of .travis.yml 
files:

 * parquet-mr.git
 * parquet-cpp.git


You must immediately move to migrate your builds from Travis. If you do not, 
you will soon be unable to do builds that now rely on Travis.

Many projects have moved to using GitHub Actions, and migrating to GHA is quite 
straightforward. Other projects use Jenkins providing ARM support, with nodes 
using the arm label

If you are unsure how to proceed, I would be happy to explain your next steps.

Please at least respond to acknowledge the need to migrate away from Travis, 
and to tell us your current plans.

Thank you!




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Parquet sync meeting notes 1/24/2023

2023-01-24 Thread Xinli shang
Attendees:( Gidon Gershinsky, Xinli Shang, Tim Miller, Vinoo)

   1.

   Release new version
   1.

  ZSTD stream closure bug fixes and a few other fixes are blocking
  issues.
  2.

   PRs:
   1.

  Parquet-2069 <https://github.com/apache/parquet-mr/pull/957>: Fix
  some Avro schema issues, we have new comments. Miller will work
on it soon.
  2.

  Parquet-2126 <https://github.com/apache/parquet-mr/pull/959>:
  thread-safe compressor/decompressor. We have new comments.
Miller will work
  on it soon.
  3.

  PARQUET-2103 <https://github.com/apache/parquet-mr/pull/1019>: Bug
  fix for prettyJOSN. Looks good. Will merge after addressing the comments.
  4.

  Consolidate rewriter - Most of the comments are addressed. Xinli will
  have another look


-- 
Xinli Shang


Re: Vectored IO in Parquet ( https://issues.apache.org/jira/browse/PARQUET-2171)

2022-10-08 Thread Xinli shang
Thanks, Mukund! As spoken at the conference, this is a great feature! Look
forward to reviewing the changes!

On Tue, Sep 27, 2022 at 9:29 AM Mukund Madhav Thakur
 wrote:

> Hi Team,
> We in hadoop project recently added a new feature in Hadoop Vectored IO
> which will be released in the upcoming 3.3.5 hadoop release.
> This is a high performance scatter/gather extension of PositionedReadable
> API optimized for reading columnar data in cloud storage.
> https://issues.apache.org/jira/browse/HADOOP-18103.
> We observed really good performance improvements in hive tpch and tpcds
> benchmark for orc data stored in S3.
>
> We are now looking at Parquet integration as well.
> https://issues.apache.org/jira/browse/PARQUET-2171
> I have a draft patch which works locally through sparks file reader.
> https://github.com/apache/parquet-mr/pull/999
>
> We know Parquet likes to support builds against the older versions of
> hadoop, we are working on a solution to offer the API through a
> shim library.
> As I have never contributed to the Parquet codebase and it is totally new
> for me, I would really appreciate some help in implementing, testing and
> releasing this feature in the best possible way.
>
> I will be talking about all these in the upcoming Apache Conference NA next
> week Tuesday, October 04, 4:10 PM CDT. It would be really great to meet
> anyone who would be interested in getting involved in this.
>
>
>
> Thanks,
> Mukund
>


-- 
Xinli Shang


Parquet community sync meeting notes - 9/27/2022

2022-09-27 Thread Xinli shang
9/27/2022

Attendees ( Gidon Gershinsky, Xinli Shang, Tim Miller, Jiasheng Zhang)

   1.

   Parquet Cell-level encryption
   1.

  Will open PRs after delivering it internally
  2.

   Parquet-2069 <https://github.com/apache/parquet-mr/pull/957>: Fix some
   Avro schema issues, in general, Avro schema is a problematic area and we
   need some risk control.
   3.

   Parquet-2126 <https://github.com/apache/parquet-mr/pull/959>:
   thread-safe compressor/decompressor
   1.

  Xinli to have another look along with other PRs
  4.

   Parquet-2196 - LZW Raw compressor
   1.

  In review
  5.

   PR-960: Has some comments
   6.

   Parquet-986: merged
   7.

   Parquet-1711: Break circular dependency, how to handle the exception
   case


-- 
Xinli Shang


[jira] [Created] (PARQUET-2183) Fix statistics issue of Column Encryptor

2022-09-02 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2183:


 Summary: Fix statistics issue of Column Encryptor
 Key: PARQUET-2183
 URL: https://issues.apache.org/jira/browse/PARQUET-2183
 Project: Parquet
  Issue Type: Improvement
Reporter: Xinli Shang
Assignee: Xinli Shang


There is an issue that missing column statistics if that column is 
re-encrypted. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Interest in adding the float16 logical type to the Parquet spec

2022-08-24 Thread Xinli shang
Hi Anja,

Thanks for your interest! We encourage people's new proposals. Go head to
make a proposal and the community can review it.

Xinli

On Tue, Aug 23, 2022 at 4:53 PM Anja  wrote:

> Hello!
>
> Is there interest in having the float16 logical type standardised in the
> Parquet spec? I am proposing a PR for Arrow that will write float16 to
> Parquet as FixedSizeBinary:
> https://issues.apache.org/jira/browse/ARROW-17464
> but for the sake of portability between data analysis tools, it would of
> course be a lot better to have this type standardised in the format itself.
>
> Previous requests for this have been here:
> https://issues.apache.org/jira/browse/PARQUET-1647 and here:
> https://issues.apache.org/jira/browse/PARQUET-758 .
>
> With the development of neural networks, half-precision floating points are
> becoming more popular:
> https://en.wikipedia.org/wiki/Half-precision_floating-point_format ; I do
> think that a demand exists for its support. I am new to the project, but am
> happy to contribute development time if there is support for this feature,
> and guidance.
>
> Warm regards,
>
> Anja
>


-- 
Xinli Shang


Parquet Sync meeting - July 26 2022

2022-07-26 Thread Xinli shang
Attendees ( Gidon Gershinsky, Xinli Shang, Tim Miller)

   1.

   Release 1.12.3
   1.

  Post release - no issue reported.
  2.

   Parquet Cell-level encryption

  a. What if the user only partially has the keys but
not all the hidden columns? Should we throw an exception or filled out with
'null'?
  b. It needs more discussion and we can continue in
the design doc.

   1.

   Parquet-2069 <https://github.com/apache/parquet-mr/pull/957>: Fix some
   avro schema issue. Generally, Avro schema conversion is a problematic
   area and we need some risk control.
   2.

   Parquet-2126 <https://github.com/apache/parquet-mr/pull/959>:
   thread-safe compressor/decompressor. Close to merge but need some more
   thinking on the thread exit scenario.


-- 
Xinli Shang


Re: Review of Q2 Parquet report

2022-07-05 Thread Xinli shang
Thanks Gidon for pointing it out!

On Tue, Jul 5, 2022 at 12:59 PM Gidon Gershinsky  wrote:

> nit: MR-1.12.3 released on 202*2*-05-26.
>
> Cheers, Gidon
>
>
> On Tue, Jul 5, 2022 at 6:04 PM Xinli shang 
> wrote:
>
> > Hi all,
> >
> > The report below is what I am going to submit for hte past quarter.
> Please
> > review and comment on it. Thanks.
> >
> >
> > ## Description:
> > The mission of Parquet is the creation and maintenance of software
> related
> > to
> > columnar storage format available to any project in the Apache Hadoop
> > ecosystem
> >
> > ## Issues:
> > no
> >
> > ## Membership Data:
> > Apache Parquet was founded 2015-04-21 (7 years ago)
> > There are currently 37 committers and 27 PMC members in this project.
> > The Committer-to-PMC ratio is roughly 5:4.
> >
> > Community changes, past quarter:
> > - No new PMC members. Last addition was Gidon Gershinsky on 2021-11-23.
> > - No new committers. Last addition was Gidon Gershinsky on 2021-04-05.
> >
> > ## Project Activity:
> > MR-1.12.3 was released on 2021-05-26.
> > MR-1.11.2 was released on 2021-10-06.
> > MR-1.12.2 was released on 2021-10-06.
> > MR-1.12.0 was released on 2021-03-25.
> >
> > ## Community Health:
> > dev@parquet.apache.org had a 65% decrease in traffic in the past quarter
> > (270 emails compared to 751)
> > 27 issues opened in JIRA, past quarter (no change)
> > 8 issues closed in JIRA, past quarter (-52% change)
> > 38 commits in the past quarter (18% increase)
> > 12 code contributors in the past quarter (20% increase)
> > 27 PRs opened on GitHub, past quarter (-20% change)
> > 17 PRs closed on GitHub, past quarter (-43% change
> >
> >
> > --
> > Xinli Shang
> >
> --
>
> Cheers, Gidon
>


-- 
Xinli Shang


Review of Q2 Parquet report

2022-07-05 Thread Xinli shang
Hi all,

The report below is what I am going to submit for hte past quarter. Please
review and comment on it. Thanks.


## Description:
The mission of Parquet is the creation and maintenance of software related
to
columnar storage format available to any project in the Apache Hadoop
ecosystem

## Issues:
no

## Membership Data:
Apache Parquet was founded 2015-04-21 (7 years ago)
There are currently 37 committers and 27 PMC members in this project.
The Committer-to-PMC ratio is roughly 5:4.

Community changes, past quarter:
- No new PMC members. Last addition was Gidon Gershinsky on 2021-11-23.
- No new committers. Last addition was Gidon Gershinsky on 2021-04-05.

## Project Activity:
MR-1.12.3 was released on 2021-05-26.
MR-1.11.2 was released on 2021-10-06.
MR-1.12.2 was released on 2021-10-06.
MR-1.12.0 was released on 2021-03-25.

## Community Health:
dev@parquet.apache.org had a 65% decrease in traffic in the past quarter
(270 emails compared to 751)
27 issues opened in JIRA, past quarter (no change)
8 issues closed in JIRA, past quarter (-52% change)
38 commits in the past quarter (18% increase)
12 code contributors in the past quarter (20% increase)
27 PRs opened on GitHub, past quarter (-20% change)
17 PRs closed on GitHub, past quarter (-43% change


-- 
Xinli Shang


Re: [VOTE] Release Apache Parquet 1.12.3 RC1

2022-05-26 Thread Xinli shang
Thank Julien, Gidon, and Yuming for verifying and voting! The vote passed!
I will move forward with the next steps.

On Wed, May 25, 2022 at 9:29 PM Julien Le Dem 
wrote:

> +1
> Verified signatures and tested
>
> On Mon, May 23, 2022 at 4:23 PM Xinli shang 
> wrote:
>
> > I also vote +1.
> >
> > On Sun, May 22, 2022 at 5:59 PM Wang, Yuming 
> > wrote:
> >
> > > +1. Tested through Spark: https://github.com/apache/spark/pull/36629
> > >
> > > From: Gidon Gershinsky 
> > > Date: Sunday, May 22, 2022 at 19:02
> > > To: dev@parquet.apache.org 
> > > Subject: Re: [VOTE] Release Apache Parquet 1.12.3 RC1
> > > External Email
> > >
> > > +1. Downloaded, verified and tested.
> > >
> > > Cheers, Gidon
> > >
> > >
> > > On Fri, May 20, 2022 at 8:49 PM Xinli shang 
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > I propose the following RC to be released as the official Apache
> > Parquet
> > > >  1.12.3 release.
> > > >
> > > >
> > > > The commit id is f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b
> > > >
> > > > * This corresponds to the tag: apache-parquet-1.12.3-rc1
> > > >
> > > > *
> > > >
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Freleases%2Ftag%2Fapache-parquet-1.12.3-rc1&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765431335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S7wFdemHnELNQZWPHWSxfOcyz3pwBh1U67eGzLuSxXU%3D&reserved=0
> > > >
> > > >
> > > > The release tarball, signature, and checksums are here:
> > > >
> > > > *
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.12.3-rc1&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zz3qy8mFVR%2FuLrBykg7JKE6O9IOBXIys57n8SIykL4A%3D&reserved=0
> > > >
> > > >
> > > > You can find the KEYS file here:
> > > >
> > > > * *
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Frelease%2Fparquet%2FKEYS&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gIq%2Beqa2z%2BUANXBcz%2FRXFVAYwv%2BczYTS%2FB1uuuq84f4%3D&reserved=0
> > > > <
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Frelease%2Fparquet%2FKEYS&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gIq%2Beqa2z%2BUANXBcz%2FRXFVAYwv%2BczYTS%2FB1uuuq84f4%3D&reserved=0
> > > >*
> > > >
> > > >
> > > > Binary artifacts are staged in Nexus here:
> > > >
> > > > *
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nel4PeDZ0dJjZfCFyTwSIloeeiGt30s33o75CL%2B8chc%3D&reserved=0
> > > >
> > > >
> > > > This release includes important changes listed
> > > >
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fparquet-1.12.3%2FCHANGES.md&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kadO6pFwux8C18dL26GVoWbg4aOVTIxZtGUq6B5J0XM%3D&reserved=0
> > > .
> > > >
> > > >
> > > > Please download, verify, and test.
> > > >
> > > >
> > > > Please vote in the next 72 hours.
> > > >
> > > >
> > > > [ ] +1 Release this as Apache Parquet 1.12.3
> > > >
> > > > [ ] +0
> > > >
> > > > [ ] -1 Do not release this because...
> > > >
> > > >
> > > >
> > > > 
> > > >
> > > > Xinli Shang
> > > >
> > > > PMC Chair of Apache Parquet
> > > >
> > > > TLM Uber Data Infra
> > > >
> > >
> >
> >
> > --
> > Xinli Shang
> >
>


-- 
Xinli Shang


Meeting notes for Parquet monthly sync - 5/24/2022

2022-05-24 Thread Xinli shang
Hi all,

This is the meeting notes for today's Parquet sync meeting. We just had a
short one as everybody is busy now. We mainly focus on release now.

Attendees (Timothy Miller(theo...@amazon.com), Gidon Gershinsky
, Xinli Shang)


Release 1.12.3 - In progress, email was sent out, waiting for 1 more
binding vote +1.

-- 
Xinli Shang


Re: [VOTE] Release Apache Parquet 1.12.3 RC1

2022-05-23 Thread Xinli shang
I also vote +1.

On Sun, May 22, 2022 at 5:59 PM Wang, Yuming 
wrote:

> +1. Tested through Spark: https://github.com/apache/spark/pull/36629
>
> From: Gidon Gershinsky 
> Date: Sunday, May 22, 2022 at 19:02
> To: dev@parquet.apache.org 
> Subject: Re: [VOTE] Release Apache Parquet 1.12.3 RC1
> External Email
>
> +1. Downloaded, verified and tested.
>
> Cheers, Gidon
>
>
> On Fri, May 20, 2022 at 8:49 PM Xinli shang 
> wrote:
>
> > Hi everyone,
> >
> >
> > I propose the following RC to be released as the official Apache Parquet
> >  1.12.3 release.
> >
> >
> > The commit id is f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b
> >
> > * This corresponds to the tag: apache-parquet-1.12.3-rc1
> >
> > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Freleases%2Ftag%2Fapache-parquet-1.12.3-rc1&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765431335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S7wFdemHnELNQZWPHWSxfOcyz3pwBh1U67eGzLuSxXU%3D&reserved=0
> >
> >
> > The release tarball, signature, and checksums are here:
> >
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.12.3-rc1&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zz3qy8mFVR%2FuLrBykg7JKE6O9IOBXIys57n8SIykL4A%3D&reserved=0
> >
> >
> > You can find the KEYS file here:
> >
> > * *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Frelease%2Fparquet%2FKEYS&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gIq%2Beqa2z%2BUANXBcz%2FRXFVAYwv%2BczYTS%2FB1uuuq84f4%3D&reserved=0
> > <
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Frelease%2Fparquet%2FKEYS&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gIq%2Beqa2z%2BUANXBcz%2FRXFVAYwv%2BczYTS%2FB1uuuq84f4%3D&reserved=0
> >*
> >
> >
> > Binary artifacts are staged in Nexus here:
> >
> > *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nel4PeDZ0dJjZfCFyTwSIloeeiGt30s33o75CL%2B8chc%3D&reserved=0
> >
> >
> > This release includes important changes listed
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fparquet-1.12.3%2FCHANGES.md&data=05%7C01%7Cyumwang%40ebay.com%7C23690f8480da452ddc8a08da3be296e3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637888141765587985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kadO6pFwux8C18dL26GVoWbg4aOVTIxZtGUq6B5J0XM%3D&reserved=0
> .
> >
> >
> > Please download, verify, and test.
> >
> >
> > Please vote in the next 72 hours.
> >
> >
> > [ ] +1 Release this as Apache Parquet 1.12.3
> >
> > [ ] +0
> >
> > [ ] -1 Do not release this because...
> >
> >
> >
> > 
> >
> > Xinli Shang
> >
> > PMC Chair of Apache Parquet
> >
> > TLM Uber Data Infra
> >
>


-- 
Xinli Shang


[VOTE] Release Apache Parquet 1.12.3 RC1

2022-05-20 Thread Xinli shang
Hi everyone,


I propose the following RC to be released as the official Apache Parquet
 1.12.3 release.


The commit id is f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b

* This corresponds to the tag: apache-parquet-1.12.3-rc1

*
https://github.com/apache/parquet-mr/releases/tag/apache-parquet-1.12.3-rc1


The release tarball, signature, and checksums are here:

* https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.12.3-rc1


You can find the KEYS file here:

* *https://dist.apache.org/repos/dist/release/parquet/KEYS
<https://dist.apache.org/repos/dist/release/parquet/KEYS>*


Binary artifacts are staged in Nexus here:

* https://repository.apache.org/content/groups/staging/org/apache/parquet/


This release includes important changes listed
https://github.com/apache/parquet-mr/blob/parquet-1.12.3/CHANGES.md.


Please download, verify, and test.


Please vote in the next 72 hours.


[ ] +1 Release this as Apache Parquet 1.12.3

[ ] +0

[ ] -1 Do not release this because...





Xinli Shang

PMC Chair of Apache Parquet

TLM Uber Data Infra


Re: AvroParquetWriter write to s3

2022-05-15 Thread Xinli shang
Hi Regin,

Parquet is a layer to handle the file format. If you are looking for
injecting something in the request header, the S3 client library could be
the place you are looking for.

Xinli

On Fri, May 13, 2022 at 3:21 PM Regin Quinoa  wrote:

> Hi, we are trying to use org.apache.parquet.avro
> <https://www.tabnine.com/code/java/packages/org.apache.parquet.avro>
> .AvroParquetWriter
>
> to write parquet file to s3 bucket. The file is successfully written to s3
> bucket but
>
> get an exception
>
> com.amazonaws.SdkClientException: Unable to verify integrity of data
> upload.
>
> The purpose is to resolve these exceptions while The s3 bucket is encrypted
> with SSE-KMS not SSE-S3.
>
>  It appears that the exceptions are thrown because of code blocks in the
> link below
>
>
> https://github.com/aws/aws-sdk-java/blob/fd409dee8ae23fb8953e0bb4dbde65536a7e0514/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3Client.java#L1876
>
> From amazon doc, the etag is not same as MD5 when s3 bucket is encrypted
> with SSE-KMS
>
>
> https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
>
>  *The possible way is to pass MD5 in request header or set system.property
> to disable validation in
> skipMd5CheckStrategy.skipClientSideValidationPerPutResponse as indicated in
> link*
>
>
> https://github.com/aws/aws-sdk-java/blob/99fe75a823d4b02f4e90fa0dda06a1558d5617a1/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/internal/SkipMd5CheckStrategy.java#L42
>
>  The issue is that I do not find a proper way to inject such configurations
> into AvroParquetWriter. Is this possible? If yes, can you help to show how
> to do it?
>
>  Thanks
>
> Regin
>


-- 
Xinli Shang


Meeting notes for Parquet monthly sync - 4/27/2022

2022-04-27 Thread Xinli shang
4/27/2022

Attendees (Timothy Miller, Vinoo Ganesh, Satish K, Gidon Gershinsky, Xinli
Shang, Huaxin Gao)

   1.

   Cell-Level encryption
   1.

  Internal implementation and rollout
  2.

  Welcome new comments
  2.

   Release 1.12.3
   1.

  SNAPSHOT release - Gidon will take the lead
  3.

   ID resolution
   1.

  Huaxin will address Ryan’s comments
  4.

   UUID support for parquet-cli
   1.

  See some exceptions when running the tool. Timothy will investigate
  it.
  5. The next meeting will be at 8:30 am on Tuesday


-- 
Xinli Shang
VP Apache Parquet PMC Chair
Tech Lead Manager @ Uber Data Infra


Today's sync meeting will start ~15 minutes late

2022-04-27 Thread Xinli shang
Hi all,

Sorry for the last-minute notification. Today's sync meeting will start ~15
minutes late.

-- 
Xinli Shang


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2022-04-08 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519686#comment-17519686
 ] 

Xinli Shang commented on PARQUET-1681:
--

[~theosib-amazon]It seems different. 

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


ASF Board Report Draft Review

2022-03-31 Thread Xinli shang
Hi all,

The report below is the draft version of the April report for the Apache
Parquet community. Please share your thoughts and I will send them to the
ASF board after your review.

## Description:
The mission of Parquet is the creation and maintenance of software related
to
columnar storage format available to any project in the Apache Hadoop
ecosystem

## Issues:
No issues found

## Membership Data:
Apache Parquet was founded 2015-04-21 (7 years ago)
There are currently 37 committers and 27 PMC members in this project.
The Committer-to-PMC ratio is roughly 5:4.

Community changes, past quarter:
- No new PMC members. Last addition was Gidon Gershinsky on 2021-11-23.
- No new committers. Last addition was Gidon Gershinsky on 2021-04-05.

## Project Activity:
Recent releases:
MR-1.11.2 was released on 2021-10-06
MR-1.12.2 was released on 2021-10-06
MR-1.12.0 was released on 2021-03-25
New website parquet.apache.org was launched in March 2022.

## Community Health:
25 issues opened in JIRA, past quarter (150% increase)
18 issues closed in JIRA, past quarter (63% increase)
30 commits in the past quarter (172% increase)
10 code contributors in the past quarter (42% increase)
32 PRs opened on GitHub, past quarter (190% increase)
29 PRs closed on GitHub, past quarter (163% increase)
dev@parquet.apache.org had a 65% decrease in traffic in the past quarter

-- 
Xinli Shang


Re: Parquet Website Launched

2022-03-25 Thread Xinli shang
Thank you so much Vinoo for working on this!

On Fri, Mar 25, 2022 at 8:09 AM Vinoo Ganesh  wrote:

> Hi All,
>   I'm excited to announce the launch of the new Parquet website -
> https://parquet.apache.org/. The new website uses Hugo
> <https://gohugo.io/> and is backed by the Docsy <https://www.docsy.dev/>
> theme.
>
> The new website simplifies both the documentation process, with support
> for creating PRs to update/modify the documentation directly from the
> website, as well as the release documentation process, where each release
> is a new blog post.
>
> Documentation for the development/release process of the website can be
> found here:
> https://github.com/apache/parquet-site/tree/production#website-development-and-deployment
> .
>
> Thanks to Xinli for his help getting this over the finish line.
>
> Please let me know if you have any feedback or feature requests.
>
> Thanks,
> Vinoo Ganesh | vinoo.gan...@gmail.com
>
> 
>


-- 
Xinli Shang


Parquet sync meeting notes 3/23/2022

2022-03-23 Thread Xinli shang
Attendee (Jorge from Munin Data),   Gidon, Huaxin, Vinoo, Xinli



   1.

   Cell level encryption
   1.

  Formal design
  
<https://docs.google.com/document/d/1PUonl9i_fVlRhUmqEmWBQJ8zesX7mlvnu3ubemT11rk/edit>
  is sent out
  2.

  We choose the 2nd option - splitting columns because it doesn’t need
  specification change
  3.

  Implementation is going on
  4.

  Create a feature branch for review
  2.

   Column resolution by ID (pr
   <https://github.com/apache/parquet-mr/pull/950>)
   1.

  The ‘field_id’ in the schema is used.
  2.

  Uniqueness might not guarantee to be used by column resolution with
  ID. We might need a place to remember a flag that this Parquet file is
  column id resolvable
  3.

  Concat tool might be a problem
  4.

  Need some help from Iceberg
  3.

   Parquet writer for Iceberg (Adding a new constructor)
   1.

  A diff will be sent out soon
  4.

   New website (link <https://parquet.staged.apache.org/>)
   1.

  Looks good, will make it formal


-- 
Xinli Shang
VP Apache Parquet PMC Chair, Tech Lead Manager at Uber Data Infra


Look for protobuf reviewers for PR-900

2022-03-20 Thread Xinli shang
Hi all,

We have a PR <https://github.com/apache/parquet-mr/pull/900> related to
Protobuf pending review. We are looking for people who are familiar with
Protobbuf to review the change. If you can help, please review. Thanks.

--
Xinli Shang


[jira] [Commented] (PARQUET-1595) Parquet proto writer de-nest Protobuf wrapper classes

2022-03-20 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509500#comment-17509500
 ] 

Xinli Shang commented on PARQUET-1595:
--

Is it a typo for Int32Value -> int64?



> Parquet proto writer de-nest Protobuf wrapper classes
> -
>
> Key: PARQUET-1595
> URL: https://issues.apache.org/jira/browse/PARQUET-1595
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Ying Xu
>Priority: Major
>
> Existing Parquet protobuf writer support preserves the structure of any 
> Protobuf Message objects.  This works well in most cases. However, when 
> dealing with [Protobuf wrapper 
> messages|https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/wrappers.proto],
>  users may prefer directly writing the de-nested value into the Parquet 
> files, for ease of querying them directly (in query engine such as 
> Hive/Presto). 
> Proposal: 
>  * Implement a control flag, e.g., enableDenestingWrappers, to control 
> whether or not to denest Protobuf wrapper classes. 
>  * When this flag is set to true, write the Protobuf wrapper classes as 
> single primitive fields, based on the type of the wrapped *value* field.
>   
> ||Protobuf Type||Parquet Type||
> |BoolValue|boolean|
> |BytesValue|binary|
> |DoubleValue|double|
> |FloatValue|float|
> |Int32Value|int64 (32-bit, signed)|
> |Int64Value|int64 (64-bit, signed)|
> |StringValue|binary (string)|
> |UInt32Value|int64 (32-bit, unsigned)|
> |UInt64Value|int64 (64-bit, unsigned)|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Please review the design of Parquet-2116: Cell Level Encryption

2022-03-12 Thread Xinli shang
Hi all,

We just drafted a formal version
<https://docs.google.com/document/d/1PUonl9i_fVlRhUmqEmWBQJ8zesX7mlvnu3ubemT11rk/edit#heading=h.kkuoyw5u0ywe>
of
the design based on the pre-design
<https://docs.google.com/document/d/1Q-d98Os_aJahUynznPrWvXwWQeN0aFDRhZj3hXt_JOM/edit#heading=h.kkuoyw5u0ywe>
discussion. Any feedback is welcome! Feel free to make comment on the
document directly. Thanks.

-- 
Xinli Shang


[jira] [Updated] (PARQUET-2116) Cell Level Encryption

2022-03-12 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2116:
-
External issue URL: 
https://docs.google.com/document/d/1PUonl9i_fVlRhUmqEmWBQJ8zesX7mlvnu3ubemT11rk/edit#heading=h.kkuoyw5u0ywe
  (was: 
https://docs.google.com/document/d/1Q-d98Os_aJahUynznPrWvXwWQeN0aFDRhZj3hXt_JOM/edit#)

> Cell Level Encryption 
> --
>
> Key: PARQUET-2116
> URL: https://issues.apache.org/jira/browse/PARQUET-2116
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Xinli Shang
>    Assignee: Xinli Shang
>Priority: Major
>
> Cell level encryption can do finer-grained encryption than modular 
> encryption(Parquet-1178) or file encryption. The idea is only some fields 
> inside the column are encrypted based on a filter expression. For example, a 
> table with column a, b, c.x, c.y, d, we can encrypt column a, c.x where d == 
> 5 and c.y > 0.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Meeting notes for Parquet sync meeting - March 1st. 2022

2022-03-11 Thread Xinli shang
We don't have a detailed plan/timeline yet but we are looking for it in the
next one or two months.

On Thu, Mar 10, 2022 at 4:36 PM Prakhar Jain 
wrote:

> Hi All
>   Thanks for sharing the meeting notes. Do we have a tentative timeline in
> mind for the new version of Parquet-MR? Also will it be a major/minor/or a
> patch release.
>
> Thanks and Regards
> Prakhar Jain
>
>
> On Tue, Mar 1, 2022 at 9:34 AM Xinli shang 
> wrote:
>
> > 3/1/2022
> >
> > Attendees: Xinli Shang, Gidon Gershinsky, Vinoo Ganesh
> >
> >1.
> >
> >The new website of Apache Parquet is to be launched
> >1.
> >
> >   https://www.vinoo.io/
> >   2.
> >
> >   Vinoo to  send out an email to dev@ for a preview
> >   2.
> >
> >Cell level encryption
> >1.
> >
> >   Objective/Goals need to be clear
> >   2.
> >
> >   Performance
> >   3.
> >
> >   Avoid changing specification
> >   3.
> >
> >Data masking
> >1.
> >
> >   The PR is to be ready to review soon.
> >   4.
> >
> >Release new version of Parquet
> >1.
> >
> >   Blocked on ID resolution change. Need to ping.
> >
> > --
> > Xinli Shang
> >
>


-- 
Xinli Shang


Two blogs about Apache Parquet were just published on the Uber EngBlog site

2022-03-11 Thread Xinli shang
Hi all,

Uber EngBlog site just pushed two articles about Apache Parquet: Cost
Efficiency @ Scale in Big Data File Format
<https://eng.uber.com/cost-efficiency-big-data/> and One Stone, Three
Birds: Finer-Grained Encryption @ Apache Parquet™
<https://eng.uber.com/one-stone-three-birds-finer-grained-encryption-apache-parquet/>.
Please checkout out!


The first one is about how to use Parquet ZSTD, Column Prunning(deletion)
tool, Precision Reduction, Multi-Column Ordering, and fast translation tool
in Parquet to reduce storage space to improve cost efficiency. This project
alone saves the storage size at hundred PB level which is equivalent to
several millions of dollars savings per year.

The second one talks about using Apache Parquet's fine-grained encryption
feature to solve three challenges: encryption, access control, and data
retention! This wraps up the work we have done with the community in the
last 3 years around Parquet Modular Encryption. I would like to thank Gidon
for his continuous collaborations with us!

If you have any questions about the blog, feel free to reach out!

Xinli Shang

Tech Lead Manager at Uber Data Infra

VP Apache Parquet PMC Chair


Meeting notes for Parquet sync meeting - March 1st. 2022

2022-03-01 Thread Xinli shang
3/1/2022

Attendees: Xinli Shang, Gidon Gershinsky, Vinoo Ganesh

   1.

   The new website of Apache Parquet is to be launched
   1.

  https://www.vinoo.io/
  2.

  Vinoo to  send out an email to dev@ for a preview
  2.

   Cell level encryption
   1.

  Objective/Goals need to be clear
  2.

  Performance
  3.

  Avoid changing specification
  3.

   Data masking
   1.

  The PR is to be ready to review soon.
  4.

   Release new version of Parquet
   1.

  Blocked on ID resolution change. Need to ping.

-- 
Xinli Shang


Re: Get uncompressed size of parquet file via parquet-cli

2022-02-20 Thread Xinli shang
You seem right. The 'uncompressedSize' is having the value but not printed
out anywhere. Do you want to make a fix?

On Thu, Feb 17, 2022 at 3:29 AM Deepak Gangwar  wrote:

> Hi folks,
>
> I was using parquet-tools to see the data or metadata of parquet files. I
> noticed that parquet-tools has been deprecated and removed from the latest
> branch and it is replaced by parquet-cli. Most of my use-cases are
> fulfilled by parquet-cli but there is 1 thing missing in parquet-cli. I am
> not able to find any way to get the uncompressed size of the data present.
> “parquet-tools size -u” gave the uncompressed size but there is no
> equivalent parquet-cli command and “parquet-cli meta” only prints the
> compressed size.
>
> I looked around in the codebase and noticed that uncompressedSize is
> assigned to a variable in meta command but it is not used or printed
> anywhere [1]. I think usage of the variable is missed but I am not able to
> find any open issue in jira so I might be completely wrong here. Please
> confirm whether this is actually an issue and is there any other way to get
> uncompressed size that I am missing?
>
>
> [1]
> https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ParquetMetadataCommand.java#L123
> --
> Thanks & Regards
> Deepak Gangwar
>
>

-- 
Xinli Shang


[jira] [Comment Edited] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar

2022-02-17 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494321#comment-17494321
 ] 

Xinli Shang edited comment on PARQUET-2127 at 2/18/22, 2:23 AM:


Thanks for reporting [~phoebemaomao]! Will you be able to come up with the fix? 
I will be happy to review and merge.


was (Author: sha...@uber.com):
Thanks for reporting [~phoebemaomao]! Will you be able to come up with the fix? 
I will be happy to review and merge.. 

> Security risk in latest parquet-jackson-1.12.2.jar
> --
>
> Key: PARQUET-2127
> URL: https://issues.apache.org/jira/browse/PARQUET-2127
> Project: Parquet
>  Issue Type: Improvement
>Reporter: phoebe chen
>Priority: Major
>
> Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK 
> serialization to serialize JsonNode 
> ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to 
> 2.13.1 can fix this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar

2022-02-17 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494321#comment-17494321
 ] 

Xinli Shang commented on PARQUET-2127:
--

Thanks for reporting [~phoebemaomao]! Will you be able to come up with the fix? 
I will be happy to review and merge.. 

> Security risk in latest parquet-jackson-1.12.2.jar
> --
>
> Key: PARQUET-2127
> URL: https://issues.apache.org/jira/browse/PARQUET-2127
> Project: Parquet
>  Issue Type: Improvement
>Reporter: phoebe chen
>Priority: Major
>
> Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK 
> serialization to serialize JsonNode 
> ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to 
> 2.13.1 can fix this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-02-14 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492099#comment-17492099
 ] 

Xinli Shang edited comment on PARQUET-2122 at 2/14/22, 4:56 PM:


[~junjie] Do you know why? 


was (Author: sha...@uber.com):
[~junjie]Do you know why? 

> Adding Bloom filter to small Parquet file bloats in size X1700
> --
>
> Key: PARQUET-2122
> URL: https://issues.apache.org/jira/browse/PARQUET-2122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Ze'ev Maor
>Priority: Critical
> Attachments: data.csv, data_index_bloom.parquet
>
>
> Converting a small, 14 rows/1 string column csv file to Parquet without bloom 
> filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to 
> ParquetWriter then yields a 1049197B file.
> It isn't clear what the extra space is used by.
> Attached csv and bloated Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-02-14 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492099#comment-17492099
 ] 

Xinli Shang commented on PARQUET-2122:
--

[~junjie]Do you know why? 

> Adding Bloom filter to small Parquet file bloats in size X1700
> --
>
> Key: PARQUET-2122
> URL: https://issues.apache.org/jira/browse/PARQUET-2122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Ze'ev Maor
>Priority: Critical
> Attachments: data.csv, data_index_bloom.parquet
>
>
> Converting a small, 14 rows/1 string column csv file to Parquet without bloom 
> filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to 
> ParquetWriter then yields a 1049197B file.
> It isn't clear what the extra space is used by.
> Attached csv and bloated Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Parquet Column Resolution by ID

2022-02-11 Thread Xinli shang
Hi Gidon,

I just shared the 'comment' permission for everybody. Let me know if you
still have issues with it.

Xinli

On Thu, Feb 10, 2022 at 9:45 PM Gidon Gershinsky  wrote:

> Hi Huaxin,
>
> Can you open this document for comments?
>
> Cheers, Gidon
>
>
> On Fri, Feb 11, 2022 at 6:01 AM huaxin gao  wrote:
>
> > Hi Parquet community,
> >
> > Xinli and I drafted a design doc to support ID based column resolution in
> > Parquet. Here is the link
> > <
> >
> https://docs.google.com/document/d/1hDLFIKuVhhnTNpA5bTo4nfD-MUZz8Iq4V9FXrr1WPsw/edit?usp=sharing
> > >.
> > We'd like to start a discussion on the doc and any feedback is welcome!
> >
> > Thanks,
> > Huaxin
> >
>


-- 
Xinli Shang


[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485949#comment-17485949
 ] 

Xinli Shang commented on PARQUET-2117:
--

Thanks for opening this Jira! Look forward to the PR.

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Parquet sync meeting notes - 1/26/2022

2022-01-27 Thread Xinli shang
Here
<https://docs.google.com/document/d/1Q-d98Os_aJahUynznPrWvXwWQeN0aFDRhZj3hXt_JOM>
is the link for the Cell-Level encryption pre-design. Feel free to share
the feedback in the file directly by adding comments.

On Wed, Jan 26, 2022 at 9:51 AM Xinli shang  wrote:

> 1/26/2022
>
> Attendees: Xinli Shang, Gidon Gershinsky, Pavi Subenderan, Jason Zhang
>
>1.
>
>Data masking
>1.
>
>   Pavi: Will create a PR by next week
>   2.
>
>   PARQUET-2062 <https://issues.apache.org/jira/browse/PARQUET-2062>
>   3.
>
>   Will have a high-level design sent out soon
>   2.
>
>Cell level encryption
>1.
>
>   Xinli: Will send out the draft design soon
>   2.
>
>   Key questions: Should we have the same key for all the cells in the
>   same column? It could generate millions of keys if we do it.
>   3.
>
>   There are two options explored: 1)Use FPE to encrypt in place, 2)
>   add extra columns to utilize existing modular encryption. Will have 
> them in
>   the design.
>   3.
>
>Release of 1.13.0
>1.
>
>   Data masking(null)
>   1.
>
>  PARQUET-2062 <https://issues.apache.org/jira/browse/PARQUET-2062>
>  will be done in a few weeks.
>  2.
>
>   ID resolution instead of name
>   1.
>
>  PARQUET-2006 <https://issues.apache.org/jira/browse/PARQUET-2062>,
>  need to see if it needs specification change and the scope of the 
> change
>  and ETA. We will decide should we include it in 1.13.0.
>
>
>
> Xinli Shang
> Apache Parquet PMC Chair
> Teach Lead Manager at Uber Data Infra
>
>
>

-- 
Xinli Shang


[jira] [Updated] (PARQUET-2116) Cell Level Encryption

2022-01-27 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2116:
-
External issue URL: 
https://docs.google.com/document/d/1Q-d98Os_aJahUynznPrWvXwWQeN0aFDRhZj3hXt_JOM/edit#

> Cell Level Encryption 
> --
>
> Key: PARQUET-2116
> URL: https://issues.apache.org/jira/browse/PARQUET-2116
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Xinli Shang
>    Assignee: Xinli Shang
>Priority: Major
>
> Cell level encryption can do finer-grained encryption than modular 
> encryption(Parquet-1178) or file encryption. The idea is only some fields 
> inside the column are encrypted based on a filter expression. For example, a 
> table with column a, b, c.x, c.y, d, we can encrypt column a, c.x where d == 
> 5 and c.y > 0.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2116) Cell Level Encryption

2022-01-27 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2116:


 Summary: Cell Level Encryption 
 Key: PARQUET-2116
 URL: https://issues.apache.org/jira/browse/PARQUET-2116
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Xinli Shang
Assignee: Xinli Shang


Cell level encryption can do finer-grained encryption than modular 
encryption(Parquet-1178) or file encryption. The idea is only some fields 
inside the column are encrypted based on a filter expression. For example, a 
table with column a, b, c.x, c.y, d, we can encrypt column a, c.x where d == 5 
and c.y > 0.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2091) Fix release build error introduced by PARQUET-2043

2022-01-27 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-2091.
--
Resolution: Won't Fix

> Fix release build error introduced by PARQUET-2043
> --
>
> Key: PARQUET-2091
> URL: https://issues.apache.org/jira/browse/PARQUET-2091
> Project: Parquet
>  Issue Type: Task
>Reporter: Xinli Shang
>    Assignee: Xinli Shang
>Priority: Major
>
> After PARQUET-2043 when building for a release like 1.12.1, there is build 
> error complaining 'used undeclared dependency'. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2098) Add more methods into interface of BlockCipher

2022-01-27 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483225#comment-17483225
 ] 

Xinli Shang commented on PARQUET-2098:
--

[~gershinsky] Do you have time to work on it as we discussed to release the new 
version?

> Add more methods into interface of BlockCipher
> --
>
> Key: PARQUET-2098
> URL: https://issues.apache.org/jira/browse/PARQUET-2098
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Currently BlockCipher interface has methods without letting caller to specify 
> length/offset. In some use cases like Presto,  it is needed to pass in a byte 
> array and the data to be encrypted only occupys partially of the array.  So 
> we need to add a new methods something like below for decrypt. Similar 
> methods might be needed for encrypt. 
> byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, 
> byte[] aad);



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2112) Fix typo in MessageColumnIO

2022-01-27 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-2112.
--
Resolution: Fixed

> Fix typo in MessageColumnIO
> ---
>
> Key: PARQUET-2112
> URL: https://issues.apache.org/jira/browse/PARQUET-2112
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>    Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.13.0
>
>
> Typo of the variable 'BitSet vistedIndexes'. Change it to 'visitedIndexes'



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Parquet sync meeting notes - 1/26/2022

2022-01-26 Thread Xinli shang
1/26/2022

Attendees: Xinli Shang, Gidon Gershinsky, Pavi Subenderan, Jason Zhang

   1.

   Data masking
   1.

  Pavi: Will create a PR by next week
  2.

  PARQUET-2062 <https://issues.apache.org/jira/browse/PARQUET-2062>
  3.

  Will have a high-level design sent out soon
  2.

   Cell level encryption
   1.

  Xinli: Will send out the draft design soon
  2.

  Key questions: Should we have the same key for all the cells in the
  same column? It could generate millions of keys if we do it.
  3.

  There are two options explored: 1)Use FPE to encrypt in place, 2) add
  extra columns to utilize existing modular encryption. Will have
them in the
  design.
  3.

   Release of 1.13.0
   1.

  Data masking(null)
  1.

 PARQUET-2062 <https://issues.apache.org/jira/browse/PARQUET-2062>
 will be done in a few weeks.
 2.

  ID resolution instead of name
  1.

 PARQUET-2006 <https://issues.apache.org/jira/browse/PARQUET-2062>,
 need to see if it needs specification change and the scope of
the change
 and ETA. We will decide should we include it in 1.13.0.



Xinli Shang
Apache Parquet PMC Chair
Teach Lead Manager at Uber Data Infra


[jira] [Created] (PARQUET-2112) Fix typo in MessageColumnIO

2022-01-22 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2112:


 Summary: Fix typo in MessageColumnIO
 Key: PARQUET-2112
 URL: https://issues.apache.org/jira/browse/PARQUET-2112
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.2
Reporter: Xinli Shang
Assignee: Xinli Shang
 Fix For: 1.13.0


Typo of the variable 'BitSet vistedIndexes'. Change it to 'visitedIndexes'



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: To be a Parquet contributor

2022-01-21 Thread Xinli shang
Welcome, Jianshen! You can subscribe the mailing list
https://parquet.apache.org/community.

I just add you to the meeting invitation!



On Fri, Jan 21, 2022 at 4:44 PM jiashen zhang 
wrote:

> Hi Parquet Experts,
>
> I am Jiashen Zhang (https://www.linkedin.com/in/jiashen-zhang/). I am
> really interested in Parquet and I would like to join our Parquet
> community, could you help pull me into our community, such as inviting to
> the channel or meetings etc?
>
> --
> Thanks,
> Jiashen
>


-- 
Xinli Shang


[jira] [Commented] (PARQUET-2111) Support limit push down and stop early for RecordReader

2022-01-21 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480128#comment-17480128
 ] 

Xinli Shang commented on PARQUET-2111:
--

Look forward to the PR

> Support limit push down and stop early for RecordReader
> ---
>
> Key: PARQUET-2111
> URL: https://issues.apache.org/jira/browse/PARQUET-2111
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Jackey Lee
>Priority: Major
>
> With limit push down, it can stop scanning parquet early, and reduce network 
> and disk IO.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2071) Encryption translation tool

2022-01-14 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-2071.
--
Resolution: Fixed

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>    Reporter: Xinli Shang
>    Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-1872) Add TransCompression Feature

2022-01-14 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-1872.
--
Resolution: Fixed

> Add TransCompression Feature 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>    Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2105) Refactor the test code of creating the test file

2022-01-14 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-2105.
--
Resolution: Fixed

> Refactor the test code of creating the test file 
> -
>
> Key: PARQUET-2105
> URL: https://issues.apache.org/jira/browse/PARQUET-2105
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Xinli Shang
>    Assignee: Xinli Shang
>Priority: Major
>
> In the tests, there are many places that need to create a test parquet file 
> with different settings. Currently, each test file just creates its own code. 
> It would be better to have a test file builder to create that. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-1889) Register a MIME type for the Parquet format.

2022-01-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17473147#comment-17473147
 ] 

Xinli Shang commented on PARQUET-1889:
--

+1 on [~westonpace]'s point 

> Register a MIME type for the Parquet format.
> 
>
> Key: PARQUET-1889
> URL: https://issues.apache.org/jira/browse/PARQUET-1889
> Project: Parquet
>  Issue Type: Wish
>  Components: parquet-format
>Affects Versions: format-2.7.0
>Reporter: Mark Wood
>Priority: Major
>
> There is currently  no MIME type registered for Parquet.  Perhaps this is 
> intentional.
> If it is not intentional, I suggest steps be taken to register a MIME type 
> with IANA.
>  
> [https://www.iana.org/assignments/media-types/media-types.xhtml]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Review of the ASF Board Report for Parquet

2022-01-05 Thread Xinli shang
Hi all,

This is the quarterly report of Parquet to ASF board. Please have a look
and reply with your comments before the end of this week.

## Description:
The mission of Parquet is the creation and maintenance of software related
to
columnar storage format available to any project in the Apache Hadoop
ecosystem

## Issues:
no

## Membership Data:
Apache Parquet was founded 2015-04-21 (7 years ago)
There are currently 37 committers and 27 PMC members in this project.
The Committer-to-PMC ratio is roughly 5:4.

Community changes, past quarter:
- Gidon Gershinsky was added to the PMC on 2021-11-23
- No new committers. Last addition was Gidon Gershinsky on 2021-04-05.

## Project Activity:
Recent releases:
MR-1.11.2 was released on 2021-10-06.
MR-1.12.2 was released on 2021-10-06.
## Community Health:
dev@parquet.apache.org had a 65% decrease in traffic in the past quarter
9 issues opened in JIRA, past quarter (-75% change)
11 issues closed in JIRA, past quarter (-45% change)
7 commits in the past quarter (-85% change)
7 code contributors in the past quarter (-53% change)
11 PRs opened on GitHub, past quarter (-47% change)

-- 
Xinli Shang


Re: Parquet-tools Replacement

2022-01-04 Thread Xinli shang
That is correct!

On Tue, Jan 4, 2022 at 12:29 PM Vinoo Ganesh  wrote:

> Hi Xinli,
>   Great - thank you! Just to make sure, you mean this right?
> https://github.com/apache/parquet-mr/tree/master/parquet-cli (
> https://mvnrepository.com/artifact/org.apache.parquet/parquet-cli).
>
> Thanks,
> Vinoo Ganesh | vinoo.gan...@gmail.com
>
> 
>
>
> On Tue, Jan 4, 2022 at 12:49 PM Xinli shang 
> wrote:
>
> > Hi Vinoo,
> >
> > Thanks for bringing this up!  Yes, they are deprecated. The recommended
> > replacement is to use Parquet-cli. Let me know if that doesn't work for
> > you!
> >
> > Xinli
> >
> > On Tue, Dec 21, 2021 at 7:11 PM Vinoo Ganesh 
> > wrote:
> >
> > > Hi Parquet Team,
> > >   I was setting up a mirror of my homebrew setup and found this:
> > >
> > >
> >
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/parquet-tools.rb#L20
> > > .
> > > It looks like parquet-tools has been marked as deprecated in this
> commit:
> > > https://github.com/Homebrew/homebrew-core/pull/73909. I just found
> this
> > > ticket: https://issues.apache.org/jira/browse/PARQUET-1666 too. Is
> > there a
> > > recommended replacement for parquet-tools? If so, could someone point
> me
> > to
> > > it? Thanks!
> > >
> > > Thanks,
> > > Vinoo Ganesh | vinoo.gan...@gmail.com
> > >
> > > 
> > >
> >
> >
> > --
> > Xinli Shang
> >
>


-- 
Xinli Shang


  1   2   3   4   >