Re: Interest in Parquet V3

2024-05-17 Thread Rok Mihevc
Hi all, I've discussed with my colleagues and we would dedicate two engineers for 4-6 months on tasks related to implementing the format changes. We're already active in design discussions and can help with C++, Rust and C# implementations. I thought it'd be good to state this explicitly FWIW.

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-17 Thread Prem Sahoo
+1 as it will be apt name . Sent from my iPhone > On May 17, 2024, at 12:32 PM, Daniel Weeks wrote: > > +1 agree, much cleaner naming > > -Dan > >> On Fri, May 17, 2024 at 8:46 AM Chao Sun wrote: >> >> +1 too. The name has been confusing for a very long time. >> >>> On Fri, May 17, 2024

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-17 Thread Daniel Weeks
+1 agree, much cleaner naming -Dan On Fri, May 17, 2024 at 8:46 AM Chao Sun wrote: > +1 too. The name has been confusing for a very long time. > > On Fri, May 17, 2024 at 8:40 AM Fokko Driesprong wrote: > > > +1 - I think it is much clearer to anyone. > > > > GitHub will handle all the

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-17 Thread Chao Sun
+1 too. The name has been confusing for a very long time. On Fri, May 17, 2024 at 8:40 AM Fokko Driesprong wrote: > +1 - I think it is much clearer to anyone. > > GitHub will handle all the redirects from the old to the new name, so no > reason from my end to not rename it :) > > Cheers, Fokko

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-17 Thread Fokko Driesprong
+1 - I think it is much clearer to anyone. GitHub will handle all the redirects from the old to the new name, so no reason from my end to not rename it :) Cheers, Fokko Op vr 17 mei 2024 om 17:30 schreef Julien Le Dem : > +1 > I should have named it that to start with. > > > On Fri, May 17,

Re: [C++] Parquet and Arrow overlap

2024-05-17 Thread Julien Le Dem
If we deem that it would be too hard to move it back for the moment, we need at a minimum to clarify and reduce the confusion. If practice doesn't match what the PMC voted on, we need to improve the practice. Do we have suggestions on improving that? perhaps OWNERSFILE in the parquet folder in the

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-17 Thread Julien Le Dem
+1 I should have named it that to start with. On Fri, May 17, 2024 at 3:27 AM Wang, Yuming wrote: > +10086 > > From: Uwe L. Korn > Date: Thursday, May 16, 2024 at 15:41 > To: dev@parquet.apache.org > Subject: Re: [DISCUSS] rename parquet-mr to parquet-java? > External Email > > very heavy +1

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-17 Thread Julien Le Dem
It's not just whether it's readable or not. It is also whether the format allows reaching the performance characteristics expected. *A* reference implementation should be developed at the same time as the format change to confirm that we reach the stated goals. This is needed whether we consider

Re: [DISCUSS] Parquet 3 metadata draft / strawman proposal

2024-05-17 Thread Julien Le Dem
This context should be added in the PR description itself. My main point is to keep the discussion connected rather than starting new threads on the mailing list or PRs on github that don't refer to the original doc they are connected to. >From a design process perspective, it makes more

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-17 Thread Wang, Yuming
+10086 From: Uwe L. Korn Date: Thursday, May 16, 2024 at 15:41 To: dev@parquet.apache.org Subject: Re: [DISCUSS] rename parquet-mr to parquet-java? External Email very heavy +1 This would help a lot. On Thu, May 16, 2024, at 4:19 AM, Gang Wu wrote: > +1 on renaming the repo to reduce

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-17 Thread Steve Loughran
I'd argue the compatibility across implementation is "can they correctly read the data generated by the others?", so there's less of an RI than compliance testing, the way closed source stuff often works. Specification 1. Files generated by the implementation which are believed to match the

Re: [C++] Parquet and Arrow overlap

2024-05-17 Thread Uwe L. Korn
On Fri, May 17, 2024, at 10:36 AM, Antoine Pitrou wrote: > Hi Julien, > > On Thu, 16 May 2024 18:23:33 -0700 > Julien Le Dem wrote: >> >> As discussed, that code was moved in the arrow repo for convenience: >> https://lists.apache.org/thread/gkvbm6yyly1r4cg3f6xtnqkjz6ogn6o2 >> >> To take an

Re: Typical number of key-value metadata entries?

2024-05-17 Thread Antoine Pitrou
Hi Fokko, So, if I understand correctly, you have a small number of key-value metadata entries, but the values may be large? Also, you actually need those metadata values to do anything with the data (because they tell you the actual Iceberg schema), so on-demand decoding of these values would

Re: [DISCUSS] Parquet 3 metadata draft / strawman proposal

2024-05-17 Thread Antoine Pitrou
Hi Julien, Yes, I posted comments on Micah's document, and I referenced this PR in those discussions. Personally, I feel more comfortable when I have some concrete proposal to comment on, rather than abstract goals, and I figured other people might be like me. Discussing actual Thrift metadata

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-17 Thread Antoine Pitrou
+1 (non-binding :-)) on the idea of having a shortlist of "accredited" implementations. I would suggest to add a third implementation such as parquet-rs, since its authors are active here; especially as the Parquet Java and C++ teams seem to have some overlap historically, and a third

Re: [C++] Parquet and Arrow overlap

2024-05-17 Thread Antoine Pitrou
Hi Julien, On Thu, 16 May 2024 18:23:33 -0700 Julien Le Dem wrote: > > As discussed, that code was moved in the arrow repo for convenience: > https://lists.apache.org/thread/gkvbm6yyly1r4cg3f6xtnqkjz6ogn6o2 > > To take an excerpt of that original decision: > > 4) The Parquet and Arrow C++

Re: [C++] Parquet and Arrow overlap

2024-05-17 Thread Antoine Pitrou
On Fri, 17 May 2024 07:48:18 +0200 Jean-Baptiste Onofré wrote: > Hi > > Technically speaking moving back to parquet would be challenging short > term. > > In terms of governance, why not having some parquet maintainer/PMC member > invited to arrow ? It would simplify the review and governance.