Re: [Discuss] Feature addition requirements/process

2025-09-25 Thread Julien Le Dem
I have now merged this PR.
Thank you all for the feedback.
(esp: Micah, Marc, Andrew, Ryan)


On Tue, Sep 23, 2025 at 11:29 AM Julien Le Dem  wrote:

> Hello,
> Micah approved the PR and I made the last tweaks based on the feedback
> (Thank you Micah and Marc).
> I am planning to merge the PR soon.
> https://github.com/apache/parquet-format/pull/513
> This is your chance to chime in. (or, you know, open a PR later if you
> want to make changes afterwards.)
> Once this is merged, I heard some people who are looking forward to
> test-driving this with proposals for new encodings.
> I am looking forward to it!
> Julien
>
> On Tue, Sep 2, 2025 at 4:54 PM Julien Le Dem  wrote:
>
>> FYI: I'm hoping to get closer to a conclusion in the meeting tomorrow.
>> If you could take a second look, I would appreciate it.
>> Thank you !
>>
>> On Fri, Aug 29, 2025 at 3:54 PM Julien Le Dem  wrote:
>>
>>> Thank you for the feedback.
>>> I have updated the PR with all the feedback and introduced language to
>>> remove gatekeeping as much as possible and encourage people to feel
>>> empowered to propose and contribute new things.
>>>
>>> https://github.com/apache/parquet-format/pull/513
>>> See in tree here:
>>> https://github.com/apache/parquet-format/tree/proposals/proposals
>>>
>>>
>>> On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb 
>>> wrote:
>>>
 I think the PR[1][2] that Julien created is a pretty nice high level
 flow
 as it:
 1. Mostly documents clearly what is already done in practice
 2. Postpones concerns and consensus about potentially overly restrictive
 requirements for new features (but not trying to exhaustively specify
 the
 criteria)
 3. Gives a location to list active proposals

 We could make progress with his PR without having to come to a
 consensus on
 the criteria for inclusion.

 Once we had that high level flow up,  we could try it out and formalize
 some of the criteria that are used for changes.

 Andrew


 [1]: https://github.com/apache/parquet-format/pull/513
 [2]: https://github.com/apache/parquet-format/tree/proposals/proposals

 On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield >>> >
 wrote:

 > >
 > > In this situation, it's great to say that we want people to run
 > benchmarks
 > > on some representative datasets and I agree that we probably want a
 > > substantial performance improvement to justify the cost of support.
 But I
 > > think we need to see these things as guidelines and not require
 running
 > 20
 > >
 > > The intention at least in the doc was to require 20 plus datasets
 but to
 > collect at least a list of open datasets that we can narrow down.
 What I
 > would at least like to see is a fairly standard set of data to make
 > comparisons consistent.   We also discussed this in the sync.  I
 think it
 > will be up to someone who has bandwidth to help at least designate a
 subset
 > of what we want to include.
 >
 > benchmarks or not considering features with 9% improvements across the
 > > board.
 >
 > Sure, we can maybe make the language softer language on having a
 target
 > percentage be a target goal but there can be trade-offs.
 >
 > I actually think having some sort of baseline helps to function as
 making
 > things easier in some ways as long as other requirements are met
 because it
 > removes some amount of subjectivity.
 >
 > Cheers,
 > Micah
 >
 >
 >
 > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem 
 wrote:
 >
 > > I agree that the goal is to make contributions easier and not a
 daunting
 > > process.
 > > We could start the process by separating bigger projects that are
 > impacting
 > > the format in a non backward compatible way (new encodings, new
 footer,
 > > etc), versus things that are not as impacting (for example adding
 > metadata
 > > that can be ignored by older readers).
 > > The goal of the "proposals" list I'm outlining above is really only
 for
 > > bigger projects where we need collaboration across the ecosystem
 (like we
 > > just did for Variant).
 > > I'm taking inspiration from other projects here: Airflow Improvement
 > > Proposals
 > > <
 > >
 >
 https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
 > > >
 > >  or Flink Improvement Proposals
 > > <
 > >
 >
 https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals
 > > >
 > > I think it's also useful to have a central place to find those.
 > >
 > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue  wrote:
 > >
 > > > I like many things about the write up, but I want to call out one
 > > potential
 > > > pitfall.
 > > >
 > > > I think that this is nee

Re: [Discuss] Feature addition requirements/process

2025-09-23 Thread Julien Le Dem
Hello,
Micah approved the PR and I made the last tweaks based on the feedback
(Thank you Micah and Marc).
I am planning to merge the PR soon.
https://github.com/apache/parquet-format/pull/513
This is your chance to chime in. (or, you know, open a PR later if you want
to make changes afterwards.)
Once this is merged, I heard some people who are looking forward to
test-driving this with proposals for new encodings.
I am looking forward to it!
Julien

On Tue, Sep 2, 2025 at 4:54 PM Julien Le Dem  wrote:

> FYI: I'm hoping to get closer to a conclusion in the meeting tomorrow.
> If you could take a second look, I would appreciate it.
> Thank you !
>
> On Fri, Aug 29, 2025 at 3:54 PM Julien Le Dem  wrote:
>
>> Thank you for the feedback.
>> I have updated the PR with all the feedback and introduced language to
>> remove gatekeeping as much as possible and encourage people to feel
>> empowered to propose and contribute new things.
>>
>> https://github.com/apache/parquet-format/pull/513
>> See in tree here:
>> https://github.com/apache/parquet-format/tree/proposals/proposals
>>
>>
>> On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb 
>> wrote:
>>
>>> I think the PR[1][2] that Julien created is a pretty nice high level flow
>>> as it:
>>> 1. Mostly documents clearly what is already done in practice
>>> 2. Postpones concerns and consensus about potentially overly restrictive
>>> requirements for new features (but not trying to exhaustively specify the
>>> criteria)
>>> 3. Gives a location to list active proposals
>>>
>>> We could make progress with his PR without having to come to a consensus
>>> on
>>> the criteria for inclusion.
>>>
>>> Once we had that high level flow up,  we could try it out and formalize
>>> some of the criteria that are used for changes.
>>>
>>> Andrew
>>>
>>>
>>> [1]: https://github.com/apache/parquet-format/pull/513
>>> [2]: https://github.com/apache/parquet-format/tree/proposals/proposals
>>>
>>> On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield 
>>> wrote:
>>>
>>> > >
>>> > > In this situation, it's great to say that we want people to run
>>> > benchmarks
>>> > > on some representative datasets and I agree that we probably want a
>>> > > substantial performance improvement to justify the cost of support.
>>> But I
>>> > > think we need to see these things as guidelines and not require
>>> running
>>> > 20
>>> > >
>>> > > The intention at least in the doc was to require 20 plus datasets
>>> but to
>>> > collect at least a list of open datasets that we can narrow down.
>>> What I
>>> > would at least like to see is a fairly standard set of data to make
>>> > comparisons consistent.   We also discussed this in the sync.  I think
>>> it
>>> > will be up to someone who has bandwidth to help at least designate a
>>> subset
>>> > of what we want to include.
>>> >
>>> > benchmarks or not considering features with 9% improvements across the
>>> > > board.
>>> >
>>> > Sure, we can maybe make the language softer language on having a target
>>> > percentage be a target goal but there can be trade-offs.
>>> >
>>> > I actually think having some sort of baseline helps to function as
>>> making
>>> > things easier in some ways as long as other requirements are met
>>> because it
>>> > removes some amount of subjectivity.
>>> >
>>> > Cheers,
>>> > Micah
>>> >
>>> >
>>> >
>>> > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem 
>>> wrote:
>>> >
>>> > > I agree that the goal is to make contributions easier and not a
>>> daunting
>>> > > process.
>>> > > We could start the process by separating bigger projects that are
>>> > impacting
>>> > > the format in a non backward compatible way (new encodings, new
>>> footer,
>>> > > etc), versus things that are not as impacting (for example adding
>>> > metadata
>>> > > that can be ignored by older readers).
>>> > > The goal of the "proposals" list I'm outlining above is really only
>>> for
>>> > > bigger projects where we need collaboration across the ecosystem
>>> (like we
>>> > > just did for Variant).
>>> > > I'm taking inspiration from other projects here: Airflow Improvement
>>> > > Proposals
>>> > > <
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>>> > > >
>>> > >  or Flink Improvement Proposals
>>> > > <
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals
>>> > > >
>>> > > I think it's also useful to have a central place to find those.
>>> > >
>>> > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue  wrote:
>>> > >
>>> > > > I like many things about the write up, but I want to call out one
>>> > > potential
>>> > > > pitfall.
>>> > > >
>>> > > > I think that this is needed so that we can evolve the project and
>>> so we
>>> > > > have a well-understood path for adding new encodings and
>>> improvements.
>>> > If
>>> > > > we can't add new things, then the project will become outdated and
>>> > > > irrelevant.
>>> > > >
>>> > > > I'd like to keep that goal in mind when discussing t

Re: [Discuss] Feature addition requirements/process

2025-09-02 Thread Julien Le Dem
FYI: I'm hoping to get closer to a conclusion in the meeting tomorrow.
If you could take a second look, I would appreciate it.
Thank you !

On Fri, Aug 29, 2025 at 3:54 PM Julien Le Dem  wrote:

> Thank you for the feedback.
> I have updated the PR with all the feedback and introduced language to
> remove gatekeeping as much as possible and encourage people to feel
> empowered to propose and contribute new things.
>
> https://github.com/apache/parquet-format/pull/513
> See in tree here:
> https://github.com/apache/parquet-format/tree/proposals/proposals
>
>
> On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb 
> wrote:
>
>> I think the PR[1][2] that Julien created is a pretty nice high level flow
>> as it:
>> 1. Mostly documents clearly what is already done in practice
>> 2. Postpones concerns and consensus about potentially overly restrictive
>> requirements for new features (but not trying to exhaustively specify the
>> criteria)
>> 3. Gives a location to list active proposals
>>
>> We could make progress with his PR without having to come to a consensus
>> on
>> the criteria for inclusion.
>>
>> Once we had that high level flow up,  we could try it out and formalize
>> some of the criteria that are used for changes.
>>
>> Andrew
>>
>>
>> [1]: https://github.com/apache/parquet-format/pull/513
>> [2]: https://github.com/apache/parquet-format/tree/proposals/proposals
>>
>> On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield 
>> wrote:
>>
>> > >
>> > > In this situation, it's great to say that we want people to run
>> > benchmarks
>> > > on some representative datasets and I agree that we probably want a
>> > > substantial performance improvement to justify the cost of support.
>> But I
>> > > think we need to see these things as guidelines and not require
>> running
>> > 20
>> > >
>> > > The intention at least in the doc was to require 20 plus datasets but
>> to
>> > collect at least a list of open datasets that we can narrow down.  What
>> I
>> > would at least like to see is a fairly standard set of data to make
>> > comparisons consistent.   We also discussed this in the sync.  I think
>> it
>> > will be up to someone who has bandwidth to help at least designate a
>> subset
>> > of what we want to include.
>> >
>> > benchmarks or not considering features with 9% improvements across the
>> > > board.
>> >
>> > Sure, we can maybe make the language softer language on having a target
>> > percentage be a target goal but there can be trade-offs.
>> >
>> > I actually think having some sort of baseline helps to function as
>> making
>> > things easier in some ways as long as other requirements are met
>> because it
>> > removes some amount of subjectivity.
>> >
>> > Cheers,
>> > Micah
>> >
>> >
>> >
>> > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem  wrote:
>> >
>> > > I agree that the goal is to make contributions easier and not a
>> daunting
>> > > process.
>> > > We could start the process by separating bigger projects that are
>> > impacting
>> > > the format in a non backward compatible way (new encodings, new
>> footer,
>> > > etc), versus things that are not as impacting (for example adding
>> > metadata
>> > > that can be ignored by older readers).
>> > > The goal of the "proposals" list I'm outlining above is really only
>> for
>> > > bigger projects where we need collaboration across the ecosystem
>> (like we
>> > > just did for Variant).
>> > > I'm taking inspiration from other projects here: Airflow Improvement
>> > > Proposals
>> > > <
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>> > > >
>> > >  or Flink Improvement Proposals
>> > > <
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals
>> > > >
>> > > I think it's also useful to have a central place to find those.
>> > >
>> > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue  wrote:
>> > >
>> > > > I like many things about the write up, but I want to call out one
>> > > potential
>> > > > pitfall.
>> > > >
>> > > > I think that this is needed so that we can evolve the project and
>> so we
>> > > > have a well-understood path for adding new encodings and
>> improvements.
>> > If
>> > > > we can't add new things, then the project will become outdated and
>> > > > irrelevant.
>> > > >
>> > > > I'd like to keep that goal in mind when discussing the path that we
>> are
>> > > > documenting because there is a risk of having the opposite effect:
>> by
>> > > > adding so much process or so many requirements to satisfy that
>> people
>> > > > choose not to contribute or can't make it through to the end.
>> > > >
>> > > > You can see this risk at play with many ASF projects that have a
>> > > > well-defined "path to committer". Often these docs start with
>> > guidelines
>> > > > that say something like "you'll generally need to contribute for
>> about
>> > a
>> > > > year" to give context, but those things turn into rules and the
>> > community
>> > > > doesn't add anyone that has

Re: [Discuss] Feature addition requirements/process

2025-08-29 Thread Julien Le Dem
Thank you for the feedback.
I have updated the PR with all the feedback and introduced language to
remove gatekeeping as much as possible and encourage people to feel
empowered to propose and contribute new things.

https://github.com/apache/parquet-format/pull/513
See in tree here:
https://github.com/apache/parquet-format/tree/proposals/proposals


On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb  wrote:

> I think the PR[1][2] that Julien created is a pretty nice high level flow
> as it:
> 1. Mostly documents clearly what is already done in practice
> 2. Postpones concerns and consensus about potentially overly restrictive
> requirements for new features (but not trying to exhaustively specify the
> criteria)
> 3. Gives a location to list active proposals
>
> We could make progress with his PR without having to come to a consensus on
> the criteria for inclusion.
>
> Once we had that high level flow up,  we could try it out and formalize
> some of the criteria that are used for changes.
>
> Andrew
>
>
> [1]: https://github.com/apache/parquet-format/pull/513
> [2]: https://github.com/apache/parquet-format/tree/proposals/proposals
>
> On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield 
> wrote:
>
> > >
> > > In this situation, it's great to say that we want people to run
> > benchmarks
> > > on some representative datasets and I agree that we probably want a
> > > substantial performance improvement to justify the cost of support.
> But I
> > > think we need to see these things as guidelines and not require running
> > 20
> > >
> > > The intention at least in the doc was to require 20 plus datasets but
> to
> > collect at least a list of open datasets that we can narrow down.  What I
> > would at least like to see is a fairly standard set of data to make
> > comparisons consistent.   We also discussed this in the sync.  I think it
> > will be up to someone who has bandwidth to help at least designate a
> subset
> > of what we want to include.
> >
> > benchmarks or not considering features with 9% improvements across the
> > > board.
> >
> > Sure, we can maybe make the language softer language on having a target
> > percentage be a target goal but there can be trade-offs.
> >
> > I actually think having some sort of baseline helps to function as making
> > things easier in some ways as long as other requirements are met because
> it
> > removes some amount of subjectivity.
> >
> > Cheers,
> > Micah
> >
> >
> >
> > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem  wrote:
> >
> > > I agree that the goal is to make contributions easier and not a
> daunting
> > > process.
> > > We could start the process by separating bigger projects that are
> > impacting
> > > the format in a non backward compatible way (new encodings, new footer,
> > > etc), versus things that are not as impacting (for example adding
> > metadata
> > > that can be ignored by older readers).
> > > The goal of the "proposals" list I'm outlining above is really only for
> > > bigger projects where we need collaboration across the ecosystem (like
> we
> > > just did for Variant).
> > > I'm taking inspiration from other projects here: Airflow Improvement
> > > Proposals
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> > > >
> > >  or Flink Improvement Proposals
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals
> > > >
> > > I think it's also useful to have a central place to find those.
> > >
> > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue  wrote:
> > >
> > > > I like many things about the write up, but I want to call out one
> > > potential
> > > > pitfall.
> > > >
> > > > I think that this is needed so that we can evolve the project and so
> we
> > > > have a well-understood path for adding new encodings and
> improvements.
> > If
> > > > we can't add new things, then the project will become outdated and
> > > > irrelevant.
> > > >
> > > > I'd like to keep that goal in mind when discussing the path that we
> are
> > > > documenting because there is a risk of having the opposite effect: by
> > > > adding so much process or so many requirements to satisfy that people
> > > > choose not to contribute or can't make it through to the end.
> > > >
> > > > You can see this risk at play with many ASF projects that have a
> > > > well-defined "path to committer". Often these docs start with
> > guidelines
> > > > that say something like "you'll generally need to contribute for
> about
> > a
> > > > year" to give context, but those things turn into rules and the
> > community
> > > > doesn't add anyone that hasn't been around for a year.
> > > >
> > > > In this situation, it's great to say that we want people to run
> > > benchmarks
> > > > on some representative datasets and I agree that we probably want a
> > > > substantial performance improvement to justify the cost of support.
> > But I
> > > > think we need to see these things as guidelines and not require
>

Re: [Discuss] Feature addition requirements/process

2025-08-11 Thread Andrew Lamb
I think the PR[1][2] that Julien created is a pretty nice high level flow
as it:
1. Mostly documents clearly what is already done in practice
2. Postpones concerns and consensus about potentially overly restrictive
requirements for new features (but not trying to exhaustively specify the
criteria)
3. Gives a location to list active proposals

We could make progress with his PR without having to come to a consensus on
the criteria for inclusion.

Once we had that high level flow up,  we could try it out and formalize
some of the criteria that are used for changes.

Andrew


[1]: https://github.com/apache/parquet-format/pull/513
[2]: https://github.com/apache/parquet-format/tree/proposals/proposals

On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield 
wrote:

> >
> > In this situation, it's great to say that we want people to run
> benchmarks
> > on some representative datasets and I agree that we probably want a
> > substantial performance improvement to justify the cost of support. But I
> > think we need to see these things as guidelines and not require running
> 20
> >
> > The intention at least in the doc was to require 20 plus datasets but to
> collect at least a list of open datasets that we can narrow down.  What I
> would at least like to see is a fairly standard set of data to make
> comparisons consistent.   We also discussed this in the sync.  I think it
> will be up to someone who has bandwidth to help at least designate a subset
> of what we want to include.
>
> benchmarks or not considering features with 9% improvements across the
> > board.
>
> Sure, we can maybe make the language softer language on having a target
> percentage be a target goal but there can be trade-offs.
>
> I actually think having some sort of baseline helps to function as making
> things easier in some ways as long as other requirements are met because it
> removes some amount of subjectivity.
>
> Cheers,
> Micah
>
>
>
> On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem  wrote:
>
> > I agree that the goal is to make contributions easier and not a daunting
> > process.
> > We could start the process by separating bigger projects that are
> impacting
> > the format in a non backward compatible way (new encodings, new footer,
> > etc), versus things that are not as impacting (for example adding
> metadata
> > that can be ignored by older readers).
> > The goal of the "proposals" list I'm outlining above is really only for
> > bigger projects where we need collaboration across the ecosystem (like we
> > just did for Variant).
> > I'm taking inspiration from other projects here: Airflow Improvement
> > Proposals
> > <
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> > >
> >  or Flink Improvement Proposals
> > <
> >
> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals
> > >
> > I think it's also useful to have a central place to find those.
> >
> > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue  wrote:
> >
> > > I like many things about the write up, but I want to call out one
> > potential
> > > pitfall.
> > >
> > > I think that this is needed so that we can evolve the project and so we
> > > have a well-understood path for adding new encodings and improvements.
> If
> > > we can't add new things, then the project will become outdated and
> > > irrelevant.
> > >
> > > I'd like to keep that goal in mind when discussing the path that we are
> > > documenting because there is a risk of having the opposite effect: by
> > > adding so much process or so many requirements to satisfy that people
> > > choose not to contribute or can't make it through to the end.
> > >
> > > You can see this risk at play with many ASF projects that have a
> > > well-defined "path to committer". Often these docs start with
> guidelines
> > > that say something like "you'll generally need to contribute for about
> a
> > > year" to give context, but those things turn into rules and the
> community
> > > doesn't add anyone that hasn't been around for a year.
> > >
> > > In this situation, it's great to say that we want people to run
> > benchmarks
> > > on some representative datasets and I agree that we probably want a
> > > substantial performance improvement to justify the cost of support.
> But I
> > > think we need to see these things as guidelines and not require running
> > 20
> > > benchmarks or not considering features with 9% improvements across the
> > > board.
> > >
> > > Ryan
> > >
> > > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem 
> wrote:
> > >
> > > > I opened a Draft PR to illustrate what this could look like.
> > > > https://github.com/apache/parquet-format/pull/513
> > > > See in tree here:
> > > > https://github.com/apache/parquet-format/tree/proposals/proposals
> > > >
> > > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem 
> > wrote:
> > > >
> > > > > IMO, this doc is pretty close to being ready to be published. We
> can
> > > > > always improve it as we go.
> > > > >
> > > > > I t

Re: [Discuss] Feature addition requirements/process

2025-08-10 Thread Micah Kornfield
>
> In this situation, it's great to say that we want people to run benchmarks
> on some representative datasets and I agree that we probably want a
> substantial performance improvement to justify the cost of support. But I
> think we need to see these things as guidelines and not require running 20
>
> The intention at least in the doc was to require 20 plus datasets but to
collect at least a list of open datasets that we can narrow down.  What I
would at least like to see is a fairly standard set of data to make
comparisons consistent.   We also discussed this in the sync.  I think it
will be up to someone who has bandwidth to help at least designate a subset
of what we want to include.

benchmarks or not considering features with 9% improvements across the
> board.

Sure, we can maybe make the language softer language on having a target
percentage be a target goal but there can be trade-offs.

I actually think having some sort of baseline helps to function as making
things easier in some ways as long as other requirements are met because it
removes some amount of subjectivity.

Cheers,
Micah



On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem  wrote:

> I agree that the goal is to make contributions easier and not a daunting
> process.
> We could start the process by separating bigger projects that are impacting
> the format in a non backward compatible way (new encodings, new footer,
> etc), versus things that are not as impacting (for example adding metadata
> that can be ignored by older readers).
> The goal of the "proposals" list I'm outlining above is really only for
> bigger projects where we need collaboration across the ecosystem (like we
> just did for Variant).
> I'm taking inspiration from other projects here: Airflow Improvement
> Proposals
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> >
>  or Flink Improvement Proposals
> <
> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals
> >
> I think it's also useful to have a central place to find those.
>
> On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue  wrote:
>
> > I like many things about the write up, but I want to call out one
> potential
> > pitfall.
> >
> > I think that this is needed so that we can evolve the project and so we
> > have a well-understood path for adding new encodings and improvements. If
> > we can't add new things, then the project will become outdated and
> > irrelevant.
> >
> > I'd like to keep that goal in mind when discussing the path that we are
> > documenting because there is a risk of having the opposite effect: by
> > adding so much process or so many requirements to satisfy that people
> > choose not to contribute or can't make it through to the end.
> >
> > You can see this risk at play with many ASF projects that have a
> > well-defined "path to committer". Often these docs start with guidelines
> > that say something like "you'll generally need to contribute for about a
> > year" to give context, but those things turn into rules and the community
> > doesn't add anyone that hasn't been around for a year.
> >
> > In this situation, it's great to say that we want people to run
> benchmarks
> > on some representative datasets and I agree that we probably want a
> > substantial performance improvement to justify the cost of support. But I
> > think we need to see these things as guidelines and not require running
> 20
> > benchmarks or not considering features with 9% improvements across the
> > board.
> >
> > Ryan
> >
> > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem  wrote:
> >
> > > I opened a Draft PR to illustrate what this could look like.
> > > https://github.com/apache/parquet-format/pull/513
> > > See in tree here:
> > > https://github.com/apache/parquet-format/tree/proposals/proposals
> > >
> > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem 
> wrote:
> > >
> > > > IMO, this doc is pretty close to being ready to be published. We can
> > > > always improve it as we go.
> > > >
> > > > I think that one important part of the whole process is to make it
> easy
> > > > for everyone to see what proposals are ongoing and their status and a
> > > clear
> > > > step to move from proposal/evaluation to implementation.
> > > >
> > > > Once we agree the doc is close enough, I would propose to publish it
> in
> > > > markdown on the parquet-format repo, organized as follows:
> > > > - The section "Baseline Requirements for new additions" as its own
> > page,
> > > > documenting how to approach the design of a parquet change and the
> > > > underlying constraints.
> > > > - We add a physical process to list proposals in the parquet-format
> > > github
> > > > Repo as follows.
> > > > - The steps described in the section "Incorporating
> > encoding/compression
> > > > improvements" become the process on how someone creates a proposal
> and
> > > > starts a POC.
> > > > - I would complement it by the following steps for people to publish
> > > their
> > > > pro

Re: [Discuss] Feature addition requirements/process

2025-08-08 Thread Julien Le Dem
I agree that the goal is to make contributions easier and not a daunting
process.
We could start the process by separating bigger projects that are impacting
the format in a non backward compatible way (new encodings, new footer,
etc), versus things that are not as impacting (for example adding metadata
that can be ignored by older readers).
The goal of the "proposals" list I'm outlining above is really only for
bigger projects where we need collaboration across the ecosystem (like we
just did for Variant).
I'm taking inspiration from other projects here: Airflow Improvement
Proposals

 or Flink Improvement Proposals

I think it's also useful to have a central place to find those.

On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue  wrote:

> I like many things about the write up, but I want to call out one potential
> pitfall.
>
> I think that this is needed so that we can evolve the project and so we
> have a well-understood path for adding new encodings and improvements. If
> we can't add new things, then the project will become outdated and
> irrelevant.
>
> I'd like to keep that goal in mind when discussing the path that we are
> documenting because there is a risk of having the opposite effect: by
> adding so much process or so many requirements to satisfy that people
> choose not to contribute or can't make it through to the end.
>
> You can see this risk at play with many ASF projects that have a
> well-defined "path to committer". Often these docs start with guidelines
> that say something like "you'll generally need to contribute for about a
> year" to give context, but those things turn into rules and the community
> doesn't add anyone that hasn't been around for a year.
>
> In this situation, it's great to say that we want people to run benchmarks
> on some representative datasets and I agree that we probably want a
> substantial performance improvement to justify the cost of support. But I
> think we need to see these things as guidelines and not require running 20
> benchmarks or not considering features with 9% improvements across the
> board.
>
> Ryan
>
> On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem  wrote:
>
> > I opened a Draft PR to illustrate what this could look like.
> > https://github.com/apache/parquet-format/pull/513
> > See in tree here:
> > https://github.com/apache/parquet-format/tree/proposals/proposals
> >
> > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem  wrote:
> >
> > > IMO, this doc is pretty close to being ready to be published. We can
> > > always improve it as we go.
> > >
> > > I think that one important part of the whole process is to make it easy
> > > for everyone to see what proposals are ongoing and their status and a
> > clear
> > > step to move from proposal/evaluation to implementation.
> > >
> > > Once we agree the doc is close enough, I would propose to publish it in
> > > markdown on the parquet-format repo, organized as follows:
> > > - The section "Baseline Requirements for new additions" as its own
> page,
> > > documenting how to approach the design of a parquet change and the
> > > underlying constraints.
> > > - We add a physical process to list proposals in the parquet-format
> > github
> > > Repo as follows.
> > > - The steps described in the section "Incorporating
> encoding/compression
> > > improvements" become the process on how someone creates a proposal and
> > > starts a POC.
> > > - I would complement it by the following steps for people to publish
> > their
> > > proposals:
> > >- We create a folder in the parquet-format repo to hold the
> proposals.
> > >- a Readme in the folder tracks the ongoing POCs and status.
> > >- Initiating a proposal starts with a github issue. We create a
> > > template for it based on what's outlined in that section of the doc.
> > >- If the discussion concludes that the proposal is worth a POC,
> > > the author opens a PR to add the proposal in markdown in the proposals
> > > folder. It links to the Github issue where the discussion preceding the
> > > proposal occurred. More people can contribute to the POC as needed.
> > >- POC and perf evaluation are implemented as part of the proposal.
> > >- a vote by the PMC moves the proposal to actual feature in the
> format
> > > (based on the criteria outlined in this doc).
> > >- As part of the implementation step, we make sure we have cross
> > > compatible implementations as we did for Variant.
> > > - The section "Measuring improvements" becomes part of that process
> > > section to explain how we'll decide if the addition is worth adding to
> > the
> > > spec for the complexity it is adding.
> > >
> > > If that makes sense to you all, I can draft a PR to make this proposal
> a
> > > little more concrete.
> > >
> > >
> > >
> > > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb 
> > > wrote:
> 

Re: [Discuss] Feature addition requirements/process

2025-08-08 Thread Ryan Blue
I like many things about the write up, but I want to call out one potential
pitfall.

I think that this is needed so that we can evolve the project and so we
have a well-understood path for adding new encodings and improvements. If
we can't add new things, then the project will become outdated and
irrelevant.

I'd like to keep that goal in mind when discussing the path that we are
documenting because there is a risk of having the opposite effect: by
adding so much process or so many requirements to satisfy that people
choose not to contribute or can't make it through to the end.

You can see this risk at play with many ASF projects that have a
well-defined "path to committer". Often these docs start with guidelines
that say something like "you'll generally need to contribute for about a
year" to give context, but those things turn into rules and the community
doesn't add anyone that hasn't been around for a year.

In this situation, it's great to say that we want people to run benchmarks
on some representative datasets and I agree that we probably want a
substantial performance improvement to justify the cost of support. But I
think we need to see these things as guidelines and not require running 20
benchmarks or not considering features with 9% improvements across the
board.

Ryan

On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem  wrote:

> I opened a Draft PR to illustrate what this could look like.
> https://github.com/apache/parquet-format/pull/513
> See in tree here:
> https://github.com/apache/parquet-format/tree/proposals/proposals
>
> On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem  wrote:
>
> > IMO, this doc is pretty close to being ready to be published. We can
> > always improve it as we go.
> >
> > I think that one important part of the whole process is to make it easy
> > for everyone to see what proposals are ongoing and their status and a
> clear
> > step to move from proposal/evaluation to implementation.
> >
> > Once we agree the doc is close enough, I would propose to publish it in
> > markdown on the parquet-format repo, organized as follows:
> > - The section "Baseline Requirements for new additions" as its own page,
> > documenting how to approach the design of a parquet change and the
> > underlying constraints.
> > - We add a physical process to list proposals in the parquet-format
> github
> > Repo as follows.
> > - The steps described in the section "Incorporating encoding/compression
> > improvements" become the process on how someone creates a proposal and
> > starts a POC.
> > - I would complement it by the following steps for people to publish
> their
> > proposals:
> >- We create a folder in the parquet-format repo to hold the proposals.
> >- a Readme in the folder tracks the ongoing POCs and status.
> >- Initiating a proposal starts with a github issue. We create a
> > template for it based on what's outlined in that section of the doc.
> >- If the discussion concludes that the proposal is worth a POC,
> > the author opens a PR to add the proposal in markdown in the proposals
> > folder. It links to the Github issue where the discussion preceding the
> > proposal occurred. More people can contribute to the POC as needed.
> >- POC and perf evaluation are implemented as part of the proposal.
> >- a vote by the PMC moves the proposal to actual feature in the format
> > (based on the criteria outlined in this doc).
> >- As part of the implementation step, we make sure we have cross
> > compatible implementations as we did for Variant.
> > - The section "Measuring improvements" becomes part of that process
> > section to explain how we'll decide if the addition is worth adding to
> the
> > spec for the complexity it is adding.
> >
> > If that makes sense to you all, I can draft a PR to make this proposal a
> > little more concrete.
> >
> >
> >
> > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb 
> > wrote:
> >
> >> I would like to bump this thread as it came up again on the parquet sync
> >> call today
> >>
> >> Specifically, it seems like there is increasing interest in adding new
> >> encodings to the Parquet, so getting consensus on what that process
> looks
> >> like and what is required is more important.
> >>
> >> If you are interested in this topic, please leave comments on the Google
> >> Doc[1] or reply to this email chain.
> >>
> >> Thank you,
> >> Andrew
> >>
> >> [1]
> >>
> >>
> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
> >>
> >> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield 
> >> wrote:
> >>
> >> > I wrote up a long overdue draft
> >> > <
> >> >
> >>
> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
> >> > >
> >> > [1]
> >> > on how we can move forward with additional features (it provides some
> >> > proposed requirements on both consuming third-party code, as well as
> >> some
> >> > more specific guidance on new encodings, and some orthogonal work tha

Re: [Discuss] Feature addition requirements/process

2025-08-07 Thread Julien Le Dem
I opened a Draft PR to illustrate what this could look like.
https://github.com/apache/parquet-format/pull/513
See in tree here:
https://github.com/apache/parquet-format/tree/proposals/proposals

On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem  wrote:

> IMO, this doc is pretty close to being ready to be published. We can
> always improve it as we go.
>
> I think that one important part of the whole process is to make it easy
> for everyone to see what proposals are ongoing and their status and a clear
> step to move from proposal/evaluation to implementation.
>
> Once we agree the doc is close enough, I would propose to publish it in
> markdown on the parquet-format repo, organized as follows:
> - The section "Baseline Requirements for new additions" as its own page,
> documenting how to approach the design of a parquet change and the
> underlying constraints.
> - We add a physical process to list proposals in the parquet-format github
> Repo as follows.
> - The steps described in the section "Incorporating encoding/compression
> improvements" become the process on how someone creates a proposal and
> starts a POC.
> - I would complement it by the following steps for people to publish their
> proposals:
>- We create a folder in the parquet-format repo to hold the proposals.
>- a Readme in the folder tracks the ongoing POCs and status.
>- Initiating a proposal starts with a github issue. We create a
> template for it based on what's outlined in that section of the doc.
>- If the discussion concludes that the proposal is worth a POC,
> the author opens a PR to add the proposal in markdown in the proposals
> folder. It links to the Github issue where the discussion preceding the
> proposal occurred. More people can contribute to the POC as needed.
>- POC and perf evaluation are implemented as part of the proposal.
>- a vote by the PMC moves the proposal to actual feature in the format
> (based on the criteria outlined in this doc).
>- As part of the implementation step, we make sure we have cross
> compatible implementations as we did for Variant.
> - The section "Measuring improvements" becomes part of that process
> section to explain how we'll decide if the addition is worth adding to the
> spec for the complexity it is adding.
>
> If that makes sense to you all, I can draft a PR to make this proposal a
> little more concrete.
>
>
>
> On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb 
> wrote:
>
>> I would like to bump this thread as it came up again on the parquet sync
>> call today
>>
>> Specifically, it seems like there is increasing interest in adding new
>> encodings to the Parquet, so getting consensus on what that process looks
>> like and what is required is more important.
>>
>> If you are interested in this topic, please leave comments on the Google
>> Doc[1] or reply to this email chain.
>>
>> Thank you,
>> Andrew
>>
>> [1]
>>
>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>>
>> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield 
>> wrote:
>>
>> > I wrote up a long overdue draft
>> > <
>> >
>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>> > >
>> > [1]
>> > on how we can move forward with additional features (it provides some
>> > proposed requirements on both consuming third-party code, as well as
>> some
>> > more specific guidance on new encodings, and some orthogonal work that
>> > would be nice to see).
>> >
>> > The doc still lacks some details, and might be too opinionated in places
>> > but I think it serves as a good basis for conversation (and at least
>> gets
>> > me out of the critical path for evolving Parquet).
>> >
>> > I'm very excited to start moving forward with improvements.
>> >
>> > Thanks,
>> > Micah
>> >
>> > [1]
>> >
>> >
>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>> >
>>
>


Re: [Discuss] Feature addition requirements/process

2025-08-06 Thread Julien Le Dem
IMO, this doc is pretty close to being ready to be published. We can always
improve it as we go.

I think that one important part of the whole process is to make it easy for
everyone to see what proposals are ongoing and their status and a clear
step to move from proposal/evaluation to implementation.

Once we agree the doc is close enough, I would propose to publish it in
markdown on the parquet-format repo, organized as follows:
- The section "Baseline Requirements for new additions" as its own page,
documenting how to approach the design of a parquet change and the
underlying constraints.
- We add a physical process to list proposals in the parquet-format github
Repo as follows.
- The steps described in the section "Incorporating encoding/compression
improvements" become the process on how someone creates a proposal and
starts a POC.
- I would complement it by the following steps for people to publish their
proposals:
   - We create a folder in the parquet-format repo to hold the proposals.
   - a Readme in the folder tracks the ongoing POCs and status.
   - Initiating a proposal starts with a github issue. We create a template
for it based on what's outlined in that section of the doc.
   - If the discussion concludes that the proposal is worth a POC,
the author opens a PR to add the proposal in markdown in the proposals
folder. It links to the Github issue where the discussion preceding the
proposal occurred. More people can contribute to the POC as needed.
   - POC and perf evaluation are implemented as part of the proposal.
   - a vote by the PMC moves the proposal to actual feature in the format
(based on the criteria outlined in this doc).
   - As part of the implementation step, we make sure we have cross
compatible implementations as we did for Variant.
- The section "Measuring improvements" becomes part of that process section
to explain how we'll decide if the addition is worth adding to the spec for
the complexity it is adding.

If that makes sense to you all, I can draft a PR to make this proposal a
little more concrete.



On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb  wrote:

> I would like to bump this thread as it came up again on the parquet sync
> call today
>
> Specifically, it seems like there is increasing interest in adding new
> encodings to the Parquet, so getting consensus on what that process looks
> like and what is required is more important.
>
> If you are interested in this topic, please leave comments on the Google
> Doc[1] or reply to this email chain.
>
> Thank you,
> Andrew
>
> [1]
>
> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>
> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield 
> wrote:
>
> > I wrote up a long overdue draft
> > <
> >
> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
> > >
> > [1]
> > on how we can move forward with additional features (it provides some
> > proposed requirements on both consuming third-party code, as well as some
> > more specific guidance on new encodings, and some orthogonal work that
> > would be nice to see).
> >
> > The doc still lacks some details, and might be too opinionated in places
> > but I think it serves as a good basis for conversation (and at least gets
> > me out of the critical path for evolving Parquet).
> >
> > I'm very excited to start moving forward with improvements.
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
> >
>


Re: [Discuss] Feature addition requirements/process

2025-08-06 Thread Andrew Lamb
I would like to bump this thread as it came up again on the parquet sync
call today

Specifically, it seems like there is increasing interest in adding new
encodings to the Parquet, so getting consensus on what that process looks
like and what is required is more important.

If you are interested in this topic, please leave comments on the Google
Doc[1] or reply to this email chain.

Thank you,
Andrew

[1]
https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0

On Thu, May 29, 2025 at 2:42 AM Micah Kornfield 
wrote:

> I wrote up a long overdue draft
> <
> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
> >
> [1]
> on how we can move forward with additional features (it provides some
> proposed requirements on both consuming third-party code, as well as some
> more specific guidance on new encodings, and some orthogonal work that
> would be nice to see).
>
> The doc still lacks some details, and might be too opinionated in places
> but I think it serves as a good basis for conversation (and at least gets
> me out of the critical path for evolving Parquet).
>
> I'm very excited to start moving forward with improvements.
>
> Thanks,
> Micah
>
> [1]
>
> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>