Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?

2025-10-18 Thread Micah Kornfield
>
> +1 on looking in openzl more deeply *before* we add new encodings.


I think the compatibility guarantees for the project are currently not
sufficient for use in parquet due to lack of guaranteed compatibility [1],
some of the ideas might be interesting to look at an adopt in the meantime:

"However, we intend to maintain some stability guarantees in the face of
that evolution. In particular, payloads compressed with any release-tagged
version of the library will remain decompressible by new releases of the
library for at least the next several years. And new releases of the
library will be able to generate frames compatible with at least the
previous release."


[1] https://github.com/facebook/openzl

On Tue, Oct 14, 2025 at 6:58 AM Alkis Evlogimenos
 wrote:

> +1 on looking in openzl more deeply *before* we add new encodings.
>
> What's very attractive about openzl is that the decoder is fixed and
> advancements in encoding are backwards/forwards compatible. This means less
> changes to the format itself. The ideal end state would be to add openzl to
> parquet and encode everything as PLAIN.
>
> One thing to investigate is if we can get openzl compressed data at some
> point in the graph and then perform compressed execution on them. This
> would be perfect for dictionary encoded streams.
>
> On Tue, Oct 7, 2025 at 4:34 PM Krisztián Szűcs 
> wrote:
>
> > Hi,
> >
> > There seems to be a new (if I’m not mistaken it was published yesterday)
> > codec/compression framework called OpenZL [1][2][3]. I haven’t looked at
> > it
> > thoroughly yet, but it somewhat reminds me of BtrBlocks.
> > Even if we don’t consider more advanced features of a framework like
> this,
> > we could offload the various codec implementations to another project.
> >
> > Krisztian
> >
> > [1]: https://openzl.org/
> > [2]: https://github.com/facebook/openzl/tree/dev/src/openzl/codecs
> > [3]:
> >
> https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/
> >
> > > On 2025. Oct 1., at 20:11, Andrew Lamb  wrote:
> > >
> > > I would like to start a discussion to help organize and rally anyone
> > > interested in adding new encodings to Parquet.
> > >
> > > I am pretty sure there are many people interested in adding new
> > encodings,
> > > but there are only a few mentions on the mailing list, such as pcode
> [1]
> > > and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today
> > > that he is working on evaluating some potential encodings and hopes to
> > have
> > > some information to share soon, and Julien mentioned he had spoken to
> > > someone else who might be doing something similar.
> > >
> > > Now that Julien has defined a process to extend the spec[3] I think the
> > > steps are much clearer.
> > >
> > > So, I would like to invite anyone interested in adding new encodings to
> > > respond and let us know if you are willing to help evaluate new
> encodings
> > > and prototype integrations into Parquet implementations?
> > >
> > > Andrew
> > >
> > >
> > > [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6
> > > [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw
> > > [3]:
> > >
> https://github.com/apache/parquet-format/blob/master/proposals/README.md
> >
> >
>


RE: Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?

2025-10-18 Thread [email protected]
Thank you Andrew,

I'll take a look at the C/C++ and Rust implementations.

By the way, do you think it's necessary to implement ALP directly within
Parquet to evaluate its performance? Or would it be sufficient to benchmark
it using the implementations you mentioned without integrating it into Parquet,
just to get a sense of its potential?

Naohiro

On 2025/10/03 13:21:03 Andrew Lamb wrote:
> This is super exciting, thank you Naohiro
>
> I also think ALP[1] (built on FastLanes[2]) is a great encoding to explore
>
> Getting a Java based implementation of ALP would be a great validation
> that the approach works well across platforms. There are open source
> implementations in both C/C++[3] and Rust (via vortex) [4] that we could
> use to benchmark / build prototypes
>
> Andrew
>
> [1]: https://ir.cwi.nl/pub/4/4.pdf
> [2]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> [3]:
> https://github.com/cwida/FastLanes/tree/4014a3a51083a06b6d446fb78e446494721aa12b/src/alp
> [4]:
> https://github.com/vortex-data/vortex/blob/153040140e72d9038f5c092e6c6348c28a462211/encodings/alp/src/lib.rs#L4
>
> On Fri, Oct 3, 2025 at 12:22 AM [email protected]
>  wrote:
>
> > Hi Andrew,
> >
> > I'm Naohiro, and I'm the person Julien has been in touch with. I was
> > planning to attend the sync yesterday but unfortunately missed it due to
> > the timezone difference. (I’m in Japan)
> >
> > Thanks for kicking off this discussion, I'm definitely interested in
> > contributing.
> >
> > To



Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?

2025-10-18 Thread Krisztián Szűcs
Hi,

There seems to be a new (if I’m not mistaken it was published yesterday) 
codec/compression framework called OpenZL [1][2][3]. I haven’t looked at it 
thoroughly yet, but it somewhat reminds me of BtrBlocks. 
Even if we don’t consider more advanced features of a framework like this,
we could offload the various codec implementations to another project.

Krisztian

[1]: https://openzl.org/
[2]: https://github.com/facebook/openzl/tree/dev/src/openzl/codecs
[3]: 
https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/

> On 2025. Oct 1., at 20:11, Andrew Lamb  wrote:
> 
> I would like to start a discussion to help organize and rally anyone
> interested in adding new encodings to Parquet.
> 
> I am pretty sure there are many people interested in adding new encodings,
> but there are only a few mentions on the mailing list, such as pcode [1]
> and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today
> that he is working on evaluating some potential encodings and hopes to have
> some information to share soon, and Julien mentioned he had spoken to
> someone else who might be doing something similar.
> 
> Now that Julien has defined a process to extend the spec[3] I think the
> steps are much clearer.
> 
> So, I would like to invite anyone interested in adding new encodings to
> respond and let us know if you are willing to help evaluate new encodings
> and prototype integrations into Parquet implementations?
> 
> Andrew
> 
> 
> [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6
> [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw
> [3]:
> https://github.com/apache/parquet-format/blob/master/proposals/README.md



RE: [DISCUSS] Anyone working on / want to help with new encoding proposals?

2025-10-18 Thread Pierre Lacave
Hi Andrew,

I am planning to evaluate the impact of FSST and ALP for a sample of
Datadog event data.

I was thinking of hacking something with arrow-rs/parquet and Vortex crates.

Will make sure to post my findings here

Thanks


On 2025/10/01 18:11:51 Andrew Lamb wrote:
> I would like to start a discussion to help organize and rally anyone
> interested in adding new encodings to Parquet.
>
> I am pretty sure there are many people interested in adding new encodings,
> but there are only a few mentions on the mailing list, such as pcode [1]
> and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today
> that he is working on evaluating some potential encodings and hopes to
have
> some information to share soon, and Julien mentioned he had spoken to
> someone else who might be doing something similar.
>
> Now that Julien has defined a process to extend the spec[3] I think the
> steps are much clearer.
>
> So, I would like to invite anyone interested in adding new encodings to
> respond and let us know if you are willing to help evaluate new encodings
> and prototype integrations into Parquet implementations?
>
> Andrew
>
>
> [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6
> [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw
> [3]:
> https://github.com/apache/parquet-format/blob/master/proposals/README.md
>


Re: Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?

2025-10-18 Thread Andrew Bell
Just as a note, in many cases floating point values are better stored as
scaled integers. Often the range of the floating point values that are used
in an application don't require that offered by an exponent and mantissa
that make up a floating point value and are better represented, at least
when stored, as integers to which a scale factor/offset can be applied.
This allows optimal integer compression schemes to be used on the data.

On Tue, Oct 7, 2025 at 8:26 AM [email protected]
 wrote:

> Thank you Andrew,
>
> I'll take a look at the C/C++ and Rust implementations.
>
> By the way, do you think it's necessary to implement ALP directly within
> Parquet to evaluate its performance? Or would it be sufficient to benchmark
> it using the implementations you mentioned without integrating it into
> Parquet,
> just to get a sense of its potential?
>
> Naohiro
>
> On 2025/10/03 13:21:03 Andrew Lamb wrote:
> > This is super exciting, thank you Naohiro
> >
> > I also think ALP[1] (built on FastLanes[2]) is a great encoding to
> explore
> >
> > Getting a Java based implementation of ALP would be a great validation
> > that the approach works well across platforms. There are open source
> > implementations in both C/C++[3] and Rust (via vortex) [4] that we could
> > use to benchmark / build prototypes
> >
> > Andrew
> >
> > [1]: https://ir.cwi.nl/pub/4/4.pdf
> > [2]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> > [3]:
> >
> https://github.com/cwida/FastLanes/tree/4014a3a51083a06b6d446fb78e446494721aa12b/src/alp
> > [4]:
> >
> https://github.com/vortex-data/vortex/blob/153040140e72d9038f5c092e6c6348c28a462211/encodings/alp/src/lib.rs#L4
> >
> > On Fri, Oct 3, 2025 at 12:22 AM
> [email protected]
> >  wrote:
> >
> > > Hi Andrew,
> > >
> > > I'm Naohiro, and I'm the person Julien has been in touch with. I was
> > > planning to attend the sync yesterday but unfortunately missed it due
> to
> > > the timezone difference. (I’m in Japan)
> > >
> > > Thanks for kicking off this discussion, I'm definitely interested in
> > > contributing.
> > >
> > > To
>
>

-- 
Andrew Bell
[email protected]


Re: Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?

2025-10-17 Thread Andrew Lamb
Hi Kakimura,

> By the way, do you think it's necessary to implement ALP directly within
Parquet to evaluate its performance?

>From my perspective, the algorithm's performance is well explained in the
paper[1]. I suggest there are 2 milestones:

1. Gather any additional evidence that the algorithm is worth pursuing
(e.g. perhaps apply to your data, or independently reproduce the results in
the paper)
2. Make the case / proposal to add to Parquet.

Perhaps a good first thing to try would be your datasets with the Vortex[2]
file format (which has an implementation of ALP)

When we get to step 2, I do think we'll need to integrate with two Parquet
implementations.

> Just as a note, in many cases floating point values are better stored as
scaled integers.

Andrew, indeed you are right. In fact the core ALP algorithm is
transforming from floating point to scaled integers (and then applying the
techniques from FastLanes[3] which auto-vectorizes well)

Andrew

[1]: https://dl.acm.org/doi/10.1145/3626717
[2]: https://github.com/vortex-data/vortex
[3]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf

On Tue, Oct 7, 2025 at 8:26 AM [email protected]
 wrote:

> Thank you Andrew,
>
> I'll take a look at the C/C++ and Rust implementations.
>
> By the way, do you think it's necessary to implement ALP directly within
> Parquet to evaluate its performance? Or would it be sufficient to benchmark
> it using the implementations you mentioned without integrating it into
> Parquet,
> just to get a sense of its potential?
>
> Naohiro
>
> On 2025/10/03 13:21:03 Andrew Lamb wrote:
> > This is super exciting, thank you Naohiro
> >
> > I also think ALP[1] (built on FastLanes[2]) is a great encoding to
> explore
> >
> > Getting a Java based implementation of ALP would be a great validation
> > that the approach works well across platforms. There are open source
> > implementations in both C/C++[3] and Rust (via vortex) [4] that we could
> > use to benchmark / build prototypes
> >
> > Andrew
> >
> > [1]: https://ir.cwi.nl/pub/4/4.pdf
> > [2]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> > [3]:
> >
> https://github.com/cwida/FastLanes/tree/4014a3a51083a06b6d446fb78e446494721aa12b/src/alp
> > [4]:
> >
> https://github.com/vortex-data/vortex/blob/153040140e72d9038f5c092e6c6348c28a462211/encodings/alp/src/lib.rs#L4
> >
> > On Fri, Oct 3, 2025 at 12:22 AM
> [email protected]
> >  wrote:
> >
> > > Hi Andrew,
> > >
> > > I'm Naohiro, and I'm the person Julien has been in touch with. I was
> > > planning to attend the sync yesterday but unfortunately missed it due
> to
> > > the timezone difference. (I’m in Japan)
> > >
> > > Thanks for kicking off this discussion, I'm definitely interested in
> > > contributing.
> > >
> > > To
>
>


Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?

2025-10-14 Thread Alkis Evlogimenos
+1 on looking in openzl more deeply *before* we add new encodings.

What's very attractive about openzl is that the decoder is fixed and
advancements in encoding are backwards/forwards compatible. This means less
changes to the format itself. The ideal end state would be to add openzl to
parquet and encode everything as PLAIN.

One thing to investigate is if we can get openzl compressed data at some
point in the graph and then perform compressed execution on them. This
would be perfect for dictionary encoded streams.

On Tue, Oct 7, 2025 at 4:34 PM Krisztián Szűcs 
wrote:

> Hi,
>
> There seems to be a new (if I’m not mistaken it was published yesterday)
> codec/compression framework called OpenZL [1][2][3]. I haven’t looked at
> it
> thoroughly yet, but it somewhat reminds me of BtrBlocks.
> Even if we don’t consider more advanced features of a framework like this,
> we could offload the various codec implementations to another project.
>
> Krisztian
>
> [1]: https://openzl.org/
> [2]: https://github.com/facebook/openzl/tree/dev/src/openzl/codecs
> [3]:
> https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/
>
> > On 2025. Oct 1., at 20:11, Andrew Lamb  wrote:
> >
> > I would like to start a discussion to help organize and rally anyone
> > interested in adding new encodings to Parquet.
> >
> > I am pretty sure there are many people interested in adding new
> encodings,
> > but there are only a few mentions on the mailing list, such as pcode [1]
> > and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today
> > that he is working on evaluating some potential encodings and hopes to
> have
> > some information to share soon, and Julien mentioned he had spoken to
> > someone else who might be doing something similar.
> >
> > Now that Julien has defined a process to extend the spec[3] I think the
> > steps are much clearer.
> >
> > So, I would like to invite anyone interested in adding new encodings to
> > respond and let us know if you are willing to help evaluate new encodings
> > and prototype integrations into Parquet implementations?
> >
> > Andrew
> >
> >
> > [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6
> > [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw
> > [3]:
> > https://github.com/apache/parquet-format/blob/master/proposals/README.md
>
>


Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?

2025-10-03 Thread Andrew Lamb
This is super exciting, thank you Naohiro

I also think ALP[1] (built on FastLanes[2]) is a great encoding to explore

Getting a Java based implementation of ALP would be a great validation
that the approach works well across platforms. There are open source
implementations in both C/C++[3] and Rust (via vortex) [4] that we could
use to benchmark / build prototypes

Andrew

[1]: https://ir.cwi.nl/pub/4/4.pdf
[2]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
[3]:
https://github.com/cwida/FastLanes/tree/4014a3a51083a06b6d446fb78e446494721aa12b/src/alp
[4]:
https://github.com/vortex-data/vortex/blob/153040140e72d9038f5c092e6c6348c28a462211/encodings/alp/src/lib.rs#L4

On Fri, Oct 3, 2025 at 12:22 AM [email protected]
 wrote:

> Hi Andrew,
>
> I'm Naohiro, and I'm the person Julien has been in touch with. I was
> planning to attend the sync yesterday but unfortunately missed it due to
> the timezone difference. (I’m in Japan)
>
> Thanks for kicking off this discussion, I'm definitely interested in
> contributing.
>
> To start with, I'm currently working on a POC in parquet-java to evaluate
> ALP. While ALP and floating-point compression are my main focus at the
> moment, I'm also interested in exploring other encoding strategies that
> could benefit Parquet. I'm also drafting a proposal in Google Docs, and
> once it's ready, I'll share the link.
>
> I'd love to hear if others are working on similar efforts, especially
> around floating-point compression, to avoid duplication and potentially
> collaborate.
>
> On 2025/10/01 18:11:51 Andrew Lamb wrote:
> > I would like to start a discussion to help organize and rally anyone
> > interested in adding new encodings to Parquet.
> >
> > I am pretty sure there are many people interested in adding new
> encodings,
> > but there are only a few mentions on the mailing list, such as pcode [1]
> > and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today
> > that he is working on evaluating some potential encodings and hopes to
> have
> > some information to share soon, and Julien mentioned he had spoken to
> > someone else who might be doing something similar.
> >
> > Now that Julien has defined a process to extend the spec[3] I think the
> > steps are much clearer.
> >
> > So, I would like to invite anyone interested in adding new encodings to
> > respond and let us know if you are willing to help evaluate new encodings
> > and prototype integrations into Parquet implementations?
> >
> > Andrew
> >
> >
> > [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6
> > [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw
> > [3]:
> > https://github.com/apache/parquet-format/blob/master/proposals/README.md
> >
>
>


RE: [DISCUSS] Anyone working on / want to help with new encoding proposals?

2025-10-02 Thread [email protected]
Hi Andrew,

I'm Naohiro, and I'm the person Julien has been in touch with. I was planning 
to attend the sync yesterday but unfortunately missed it due to the timezone 
difference. (I’m in Japan)

Thanks for kicking off this discussion, I'm definitely interested in 
contributing.

To start with, I'm currently working on a POC in parquet-java to evaluate ALP. 
While ALP and floating-point compression are my main focus at the moment, I'm 
also interested in exploring other encoding strategies that could benefit 
Parquet. I'm also drafting a proposal in Google Docs, and once it's ready, I'll 
share the link.

I'd love to hear if others are working on similar efforts, especially around 
floating-point compression, to avoid duplication and potentially collaborate.

On 2025/10/01 18:11:51 Andrew Lamb wrote:
> I would like to start a discussion to help organize and rally anyone
> interested in adding new encodings to Parquet.
>
> I am pretty sure there are many people interested in adding new encodings,
> but there are only a few mentions on the mailing list, such as pcode [1]
> and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today
> that he is working on evaluating some potential encodings and hopes to have
> some information to share soon, and Julien mentioned he had spoken to
> someone else who might be doing something similar.
>
> Now that Julien has defined a process to extend the spec[3] I think the
> steps are much clearer.
>
> So, I would like to invite anyone interested in adding new encodings to
> respond and let us know if you are willing to help evaluate new encodings
> and prototype integrations into Parquet implementations?
>
> Andrew
>
>
> [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6
> [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw
> [3]:
> https://github.com/apache/parquet-format/blob/master/proposals/README.md
>