Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?
> > +1 on looking in openzl more deeply *before* we add new encodings. I think the compatibility guarantees for the project are currently not sufficient for use in parquet due to lack of guaranteed compatibility [1], some of the ideas might be interesting to look at an adopt in the meantime: "However, we intend to maintain some stability guarantees in the face of that evolution. In particular, payloads compressed with any release-tagged version of the library will remain decompressible by new releases of the library for at least the next several years. And new releases of the library will be able to generate frames compatible with at least the previous release." [1] https://github.com/facebook/openzl On Tue, Oct 14, 2025 at 6:58 AM Alkis Evlogimenos wrote: > +1 on looking in openzl more deeply *before* we add new encodings. > > What's very attractive about openzl is that the decoder is fixed and > advancements in encoding are backwards/forwards compatible. This means less > changes to the format itself. The ideal end state would be to add openzl to > parquet and encode everything as PLAIN. > > One thing to investigate is if we can get openzl compressed data at some > point in the graph and then perform compressed execution on them. This > would be perfect for dictionary encoded streams. > > On Tue, Oct 7, 2025 at 4:34 PM Krisztián Szűcs > wrote: > > > Hi, > > > > There seems to be a new (if I’m not mistaken it was published yesterday) > > codec/compression framework called OpenZL [1][2][3]. I haven’t looked at > > it > > thoroughly yet, but it somewhat reminds me of BtrBlocks. > > Even if we don’t consider more advanced features of a framework like > this, > > we could offload the various codec implementations to another project. > > > > Krisztian > > > > [1]: https://openzl.org/ > > [2]: https://github.com/facebook/openzl/tree/dev/src/openzl/codecs > > [3]: > > > https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/ > > > > > On 2025. Oct 1., at 20:11, Andrew Lamb wrote: > > > > > > I would like to start a discussion to help organize and rally anyone > > > interested in adding new encodings to Parquet. > > > > > > I am pretty sure there are many people interested in adding new > > encodings, > > > but there are only a few mentions on the mailing list, such as pcode > [1] > > > and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today > > > that he is working on evaluating some potential encodings and hopes to > > have > > > some information to share soon, and Julien mentioned he had spoken to > > > someone else who might be doing something similar. > > > > > > Now that Julien has defined a process to extend the spec[3] I think the > > > steps are much clearer. > > > > > > So, I would like to invite anyone interested in adding new encodings to > > > respond and let us know if you are willing to help evaluate new > encodings > > > and prototype integrations into Parquet implementations? > > > > > > Andrew > > > > > > > > > [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6 > > > [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw > > > [3]: > > > > https://github.com/apache/parquet-format/blob/master/proposals/README.md > > > > >
RE: Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?
Thank you Andrew, I'll take a look at the C/C++ and Rust implementations. By the way, do you think it's necessary to implement ALP directly within Parquet to evaluate its performance? Or would it be sufficient to benchmark it using the implementations you mentioned without integrating it into Parquet, just to get a sense of its potential? Naohiro On 2025/10/03 13:21:03 Andrew Lamb wrote: > This is super exciting, thank you Naohiro > > I also think ALP[1] (built on FastLanes[2]) is a great encoding to explore > > Getting a Java based implementation of ALP would be a great validation > that the approach works well across platforms. There are open source > implementations in both C/C++[3] and Rust (via vortex) [4] that we could > use to benchmark / build prototypes > > Andrew > > [1]: https://ir.cwi.nl/pub/4/4.pdf > [2]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf > [3]: > https://github.com/cwida/FastLanes/tree/4014a3a51083a06b6d446fb78e446494721aa12b/src/alp > [4]: > https://github.com/vortex-data/vortex/blob/153040140e72d9038f5c092e6c6348c28a462211/encodings/alp/src/lib.rs#L4 > > On Fri, Oct 3, 2025 at 12:22 AM [email protected] > wrote: > > > Hi Andrew, > > > > I'm Naohiro, and I'm the person Julien has been in touch with. I was > > planning to attend the sync yesterday but unfortunately missed it due to > > the timezone difference. (I’m in Japan) > > > > Thanks for kicking off this discussion, I'm definitely interested in > > contributing. > > > > To
Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?
Hi, There seems to be a new (if I’m not mistaken it was published yesterday) codec/compression framework called OpenZL [1][2][3]. I haven’t looked at it thoroughly yet, but it somewhat reminds me of BtrBlocks. Even if we don’t consider more advanced features of a framework like this, we could offload the various codec implementations to another project. Krisztian [1]: https://openzl.org/ [2]: https://github.com/facebook/openzl/tree/dev/src/openzl/codecs [3]: https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/ > On 2025. Oct 1., at 20:11, Andrew Lamb wrote: > > I would like to start a discussion to help organize and rally anyone > interested in adding new encodings to Parquet. > > I am pretty sure there are many people interested in adding new encodings, > but there are only a few mentions on the mailing list, such as pcode [1] > and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today > that he is working on evaluating some potential encodings and hopes to have > some information to share soon, and Julien mentioned he had spoken to > someone else who might be doing something similar. > > Now that Julien has defined a process to extend the spec[3] I think the > steps are much clearer. > > So, I would like to invite anyone interested in adding new encodings to > respond and let us know if you are willing to help evaluate new encodings > and prototype integrations into Parquet implementations? > > Andrew > > > [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6 > [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw > [3]: > https://github.com/apache/parquet-format/blob/master/proposals/README.md
RE: [DISCUSS] Anyone working on / want to help with new encoding proposals?
Hi Andrew, I am planning to evaluate the impact of FSST and ALP for a sample of Datadog event data. I was thinking of hacking something with arrow-rs/parquet and Vortex crates. Will make sure to post my findings here Thanks On 2025/10/01 18:11:51 Andrew Lamb wrote: > I would like to start a discussion to help organize and rally anyone > interested in adding new encodings to Parquet. > > I am pretty sure there are many people interested in adding new encodings, > but there are only a few mentions on the mailing list, such as pcode [1] > and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today > that he is working on evaluating some potential encodings and hopes to have > some information to share soon, and Julien mentioned he had spoken to > someone else who might be doing something similar. > > Now that Julien has defined a process to extend the spec[3] I think the > steps are much clearer. > > So, I would like to invite anyone interested in adding new encodings to > respond and let us know if you are willing to help evaluate new encodings > and prototype integrations into Parquet implementations? > > Andrew > > > [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6 > [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw > [3]: > https://github.com/apache/parquet-format/blob/master/proposals/README.md >
Re: Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?
Just as a note, in many cases floating point values are better stored as scaled integers. Often the range of the floating point values that are used in an application don't require that offered by an exponent and mantissa that make up a floating point value and are better represented, at least when stored, as integers to which a scale factor/offset can be applied. This allows optimal integer compression schemes to be used on the data. On Tue, Oct 7, 2025 at 8:26 AM [email protected] wrote: > Thank you Andrew, > > I'll take a look at the C/C++ and Rust implementations. > > By the way, do you think it's necessary to implement ALP directly within > Parquet to evaluate its performance? Or would it be sufficient to benchmark > it using the implementations you mentioned without integrating it into > Parquet, > just to get a sense of its potential? > > Naohiro > > On 2025/10/03 13:21:03 Andrew Lamb wrote: > > This is super exciting, thank you Naohiro > > > > I also think ALP[1] (built on FastLanes[2]) is a great encoding to > explore > > > > Getting a Java based implementation of ALP would be a great validation > > that the approach works well across platforms. There are open source > > implementations in both C/C++[3] and Rust (via vortex) [4] that we could > > use to benchmark / build prototypes > > > > Andrew > > > > [1]: https://ir.cwi.nl/pub/4/4.pdf > > [2]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf > > [3]: > > > https://github.com/cwida/FastLanes/tree/4014a3a51083a06b6d446fb78e446494721aa12b/src/alp > > [4]: > > > https://github.com/vortex-data/vortex/blob/153040140e72d9038f5c092e6c6348c28a462211/encodings/alp/src/lib.rs#L4 > > > > On Fri, Oct 3, 2025 at 12:22 AM > [email protected] > > wrote: > > > > > Hi Andrew, > > > > > > I'm Naohiro, and I'm the person Julien has been in touch with. I was > > > planning to attend the sync yesterday but unfortunately missed it due > to > > > the timezone difference. (I’m in Japan) > > > > > > Thanks for kicking off this discussion, I'm definitely interested in > > > contributing. > > > > > > To > > -- Andrew Bell [email protected]
Re: Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?
Hi Kakimura, > By the way, do you think it's necessary to implement ALP directly within Parquet to evaluate its performance? >From my perspective, the algorithm's performance is well explained in the paper[1]. I suggest there are 2 milestones: 1. Gather any additional evidence that the algorithm is worth pursuing (e.g. perhaps apply to your data, or independently reproduce the results in the paper) 2. Make the case / proposal to add to Parquet. Perhaps a good first thing to try would be your datasets with the Vortex[2] file format (which has an implementation of ALP) When we get to step 2, I do think we'll need to integrate with two Parquet implementations. > Just as a note, in many cases floating point values are better stored as scaled integers. Andrew, indeed you are right. In fact the core ALP algorithm is transforming from floating point to scaled integers (and then applying the techniques from FastLanes[3] which auto-vectorizes well) Andrew [1]: https://dl.acm.org/doi/10.1145/3626717 [2]: https://github.com/vortex-data/vortex [3]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf On Tue, Oct 7, 2025 at 8:26 AM [email protected] wrote: > Thank you Andrew, > > I'll take a look at the C/C++ and Rust implementations. > > By the way, do you think it's necessary to implement ALP directly within > Parquet to evaluate its performance? Or would it be sufficient to benchmark > it using the implementations you mentioned without integrating it into > Parquet, > just to get a sense of its potential? > > Naohiro > > On 2025/10/03 13:21:03 Andrew Lamb wrote: > > This is super exciting, thank you Naohiro > > > > I also think ALP[1] (built on FastLanes[2]) is a great encoding to > explore > > > > Getting a Java based implementation of ALP would be a great validation > > that the approach works well across platforms. There are open source > > implementations in both C/C++[3] and Rust (via vortex) [4] that we could > > use to benchmark / build prototypes > > > > Andrew > > > > [1]: https://ir.cwi.nl/pub/4/4.pdf > > [2]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf > > [3]: > > > https://github.com/cwida/FastLanes/tree/4014a3a51083a06b6d446fb78e446494721aa12b/src/alp > > [4]: > > > https://github.com/vortex-data/vortex/blob/153040140e72d9038f5c092e6c6348c28a462211/encodings/alp/src/lib.rs#L4 > > > > On Fri, Oct 3, 2025 at 12:22 AM > [email protected] > > wrote: > > > > > Hi Andrew, > > > > > > I'm Naohiro, and I'm the person Julien has been in touch with. I was > > > planning to attend the sync yesterday but unfortunately missed it due > to > > > the timezone difference. (I’m in Japan) > > > > > > Thanks for kicking off this discussion, I'm definitely interested in > > > contributing. > > > > > > To > >
Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?
+1 on looking in openzl more deeply *before* we add new encodings. What's very attractive about openzl is that the decoder is fixed and advancements in encoding are backwards/forwards compatible. This means less changes to the format itself. The ideal end state would be to add openzl to parquet and encode everything as PLAIN. One thing to investigate is if we can get openzl compressed data at some point in the graph and then perform compressed execution on them. This would be perfect for dictionary encoded streams. On Tue, Oct 7, 2025 at 4:34 PM Krisztián Szűcs wrote: > Hi, > > There seems to be a new (if I’m not mistaken it was published yesterday) > codec/compression framework called OpenZL [1][2][3]. I haven’t looked at > it > thoroughly yet, but it somewhat reminds me of BtrBlocks. > Even if we don’t consider more advanced features of a framework like this, > we could offload the various codec implementations to another project. > > Krisztian > > [1]: https://openzl.org/ > [2]: https://github.com/facebook/openzl/tree/dev/src/openzl/codecs > [3]: > https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/ > > > On 2025. Oct 1., at 20:11, Andrew Lamb wrote: > > > > I would like to start a discussion to help organize and rally anyone > > interested in adding new encodings to Parquet. > > > > I am pretty sure there are many people interested in adding new > encodings, > > but there are only a few mentions on the mailing list, such as pcode [1] > > and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today > > that he is working on evaluating some potential encodings and hopes to > have > > some information to share soon, and Julien mentioned he had spoken to > > someone else who might be doing something similar. > > > > Now that Julien has defined a process to extend the spec[3] I think the > > steps are much clearer. > > > > So, I would like to invite anyone interested in adding new encodings to > > respond and let us know if you are willing to help evaluate new encodings > > and prototype integrations into Parquet implementations? > > > > Andrew > > > > > > [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6 > > [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw > > [3]: > > https://github.com/apache/parquet-format/blob/master/proposals/README.md > >
Re: [DISCUSS] Anyone working on / want to help with new encoding proposals?
This is super exciting, thank you Naohiro I also think ALP[1] (built on FastLanes[2]) is a great encoding to explore Getting a Java based implementation of ALP would be a great validation that the approach works well across platforms. There are open source implementations in both C/C++[3] and Rust (via vortex) [4] that we could use to benchmark / build prototypes Andrew [1]: https://ir.cwi.nl/pub/4/4.pdf [2]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf [3]: https://github.com/cwida/FastLanes/tree/4014a3a51083a06b6d446fb78e446494721aa12b/src/alp [4]: https://github.com/vortex-data/vortex/blob/153040140e72d9038f5c092e6c6348c28a462211/encodings/alp/src/lib.rs#L4 On Fri, Oct 3, 2025 at 12:22 AM [email protected] wrote: > Hi Andrew, > > I'm Naohiro, and I'm the person Julien has been in touch with. I was > planning to attend the sync yesterday but unfortunately missed it due to > the timezone difference. (I’m in Japan) > > Thanks for kicking off this discussion, I'm definitely interested in > contributing. > > To start with, I'm currently working on a POC in parquet-java to evaluate > ALP. While ALP and floating-point compression are my main focus at the > moment, I'm also interested in exploring other encoding strategies that > could benefit Parquet. I'm also drafting a proposal in Google Docs, and > once it's ready, I'll share the link. > > I'd love to hear if others are working on similar efforts, especially > around floating-point compression, to avoid duplication and potentially > collaborate. > > On 2025/10/01 18:11:51 Andrew Lamb wrote: > > I would like to start a discussion to help organize and rally anyone > > interested in adding new encodings to Parquet. > > > > I am pretty sure there are many people interested in adding new > encodings, > > but there are only a few mentions on the mailing list, such as pcode [1] > > and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today > > that he is working on evaluating some potential encodings and hopes to > have > > some information to share soon, and Julien mentioned he had spoken to > > someone else who might be doing something similar. > > > > Now that Julien has defined a process to extend the spec[3] I think the > > steps are much clearer. > > > > So, I would like to invite anyone interested in adding new encodings to > > respond and let us know if you are willing to help evaluate new encodings > > and prototype integrations into Parquet implementations? > > > > Andrew > > > > > > [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6 > > [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw > > [3]: > > https://github.com/apache/parquet-format/blob/master/proposals/README.md > > > >
RE: [DISCUSS] Anyone working on / want to help with new encoding proposals?
Hi Andrew, I'm Naohiro, and I'm the person Julien has been in touch with. I was planning to attend the sync yesterday but unfortunately missed it due to the timezone difference. (I’m in Japan) Thanks for kicking off this discussion, I'm definitely interested in contributing. To start with, I'm currently working on a POC in parquet-java to evaluate ALP. While ALP and floating-point compression are my main focus at the moment, I'm also interested in exploring other encoding strategies that could benefit Parquet. I'm also drafting a proposal in Google Docs, and once it's ready, I'll share the link. I'd love to hear if others are working on similar efforts, especially around floating-point compression, to avoid duplication and potentially collaborate. On 2025/10/01 18:11:51 Andrew Lamb wrote: > I would like to start a discussion to help organize and rally anyone > interested in adding new encodings to Parquet. > > I am pretty sure there are many people interested in adding new encodings, > but there are only a few mentions on the mailing list, such as pcode [1] > and FSST/ALP/FastLanes [2]. Prateek mentioned on the sync call today > that he is working on evaluating some potential encodings and hopes to have > some information to share soon, and Julien mentioned he had spoken to > someone else who might be doing something similar. > > Now that Julien has defined a process to extend the spec[3] I think the > steps are much clearer. > > So, I would like to invite anyone interested in adding new encodings to > respond and let us know if you are willing to help evaluate new encodings > and prototype integrations into Parquet implementations? > > Andrew > > > [1]: https://lists.apache.org/thread/bdmfcj4g6y1ccd3mfgrp7d43d73s6zf6 > [2]: https://lists.apache.org/thread/s3o9jk0hr942pv6ono4ymnvvj6pfdsdw > [3]: > https://github.com/apache/parquet-format/blob/master/proposals/README.md >
