Re: [Parquet] ALP Encoding for Floating point data

2026-02-17 Thread PRATEEK GAUR
Hi team,

1) Andrew

   - Thanks for working on test files. My PR did add all the test files I
   used to benchmark on datasets. Maybe we can club it together. WIll also aid
   cross language testing
   -  Kosta Tarasov working on Rust implementation. This is great. Thanks


2) Antoine

   - Thanks a lot for reporting the numbers on AMD. Looks like you are
   getting 8X the decoding performance of BSS. This is amazing!!.
   - Thanks for acknowledging the sampling design.
   - I agree with you on Fastlanes. In some crude experiments I didn't get
   a good perf benefit from it on Graviton3 (but maybe there was something
   wrong with my implementation).
   - Locking the 16bit exception encoding for the spec in this case.
   - Awesome I think we have solved for all open questions minus the
   version byte :). (will get back on this soon)


3) Micah

   - FastLanes : The current spec does allow for using FastLane with the
   configurable enum value for layout. We should be able to inject any layout
   in the current design.


Working on resolving all remaining open comments on the spec this week.

Best
Prateek


On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran  wrote:

> On Sun, 8 Feb 2026 at 18:12, Micah Kornfield 
> wrote:
>
> >
> >
> > It looks like the actual issue described for ORC in the paper is that it
> > has multiple sub-encodings in a batch.  This is different then the design
> > proposed here where there is still fixed encoding per page in parquet.
> > Given reasonably sized pages I don't think branch misprediction should
> be a
> > big issue for new encodings.  I agree that we should be conservative in
> > general for adding new encodings.
> >
> >
> +1
>


Re: [Parquet] ALP Encoding for Floating point data

2026-02-10 Thread Steve Loughran
On Sun, 8 Feb 2026 at 18:12, Micah Kornfield  wrote:

>
>
> It looks like the actual issue described for ORC in the paper is that it
> has multiple sub-encodings in a batch.  This is different then the design
> proposed here where there is still fixed encoding per page in parquet.
> Given reasonably sized pages I don't think branch misprediction should be a
> big issue for new encodings.  I agree that we should be conservative in
> general for adding new encodings.
>
>
+1


Re: [Parquet] ALP Encoding for Floating point data

2026-02-08 Thread Micah Kornfield
>
> > Re-stating the points so that scrolling is not needed.
> > 1.  Change of integer encoding (see debate in this thread on FOR vs
> > Delta).  We also want to get fast lanes in at some point.


I'm not sure why we "want to get last lanes in some point". I don't
> think we want to obsess about performance here, given that current
> performance is already very good, and real-world bottlenecks will
> probably be elsewhere.


I think we can maybe debate this further, if/when someone proposes fast
lanes.  My main point here is that there are valid use-cases for making
integer encoding configurable (or at least there is uncertainty here).


This highlights that *unless you can use cpu vector opcodes, adding more
> options can hurt branch prediction and so make overall performance worse*.
> It's a good argument for simplicity in compression and encoding choices.


It looks like the actual issue described for ORC in the paper is that it
has multiple sub-encodings in a batch.  This is different then the design
proposed here where there is still fixed encoding per page in parquet.
Given reasonably sized pages I don't think branch misprediction should be a
big issue for new encodings.  I agree that we should be conservative in
general for adding new encodings.

Regards,
Micah

On Thu, Feb 5, 2026 at 6:28 AM Antoine Pitrou  wrote:

>
> hi Prateek,
>
> Le 03/02/2026 à 23:39, PRATEEK GAUR a écrit :
> > Hi Antoine and Micah,
> >
> > Apologies for getting back on this a little late.
> >
> > *Running Perf tests*
> > @Antoine Pitrou  were you able to figure out the
> steps
> > to run the tests?
>
> Yes, I finally did that, results below on an AMD Zen 2 CPU:
> https://gist.github.com/pitrou/1f4aefb7034657ce018231d87993f437
>
> > *Sampling Frequency*
> > We want to pick the right parameters to encode the values with. That is
> > what the Spec requires.
> >  From the implementation perspective you raise a good point that did
> cross my
> > mind that 'practically we don't want to sample for every page', for
> > performance
> > reasons. My thinking is each engine is free to decide this.
> > 1) Do it at page level if data is changing often
> > 2) Provide fixed presets via config
> > 3) Do it once per encoder (per column, as Micah pointed out)
> > 4) Provide a fancy config.
>
> Ok, that would sound fine to me.
>
> > Re-stating the points so that scrolling is not needed.
> > 1.  Change of integer encoding (see debate in this thread on FOR vs
> > Delta).  We also want to get fast lanes in at some point.
>
> I'm not sure why we "want to get last lanes in some point". I don't
> think we want to obsess about performance here, given that current
> performance is already very good, and real-world bottlenecks will
> probably be elsewhere.
>
> > For eg at this point I do see that both bitpacking of exceptions, as
> > pointed by
> > Antoine, or plain ub2 encoding should work equally well .
>
> Well, since 16 bits are actually enough for the current vector size, I'd
> say we can keep things simple.
>
> Regards
>
> Antoine.
>
>
>


Re: [Parquet] ALP Encoding for Floating point data

2026-02-05 Thread Antoine Pitrou



hi Prateek,

Le 03/02/2026 à 23:39, PRATEEK GAUR a écrit :

Hi Antoine and Micah,

Apologies for getting back on this a little late.

*Running Perf tests*
@Antoine Pitrou  were you able to figure out the steps
to run the tests?


Yes, I finally did that, results below on an AMD Zen 2 CPU:
https://gist.github.com/pitrou/1f4aefb7034657ce018231d87993f437


*Sampling Frequency*
We want to pick the right parameters to encode the values with. That is
what the Spec requires.
 From the implementation perspective you raise a good point that did cross my
mind that 'practically we don't want to sample for every page', for
performance
reasons. My thinking is each engine is free to decide this.
1) Do it at page level if data is changing often
2) Provide fixed presets via config
3) Do it once per encoder (per column, as Micah pointed out)
4) Provide a fancy config.


Ok, that would sound fine to me.


Re-stating the points so that scrolling is not needed.
1.  Change of integer encoding (see debate in this thread on FOR vs
Delta).  We also want to get fast lanes in at some point.


I'm not sure why we "want to get last lanes in some point". I don't 
think we want to obsess about performance here, given that current 
performance is already very good, and real-world bottlenecks will 
probably be elsewhere.



For eg at this point I do see that both bitpacking of exceptions, as
pointed by
Antoine, or plain ub2 encoding should work equally well .


Well, since 16 bits are actually enough for the current vector size, I'd 
say we can keep things simple.


Regards

Antoine.




Re: [Parquet] ALP Encoding for Floating point data

2026-02-05 Thread Andrew Lamb
I started creating some test .parquet files using the C++ implementation
[1] to help other implementations test compatibility. I currently have a
really simple file here [2] if anyone wants to play around with it.

I plan to expand it out over the next few days (e.g. include exceptions,
f32, etc)

Finally, I am pretty stoked to report that Kosta Tarasov has started
working on a Rust implementation[3]

Andrew

[1]: https://github.com/apache/arrow/pull/49154
[2]: https://github.com/user-attachments/files/25097841/single_f64_ALP.zip
[3]: https://github.com/apache/arrow-rs/issues/8748

On Tue, Feb 3, 2026 at 6:03 PM Micah Kornfield 
wrote:

> Thanks Prateek,
>
>> *Alp Version in Header*
>> Micah's point
>> *`* I'd suggest modelling fundamentally different algorithms with the
>> top level encoding enum, and have versioning/control bits where we believe
>> we will likely want to iterate`
>> Yes this is exactly what is happening here. An enum to add AlpRd(and more)
>> and version control to iterate anything fundamental (like a layout change
>> of the
>> metadata).
>
>
> I think there might still be some misalignment on which enum we are
> talking about.  I was referring to an enum in parquet.thrift for encoding
> 
>  [1].
> Specifically for the current proposal for ALP would ALP_PSEUDODECIMAL.  For
> AlpRd we would and ALP_RD (or something similar).  If for some reason we
> need ALP_PSEUDODECIMAL, we always have ALP_PSEUDO_DECIMAL_2.  Given the
> other extension points, I hope we wouldn't need ALP_PSEUDO_DECIMAL_2 (or at
> least we get a better name) for a long time.
>
> I think the main technical trade-off here is how the thrift parsers across
> bindings handle unknown enum values (e.g. if these encodings show in the
> footer in stats, would that make the whole file unreadable, and do we care).
>
> Cheers,
> Micah
>
>
> [1]
> https://github.com/apache/parquet-format/blame/master/src/main/thrift/parquet.thrift#L630
>
>
>
>
> On Tue, Feb 3, 2026 at 2:40 PM PRATEEK GAUR  wrote:
>
>> Hi Antoine and Micah,
>>
>> Apologies for getting back on this a little late.
>>
>> *Running Perf tests*
>> @Antoine Pitrou  were you able to figure out the
>> steps
>> to run the tests?
>>
>> *Sampling Frequency*
>> We want to pick the right parameters to encode the values with. That is
>> what the Spec requires.
>> From the implementation perspective you raise a good point that did cross
>> my
>> mind that 'practically we don't want to sample for every page', for
>> performance
>> reasons. My thinking is each engine is free to decide this.
>> 1) Do it at page level if data is changing often
>> 2) Provide fixed presets via config
>> 3) Do it once per encoder (per column, as Micah pointed out)
>> 4) Provide a fancy config.
>> I agree with Micah here '*I think maybe should clarify that the*
>> *encoding algorithm in a specification is a recommendation'*
>>
>>
>> *Number of values to pick for sampling*
>> 'why does this have to be a constant'
>> You are right, it doesn't need to be a constant, hence the spec doesn't
>> say
>> so.
>> Everything that is segregated out in the AlpContants(c++ impl) file can be
>> changed by
>> configuration.
>> (Did I get your question right @Antoine Pitrou  ?)
>>
>> *Alp Version in Header*
>> Micah's point
>> *`* I'd suggest modelling fundamentally different algorithms with the
>> top level encoding enum, and have versioning/control bits where we believe
>> we will likely want to iterate`
>> Yes this is exactly what is happening here. An enum to add AlpRd(and more)
>> and version control to iterate anything fundamental (like a layout change
>> of the
>> metadata).
>>
>> Re-stating the points so that scrolling is not needed.
>> 1.  Change of integer encoding (see debate in this thread on FOR vs
>> Delta).  We also want to get fast lanes in at some point.  I think an
>> enum inside the page for versioning makes sense, as it allows for easier
>> composability.
>> 2.  Change in structure to exceptions (e.g. G-ALP).  G-ALP comes with some
>> trade-offs, so it is not clear if it is something everyone would want to
>> enable.
>> 3.  Offset indexes to vectors
>> 4.  Different floating point encoding algorithms  (e.g. AlpRd + AlpBSS)
>>
>> For eg at this point I do see that both bitpacking of exceptions, as
>> pointed by
>> Antoine, or plain ub2 encoding should work equally well . I was running
>> some benchmarks
>> here and I was getting read speeds of around 20 GB/s for 10bit packed
>> values which is
>> quite good enough (graviton 3 processor).
>> But for simplicity (and the fact that we won't get really large vectors,
>> my
>> inclination
>> is towards ub2 values) but I want to keep the path open to possibly have
>> bipacking
>> as an option as the workloads evolve to a level we haven't thought about
>> yet. We
>> can always add a new encoding but I don't see a path to having 20+ top
>> level
>> encodings. Again I don't hav

Re: [Parquet] ALP Encoding for Floating point data

2026-02-03 Thread Micah Kornfield
Thanks Prateek,

> *Alp Version in Header*
> Micah's point
> *`* I'd suggest modelling fundamentally different algorithms with the
> top level encoding enum, and have versioning/control bits where we believe
> we will likely want to iterate`
> Yes this is exactly what is happening here. An enum to add AlpRd(and more)
> and version control to iterate anything fundamental (like a layout change
> of the
> metadata).


I think there might still be some misalignment on which enum we are talking
about.  I was referring to an enum in parquet.thrift for encoding

[1].
Specifically for the current proposal for ALP would ALP_PSEUDODECIMAL.  For
AlpRd we would and ALP_RD (or something similar).  If for some reason we
need ALP_PSEUDODECIMAL, we always have ALP_PSEUDO_DECIMAL_2.  Given the
other extension points, I hope we wouldn't need ALP_PSEUDO_DECIMAL_2 (or at
least we get a better name) for a long time.

I think the main technical trade-off here is how the thrift parsers across
bindings handle unknown enum values (e.g. if these encodings show in the
footer in stats, would that make the whole file unreadable, and do we care).

Cheers,
Micah


[1]
https://github.com/apache/parquet-format/blame/master/src/main/thrift/parquet.thrift#L630




On Tue, Feb 3, 2026 at 2:40 PM PRATEEK GAUR  wrote:

> Hi Antoine and Micah,
>
> Apologies for getting back on this a little late.
>
> *Running Perf tests*
> @Antoine Pitrou  were you able to figure out the steps
> to run the tests?
>
> *Sampling Frequency*
> We want to pick the right parameters to encode the values with. That is
> what the Spec requires.
> From the implementation perspective you raise a good point that did cross
> my
> mind that 'practically we don't want to sample for every page', for
> performance
> reasons. My thinking is each engine is free to decide this.
> 1) Do it at page level if data is changing often
> 2) Provide fixed presets via config
> 3) Do it once per encoder (per column, as Micah pointed out)
> 4) Provide a fancy config.
> I agree with Micah here '*I think maybe should clarify that the*
> *encoding algorithm in a specification is a recommendation'*
>
>
> *Number of values to pick for sampling*
> 'why does this have to be a constant'
> You are right, it doesn't need to be a constant, hence the spec doesn't say
> so.
> Everything that is segregated out in the AlpContants(c++ impl) file can be
> changed by
> configuration.
> (Did I get your question right @Antoine Pitrou  ?)
>
> *Alp Version in Header*
> Micah's point
> *`* I'd suggest modelling fundamentally different algorithms with the
> top level encoding enum, and have versioning/control bits where we believe
> we will likely want to iterate`
> Yes this is exactly what is happening here. An enum to add AlpRd(and more)
> and version control to iterate anything fundamental (like a layout change
> of the
> metadata).
>
> Re-stating the points so that scrolling is not needed.
> 1.  Change of integer encoding (see debate in this thread on FOR vs
> Delta).  We also want to get fast lanes in at some point.  I think an
> enum inside the page for versioning makes sense, as it allows for easier
> composability.
> 2.  Change in structure to exceptions (e.g. G-ALP).  G-ALP comes with some
> trade-offs, so it is not clear if it is something everyone would want to
> enable.
> 3.  Offset indexes to vectors
> 4.  Different floating point encoding algorithms  (e.g. AlpRd + AlpBSS)
>
> For eg at this point I do see that both bitpacking of exceptions, as
> pointed by
> Antoine, or plain ub2 encoding should work equally well . I was running
> some benchmarks
> here and I was getting read speeds of around 20 GB/s for 10bit packed
> values which is
> quite good enough (graviton 3 processor).
> But for simplicity (and the fact that we won't get really large vectors, my
> inclination
> is towards ub2 values) but I want to keep the path open to possibly have
> bipacking
> as an option as the workloads evolve to a level we haven't thought about
> yet. We
> can always add a new encoding but I don't see a path to having 20+ top
> level
> encodings. Again I don't have a very strong bias towards keeping it or
> removing it
> but my thinking right now is let's have the flexibility and make it easier
> for people
> to evolve this encoding behind a version byte.
>
>
> Best
> Prateek
> PS : I probably have addressed all open threads raised by Antoine and
> Micah.
> (but I may have missed something)
>
>
>
>
>
> On Thu, Jan 29, 2026 at 10:52 PM Micah Kornfield 
> wrote:
>
> > Hi Antoine and Prateek,
> >
> > > > In Parquet C++, encoding happens at page level, and I would guess
> other
> > > > implementations do something similar. Sampling cannot reasonably be
> > done
> > > > at a higher level, that would require invasive architectural changes.
> >
> >
> > At least in C++ I believe we cache the encoder at a column level [1] (I
> > believe th

Re: [Parquet] ALP Encoding for Floating point data

2026-02-03 Thread PRATEEK GAUR
Hi Antoine and Micah,

Apologies for getting back on this a little late.

*Running Perf tests*
@Antoine Pitrou  were you able to figure out the steps
to run the tests?

*Sampling Frequency*
We want to pick the right parameters to encode the values with. That is
what the Spec requires.
>From the implementation perspective you raise a good point that did cross my
mind that 'practically we don't want to sample for every page', for
performance
reasons. My thinking is each engine is free to decide this.
1) Do it at page level if data is changing often
2) Provide fixed presets via config
3) Do it once per encoder (per column, as Micah pointed out)
4) Provide a fancy config.
I agree with Micah here '*I think maybe should clarify that the*
*encoding algorithm in a specification is a recommendation'*


*Number of values to pick for sampling*
'why does this have to be a constant'
You are right, it doesn't need to be a constant, hence the spec doesn't say
so.
Everything that is segregated out in the AlpContants(c++ impl) file can be
changed by
configuration.
(Did I get your question right @Antoine Pitrou  ?)

*Alp Version in Header*
Micah's point
*`* I'd suggest modelling fundamentally different algorithms with the
top level encoding enum, and have versioning/control bits where we believe
we will likely want to iterate`
Yes this is exactly what is happening here. An enum to add AlpRd(and more)
and version control to iterate anything fundamental (like a layout change
of the
metadata).

Re-stating the points so that scrolling is not needed.
1.  Change of integer encoding (see debate in this thread on FOR vs
Delta).  We also want to get fast lanes in at some point.  I think an
enum inside the page for versioning makes sense, as it allows for easier
composability.
2.  Change in structure to exceptions (e.g. G-ALP).  G-ALP comes with some
trade-offs, so it is not clear if it is something everyone would want to
enable.
3.  Offset indexes to vectors
4.  Different floating point encoding algorithms  (e.g. AlpRd + AlpBSS)

For eg at this point I do see that both bitpacking of exceptions, as
pointed by
Antoine, or plain ub2 encoding should work equally well . I was running
some benchmarks
here and I was getting read speeds of around 20 GB/s for 10bit packed
values which is
quite good enough (graviton 3 processor).
But for simplicity (and the fact that we won't get really large vectors, my
inclination
is towards ub2 values) but I want to keep the path open to possibly have
bipacking
as an option as the workloads evolve to a level we haven't thought about
yet. We
can always add a new encoding but I don't see a path to having 20+ top level
encodings. Again I don't have a very strong bias towards keeping it or
removing it
but my thinking right now is let's have the flexibility and make it easier
for people
to evolve this encoding behind a version byte.


Best
Prateek
PS : I probably have addressed all open threads raised by Antoine and Micah.
(but I may have missed something)





On Thu, Jan 29, 2026 at 10:52 PM Micah Kornfield 
wrote:

> Hi Antoine and Prateek,
>
> > > In Parquet C++, encoding happens at page level, and I would guess other
> > > implementations do something similar. Sampling cannot reasonably be
> done
> > > at a higher level, that would require invasive architectural changes.
>
>
> At least in C++ I believe we cache the encoder at a column level [1] (I
> believe the same is true for java). I think this implies one could sample
> for the first page more or less, and then resample on some regular cadence
> (or if compression degrades too much)?  In general, the exact approach used
> in implementations can vary here, so I think maybe should clarify that the
> encoding algorithm in a specification is a recommendation, and we
> concentrate the discussion on implementation to the individual language
> binding PRs.
>
> 1) Addition of AlpRd (if that takes time to get in for read doubles). (This
> > is easily addable with provided AlgorithmEnum)
> > 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is easily
> > addable with provided AlgorithmEnum)
> > 3) Addition of different encoding (This is easily addable with provided
> > AlgorithmEnum).
>
>
> In short, I'd suggest modelling fundamentally different algorithms with the
> top level encoding enum, and have versioning/control bits where we believe
> we will likely want to iterate.
>
> Long version, based on my understanding of current open design points the
> following extensions have been discussed:
>
> 1.  Change of integer encoding (see debate in this thread on FOR vs
> Delta).  We also want to get fastlanes in at some point.  I think an
> enum inside the page for versioning makes sense, as it allows for easier
> composability.
> 2.  Change in structure to exceptions (e.g. G-ALP).  G-ALP comes with some
> trade-offs, so it is not clear if it is something everyone would want to
> enable.
> 3.  Offset indexes to vectors
> 4.  Different floating point encod

Re: [Parquet] ALP Encoding for Floating point data

2026-01-29 Thread Micah Kornfield
Hi Antoine and Prateek,

> > In Parquet C++, encoding happens at page level, and I would guess other
> > implementations do something similar. Sampling cannot reasonably be done
> > at a higher level, that would require invasive architectural changes.


At least in C++ I believe we cache the encoder at a column level [1] (I
believe the same is true for java). I think this implies one could sample
for the first page more or less, and then resample on some regular cadence
(or if compression degrades too much)?  In general, the exact approach used
in implementations can vary here, so I think maybe should clarify that the
encoding algorithm in a specification is a recommendation, and we
concentrate the discussion on implementation to the individual language
binding PRs.

1) Addition of AlpRd (if that takes time to get in for read doubles). (This
> is easily addable with provided AlgorithmEnum)
> 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is easily
> addable with provided AlgorithmEnum)
> 3) Addition of different encoding (This is easily addable with provided
> AlgorithmEnum).


In short, I'd suggest modelling fundamentally different algorithms with the
top level encoding enum, and have versioning/control bits where we believe
we will likely want to iterate.

Long version, based on my understanding of current open design points the
following extensions have been discussed:

1.  Change of integer encoding (see debate in this thread on FOR vs
Delta).  We also want to get fastlanes in at some point.  I think an
enum inside the page for versioning makes sense, as it allows for easier
composability.
2.  Change in structure to exceptions (e.g. G-ALP).  G-ALP comes with some
trade-offs, so it is not clear if it is something everyone would want to
enable.
3.  Offset indexes to vectors
4.  Different floating point encoding algorithms  (e.g. AlpRd + AlpBSS)

1 is almost definitely an extension point and I think it pays to version
this within the page (if we decide to allow delta at some point, and then
do fast lanes we start having a lot of ALP-pseudecimal enums, that might
not add a lot of value).  This would get worse if we multiply this by any
future exception layouts.  3. Feels like we can just make a decision now on
whether to add them, and if not added now can probably be added if we get
to cascaded encodings, or if really needed as a separate enum (but I'm open
to control bit here).

For 4, these feel like they should be a top level enum for versioning, they
are fundamentally different algorithms with different code to decode them.
The trade-off here is that the current writers need to have some more
invasive changes to have better fallback or choice of initial encoder (but
this needs to happen anyways).  We can easily have multiple page encodings
with a row-group from a reader's perspective

For any other future changes that don't fall into these buckets a new
top-level enum is always an escape hatch (and we can learn from our
mistakes). Another take is it is just one byte per page, which would take a
while to add up even at Parquet's scale, but on the balance I'd lean
towards YAGNI.

Cheers,
Micah

[1]
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1290


On Tue, Jan 27, 2026 at 9:47 AM PRATEEK GAUR  wrote:

> Hi Antoine,
>
> Thanks. Replying inline to the discussion threads.
>
>
> > > https://github.com/apache/parquet-testing/pull/100/files.
> > > You might have to checkout that branch to be able to run the
> benchmarks.
> >
>
> Yeah just cherry-picking the branch should work. Let me me know if it ends
> up
> being some kind of permission issue.
>
>
> >
> > Hmm, I'll take a look at that later and come back.
> >
> > > * the encoding of integers uses a custom framing with
> frame-of-reference
> > >> encoding inside it, but Parquet implementations already implement
> > >> DELTA_BINARY_PACKED which should have similar characteristics, so why
> > >> not reuse that?
> >
> > Thanks. At worse it seems it would need one more bit per integer than
> > the proposed FOR scheme (it might need less if delta encoding allows
> > reducing entropy among the encoded integers). I'm not sure how that
> > makes it "fail".
> >
>
>
> My bad. I wanted to add another point.
> 1) So I looked at at a few other examples and in most cases FOR ended up
> using
>  let's bits per value. My thinking is that for 1M (1B values) , it will
> add up.
> 2) Another major point was that the bitunpacker + FOR was much much faster
> than
> DeltaBitPack decoder. BitUnpacker+FOR was easily SIMDable but
> DeltaBitUnpack
> not so much. I vaguely remember the difference being around 2x. I can
> try and
> compute the numbers again. But today the entire bottleneck in the
> decoder shows
> up in bit-unpacking.
>
>
> > >> * there are a lot of fields in the headers that look a bit superfluous
> > >> (though of course those bits are relatively cheap); for example, why
> > >> have a format "version" 

Re: [Parquet] ALP Encoding for Floating point data

2026-01-27 Thread PRATEEK GAUR
Hi Antoine,

Thanks. Replying inline to the discussion threads.


> > https://github.com/apache/parquet-testing/pull/100/files.
> > You might have to checkout that branch to be able to run the benchmarks.
>

Yeah just cherry-picking the branch should work. Let me me know if it ends
up
being some kind of permission issue.


>
> Hmm, I'll take a look at that later and come back.
>
> > * the encoding of integers uses a custom framing with frame-of-reference
> >> encoding inside it, but Parquet implementations already implement
> >> DELTA_BINARY_PACKED which should have similar characteristics, so why
> >> not reuse that?
>
> Thanks. At worse it seems it would need one more bit per integer than
> the proposed FOR scheme (it might need less if delta encoding allows
> reducing entropy among the encoded integers). I'm not sure how that
> makes it "fail".
>


My bad. I wanted to add another point.
1) So I looked at at a few other examples and in most cases FOR ended up
using
 let's bits per value. My thinking is that for 1M (1B values) , it will
add up.
2) Another major point was that the bitunpacker + FOR was much much faster
than
DeltaBitPack decoder. BitUnpacker+FOR was easily SIMDable but
DeltaBitUnpack
not so much. I vaguely remember the difference being around 2x. I can
try and
compute the numbers again. But today the entire bottleneck in the
decoder shows
up in bit-unpacking.


> >> * there are a lot of fields in the headers that look a bit superfluous
> >> (though of course those bits are relatively cheap); for example, why
> >> have a format "version" while we could define a new encoding for
> >> incompatible evolutions?
> >
> > We discussed this point in the Spec
> > <
> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit>
> > document a lot and have gravitated
> > towards a versioning scheme for easier evolution.
>

> * Meta-answer:
>
> I don't see any such discussion in the comments and the document doesn't
> state any rationale for it.
>
> As a reference, Python PEPs include a discussion of rejected
> alternatives so that people don't ask the same questions over and over
> (see https://peps.python.org/pep-0810/#alternate-implementation-ideas
> for an example).
>
> * Actual answer:
>
> My problem with this is that it will complicate understanding and
> communicating the feature set supported by each Parquet implementation.
> "ALP" will not be a single spec but an evolving one with its own version
> numbers. I'm not sure why that is better than adding a "ALP2" if we even
> want to evolve the spec.
>
> It will also complicate APIs that currently accept encoding numbers,
> such as
>
> https://github.com/apache/arrow/blob/5272a68c134deea82040f2f29bb6257ad7b52be0/cpp/src/parquet/properties.h#L221
>
> We need a clear explanation of what makes ALP so special that it *needs*
> its own versioning scheme. Otherwise we should remove it IMHO.
>

Based on the feedback I think it is more of a clarification/API
complication discussion.
I was thinking along these lines.

Things that I thought will had to change but then we added enough
flexibility in metadata structs to allow for these.
1) Addition of AlpRd (if that takes time to get in for read doubles). (This
is easily addable with provided AlgorithmEnum)
2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is easily
addable with provided AlgorithmEnum)
3) Addition of different encoding (This is easily addable with provided
AlgorithmEnum).

Things that I think might need a version field.
1) Right now FOR and ALP structs which are interleaved store certain set of
fields, given there have been lots of
 in coming suggestions my thinking was that having a version will allow
us to easily change them with minor
 changes to the code in all writers/readers (maybe that is a very bit
task).


>
> >> * the "Total Encoded Element count" duplicates information already in
> >> the page header, with risks of inconsistent values (including security
> >> risks that require specific care in implementations)
> >
> > 'num_elements' : let me re-read and get back on this.
>
> Ok, Micah's answer cleared that up.
>

Cool.


>
> >> * what happens if the number of exceptions is above 65535? their indices
> >> are coded as 16-bit uints. How about using the same encoding as for
> >> bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove
> >> the 65535 limitation.
> >
> > So, I don't see a need for a vector larger than 65535. With that large
> > vectors the
> > overhead of metadata is small and you might as well break it into
> multiple
> > vectors. I'm gonna give it some more thought and get back.
>
> Ah, yes, I think you're right. Let's forget this :-)
>


:).


>
> > Sampling process should be statistically significant. It should pick
> enough
> > values
> > and not have bias towards just the values towards the start. ALP
> algorithm
> > ensures
> > that and tries to balance between not spending enough cycles to get r

Re: [Parquet] ALP Encoding for Floating point data

2026-01-27 Thread Antoine Pitrou



Hi Prateek,

Thanks a lot for the answers.

Le 27/01/2026 à 04:14, PRATEEK GAUR a écrit :

* I cannot seem to run the C++ benchmarks because of the git submodule
configuration. It may be easier to fix but I'm looking for guidance here
:-)

```
$ LANG=C git submodule update
fatal: transport 'file' not allowed
fatal: Fetched in submodule path 'submodules/parquet-testing', but it
did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct
fetching of that commit failed.
```



I think that is because the dataset branch hasn't been merged in yet.
The files are in this pull request
https://github.com/apache/parquet-testing/pull/100/files.
You might have to checkout that branch to be able to run the benchmarks.


Hmm, I'll take a look at that later and come back.


* the encoding of integers uses a custom framing with frame-of-reference

encoding inside it, but Parquet implementations already implement
DELTA_BINARY_PACKED which should have similar characteristics, so why
not reuse that?


I did look at DELTA_BINARY_PACKED. Unless I unless I understood it wrong it
didn't
fit the needs.

My understanding of DELTA_BINARY_PACKED is this
delta[i] = value[i] - value[i-1]

Pasting an example of why it may fail.


Thanks. At worse it seems it would need one more bit per integer than 
the proposed FOR scheme (it might need less if delta encoding allows 
reducing entropy among the encoded integers). I'm not sure how that 
makes it "fail".



* there are a lot of fields in the headers that look a bit superfluous
(though of course those bits are relatively cheap); for example, why
have a format "version" while we could define a new encoding for
incompatible evolutions?


We discussed this point in the Spec

document a lot and have gravitated
towards a versioning scheme for easier evolution.


* Meta-answer:

I don't see any such discussion in the comments and the document doesn't 
state any rationale for it.


As a reference, Python PEPs include a discussion of rejected 
alternatives so that people don't ask the same questions over and over 
(see https://peps.python.org/pep-0810/#alternate-implementation-ideas 
for an example).


* Actual answer:

My problem with this is that it will complicate understanding and 
communicating the feature set supported by each Parquet implementation. 
"ALP" will not be a single spec but an evolving one with its own version 
numbers. I'm not sure why that is better than adding a "ALP2" if we even 
want to evolve the spec.


It will also complicate APIs that currently accept encoding numbers, 
such as 
https://github.com/apache/arrow/blob/5272a68c134deea82040f2f29bb6257ad7b52be0/cpp/src/parquet/properties.h#L221


We need a clear explanation of what makes ALP so special that it *needs* 
its own versioning scheme. Otherwise we should remove it IMHO.



* the "Total Encoded Element count" duplicates information already in
the page header, with risks of inconsistent values (including security
risks that require specific care in implementations)


'num_elements' : let me re-read and get back on this.


Ok, Micah's answer cleared that up.


* what happens if the number of exceptions is above 65535? their indices
are coded as 16-bit uints. How about using the same encoding as for
bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove
the 65535 limitation.


So, I don't see a need for a vector larger than 65535. With that large
vectors the
overhead of metadata is small and you might as well break it into multiple
vectors. I'm gonna give it some more thought and get back.


Ah, yes, I think you're right. Let's forget this :-)


Sampling process should be statistically significant. It should pick enough
values
and not have bias towards just the values towards the start. ALP algorithm
ensures
that and tries to balance between not spending enough cycles to get right
parameters
and picking incorrect parameters.

For a very large row group we can change the constant and have it
selects over a larger data set.
Or one can do it at page level too. Happy to discuss more on this.


In Parquet C++, encoding happens at page level, and I would guess other 
implementations do something similar. Sampling cannot reasonably be done 
at a higher level, that would require invasive architectural changes.


But this raises another question: why does this have to be a constant? 
When encoding a page, you know the population size (i.e. the actual 
number of values that will be encoded). You don't need to estimate it 
with a constant.


Thank you

Antoine.




Re: [Parquet] ALP Encoding for Floating point data

2026-01-26 Thread Micah Kornfield
Hi Antoine,

I think I can help perhaps add some more details to Prateek's answer.

>
> * the "Total Encoded Element count" duplicates information already in
> the page header, with risks of inconsistent values (including security
> risks that require specific care in implementations)
>

'num_elements' : let me re-read and get back on this.


This is confusing but the data page header only contains the total number
of values "including null" [1], so it effectively has the number of
repetition and definition levels not the number of encoded value count.  We
expect these to be inconsistent (DataPageV2 and statistics have null counts
but generally I don't think these are passed to decoders anyways).  The
only other way of retrieving this count is to decode all
repetition/definitions up front to get the true count.

DELTA_BINARY_PACKED also stores the total number of values [2]

BYTE_STREAM_SPLIT actually required a spec update [3] to state no padding
is allowed because otherwise there would be no other way to get this number.

> * the C++ implementation has a `kSamplerRowgroupSize` constant, which
> worries me; row group size can vary *a lot* between workloads (from
> thousands to millions of elements typically), the sampling process
> should not depend on that.

The last time I reviewed the C++ implementation I think we were actually
recalculating these values per page.   So I think this might just be a
naming issue (as long as the implementation doesn't change.  Page sizes
could get larger then this constant but probably not by too much?

Cheers,
Micah

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L679
[2]
https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
[3]
https://github.com/apache/parquet-format/commit/230711fbfd8d3399cce935a4f39d1be7b6ad5ad5

On Mon, Jan 26, 2026 at 7:15 PM PRATEEK GAUR  wrote:

> Thanks Andrew for building momentum.
>
> Hi Antoine,
>
> Replies to your questions are inline.
>
> On Mon, Jan 26, 2026 at 2:45 AM Antoine Pitrou  wrote:
>
> >
> > Hey all,
> >
> > Thanks Prateek and Dhirhan for submitting this as it's clear you've been
> > putting quite a bit of work into it. IMHO, the ALP encoding looks very
> > promising as an addition to Parquet format.
> >
> > That said, I have a few technical concerns:
> >
> > * I cannot seem to run the C++ benchmarks because of the git submodule
> > configuration. It may be easier to fix but I'm looking for guidance here
> > :-)
> >
> > ```
> > $ LANG=C git submodule update
> > fatal: transport 'file' not allowed
> > fatal: Fetched in submodule path 'submodules/parquet-testing', but it
> > did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct
> > fetching of that commit failed.
> > ```
> >
>
> I think that is because the dataset branch hasn't been merged in yet.
> The files are in this pull request
> https://github.com/apache/parquet-testing/pull/100/files.
> You might have to checkout that branch to be able to run the benchmarks.
>
> * the encoding of integers uses a custom framing with frame-of-reference
> > encoding inside it, but Parquet implementations already implement
> > DELTA_BINARY_PACKED which should have similar characteristics, so why
> > not reuse that?
> >
>
> I did look at DELTA_BINARY_PACKED. Unless I unless I understood it wrong it
> didn't
> fit the needs.
>
> My understanding of DELTA_BINARY_PACKED is this
> delta[i] = value[i] - value[i-1]
>
> Pasting an example of why it may fail.
>
> Input:[19.99, 5.49, 149.00, 0.99, 299.99, ...] // ALP encoded prices:
> [1999, 549, 14900, 99, 2]
> ═══
> DELTA_BINARY_PACKED (Two Levels)
> ═══
>
> STEP 1: Adjacent differences
>   first_value = 1999  (stored in header)
>
>   delta[0] = 549 - 1999 = -1450
>   delta[1] = 14900 - 549= +14351
>   delta[2] = 99 - 14900 = -14801
>   delta[3] = 2 - 99 = +29900
>
>   Adjacent deltas = [-1450, +14351, -14801, +29900]
>
> STEP 2: Frame of reference on the deltas
>   min_delta = -14801  (stored as zigzag ULEB128)
>
>   adjusted[0] = -1450 - (-14801)  = 13351
>   adjusted[1] = +14351 - (-14801) = 29152
>   adjusted[2] = -14801 - (-14801) = 0
>   adjusted[3] = +29900 - (-14801) = 44701
>
>   Adjusted deltas = [13351, 29152, 0, 44701]
>
>   Range: 0 to 44701 → bit_width = ceil(log2(44702)) = 16 bits
>
> ═══
>   FOR (Single Level)
> ═══
>
>   min = 99  (stored as frame_of_reference)
>
>   delta[0] = 1999 - 99   = 1900
>   delta[1] = 549 - 99= 450
>   delta[2] = 14900 - 99  = 14801
>   delta[3] = 99 - 99 = 0
>   delta[4] = 2 - 99  = 29900
>
>   Deltas = [1900, 450, 14801, 0, 29900]
>
>

Re: [Parquet] ALP Encoding for Floating point data

2026-01-26 Thread PRATEEK GAUR
Thanks Andrew for building momentum.

Hi Antoine,

Replies to your questions are inline.

On Mon, Jan 26, 2026 at 2:45 AM Antoine Pitrou  wrote:

>
> Hey all,
>
> Thanks Prateek and Dhirhan for submitting this as it's clear you've been
> putting quite a bit of work into it. IMHO, the ALP encoding looks very
> promising as an addition to Parquet format.
>
> That said, I have a few technical concerns:
>
> * I cannot seem to run the C++ benchmarks because of the git submodule
> configuration. It may be easier to fix but I'm looking for guidance here
> :-)
>
> ```
> $ LANG=C git submodule update
> fatal: transport 'file' not allowed
> fatal: Fetched in submodule path 'submodules/parquet-testing', but it
> did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct
> fetching of that commit failed.
> ```
>

I think that is because the dataset branch hasn't been merged in yet.
The files are in this pull request
https://github.com/apache/parquet-testing/pull/100/files.
You might have to checkout that branch to be able to run the benchmarks.

* the encoding of integers uses a custom framing with frame-of-reference
> encoding inside it, but Parquet implementations already implement
> DELTA_BINARY_PACKED which should have similar characteristics, so why
> not reuse that?
>

I did look at DELTA_BINARY_PACKED. Unless I unless I understood it wrong it
didn't
fit the needs.

My understanding of DELTA_BINARY_PACKED is this
delta[i] = value[i] - value[i-1]

Pasting an example of why it may fail.

Input:[19.99, 5.49, 149.00, 0.99, 299.99, ...] // ALP encoded prices:
[1999, 549, 14900, 99, 2]
═══
DELTA_BINARY_PACKED (Two Levels)
═══

STEP 1: Adjacent differences
  first_value = 1999  (stored in header)

  delta[0] = 549 - 1999 = -1450
  delta[1] = 14900 - 549= +14351
  delta[2] = 99 - 14900 = -14801
  delta[3] = 2 - 99 = +29900

  Adjacent deltas = [-1450, +14351, -14801, +29900]

STEP 2: Frame of reference on the deltas
  min_delta = -14801  (stored as zigzag ULEB128)

  adjusted[0] = -1450 - (-14801)  = 13351
  adjusted[1] = +14351 - (-14801) = 29152
  adjusted[2] = -14801 - (-14801) = 0
  adjusted[3] = +29900 - (-14801) = 44701

  Adjusted deltas = [13351, 29152, 0, 44701]

  Range: 0 to 44701 → bit_width = ceil(log2(44702)) = 16 bits

═══
  FOR (Single Level)
═══

  min = 99  (stored as frame_of_reference)

  delta[0] = 1999 - 99   = 1900
  delta[1] = 549 - 99= 450
  delta[2] = 14900 - 99  = 14801
  delta[3] = 99 - 99 = 0
  delta[4] = 2 - 99  = 29900

  Deltas = [1900, 450, 14801, 0, 29900]

  Range: 0 to 29900 → bit_width = ceil(log2(29901)) = 15 bits

I think for floating we need min and not adjacent subtractions which will
also include negative values.


> * there are a lot of fields in the headers that look a bit superfluous
> (though of course those bits are relatively cheap); for example, why
> have a format "version" while we could define a new encoding for
> incompatible evolutions?
>

We discussed this point in the Spec

document a lot and have gravitated
towards a versioning scheme for easier evolution.


>
> * the "Total Encoded Element count" duplicates information already in
> the page header, with risks of inconsistent values (including security
> risks that require specific care in implementations)
>

'num_elements' : let me re-read and get back on this.


>
> * what happens if the number of exceptions is above 65535? their indices
> are coded as 16-bit uints. How about using the same encoding as for
> bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove
> the 65535 limitation.
>

So, I don't see a need for a vector larger than 65535. With that large
vectors the
overhead of metadata is small and you might as well break it into multiple
vectors. I'm gonna give it some more thought and get back.


>
> * the C++ implementation has a `kSamplerRowgroupSize` constant, which
> worries me; row group size can vary *a lot* between workloads (from
> thousands to millions of elements typically), the sampling process
> should not depend on that.
>

Sampling process should be statistically significant. It should pick enough
values
and not have bias towards just the values towards the start. ALP algorithm
ensures
that and tries to balance between not spending enough cycles to get right
parameters
and picking incorrect parameters.

For a very large row group we can change the constant and have it
selects over a larger data set.
Or one can do it at page level too. Happy to discuss more on this.


>
> Regards
>
> Antoine.
>
>
>
> Le 16/10/2025 à 23:47, P

Re: [Parquet] ALP Encoding for Floating point data

2026-01-26 Thread Antoine Pitrou



Hey all,

Thanks Prateek and Dhirhan for submitting this as it's clear you've been 
putting quite a bit of work into it. IMHO, the ALP encoding looks very 
promising as an addition to Parquet format.


That said, I have a few technical concerns:

* I cannot seem to run the C++ benchmarks because of the git submodule 
configuration. It may be easier to fix but I'm looking for guidance here :-)


```
$ LANG=C git submodule update
fatal: transport 'file' not allowed
fatal: Fetched in submodule path 'submodules/parquet-testing', but it 
did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct 
fetching of that commit failed.

```

* the encoding of integers uses a custom framing with frame-of-reference 
encoding inside it, but Parquet implementations already implement 
DELTA_BINARY_PACKED which should have similar characteristics, so why 
not reuse that?


* there are a lot of fields in the headers that look a bit superfluous 
(though of course those bits are relatively cheap); for example, why 
have a format "version" while we could define a new encoding for 
incompatible evolutions?


* the "Total Encoded Element count" duplicates information already in 
the page header, with risks of inconsistent values (including security 
risks that require specific care in implementations)


* what happens if the number of exceptions is above 65535? their indices 
are coded as 16-bit uints. How about using the same encoding as for 
bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove 
the 65535 limitation.


* the C++ implementation has a `kSamplerRowgroupSize` constant, which 
worries me; row group size can vary *a lot* between workloads (from 
thousands to millions of elements typically), the sampling process 
should not depend on that.


Regards

Antoine.



Le 16/10/2025 à 23:47, PRATEEK GAUR a écrit :

Hi team,

We spent some time evaluating ALP compression and decompression compared to
other encoding alternatives like CHIMP/GORILLA and compression techniques
like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
on October 15th in the biweekly parquet meeting. ( I can't seem to access
the recording, so please let me know what access rules I need to get to be
able to view it )

We did this evaluation over some datasets pointed by the ALP paper and some
pointed by the parquet community.

The results are available in the following document

:
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg

Based on the numbers we see

-  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
much better compared to other schemes. (numbers in the sheet are bytes
needed to encode each value )
- ALP going quite well in terms of decompression speed (numbers in the
sheet are bytes decompressed per second)

As next steps we will

- Get the numbers for compression on top of byte stream split.
- Evaluate the algorithm over a few more datasets.
- Have an implementation in the arrow-parquet repo.

Looking forward to feedback from the community.

Best
Prateek and Dhirhan






Re: [Parquet] ALP Encoding for Floating point data

2026-01-25 Thread Andrew Lamb
I have also taken the liberty to solicit feedback (links below for my own
personal memory) from other open source implementations listed on our
implementation status page, in case they would like to help with the
process and share their experience implementing and maintaining the current
encodings.

Andrew

https://github.com/pola-rs/polars/issues/26279
https://github.com/duckdb/duckdb/discussions/20665
https://github.com/rapidsai/cudf/issues/21173
https://github.com/apache/arrow-go/issues/646
https://github.com/hyparam/hyparquet/issues/151



On Thu, Jan 22, 2026 at 12:12 PM PRATEEK GAUR  wrote:

> Awesome,
>
> That was fast :). I'll look at it in detail and see if I can fill out on
> any missing details (if they are present).
> Thanks for taking a look at the 'cross compatibility tests'. That'll strike
> of a big item from the TODO list.
>
> Best
> Prateek
>
> On Thu, Jan 22, 2026 at 8:59 AM Julien Le Dem  wrote:
>
> > Following Micah's suggestion yesterday, I took a stab at using Claude to
> > produce a java implementation of ALP based on Prateek's spec and c++
> > implementation.
> > https://github.com/apache/parquet-java/pull/3390
> > Bear in mind that I haven't closely reviewed it yet, it is fairly
> > experimental but it seems promising.
> > I will look into running cross compatibility tests with the cpp
> > implementation.
> >
> > On Wed, Jan 21, 2026 at 2:53 PM Andrew Lamb 
> > wrote:
> >
> > > > Would this require a
> > > more fundamental change to the data layout as proposed (i.e. something
> we
> > > > can't plugin by adding a new integer encoding)?
> > >
> > > > We can plugin a new layout, it would just be an enum change which
> > > triggers
> > > new
> > > > code path. We would have have to swap out bit unpacker which I used
> > > because
> > > > it was already present in arrow code base. I agree that fastlanes
> would
> > > be
> > > > good
> > >
> > > I agree with both of your assessments that this could be added in the
> > > future with the current spec.
> > >
> > > Thanks for the clarifications
> > >
> > > On Wed, Jan 21, 2026 at 5:38 PM PRATEEK GAUR 
> wrote:
> > >
> > > > >
> > > > >
> > > > > I think we touched on this briefly in a sync but linear encoding
> was
> > > > chosen
> > > > > because we already have these routines written for
> > > DELTA_BINARY_PACKED? I
> > > > > think the current design is extensible now to support other types
> of
> > > > > integer encodings.  Or I might be misunderstanding. Would this
> > require
> > > a
> > > > > more fundamental change to the data layout as proposed (i.e.
> > something
> > > we
> > > > > can't plugin by adding a new integer encoding)?
> > > > >
> > > >
> > > > We can plugin a new layout, it would just be an enum change which
> > > triggers
> > > > new
> > > > code path. We would have have to swap out bit unpacker which I used
> > > because
> > > > it was already present in arrow code base. I agree that fastlanes
> would
> > > be
> > > > good
> > > > to have but that is also a more fundamental building block which I'm
> > > happy
> > > > to
> > > > take up outside the ALP effort and then integrate it with ALP later
> on
> > > > given ALP
> > > > allows a mechanism to deal with it with minimal changes.
> > > >
> > > > I fear with fastlanes and need to implement it it in all languages
> can
> > > > potentially
> > > > slow down the project.
> > > >
> > > >
> > > >
> > > > > If it isn't a fundamental change, unless we have a volunteer to
> > > implement
> > > > > it immediately, I think we can maybe defer this for follow-up work
> on
> > > > > integer encodings, and then add it as an option to ALP when it
> > becomes
> > > > > available. I want to be careful of moving the goal-posts here.
> > > > >
> > > >
> > > > Okay you and I are thinking along the same lines :).
> > > >
> > > >
> > > > >
> > > > > 2) The layout for exceptions, specifically making sure that the
> spec
> > > > allows
> > > > > > other potential layouts in the future to make them more GPU
> > friendly.
> > > > One
> > > > > > proposal is in the G-ALP[3] paper, but it comes with tradeoffs
> > (e.g.
> > > it
> > > > > > requires additional storage overhead).
> > > > >
> > > > >
> > > > > I think changing the exception layout would be handled by the
> version
> > > > enum
> > > > > in the current proposal?
> > > > >
> > > >
> > > > Yes, current spec allows for this.
> > > >
> > > >
> > > > >
> > > > > Cheers,
> > > > > Micah
> > > > >
> > > > >
> > > > > On Wed, Jan 21, 2026 at 1:57 PM Andrew Lamb <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > First of all, thank you again for this spec. I would recommend
> > anyone
> > > > > else
> > > > > > curious about ALP (or wanting to read a well written technical
> > spec)
> > > to
> > > > > > read Prateek's document -- it is really nice.
> > > > > >
> > > > > > I would like to raise two more items (I am not sure the spec
> needs
> > to
> > > > be
> > > > > > changed to accommodate them, but I do think we should di

Re: [Parquet] ALP Encoding for Floating point data

2026-01-22 Thread PRATEEK GAUR
Awesome,

That was fast :). I'll look at it in detail and see if I can fill out on
any missing details (if they are present).
Thanks for taking a look at the 'cross compatibility tests'. That'll strike
of a big item from the TODO list.

Best
Prateek

On Thu, Jan 22, 2026 at 8:59 AM Julien Le Dem  wrote:

> Following Micah's suggestion yesterday, I took a stab at using Claude to
> produce a java implementation of ALP based on Prateek's spec and c++
> implementation.
> https://github.com/apache/parquet-java/pull/3390
> Bear in mind that I haven't closely reviewed it yet, it is fairly
> experimental but it seems promising.
> I will look into running cross compatibility tests with the cpp
> implementation.
>
> On Wed, Jan 21, 2026 at 2:53 PM Andrew Lamb 
> wrote:
>
> > > Would this require a
> > more fundamental change to the data layout as proposed (i.e. something we
> > > can't plugin by adding a new integer encoding)?
> >
> > > We can plugin a new layout, it would just be an enum change which
> > triggers
> > new
> > > code path. We would have have to swap out bit unpacker which I used
> > because
> > > it was already present in arrow code base. I agree that fastlanes would
> > be
> > > good
> >
> > I agree with both of your assessments that this could be added in the
> > future with the current spec.
> >
> > Thanks for the clarifications
> >
> > On Wed, Jan 21, 2026 at 5:38 PM PRATEEK GAUR  wrote:
> >
> > > >
> > > >
> > > > I think we touched on this briefly in a sync but linear encoding was
> > > chosen
> > > > because we already have these routines written for
> > DELTA_BINARY_PACKED? I
> > > > think the current design is extensible now to support other types of
> > > > integer encodings.  Or I might be misunderstanding. Would this
> require
> > a
> > > > more fundamental change to the data layout as proposed (i.e.
> something
> > we
> > > > can't plugin by adding a new integer encoding)?
> > > >
> > >
> > > We can plugin a new layout, it would just be an enum change which
> > triggers
> > > new
> > > code path. We would have have to swap out bit unpacker which I used
> > because
> > > it was already present in arrow code base. I agree that fastlanes would
> > be
> > > good
> > > to have but that is also a more fundamental building block which I'm
> > happy
> > > to
> > > take up outside the ALP effort and then integrate it with ALP later on
> > > given ALP
> > > allows a mechanism to deal with it with minimal changes.
> > >
> > > I fear with fastlanes and need to implement it it in all languages can
> > > potentially
> > > slow down the project.
> > >
> > >
> > >
> > > > If it isn't a fundamental change, unless we have a volunteer to
> > implement
> > > > it immediately, I think we can maybe defer this for follow-up work on
> > > > integer encodings, and then add it as an option to ALP when it
> becomes
> > > > available. I want to be careful of moving the goal-posts here.
> > > >
> > >
> > > Okay you and I are thinking along the same lines :).
> > >
> > >
> > > >
> > > > 2) The layout for exceptions, specifically making sure that the spec
> > > allows
> > > > > other potential layouts in the future to make them more GPU
> friendly.
> > > One
> > > > > proposal is in the G-ALP[3] paper, but it comes with tradeoffs
> (e.g.
> > it
> > > > > requires additional storage overhead).
> > > >
> > > >
> > > > I think changing the exception layout would be handled by the version
> > > enum
> > > > in the current proposal?
> > > >
> > >
> > > Yes, current spec allows for this.
> > >
> > >
> > > >
> > > > Cheers,
> > > > Micah
> > > >
> > > >
> > > > On Wed, Jan 21, 2026 at 1:57 PM Andrew Lamb 
> > > > wrote:
> > > >
> > > > > First of all, thank you again for this spec. I would recommend
> anyone
> > > > else
> > > > > curious about ALP (or wanting to read a well written technical
> spec)
> > to
> > > > > read Prateek's document -- it is really nice.
> > > > >
> > > > > I would like to raise two more items (I am not sure the spec needs
> to
> > > be
> > > > > changed to accommodate them, but I do think we should discuss
> them):
> > > > >
> > > > > 1) Interleaving the bitpacked values (this was suggested by Peter
> > > Boncz).
> > > > > Specifically, I recommend we consider the technique described in
> the
> > > > > FASTLANES paper[1] (figure 1) that interleaves bit-packed values
> in a
> > > > > pattern that enables decoding multiple values using a single
> > > > > SIMD instruction and is GPU friendly. To be clear we don't need to
> > > > > implement all of the techniques described in that paper, but I
> think
> > > the
> > > > > interleaving is worth considering. It seems like the current
> > prototype
> > > > uses
> > > > > linear bitpacking[2]
> > > > >
> > > > > 2) The layout for exceptions, specifically making sure that the
> spec
> > > > allows
> > > > > other potential layouts in the future to make them more GPU
> friendly.
> > > One
> > > > > proposal is in the G-ALP[3] paper, but it comes with tradeoffs

Re: [Parquet] ALP Encoding for Floating point data

2026-01-22 Thread Julien Le Dem
Following Micah's suggestion yesterday, I took a stab at using Claude to
produce a java implementation of ALP based on Prateek's spec and c++
implementation.
https://github.com/apache/parquet-java/pull/3390
Bear in mind that I haven't closely reviewed it yet, it is fairly
experimental but it seems promising.
I will look into running cross compatibility tests with the cpp
implementation.

On Wed, Jan 21, 2026 at 2:53 PM Andrew Lamb  wrote:

> > Would this require a
> more fundamental change to the data layout as proposed (i.e. something we
> > can't plugin by adding a new integer encoding)?
>
> > We can plugin a new layout, it would just be an enum change which
> triggers
> new
> > code path. We would have have to swap out bit unpacker which I used
> because
> > it was already present in arrow code base. I agree that fastlanes would
> be
> > good
>
> I agree with both of your assessments that this could be added in the
> future with the current spec.
>
> Thanks for the clarifications
>
> On Wed, Jan 21, 2026 at 5:38 PM PRATEEK GAUR  wrote:
>
> > >
> > >
> > > I think we touched on this briefly in a sync but linear encoding was
> > chosen
> > > because we already have these routines written for
> DELTA_BINARY_PACKED? I
> > > think the current design is extensible now to support other types of
> > > integer encodings.  Or I might be misunderstanding. Would this require
> a
> > > more fundamental change to the data layout as proposed (i.e. something
> we
> > > can't plugin by adding a new integer encoding)?
> > >
> >
> > We can plugin a new layout, it would just be an enum change which
> triggers
> > new
> > code path. We would have have to swap out bit unpacker which I used
> because
> > it was already present in arrow code base. I agree that fastlanes would
> be
> > good
> > to have but that is also a more fundamental building block which I'm
> happy
> > to
> > take up outside the ALP effort and then integrate it with ALP later on
> > given ALP
> > allows a mechanism to deal with it with minimal changes.
> >
> > I fear with fastlanes and need to implement it it in all languages can
> > potentially
> > slow down the project.
> >
> >
> >
> > > If it isn't a fundamental change, unless we have a volunteer to
> implement
> > > it immediately, I think we can maybe defer this for follow-up work on
> > > integer encodings, and then add it as an option to ALP when it becomes
> > > available. I want to be careful of moving the goal-posts here.
> > >
> >
> > Okay you and I are thinking along the same lines :).
> >
> >
> > >
> > > 2) The layout for exceptions, specifically making sure that the spec
> > allows
> > > > other potential layouts in the future to make them more GPU friendly.
> > One
> > > > proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g.
> it
> > > > requires additional storage overhead).
> > >
> > >
> > > I think changing the exception layout would be handled by the version
> > enum
> > > in the current proposal?
> > >
> >
> > Yes, current spec allows for this.
> >
> >
> > >
> > > Cheers,
> > > Micah
> > >
> > >
> > > On Wed, Jan 21, 2026 at 1:57 PM Andrew Lamb 
> > > wrote:
> > >
> > > > First of all, thank you again for this spec. I would recommend anyone
> > > else
> > > > curious about ALP (or wanting to read a well written technical spec)
> to
> > > > read Prateek's document -- it is really nice.
> > > >
> > > > I would like to raise two more items (I am not sure the spec needs to
> > be
> > > > changed to accommodate them, but I do think we should discuss them):
> > > >
> > > > 1) Interleaving the bitpacked values (this was suggested by Peter
> > Boncz).
> > > > Specifically, I recommend we consider the technique described in the
> > > > FASTLANES paper[1] (figure 1) that interleaves bit-packed values in a
> > > > pattern that enables decoding multiple values using a single
> > > > SIMD instruction and is GPU friendly. To be clear we don't need to
> > > > implement all of the techniques described in that paper, but I think
> > the
> > > > interleaving is worth considering. It seems like the current
> prototype
> > > uses
> > > > linear bitpacking[2]
> > > >
> > > > 2) The layout for exceptions, specifically making sure that the spec
> > > allows
> > > > other potential layouts in the future to make them more GPU friendly.
> > One
> > > > proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g.
> it
> > > > requires additional storage overhead).
> > > >
> > > > Andrew
> > > >
> > > >
> > > > [1]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> > > > [2]:
> > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/pull/48345/changes#diff-f9ab708cab94060b4067fff0a6739e9c3751b450422115663b2bd0badfcc748bR801
> > > > [3]: https://dl.acm.org/doi/10.1145/3736227.3736242
> > > >
> > > > On Wed, Jan 14, 2026 at 3:21 PM Andrew Lamb 
> > > > wrote:
> > > >
> > > > > Here is a PR that turns Prateek's document into markdown in the
> > > > > parquet-format repo
> > > > > - https

Re: [Parquet] ALP Encoding for Floating point data

2026-01-21 Thread Andrew Lamb
> Would this require a
more fundamental change to the data layout as proposed (i.e. something we
> can't plugin by adding a new integer encoding)?

> We can plugin a new layout, it would just be an enum change which triggers
new
> code path. We would have have to swap out bit unpacker which I used
because
> it was already present in arrow code base. I agree that fastlanes would be
> good

I agree with both of your assessments that this could be added in the
future with the current spec.

Thanks for the clarifications

On Wed, Jan 21, 2026 at 5:38 PM PRATEEK GAUR  wrote:

> >
> >
> > I think we touched on this briefly in a sync but linear encoding was
> chosen
> > because we already have these routines written for DELTA_BINARY_PACKED? I
> > think the current design is extensible now to support other types of
> > integer encodings.  Or I might be misunderstanding. Would this require a
> > more fundamental change to the data layout as proposed (i.e. something we
> > can't plugin by adding a new integer encoding)?
> >
>
> We can plugin a new layout, it would just be an enum change which triggers
> new
> code path. We would have have to swap out bit unpacker which I used because
> it was already present in arrow code base. I agree that fastlanes would be
> good
> to have but that is also a more fundamental building block which I'm happy
> to
> take up outside the ALP effort and then integrate it with ALP later on
> given ALP
> allows a mechanism to deal with it with minimal changes.
>
> I fear with fastlanes and need to implement it it in all languages can
> potentially
> slow down the project.
>
>
>
> > If it isn't a fundamental change, unless we have a volunteer to implement
> > it immediately, I think we can maybe defer this for follow-up work on
> > integer encodings, and then add it as an option to ALP when it becomes
> > available. I want to be careful of moving the goal-posts here.
> >
>
> Okay you and I are thinking along the same lines :).
>
>
> >
> > 2) The layout for exceptions, specifically making sure that the spec
> allows
> > > other potential layouts in the future to make them more GPU friendly.
> One
> > > proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g. it
> > > requires additional storage overhead).
> >
> >
> > I think changing the exception layout would be handled by the version
> enum
> > in the current proposal?
> >
>
> Yes, current spec allows for this.
>
>
> >
> > Cheers,
> > Micah
> >
> >
> > On Wed, Jan 21, 2026 at 1:57 PM Andrew Lamb 
> > wrote:
> >
> > > First of all, thank you again for this spec. I would recommend anyone
> > else
> > > curious about ALP (or wanting to read a well written technical spec) to
> > > read Prateek's document -- it is really nice.
> > >
> > > I would like to raise two more items (I am not sure the spec needs to
> be
> > > changed to accommodate them, but I do think we should discuss them):
> > >
> > > 1) Interleaving the bitpacked values (this was suggested by Peter
> Boncz).
> > > Specifically, I recommend we consider the technique described in the
> > > FASTLANES paper[1] (figure 1) that interleaves bit-packed values in a
> > > pattern that enables decoding multiple values using a single
> > > SIMD instruction and is GPU friendly. To be clear we don't need to
> > > implement all of the techniques described in that paper, but I think
> the
> > > interleaving is worth considering. It seems like the current prototype
> > uses
> > > linear bitpacking[2]
> > >
> > > 2) The layout for exceptions, specifically making sure that the spec
> > allows
> > > other potential layouts in the future to make them more GPU friendly.
> One
> > > proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g. it
> > > requires additional storage overhead).
> > >
> > > Andrew
> > >
> > >
> > > [1]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> > > [2]:
> > >
> > >
> >
> https://github.com/apache/arrow/pull/48345/changes#diff-f9ab708cab94060b4067fff0a6739e9c3751b450422115663b2bd0badfcc748bR801
> > > [3]: https://dl.acm.org/doi/10.1145/3736227.3736242
> > >
> > > On Wed, Jan 14, 2026 at 3:21 PM Andrew Lamb 
> > > wrote:
> > >
> > > > Here is a PR that turns Prateek's document into markdown in the
> > > > parquet-format repo
> > > > - https://github.com/apache/parquet-format/pull/548
> > > >
> > > > I am a little worried we will have two set of parallel comments (one
> in
> > > > the google doc and one in the PR)
> > > >
> > > > However, the spec is of sufficient quality (thanks, again Prateek)
> that
> > > it
> > > > would be possible for another language implementation to be
> attempted.
> > > >
> > > > Andrew
> > > >
> > > >
> > > >
> > > > On Wed, Jan 14, 2026 at 8:54 AM Andrew Lamb 
> > > > wrote:
> > > >
> > > >> I plan to help turn the document into a PR to parquet-format later
> > today
> > > >>
> > > >> And again thank you Prateek and everyone for helping make this
> happen
> > > >>
> > > >> Andrew
> > > >>
> > > >> On Wed, Jan 14, 2026

Re: [Parquet] ALP Encoding for Floating point data

2026-01-21 Thread PRATEEK GAUR
>
>
> I think we touched on this briefly in a sync but linear encoding was chosen
> because we already have these routines written for DELTA_BINARY_PACKED? I
> think the current design is extensible now to support other types of
> integer encodings.  Or I might be misunderstanding. Would this require a
> more fundamental change to the data layout as proposed (i.e. something we
> can't plugin by adding a new integer encoding)?
>

We can plugin a new layout, it would just be an enum change which triggers
new
code path. We would have have to swap out bit unpacker which I used because
it was already present in arrow code base. I agree that fastlanes would be
good
to have but that is also a more fundamental building block which I'm happy
to
take up outside the ALP effort and then integrate it with ALP later on
given ALP
allows a mechanism to deal with it with minimal changes.

I fear with fastlanes and need to implement it it in all languages can
potentially
slow down the project.



> If it isn't a fundamental change, unless we have a volunteer to implement
> it immediately, I think we can maybe defer this for follow-up work on
> integer encodings, and then add it as an option to ALP when it becomes
> available. I want to be careful of moving the goal-posts here.
>

Okay you and I are thinking along the same lines :).


>
> 2) The layout for exceptions, specifically making sure that the spec allows
> > other potential layouts in the future to make them more GPU friendly. One
> > proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g. it
> > requires additional storage overhead).
>
>
> I think changing the exception layout would be handled by the version enum
> in the current proposal?
>

Yes, current spec allows for this.


>
> Cheers,
> Micah
>
>
> On Wed, Jan 21, 2026 at 1:57 PM Andrew Lamb 
> wrote:
>
> > First of all, thank you again for this spec. I would recommend anyone
> else
> > curious about ALP (or wanting to read a well written technical spec) to
> > read Prateek's document -- it is really nice.
> >
> > I would like to raise two more items (I am not sure the spec needs to be
> > changed to accommodate them, but I do think we should discuss them):
> >
> > 1) Interleaving the bitpacked values (this was suggested by Peter Boncz).
> > Specifically, I recommend we consider the technique described in the
> > FASTLANES paper[1] (figure 1) that interleaves bit-packed values in a
> > pattern that enables decoding multiple values using a single
> > SIMD instruction and is GPU friendly. To be clear we don't need to
> > implement all of the techniques described in that paper, but I think the
> > interleaving is worth considering. It seems like the current prototype
> uses
> > linear bitpacking[2]
> >
> > 2) The layout for exceptions, specifically making sure that the spec
> allows
> > other potential layouts in the future to make them more GPU friendly. One
> > proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g. it
> > requires additional storage overhead).
> >
> > Andrew
> >
> >
> > [1]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> > [2]:
> >
> >
> https://github.com/apache/arrow/pull/48345/changes#diff-f9ab708cab94060b4067fff0a6739e9c3751b450422115663b2bd0badfcc748bR801
> > [3]: https://dl.acm.org/doi/10.1145/3736227.3736242
> >
> > On Wed, Jan 14, 2026 at 3:21 PM Andrew Lamb 
> > wrote:
> >
> > > Here is a PR that turns Prateek's document into markdown in the
> > > parquet-format repo
> > > - https://github.com/apache/parquet-format/pull/548
> > >
> > > I am a little worried we will have two set of parallel comments (one in
> > > the google doc and one in the PR)
> > >
> > > However, the spec is of sufficient quality (thanks, again Prateek) that
> > it
> > > would be possible for another language implementation to be attempted.
> > >
> > > Andrew
> > >
> > >
> > >
> > > On Wed, Jan 14, 2026 at 8:54 AM Andrew Lamb 
> > > wrote:
> > >
> > >> I plan to help turn the document into a PR to parquet-format later
> today
> > >>
> > >> And again thank you Prateek and everyone for helping make this happen
> > >>
> > >> Andrew
> > >>
> > >> On Wed, Jan 14, 2026 at 6:34 AM Antoine Pitrou 
> > >> wrote:
> > >>
> > >>>
> > >>> Yes, I'd really rather comment on the final spec, rather than a
> Google
> > >>> doc.
> > >>>
> > >>> (also, Google Doc comments are not terrific for non-trivial
> > discussions)
> > >>>
> > >>>
> > >>> Le 14/01/2026 à 10:37, Gang Wu a écrit :
> > >>> > Is it better to create a PR against
> > >>> https://github.com/apache/parquet-format
> > >>> > so
> > >>> > it can become the single source of truth of the Parquet-ALP spec?
> > >>> >
> > >>> > On Wed, Jan 14, 2026 at 9:34 AM Julien Le Dem 
> > >>> wrote:
> > >>> >
> > >>> >> Thank you Micah for the detailed review!
> > >>> >> Who else needs to do a round of reviews on the spec before we can
> > >>> finalize
> > >>> >> it?
> > >>>
> > >>>
> > >>>
> >
>


Re: [Parquet] ALP Encoding for Floating point data

2026-01-21 Thread PRATEEK GAUR
Hi Andrew,

Thanks :).


   - Interleaved bit-packing : Yes this has been on my mind and thanks for
   bringing it up. It came up as a part of benchmark discussion for pFOR too.
   Thankfully keeping these improvements in mind we have designed the *ALP
   spec such that it allows* for the current FOR based integer encoding to
   be swapped out with FastLanes which is what I think Peter was referring to.
   - Exception layout : By the way hyper parameters  are picked the number
   of exceptions have to be low, as each exception has slightly higher
   overhead with respect to storage and read. This means that reading of
   exceptions is *not on the performance critical path* so not sure if
   trying more complicated GPU friendly encodings with give general
   improvement. With that said thankfully :), ALP Spec has been written with
   this extension in mind and one can change the version to accommodate for
   different exception encoding.


Best
Prateek

On Wed, Jan 21, 2026 at 1:57 PM Andrew Lamb  wrote:

> First of all, thank you again for this spec. I would recommend anyone else
> curious about ALP (or wanting to read a well written technical spec) to
> read Prateek's document -- it is really nice.
>
> I would like to raise two more items (I am not sure the spec needs to be
> changed to accommodate them, but I do think we should discuss them):
>
> 1) Interleaving the bitpacked values (this was suggested by Peter Boncz).
> Specifically, I recommend we consider the technique described in the
> FASTLANES paper[1] (figure 1) that interleaves bit-packed values in a
> pattern that enables decoding multiple values using a single
> SIMD instruction and is GPU friendly. To be clear we don't need to
> implement all of the techniques described in that paper, but I think the
> interleaving is worth considering. It seems like the current prototype uses
> linear bitpacking[2]
>
> 2) The layout for exceptions, specifically making sure that the spec allows
> other potential layouts in the future to make them more GPU friendly. One
> proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g. it
> requires additional storage overhead).
>
> Andrew
>
>
> [1]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> [2]:
>
> https://github.com/apache/arrow/pull/48345/changes#diff-f9ab708cab94060b4067fff0a6739e9c3751b450422115663b2bd0badfcc748bR801
> [3]: https://dl.acm.org/doi/10.1145/3736227.3736242
>
> On Wed, Jan 14, 2026 at 3:21 PM Andrew Lamb 
> wrote:
>
> > Here is a PR that turns Prateek's document into markdown in the
> > parquet-format repo
> > - https://github.com/apache/parquet-format/pull/548
> >
> > I am a little worried we will have two set of parallel comments (one in
> > the google doc and one in the PR)
> >
> > However, the spec is of sufficient quality (thanks, again Prateek) that
> it
> > would be possible for another language implementation to be attempted.
> >
> > Andrew
> >
> >
> >
> > On Wed, Jan 14, 2026 at 8:54 AM Andrew Lamb 
> > wrote:
> >
> >> I plan to help turn the document into a PR to parquet-format later today
> >>
> >> And again thank you Prateek and everyone for helping make this happen
> >>
> >> Andrew
> >>
> >> On Wed, Jan 14, 2026 at 6:34 AM Antoine Pitrou 
> >> wrote:
> >>
> >>>
> >>> Yes, I'd really rather comment on the final spec, rather than a Google
> >>> doc.
> >>>
> >>> (also, Google Doc comments are not terrific for non-trivial
> discussions)
> >>>
> >>>
> >>> Le 14/01/2026 à 10:37, Gang Wu a écrit :
> >>> > Is it better to create a PR against
> >>> https://github.com/apache/parquet-format
> >>> > so
> >>> > it can become the single source of truth of the Parquet-ALP spec?
> >>> >
> >>> > On Wed, Jan 14, 2026 at 9:34 AM Julien Le Dem 
> >>> wrote:
> >>> >
> >>> >> Thank you Micah for the detailed review!
> >>> >> Who else needs to do a round of reviews on the spec before we can
> >>> finalize
> >>> >> it?
> >>>
> >>>
> >>>
>


Re: [Parquet] ALP Encoding for Floating point data

2026-01-21 Thread Micah Kornfield
>
> 1) Interleaving the bitpacked values (this was suggested by Peter Boncz).
> Specifically, I recommend we consider the technique described in the
> FASTLANES paper[1] (figure 1) that interleaves bit-packed values in a
> pattern that enables decoding multiple values using a single
> SIMD instruction and is GPU friendly. To be clear we don't need to
> implement all of the techniques described in that paper, but I think the
> interleaving is worth considering. It seems like the current prototype uses
> linear bitpacking[2]


I think we touched on this briefly in a sync but linear encoding was chosen
because we already have these routines written for DELTA_BINARY_PACKED? I
think the current design is extensible now to support other types of
integer encodings.  Or I might be misunderstanding. Would this require a
more fundamental change to the data layout as proposed (i.e. something we
can't plugin by adding a new integer encoding)?

If it isn't a fundamental change, unless we have a volunteer to implement
it immediately, I think we can maybe defer this for follow-up work on
integer encodings, and then add it as an option to ALP when it becomes
available. I want to be careful of moving the goal-posts here.

2) The layout for exceptions, specifically making sure that the spec allows
> other potential layouts in the future to make them more GPU friendly. One
> proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g. it
> requires additional storage overhead).


I think changing the exception layout would be handled by the version enum
in the current proposal?

Cheers,
Micah


On Wed, Jan 21, 2026 at 1:57 PM Andrew Lamb  wrote:

> First of all, thank you again for this spec. I would recommend anyone else
> curious about ALP (or wanting to read a well written technical spec) to
> read Prateek's document -- it is really nice.
>
> I would like to raise two more items (I am not sure the spec needs to be
> changed to accommodate them, but I do think we should discuss them):
>
> 1) Interleaving the bitpacked values (this was suggested by Peter Boncz).
> Specifically, I recommend we consider the technique described in the
> FASTLANES paper[1] (figure 1) that interleaves bit-packed values in a
> pattern that enables decoding multiple values using a single
> SIMD instruction and is GPU friendly. To be clear we don't need to
> implement all of the techniques described in that paper, but I think the
> interleaving is worth considering. It seems like the current prototype uses
> linear bitpacking[2]
>
> 2) The layout for exceptions, specifically making sure that the spec allows
> other potential layouts in the future to make them more GPU friendly. One
> proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g. it
> requires additional storage overhead).
>
> Andrew
>
>
> [1]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> [2]:
>
> https://github.com/apache/arrow/pull/48345/changes#diff-f9ab708cab94060b4067fff0a6739e9c3751b450422115663b2bd0badfcc748bR801
> [3]: https://dl.acm.org/doi/10.1145/3736227.3736242
>
> On Wed, Jan 14, 2026 at 3:21 PM Andrew Lamb 
> wrote:
>
> > Here is a PR that turns Prateek's document into markdown in the
> > parquet-format repo
> > - https://github.com/apache/parquet-format/pull/548
> >
> > I am a little worried we will have two set of parallel comments (one in
> > the google doc and one in the PR)
> >
> > However, the spec is of sufficient quality (thanks, again Prateek) that
> it
> > would be possible for another language implementation to be attempted.
> >
> > Andrew
> >
> >
> >
> > On Wed, Jan 14, 2026 at 8:54 AM Andrew Lamb 
> > wrote:
> >
> >> I plan to help turn the document into a PR to parquet-format later today
> >>
> >> And again thank you Prateek and everyone for helping make this happen
> >>
> >> Andrew
> >>
> >> On Wed, Jan 14, 2026 at 6:34 AM Antoine Pitrou 
> >> wrote:
> >>
> >>>
> >>> Yes, I'd really rather comment on the final spec, rather than a Google
> >>> doc.
> >>>
> >>> (also, Google Doc comments are not terrific for non-trivial
> discussions)
> >>>
> >>>
> >>> Le 14/01/2026 à 10:37, Gang Wu a écrit :
> >>> > Is it better to create a PR against
> >>> https://github.com/apache/parquet-format
> >>> > so
> >>> > it can become the single source of truth of the Parquet-ALP spec?
> >>> >
> >>> > On Wed, Jan 14, 2026 at 9:34 AM Julien Le Dem 
> >>> wrote:
> >>> >
> >>> >> Thank you Micah for the detailed review!
> >>> >> Who else needs to do a round of reviews on the spec before we can
> >>> finalize
> >>> >> it?
> >>>
> >>>
> >>>
>


Re: [Parquet] ALP Encoding for Floating point data

2026-01-21 Thread Andrew Lamb
First of all, thank you again for this spec. I would recommend anyone else
curious about ALP (or wanting to read a well written technical spec) to
read Prateek's document -- it is really nice.

I would like to raise two more items (I am not sure the spec needs to be
changed to accommodate them, but I do think we should discuss them):

1) Interleaving the bitpacked values (this was suggested by Peter Boncz).
Specifically, I recommend we consider the technique described in the
FASTLANES paper[1] (figure 1) that interleaves bit-packed values in a
pattern that enables decoding multiple values using a single
SIMD instruction and is GPU friendly. To be clear we don't need to
implement all of the techniques described in that paper, but I think the
interleaving is worth considering. It seems like the current prototype uses
linear bitpacking[2]

2) The layout for exceptions, specifically making sure that the spec allows
other potential layouts in the future to make them more GPU friendly. One
proposal is in the G-ALP[3] paper, but it comes with tradeoffs (e.g. it
requires additional storage overhead).

Andrew


[1]: https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
[2]:
https://github.com/apache/arrow/pull/48345/changes#diff-f9ab708cab94060b4067fff0a6739e9c3751b450422115663b2bd0badfcc748bR801
[3]: https://dl.acm.org/doi/10.1145/3736227.3736242

On Wed, Jan 14, 2026 at 3:21 PM Andrew Lamb  wrote:

> Here is a PR that turns Prateek's document into markdown in the
> parquet-format repo
> - https://github.com/apache/parquet-format/pull/548
>
> I am a little worried we will have two set of parallel comments (one in
> the google doc and one in the PR)
>
> However, the spec is of sufficient quality (thanks, again Prateek) that it
> would be possible for another language implementation to be attempted.
>
> Andrew
>
>
>
> On Wed, Jan 14, 2026 at 8:54 AM Andrew Lamb 
> wrote:
>
>> I plan to help turn the document into a PR to parquet-format later today
>>
>> And again thank you Prateek and everyone for helping make this happen
>>
>> Andrew
>>
>> On Wed, Jan 14, 2026 at 6:34 AM Antoine Pitrou 
>> wrote:
>>
>>>
>>> Yes, I'd really rather comment on the final spec, rather than a Google
>>> doc.
>>>
>>> (also, Google Doc comments are not terrific for non-trivial discussions)
>>>
>>>
>>> Le 14/01/2026 à 10:37, Gang Wu a écrit :
>>> > Is it better to create a PR against
>>> https://github.com/apache/parquet-format
>>> > so
>>> > it can become the single source of truth of the Parquet-ALP spec?
>>> >
>>> > On Wed, Jan 14, 2026 at 9:34 AM Julien Le Dem 
>>> wrote:
>>> >
>>> >> Thank you Micah for the detailed review!
>>> >> Who else needs to do a round of reviews on the spec before we can
>>> finalize
>>> >> it?
>>>
>>>
>>>


Re: [Parquet] ALP Encoding for Floating point data

2026-01-14 Thread Andrew Lamb
Here is a PR that turns Prateek's document into markdown in the
parquet-format repo
- https://github.com/apache/parquet-format/pull/548

I am a little worried we will have two set of parallel comments (one in the
google doc and one in the PR)

However, the spec is of sufficient quality (thanks, again Prateek) that it
would be possible for another language implementation to be attempted.

Andrew



On Wed, Jan 14, 2026 at 8:54 AM Andrew Lamb  wrote:

> I plan to help turn the document into a PR to parquet-format later today
>
> And again thank you Prateek and everyone for helping make this happen
>
> Andrew
>
> On Wed, Jan 14, 2026 at 6:34 AM Antoine Pitrou  wrote:
>
>>
>> Yes, I'd really rather comment on the final spec, rather than a Google
>> doc.
>>
>> (also, Google Doc comments are not terrific for non-trivial discussions)
>>
>>
>> Le 14/01/2026 à 10:37, Gang Wu a écrit :
>> > Is it better to create a PR against
>> https://github.com/apache/parquet-format
>> > so
>> > it can become the single source of truth of the Parquet-ALP spec?
>> >
>> > On Wed, Jan 14, 2026 at 9:34 AM Julien Le Dem 
>> wrote:
>> >
>> >> Thank you Micah for the detailed review!
>> >> Who else needs to do a round of reviews on the spec before we can
>> finalize
>> >> it?
>>
>>
>>


Re: [Parquet] ALP Encoding for Floating point data

2026-01-14 Thread Andrew Lamb
I plan to help turn the document into a PR to parquet-format later today

And again thank you Prateek and everyone for helping make this happen

Andrew

On Wed, Jan 14, 2026 at 6:34 AM Antoine Pitrou  wrote:

>
> Yes, I'd really rather comment on the final spec, rather than a Google doc.
>
> (also, Google Doc comments are not terrific for non-trivial discussions)
>
>
> Le 14/01/2026 à 10:37, Gang Wu a écrit :
> > Is it better to create a PR against
> https://github.com/apache/parquet-format
> > so
> > it can become the single source of truth of the Parquet-ALP spec?
> >
> > On Wed, Jan 14, 2026 at 9:34 AM Julien Le Dem  wrote:
> >
> >> Thank you Micah for the detailed review!
> >> Who else needs to do a round of reviews on the spec before we can
> finalize
> >> it?
>
>
>


Re: [Parquet] ALP Encoding for Floating point data

2026-01-14 Thread Antoine Pitrou



Yes, I'd really rather comment on the final spec, rather than a Google doc.

(also, Google Doc comments are not terrific for non-trivial discussions)


Le 14/01/2026 à 10:37, Gang Wu a écrit :

Is it better to create a PR against https://github.com/apache/parquet-format
so
it can become the single source of truth of the Parquet-ALP spec?

On Wed, Jan 14, 2026 at 9:34 AM Julien Le Dem  wrote:


Thank you Micah for the detailed review!
Who else needs to do a round of reviews on the spec before we can finalize
it?





Re: [Parquet] ALP Encoding for Floating point data

2026-01-14 Thread Gang Wu
Is it better to create a PR against https://github.com/apache/parquet-format
so
it can become the single source of truth of the Parquet-ALP spec?

On Wed, Jan 14, 2026 at 9:34 AM Julien Le Dem  wrote:

> Thank you Micah for the detailed review!
> Who else needs to do a round of reviews on the spec before we can finalize
> it?
>
>
> On Tue, Jan 13, 2026 at 10:07 AM PRATEEK GAUR  wrote:
>
> > Thanks Micah for a round of feedback.
> >
> > Here is a link to the spec document :
> >
> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
> >
> > On Tue, Nov 25, 2025 at 8:57 AM PRATEEK GAUR  wrote:
> >
> > > On Sat, Nov 22, 2025 at 4:49 AM Steve Loughran 
> > > wrote:
> > >
> > >> First, sorry: I think I accidentally marked as done the comment in the
> > >> doc about x86 performance.
> > >>
> > >
> > > No worries, I restored the thread :).
> > >
> > > Those x86 numbers are critical, especially AVX512 in a recent intel
> part.
> > >> There's a notorious feature in the early ones where the cores would
> > reduce
> > >> frequency after you used the opcodes as a way of managing die
> > temperature (
> > >>
> >
> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
> > >> ); the later ones and AMD models are the ones to worry about.
> > >>
> > >
> > > We did collect performance numbers in our early prototype and they
> looked
> > > good on x86 hardware. Though I didn't check the processor family.
> > > In our arrow implementation we are also working on a comprehensive
> > > benchmarking script which will help everyone run it on different CPU
> > > families to get a good idea of performance.
> > >
> > > Best
> > > Prateek
> > >
> > >
> > >> On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev <
> > >> [email protected]> wrote:
> > >>
> > >>> Hi team,
> > >>>
> > >>> *ALP ---> ALP PeudoDecimal*
> > >>>
> > >>> As is visible from the numbers above and as stated in the paper too
> for
> > >>> real double values, i.e the values with high precision points, it is
> > very
> > >>> difficult to get a good compression ratio.
> > >>>
> > >>> This combined with the fact that we want to keep the
> > spec/implementation
> > >>> simpler, stating Antoine directly here
> > >>>
> > >>> `*2. Do not include the ALPrd fallback which is a homegrown
> dictionary*
> > >>>
> > >>> *encoding without dictionary reuse accross pages, and instead rely on
> > >>> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
> > >>>
> > >>> Also based on some discussion I had with Julien in person and the
> > >>> biweekly
> > >>> meeting with a number of you.
> > >>>
> > >>> We'll be going with ALPpd (pseudo decimal) as the first
> > >>> implementation relying on the query engine based on its own
> heuristics
> > to
> > >>> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
> > >>>
> > >>> Best
> > >>> Prateek
> > >>>
> > >>>
> > >>>
> > >>> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur <
> > [email protected]
> > >>> >
> > >>> wrote:
> > >>>
> > >>> > Sheet with numbers
> > >>> > <
> > >>>
> >
> https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
> > >>> >
> > >>> > .
> > >>> >
> > >>> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR 
> > >>> wrote:
> > >>> >
> > >>> >> Hi team,
> > >>> >>
> > >>> >> There was a request from a few folks, Antoine Pitrou and Adam
> Reeve
> > >>> if I
> > >>> >> remember correctly, to perform the experiment on some of the
> papers
> > >>> that
> > >>> >> talked about BYTE_STREAM_SPLIT for completeness.
> > >>> >> I wanted to share the numbers for the same in this sheet. At this
> > >>> point
> > >>> >> we have numbers on a wide variety of data.
> > >>> >> (Will have to share the sheet from my snowflake account as our
> > laptops
> > >>> >> have fair bit of restriction with respect to copy paste
> permissions
> > >>> :) )
> > >>> >>
> > >>> >> Best
> > >>> >> Prateek
> > >>> >>
> > >>> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR 
> > >>> wrote:
> > >>> >>
> > >>> >>> Hi Julien,
> > >>> >>>
> > >>> >>> Yes based on
> > >>> >>>
> > >>> >>>- Numbers presented
> > >>> >>>- Discussions over the doc and
> > >>> >>>- Multiple discussions in the biweekly meeting
> > >>> >>>
> > >>> >>> We are in a stage where we agree this is the right encoding to
> add
> > >>> and
> > >>> >>> we can move to the DRAFT/POC stage from DISCUSS stage.
> > >>> >>> Will start working on the PR for the same.
> > >>> >>>
> > >>> >>> Thanks for bringing this up.
> > >>> >>> Prateek
> > >>> >>>
> > >>> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem  >
> > >>> wrote:
> > >>> >>>
> > >>>  @PRATEEK GAUR  : Would you agree that we
> are
> > >>> past
> > >>>  the DISCUSS step and into the DRAFT/POC phase according to the
> > >>> proposals
> > >>>  process <
> > >>> https://github.com/apache/parquet-format/tree/master/proposals
> > >>>  >?
> > >>>  If yes, could you open a PR on this page

Re: [Parquet] ALP Encoding for Floating point data

2026-01-13 Thread Julien Le Dem
Thank you Micah for the detailed review!
Who else needs to do a round of reviews on the spec before we can finalize
it?


On Tue, Jan 13, 2026 at 10:07 AM PRATEEK GAUR  wrote:

> Thanks Micah for a round of feedback.
>
> Here is a link to the spec document :
> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
>
> On Tue, Nov 25, 2025 at 8:57 AM PRATEEK GAUR  wrote:
>
> > On Sat, Nov 22, 2025 at 4:49 AM Steve Loughran 
> > wrote:
> >
> >> First, sorry: I think I accidentally marked as done the comment in the
> >> doc about x86 performance.
> >>
> >
> > No worries, I restored the thread :).
> >
> > Those x86 numbers are critical, especially AVX512 in a recent intel part.
> >> There's a notorious feature in the early ones where the cores would
> reduce
> >> frequency after you used the opcodes as a way of managing die
> temperature (
> >>
> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
> >> ); the later ones and AMD models are the ones to worry about.
> >>
> >
> > We did collect performance numbers in our early prototype and they looked
> > good on x86 hardware. Though I didn't check the processor family.
> > In our arrow implementation we are also working on a comprehensive
> > benchmarking script which will help everyone run it on different CPU
> > families to get a good idea of performance.
> >
> > Best
> > Prateek
> >
> >
> >> On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev <
> >> [email protected]> wrote:
> >>
> >>> Hi team,
> >>>
> >>> *ALP ---> ALP PeudoDecimal*
> >>>
> >>> As is visible from the numbers above and as stated in the paper too for
> >>> real double values, i.e the values with high precision points, it is
> very
> >>> difficult to get a good compression ratio.
> >>>
> >>> This combined with the fact that we want to keep the
> spec/implementation
> >>> simpler, stating Antoine directly here
> >>>
> >>> `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
> >>>
> >>> *encoding without dictionary reuse accross pages, and instead rely on
> >>> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
> >>>
> >>> Also based on some discussion I had with Julien in person and the
> >>> biweekly
> >>> meeting with a number of you.
> >>>
> >>> We'll be going with ALPpd (pseudo decimal) as the first
> >>> implementation relying on the query engine based on its own heuristics
> to
> >>> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
> >>>
> >>> Best
> >>> Prateek
> >>>
> >>>
> >>>
> >>> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur <
> [email protected]
> >>> >
> >>> wrote:
> >>>
> >>> > Sheet with numbers
> >>> > <
> >>>
> https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
> >>> >
> >>> > .
> >>> >
> >>> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR 
> >>> wrote:
> >>> >
> >>> >> Hi team,
> >>> >>
> >>> >> There was a request from a few folks, Antoine Pitrou and Adam Reeve
> >>> if I
> >>> >> remember correctly, to perform the experiment on some of the papers
> >>> that
> >>> >> talked about BYTE_STREAM_SPLIT for completeness.
> >>> >> I wanted to share the numbers for the same in this sheet. At this
> >>> point
> >>> >> we have numbers on a wide variety of data.
> >>> >> (Will have to share the sheet from my snowflake account as our
> laptops
> >>> >> have fair bit of restriction with respect to copy paste permissions
> >>> :) )
> >>> >>
> >>> >> Best
> >>> >> Prateek
> >>> >>
> >>> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR 
> >>> wrote:
> >>> >>
> >>> >>> Hi Julien,
> >>> >>>
> >>> >>> Yes based on
> >>> >>>
> >>> >>>- Numbers presented
> >>> >>>- Discussions over the doc and
> >>> >>>- Multiple discussions in the biweekly meeting
> >>> >>>
> >>> >>> We are in a stage where we agree this is the right encoding to add
> >>> and
> >>> >>> we can move to the DRAFT/POC stage from DISCUSS stage.
> >>> >>> Will start working on the PR for the same.
> >>> >>>
> >>> >>> Thanks for bringing this up.
> >>> >>> Prateek
> >>> >>>
> >>> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem 
> >>> wrote:
> >>> >>>
> >>>  @PRATEEK GAUR  : Would you agree that we are
> >>> past
> >>>  the DISCUSS step and into the DRAFT/POC phase according to the
> >>> proposals
> >>>  process <
> >>> https://github.com/apache/parquet-format/tree/master/proposals
> >>>  >?
> >>>  If yes, could you open a PR on this page to add this proposal to
> the
> >>>  list?
> >>>  https://github.com/apache/parquet-format/tree/master/proposals
> >>>  Thank you!
> >>> 
> >>> 
> >>>  On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb <
> [email protected]
> >>> >
> >>>  wrote:
> >>> 
> >>>  > I have filed a ticket[1] in arrow-rs to track prototyping ALP in
> >>> the
> >>>  Rust
> >>>  > Parquet reader if anyone is interested
> >>>  >
> >>>  > Andrew
> >>>  >
> >>>  > [1]:  htt

Re: [Parquet] ALP Encoding for Floating point data

2026-01-13 Thread PRATEEK GAUR
Thanks Micah for a round of feedback.

Here is a link to the spec document :
https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit

On Tue, Nov 25, 2025 at 8:57 AM PRATEEK GAUR  wrote:

> On Sat, Nov 22, 2025 at 4:49 AM Steve Loughran 
> wrote:
>
>> First, sorry: I think I accidentally marked as done the comment in the
>> doc about x86 performance.
>>
>
> No worries, I restored the thread :).
>
> Those x86 numbers are critical, especially AVX512 in a recent intel part.
>> There's a notorious feature in the early ones where the cores would reduce
>> frequency after you used the opcodes as a way of managing die temperature (
>> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
>> ); the later ones and AMD models are the ones to worry about.
>>
>
> We did collect performance numbers in our early prototype and they looked
> good on x86 hardware. Though I didn't check the processor family.
> In our arrow implementation we are also working on a comprehensive
> benchmarking script which will help everyone run it on different CPU
> families to get a good idea of performance.
>
> Best
> Prateek
>
>
>> On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev <
>> [email protected]> wrote:
>>
>>> Hi team,
>>>
>>> *ALP ---> ALP PeudoDecimal*
>>>
>>> As is visible from the numbers above and as stated in the paper too for
>>> real double values, i.e the values with high precision points, it is very
>>> difficult to get a good compression ratio.
>>>
>>> This combined with the fact that we want to keep the spec/implementation
>>> simpler, stating Antoine directly here
>>>
>>> `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
>>>
>>> *encoding without dictionary reuse accross pages, and instead rely on
>>> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
>>>
>>> Also based on some discussion I had with Julien in person and the
>>> biweekly
>>> meeting with a number of you.
>>>
>>> We'll be going with ALPpd (pseudo decimal) as the first
>>> implementation relying on the query engine based on its own heuristics to
>>> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
>>>
>>> Best
>>> Prateek
>>>
>>>
>>>
>>> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur >> >
>>> wrote:
>>>
>>> > Sheet with numbers
>>> > <
>>> https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
>>> >
>>> > .
>>> >
>>> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR 
>>> wrote:
>>> >
>>> >> Hi team,
>>> >>
>>> >> There was a request from a few folks, Antoine Pitrou and Adam Reeve
>>> if I
>>> >> remember correctly, to perform the experiment on some of the papers
>>> that
>>> >> talked about BYTE_STREAM_SPLIT for completeness.
>>> >> I wanted to share the numbers for the same in this sheet. At this
>>> point
>>> >> we have numbers on a wide variety of data.
>>> >> (Will have to share the sheet from my snowflake account as our laptops
>>> >> have fair bit of restriction with respect to copy paste permissions
>>> :) )
>>> >>
>>> >> Best
>>> >> Prateek
>>> >>
>>> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR 
>>> wrote:
>>> >>
>>> >>> Hi Julien,
>>> >>>
>>> >>> Yes based on
>>> >>>
>>> >>>- Numbers presented
>>> >>>- Discussions over the doc and
>>> >>>- Multiple discussions in the biweekly meeting
>>> >>>
>>> >>> We are in a stage where we agree this is the right encoding to add
>>> and
>>> >>> we can move to the DRAFT/POC stage from DISCUSS stage.
>>> >>> Will start working on the PR for the same.
>>> >>>
>>> >>> Thanks for bringing this up.
>>> >>> Prateek
>>> >>>
>>> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem 
>>> wrote:
>>> >>>
>>>  @PRATEEK GAUR  : Would you agree that we are
>>> past
>>>  the DISCUSS step and into the DRAFT/POC phase according to the
>>> proposals
>>>  process <
>>> https://github.com/apache/parquet-format/tree/master/proposals
>>>  >?
>>>  If yes, could you open a PR on this page to add this proposal to the
>>>  list?
>>>  https://github.com/apache/parquet-format/tree/master/proposals
>>>  Thank you!
>>> 
>>> 
>>>  On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb >> >
>>>  wrote:
>>> 
>>>  > I have filed a ticket[1] in arrow-rs to track prototyping ALP in
>>> the
>>>  Rust
>>>  > Parquet reader if anyone is interested
>>>  >
>>>  > Andrew
>>>  >
>>>  > [1]:  https://github.com/apache/arrow-rs/issues/8748
>>>  >
>>>  > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <
>>>  [email protected]>
>>>  > wrote:
>>>  >
>>>  > > >
>>>  > > > C++, Java and Rust support them for sure. I feel like we
>>> should
>>>  > > > probably default to V2 at some point.
>>>  > >
>>>  > >
>>>  > > I seem to recall, some of the vectorized java readers (Iceberg,
>>>  Spark)
>>>  > > might not support V2 data pages (but I might be confusing this
>>> with
>>> 

Re: Re: [Parquet] ALP Encoding for Floating point data

2025-11-25 Thread PRATEEK GAUR
Hi Azim,

Thank you for the valuable feedback.
The rationale behind `BYTE_STREAM_SPLIT_ZSTD` fallback looks quite solid.

Looking at the POI dataset and the compression achieved with
BYTE_STREAM_SPLIT + ZSTD I see ~ 57 and 59 bits per values, so around ~15%
compression ratio.
Which can be totally achieved with the first 2 bytes (2/8 ~ 25% of the
bytes of data) if they exhibit low cardinality in them.

Will give this a try after the ALP pseudo decimal impl in arrow.

Best
Prateek

On Mon, Nov 24, 2025 at 10:08 AM Azim Afroozeh  wrote:

> Hi everyone,
>
> Azim here, first author of the ALP paper.
>
> Great to see the ALP moving into Parquet. I wanted to share one
> recommendation that may help when using BYTE_STREAM_SPLIT with ZSTD for
> real double-precision data, based on what we learned while designing
> ALP_RD.
>
> Recommendation:
> When using BYTE_STREAM_SPLIT as the fallback for real double floating-point
> columns, after applying BYTE_STREAM_SPLIT and obtaining the byte streams,
> consider a design where only the first two byte streams are compressed with
> ZSTD, rather than applying ZSTD to all byte streams.
>
> Rationale:
> During the design of ALP_RD we found that only the high-order bytes contain
> meaningful, repeatable patterns. ALP_RD therefore focuses on the first 16
> bits (2bytes) and applies a specialized dictionary-style encoding that
> captures redundancy in the sign, exponent, and upper mantissa bits. These
> are the parts of floating-point numbers where we consistently observed
> structure that is compressible.
>
> From this experience, I would expect that applying ZSTD only to the first
> two BYTE_STREAM_SPLIT streams would achieve similar (or sometimes better)
> compression ratios than ALP_RD, while avoiding compression of the remaining
> byte streams (the lower mantissa bytes), which are effectively high-entropy
> noise. ZSTD generally cannot compress these streams, and in some cases
> compressing them actually increases the encoded size. Leaving those byte
> streams uncompressed also improves decompression speed.
>
> By focusing compression only on the first two byte streams, you retain
> almost all of the benefit that ALP_RD provided while keeping the fallback
> much simpler and avoiding negative compression on noisy byte streams.
>
> Technical note:
> Since ZSTD is a page-level compression codec and BYTE_STREAM_SPLIT is an
> encoding, this selective approach cannot be expressed with the current
> layering. However, if you consider introducing a new encoding (for example,
> something like BYTE_STREAM_SPLIT_ZSTD), that encoding could internally
> apply ZSTD only to the first two byte streams and leave the remaining
> streams uncompressed.
>
> Happy to share more details if useful.
>
> Best,
> Azim
>
> On 2025/11/21 01:21:26 Prateek Gaur via dev wrote:
> > Hi team,
> >
> > *ALP ---> ALP PeudoDecimal*
> >
> > As is visible from the numbers above and as stated in the paper too for
> > real double values, i.e the values with high precision points, it is very
> > difficult to get a good compression ratio.
> >
> > This combined with the fact that we want to keep the spec/implementation
> > simpler, stating Antoine directly here
> >
> > `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
> >
> > *encoding without dictionary reuse accross pages, and instead rely on
> > awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
> >
> > Also based on some discussion I had with Julien in person and the
> biweekly
> > meeting with a number of you.
> >
> > We'll be going with ALPpd (pseudo decimal) as the first
> > implementation relying on the query engine based on its own heuristics to
> > decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
> >
> > Best
> > Prateek
> >
> >
> >
> > On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur 
> > wrote:
> >
> > > Sheet with numbers
> > > <
>
> https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
> >
> > > .
> > >
> > > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR  wrote:
> > >
> > >> Hi team,
> > >>
> > >> There was a request from a few folks, Antoine Pitrou and Adam Reeve if
> I
> > >> remember correctly, to perform the experiment on some of the papers
> that
> > >> talked about BYTE_STREAM_SPLIT for completeness.
> > >> I wanted to share the numbers for the same in this sheet. At this
> point
> > >> we have numbers on a wide variety of data.
> > >> (Will have to share the sheet from my snowflake account as our laptops
> > >> have fair bit of restriction with respect to copy paste permissions :)
> )
> > >>
> > >> Best
> > >> Prateek
> > >>
> > >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR  wrote:
> > >>
> > >>> Hi Julien,
> > >>>
> > >>> Yes based on
> > >>>
> > >>>- Numbers presented
> > >>>- Discussions over the doc and
> > >>>- Multiple discussions in the biweekly meeting
> > >>>
> > >>> We are in a stage where we agree this is the right encoding to 

Re: [Parquet] ALP Encoding for Floating point data

2025-11-25 Thread PRATEEK GAUR
On Sat, Nov 22, 2025 at 4:49 AM Steve Loughran  wrote:

> First, sorry: I think I accidentally marked as done the comment in the doc
> about x86 performance.
>

No worries, I restored the thread :).

Those x86 numbers are critical, especially AVX512 in a recent intel part.
> There's a notorious feature in the early ones where the cores would reduce
> frequency after you used the opcodes as a way of managing die temperature (
> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
> ); the later ones and AMD models are the ones to worry about.
>

We did collect performance numbers in our early prototype and they looked
good on x86 hardware. Though I didn't check the processor family.
In our arrow implementation we are also working on a comprehensive
benchmarking script which will help everyone run it on different CPU
families to get a good idea of performance.

Best
Prateek


> On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev 
> wrote:
>
>> Hi team,
>>
>> *ALP ---> ALP PeudoDecimal*
>>
>> As is visible from the numbers above and as stated in the paper too for
>> real double values, i.e the values with high precision points, it is very
>> difficult to get a good compression ratio.
>>
>> This combined with the fact that we want to keep the spec/implementation
>> simpler, stating Antoine directly here
>>
>> `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
>>
>> *encoding without dictionary reuse accross pages, and instead rely on
>> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
>>
>> Also based on some discussion I had with Julien in person and the biweekly
>> meeting with a number of you.
>>
>> We'll be going with ALPpd (pseudo decimal) as the first
>> implementation relying on the query engine based on its own heuristics to
>> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
>>
>> Best
>> Prateek
>>
>>
>>
>> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur 
>> wrote:
>>
>> > Sheet with numbers
>> > <
>> https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
>> >
>> > .
>> >
>> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR 
>> wrote:
>> >
>> >> Hi team,
>> >>
>> >> There was a request from a few folks, Antoine Pitrou and Adam Reeve if
>> I
>> >> remember correctly, to perform the experiment on some of the papers
>> that
>> >> talked about BYTE_STREAM_SPLIT for completeness.
>> >> I wanted to share the numbers for the same in this sheet. At this point
>> >> we have numbers on a wide variety of data.
>> >> (Will have to share the sheet from my snowflake account as our laptops
>> >> have fair bit of restriction with respect to copy paste permissions :)
>> )
>> >>
>> >> Best
>> >> Prateek
>> >>
>> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR 
>> wrote:
>> >>
>> >>> Hi Julien,
>> >>>
>> >>> Yes based on
>> >>>
>> >>>- Numbers presented
>> >>>- Discussions over the doc and
>> >>>- Multiple discussions in the biweekly meeting
>> >>>
>> >>> We are in a stage where we agree this is the right encoding to add and
>> >>> we can move to the DRAFT/POC stage from DISCUSS stage.
>> >>> Will start working on the PR for the same.
>> >>>
>> >>> Thanks for bringing this up.
>> >>> Prateek
>> >>>
>> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem 
>> wrote:
>> >>>
>>  @PRATEEK GAUR  : Would you agree that we are
>> past
>>  the DISCUSS step and into the DRAFT/POC phase according to the
>> proposals
>>  process <
>> https://github.com/apache/parquet-format/tree/master/proposals
>>  >?
>>  If yes, could you open a PR on this page to add this proposal to the
>>  list?
>>  https://github.com/apache/parquet-format/tree/master/proposals
>>  Thank you!
>> 
>> 
>>  On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb 
>>  wrote:
>> 
>>  > I have filed a ticket[1] in arrow-rs to track prototyping ALP in
>> the
>>  Rust
>>  > Parquet reader if anyone is interested
>>  >
>>  > Andrew
>>  >
>>  > [1]:  https://github.com/apache/arrow-rs/issues/8748
>>  >
>>  > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <
>>  [email protected]>
>>  > wrote:
>>  >
>>  > > >
>>  > > > C++, Java and Rust support them for sure. I feel like we should
>>  > > > probably default to V2 at some point.
>>  > >
>>  > >
>>  > > I seem to recall, some of the vectorized java readers (Iceberg,
>>  Spark)
>>  > > might not support V2 data pages (but I might be confusing this
>> with
>>  > > encodings).  But this is only a vague recollection.
>>  > >
>>  > >
>>  > >
>>  > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb <
>> [email protected]
>>  >
>>  > > wrote:
>>  > >
>>  > > > > Someone has to add V2 data pages to
>>  > > > >
>>  > > >
>>  > > >
>>  > >
>>  >
>> 
>> https://github.com/apache/parquet-site/blob/production/conte

RE: Re: [Parquet] ALP Encoding for Floating point data

2025-11-24 Thread Azim Afroozeh
Hi everyone,

Azim here, first author of the ALP paper.

Great to see the ALP moving into Parquet. I wanted to share one
recommendation that may help when using BYTE_STREAM_SPLIT with ZSTD for
real double-precision data, based on what we learned while designing ALP_RD.

Recommendation:
When using BYTE_STREAM_SPLIT as the fallback for real double floating-point
columns, after applying BYTE_STREAM_SPLIT and obtaining the byte streams,
consider a design where only the first two byte streams are compressed with
ZSTD, rather than applying ZSTD to all byte streams.

Rationale:
During the design of ALP_RD we found that only the high-order bytes contain
meaningful, repeatable patterns. ALP_RD therefore focuses on the first 16
bits (2bytes) and applies a specialized dictionary-style encoding that
captures redundancy in the sign, exponent, and upper mantissa bits. These
are the parts of floating-point numbers where we consistently observed
structure that is compressible.

>From this experience, I would expect that applying ZSTD only to the first
two BYTE_STREAM_SPLIT streams would achieve similar (or sometimes better)
compression ratios than ALP_RD, while avoiding compression of the remaining
byte streams (the lower mantissa bytes), which are effectively high-entropy
noise. ZSTD generally cannot compress these streams, and in some cases
compressing them actually increases the encoded size. Leaving those byte
streams uncompressed also improves decompression speed.

By focusing compression only on the first two byte streams, you retain
almost all of the benefit that ALP_RD provided while keeping the fallback
much simpler and avoiding negative compression on noisy byte streams.

Technical note:
Since ZSTD is a page-level compression codec and BYTE_STREAM_SPLIT is an
encoding, this selective approach cannot be expressed with the current
layering. However, if you consider introducing a new encoding (for example,
something like BYTE_STREAM_SPLIT_ZSTD), that encoding could internally
apply ZSTD only to the first two byte streams and leave the remaining
streams uncompressed.

Happy to share more details if useful.

Best,
Azim

On 2025/11/21 01:21:26 Prateek Gaur via dev wrote:
> Hi team,
>
> *ALP ---> ALP PeudoDecimal*
>
> As is visible from the numbers above and as stated in the paper too for
> real double values, i.e the values with high precision points, it is very
> difficult to get a good compression ratio.
>
> This combined with the fact that we want to keep the spec/implementation
> simpler, stating Antoine directly here
>
> `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
>
> *encoding without dictionary reuse accross pages, and instead rely on
> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
>
> Also based on some discussion I had with Julien in person and the biweekly
> meeting with a number of you.
>
> We'll be going with ALPpd (pseudo decimal) as the first
> implementation relying on the query engine based on its own heuristics to
> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
>
> Best
> Prateek
>
>
>
> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur 
> wrote:
>
> > Sheet with numbers
> > <
https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
>
> > .
> >
> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR  wrote:
> >
> >> Hi team,
> >>
> >> There was a request from a few folks, Antoine Pitrou and Adam Reeve if
I
> >> remember correctly, to perform the experiment on some of the papers
that
> >> talked about BYTE_STREAM_SPLIT for completeness.
> >> I wanted to share the numbers for the same in this sheet. At this point
> >> we have numbers on a wide variety of data.
> >> (Will have to share the sheet from my snowflake account as our laptops
> >> have fair bit of restriction with respect to copy paste permissions :)
)
> >>
> >> Best
> >> Prateek
> >>
> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR  wrote:
> >>
> >>> Hi Julien,
> >>>
> >>> Yes based on
> >>>
> >>>- Numbers presented
> >>>- Discussions over the doc and
> >>>- Multiple discussions in the biweekly meeting
> >>>
> >>> We are in a stage where we agree this is the right encoding to add and
> >>> we can move to the DRAFT/POC stage from DISCUSS stage.
> >>> Will start working on the PR for the same.
> >>>
> >>> Thanks for bringing this up.
> >>> Prateek
> >>>
> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem 
wrote:
> >>>
>  @PRATEEK GAUR  : Would you agree that we are past
>  the DISCUSS step and into the DRAFT/POC phase according to the
proposals
>  process <
https://github.com/apache/parquet-format/tree/master/proposals
>  >?
>  If yes, could you open a PR on this page to add this proposal to the
>  list?
>  https://github.com/apache/parquet-format/tree/master/proposals
>  Thank you!
> 
> 
>  On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb 
>  wrote:
> 
>  > I have filed a ticket

Re: [Parquet] ALP Encoding for Floating point data

2025-11-24 Thread Antoine Pitrou


I would recommend to not get carried away with AVX512, as it's still
missing from many recent Intel CPUs. AVX2 is the current sweet spot for
SIMD on x86, IMHO.

Regards

Antoine.


On Sat, 22 Nov 2025 12:49:15 +
Steve Loughran  wrote:
> First, sorry: I think I accidentally marked as done the comment in the doc
> about x86 performance.
> 
> Those x86 numbers are critical, especially AVX512 in a recent intel part.
> There's a notorious feature in the early ones where the cores would reduce
> frequency after you used the opcodes as a way of managing die temperature (
> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
> ); the later ones and AMD models are the ones to worry about.
> 
> FWIW in hadoop we are starting to see RISC-V PRs for CRC performance, which
> boosts throughput reading data from hdfs or even locally if you haven't
> turned crc checks off. I wouldn't worry about RISC-V for parquet FP *yet*,
> but it's interesting to see that work appearing, especially in the context
> of the EU's active development of a sovereign cloud (i.e. one the US govt
> can't disable on an order from their president)
> https://cordis.europa.eu/project/id/101092993
> 
> 
> 
> On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev 
> 
> wrote:
> 
> > Hi team,
> >
> > *ALP ---> ALP PeudoDecimal*
> >
> > As is visible from the numbers above and as stated in the paper too for
> > real double values, i.e the values with high precision points, it is very
> > difficult to get a good compression ratio.
> >
> > This combined with the fact that we want to keep the spec/implementation
> > simpler, stating Antoine directly here
> >
> > `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
> >
> > *encoding without dictionary reuse accross pages, and instead rely on
> > awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
> >
> > Also based on some discussion I had with Julien in person and the biweekly
> > meeting with a number of you.
> >
> > We'll be going with ALPpd (pseudo decimal) as the first
> > implementation relying on the query engine based on its own heuristics to
> > decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
> >
> > Best
> > Prateek
> >
> >
> >
> > On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur 
> > wrote:
> >  
> > > Sheet with numbers
> > > <  
> > https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
> >   
> > >
> > > .
> > >
> > > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR 
> > >  wrote:
> > >  
> > >> Hi team,
> > >>
> > >> There was a request from a few folks, Antoine Pitrou and Adam Reeve if I
> > >> remember correctly, to perform the experiment on some of the papers that
> > >> talked about BYTE_STREAM_SPLIT for completeness.
> > >> I wanted to share the numbers for the same in this sheet. At this point
> > >> we have numbers on a wide variety of data.
> > >> (Will have to share the sheet from my snowflake account as our laptops
> > >> have fair bit of restriction with respect to copy paste permissions :) )
> > >>
> > >> Best
> > >> Prateek
> > >>
> > >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR   
> > wrote:  
> > >>  
> > >>> Hi Julien,
> > >>>
> > >>> Yes based on
> > >>>
> > >>>- Numbers presented
> > >>>- Discussions over the doc and
> > >>>- Multiple discussions in the biweekly meeting
> > >>>
> > >>> We are in a stage where we agree this is the right encoding to add and
> > >>> we can move to the DRAFT/POC stage from DISCUSS stage.
> > >>> Will start working on the PR for the same.
> > >>>
> > >>> Thanks for bringing this up.
> > >>> Prateek
> > >>>
> > >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem   
> > wrote:  
> > >>>  
> >  @PRATEEK GAUR  : Would you agree that we are past
> >  the DISCUSS step and into the DRAFT/POC phase according to the  
> > proposals  
> >  process <  
> > https://github.com/apache/parquet-format/tree/master/proposals  
> >  >?  
> >  If yes, could you open a PR on this page to add this proposal to the
> >  list?
> >  https://github.com/apache/parquet-format/tree/master/proposals
> >  Thank you!
> > 
> > 
> >  On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb 
> >  wrote:
> >   
> >  > I have filed a ticket[1] in arrow-rs to track prototyping ALP in the 
> >  >  
> >  Rust  
> >  > Parquet reader if anyone is interested
> >  >
> >  > Andrew
> >  >
> >  > [1]:  https://github.com/apache/arrow-rs/issues/8748
> >  >
> >  > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <  
> >  [email protected]>  
> >  > wrote:
> >  >  
> >  > > >
> >  > > > C++, Java and Rust support them for sure. I feel like we should
> >  > > > probably default to V2 at some point.  
> >  > >
> >  > >
> >  > > I seem to recall, some of the vectorized java readers (Iceberg,  
> >  Spark)  
> >  > > might not support 

Re: [Parquet] ALP Encoding for Floating point data

2025-11-22 Thread Steve Loughran
First, sorry: I think I accidentally marked as done the comment in the doc
about x86 performance.

Those x86 numbers are critical, especially AVX512 in a recent intel part.
There's a notorious feature in the early ones where the cores would reduce
frequency after you used the opcodes as a way of managing die temperature (
https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
); the later ones and AMD models are the ones to worry about.

FWIW in hadoop we are starting to see RISC-V PRs for CRC performance, which
boosts throughput reading data from hdfs or even locally if you haven't
turned crc checks off. I wouldn't worry about RISC-V for parquet FP *yet*,
but it's interesting to see that work appearing, especially in the context
of the EU's active development of a sovereign cloud (i.e. one the US govt
can't disable on an order from their president)
https://cordis.europa.eu/project/id/101092993



On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev 
wrote:

> Hi team,
>
> *ALP ---> ALP PeudoDecimal*
>
> As is visible from the numbers above and as stated in the paper too for
> real double values, i.e the values with high precision points, it is very
> difficult to get a good compression ratio.
>
> This combined with the fact that we want to keep the spec/implementation
> simpler, stating Antoine directly here
>
> `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
>
> *encoding without dictionary reuse accross pages, and instead rely on
> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
>
> Also based on some discussion I had with Julien in person and the biweekly
> meeting with a number of you.
>
> We'll be going with ALPpd (pseudo decimal) as the first
> implementation relying on the query engine based on its own heuristics to
> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
>
> Best
> Prateek
>
>
>
> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur 
> wrote:
>
> > Sheet with numbers
> > <
> https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
> >
> > .
> >
> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR  wrote:
> >
> >> Hi team,
> >>
> >> There was a request from a few folks, Antoine Pitrou and Adam Reeve if I
> >> remember correctly, to perform the experiment on some of the papers that
> >> talked about BYTE_STREAM_SPLIT for completeness.
> >> I wanted to share the numbers for the same in this sheet. At this point
> >> we have numbers on a wide variety of data.
> >> (Will have to share the sheet from my snowflake account as our laptops
> >> have fair bit of restriction with respect to copy paste permissions :) )
> >>
> >> Best
> >> Prateek
> >>
> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR 
> wrote:
> >>
> >>> Hi Julien,
> >>>
> >>> Yes based on
> >>>
> >>>- Numbers presented
> >>>- Discussions over the doc and
> >>>- Multiple discussions in the biweekly meeting
> >>>
> >>> We are in a stage where we agree this is the right encoding to add and
> >>> we can move to the DRAFT/POC stage from DISCUSS stage.
> >>> Will start working on the PR for the same.
> >>>
> >>> Thanks for bringing this up.
> >>> Prateek
> >>>
> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem 
> wrote:
> >>>
>  @PRATEEK GAUR  : Would you agree that we are past
>  the DISCUSS step and into the DRAFT/POC phase according to the
> proposals
>  process <
> https://github.com/apache/parquet-format/tree/master/proposals
>  >?
>  If yes, could you open a PR on this page to add this proposal to the
>  list?
>  https://github.com/apache/parquet-format/tree/master/proposals
>  Thank you!
> 
> 
>  On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb 
>  wrote:
> 
>  > I have filed a ticket[1] in arrow-rs to track prototyping ALP in the
>  Rust
>  > Parquet reader if anyone is interested
>  >
>  > Andrew
>  >
>  > [1]:  https://github.com/apache/arrow-rs/issues/8748
>  >
>  > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <
>  [email protected]>
>  > wrote:
>  >
>  > > >
>  > > > C++, Java and Rust support them for sure. I feel like we should
>  > > > probably default to V2 at some point.
>  > >
>  > >
>  > > I seem to recall, some of the vectorized java readers (Iceberg,
>  Spark)
>  > > might not support V2 data pages (but I might be confusing this
> with
>  > > encodings).  But this is only a vague recollection.
>  > >
>  > >
>  > >
>  > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb <
> [email protected]
>  >
>  > > wrote:
>  > >
>  > > > > Someone has to add V2 data pages to
>  > > > >
>  > > >
>  > > >
>  > >
>  >
> 
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
>  > > > > :)
>  > > >
>  > > > Your wish is my command:
>  > 

Re: [Parquet] ALP Encoding for Floating point data

2025-11-21 Thread Prateek Gaur via dev
Hi team,

*ALP ---> ALP PeudoDecimal*

As is visible from the numbers above and as stated in the paper too for
real double values, i.e the values with high precision points, it is very
difficult to get a good compression ratio.

This combined with the fact that we want to keep the spec/implementation
simpler, stating Antoine directly here

`*2. Do not include the ALPrd fallback which is a homegrown dictionary*

*encoding without dictionary reuse accross pages, and instead rely on
awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`

Also based on some discussion I had with Julien in person and the biweekly
meeting with a number of you.

We'll be going with ALPpd (pseudo decimal) as the first
implementation relying on the query engine based on its own heuristics to
decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.

Best
Prateek



On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur 
wrote:

> Sheet with numbers
> 
> .
>
> On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR  wrote:
>
>> Hi team,
>>
>> There was a request from a few folks, Antoine Pitrou and Adam Reeve if I
>> remember correctly, to perform the experiment on some of the papers that
>> talked about BYTE_STREAM_SPLIT for completeness.
>> I wanted to share the numbers for the same in this sheet. At this point
>> we have numbers on a wide variety of data.
>> (Will have to share the sheet from my snowflake account as our laptops
>> have fair bit of restriction with respect to copy paste permissions :) )
>>
>> Best
>> Prateek
>>
>> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR  wrote:
>>
>>> Hi Julien,
>>>
>>> Yes based on
>>>
>>>- Numbers presented
>>>- Discussions over the doc and
>>>- Multiple discussions in the biweekly meeting
>>>
>>> We are in a stage where we agree this is the right encoding to add and
>>> we can move to the DRAFT/POC stage from DISCUSS stage.
>>> Will start working on the PR for the same.
>>>
>>> Thanks for bringing this up.
>>> Prateek
>>>
>>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem  wrote:
>>>
 @PRATEEK GAUR  : Would you agree that we are past
 the DISCUSS step and into the DRAFT/POC phase according to the proposals
 process ?
 If yes, could you open a PR on this page to add this proposal to the
 list?
 https://github.com/apache/parquet-format/tree/master/proposals
 Thank you!


 On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb 
 wrote:

 > I have filed a ticket[1] in arrow-rs to track prototyping ALP in the
 Rust
 > Parquet reader if anyone is interested
 >
 > Andrew
 >
 > [1]:  https://github.com/apache/arrow-rs/issues/8748
 >
 > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <
 [email protected]>
 > wrote:
 >
 > > >
 > > > C++, Java and Rust support them for sure. I feel like we should
 > > > probably default to V2 at some point.
 > >
 > >
 > > I seem to recall, some of the vectorized java readers (Iceberg,
 Spark)
 > > might not support V2 data pages (but I might be confusing this with
 > > encodings).  But this is only a vague recollection.
 > >
 > >
 > >
 > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb >>> >
 > > wrote:
 > >
 > > > > Someone has to add V2 data pages to
 > > > >
 > > >
 > > >
 > >
 >
 https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
 > > > > :)
 > > >
 > > > Your wish is my command:
 > https://github.com/apache/parquet-site/pull/124
 > > >
 > > > As the format grows in popularity and momentum builds to evolve,
 I feel
 > > the
 > > > content on the parquet.apache.org site could use refreshing /
 > updating.
 > > > So, while I had the site open, I made some other PRs to scratch
 various
 > > > itches
 > > >
 > > > (I am absolutely 🎣 for someone to please review 🙏):
 > > >
 > > > 1. Add Variant/Geometry/Geography types to implementation status
 > matrix:
 > > > https://github.com/apache/parquet-site/pull/123
 > > > 2. Improve introduction / overview, add more links to spec and
 > > > implementation status:
 https://github.com/apache/parquet-site/pull/125
 > > >
 > > >
 > > > Thanks,
 > > > Andrew
 > > >
 > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou <
 [email protected]>
 > > wrote:
 > > >
 > > > >
 > > > > Hi Julien, hi all,
 > > > >
 > > > > On Mon, 20 Oct 2025 15:14:58 -0700
 > > > > Julien Le Dem  wrote:
 > > > > >
 > > > > > Another question from me:
 > > > > >
 > > > > > Since the goal is to not use compression at all in this case
 (no
 > > ZSTD)
 > > > > > I'm assuming we wo

Re: [Parquet] ALP Encoding for Floating point data

2025-11-21 Thread Prateek Gaur via dev
Sheet with numbers

.

On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR  wrote:

> Hi team,
>
> There was a request from a few folks, Antoine Pitrou and Adam Reeve if I
> remember correctly, to perform the experiment on some of the papers that
> talked about BYTE_STREAM_SPLIT for completeness.
> I wanted to share the numbers for the same in this sheet. At this point we
> have numbers on a wide variety of data.
> (Will have to share the sheet from my snowflake account as our laptops
> have fair bit of restriction with respect to copy paste permissions :) )
>
> Best
> Prateek
>
> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR  wrote:
>
>> Hi Julien,
>>
>> Yes based on
>>
>>- Numbers presented
>>- Discussions over the doc and
>>- Multiple discussions in the biweekly meeting
>>
>> We are in a stage where we agree this is the right encoding to add and we
>> can move to the DRAFT/POC stage from DISCUSS stage.
>> Will start working on the PR for the same.
>>
>> Thanks for bringing this up.
>> Prateek
>>
>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem  wrote:
>>
>>> @PRATEEK GAUR  : Would you agree that we are past
>>> the DISCUSS step and into the DRAFT/POC phase according to the proposals
>>> process >> >?
>>> If yes, could you open a PR on this page to add this proposal to the
>>> list?
>>> https://github.com/apache/parquet-format/tree/master/proposals
>>> Thank you!
>>>
>>>
>>> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb 
>>> wrote:
>>>
>>> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in the
>>> Rust
>>> > Parquet reader if anyone is interested
>>> >
>>> > Andrew
>>> >
>>> > [1]:  https://github.com/apache/arrow-rs/issues/8748
>>> >
>>> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield >> >
>>> > wrote:
>>> >
>>> > > >
>>> > > > C++, Java and Rust support them for sure. I feel like we should
>>> > > > probably default to V2 at some point.
>>> > >
>>> > >
>>> > > I seem to recall, some of the vectorized java readers (Iceberg,
>>> Spark)
>>> > > might not support V2 data pages (but I might be confusing this with
>>> > > encodings).  But this is only a vague recollection.
>>> > >
>>> > >
>>> > >
>>> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb 
>>> > > wrote:
>>> > >
>>> > > > > Someone has to add V2 data pages to
>>> > > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
>>> > > > > :)
>>> > > >
>>> > > > Your wish is my command:
>>> > https://github.com/apache/parquet-site/pull/124
>>> > > >
>>> > > > As the format grows in popularity and momentum builds to evolve, I
>>> feel
>>> > > the
>>> > > > content on the parquet.apache.org site could use refreshing /
>>> > updating.
>>> > > > So, while I had the site open, I made some other PRs to scratch
>>> various
>>> > > > itches
>>> > > >
>>> > > > (I am absolutely 🎣 for someone to please review 🙏):
>>> > > >
>>> > > > 1. Add Variant/Geometry/Geography types to implementation status
>>> > matrix:
>>> > > > https://github.com/apache/parquet-site/pull/123
>>> > > > 2. Improve introduction / overview, add more links to spec and
>>> > > > implementation status:
>>> https://github.com/apache/parquet-site/pull/125
>>> > > >
>>> > > >
>>> > > > Thanks,
>>> > > > Andrew
>>> > > >
>>> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou >> >
>>> > > wrote:
>>> > > >
>>> > > > >
>>> > > > > Hi Julien, hi all,
>>> > > > >
>>> > > > > On Mon, 20 Oct 2025 15:14:58 -0700
>>> > > > > Julien Le Dem  wrote:
>>> > > > > >
>>> > > > > > Another question from me:
>>> > > > > >
>>> > > > > > Since the goal is to not use compression at all in this case
>>> (no
>>> > > ZSTD)
>>> > > > > > I'm assuming we would be using either:
>>> > > > > > - the Data Page V1 with UNCOMPRESSED in the
>>> ColumnMetadata.column
>>> > > > > > <
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
>>> > > > > >
>>> > > > > > field.
>>> > > > > > - the Data Page V2 with false in the
>>> DataPageHeaderV2.is_compressed
>>> > > > > > <
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
>>> > > > > >
>>> > > > > > field
>>> > > > > > The second helping decide if we can selectively compress some
>>> pages
>>> > > if
>>> > > > > they
>>> > > > > > are less compressed by the
>>> > > > > > A few years ago there was a question on the support of the
>>> > > DATA_PAGE_V2
>>> > > > > and
>>> > > > > > I was curious to hear a refresh on how that's generally
>>> supported
>>> > in
>>> > > > > > Parquet implementations. The is_compressed field was exactly
>>> > intended
>>> > > > to
>>> > > > > > a

Re: [Parquet] ALP Encoding for Floating point data

2025-11-20 Thread PRATEEK GAUR
Hi team,

There was a request from a few folks, Antoine Pitrou and Adam Reeve if I
remember correctly, to perform the experiment on some of the papers that
talked about BYTE_STREAM_SPLIT for completeness.
I wanted to share the numbers for the same in this sheet. At this point we
have numbers on a wide variety of data.
(Will have to share the sheet from my snowflake account as our laptops have
fair bit of restriction with respect to copy paste permissions :) )

Best
Prateek

On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR  wrote:

> Hi Julien,
>
> Yes based on
>
>- Numbers presented
>- Discussions over the doc and
>- Multiple discussions in the biweekly meeting
>
> We are in a stage where we agree this is the right encoding to add and we
> can move to the DRAFT/POC stage from DISCUSS stage.
> Will start working on the PR for the same.
>
> Thanks for bringing this up.
> Prateek
>
> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem  wrote:
>
>> @PRATEEK GAUR  : Would you agree that we are past
>> the DISCUSS step and into the DRAFT/POC phase according to the proposals
>> process ?
>> If yes, could you open a PR on this page to add this proposal to the list?
>> https://github.com/apache/parquet-format/tree/master/proposals
>> Thank you!
>>
>>
>> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb 
>> wrote:
>>
>> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in the
>> Rust
>> > Parquet reader if anyone is interested
>> >
>> > Andrew
>> >
>> > [1]:  https://github.com/apache/arrow-rs/issues/8748
>> >
>> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield 
>> > wrote:
>> >
>> > > >
>> > > > C++, Java and Rust support them for sure. I feel like we should
>> > > > probably default to V2 at some point.
>> > >
>> > >
>> > > I seem to recall, some of the vectorized java readers (Iceberg, Spark)
>> > > might not support V2 data pages (but I might be confusing this with
>> > > encodings).  But this is only a vague recollection.
>> > >
>> > >
>> > >
>> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb 
>> > > wrote:
>> > >
>> > > > > Someone has to add V2 data pages to
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
>> > > > > :)
>> > > >
>> > > > Your wish is my command:
>> > https://github.com/apache/parquet-site/pull/124
>> > > >
>> > > > As the format grows in popularity and momentum builds to evolve, I
>> feel
>> > > the
>> > > > content on the parquet.apache.org site could use refreshing /
>> > updating.
>> > > > So, while I had the site open, I made some other PRs to scratch
>> various
>> > > > itches
>> > > >
>> > > > (I am absolutely 🎣 for someone to please review 🙏):
>> > > >
>> > > > 1. Add Variant/Geometry/Geography types to implementation status
>> > matrix:
>> > > > https://github.com/apache/parquet-site/pull/123
>> > > > 2. Improve introduction / overview, add more links to spec and
>> > > > implementation status:
>> https://github.com/apache/parquet-site/pull/125
>> > > >
>> > > >
>> > > > Thanks,
>> > > > Andrew
>> > > >
>> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou 
>> > > wrote:
>> > > >
>> > > > >
>> > > > > Hi Julien, hi all,
>> > > > >
>> > > > > On Mon, 20 Oct 2025 15:14:58 -0700
>> > > > > Julien Le Dem  wrote:
>> > > > > >
>> > > > > > Another question from me:
>> > > > > >
>> > > > > > Since the goal is to not use compression at all in this case (no
>> > > ZSTD)
>> > > > > > I'm assuming we would be using either:
>> > > > > > - the Data Page V1 with UNCOMPRESSED in the
>> ColumnMetadata.column
>> > > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
>> > > > > >
>> > > > > > field.
>> > > > > > - the Data Page V2 with false in the
>> DataPageHeaderV2.is_compressed
>> > > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
>> > > > > >
>> > > > > > field
>> > > > > > The second helping decide if we can selectively compress some
>> pages
>> > > if
>> > > > > they
>> > > > > > are less compressed by the
>> > > > > > A few years ago there was a question on the support of the
>> > > DATA_PAGE_V2
>> > > > > and
>> > > > > > I was curious to hear a refresh on how that's generally
>> supported
>> > in
>> > > > > > Parquet implementations. The is_compressed field was exactly
>> > intended
>> > > > to
>> > > > > > avoid block compression when the encoding itself is good enough.
>> > > > >
>> > > > > Someone has to add V2 data pages to
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
>> > > > > :)
>> > > > >
>> > > > > C++, Java and Rust support them for sure. I feel like we should
>> > > 

Re: [Parquet] ALP Encoding for Floating point data

2025-11-20 Thread PRATEEK GAUR
Hi Julien,

Yes based on

   - Numbers presented
   - Discussions over the doc and
   - Multiple discussions in the biweekly meeting

We are in a stage where we agree this is the right encoding to add and we
can move to the DRAFT/POC stage from DISCUSS stage.
Will start working on the PR for the same.

Thanks for bringing this up.
Prateek

On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem  wrote:

> @PRATEEK GAUR  : Would you agree that we are past
> the DISCUSS step and into the DRAFT/POC phase according to the proposals
> process ?
> If yes, could you open a PR on this page to add this proposal to the list?
> https://github.com/apache/parquet-format/tree/master/proposals
> Thank you!
>
>
> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb 
> wrote:
>
> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in the Rust
> > Parquet reader if anyone is interested
> >
> > Andrew
> >
> > [1]:  https://github.com/apache/arrow-rs/issues/8748
> >
> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield 
> > wrote:
> >
> > > >
> > > > C++, Java and Rust support them for sure. I feel like we should
> > > > probably default to V2 at some point.
> > >
> > >
> > > I seem to recall, some of the vectorized java readers (Iceberg, Spark)
> > > might not support V2 data pages (but I might be confusing this with
> > > encodings).  But this is only a vague recollection.
> > >
> > >
> > >
> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb 
> > > wrote:
> > >
> > > > > Someone has to add V2 data pages to
> > > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> > > > > :)
> > > >
> > > > Your wish is my command:
> > https://github.com/apache/parquet-site/pull/124
> > > >
> > > > As the format grows in popularity and momentum builds to evolve, I
> feel
> > > the
> > > > content on the parquet.apache.org site could use refreshing /
> > updating.
> > > > So, while I had the site open, I made some other PRs to scratch
> various
> > > > itches
> > > >
> > > > (I am absolutely 🎣 for someone to please review 🙏):
> > > >
> > > > 1. Add Variant/Geometry/Geography types to implementation status
> > matrix:
> > > > https://github.com/apache/parquet-site/pull/123
> > > > 2. Improve introduction / overview, add more links to spec and
> > > > implementation status:
> https://github.com/apache/parquet-site/pull/125
> > > >
> > > >
> > > > Thanks,
> > > > Andrew
> > > >
> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou 
> > > wrote:
> > > >
> > > > >
> > > > > Hi Julien, hi all,
> > > > >
> > > > > On Mon, 20 Oct 2025 15:14:58 -0700
> > > > > Julien Le Dem  wrote:
> > > > > >
> > > > > > Another question from me:
> > > > > >
> > > > > > Since the goal is to not use compression at all in this case (no
> > > ZSTD)
> > > > > > I'm assuming we would be using either:
> > > > > > - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> > > > > >
> > > > > > field.
> > > > > > - the Data Page V2 with false in the
> DataPageHeaderV2.is_compressed
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> > > > > >
> > > > > > field
> > > > > > The second helping decide if we can selectively compress some
> pages
> > > if
> > > > > they
> > > > > > are less compressed by the
> > > > > > A few years ago there was a question on the support of the
> > > DATA_PAGE_V2
> > > > > and
> > > > > > I was curious to hear a refresh on how that's generally supported
> > in
> > > > > > Parquet implementations. The is_compressed field was exactly
> > intended
> > > > to
> > > > > > avoid block compression when the encoding itself is good enough.
> > > > >
> > > > > Someone has to add V2 data pages to
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> > > > > :)
> > > > >
> > > > > C++, Java and Rust support them for sure. I feel like we should
> > > > > probably default to V2 at some point.
> > > > >
> > > > > Also see https://github.com/apache/parquet-java/issues/3344 for
> > Java.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > >
> > > > > > Julien
> > > > > >
> > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb
> > > > >  wrote:
> > > > > >
> > > > > > > Thanks again Prateek and co for pushing this along!
> > > > > > >
> > > > > > >
> > > > > > > > 1. Design and write our own Parquet-ALP spec so that
> > > > implementations
> > > > > > > > know exactly how to encode and represent data
> > > > > > >
> > > > > > > 100% agree with this (similar to what was done for
> > ParquetVariant)
> > > >

Re: [Parquet] ALP Encoding for Floating point data

2025-11-20 Thread Julien Le Dem
@PRATEEK GAUR  : Would you agree that we are past
the DISCUSS step and into the DRAFT/POC phase according to the proposals
process ?
If yes, could you open a PR on this page to add this proposal to the list?
https://github.com/apache/parquet-format/tree/master/proposals
Thank you!


On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb  wrote:

> I have filed a ticket[1] in arrow-rs to track prototyping ALP in the Rust
> Parquet reader if anyone is interested
>
> Andrew
>
> [1]:  https://github.com/apache/arrow-rs/issues/8748
>
> On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield 
> wrote:
>
> > >
> > > C++, Java and Rust support them for sure. I feel like we should
> > > probably default to V2 at some point.
> >
> >
> > I seem to recall, some of the vectorized java readers (Iceberg, Spark)
> > might not support V2 data pages (but I might be confusing this with
> > encodings).  But this is only a vague recollection.
> >
> >
> >
> > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb 
> > wrote:
> >
> > > > Someone has to add V2 data pages to
> > > >
> > >
> > >
> >
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> > > > :)
> > >
> > > Your wish is my command:
> https://github.com/apache/parquet-site/pull/124
> > >
> > > As the format grows in popularity and momentum builds to evolve, I feel
> > the
> > > content on the parquet.apache.org site could use refreshing /
> updating.
> > > So, while I had the site open, I made some other PRs to scratch various
> > > itches
> > >
> > > (I am absolutely 🎣 for someone to please review 🙏):
> > >
> > > 1. Add Variant/Geometry/Geography types to implementation status
> matrix:
> > > https://github.com/apache/parquet-site/pull/123
> > > 2. Improve introduction / overview, add more links to spec and
> > > implementation status: https://github.com/apache/parquet-site/pull/125
> > >
> > >
> > > Thanks,
> > > Andrew
> > >
> > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > Hi Julien, hi all,
> > > >
> > > > On Mon, 20 Oct 2025 15:14:58 -0700
> > > > Julien Le Dem  wrote:
> > > > >
> > > > > Another question from me:
> > > > >
> > > > > Since the goal is to not use compression at all in this case (no
> > ZSTD)
> > > > > I'm assuming we would be using either:
> > > > > - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> > > > >
> > > > > field.
> > > > > - the Data Page V2 with false in the DataPageHeaderV2.is_compressed
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> > > > >
> > > > > field
> > > > > The second helping decide if we can selectively compress some pages
> > if
> > > > they
> > > > > are less compressed by the
> > > > > A few years ago there was a question on the support of the
> > DATA_PAGE_V2
> > > > and
> > > > > I was curious to hear a refresh on how that's generally supported
> in
> > > > > Parquet implementations. The is_compressed field was exactly
> intended
> > > to
> > > > > avoid block compression when the encoding itself is good enough.
> > > >
> > > > Someone has to add V2 data pages to
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> > > > :)
> > > >
> > > > C++, Java and Rust support them for sure. I feel like we should
> > > > probably default to V2 at some point.
> > > >
> > > > Also see https://github.com/apache/parquet-java/issues/3344 for
> Java.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > >
> > > > > Julien
> > > > >
> > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb
> > > >  wrote:
> > > > >
> > > > > > Thanks again Prateek and co for pushing this along!
> > > > > >
> > > > > >
> > > > > > > 1. Design and write our own Parquet-ALP spec so that
> > > implementations
> > > > > > > know exactly how to encode and represent data
> > > > > >
> > > > > > 100% agree with this (similar to what was done for
> ParquetVariant)
> > > > > >
> > > > > > > 2. I may be missing something, but the paper doesn't seem to
> > > > mention
> > > > > > non-finite values (such as +/-Inf and NaNs).
> > > > > >
> > > > > > I think they are handled via the "Exception" mechanism. Vortex's
> > ALP
> > > > > > implementation (below) does appear to handle finite numbers[2]
> > > > > >
> > > > > > > 3. It seems there is a single implementation, which is the one
> > > > published
> > > > > > > together with the paper. It is not obvious that it will be
> > > > > > > maintained in the future, and reusing it is probably not an
> > option
> > > > for
> > > > > > > non-C++ Parquet implementations
> > > > > >
> > > > > > My understanding from the 

Re: [Parquet] ALP Encoding for Floating point data

2025-10-30 Thread Andrew Lamb
I have filed a ticket[1] in arrow-rs to track prototyping ALP in the Rust
Parquet reader if anyone is interested

Andrew

[1]:  https://github.com/apache/arrow-rs/issues/8748

On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield 
wrote:

> >
> > C++, Java and Rust support them for sure. I feel like we should
> > probably default to V2 at some point.
>
>
> I seem to recall, some of the vectorized java readers (Iceberg, Spark)
> might not support V2 data pages (but I might be confusing this with
> encodings).  But this is only a vague recollection.
>
>
>
> On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb 
> wrote:
>
> > > Someone has to add V2 data pages to
> > >
> >
> >
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> > > :)
> >
> > Your wish is my command: https://github.com/apache/parquet-site/pull/124
> >
> > As the format grows in popularity and momentum builds to evolve, I feel
> the
> > content on the parquet.apache.org site could use refreshing / updating.
> > So, while I had the site open, I made some other PRs to scratch various
> > itches
> >
> > (I am absolutely 🎣 for someone to please review 🙏):
> >
> > 1. Add Variant/Geometry/Geography types to implementation status matrix:
> > https://github.com/apache/parquet-site/pull/123
> > 2. Improve introduction / overview, add more links to spec and
> > implementation status: https://github.com/apache/parquet-site/pull/125
> >
> >
> > Thanks,
> > Andrew
> >
> > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou 
> wrote:
> >
> > >
> > > Hi Julien, hi all,
> > >
> > > On Mon, 20 Oct 2025 15:14:58 -0700
> > > Julien Le Dem  wrote:
> > > >
> > > > Another question from me:
> > > >
> > > > Since the goal is to not use compression at all in this case (no
> ZSTD)
> > > > I'm assuming we would be using either:
> > > > - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column
> > > > <
> > >
> >
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> > > >
> > > > field.
> > > > - the Data Page V2 with false in the DataPageHeaderV2.is_compressed
> > > > <
> > >
> >
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> > > >
> > > > field
> > > > The second helping decide if we can selectively compress some pages
> if
> > > they
> > > > are less compressed by the
> > > > A few years ago there was a question on the support of the
> DATA_PAGE_V2
> > > and
> > > > I was curious to hear a refresh on how that's generally supported in
> > > > Parquet implementations. The is_compressed field was exactly intended
> > to
> > > > avoid block compression when the encoding itself is good enough.
> > >
> > > Someone has to add V2 data pages to
> > >
> > >
> >
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> > > :)
> > >
> > > C++, Java and Rust support them for sure. I feel like we should
> > > probably default to V2 at some point.
> > >
> > > Also see https://github.com/apache/parquet-java/issues/3344 for Java.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > >
> > > > Julien
> > > >
> > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb
> > >  wrote:
> > > >
> > > > > Thanks again Prateek and co for pushing this along!
> > > > >
> > > > >
> > > > > > 1. Design and write our own Parquet-ALP spec so that
> > implementations
> > > > > > know exactly how to encode and represent data
> > > > >
> > > > > 100% agree with this (similar to what was done for ParquetVariant)
> > > > >
> > > > > > 2. I may be missing something, but the paper doesn't seem to
> > > mention
> > > > > non-finite values (such as +/-Inf and NaNs).
> > > > >
> > > > > I think they are handled via the "Exception" mechanism. Vortex's
> ALP
> > > > > implementation (below) does appear to handle finite numbers[2]
> > > > >
> > > > > > 3. It seems there is a single implementation, which is the one
> > > published
> > > > > > together with the paper. It is not obvious that it will be
> > > > > > maintained in the future, and reusing it is probably not an
> option
> > > for
> > > > > > non-C++ Parquet implementations
> > > > >
> > > > > My understanding from the call was that Prateek and team
> > re-implemented
> > > > > ALP  (did not use the implementation from CWI[3]) but that would be
> > > good to
> > > > > confirm.
> > > > >
> > > > > There is also a Rust implementation of ALP[1] that is part of the
> > > Vortex
> > > > > file format implementation. I have not reviewed it to see if it
> > > deviates
> > > > > from the algorithm presented in the paper.
> > > > >
> > > > > Andrew
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > >
> >
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> > > > > [2]:
> > > > >
> > > > >
> > >
> >
> https://github.com/vortex-data/vortex/blob/534821969201

Re: [Parquet] ALP Encoding for Floating point data

2025-10-22 Thread Micah Kornfield
>
> C++, Java and Rust support them for sure. I feel like we should
> probably default to V2 at some point.


I seem to recall, some of the vectorized java readers (Iceberg, Spark)
might not support V2 data pages (but I might be confusing this with
encodings).  But this is only a vague recollection.



On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb  wrote:

> > Someone has to add V2 data pages to
> >
>
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> > :)
>
> Your wish is my command: https://github.com/apache/parquet-site/pull/124
>
> As the format grows in popularity and momentum builds to evolve, I feel the
> content on the parquet.apache.org site could use refreshing / updating.
> So, while I had the site open, I made some other PRs to scratch various
> itches
>
> (I am absolutely 🎣 for someone to please review 🙏):
>
> 1. Add Variant/Geometry/Geography types to implementation status matrix:
> https://github.com/apache/parquet-site/pull/123
> 2. Improve introduction / overview, add more links to spec and
> implementation status: https://github.com/apache/parquet-site/pull/125
>
>
> Thanks,
> Andrew
>
> On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou  wrote:
>
> >
> > Hi Julien, hi all,
> >
> > On Mon, 20 Oct 2025 15:14:58 -0700
> > Julien Le Dem  wrote:
> > >
> > > Another question from me:
> > >
> > > Since the goal is to not use compression at all in this case (no ZSTD)
> > > I'm assuming we would be using either:
> > > - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column
> > > <
> >
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> > >
> > > field.
> > > - the Data Page V2 with false in the DataPageHeaderV2.is_compressed
> > > <
> >
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> > >
> > > field
> > > The second helping decide if we can selectively compress some pages if
> > they
> > > are less compressed by the
> > > A few years ago there was a question on the support of the DATA_PAGE_V2
> > and
> > > I was curious to hear a refresh on how that's generally supported in
> > > Parquet implementations. The is_compressed field was exactly intended
> to
> > > avoid block compression when the encoding itself is good enough.
> >
> > Someone has to add V2 data pages to
> >
> >
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> > :)
> >
> > C++, Java and Rust support them for sure. I feel like we should
> > probably default to V2 at some point.
> >
> > Also see https://github.com/apache/parquet-java/issues/3344 for Java.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > >
> > > Julien
> > >
> > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb
> >  wrote:
> > >
> > > > Thanks again Prateek and co for pushing this along!
> > > >
> > > >
> > > > > 1. Design and write our own Parquet-ALP spec so that
> implementations
> > > > > know exactly how to encode and represent data
> > > >
> > > > 100% agree with this (similar to what was done for ParquetVariant)
> > > >
> > > > > 2. I may be missing something, but the paper doesn't seem to
> > mention
> > > > non-finite values (such as +/-Inf and NaNs).
> > > >
> > > > I think they are handled via the "Exception" mechanism. Vortex's ALP
> > > > implementation (below) does appear to handle finite numbers[2]
> > > >
> > > > > 3. It seems there is a single implementation, which is the one
> > published
> > > > > together with the paper. It is not obvious that it will be
> > > > > maintained in the future, and reusing it is probably not an option
> > for
> > > > > non-C++ Parquet implementations
> > > >
> > > > My understanding from the call was that Prateek and team
> re-implemented
> > > > ALP  (did not use the implementation from CWI[3]) but that would be
> > good to
> > > > confirm.
> > > >
> > > > There is also a Rust implementation of ALP[1] that is part of the
> > Vortex
> > > > file format implementation. I have not reviewed it to see if it
> > deviates
> > > > from the algorithm presented in the paper.
> > > >
> > > > Andrew
> > > >
> > > > [1]:
> > > >
> > > >
> >
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> > > > [2]:
> > > >
> > > >
> >
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> > > > [3]: https://github.com/cwida/ALP
> > > >
> > > >
> > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou
> >  wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > Thanks for doing this and I agree the numbers look impressive.
> > > > >
> > > > > I would ask if possible for more data points:
> > > > >
> > > > > 1. More datasets: you could for example look at the datasets that
> > were
> > > > > used to originally evalute BYTE_STREAM_SPLIT (se

Re: [Parquet] ALP Encoding for Floating point data

2025-10-22 Thread Andrew Lamb
> Someone has to add V2 data pages to
>
https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> :)

Your wish is my command: https://github.com/apache/parquet-site/pull/124

As the format grows in popularity and momentum builds to evolve, I feel the
content on the parquet.apache.org site could use refreshing / updating.
So, while I had the site open, I made some other PRs to scratch various
itches

(I am absolutely 🎣 for someone to please review 🙏):

1. Add Variant/Geometry/Geography types to implementation status matrix:
https://github.com/apache/parquet-site/pull/123
2. Improve introduction / overview, add more links to spec and
implementation status: https://github.com/apache/parquet-site/pull/125


Thanks,
Andrew

On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou  wrote:

>
> Hi Julien, hi all,
>
> On Mon, 20 Oct 2025 15:14:58 -0700
> Julien Le Dem  wrote:
> >
> > Another question from me:
> >
> > Since the goal is to not use compression at all in this case (no ZSTD)
> > I'm assuming we would be using either:
> > - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column
> > <
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> >
> > field.
> > - the Data Page V2 with false in the DataPageHeaderV2.is_compressed
> > <
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> >
> > field
> > The second helping decide if we can selectively compress some pages if
> they
> > are less compressed by the
> > A few years ago there was a question on the support of the DATA_PAGE_V2
> and
> > I was curious to hear a refresh on how that's generally supported in
> > Parquet implementations. The is_compressed field was exactly intended to
> > avoid block compression when the encoding itself is good enough.
>
> Someone has to add V2 data pages to
>
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> :)
>
> C++, Java and Rust support them for sure. I feel like we should
> probably default to V2 at some point.
>
> Also see https://github.com/apache/parquet-java/issues/3344 for Java.
>
> Regards
>
> Antoine.
>
>
> >
> > Julien
> >
> > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb
>  wrote:
> >
> > > Thanks again Prateek and co for pushing this along!
> > >
> > >
> > > > 1. Design and write our own Parquet-ALP spec so that implementations
> > > > know exactly how to encode and represent data
> > >
> > > 100% agree with this (similar to what was done for ParquetVariant)
> > >
> > > > 2. I may be missing something, but the paper doesn't seem to
> mention
> > > non-finite values (such as +/-Inf and NaNs).
> > >
> > > I think they are handled via the "Exception" mechanism. Vortex's ALP
> > > implementation (below) does appear to handle finite numbers[2]
> > >
> > > > 3. It seems there is a single implementation, which is the one
> published
> > > > together with the paper. It is not obvious that it will be
> > > > maintained in the future, and reusing it is probably not an option
> for
> > > > non-C++ Parquet implementations
> > >
> > > My understanding from the call was that Prateek and team re-implemented
> > > ALP  (did not use the implementation from CWI[3]) but that would be
> good to
> > > confirm.
> > >
> > > There is also a Rust implementation of ALP[1] that is part of the
> Vortex
> > > file format implementation. I have not reviewed it to see if it
> deviates
> > > from the algorithm presented in the paper.
> > >
> > > Andrew
> > >
> > > [1]:
> > >
> > >
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> > > [2]:
> > >
> > >
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> > > [3]: https://github.com/cwida/ALP
> > >
> > >
> > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou
>  wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > Thanks for doing this and I agree the numbers look impressive.
> > > >
> > > > I would ask if possible for more data points:
> > > >
> > > > 1. More datasets: you could for example look at the datasets that
> were
> > > > used to originally evalute BYTE_STREAM_SPLIT (see
> > > > https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
> > > > the Google Doc linked there)
> > > >
> > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD
> > > >
> > > > 3. Optionally, some perf numbers on x86 too, but I expect that ALP
> will
> > > > remain very good there as well
> > > >
> > > >
> > > > I also have the following reservations towards ALP:
> > > >
> > > > 1. There is no published official spec AFAICT, just a research paper.
> > > >
> > > > 2. I may be missing something, but the paper doesn't seem to mention
> > > > non-finite values (such as +/-Inf and NaNs).
> > >

Re: [Parquet] ALP Encoding for Floating point data

2025-10-22 Thread Antoine Pitrou


Hi Julien, hi all,

On Mon, 20 Oct 2025 15:14:58 -0700
Julien Le Dem  wrote:
> 
> Another question from me:
> 
> Since the goal is to not use compression at all in this case (no ZSTD)
> I'm assuming we would be using either:
> - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column
> 
> field.
> - the Data Page V2 with false in the DataPageHeaderV2.is_compressed
> 
> field
> The second helping decide if we can selectively compress some pages if they
> are less compressed by the
> A few years ago there was a question on the support of the DATA_PAGE_V2 and
> I was curious to hear a refresh on how that's generally supported in
> Parquet implementations. The is_compressed field was exactly intended to
> avoid block compression when the encoding itself is good enough.

Someone has to add V2 data pages to
https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
:)

C++, Java and Rust support them for sure. I feel like we should
probably default to V2 at some point.

Also see https://github.com/apache/parquet-java/issues/3344 for Java.

Regards

Antoine.


> 
> Julien
> 
> On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb 
>  wrote:
> 
> > Thanks again Prateek and co for pushing this along!
> >
> >  
> > > 1. Design and write our own Parquet-ALP spec so that implementations
> > > know exactly how to encode and represent data  
> >
> > 100% agree with this (similar to what was done for ParquetVariant)
> >  
> > > 2. I may be missing something, but the paper doesn't seem to mention  
> > non-finite values (such as +/-Inf and NaNs).
> >
> > I think they are handled via the "Exception" mechanism. Vortex's ALP
> > implementation (below) does appear to handle finite numbers[2]
> >  
> > > 3. It seems there is a single implementation, which is the one published
> > > together with the paper. It is not obvious that it will be
> > > maintained in the future, and reusing it is probably not an option for
> > > non-C++ Parquet implementations  
> >
> > My understanding from the call was that Prateek and team re-implemented
> > ALP  (did not use the implementation from CWI[3]) but that would be good to
> > confirm.
> >
> > There is also a Rust implementation of ALP[1] that is part of the Vortex
> > file format implementation. I have not reviewed it to see if it deviates
> > from the algorithm presented in the paper.
> >
> > Andrew
> >
> > [1]:
> >
> > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> > [2]:
> >
> > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> > [3]: https://github.com/cwida/ALP
> >
> >
> > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou 
> >  wrote:
> >  
> > >
> > > Hello,
> > >
> > > Thanks for doing this and I agree the numbers look impressive.
> > >
> > > I would ask if possible for more data points:
> > >
> > > 1. More datasets: you could for example look at the datasets that were
> > > used to originally evalute BYTE_STREAM_SPLIT (see
> > > https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
> > > the Google Doc linked there)
> > >
> > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD
> > >
> > > 3. Optionally, some perf numbers on x86 too, but I expect that ALP will
> > > remain very good there as well
> > >
> > >
> > > I also have the following reservations towards ALP:
> > >
> > > 1. There is no published official spec AFAICT, just a research paper.
> > >
> > > 2. I may be missing something, but the paper doesn't seem to mention
> > > non-finite values (such as +/-Inf and NaNs).
> > >
> > > 3. It seems there is a single implementation, which is the one published
> > > together with the paper. It is not obvious that it will be
> > > maintained in the future, and reusing it is probably not an option for
> > > non-C++ Parquet implementations
> > >
> > > 4. The encoding itself is complex, since it involves a fallback on
> > > another encoding if the primary encoding (which constitutes the real
> > > innovation) doesn't work out on a piece of data.
> > >
> > >
> > > Based on this, I would say that if we think ALP is attractive for us,
> > > we may want to incorporate our own version of ALP with the following
> > > changes:
> > >
> > > 1. Design and write our own Parquet-ALP spec so that implementations
> > > know exactly how to encode and represent data
> > >
> > > 2. Do not include the ALPrd fallback which is a homegrown dictionary
> > > encoding without dictionary reuse accross pages, and instead rely on a
> > > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)
> > >
> > > 3. Replace the FOR encoding inside ALP, whi

Re: [Parquet] ALP Encoding for Floating point data

2025-10-21 Thread Robert Kruszewski
The ALP implenentation in vortex logically behaves the same, however, since we 
support stacked encodings we don't need to handle the full compression flow, 
just the Float -> Integer transformation that ALP offers.

The paper indeed doesn't mention how special float values are handled but the 
logic exists in the published C++ implementation. Vortex leverages differences 
in cast semantics between Rust and C++ to make encoding faster 
https://spiraldb.com/post/alp-rust-is-faster-than-c. 

On Mon, 20 Oct 2025, at 19:56, Andrew Lamb wrote:
> Thanks again Prateek and co for pushing this along!
> 
> 
> > 1. Design and write our own Parquet-ALP spec so that implementations
> > know exactly how to encode and represent data
> 
> 100% agree with this (similar to what was done for ParquetVariant)
> 
> > 2. I may be missing something, but the paper doesn't seem to mention
> non-finite values (such as +/-Inf and NaNs).
> 
> I think they are handled via the "Exception" mechanism. Vortex's ALP
> implementation (below) does appear to handle finite numbers[2]
> 
> > 3. It seems there is a single implementation, which is the one published
> > together with the paper. It is not obvious that it will be
> > maintained in the future, and reusing it is probably not an option for
> > non-C++ Parquet implementations
> 
> My understanding from the call was that Prateek and team re-implemented
> ALP  (did not use the implementation from CWI[3]) but that would be good to
> confirm.
> 
> There is also a Rust implementation of ALP[1] that is part of the Vortex
> file format implementation. I have not reviewed it to see if it deviates
> from the algorithm presented in the paper.
> 
> Andrew
> 
> [1]:
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> [2]:
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> [3]: https://github.com/cwida/ALP
> 
> 
> On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou  wrote:
> 
> >
> > Hello,
> >
> > Thanks for doing this and I agree the numbers look impressive.
> >
> > I would ask if possible for more data points:
> >
> > 1. More datasets: you could for example look at the datasets that were
> > used to originally evalute BYTE_STREAM_SPLIT (see
> > https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
> > the Google Doc linked there)
> >
> > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD
> >
> > 3. Optionally, some perf numbers on x86 too, but I expect that ALP will
> > remain very good there as well
> >
> >
> > I also have the following reservations towards ALP:
> >
> > 1. There is no published official spec AFAICT, just a research paper.
> >
> > 2. I may be missing something, but the paper doesn't seem to mention
> > non-finite values (such as +/-Inf and NaNs).
> >
> > 3. It seems there is a single implementation, which is the one published
> > together with the paper. It is not obvious that it will be
> > maintained in the future, and reusing it is probably not an option for
> > non-C++ Parquet implementations
> >
> > 4. The encoding itself is complex, since it involves a fallback on
> > another encoding if the primary encoding (which constitutes the real
> > innovation) doesn't work out on a piece of data.
> >
> >
> > Based on this, I would say that if we think ALP is attractive for us,
> > we may want to incorporate our own version of ALP with the following
> > changes:
> >
> > 1. Design and write our own Parquet-ALP spec so that implementations
> > know exactly how to encode and represent data
> >
> > 2. Do not include the ALPrd fallback which is a homegrown dictionary
> > encoding without dictionary reuse accross pages, and instead rely on a
> > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)
> >
> > 3. Replace the FOR encoding inside ALP, which aims at compressing
> > integers efficiently, with our own DELTA_BINARY_PACKED (which has the
> > same qualities and is already available in Parquet implementations)
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > On Thu, 16 Oct 2025 14:47:33 -0700
> > PRATEEK GAUR  wrote:
> > > Hi team,
> > >
> > > We spent some time evaluating ALP compression and decompression compared
> > to
> > > other encoding alternatives like CHIMP/GORILLA and compression techniques
> > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> > > on October 15th in the biweekly parquet meeting. ( I can't seem to access
> > > the recording, so please let me know what access rules I need to get to
> > be
> > > able to view it )
> > >
> > > We did this evaluation over some datasets pointed by the ALP paper and
> > some
> > > pointed by the parquet community.
> > >
> > > The results are available in the following document
> > > <
> > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> > >
> > > :
> > >
> > https://docs.google.com/document/d

Re: [Parquet] ALP Encoding for Floating point data

2025-10-20 Thread PRATEEK GAUR
Thanks everyone.

I'm gonna try and answer as many questions as I can in this reply.


   - [Antoine] More datasets : Thanks we'll look into this.
   - [Antoine + Adam] ByteStreamSplit + ZSTD : Added another column to this
   sheet
   

   with this info.
   - [Antoine] Numbers on x86 : Ack, i'll run the same experiment on an
   intel machine (will try to get to it before EOW)


More queries

   - [Antoine + Andew + Julien] Official spec : Yes I'll try and start one
   and then I'm gonna request help from everyone to make it perfect :).
   - [Antoine + Andrew] Non-finite values : One tags them as exceptions.
   - [Antoine] ALPrd -> ByteStreamSplit : Actually I really like this idea.
   I did not see a major perf difference or compression ratio improvement when
   one has to use the ALPrd route. What do you think @andrew.lamb, @Julien
   Le Dem , @[email protected]
, @[email protected]
   , @adam.reeve
   - [Antoine] FOR -> DELTA_BINARY_PACKED : So DELTA is significantly
   slower than FOR based on numbers I have seen. FOR is very simple and can be
   further improved in decompression speed. I like the FOR approach.
   - [Julien] BtrBlocks : I really like the flexibility that this brings in
   but we'll have to come up with a good spec for it to be broadly applicable.
   We can discuss more.


Best
Prateek

On Mon, Oct 20, 2025 at 3:15 PM Julien Le Dem  wrote:

> >
> >
> > > 1. Design and write our own Parquet-ALP spec so that implementations
> > > know exactly how to encode and represent data
> >
> > 100% agree with this (similar to what was done for ParquetVariant)
> >
> Seconded!
>
> > 4. The encoding itself is complex, since it involves a fallback on
> > another encoding if the primary encoding (which constitutes the real
> > innovation) doesn't work out on a piece of data.
>
> We had a discussion on how to layer/chain/stack encodings in the call. A la
> BTRblocks. We could have a general mechanism to clarify how to reuse an
> encoding and make it generally flexible to compose encodings in ways that
> would be a bit more flexible that the current way (instead of
> being constrained  to an enum for the encoding, we could allow a bit more
> metadata). For example it has been discussed to use FastLanes to store the
> integers produced by ALP (the current prototype uses bitpacking). I
> understand, this is what Vortex does.
>
> >
> >
> >
>
> Another question from me:
>
> Since the goal is to not use compression at all in this case (no ZSTD)
> I'm assuming we would be using either:
> - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column
> <
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> >
> field.
> - the Data Page V2 with false in the DataPageHeaderV2.is_compressed
> <
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> >
> field
> The second helping decide if we can selectively compress some pages if they
> are less compressed by the
> A few years ago there was a question on the support of the DATA_PAGE_V2 and
> I was curious to hear a refresh on how that's generally supported in
> Parquet implementations. The is_compressed field was exactly intended to
> avoid block compression when the encoding itself is good enough.
>
> Julien
>
> On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb 
> wrote:
>
> > Thanks again Prateek and co for pushing this along!
> >
> >
> > > 1. Design and write our own Parquet-ALP spec so that implementations
> > > know exactly how to encode and represent data
> >
> > 100% agree with this (similar to what was done for ParquetVariant)
> >
> > > 2. I may be missing something, but the paper doesn't seem to mention
> > non-finite values (such as +/-Inf and NaNs).
> >
> > I think they are handled via the "Exception" mechanism. Vortex's ALP
> > implementation (below) does appear to handle finite numbers[2]
> >
> > > 3. It seems there is a single implementation, which is the one
> published
> > > together with the paper. It is not obvious that it will be
> > > maintained in the future, and reusing it is probably not an option for
> > > non-C++ Parquet implementations
> >
> > My understanding from the call was that Prateek and team re-implemented
> > ALP  (did not use the implementation from CWI[3]) but that would be good
> to
> > confirm.
> >
> > There is also a Rust implementation of ALP[1] that is part of the Vortex
> > file format implementation. I have not reviewed it to see if it deviates
> > from the algorithm presented in the paper.
> >
> > Andrew
> >
> > [1]:
> >
> >
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> > [2]:
> >
> >
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/com

Re: [Parquet] ALP Encoding for Floating point data

2025-10-20 Thread Julien Le Dem
>
>
> > 1. Design and write our own Parquet-ALP spec so that implementations
> > know exactly how to encode and represent data
>
> 100% agree with this (similar to what was done for ParquetVariant)
>
Seconded!

> 4. The encoding itself is complex, since it involves a fallback on
> another encoding if the primary encoding (which constitutes the real
> innovation) doesn't work out on a piece of data.

We had a discussion on how to layer/chain/stack encodings in the call. A la
BTRblocks. We could have a general mechanism to clarify how to reuse an
encoding and make it generally flexible to compose encodings in ways that
would be a bit more flexible that the current way (instead of
being constrained  to an enum for the encoding, we could allow a bit more
metadata). For example it has been discussed to use FastLanes to store the
integers produced by ALP (the current prototype uses bitpacking). I
understand, this is what Vortex does.

>
>
>

Another question from me:

Since the goal is to not use compression at all in this case (no ZSTD)
I'm assuming we would be using either:
- the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column

field.
- the Data Page V2 with false in the DataPageHeaderV2.is_compressed

field
The second helping decide if we can selectively compress some pages if they
are less compressed by the
A few years ago there was a question on the support of the DATA_PAGE_V2 and
I was curious to hear a refresh on how that's generally supported in
Parquet implementations. The is_compressed field was exactly intended to
avoid block compression when the encoding itself is good enough.

Julien

On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb  wrote:

> Thanks again Prateek and co for pushing this along!
>
>
> > 1. Design and write our own Parquet-ALP spec so that implementations
> > know exactly how to encode and represent data
>
> 100% agree with this (similar to what was done for ParquetVariant)
>
> > 2. I may be missing something, but the paper doesn't seem to mention
> non-finite values (such as +/-Inf and NaNs).
>
> I think they are handled via the "Exception" mechanism. Vortex's ALP
> implementation (below) does appear to handle finite numbers[2]
>
> > 3. It seems there is a single implementation, which is the one published
> > together with the paper. It is not obvious that it will be
> > maintained in the future, and reusing it is probably not an option for
> > non-C++ Parquet implementations
>
> My understanding from the call was that Prateek and team re-implemented
> ALP  (did not use the implementation from CWI[3]) but that would be good to
> confirm.
>
> There is also a Rust implementation of ALP[1] that is part of the Vortex
> file format implementation. I have not reviewed it to see if it deviates
> from the algorithm presented in the paper.
>
> Andrew
>
> [1]:
>
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> [2]:
>
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> [3]: https://github.com/cwida/ALP
>
>
> On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou  wrote:
>
> >
> > Hello,
> >
> > Thanks for doing this and I agree the numbers look impressive.
> >
> > I would ask if possible for more data points:
> >
> > 1. More datasets: you could for example look at the datasets that were
> > used to originally evalute BYTE_STREAM_SPLIT (see
> > https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
> > the Google Doc linked there)
> >
> > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD
> >
> > 3. Optionally, some perf numbers on x86 too, but I expect that ALP will
> > remain very good there as well
> >
> >
> > I also have the following reservations towards ALP:
> >
> > 1. There is no published official spec AFAICT, just a research paper.
> >
> > 2. I may be missing something, but the paper doesn't seem to mention
> > non-finite values (such as +/-Inf and NaNs).
> >
> > 3. It seems there is a single implementation, which is the one published
> > together with the paper. It is not obvious that it will be
> > maintained in the future, and reusing it is probably not an option for
> > non-C++ Parquet implementations
> >
> > 4. The encoding itself is complex, since it involves a fallback on
> > another encoding if the primary encoding (which constitutes the real
> > innovation) doesn't work out on a piece of data.
> >
> >
> > Based on this, I would say that if we think ALP is attractive for us,
> > we may want to incorporate our own version of ALP with the following
> > changes:
> >
> > 1. Design and write our own Parquet-ALP spec so that implementations
> > know exactly how to enco

Re: [Parquet] ALP Encoding for Floating point data

2025-10-20 Thread Andrew Lamb
Thanks again Prateek and co for pushing this along!


> 1. Design and write our own Parquet-ALP spec so that implementations
> know exactly how to encode and represent data

100% agree with this (similar to what was done for ParquetVariant)

> 2. I may be missing something, but the paper doesn't seem to mention
non-finite values (such as +/-Inf and NaNs).

I think they are handled via the "Exception" mechanism. Vortex's ALP
implementation (below) does appear to handle finite numbers[2]

> 3. It seems there is a single implementation, which is the one published
> together with the paper. It is not obvious that it will be
> maintained in the future, and reusing it is probably not an option for
> non-C++ Parquet implementations

My understanding from the call was that Prateek and team re-implemented
ALP  (did not use the implementation from CWI[3]) but that would be good to
confirm.

There is also a Rust implementation of ALP[1] that is part of the Vortex
file format implementation. I have not reviewed it to see if it deviates
from the algorithm presented in the paper.

Andrew

[1]:
https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
[2]:
https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
[3]: https://github.com/cwida/ALP


On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou  wrote:

>
> Hello,
>
> Thanks for doing this and I agree the numbers look impressive.
>
> I would ask if possible for more data points:
>
> 1. More datasets: you could for example look at the datasets that were
> used to originally evalute BYTE_STREAM_SPLIT (see
> https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
> the Google Doc linked there)
>
> 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD
>
> 3. Optionally, some perf numbers on x86 too, but I expect that ALP will
> remain very good there as well
>
>
> I also have the following reservations towards ALP:
>
> 1. There is no published official spec AFAICT, just a research paper.
>
> 2. I may be missing something, but the paper doesn't seem to mention
> non-finite values (such as +/-Inf and NaNs).
>
> 3. It seems there is a single implementation, which is the one published
> together with the paper. It is not obvious that it will be
> maintained in the future, and reusing it is probably not an option for
> non-C++ Parquet implementations
>
> 4. The encoding itself is complex, since it involves a fallback on
> another encoding if the primary encoding (which constitutes the real
> innovation) doesn't work out on a piece of data.
>
>
> Based on this, I would say that if we think ALP is attractive for us,
> we may want to incorporate our own version of ALP with the following
> changes:
>
> 1. Design and write our own Parquet-ALP spec so that implementations
> know exactly how to encode and represent data
>
> 2. Do not include the ALPrd fallback which is a homegrown dictionary
> encoding without dictionary reuse accross pages, and instead rely on a
> well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)
>
> 3. Replace the FOR encoding inside ALP, which aims at compressing
> integers efficiently, with our own DELTA_BINARY_PACKED (which has the
> same qualities and is already available in Parquet implementations)
>
> Regards
>
> Antoine.
>
>
>
> On Thu, 16 Oct 2025 14:47:33 -0700
> PRATEEK GAUR  wrote:
> > Hi team,
> >
> > We spent some time evaluating ALP compression and decompression compared
> to
> > other encoding alternatives like CHIMP/GORILLA and compression techniques
> > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> > on October 15th in the biweekly parquet meeting. ( I can't seem to access
> > the recording, so please let me know what access rules I need to get to
> be
> > able to view it )
> >
> > We did this evaluation over some datasets pointed by the ALP paper and
> some
> > pointed by the parquet community.
> >
> > The results are available in the following document
> > <
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> >
> > :
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> >
> > Based on the numbers we see
> >
> >-  ALP is comparable to ZSTD(level=1) in terms of compression ratio
> and
> >much better compared to other schemes. (numbers in the sheet are bytes
> >needed to encode each value )
> >- ALP going quite well in terms of decompression speed (numbers in the
> >sheet are bytes decompressed per second)
> >
> > As next steps we will
> >
> >- Get the numbers for compression on top of byte stream split.
> >- Evaluate the algorithm over a few more datasets.
> >- Have an implementation in the arrow-parquet repo.
> >
> > Looking forward to feedback from the community.
> >
> > Best
> > Prateek and Dhirhan
> >
>
>
>
>


Re: [Parquet] ALP Encoding for Floating point data

2025-10-20 Thread Antoine Pitrou


Hello,

Thanks for doing this and I agree the numbers look impressive.

I would ask if possible for more data points:

1. More datasets: you could for example look at the datasets that were
used to originally evalute BYTE_STREAM_SPLIT (see
https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
the Google Doc linked there)
 
2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD

3. Optionally, some perf numbers on x86 too, but I expect that ALP will
remain very good there as well
   
   
I also have the following reservations towards ALP:

1. There is no published official spec AFAICT, just a research paper.

2. I may be missing something, but the paper doesn't seem to mention
non-finite values (such as +/-Inf and NaNs).

3. It seems there is a single implementation, which is the one published
together with the paper. It is not obvious that it will be
maintained in the future, and reusing it is probably not an option for
non-C++ Parquet implementations
   
4. The encoding itself is complex, since it involves a fallback on
another encoding if the primary encoding (which constitutes the real
innovation) doesn't work out on a piece of data.


Based on this, I would say that if we think ALP is attractive for us,
we may want to incorporate our own version of ALP with the following
changes:

1. Design and write our own Parquet-ALP spec so that implementations
know exactly how to encode and represent data

2. Do not include the ALPrd fallback which is a homegrown dictionary
encoding without dictionary reuse accross pages, and instead rely on a
well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)

3. Replace the FOR encoding inside ALP, which aims at compressing
integers efficiently, with our own DELTA_BINARY_PACKED (which has the
same qualities and is already available in Parquet implementations)

Regards

Antoine.



On Thu, 16 Oct 2025 14:47:33 -0700
PRATEEK GAUR  wrote:
> Hi team,
> 
> We spent some time evaluating ALP compression and decompression compared to
> other encoding alternatives like CHIMP/GORILLA and compression techniques
> like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> on October 15th in the biweekly parquet meeting. ( I can't seem to access
> the recording, so please let me know what access rules I need to get to be
> able to view it )
> 
> We did this evaluation over some datasets pointed by the ALP paper and some
> pointed by the parquet community.
> 
> The results are available in the following document
> 
> :
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> 
> Based on the numbers we see
> 
>-  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
>much better compared to other schemes. (numbers in the sheet are bytes
>needed to encode each value )
>- ALP going quite well in terms of decompression speed (numbers in the
>sheet are bytes decompressed per second)
> 
> As next steps we will
> 
>- Get the numbers for compression on top of byte stream split.
>- Evaluate the algorithm over a few more datasets.
>- Have an implementation in the arrow-parquet repo.
> 
> Looking forward to feedback from the community.
> 
> Best
> Prateek and Dhirhan
> 





Re: [Parquet] ALP Encoding for Floating point data

2025-10-17 Thread PRATEEK GAUR
Thanks Adrian.

Yes, that is absolutely correct. Having the power of doing filter push
downs will really really help ALP ( and few other schemes ) over the block
compression schemes like ZSTD. That is an added plus of ALP over ZSTD in
addition to better decompression speed.
And I agree in most cases decompression speed is given more weight than
compression speeds.

On Thu, Oct 16, 2025 at 5:53 PM Adrian Garcia Badaracco
 wrote:

> Thank you for sharing that. Very interesting. I do think decompression
> speed is generally more important than compression speed. Another thing to
> consider is the possibility of operating on the compressed data e.g. for
> filtering: zstd data for example has to be decompressed before any
> filtering, arithmetic, etc. can be done. I believe at least filtering could
> be done on some of these other encodings. Apologies if this was discussed
> in the meeting already.
>
> > On Oct 16, 2025, at 4:47 PM, PRATEEK GAUR  wrote:
> >
> > Hi team,
> >
> > We spent some time evaluating ALP compression and decompression compared
> to
> > other encoding alternatives like CHIMP/GORILLA and compression techniques
> > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> > on October 15th in the biweekly parquet meeting. ( I can't seem to access
> > the recording, so please let me know what access rules I need to get to
> be
> > able to view it )
> >
> > We did this evaluation over some datasets pointed by the ALP paper and
> some
> > pointed by the parquet community.
> >
> > The results are available in the following document
> > <
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> >
> > :
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> >
> > Based on the numbers we see
> >
> >   -  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
> >   much better compared to other schemes. (numbers in the sheet are bytes
> >   needed to encode each value )
> >   - ALP going quite well in terms of decompression speed (numbers in the
> >   sheet are bytes decompressed per second)
> >
> > As next steps we will
> >
> >   - Get the numbers for compression on top of byte stream split.
> >   - Evaluate the algorithm over a few more datasets.
> >   - Have an implementation in the arrow-parquet repo.
> >
> > Looking forward to feedback from the community.
> >
> > Best
> > Prateek and Dhirhan
>
>


Re: [Parquet] ALP Encoding for Floating point data

2025-10-17 Thread Adrian Garcia Badaracco
Thank you for sharing that. Very interesting. I do think decompression speed is 
generally more important than compression speed. Another thing to consider is 
the possibility of operating on the compressed data e.g. for filtering: zstd 
data for example has to be decompressed before any filtering, arithmetic, etc. 
can be done. I believe at least filtering could be done on some of these other 
encodings. Apologies if this was discussed in the meeting already.

> On Oct 16, 2025, at 4:47 PM, PRATEEK GAUR  wrote:
> 
> Hi team,
> 
> We spent some time evaluating ALP compression and decompression compared to
> other encoding alternatives like CHIMP/GORILLA and compression techniques
> like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> on October 15th in the biweekly parquet meeting. ( I can't seem to access
> the recording, so please let me know what access rules I need to get to be
> able to view it )
> 
> We did this evaluation over some datasets pointed by the ALP paper and some
> pointed by the parquet community.
> 
> The results are available in the following document
> 
> :
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> 
> Based on the numbers we see
> 
>   -  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
>   much better compared to other schemes. (numbers in the sheet are bytes
>   needed to encode each value )
>   - ALP going quite well in terms of decompression speed (numbers in the
>   sheet are bytes decompressed per second)
> 
> As next steps we will
> 
>   - Get the numbers for compression on top of byte stream split.
>   - Evaluate the algorithm over a few more datasets.
>   - Have an implementation in the arrow-parquet repo.
> 
> Looking forward to feedback from the community.
> 
> Best
> Prateek and Dhirhan



Re: [Parquet] ALP Encoding for Floating point data

2025-10-17 Thread PRATEEK GAUR
Thanks Adrian.

Yes, that is absolutely correct. Having the power of doing filter push
downs will really really help ALP ( and few other schemes ) over the block
compression schemes like ZSTD. That is an added plus of ALP over ZSTD in
addition to better decompression speed.
And I agree in most cases decompression speed is given more weight than
compression speeds.

On Thu, Oct 16, 2025 at 5:53 PM Adrian Garcia Badaracco
 wrote:

> Thank you for sharing that. Very interesting. I do think decompression
> speed is generally more important than compression speed. Another thing to
> consider is the possibility of operating on the compressed data e.g. for
> filtering: zstd data for example has to be decompressed before any
> filtering, arithmetic, etc. can be done. I believe at least filtering could
> be done on some of these other encodings. Apologies if this was discussed
> in the meeting already.
>
> > On Oct 16, 2025, at 4:47 PM, PRATEEK GAUR  wrote:
> >
> > Hi team,
> >
> > We spent some time evaluating ALP compression and decompression compared
> to
> > other encoding alternatives like CHIMP/GORILLA and compression techniques
> > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> > on October 15th in the biweekly parquet meeting. ( I can't seem to access
> > the recording, so please let me know what access rules I need to get to
> be
> > able to view it )
> >
> > We did this evaluation over some datasets pointed by the ALP paper and
> some
> > pointed by the parquet community.
> >
> > The results are available in the following document
> > <
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> >
> > :
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> >
> > Based on the numbers we see
> >
> >   -  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
> >   much better compared to other schemes. (numbers in the sheet are bytes
> >   needed to encode each value )
> >   - ALP going quite well in terms of decompression speed (numbers in the
> >   sheet are bytes decompressed per second)
> >
> > As next steps we will
> >
> >   - Get the numbers for compression on top of byte stream split.
> >   - Evaluate the algorithm over a few more datasets.
> >   - Have an implementation in the arrow-parquet repo.
> >
> > Looking forward to feedback from the community.
> >
> > Best
> > Prateek and Dhirhan
>
>