Re: Pitch for Pcodec Encoding in Parquet

2024-01-30 Thread Martin Loncaric
Gang: thanks for all the good info. I've iterated a bit on the spec document now. To answer your question, > ... it seems to have the potential to achieve good compression ratio on > integers having common suffix... Absolutely. And it's somewhat robust when a small fraction of numbers break the

Re: Pitch for Pcodec Encoding in Parquet

2024-01-15 Thread Antoine Pitrou
My personal sentiment is: not only its newness, but the fact that it is 1) highly non-trivial (it seems much more complicated than all other Parquet encodings); 2) maintained by a single person, both the spec and the implementation (please correct me if I'm wrong?); and 3) has little to no

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Micah Kornfield
> > - CMIW, it seems to have the potential to achieve good compression ratio on >integers having common suffix, e.g. decimal(10,2) values that all have > .00 This is an interesting point. It looks like the algorithm already does difference-in-difference encoding, Martin do you have a sense of

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Micah Kornfield
Hi Martin, I agree with Gang's point about tAns. I opened up an issue against the pcodec repo enumerating some items that I think could be improved for someone trying to implement this from the spec [1] . We can maybe use that as a centralized place to discuss any understanding/clarity issues (I

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Gang Wu
Hi Martin, Sorry for chiming in late. I have just read your blog post and the format specs. Below are my two cents: - The PCO spec is a good starting point with good explanation on the format definition. For people unfamiliar with the background, it would be good to also include the

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Martin Loncaric
Micah: I've added a format doc now: https://github.com/mwlon/pcodec/blob/main/docs/format.md. Would appreciate any feedback or thoughts on it. On Thu, Jan 11, 2024 at 11:47 PM Micah Kornfield wrote: > > > > Pco could technically work as a Parquet encoding, but people are wary of > > its newness

Re: Pitch for Pcodec Encoding in Parquet

2024-01-11 Thread Micah Kornfield
> > Pco could technically work as a Parquet encoding, but people are wary of > its newness and weak FFI support. It seems there is no immediate action to > take, but would be worthwhile to consider this again further in the future. I guess I'm more optimistic on the potential gaps. I think if

Re: Pitch for Pcodec Encoding in Parquet

2024-01-11 Thread Martin Loncaric
(Oops, the repeating binary decimal is 1100... with period 4, so exactly 2 bits of entropy for the 52 mantissa bits. The argument is the same though.) On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric wrote: > To reach a conclusion on this thread, I understand the overall sentiment > as: > > Pco

Re: Pitch for Pcodec Encoding in Parquet

2024-01-11 Thread Martin Loncaric
To reach a conclusion on this thread, I understand the overall sentiment as: Pco could technically work as a Parquet encoding, but people are wary of its newness and weak FFI support. It seems there is no immediate action to take, but would be worthwhile to consider this again further in the

Re: Pitch for Pcodec Encoding in Parquet

2024-01-11 Thread Martin Loncaric
> > I must admit I'm a bit surprised by these results. The first thing is > that the Pcodec results were actually obtained using dictionary > encoding. Then I don't understand what is Pcodec-encoded: the dictionary > values or the dictionary indices? No, pco cannot be dictionary encoded; it only

Re: Pitch for Pcodec Encoding in Parquet

2024-01-08 Thread Antoine Pitrou
Hello Martin, On Sat, 6 Jan 2024 17:09:07 -0500 Martin Loncaric wrote: > > > > It would be very interesting to expand the comparison against > > BYTE_STREAM_SPLIT + compression. > > Antoine: I created one now, at the bottom of the post >

Re: Pitch for Pcodec Encoding in Parquet

2024-01-06 Thread Martin Loncaric
> > It would be very interesting to expand the comparison against > BYTE_STREAM_SPLIT + compression. Antoine: I created one now, at the bottom of the post . In this case, BYTE_STREAM_SPLIT did worse. parquet-mr is currently a pure

Re: Pitch for Pcodec Encoding in Parquet

2024-01-05 Thread Will Jones
> Another thing that could help with adoption here is if pcodec had a > specification document (apologies if I missed it), that would allow others > to more easily port it. > +1 to this. The encodings right now are described by a spec, rather than some specific library. I think if we wanted

Re: Pitch for Pcodec Encoding in Parquet

2024-01-05 Thread Micah Kornfield
> > I don't believe Apache has any restriction against Rust. We are not > collectively beholden to any other organization's restrictions, are we? It is correct that Apache does not have any restrictions. The point is mostly about: 1. Even if there is no restriction, maintainers of Apache

Re: Pitch for Pcodec Encoding in Parquet

2024-01-05 Thread Jan Finis
IMHO, any implementation relying on JNI on Java is a non-starter. The Java ecosystem prefers pure Java libraries a lot over libraries with native components. parquet-mr is currently a pure Java library. Making it a mixed library with native libraries and JNI would be such a maintenance disaster,

Re: Pitch for Pcodec Encoding in Parquet

2024-01-05 Thread Martin Loncaric
I would make the comparison to byte_stream_split immediately, filtering down to only float columns, but looks like it's the one encoding not supported by arrow-rs. Seeing if I can get this merged in: https://github.com/apache/arrow-rs/pull/4183. In the meantime I'll see if I can do a

Re: Pitch for Pcodec Encoding in Parquet

2024-01-05 Thread Antoine Pitrou
Hello, It would be very interesting to expand the comparison against BYTE_STREAM_SPLIT + compression. See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal to extend the range of types supporting BYTE_STREAM_SPLIT. Regards Antoine. On Wed, 3 Jan 2024 00:10:14 -0500 Martin

Re: Pitch for Pcodec Encoding in Parquet

2024-01-04 Thread Micah Kornfield
> > Not every organization uses Rust, but Rust > programs are known to be much safer than C/C++. I agree, but I think this is irrelevant to the point of building from source (expanded upon below) and maintainers of parquet don't necessarily have strong influence on all toolchain decisions their

Re: Pitch for Pcodec Encoding in Parquet

2024-01-04 Thread Steve Loughran
On Wed, 3 Jan 2024 at 05:10, Martin Loncaric wrote: > I'd like to propose and get feedback on a new encoding for numerical > columns: pco. I just did a blog post demonstrating how this would perform > on various real-world datasets >

Re: Pitch for Pcodec Encoding in Parquet

2024-01-03 Thread Martin Loncaric
Hi Micah, I think point 2 is valid. Pcodec has a small but growing community of users. I'm not sure where the bar is for age or establishment, but it's reasonably if pcodec doesn't meet it yet. I think point 1 is misguided. Not every organization uses Rust, but Rust programs are known to be much

Re: Pitch for Pcodec Encoding in Parquet

2024-01-03 Thread Micah Kornfield
Hi Martin, The results are impressive. However I'll point you to a recent prior discussion on a proposed new encoding/compression technique [1]. While this seems to avoid the lossiness concerns. There are also suggested benchmarks

Re: Pitch for Pcodec Encoding in Parquet

2024-01-03 Thread Martin Loncaric
Yep. And doing some compression during encoding isn't really a new thing. For instance, on the air quality dataset, "uncompressed" Parquet gets a compression ratio of about 3.6. Existing encodings sometimes take deltas or varints to reduce integer data size. On Wed, Jan 3, 2024 at 12:14 AM wish

Re: Pitch for Pcodec Encoding in Parquet

2024-01-02 Thread wish maple
Hi Martin, Parquet has "Compression" and "Encoding" parts. So, this new method is a part of integer/float-point encoding, but also doing some compression workload? Best, Xuwei Fu Martin Loncaric 于2024年1月3日周三 13:10写道: > I'd like to propose and get feedback on a new encoding for numerical >

Pitch for Pcodec Encoding in Parquet

2024-01-02 Thread Martin Loncaric
I'd like to propose and get feedback on a new encoding for numerical columns: pco. I just did a blog post demonstrating how this would perform on various real-world datasets . TL;DR: pco losslessly achieves much better compression