Re: Interest in Parquet V3

2024-05-14 Thread Martin Loncaric
I think Parquet's metadata and encoding/compression setup are problematic, but I don't see a reason to make Parquet V3 if it's just going to be another BtrBlocks or Nimble look-alike. Some people in the thread have expressed the view that Parquet's metadata is fine, and that people can achieve

Re: Pitch for Pcodec Encoding in Parquet

2024-01-30 Thread Martin Loncaric
and-egg problem, but given > that Parquet files are used for long-term storage (not just transient > data), it's probably not a good idea to be an early adopter here. > > And of course, if the encoding was simpler, points 2 and 3 wouldn't > really hurt. > > This is just my o

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Martin Loncaric
> concern. > > I can't say there wouldn't be other technical blockers but at least this > would be someplace to start? > > Cheers, > Micah > > On Thu, Jan 11, 2024 at 7:21 PM Martin Loncaric > wrote: > > > (Oops, the repeating binary decimal is 1100... with peri

Re: Pitch for Pcodec Encoding in Parquet

2024-01-11 Thread Martin Loncaric
(Oops, the repeating binary decimal is 1100... with period 4, so exactly 2 bits of entropy for the 52 mantissa bits. The argument is the same though.) On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric wrote: > To reach a conclusion on this thread, I understand the overall sentiment > as: &

Re: Pitch for Pcodec Encoding in Parquet

2024-01-11 Thread Martin Loncaric
in the future. On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric wrote: > I must admit I'm a bit surprised by these results. The first thing is >> that the Pcodec results were actually obtained using dictionary >> encoding. Then I don't understand what is Pcodec-encoded: the dicti

Re: Pitch for Pcodec Encoding in Parquet

2024-01-11 Thread Martin Loncaric
ties (<2 bits of entropy) for the last 6+ bytes of each number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many bits for them. On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou wrote: > > Hello Martin, > > On Sat, 6 Jan 2024 17:09:07 -0500 > Martin Loncaric > wr

Re: [Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY

2024-01-08 Thread Martin Loncaric
+1 that this is beneficial, especially for 16 bit floats On Mon, Jan 8, 2024, 11:56 Antoine Pitrou wrote: > > Hello all, > > Based on the response received, it seems this addition is > non-controversial and generally considered beneficial. > > What should be the way forward? Should I submit a

Re: Pitch for Pcodec Encoding in Parquet

2024-01-06 Thread Martin Loncaric
> maintenance, but Arrow/Parquet C++ build system is particularly complex > due > > to the wide range ot systems it targets). > > 2. IMO it is important to consider downstream users in these decisions > as > > well. > > > > Another thing that could help

Re: Pitch for Pcodec Encoding in Parquet

2024-01-05 Thread Martin Loncaric
lo, > > It would be very interesting to expand the comparison against > BYTE_STREAM_SPLIT + compression. > > See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal > to extend the range of types supporting BYTE_STREAM_SPLIT. > > Regards > > Antoine. > > &g

Re: Pitch for Pcodec Encoding in Parquet

2024-01-03 Thread Martin Loncaric
gt; [1] https://lists.apache.org/thread/z8fnoq3lm5t67rfz74fwzj5qytzyy4gv > > On Tue, Jan 2, 2024 at 9:10 PM Martin Loncaric > wrote: > > > I'd like to propose and get feedback on a new encoding for numerical > > columns: pco. I just did a blog post demonstrating how this would pe

Re: Pitch for Pcodec Encoding in Parquet

2024-01-03 Thread Martin Loncaric
14 AM wish maple wrote: > Hi Martin, > > Parquet has "Compression" and "Encoding" parts. So, this new > method is a part of integer/float-point encoding, but also doing some > compression workload? > > Best, > Xuwei Fu > > Martin Loncaric 于202

Pitch for Pcodec Encoding in Parquet

2024-01-02 Thread Martin Loncaric
I'd like to propose and get feedback on a new encoding for numerical columns: pco. I just did a blog post demonstrating how this would perform on various real-world datasets . TL;DR: pco losslessly achieves much better compression

[jira] [Updated] (PARQUET-2132) Support Quantile Compression q_compress column codec

2022-02-28 Thread Martin Loncaric (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Loncaric updated PARQUET-2132: - Description: Quantile Compression (https://github.com/mwlon/quantile-compression

[jira] [Created] (PARQUET-2132) Support Quantile Compression q_compress column codec

2022-02-28 Thread Martin Loncaric (Jira)
Martin Loncaric created PARQUET-2132: Summary: Support Quantile Compression q_compress column codec Key: PARQUET-2132 URL: https://issues.apache.org/jira/browse/PARQUET-2132 Project: Parquet

[jira] [Updated] (PARQUET-2132) Support Quantile Compression q_compress column codec

2022-02-28 Thread Martin Loncaric (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Loncaric updated PARQUET-2132: - Description: Quantile Compression (https://github.com/mwlon/quantile-compression