Floating point data compression for Apache Parquet

2019-06-12 Thread Radev, Martin
Dear all, thank you for your work on the Apache Parquet format. We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet. The current encodings and

Re: Floating point data compression for Apache Parquet

2019-07-05 Thread Radev, Martin
, Martin From: Zoltan Ivanfi Sent: Friday, July 5, 2019 4:21:39 PM To: Radev, Martin Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman Subject: Re: Floating point data compression for Apache Parquet Hi Martin, Thanks for the explanations, makes sense. Nice work

Re: Floating point data compression for Apache Parquet

2019-07-11 Thread Radev, Martin
for adding the "BYTE_STREAM_SPLIT" encoding to parquet-cpp within Apache Arrow. How should we proceed? It would be great to get feedback from other community members. Regards, Martin ____ From: Radev, Martin Sent: Tuesday, July 9, 2019 1:01:25 AM

Re: Floating point data compression for Apache Parquet

2019-07-08 Thread Radev, Martin
From: Zoltan Ivanfi Sent: Monday, July 8, 2019 5:06:30 PM To: Radev, Martin Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman Subject: Re: Floating point data compression for Apache Parquet Hi Martin, I agree that bs_zstd would be a good place to start. Regarding the choice

Re: Floating point data compression for Apache Parquet

2019-07-03 Thread Radev, Martin
es though it might be somewhat more complex and I have not really though about this case. Regards, Martin From: Zoltan Ivanfi Sent: Wednesday, July 3, 2019 6:07:50 PM To: Parquet Dev; Radev, Martin Cc: Raoofy, Amir; Karlstetter, Roman Subject: Re: Floating point da

Re: Floating point data compression for Apache Parquet

2019-07-01 Thread Radev, Martin
n 12, 2019 at 11:10 PM Radev, Martin wrote: > > Dear all, > > thank you for your work on the Apache Parquet format. > > We are a group of students at the Technical University of Munich who would > like to extend the available compression and encoding options for 32-bit an

[VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-08-27 Thread Radev, Martin
Dear all, there was some earlier discussion on adding a new encoding for better compression of FP32 and FP64 data. The pull request which extends the format is here: https://github.com/apache/parquet-format/pull/144 The change has one approval from earlier from Zoltan. The results from an

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-09-03 Thread Radev, Martin
of the PMC. On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin wrote: > Dear all, > > > there was some earlier discussion on adding a new encoding for better > compression of FP32 and FP64 data. > > > The pull request which extends the format is here: > https://github.com/apa

Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

2019-08-23 Thread Radev, Martin
the archives of this mailing list. Regards, Gabor On Thu, Jul 25, 2019 at 12:56 PM Radev, Martin wrote: > Dear all, > > > how should be proceeded with this proposal? > > > Would somebody like to offer feedback on the new encoding, change of > specification, and patc

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-09-12 Thread Radev, Martin
gt; > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin wrote: > > > Hello all, > > > > > > thank you Julien for the interest. > > > > > > Could other people, part of Apache Parquet, share their opinions? > > > > Do you have your own data whi

Comparing combinations of encodings and compression algorithms using Apache Parquet

2019-07-18 Thread Radev, Martin
Dear all, I am interested in comparing the available encodings and compression algorithms in Parquet using the parquet-mr project. The metrics I would like to collect is compression ratio and compression/decompression speed. Is there an available project which does something similar which I

[DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

2019-07-23 Thread Radev, Martin
Dear Apache Parquet Devs, I would like to make a proposal for extending the Apache Parquet specification by adding a better encoding for FP data which improves compression ratio and also to raise the question of adding a lossy compression algorithm for FP data. Contents: 1. Problem: FP data

Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

2019-07-25 Thread Radev, Martin
to help me drive this? Regards, Martin From: Radev, Martin Sent: Tuesday, July 23, 2019 8:22:43 PM To: dev@parquet.apache.org Cc: Zoltan Ivanfi; wesmck...@gmail.com; fo...@driesprong.frl; heue...@gmail.com; Karlstetter, Roman; Raoofy, Amir Subject: [DISCUSS][JAVA][C

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-09-19 Thread Radev, Martin
Hello Ryan, we decided that it would be beneficial to try out your proposal. I will look into it and provide measurements on the compression ratio and speed. Regards, Martin From: Ryan Blue Sent: Saturday, September 14, 2019 2:23:20 AM To: Radev, Martin Cc

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-11-02 Thread Radev, Martin
what code and what settings to use. > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin wrote: > > > > Dear all, > > > > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT encoding? > > > > Please feel free to contact me di

Re: custom CompressionCodec support

2019-10-21 Thread Radev, Martin
Hello Manik, If the compression level is really propagated to the library, what compression levels did you check? Regards, Martin From: Manik Singla Sent: Monday, October 21, 2019 10:11:36 PM To: Parquet Dev Cc: fa...@sumologic.com; Radev, Martin Subject

Re: custom CompressionCodec support

2019-10-17 Thread Radev, Martin
Hi Falak, I was one of the people who recently exposed this to Arrow but this is not part of the Parquet specification. In particular, any implementation for writing parquet files can decide whether to expose this or select a reasonable value internally. If you're using Arrow, you would

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-11-29 Thread Radev, Martin
at 11:22 PM Wes McKinney wrote: > > > +1 from me on adding the FP encoding > > > > On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin wrote: > > > > > > Hello all, > > > > > > > > > thanks for the vote Ryan and to Wes for the fee

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-10-31 Thread Radev, Martin
and fp64 values. My early experiments also show that this encoding+zstd performs better on average than any of the specialized floating-point lossless compressors like fpc, spdp, zfp. Regards, Martin From: Radev, Martin Sent: Thursday, October 10, 2019 2:34:15

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-10-10 Thread Radev, Martin
file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing Regards, Martin From: Ryan Blue Sent: Thursday, September 19, 2019 7:54 PM To: Radev, Martin Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache P

Re: Provide pluggable APIs to support user customized compression codec

2020-03-04 Thread Radev, Martin
Hi Xin, thanks for the interest in extending Parquet. I suppose this is only about the Parquet Writer/Reader implementation, not about changes to the Parquet specification. I would like to know whether offloading the task of compressing/decompressing some data is really beneficial

Re: Allow users to fine-tune parquet writing

2020-02-04 Thread Radev, Martin
Dear all, in our project of using Parquet for streaming fp data with various entropy, we definitely needed to treat the columns differently. For fp data with low entropy, dictionary encoding provided good results. For fp data with entropy >15 bits element, the newly added encoding + zstd