Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Wes McKinney
On Mon, Jul 6, 2020 at 11:08 AM Antoine Pitrou wrote: > > > Le 06/07/2020 à 17:57, Steve Kim a écrit : > > The Parquet format specification is ambiguous about the exact details of > > LZ4 compression. However, the *de facto* reference implementation in Java > > (parquet-mr) uses the Hadoop LZ4 cod

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
> Would that keep compatibility with existing files produces by Parquet C++? Changing the lz4 implementation to be compatible with parquet-mr/hadoop would break compatibility with any existing files that were written by Parquet C++ using lz4 compression. I believe that it is not possible to reliab

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Antoine Pitrou
Le 06/07/2020 à 17:57, Steve Kim a écrit : > The Parquet format specification is ambiguous about the exact details of > LZ4 compression. However, the *de facto* reference implementation in Java > (parquet-mr) uses the Hadoop LZ4 codec. > > I think that it is important for Parquet c++ to have com

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
The Parquet format specification is ambiguous about the exact details of LZ4 compression. However, the *de facto* reference implementation in Java (parquet-mr) uses the Hadoop LZ4 codec. I think that it is important for Parquet c++ to have compatibility and feature parity with parquet-mr when poss

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-04 Thread Antoine Pitrou
I don't have a sense of how conservative Parquet users generally are. Is it worth adding a LZ4_FRAMED compression option in the Parquet format, or would people just not use it? Regards Antoine. On Tue, 30 Jun 2020 14:33:17 +0200 "Uwe L. Korn" wrote: > I'm also in favor of disabling support f

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-30 Thread Uwe L. Korn
I'm also in favor of disabling support for now. Having to deal with broken files or the detection of various incompatible implementations in the long-term will harm more than not supporting LZ4 for a while. Snappy is generally more used than LZ4 in this category as it has been available since th

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-29 Thread Wes McKinney
On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou wrote: > > > Le 25/06/2020 à 00:02, Wes McKinney a écrit : > > hi folks, > > > > (cross-posting to dev@arrow and dev@parquet since there are > > stakeholders in both places) > > > > It seems there are still problems at least with the C++ implementatio

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-25 Thread Antoine Pitrou
Le 25/06/2020 à 00:02, Wes McKinney a écrit : > hi folks, > > (cross-posting to dev@arrow and dev@parquet since there are > stakeholders in both places) > > It seems there are still problems at least with the C++ implementation > of LZ4 compression in Parquet files > > https://issues.apache.or

[DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-24 Thread Wes McKinney
hi folks, (cross-posting to dev@arrow and dev@parquet since there are stakeholders in both places) It seems there are still problems at least with the C++ implementation of LZ4 compression in Parquet files https://issues.apache.org/jira/browse/PARQUET-1241 https://issues.apache.org/jira/browse/P