Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Wes McKinney
On Mon, Jul 6, 2020 at 11:08 AM Antoine Pitrou  wrote:
>
>
> Le 06/07/2020 à 17:57, Steve Kim a écrit :
> > The Parquet format specification is ambiguous about the exact details of
> > LZ4 compression. However, the *de facto* reference implementation in Java
> > (parquet-mr) uses the Hadoop LZ4 codec.
> >
> > I think that it is important for Parquet c++ to have compatibility and
> > feature parity with parquet-mr when possible. I prefer to change the
> > LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation
> > that is used by parquet-mr (
> > https://issues.apache.org/jira/browse/PARQUET-1878). I think that this
> > change will be quick and easy. I have an intern under my supervision who is
> > available to work on it full time, starting immediately. Please let me know
> > if we ought to proceed.
>
> Would that keep compatibility with existing files produces by Parquet C++?

Given that LZ4 has been constantly broken in C++ (first using the raw
format, then the block format -- still incompatible apparently) I
think we would recommend that in the rare event that people have
LZ4-compressed files (likely not very ubiquitous, FWIW, Snappy is used
mostly) they should rewrite their files with a different codec using
e.g. pyarrow 0.17.1

> Regards
>
> Antoine.


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
> Would that keep compatibility with existing files produces by Parquet C++?

Changing the lz4 implementation to be compatible with parquet-mr/hadoop
would break compatibility with any existing files that were written by
Parquet C++ using lz4 compression. I believe that it is not possible to
reliably detect, from inspection of the first few bytes, which
implementation variant was used by the writer. But I could be misinformed,
as I do not have expert knowledge of LZ4 compression.


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Antoine Pitrou


Le 06/07/2020 à 17:57, Steve Kim a écrit :
> The Parquet format specification is ambiguous about the exact details of
> LZ4 compression. However, the *de facto* reference implementation in Java
> (parquet-mr) uses the Hadoop LZ4 codec.
> 
> I think that it is important for Parquet c++ to have compatibility and
> feature parity with parquet-mr when possible. I prefer to change the
> LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation
> that is used by parquet-mr (
> https://issues.apache.org/jira/browse/PARQUET-1878). I think that this
> change will be quick and easy. I have an intern under my supervision who is
> available to work on it full time, starting immediately. Please let me know
> if we ought to proceed.

Would that keep compatibility with existing files produces by Parquet C++?

Regards

Antoine.


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
The Parquet format specification is ambiguous about the exact details of
LZ4 compression. However, the *de facto* reference implementation in Java
(parquet-mr) uses the Hadoop LZ4 codec.

I think that it is important for Parquet c++ to have compatibility and
feature parity with parquet-mr when possible. I prefer to change the
LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation
that is used by parquet-mr (
https://issues.apache.org/jira/browse/PARQUET-1878). I think that this
change will be quick and easy. I have an intern under my supervision who is
available to work on it full time, starting immediately. Please let me know
if we ought to proceed.

If it is not feasible to achieve compatibility in the next release, then I
am in favor of disabling lz4 support (
https://issues.apache.org/jira/browse/PARQUET-1515) until it can be fixed.

Thanks,
Steve


On Tue, 30 Jun 2020 14:33:17 +0200
"Uwe L. Korn"  wrote:
> I'm also in favor of disabling support for now. Having to deal with
broken files or the detection of various incompatible implementations in
the long-term will harm more than not supporting LZ4 for a while. Snappy is
generally more used than LZ4 in this category as it has been available
since the inception of Parquet and thus should be considered as a viable
alternative.
>
> Cheers
> Uwe
>
> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou 
wrote:
> > >
> > >
> > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > > hi folks,
> > > >
> > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > stakeholders in both places)
> > > >
> > > > It seems there are still problems at least with the C++
implementation
> > > > of LZ4 compression in Parquet files
> > > >
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > >
> > > I don't have any particular opinion on how to solve the LZ4 issue, but
> > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > compression algorithms available, and they span different parts of the
> > > speed/compression spectrum, so it would be a pity to disable one of
them.
> >
> > It's true, however I think it's worse to write LZ4-compressed files
> > that cannot be read by other Parquet implementations (if that's what's
> > happening as I understand it?). If we are indeed shipping something
> > broken then we either should fix it or disable it until it can be
> > fixed.
> >
> > > Regards
> > >
> > > Antoine.
> >
>


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-04 Thread Antoine Pitrou


I don't have a sense of how conservative Parquet users generally are.
Is it worth adding a LZ4_FRAMED compression option in the Parquet
format, or would people just not use it?

Regards

Antoine.


On Tue, 30 Jun 2020 14:33:17 +0200
"Uwe L. Korn"  wrote:
> I'm also in favor of disabling support for now. Having to deal with broken 
> files or the detection of various incompatible implementations in the 
> long-term will harm more than not supporting LZ4 for a while. Snappy is 
> generally more used than LZ4 in this category as it has been available since 
> the inception of Parquet and thus should be considered as a viable 
> alternative.
> 
> Cheers
> Uwe
> 
> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou  wrote:  
> > >
> > >
> > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :  
> > > > hi folks,
> > > >
> > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > stakeholders in both places)
> > > >
> > > > It seems there are still problems at least with the C++ implementation
> > > > of LZ4 compression in Parquet files
> > > >
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > https://issues.apache.org/jira/browse/PARQUET-1878  
> > >
> > > I don't have any particular opinion on how to solve the LZ4 issue, but
> > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > compression algorithms available, and they span different parts of the
> > > speed/compression spectrum, so it would be a pity to disable one of them. 
> > >  
> > 
> > It's true, however I think it's worse to write LZ4-compressed files
> > that cannot be read by other Parquet implementations (if that's what's
> > happening as I understand it?). If we are indeed shipping something
> > broken then we either should fix it or disable it until it can be
> > fixed.
> >   
> > > Regards
> > >
> > > Antoine.  
> >  
> 





Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-30 Thread Uwe L. Korn
I'm also in favor of disabling support for now. Having to deal with broken 
files or the detection of various incompatible implementations in the long-term 
will harm more than not supporting LZ4 for a while. Snappy is generally more 
used than LZ4 in this category as it has been available since the inception of 
Parquet and thus should be considered as a viable alternative.

Cheers
Uwe

On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou  wrote:
> >
> >
> > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > hi folks,
> > >
> > > (cross-posting to dev@arrow and dev@parquet since there are
> > > stakeholders in both places)
> > >
> > > It seems there are still problems at least with the C++ implementation
> > > of LZ4 compression in Parquet files
> > >
> > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > https://issues.apache.org/jira/browse/PARQUET-1878
> >
> > I don't have any particular opinion on how to solve the LZ4 issue, but
> > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > compression algorithms available, and they span different parts of the
> > speed/compression spectrum, so it would be a pity to disable one of them.
> 
> It's true, however I think it's worse to write LZ4-compressed files
> that cannot be read by other Parquet implementations (if that's what's
> happening as I understand it?). If we are indeed shipping something
> broken then we either should fix it or disable it until it can be
> fixed.
> 
> > Regards
> >
> > Antoine.
>


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-29 Thread Wes McKinney
On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou  wrote:
>
>
> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > hi folks,
> >
> > (cross-posting to dev@arrow and dev@parquet since there are
> > stakeholders in both places)
> >
> > It seems there are still problems at least with the C++ implementation
> > of LZ4 compression in Parquet files
> >
> > https://issues.apache.org/jira/browse/PARQUET-1241
> > https://issues.apache.org/jira/browse/PARQUET-1878
>
> I don't have any particular opinion on how to solve the LZ4 issue, but
> I'd like to mention that LZ4 and ZStandard are the two most efficient
> compression algorithms available, and they span different parts of the
> speed/compression spectrum, so it would be a pity to disable one of them.

It's true, however I think it's worse to write LZ4-compressed files
that cannot be read by other Parquet implementations (if that's what's
happening as I understand it?). If we are indeed shipping something
broken then we either should fix it or disable it until it can be
fixed.

> Regards
>
> Antoine.


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-25 Thread Antoine Pitrou


Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> hi folks,
> 
> (cross-posting to dev@arrow and dev@parquet since there are
> stakeholders in both places)
> 
> It seems there are still problems at least with the C++ implementation
> of LZ4 compression in Parquet files
> 
> https://issues.apache.org/jira/browse/PARQUET-1241
> https://issues.apache.org/jira/browse/PARQUET-1878

I don't have any particular opinion on how to solve the LZ4 issue, but
I'd like to mention that LZ4 and ZStandard are the two most efficient
compression algorithms available, and they span different parts of the
speed/compression spectrum, so it would be a pity to disable one of them.

Regards

Antoine.


[DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-24 Thread Wes McKinney
hi folks,

(cross-posting to dev@arrow and dev@parquet since there are
stakeholders in both places)

It seems there are still problems at least with the C++ implementation
of LZ4 compression in Parquet files

https://issues.apache.org/jira/browse/PARQUET-1241
https://issues.apache.org/jira/browse/PARQUET-1878

If these problems cannot be resolved, I am going to recommend that we
disable use of LZ4 in the Parquet C++ library until these things can
be properly tested and validated across different implementations.
Thoughts? We're within weeks of the next Apache Arrow release so if
we're going to disable LZ4-for-Parquet it needs to happen soon.

Thanks
Wes