Re: hadoop LZ4 incompatible with open source LZ4

2018-08-08 Thread ALeX Wang
@Wes

Okay I think I figured it out why I could not read LZ4 encoded parquet file
generated by parquet-mr.

Turns out hadoop LZ4 has its own framing format.

I summarized details in the JIRA ticket you posted:
https://issues.apache.org/jira/browse/PARQUET-1241

Thanks,
Alex Wang,

On Tue, 7 Aug 2018 at 12:13, Wes McKinney  wrote:

> hi Alex,
>
> here's one thread I remember about this
>
> https://github.com/dask/fastparquet/issues/314#issuecomment-371629605
>
> and a relevant unresolved JIRA
>
> https://issues.apache.org/jira/browse/PARQUET-1241
>
> The first step to resolving this issue is to reconcile what mode of
> LZ4 the Parquet format is supposed to be using
>
> - Wes
>
>
> On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Just to share my understanding,
> >
> > In Arrow, my understanding is that it downloads the lz4 from
> > https://github.com/lz4/lz4 (via export
> > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it
> is
> > using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> > see if I could implement a CodecFactory handle for LZ4_FRAMED in
> parquet-mr,
> >
> > Thanks,
> >
> >
> > On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> No, if you look at the implementation in
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> >> it is not using the same LZ4 compression style that Hadoop is using;
> >> realistically we need to add a bunch of options to Lz4Codec to be able
> >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> >> my e-mail to find the prior thread
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> >> > Hi Wes,
> >> >
> >> > Are you talking about this ?
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >> >
> >> > I tried to compile with the latest arrow which contain this fix and
> still
> >> > encountered the corruption error.
> >> >
> >> > Also, we tried to read the file using pyparquet, and spark, did not
> work
> >> > either,
> >> >
> >> > Thanks,
> >> > Alex Wang,
> >> >
> >> >
> >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney 
> wrote:
> >> >
> >> >> hi Alex,
> >> >>
> >> >> I think there was an e-mail thread or JIRA about this, would have to
> >> >> dig it up. LZ4 compression was originally underspecified (has that
> >> >> been fixed) and we aren't using the correct compressor/decompressor
> >> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> >> Apache Arrow
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang 
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Would like to kindly confirm my observation,
> >> >> >
> >> >> > We use parquet-mr (java) to generate parquet file with LZ4
> >> compression.
> >> >> To
> >> >> > do this we have to compile/install hadoop native library with
> provides
> >> >> LZ4
> >> >> > codec.
> >> >> >
> >> >> > However, the generated parquet file, is not recognizable by
> >> >> parquet-cpp.  I
> >> >> > encountered following error when using the `tools/parquet_reader`
> >> binary,
> >> >> >
> >> >> > ```
> >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> >> > ```
> >> >> >
> >> >> > Further search online get me to this JIRA ticket:
> >> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >> >
> >> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr
> lz4
> >> is
> >> >> > not compatible with parquet-cpp?
> >> >> >
> >> >> > Thanks,
> >> >> > --
> >> >> > Alex Wang,
> >> >> > Open vSwitch developer
> >> >>
> >> >
> >> >
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-08 Thread ALeX Wang
wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Would like to kindly confirm my observation,
> >> >> >
> >> >> > We use parquet-mr (java) to generate parquet file with LZ4
> >> compression.
> >> >> To
> >> >> > do this we have to compile/install hadoop native library with
> provides
> >> >> LZ4
> >> >> > codec.
> >> >> >
> >> >> > However, the generated parquet file, is not recognizable by
> >> >> parquet-cpp.  I
> >> >> > encountered following error when using the `tools/parquet_reader`
> >> binary,
> >> >> >
> >> >> > ```
> >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> >> > ```
> >> >> >
> >> >> > Further search online get me to this JIRA ticket:
> >> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >> >
> >> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr
> lz4
> >> is
> >> >> > not compatible with parquet-cpp?
> >> >> >
> >> >> > Thanks,
> >> >> > --
> >> >> > Alex Wang,
> >> >> > Open vSwitch developer
> >> >>
> >> >
> >> >
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread Wes McKinney
hi Alex,

here's one thread I remember about this

https://github.com/dask/fastparquet/issues/314#issuecomment-371629605

and a relevant unresolved JIRA

https://issues.apache.org/jira/browse/PARQUET-1241

The first step to resolving this issue is to reconcile what mode of
LZ4 the Parquet format is supposed to be using

- Wes


On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang  wrote:
> Hi Wes,
>
> Just to share my understanding,
>
> In Arrow, my understanding is that it downloads the lz4 from
> https://github.com/lz4/lz4 (via export
> LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it is
> using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> see if I could implement a CodecFactory handle for LZ4_FRAMED in parquet-mr,
>
> Thanks,
>
>
> On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:
>
>> hi Alex,
>>
>> No, if you look at the implementation in
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
>> it is not using the same LZ4 compression style that Hadoop is using;
>> realistically we need to add a bunch of options to Lz4Codec to be able
>> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
>> my e-mail to find the prior thread
>>
>> - Wes
>>
>> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
>> > Hi Wes,
>> >
>> > Are you talking about this ?
>> >
>> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
>> >
>> > I tried to compile with the latest arrow which contain this fix and still
>> > encountered the corruption error.
>> >
>> > Also, we tried to read the file using pyparquet, and spark, did not work
>> > either,
>> >
>> > Thanks,
>> > Alex Wang,
>> >
>> >
>> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:
>> >
>> >> hi Alex,
>> >>
>> >> I think there was an e-mail thread or JIRA about this, would have to
>> >> dig it up. LZ4 compression was originally underspecified (has that
>> >> been fixed) and we aren't using the correct compressor/decompressor
>> >> options in parquet-cpp at the moment. If you have time to dig in and
>> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
>> >> Apache Arrow
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
>> >> > Hi,
>> >> >
>> >> > Would like to kindly confirm my observation,
>> >> >
>> >> > We use parquet-mr (java) to generate parquet file with LZ4
>> compression.
>> >> To
>> >> > do this we have to compile/install hadoop native library with provides
>> >> LZ4
>> >> > codec.
>> >> >
>> >> > However, the generated parquet file, is not recognizable by
>> >> parquet-cpp.  I
>> >> > encountered following error when using the `tools/parquet_reader`
>> binary,
>> >> >
>> >> > ```
>> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
>> >> > ```
>> >> >
>> >> > Further search online get me to this JIRA ticket:
>> >> > https://issues.apache.org/jira/browse/HADOOP-12990
>> >> >
>> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4
>> is
>> >> > not compatible with parquet-cpp?
>> >> >
>> >> > Thanks,
>> >> > --
>> >> > Alex Wang,
>> >> > Open vSwitch developer
>> >>
>> >
>> >
>> > --
>> > Alex Wang,
>> > Open vSwitch developer
>>
>
>
> --
> Alex Wang,
> Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread ALeX Wang
Hi Wes,

Just to share my understanding,

In Arrow, my understanding is that it downloads the lz4 from
https://github.com/lz4/lz4 (via export
LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it is
using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
see if I could implement a CodecFactory handle for LZ4_FRAMED in parquet-mr,

Thanks,


On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:

> hi Alex,
>
> No, if you look at the implementation in
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> it is not using the same LZ4 compression style that Hadoop is using;
> realistically we need to add a bunch of options to Lz4Codec to be able
> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> my e-mail to find the prior thread
>
> - Wes
>
> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Are you talking about this ?
> >
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >
> > I tried to compile with the latest arrow which contain this fix and still
> > encountered the corruption error.
> >
> > Also, we tried to read the file using pyparquet, and spark, did not work
> > either,
> >
> > Thanks,
> > Alex Wang,
> >
> >
> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> I think there was an e-mail thread or JIRA about this, would have to
> >> dig it up. LZ4 compression was originally underspecified (has that
> >> been fixed) and we aren't using the correct compressor/decompressor
> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> Apache Arrow
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
> >> > Hi,
> >> >
> >> > Would like to kindly confirm my observation,
> >> >
> >> > We use parquet-mr (java) to generate parquet file with LZ4
> compression.
> >> To
> >> > do this we have to compile/install hadoop native library with provides
> >> LZ4
> >> > codec.
> >> >
> >> > However, the generated parquet file, is not recognizable by
> >> parquet-cpp.  I
> >> > encountered following error when using the `tools/parquet_reader`
> binary,
> >> >
> >> > ```
> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> > ```
> >> >
> >> > Further search online get me to this JIRA ticket:
> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >
> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4
> is
> >> > not compatible with parquet-cpp?
> >> >
> >> > Thanks,
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread Wes McKinney
hi Alex,

No, if you look at the implementation in
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
it is not using the same LZ4 compression style that Hadoop is using;
realistically we need to add a bunch of options to Lz4Codec to be able
to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
my e-mail to find the prior thread

- Wes

On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> Hi Wes,
>
> Are you talking about this ?
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
>
> I tried to compile with the latest arrow which contain this fix and still
> encountered the corruption error.
>
> Also, we tried to read the file using pyparquet, and spark, did not work
> either,
>
> Thanks,
> Alex Wang,
>
>
> On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:
>
>> hi Alex,
>>
>> I think there was an e-mail thread or JIRA about this, would have to
>> dig it up. LZ4 compression was originally underspecified (has that
>> been fixed) and we aren't using the correct compressor/decompressor
>> options in parquet-cpp at the moment. If you have time to dig in and
>> fix it, it would be much appreciated. Note that the LZ4 code lives in
>> Apache Arrow
>>
>> - Wes
>>
>> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
>> > Hi,
>> >
>> > Would like to kindly confirm my observation,
>> >
>> > We use parquet-mr (java) to generate parquet file with LZ4 compression.
>> To
>> > do this we have to compile/install hadoop native library with provides
>> LZ4
>> > codec.
>> >
>> > However, the generated parquet file, is not recognizable by
>> parquet-cpp.  I
>> > encountered following error when using the `tools/parquet_reader` binary,
>> >
>> > ```
>> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
>> > ```
>> >
>> > Further search online get me to this JIRA ticket:
>> > https://issues.apache.org/jira/browse/HADOOP-12990
>> >
>> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4 is
>> > not compatible with parquet-cpp?
>> >
>> > Thanks,
>> > --
>> > Alex Wang,
>> > Open vSwitch developer
>>
>
>
> --
> Alex Wang,
> Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread ALeX Wang
Hi Wes,

Are you talking about this ?
http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E

I tried to compile with the latest arrow which contain this fix and still
encountered the corruption error.

Also, we tried to read the file using pyparquet, and spark, did not work
either,

Thanks,
Alex Wang,


On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:

> hi Alex,
>
> I think there was an e-mail thread or JIRA about this, would have to
> dig it up. LZ4 compression was originally underspecified (has that
> been fixed) and we aren't using the correct compressor/decompressor
> options in parquet-cpp at the moment. If you have time to dig in and
> fix it, it would be much appreciated. Note that the LZ4 code lives in
> Apache Arrow
>
> - Wes
>
> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
> > Hi,
> >
> > Would like to kindly confirm my observation,
> >
> > We use parquet-mr (java) to generate parquet file with LZ4 compression.
> To
> > do this we have to compile/install hadoop native library with provides
> LZ4
> > codec.
> >
> > However, the generated parquet file, is not recognizable by
> parquet-cpp.  I
> > encountered following error when using the `tools/parquet_reader` binary,
> >
> > ```
> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> > ```
> >
> > Further search online get me to this JIRA ticket:
> > https://issues.apache.org/jira/browse/HADOOP-12990
> >
> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4 is
> > not compatible with parquet-cpp?
> >
> > Thanks,
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread Wes McKinney
hi Alex,

I think there was an e-mail thread or JIRA about this, would have to
dig it up. LZ4 compression was originally underspecified (has that
been fixed) and we aren't using the correct compressor/decompressor
options in parquet-cpp at the moment. If you have time to dig in and
fix it, it would be much appreciated. Note that the LZ4 code lives in
Apache Arrow

- Wes

On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
> Hi,
>
> Would like to kindly confirm my observation,
>
> We use parquet-mr (java) to generate parquet file with LZ4 compression.  To
> do this we have to compile/install hadoop native library with provides LZ4
> codec.
>
> However, the generated parquet file, is not recognizable by parquet-cpp.  I
> encountered following error when using the `tools/parquet_reader` binary,
>
> ```
> Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> ```
>
> Further search online get me to this JIRA ticket:
> https://issues.apache.org/jira/browse/HADOOP-12990
>
> So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4 is
> not compatible with parquet-cpp?
>
> Thanks,
> --
> Alex Wang,
> Open vSwitch developer


hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread ALeX Wang
Hi,

Would like to kindly confirm my observation,

We use parquet-mr (java) to generate parquet file with LZ4 compression.  To
do this we have to compile/install hadoop native library with provides LZ4
codec.

However, the generated parquet file, is not recognizable by parquet-cpp.  I
encountered following error when using the `tools/parquet_reader` binary,

```
Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
```

Further search online get me to this JIRA ticket:
https://issues.apache.org/jira/browse/HADOOP-12990

So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4 is
not compatible with parquet-cpp?

Thanks,
-- 
Alex Wang,
Open vSwitch developer