Re: hadoop LZ4 incompatible with open source LZ4

2018-08-08 Thread ALeX Wang
@Wes

Okay I think I figured it out why I could not read LZ4 encoded parquet file
generated by parquet-mr.

Turns out hadoop LZ4 has its own framing format.

I summarized details in the JIRA ticket you posted:
https://issues.apache.org/jira/browse/PARQUET-1241

Thanks,
Alex Wang,

On Tue, 7 Aug 2018 at 12:13, Wes McKinney  wrote:

> hi Alex,
>
> here's one thread I remember about this
>
> https://github.com/dask/fastparquet/issues/314#issuecomment-371629605
>
> and a relevant unresolved JIRA
>
> https://issues.apache.org/jira/browse/PARQUET-1241
>
> The first step to resolving this issue is to reconcile what mode of
> LZ4 the Parquet format is supposed to be using
>
> - Wes
>
>
> On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Just to share my understanding,
> >
> > In Arrow, my understanding is that it downloads the lz4 from
> > https://github.com/lz4/lz4 (via export
> > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it
> is
> > using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> > see if I could implement a CodecFactory handle for LZ4_FRAMED in
> parquet-mr,
> >
> > Thanks,
> >
> >
> > On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> No, if you look at the implementation in
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> >> it is not using the same LZ4 compression style that Hadoop is using;
> >> realistically we need to add a bunch of options to Lz4Codec to be able
> >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> >> my e-mail to find the prior thread
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> >> > Hi Wes,
> >> >
> >> > Are you talking about this ?
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >> >
> >> > I tried to compile with the latest arrow which contain this fix and
> still
> >> > encountered the corruption error.
> >> >
> >> > Also, we tried to read the file using pyparquet, and spark, did not
> work
> >> > either,
> >> >
> >> > Thanks,
> >> > Alex Wang,
> >> >
> >> >
> >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney 
> wrote:
> >> >
> >> >> hi Alex,
> >> >>
> >> >> I think there was an e-mail thread or JIRA about this, would have to
> >> >> dig it up. LZ4 compression was originally underspecified (has that
> >> >> been fixed) and we aren't using the correct compressor/decompressor
> >> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> >> Apache Arrow
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang 
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Would like to kindly confirm my observation,
> >> >> >
> >> >> > We use parquet-mr (java) to generate parquet file with LZ4
> >> compression.
> >> >> To
> >> >> > do this we have to compile/install hadoop native library with
> provides
> >> >> LZ4
> >> >> > codec.
> >> >> >
> >> >> > However, the generated parquet file, is not recognizable by
> >> >> parquet-cpp.  I
> >> >> > encountered following error when using the `tools/parquet_reader`
> >> binary,
> >> >> >
> >> >> > ```
> >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> >> > ```
> >> >> >
> >> >> > Further search online get me to this JIRA ticket:
> >> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >> >
> >> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr
> lz4
> >> is
> >> >> > not compatible with parquet-cpp?
> >> >> >
> >> >> > Thanks,
> >> >> > --
> >> >> > Alex Wang,
> >> >> > Open vSwitch developer
> >> >>
> >> >
> >> >
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-08 Thread ALeX Wang
Hi Wes,

Thanks again for the pointers.  During investigation I noticed this
possible bug,

The private variables 'min_', 'max_' in 'TypedRowGroupStatistics' class is
not initialized in constructor.

And I got an abort while trying to read a column using 'parquet_reader'.
And gdb breakpoint at constructor shows that those variables are not
initialized.

```
Breakpoint 1,
parquet::TypedRowGroupStatistics
>::TypedRowGroupStatistics (
this=0xbafd48, schema=0xc17080, encoded_min="", encoded_max="",
num_values=0, null_count=39160, distinct_count=0,
 has_min_max=true, pool=0xbab8e0
)
   at
/opt/gpdbbuild/parquet-cpp/src/parquet/statistics.cc:74
74if
(!encoded_min.empty()) {
(gdb) p min_
$37 = 116
(gdb) p max_
$38 = false
```
And since both 'encoded_min' and 'encoded_max' are empty string, they are
never set...


So, if this is valid issue, i'm proposing the following fix:
```
diff --git a/src/parquet/statistics.cc b/src/parquet/statistics.cc
index ea7f783..5d61bc9 100644
--- a/src/parquet/statistics.cc
+++ b/src/parquet/statistics.cc
@@ -65,6 +65,8 @@ TypedRowGroupStatistics::TypedRowGroupStatistics(
 : pool_(pool),
   min_buffer_(AllocateBuffer(pool_, 0)),
   max_buffer_(AllocateBuffer(pool_, 0)) {
+  using T = typename DType::c_type;
+
   IncrementNumValues(num_values);
   IncrementNullCount(null_count);
   IncrementDistinctCount(distinct_count);
@@ -73,9 +75,13 @@ TypedRowGroupStatistics::TypedRowGroupStatistics(

   if (!encoded_min.empty()) {
 PlainDecode(encoded_min, _);
+  } else {
+min_ = T();
   }
   if (!encoded_max.empty()) {
 PlainDecode(encoded_max, _);
+  } else {
+max_ = T();
   }
   has_min_max_ = has_min_max;
 }
```

Thanks,

On Tue, 7 Aug 2018 at 12:13, Wes McKinney  wrote:

> hi Alex,
>
> here's one thread I remember about this
>
> https://github.com/dask/fastparquet/issues/314#issuecomment-371629605
>
> and a relevant unresolved JIRA
>
> https://issues.apache.org/jira/browse/PARQUET-1241
>
> The first step to resolving this issue is to reconcile what mode of
> LZ4 the Parquet format is supposed to be using
>
> - Wes
>
>
> On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Just to share my understanding,
> >
> > In Arrow, my understanding is that it downloads the lz4 from
> > https://github.com/lz4/lz4 (via export
> > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it
> is
> > using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> > see if I could implement a CodecFactory handle for LZ4_FRAMED in
> parquet-mr,
> >
> > Thanks,
> >
> >
> > On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> No, if you look at the implementation in
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> >> it is not using the same LZ4 compression style that Hadoop is using;
> >> realistically we need to add a bunch of options to Lz4Codec to be able
> >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> >> my e-mail to find the prior thread
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> >> > Hi Wes,
> >> >
> >> > Are you talking about this ?
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >> >
> >> > I tried to compile with the latest arrow which contain this fix and
> still
> >> > encountered the corruption error.
> >> >
> >> > Also, we tried to read the file using pyparquet, and spark, did not
> work
> >> > either,
> >> >
> >> > Thanks,
> >> > Alex Wang,
> >> >
> >> >
> >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney 
> wrote:
> >> >
> >> >> hi Alex,
> >> >>
> >> >> I think there was an e-mail thread or JIRA about this, would have to
> >> >> dig it up. LZ4 compression was originally underspecified (has that
> >> >> been fixed) and we aren't using the correct compressor/decompressor
> >> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> >> Apache Arrow
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang 
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Would like to kindly confirm my observation,
> >> >> >
> >> >> > We use parquet-mr (java) to generate parquet file with LZ4
> >> compression.
> >> >> To
> >> >> > do this we have to compile/install hadoop native library with
> provides
> >> >> LZ4
> >> >> > codec.
> >> >> >
> >> >> > However, the generated parquet file, is not recognizable by
> >> >> parquet-cpp.  I
> >> >> > encountered following error when using the `tools/parquet_reader`
> >> binary,
> >> >> >
> >> >> > ```
> >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> >> > ```
> >> >> >
> >> >> > Further search online get me to this JIRA ticket:
> >> >> > 

Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread Wes McKinney
hi Alex,

here's one thread I remember about this

https://github.com/dask/fastparquet/issues/314#issuecomment-371629605

and a relevant unresolved JIRA

https://issues.apache.org/jira/browse/PARQUET-1241

The first step to resolving this issue is to reconcile what mode of
LZ4 the Parquet format is supposed to be using

- Wes


On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang  wrote:
> Hi Wes,
>
> Just to share my understanding,
>
> In Arrow, my understanding is that it downloads the lz4 from
> https://github.com/lz4/lz4 (via export
> LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it is
> using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> see if I could implement a CodecFactory handle for LZ4_FRAMED in parquet-mr,
>
> Thanks,
>
>
> On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:
>
>> hi Alex,
>>
>> No, if you look at the implementation in
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
>> it is not using the same LZ4 compression style that Hadoop is using;
>> realistically we need to add a bunch of options to Lz4Codec to be able
>> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
>> my e-mail to find the prior thread
>>
>> - Wes
>>
>> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
>> > Hi Wes,
>> >
>> > Are you talking about this ?
>> >
>> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
>> >
>> > I tried to compile with the latest arrow which contain this fix and still
>> > encountered the corruption error.
>> >
>> > Also, we tried to read the file using pyparquet, and spark, did not work
>> > either,
>> >
>> > Thanks,
>> > Alex Wang,
>> >
>> >
>> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:
>> >
>> >> hi Alex,
>> >>
>> >> I think there was an e-mail thread or JIRA about this, would have to
>> >> dig it up. LZ4 compression was originally underspecified (has that
>> >> been fixed) and we aren't using the correct compressor/decompressor
>> >> options in parquet-cpp at the moment. If you have time to dig in and
>> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
>> >> Apache Arrow
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
>> >> > Hi,
>> >> >
>> >> > Would like to kindly confirm my observation,
>> >> >
>> >> > We use parquet-mr (java) to generate parquet file with LZ4
>> compression.
>> >> To
>> >> > do this we have to compile/install hadoop native library with provides
>> >> LZ4
>> >> > codec.
>> >> >
>> >> > However, the generated parquet file, is not recognizable by
>> >> parquet-cpp.  I
>> >> > encountered following error when using the `tools/parquet_reader`
>> binary,
>> >> >
>> >> > ```
>> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
>> >> > ```
>> >> >
>> >> > Further search online get me to this JIRA ticket:
>> >> > https://issues.apache.org/jira/browse/HADOOP-12990
>> >> >
>> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4
>> is
>> >> > not compatible with parquet-cpp?
>> >> >
>> >> > Thanks,
>> >> > --
>> >> > Alex Wang,
>> >> > Open vSwitch developer
>> >>
>> >
>> >
>> > --
>> > Alex Wang,
>> > Open vSwitch developer
>>
>
>
> --
> Alex Wang,
> Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread ALeX Wang
Hi Wes,

Just to share my understanding,

In Arrow, my understanding is that it downloads the lz4 from
https://github.com/lz4/lz4 (via export
LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it is
using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
see if I could implement a CodecFactory handle for LZ4_FRAMED in parquet-mr,

Thanks,


On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:

> hi Alex,
>
> No, if you look at the implementation in
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> it is not using the same LZ4 compression style that Hadoop is using;
> realistically we need to add a bunch of options to Lz4Codec to be able
> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> my e-mail to find the prior thread
>
> - Wes
>
> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Are you talking about this ?
> >
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >
> > I tried to compile with the latest arrow which contain this fix and still
> > encountered the corruption error.
> >
> > Also, we tried to read the file using pyparquet, and spark, did not work
> > either,
> >
> > Thanks,
> > Alex Wang,
> >
> >
> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> I think there was an e-mail thread or JIRA about this, would have to
> >> dig it up. LZ4 compression was originally underspecified (has that
> >> been fixed) and we aren't using the correct compressor/decompressor
> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> Apache Arrow
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
> >> > Hi,
> >> >
> >> > Would like to kindly confirm my observation,
> >> >
> >> > We use parquet-mr (java) to generate parquet file with LZ4
> compression.
> >> To
> >> > do this we have to compile/install hadoop native library with provides
> >> LZ4
> >> > codec.
> >> >
> >> > However, the generated parquet file, is not recognizable by
> >> parquet-cpp.  I
> >> > encountered following error when using the `tools/parquet_reader`
> binary,
> >> >
> >> > ```
> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> > ```
> >> >
> >> > Further search online get me to this JIRA ticket:
> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >
> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4
> is
> >> > not compatible with parquet-cpp?
> >> >
> >> > Thanks,
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread Wes McKinney
hi Alex,

No, if you look at the implementation in
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
it is not using the same LZ4 compression style that Hadoop is using;
realistically we need to add a bunch of options to Lz4Codec to be able
to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
my e-mail to find the prior thread

- Wes

On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> Hi Wes,
>
> Are you talking about this ?
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
>
> I tried to compile with the latest arrow which contain this fix and still
> encountered the corruption error.
>
> Also, we tried to read the file using pyparquet, and spark, did not work
> either,
>
> Thanks,
> Alex Wang,
>
>
> On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:
>
>> hi Alex,
>>
>> I think there was an e-mail thread or JIRA about this, would have to
>> dig it up. LZ4 compression was originally underspecified (has that
>> been fixed) and we aren't using the correct compressor/decompressor
>> options in parquet-cpp at the moment. If you have time to dig in and
>> fix it, it would be much appreciated. Note that the LZ4 code lives in
>> Apache Arrow
>>
>> - Wes
>>
>> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
>> > Hi,
>> >
>> > Would like to kindly confirm my observation,
>> >
>> > We use parquet-mr (java) to generate parquet file with LZ4 compression.
>> To
>> > do this we have to compile/install hadoop native library with provides
>> LZ4
>> > codec.
>> >
>> > However, the generated parquet file, is not recognizable by
>> parquet-cpp.  I
>> > encountered following error when using the `tools/parquet_reader` binary,
>> >
>> > ```
>> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
>> > ```
>> >
>> > Further search online get me to this JIRA ticket:
>> > https://issues.apache.org/jira/browse/HADOOP-12990
>> >
>> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4 is
>> > not compatible with parquet-cpp?
>> >
>> > Thanks,
>> > --
>> > Alex Wang,
>> > Open vSwitch developer
>>
>
>
> --
> Alex Wang,
> Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread ALeX Wang
Hi Wes,

Are you talking about this ?
http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E

I tried to compile with the latest arrow which contain this fix and still
encountered the corruption error.

Also, we tried to read the file using pyparquet, and spark, did not work
either,

Thanks,
Alex Wang,


On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:

> hi Alex,
>
> I think there was an e-mail thread or JIRA about this, would have to
> dig it up. LZ4 compression was originally underspecified (has that
> been fixed) and we aren't using the correct compressor/decompressor
> options in parquet-cpp at the moment. If you have time to dig in and
> fix it, it would be much appreciated. Note that the LZ4 code lives in
> Apache Arrow
>
> - Wes
>
> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
> > Hi,
> >
> > Would like to kindly confirm my observation,
> >
> > We use parquet-mr (java) to generate parquet file with LZ4 compression.
> To
> > do this we have to compile/install hadoop native library with provides
> LZ4
> > codec.
> >
> > However, the generated parquet file, is not recognizable by
> parquet-cpp.  I
> > encountered following error when using the `tools/parquet_reader` binary,
> >
> > ```
> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> > ```
> >
> > Further search online get me to this JIRA ticket:
> > https://issues.apache.org/jira/browse/HADOOP-12990
> >
> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4 is
> > not compatible with parquet-cpp?
> >
> > Thanks,
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread Wes McKinney
hi Alex,

I think there was an e-mail thread or JIRA about this, would have to
dig it up. LZ4 compression was originally underspecified (has that
been fixed) and we aren't using the correct compressor/decompressor
options in parquet-cpp at the moment. If you have time to dig in and
fix it, it would be much appreciated. Note that the LZ4 code lives in
Apache Arrow

- Wes

On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
> Hi,
>
> Would like to kindly confirm my observation,
>
> We use parquet-mr (java) to generate parquet file with LZ4 compression.  To
> do this we have to compile/install hadoop native library with provides LZ4
> codec.
>
> However, the generated parquet file, is not recognizable by parquet-cpp.  I
> encountered following error when using the `tools/parquet_reader` binary,
>
> ```
> Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> ```
>
> Further search online get me to this JIRA ticket:
> https://issues.apache.org/jira/browse/HADOOP-12990
>
> So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4 is
> not compatible with parquet-cpp?
>
> Thanks,
> --
> Alex Wang,
> Open vSwitch developer