[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-09-22 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-696630511 Need to add a test with the legacy file in https://github.com/apache/arrow-testing/pull/47 This is an automated

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-09-22 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-696630511 Need to add a test with the legacy file in https://github.com/apache/arrow-testing/pull/47 This is an automated

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-09-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-694320305 I'll be able to do that. This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-08-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-674893008 @patrickpai Do you have some time to make the desired changes here? This is an automated message from the Apache

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-07-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-660342064 Well, this change is certainly not going to land in 1.0, so I think you can add the write path here if you want. Also, in addition to the Hadoop-encoded LZ4 file, it would

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-07-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-660293362 I don't think you need to do it in the constructor, you can simply do it when the decompression is called. This

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-07-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-660275706 > There would be a performance cost when attempting to read data pages that were written with incompatible lz4 codec Why can't you implement the heuristic I outlined above?

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-07-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-660272945 Hmm, I think the guess is extremely likely to be correct. There's a tiny chance that bytes 4-7 for a non-Hadoop-compressed file would be equal to the compressed buffer size - 8.

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-07-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-660271155 We're not the only ones producing Parquet files. This is an automated message from the Apache Git Service. To

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-07-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-660265813 Well, you can read the compressed size in bytes 4-7 and see if that corresponds to the actual buffer size you got. If by chance it corresponds but it is not actually

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-07-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-660254916 > Should we make every Codec know its corresponding compression type enum? > Another approach would be to define another codec, LZ4_HADOOP, [...] Both approaches sounds ok

[GitHub] [arrow] pitrou commented on pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-07-17 Thread GitBox
pitrou commented on pull request #7789: URL: https://github.com/apache/arrow/pull/7789#issuecomment-659994574 We certainly don't want to do this on the Arrow side (the codecs may be used for something else than Parquet), rather on the Parquet side.