[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
[ https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148060#comment-17148060 ] ASF GitHub Bot commented on PARQUET-1643: - samarthjain commented on pull request #671: URL: https://github.com/apache/parquet-mr/pull/671#issuecomment-651282816 @nandorKollar, @rdblue, @danielcweeks - if you have cycles, could you please take a look at this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs > --- > > Key: PARQUET-1643 > URL: https://issues.apache.org/jira/browse/PARQUET-1643 > Project: Parquet > Issue Type: Improvement >Reporter: Samarth Jain >Assignee: Samarth Jain >Priority: Major > Labels: pull-request-available > > [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which > provides non-native implementations of compression codecs. It claims to be > much faster than native wrappers that parquet uses. This Jira is to track the > work needed for exploring using these codecs, getting benchmark results and > making changes including not needing to pool compressors and decompressors > anymore. Note that this doesn't include SNAPPY since Parquet already has its > own non-hadoopy implementation for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
[ https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147587#comment-17147587 ] ASF GitHub Bot commented on PARQUET-1643: - samarthjain commented on pull request #671: URL: https://github.com/apache/parquet-mr/pull/671#issuecomment-650971182 @dbtsai > Since airlift is pure Java implementation, what's the performance implications for zstd? I saw there is a benchmark for GZIP, but I don't see benchmark for other codecs. It looks like the zstd Airlift implementation doesn't implement the Hadoop APIs. It can be integrated within Parquet, but will take some work worth definitely worthy of another PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs > --- > > Key: PARQUET-1643 > URL: https://issues.apache.org/jira/browse/PARQUET-1643 > Project: Parquet > Issue Type: Improvement >Reporter: Samarth Jain >Assignee: Samarth Jain >Priority: Major > Labels: pull-request-available > > [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which > provides non-native implementations of compression codecs. It claims to be > much faster than native wrappers that parquet uses. This Jira is to track the > work needed for exploring using these codecs, getting benchmark results and > making changes including not needing to pool compressors and decompressors > anymore. Note that this doesn't include SNAPPY since Parquet already has its > own non-hadoopy implementation for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
[ https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147302#comment-17147302 ] ASF GitHub Bot commented on PARQUET-1643: - samarthjain commented on pull request #671: URL: https://github.com/apache/parquet-mr/pull/671#issuecomment-650731696 Force pushed a new commit that makes it configurable whether to use Airlift based compressors or not. Also added tests and GZIP benchmarks for Airlift compressors. Benchmark results reveal that there are no performance improvements or regressions when using Airlift GZIP vs plain GZIP. ``` PageChecksumReadBenchmarks.read10MRowsAirliftGzipWithVerification 3 6.431 ±0.741 PageChecksumReadBenchmarks.read10MRowsAirliftGzipWithoutVerification 3 6.605 ±0.709 PageChecksumReadBenchmarks.read10MRowsGzipWithVerification 3 6.468 ±0.700 PageChecksumReadBenchmarks.read10MRowsGzipWithoutVerification 3 6.583 ±1.538 PageChecksumWriteBenchmarks.write10MRowsAirliftGzipWithChecksums 336.333 ±0.510 PageChecksumWriteBenchmarks.write10MRowsAirliftGzipWithoutChecksums 336.069 ±1.096 PageChecksumWriteBenchmarks.write10MRowsGzipWithChecksums 336.141 ±1.095 PageChecksumWriteBenchmarks.write10MRowsGzipWithoutChecksums 336.174 ±5.125 ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeAirliftGZIP 3 0.898 ±1.254 ReadBenchmarks.read1MRowsDefaultBlockAndPageSizeGZIP 3 0.891 ±1.201 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs > --- > > Key: PARQUET-1643 > URL: https://issues.apache.org/jira/browse/PARQUET-1643 > Project: Parquet > Issue Type: Improvement >Reporter: Samarth Jain >Assignee: Samarth Jain >Priority: Major > Labels: pull-request-available > > [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which > provides non-native implementations of compression codecs. It claims to be > much faster than native wrappers that parquet uses. This Jira is to track the > work needed for exploring using these codecs, getting benchmark results and > making changes including not needing to pool compressors and decompressors > anymore. Note that this doesn't include SNAPPY since Parquet already has its > own non-hadoopy implementation for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
[ https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096844#comment-17096844 ] ASF GitHub Bot commented on PARQUET-1643: - dbtsai commented on pull request #671: URL: https://github.com/apache/parquet-mr/pull/671#issuecomment-622018995 @samarthjain thanks for the work. I am looking to deploy zstd parquet into prod, but that requires new hadoop with native library support which is not practical in many prod use-cases. Since airlift is pure Java implementation, what's the performance implications for zstd? I saw there is a benchmark for GZIP, but I don't see benchmark for other codecs. Also, do we consider to use zstd-jin which is a Java library that packages native implementation of zstd for different platforms in jar? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs > --- > > Key: PARQUET-1643 > URL: https://issues.apache.org/jira/browse/PARQUET-1643 > Project: Parquet > Issue Type: Improvement >Reporter: Samarth Jain >Assignee: Samarth Jain >Priority: Major > Labels: pull-request-available > > [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which > provides non-native implementations of compression codecs. It claims to be > much faster than native wrappers that parquet uses. This Jira is to track the > work needed for exploring using these codecs, getting benchmark results and > making changes including not needing to pool compressors and decompressors > anymore. Note that this doesn't include SNAPPY since Parquet already has its > own non-hadoopy implementation for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
[ https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918907#comment-16918907 ] ASF GitHub Bot commented on PARQUET-1643: - samarthjain commented on pull request #671: PARQUET-1643 Use airlift codecs for LZ4, LZ0 and GZIP URL: https://github.com/apache/parquet-mr/pull/671 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs > --- > > Key: PARQUET-1643 > URL: https://issues.apache.org/jira/browse/PARQUET-1643 > Project: Parquet > Issue Type: Improvement >Reporter: Samarth Jain >Assignee: Samarth Jain >Priority: Major > Labels: pull-request-available > > [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which > provides non-native implementations of compression codecs. It claims to be > much faster than native wrappers that parquet uses. This Jira is to track the > work needed for exploring using these codecs, getting benchmark results and > making changes including not needing to pool compressors and decompressors > anymore. Note that this doesn't include SNAPPY since Parquet already has its > own non-hadoopy implementation for it. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
[ https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918905#comment-16918905 ] ASF GitHub Bot commented on PARQUET-1643: - samarthjain commented on pull request #671: PARQUET-1643 Use airlift codecs for LZ4, LZ0 and GZIP URL: https://github.com/apache/parquet-mr/pull/671 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs > --- > > Key: PARQUET-1643 > URL: https://issues.apache.org/jira/browse/PARQUET-1643 > Project: Parquet > Issue Type: Improvement >Reporter: Samarth Jain >Assignee: Samarth Jain >Priority: Major > Labels: pull-request-available > > [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which > provides non-native implementations of compression codecs. It claims to be > much faster than native wrappers that parquet uses. This Jira is to track the > work needed for exploring using these codecs, getting benchmark results and > making changes including not needing to pool compressors and decompressors > anymore. Note that this doesn't include SNAPPY since Parquet already has its > own non-hadoopy implementation for it. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
[ https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917990#comment-16917990 ] ASF GitHub Bot commented on PARQUET-1643: - samarthjain commented on pull request #671: PARQUET-1643 Use airlift codecs for LZ4, LZ0 and GZIP URL: https://github.com/apache/parquet-mr/pull/671 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs > --- > > Key: PARQUET-1643 > URL: https://issues.apache.org/jira/browse/PARQUET-1643 > Project: Parquet > Issue Type: Improvement >Reporter: Samarth Jain >Priority: Major > Labels: pull-request-available > > [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which > provides non-native implementations of compression codecs. It claims to be > much faster than native wrappers that parquet uses. This Jira is to track the > work needed for exploring using these codecs, getting benchmark results and > making changes including not needing to pool compressors and decompressors > anymore. Note that this doesn't include SNAPPY since Parquet already has its > own non-hadoopy implementation for it. -- This message was sent by Atlassian Jira (v8.3.2#803003)