[jira] [Commented] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+
[ https://issues.apache.org/jira/browse/PARQUET-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580443#comment-17580443 ] ASF GitHub Bot commented on PARQUET-2173: - steveloughran commented on PR #985: URL: https://github.com/apache/parquet-mr/pull/985#issuecomment-1217104595 i've also built against the next release of hadoop, and of 3.4.0-SNAPSHOT. the parquet build fails there as jackson 1 is purged from the hadoop classpath, breaking the japicmp plugin. ``` Execution default of goal com.github.siom79.japicmp:japicmp-maven-plugin:0.14.2:cmp failed: Could not load 'org.codehaus.jackson.type.TypeReference ``` > Fix parquet build against hadoop 3.3.3+ > --- > > Key: PARQUET-2173 > URL: https://issues.apache.org/jira/browse/PARQUET-2173 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Major > > parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 > for reload4j, and this creates maven dependency problems in parquet cli > {code} > [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli > --- > [WARNING] Used undeclared dependencies found: > [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided > {code} > the hadoop common dependencies need to exclude this jar and any changed slf4j > ones. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] steveloughran commented on pull request #985: PARQUET-2173. Fix parquet build against hadoop 3.3.3+
steveloughran commented on PR #985: URL: https://github.com/apache/parquet-mr/pull/985#issuecomment-1217104595 i've also built against the next release of hadoop, and of 3.4.0-SNAPSHOT. the parquet build fails there as jackson 1 is purged from the hadoop classpath, breaking the japicmp plugin. ``` Execution default of goal com.github.siom79.japicmp:japicmp-maven-plugin:0.14.2:cmp failed: Could not load 'org.codehaus.jackson.type.TypeReference ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+
[ https://issues.apache.org/jira/browse/PARQUET-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580435#comment-17580435 ] ASF GitHub Bot commented on PARQUET-2173: - steveloughran opened a new pull request, #985: URL: https://github.com/apache/parquet-mr/pull/985 Hadoop 3.3.3 moved to reload4j for logging to stop shipping a version of log4j with known (albeit unused) CVEs. This bypasses the existing exclusion code used to keep hadoop's SLF4J dependency off the classpaths, and by adding a new jar, breaks parquet-cli build. Make sure you have checked _all_ steps below. ### Jira - [X] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [X] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: The testing is regression testing "does the build work?", "does a test run complete without SLF4J warnings of duplicates?". done manually with `-Dhadoop.version=3.3.4` ### Commits - [X] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Fix parquet build against hadoop 3.3.3+ > --- > > Key: PARQUET-2173 > URL: https://issues.apache.org/jira/browse/PARQUET-2173 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Major > > parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 > for reload4j, and this creates maven dependency problems in parquet cli > {code} > [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli > --- > [WARNING] Used undeclared dependencies found: > [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided > {code} > the hadoop common dependencies need to exclude this jar and any changed slf4j > ones. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] steveloughran opened a new pull request, #985: PARQUET-2173. Fix parquet build against hadoop 3.3.3+
steveloughran opened a new pull request, #985: URL: https://github.com/apache/parquet-mr/pull/985 Hadoop 3.3.3 moved to reload4j for logging to stop shipping a version of log4j with known (albeit unused) CVEs. This bypasses the existing exclusion code used to keep hadoop's SLF4J dependency off the classpaths, and by adding a new jar, breaks parquet-cli build. Make sure you have checked _all_ steps below. ### Jira - [X] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [X] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: The testing is regression testing "does the build work?", "does a test run complete without SLF4J warnings of duplicates?". done manually with `-Dhadoop.version=3.3.4` ### Commits - [X] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+
Steve Loughran created PARQUET-2173: --- Summary: Fix parquet build against hadoop 3.3.3+ Key: PARQUET-2173 URL: https://issues.apache.org/jira/browse/PARQUET-2173 Project: Parquet Issue Type: Bug Components: parquet-cli Affects Versions: 1.13.0 Reporter: Steve Loughran parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 for reload4j, and this creates maven dependency problems in parquet cli {code} [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli --- [WARNING] Used undeclared dependencies found: [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided {code} the hadoop common dependencies need to exclude this jar and any changed slf4j ones. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader
[ https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580267#comment-17580267 ] ASF GitHub Bot commented on PARQUET-2149: - ggershinsky commented on code in PR #968: URL: https://github.com/apache/parquet-mr/pull/968#discussion_r946662874 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -126,6 +127,42 @@ public class ParquetFileReader implements Closeable { public static String PARQUET_READ_PARALLELISM = "parquet.metadata.read.parallelism"; + public static int numProcessors = Runtime.getRuntime().availableProcessors(); + + // Thread pool to read column chunk data from disk. Applications should call setAsyncIOThreadPool + // to initialize this with their own implementations. + // Default initialization is useful only for testing + public static ExecutorService ioThreadPool = Executors.newCachedThreadPool( +r -> new Thread(r, "parquet-io")); + + // Thread pool to process pages for multiple columns in parallel. Applications should call + // setAsyncProcessThreadPool to initialize this with their own implementations. + // Default initialization is useful only for testing + public static ExecutorService processThreadPool = Executors.newCachedThreadPool( Review Comment: not sure; looks like many tests use copy/paste, rather than extension.. > Implement async IO for Parquet file reader > -- > > Key: PARQUET-2149 > URL: https://issues.apache.org/jira/browse/PARQUET-2149 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Parth Chandra >Priority: Major > > ParquetFileReader's implementation has the following flow (simplified) - > - For every column -> Read from storage in 8MB blocks -> Read all > uncompressed pages into output queue > - From output queues -> (downstream ) decompression + decoding > This flow is serialized, which means that downstream threads are blocked > until the data has been read. Because a large part of the time spent is > waiting for data from storage, threads are idle and CPU utilization is really > low. > There is no reason why this cannot be made asynchronous _and_ parallel. So > For Column _i_ -> reading one chunk until end, from storage -> intermediate > output queue -> read one uncompressed page until end -> output queue -> > (downstream ) decompression + decoding > Note that this can be made completely self contained in ParquetFileReader and > downstream implementations like Iceberg and Spark will automatically be able > to take advantage without code change as long as the ParquetFileReader apis > are not changed. > In past work with async io [Drill - async page reader > |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java] > , I have seen 2x-3x improvement in reading speed for Parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
ggershinsky commented on code in PR #968: URL: https://github.com/apache/parquet-mr/pull/968#discussion_r946662874 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -126,6 +127,42 @@ public class ParquetFileReader implements Closeable { public static String PARQUET_READ_PARALLELISM = "parquet.metadata.read.parallelism"; + public static int numProcessors = Runtime.getRuntime().availableProcessors(); + + // Thread pool to read column chunk data from disk. Applications should call setAsyncIOThreadPool + // to initialize this with their own implementations. + // Default initialization is useful only for testing + public static ExecutorService ioThreadPool = Executors.newCachedThreadPool( +r -> new Thread(r, "parquet-io")); + + // Thread pool to process pages for multiple columns in parallel. Applications should call + // setAsyncProcessThreadPool to initialize this with their own implementations. + // Default initialization is useful only for testing + public static ExecutorService processThreadPool = Executors.newCachedThreadPool( Review Comment: not sure; looks like many tests use copy/paste, rather than extension.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org