[jira] [Commented] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+

2022-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580443#comment-17580443
 ] 

ASF GitHub Bot commented on PARQUET-2173:
-

steveloughran commented on PR #985:
URL: https://github.com/apache/parquet-mr/pull/985#issuecomment-1217104595

   i've also built against the next release of hadoop, and of 3.4.0-SNAPSHOT.
   
   the parquet build fails there as jackson 1 is purged from the hadoop 
classpath, breaking the japicmp plugin.
   
   ```
   Execution default of goal 
com.github.siom79.japicmp:japicmp-maven-plugin:0.14.2:cmp failed: Could not 
load 'org.codehaus.jackson.type.TypeReference
   ```




> Fix parquet build against hadoop 3.3.3+
> ---
>
> Key: PARQUET-2173
> URL: https://issues.apache.org/jira/browse/PARQUET-2173
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 
> for reload4j, and this creates maven dependency problems in parquet cli
> {code}
> [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli 
> ---
> [WARNING] Used undeclared dependencies found:
> [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided
> {code}
> the hadoop common dependencies need to exclude this jar and any changed slf4j 
> ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] steveloughran commented on pull request #985: PARQUET-2173. Fix parquet build against hadoop 3.3.3+

2022-08-16 Thread GitBox


steveloughran commented on PR #985:
URL: https://github.com/apache/parquet-mr/pull/985#issuecomment-1217104595

   i've also built against the next release of hadoop, and of 3.4.0-SNAPSHOT.
   
   the parquet build fails there as jackson 1 is purged from the hadoop 
classpath, breaking the japicmp plugin.
   
   ```
   Execution default of goal 
com.github.siom79.japicmp:japicmp-maven-plugin:0.14.2:cmp failed: Could not 
load 'org.codehaus.jackson.type.TypeReference
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+

2022-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580435#comment-17580435
 ] 

ASF GitHub Bot commented on PARQUET-2173:
-

steveloughran opened a new pull request, #985:
URL: https://github.com/apache/parquet-mr/pull/985

   
   Hadoop 3.3.3 moved to reload4j for logging to stop
   shipping a version of log4j with known (albeit unused)
   CVEs.
   
   This bypasses the existing exclusion code used to
   keep hadoop's SLF4J dependency off the classpaths,
   and by adding a new jar, breaks parquet-cli build.
   
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [X] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [X] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   The testing is regression testing "does the build work?", "does a test run 
complete without SLF4J warnings of duplicates?". done manually with 
`-Dhadoop.version=3.3.4`
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Fix parquet build against hadoop 3.3.3+
> ---
>
> Key: PARQUET-2173
> URL: https://issues.apache.org/jira/browse/PARQUET-2173
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 
> for reload4j, and this creates maven dependency problems in parquet cli
> {code}
> [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli 
> ---
> [WARNING] Used undeclared dependencies found:
> [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided
> {code}
> the hadoop common dependencies need to exclude this jar and any changed slf4j 
> ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] steveloughran opened a new pull request, #985: PARQUET-2173. Fix parquet build against hadoop 3.3.3+

2022-08-16 Thread GitBox


steveloughran opened a new pull request, #985:
URL: https://github.com/apache/parquet-mr/pull/985

   
   Hadoop 3.3.3 moved to reload4j for logging to stop
   shipping a version of log4j with known (albeit unused)
   CVEs.
   
   This bypasses the existing exclusion code used to
   keep hadoop's SLF4J dependency off the classpaths,
   and by adding a new jar, breaks parquet-cli build.
   
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [X] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [X] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   The testing is regression testing "does the build work?", "does a test run 
complete without SLF4J warnings of duplicates?". done manually with 
`-Dhadoop.version=3.3.4`
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+

2022-08-16 Thread Steve Loughran (Jira)
Steve Loughran created PARQUET-2173:
---

 Summary: Fix parquet build against hadoop 3.3.3+
 Key: PARQUET-2173
 URL: https://issues.apache.org/jira/browse/PARQUET-2173
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cli
Affects Versions: 1.13.0
Reporter: Steve Loughran


parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 for 
reload4j, and this creates maven dependency problems in parquet cli


{code}
[INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli 
---
[WARNING] Used undeclared dependencies found:
[WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided

{code}

the hadoop common dependencies need to exclude this jar and any changed slf4j 
ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader

2022-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580267#comment-17580267
 ] 

ASF GitHub Bot commented on PARQUET-2149:
-

ggershinsky commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r946662874


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -126,6 +127,42 @@ public class ParquetFileReader implements Closeable {
 
   public static String PARQUET_READ_PARALLELISM = 
"parquet.metadata.read.parallelism";
 
+  public static int numProcessors = Runtime.getRuntime().availableProcessors();
+
+  // Thread pool to read column chunk data from disk. Applications should call 
setAsyncIOThreadPool
+  // to initialize this with their own implementations.
+  // Default initialization is useful only for testing
+  public static ExecutorService ioThreadPool = Executors.newCachedThreadPool(
+r -> new Thread(r, "parquet-io"));
+
+  // Thread pool to process pages for multiple columns in parallel. 
Applications should call
+  // setAsyncProcessThreadPool to initialize this with their own 
implementations.
+  // Default initialization is useful only for testing
+  public static ExecutorService processThreadPool = 
Executors.newCachedThreadPool(

Review Comment:
   not sure; looks like many tests use copy/paste, rather than extension..





> Implement async IO for Parquet file reader
> --
>
> Key: PARQUET-2149
> URL: https://issues.apache.org/jira/browse/PARQUET-2149
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Parth Chandra
>Priority: Major
>
> ParquetFileReader's implementation has the following flow (simplified) - 
>       - For every column -> Read from storage in 8MB blocks -> Read all 
> uncompressed pages into output queue 
>       - From output queues -> (downstream ) decompression + decoding
> This flow is serialized, which means that downstream threads are blocked 
> until the data has been read. Because a large part of the time spent is 
> waiting for data from storage, threads are idle and CPU utilization is really 
> low.
> There is no reason why this cannot be made asynchronous _and_ parallel. So 
> For Column _i_ -> reading one chunk until end, from storage -> intermediate 
> output queue -> read one uncompressed page until end -> output queue -> 
> (downstream ) decompression + decoding
> Note that this can be made completely self contained in ParquetFileReader and 
> downstream implementations like Iceberg and Spark will automatically be able 
> to take advantage without code change as long as the ParquetFileReader apis 
> are not changed. 
> In past work with async io  [Drill - async page reader 
> |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java]
>  , I have seen 2x-3x improvement in reading speed for Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-08-16 Thread GitBox


ggershinsky commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r946662874


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -126,6 +127,42 @@ public class ParquetFileReader implements Closeable {
 
   public static String PARQUET_READ_PARALLELISM = 
"parquet.metadata.read.parallelism";
 
+  public static int numProcessors = Runtime.getRuntime().availableProcessors();
+
+  // Thread pool to read column chunk data from disk. Applications should call 
setAsyncIOThreadPool
+  // to initialize this with their own implementations.
+  // Default initialization is useful only for testing
+  public static ExecutorService ioThreadPool = Executors.newCachedThreadPool(
+r -> new Thread(r, "parquet-io"));
+
+  // Thread pool to process pages for multiple columns in parallel. 
Applications should call
+  // setAsyncProcessThreadPool to initialize this with their own 
implementations.
+  // Default initialization is useful only for testing
+  public static ExecutorService processThreadPool = 
Executors.newCachedThreadPool(

Review Comment:
   not sure; looks like many tests use copy/paste, rather than extension..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org