date:20221121

[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader

2022-11-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636956#comment-17636956
 ] 

ASF GitHub Bot commented on PARQUET-2149:
-

parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1322881480

   Let me verify that merging the thread pools is not dangerous and post an 
update. After that if be great if you can take it.




> Implement async IO for Parquet file reader
> --
>
> Key: PARQUET-2149
> URL: https://issues.apache.org/jira/browse/PARQUET-2149
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Parth Chandra
>Priority: Major
>
> ParquetFileReader's implementation has the following flow (simplified) - 
>       - For every column -> Read from storage in 8MB blocks -> Read all 
> uncompressed pages into output queue 
>       - From output queues -> (downstream ) decompression + decoding
> This flow is serialized, which means that downstream threads are blocked 
> until the data has been read. Because a large part of the time spent is 
> waiting for data from storage, threads are idle and CPU utilization is really 
> low.
> There is no reason why this cannot be made asynchronous _and_ parallel. So 
> For Column _i_ -> reading one chunk until end, from storage -> intermediate 
> output queue -> read one uncompressed page until end -> output queue -> 
> (downstream ) decompression + decoding
> Note that this can be made completely self contained in ParquetFileReader and 
> downstream implementations like Iceberg and Spark will automatically be able 
> to take advantage without code change as long as the ParquetFileReader apis 
> are not changed. 
> In past work with async io  [Drill - async page reader 
> |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java]
>  , I have seen 2x-3x improvement in reading speed for Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-11-21 Thread GitBox



parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1322881480

   Let me verify that merging the thread pools is not dangerous and post an 
update. After that if be great if you can take it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] theosib-amazon commented on a diff in pull request #960: Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream

2022-11-21 Thread GitBox



theosib-amazon commented on code in PR #960:
URL: https://github.com/apache/parquet-mr/pull/960#discussion_r1021873458


##
parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java:
##
@@ -157,4 +165,80 @@ public void reset() throws IOException {
   public boolean markSupported() {
 return delegate.markSupported();
   }
+
+  public boolean readBoolean() throws IOException {
+return readByte() != 0;
+  }
+
+  public byte readByte() throws IOException {
+return delegate.readByte();
+  }
+
+  public int readUnsignedByte() throws IOException {
+return delegate.readUnsignedByte();
+  }
+
+  public short readShort() throws IOException {

Review Comment:
   I use the new batch read methods heavily in some optimizations I made to 
Trino. As for short, I can't say I recall any uses in Trino of readShorts(). 
readShort() is used indirectly through a method that reads a variable sized 
representation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader

2022-11-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636711#comment-17636711
 ] 

ASF GitHub Bot commented on PARQUET-2149:
-

wgtmac commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1322172984

   > > > IMO, switching `ioThreadPool` and `processThreadPool` the reader 
instance level will make it more flexible.
   > > 
   > > 
   > > I've changed the thread pool so that it is not initialized by default 
but I left them as static members. Ideally, there should be a single IO thread 
pool that handles all the IO for a process and the size of the pool is 
determined by the bandwidthof the underlying storage system. Making them per 
instance is not an issue though. The calling code can decide to set the same 
thread pool for all instances and achieve the same result. Let me update this.
   > > Also, any changes you want to make are fine with me, and the help is 
certainly appreciated !
   > 
   > I'm thinking of merging the thread pools into a single `ioThreadPool` and 
making it settable thru `ParquetReadOptions` (like the allocator is). The work 
being done by the `processThreadPool` is rather small and maybe we can do away 
with it. Adding the pool via `ParquetReadOptions` makes it easier to use with 
`ParquetReader` (used a lot in unit tests). WDYT?
   
   Sorry for my late reply.
   
   Setting the thread pools via `ParquetReadOptions` is a good idea and that is 
exactly the way I want to do them away with static members. Merging 
`ioThreadPool` and `processThreadPool` into a single pool should work if the 
tasks in the `processThreadPool` do not wait for the return of tasks in the 
`ioThreadPool`. I will look into the detail later.
   
   BTW, I don't have the permission to directly update your PR in place as I am 
not yet a maintainer of the repo. I may need to open a new one by copying what 
you have done here and add you as a co-author. WDYT? If that sounds good to 
you, I can proceed. @parthchandra 




> Implement async IO for Parquet file reader
> --
>
> Key: PARQUET-2149
> URL: https://issues.apache.org/jira/browse/PARQUET-2149
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Parth Chandra
>Priority: Major
>
> ParquetFileReader's implementation has the following flow (simplified) - 
>       - For every column -> Read from storage in 8MB blocks -> Read all 
> uncompressed pages into output queue 
>       - From output queues -> (downstream ) decompression + decoding
> This flow is serialized, which means that downstream threads are blocked 
> until the data has been read. Because a large part of the time spent is 
> waiting for data from storage, threads are idle and CPU utilization is really 
> low.
> There is no reason why this cannot be made asynchronous _and_ parallel. So 
> For Column _i_ -> reading one chunk until end, from storage -> intermediate 
> output queue -> read one uncompressed page until end -> output queue -> 
> (downstream ) decompression + decoding
> Note that this can be made completely self contained in ParquetFileReader and 
> downstream implementations like Iceberg and Spark will automatically be able 
> to take advantage without code change as long as the ParquetFileReader apis 
> are not changed. 
> In past work with async io  [Drill - async page reader 
> |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java]
>  , I have seen 2x-3x improvement in reading speed for Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] wgtmac commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-11-21 Thread GitBox



wgtmac commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1322172984

   > > > IMO, switching `ioThreadPool` and `processThreadPool` the reader 
instance level will make it more flexible.
   > > 
   > > 
   > > I've changed the thread pool so that it is not initialized by default 
but I left them as static members. Ideally, there should be a single IO thread 
pool that handles all the IO for a process and the size of the pool is 
determined by the bandwidthof the underlying storage system. Making them per 
instance is not an issue though. The calling code can decide to set the same 
thread pool for all instances and achieve the same result. Let me update this.
   > > Also, any changes you want to make are fine with me, and the help is 
certainly appreciated !
   > 
   > I'm thinking of merging the thread pools into a single `ioThreadPool` and 
making it settable thru `ParquetReadOptions` (like the allocator is). The work 
being done by the `processThreadPool` is rather small and maybe we can do away 
with it. Adding the pool via `ParquetReadOptions` makes it easier to use with 
`ParquetReader` (used a lot in unit tests). WDYT?
   
   Sorry for my late reply.
   
   Setting the thread pools via `ParquetReadOptions` is a good idea and that is 
exactly the way I want to do them away with static members. Merging 
`ioThreadPool` and `processThreadPool` into a single pool should work if the 
tasks in the `processThreadPool` do not wait for the return of tasks in the 
`ioThreadPool`. I will look into the detail later.
   
   BTW, I don't have the permission to directly update your PR in place as I am 
not yet a maintainer of the repo. I may need to open a new one by copying what 
you have done here and add you as a co-author. WDYT? If that sounds good to 
you, I can proceed. @parthchandra 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader

[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

[GitHub] [parquet-mr] theosib-amazon commented on a diff in pull request #960: Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream

[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader

[GitHub] [parquet-mr] wgtmac commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

5 matches

Site Navigation

Mail list logo

Footer information