[GitHub] [parquet-mr] theosib-amazon commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-25 Thread GitBox


theosib-amazon commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1137610598

   That batch reader in Presto reminds me of some of the experimental changes I 
made in Trino. I modified PrimitiveColumnReader to work out how many of each 
data item it needs to read from the data source and requests all of them at 
once in an array. This doubled the performance of some TPCDS queries. This is 
why I have array access methods planned for ParquetMR. 
(https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing)
 Requesting data in bulk saves a lot on function call overhead for each data 
item.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] theosib-amazon commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox


theosib-amazon commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136528207

   > > Is byte (and arrays and buffers of bytes) the only datatype you support? 
My PR is optimizing code paths that pull ints, longs, and other sizes out of 
the data buffers. Are those not necessary for any of the situations where 
you're using an async buffer?
   > > The input stream API is generally unaware of the datatypes of its 
contents and so those are the only apis I use. The other reason is that the 
ParquetFileReader returns Pages which basically contain metadata and 
ByteBuffers of _compressed_ data. The decompression and decoding into types 
comes much later in a downstream thread.
   > > For your PR, I don't think the AsyncMultibufferInputStream is every 
going to be in play in the paths you're optimizing. But just in case it is, 
your type aware methods will work as is because AsyncMultibufferInputStream is 
derived from MultiBufferInputStream and will inherit those methods.
   
   I'm still learning Parquet's structure. So it sounds to me like these buffer 
input streams are used twice. Once to get data and decompress it, and then once 
again to decode it into data structures. Is that correct? So it sounds like 
you're optimizing one layer of processing, and I'm optimizing the next layer 
up, and it's kindof a coincidence that we're touching some of the same classes 
just because code reuse has been possible here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] theosib-amazon commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox


theosib-amazon commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136526711

   This is interesting, because when I did profiling of Trino, I found that 
although I/O (from S3, over the network no less) was significant, even more 
time was spent in compute. Maybe you're getting improved performance because 
you're increasing *parallelism* between I/O and compute.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] theosib-amazon commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-18 Thread GitBox


theosib-amazon commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130275378

   @parthchandra One thing that confuses me a bit is that these buffers have 
only ByteBuffer inside them. There's no actual I/O, so it's not possible to 
block. Do you have subclasses that provide some sort of access to real I/O?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] theosib-amazon commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-18 Thread GitBox


theosib-amazon commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130176799

   @parthchandra Would you mind having a look at my I/O performance 
optimization plan for ParquetMR? I think we should coordinate, since we have 
some ideas that might overlap what we touch.
   
https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org