[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2023-05-17 Thread via GitHub
steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1551181414 > FWIW the hadoop 3.3.5 vector io changes might make this PR redundant. on those stores which do it well (s3a, native filesystem); until gcs and abfs add it they'll benefit from

[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2023-04-12 Thread via GitHub
steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1505269164 @hazelnutsgz hadoop 3.3.5 supports vector IO on an s3 stream; async parallel fetch of blocks, which also works on local fs (and with gcs, abfs TODO items). we see significant

[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-13 Thread GitBox
steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1153924743 (i could of course add those probes into the shim class, so at least that access of internals was in one place) -- This is an automated message from the Apache Git Service. To

[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-13 Thread GitBox
steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1153923501 bq. perhaps check if the ByteBufferReadable interface is implemented in the stream? The requirement for the `hasCapability("in:readbytebuffer")` to return true postdates

[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-09 Thread GitBox
steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1151417126 I've started work on a fs-api-shim library, with the goal of "apps compile against hadoop 3.2.0 can get access to the 3.3 and 3.4 APIs when available either with transparent

[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox
steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136465506 > At this point the bottlenecks in parquet begin to move towards decompression and decoding but IO remains the slowest link in the chain. Latency is the killer; in an HTTP

[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox
steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1135585289 > I was working with s3a > Spark 3.2.1 > Hadoop (Hadoop-aws) 3.3.2 > AWS SDK 1.11.655 thanks., that means you are current with all shipping improvments. the main

[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-23 Thread GitBox
steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1134843705 1. whose s3 client was used for testing here -if the s3a one, which hadoop release? 2. the azure abfs and gcs connectors do async prefetching of the next block, but are simply