steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1551181414
> FWIW the hadoop 3.3.5 vector io changes might make this PR redundant.
on those stores which do it well (s3a, native filesystem); until gcs and
abfs add it they'll benefit from
steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1505269164
@hazelnutsgz hadoop 3.3.5 supports vector IO on an s3 stream; async parallel
fetch of blocks, which also works on local fs (and with gcs, abfs TODO items).
we see significant
steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1153924743
(i could of course add those probes into the shim class, so at least that
access of internals was in one place)
--
This is an automated message from the Apache Git Service.
To
steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1153923501
bq. perhaps check if the ByteBufferReadable interface is implemented in the
stream?
The requirement for the `hasCapability("in:readbytebuffer")` to return true
postdates
steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1151417126
I've started work on a fs-api-shim library, with the goal of "apps compile
against hadoop 3.2.0 can get access to the 3.3 and 3.4 APIs when available
either with transparent
steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136465506
> At this point the bottlenecks in parquet begin to move towards
decompression and decoding but IO remains the slowest link in the chain.
Latency is the killer; in an HTTP
steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1135585289
> I was working with s3a
> Spark 3.2.1
> Hadoop (Hadoop-aws) 3.3.2
> AWS SDK 1.11.655
thanks., that means you are current with all shipping improvments. the main
steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1134843705
1. whose s3 client was used for testing here -if the s3a one, which hadoop
release?
2. the azure abfs and gcs connectors do async prefetching of the next block,
but are simply