[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r417626444 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: Agreed, memory usage is part of why in the original discussion, we mentioned the I/O concurrency manager component - which among other things would let an application trade off read performance and bandwidth (which I brought up again today on the mailing list). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413819271 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: GitHub acting up, sorry for the delay... > > Correct me if I'm wrong, though, but even that doesn't help much without more refactoring, since reading is organized along columns. > > That was my original point, which you said was irrelevant... Apologies, I remember you had made this point on JIRA or the mailing list and was trying to find where (I still can't find where :/) I admit my understanding of the internals is poor, and your comment helped me understand what to look for better. > I'm afraid I don't understand you. How do you "read one row group at a time" if the cache is shared at the FileReader level? Do you mean creating a new Parquet file reader for each row group? Sorry, I was thinking about creating RowGroupReaders from a FileReader, one row group at a time: https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReader8RowGroupEi Each time this is called, it internally recreates the cache and buffers a new set of ranges. (Hence the comment above about "unsafe".) So this would let you read a row group's worth of data at a time, at the cost of more complexity for the user. > To be clear, I don't think this concern should block the PR, but the docstring should warn about it. I'll update it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413798078 ## File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc ## @@ -331,10 +358,64 @@ BENCHMARK_DEFINE_F(MinioFixture, ReadCoalesced500Mib)(benchmark::State& st) { } BENCHMARK_REGISTER_F(MinioFixture, ReadCoalesced500Mib)->UseRealTime(); -BENCHMARK_DEFINE_F(MinioFixture, ReadParquet250K)(benchmark::State& st) { - ParquetRead(st, fs_.get(), bucket_ + "/pq_c100_r250k"); -} -BENCHMARK_REGISTER_F(MinioFixture, ReadParquet250K)->UseRealTime(); +// Helpers to generate various multiple benchmarks for a given Parquet file. + +// NAME: the base name of the benchmark. +// ROWS: the number of rows in the Parquet file. +// COLS: the number of columns in the Parquet file. +// STRATEGY: how to read the file (ReadTable or GetRecordBatchReader) +#define PQ_BENCHMARK_IMPL(NAME, ROWS, COLS, STRATEGY) \ + BENCHMARK_DEFINE_F(MinioFixture, NAME##STRATEGY##AllNaive)(benchmark::State & st) { \ +std::vector column_indices(COLS); \ +std::iota(column_indices.begin(), column_indices.end(), 0); \ +std::stringstream ss; \ +ss << bucket_ << "/pq_c" << COLS << "_r" << ROWS << "k"; \ +ParquetRead(st, fs_.get(), ss.str(), column_indices, false, #STRATEGY); \ Review comment: Good idea, I've created a set of helper functions that set up things and then call the benchmark. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413797661 ## File path: cpp/src/parquet/file_reader.cc ## @@ -536,6 +577,14 @@ std::shared_ptr ParquetFileReader::RowGroup(int i) { return contents_->GetRowGroup(i); } +void ParquetFileReader::PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { + // Access private methods here + SerializedFile* file = static_cast(contents_.get()); Review comment: It's not possible, but I made it a checked cast regardless. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413792469 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: That's fair. In our case, even if a dataset is large, individual files are smaller, so it's fine to buffer an entire file and then discard it. But I agree that for other use cases, this is not ideal. A caller who is very concerned about memory might instead choose to explicitly read only one row group at a time to limit memory usage. This is rather annoying, though. We could create a separate cache per row group, but this means we lose some performance as we can't coalesce reads across row groups anymore. However that might be a worthwhile tradeoff for large files. Correct me if I'm wrong, though, but even that doesn't help much without more refactoring, since reading is organized along columns. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413763701 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: No, on the contrary, there's only one instance of `Contents`, and hence a single shared cache right now (between all reads of the same file). However, lots of instances of `RowGroupReader::Contents` get created (one per row group per column), so it's not easy to cache each row group separately. Perhaps I'm missing the point: what you'd like is a way to stream record batches out of a file, and have it internally clean up memory for each row group once the data has been fully read, right? (And, not pre-buffer more than some fixed number of row groups ahead of the current one.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413759177 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: The other option would be to more fully separate the I/O and allow callers to pass a cache for a given read, and have `PreBuffer` instead pre-warm a supplied cache for the given row groups/column indices. Then you would have full control over the cache's lifetime, at the expense of a lot more complicated code (both for the caller and internally). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413757274 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: It depends: reading multiple row groups at once might cache chunks that span row groups (and in general you want more opportunities to coalesce reads if you are aiming for throughput). For more control over that, a consumer could make repeated calls to `FileReader->RowGroup` to get individual RowGroupReaders that would cache only a single RowGroup. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r413744080 ## File path: cpp/src/parquet/file_reader.cc ## @@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents { file_metadata_ = std::move(metadata); } + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices, + const ::arrow::io::CacheOptions& options) { +cached_source_ = +std::make_shared(source_, options); Review comment: We could add a method to clear/remove the cache, or the user might recreate the ParquetFileReader (as the cache is shared among all readers created from the file anyways). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r412987199 ## File path: cpp/src/parquet/properties.h ## @@ -56,10 +60,32 @@ class PARQUET_EXPORT ReaderProperties { bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; } + bool is_coalesced_stream_enabled() const { return coalesced_stream_enabled_; } + void enable_buffered_stream() { buffered_stream_enabled_ = true; } void disable_buffered_stream() { buffered_stream_enabled_ = false; } + /// Enable read coalescing. + /// + /// When enabled, readers can optionally call + /// ParquetFileReader.PreBuffer to cache the necessary slices of the + /// file in-memory before deserialization. Arrow readers + /// automatically do this. This is intended to increase performance + /// when reading from high-latency filesystems (e.g. Amazon S3). + /// + /// When this is enabled, it is NOT SAFE to concurrently create + /// RecordBatchReaders from the same file. Review comment: I've updated the wording - it's not unsafe (in that it won't crash), but it'll fail if you try to read a row group/column chunk that wasn't cached. This is because there's one shared cache that is not specific to a particular reader. Looking through the code, we could make this work better by avoiding a PreBuffer method and instead threading the cache all the way from the Arrow readers to the Parquet implementation. I think that's possible, but would require passing the cache between several classes/methods - do you think that's worth it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group
lidavidm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r412985177 ## File path: cpp/src/parquet/file_reader.h ## @@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader { // Returns the file metadata. Only one instance is ever created std::shared_ptr metadata() const; + /// Pre-buffer the specified column indices in all row groups. + /// + /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is set; + /// otherwise this is a no-op. The reader internally maintains a cache which is + /// overwritten each time this is called. Intended to increase performance on + /// high-latency filesystems (e.g. Amazon S3). + void PreBuffer(const std::vector& row_groups, + const std::vector& column_indices); Review comment: I changed this to accept the CacheOptions struct here, which gives the caller full control each time they invoke it. ## File path: cpp/src/parquet/file_reader.h ## @@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader { // Returns the file metadata. Only one instance is ever created std::shared_ptr metadata() const; + /// Pre-buffer the specified column indices in all row groups. + /// + /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is set; Review comment: Yes - that makes more sense, combined with moving the flag to ArrowReaderProperties. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org