[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-29 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r417626444



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   Agreed, memory usage is part of why in the original discussion, we 
mentioned the I/O concurrency manager component - which among other things 
would let an application trade off read performance and bandwidth (which I 
brought up again today on the mailing list).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413819271



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   GitHub acting up, sorry for the delay...
   
   > > Correct me if I'm wrong, though, but even that doesn't help much without 
more refactoring, since reading is organized along columns.
   > 
   > That was my original point, which you said was irrelevant...
   
   Apologies, I remember you had made this point on JIRA or the mailing list 
and was trying to find where (I still can't find where :/) I admit my 
understanding of the internals is poor, and your comment helped me understand 
what to look for better.
   
   > I'm afraid I don't understand you. How do you "read one row group at a 
time" if the cache is shared at the FileReader level? Do you mean creating a 
new Parquet file reader for each row group?
   
   Sorry, I was thinking about creating RowGroupReaders from a FileReader, one 
row group at a time: 
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReader8RowGroupEi
   
   Each time this is called, it internally recreates the cache and buffers a 
new set of ranges. (Hence the comment above about "unsafe".) So this would let 
you read a row group's worth of data at a time, at the cost of more complexity 
for the user.
   
   > To be clear, I don't think this concern should block the PR, but the 
docstring should warn about it.
   
   I'll update it. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413798078



##
File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc
##
@@ -331,10 +358,64 @@ BENCHMARK_DEFINE_F(MinioFixture, 
ReadCoalesced500Mib)(benchmark::State& st) {
 }
 BENCHMARK_REGISTER_F(MinioFixture, ReadCoalesced500Mib)->UseRealTime();
 
-BENCHMARK_DEFINE_F(MinioFixture, ReadParquet250K)(benchmark::State& st) {
-  ParquetRead(st, fs_.get(), bucket_ + "/pq_c100_r250k");
-}
-BENCHMARK_REGISTER_F(MinioFixture, ReadParquet250K)->UseRealTime();
+// Helpers to generate various multiple benchmarks for a given Parquet file.
+
+// NAME: the base name of the benchmark.
+// ROWS: the number of rows in the Parquet file.
+// COLS: the number of columns in the Parquet file.
+// STRATEGY: how to read the file (ReadTable or GetRecordBatchReader)
+#define PQ_BENCHMARK_IMPL(NAME, ROWS, COLS, STRATEGY)  
   \
+  BENCHMARK_DEFINE_F(MinioFixture, NAME##STRATEGY##AllNaive)(benchmark::State 
& st) { \
+std::vector column_indices(COLS); 
   \
+std::iota(column_indices.begin(), column_indices.end(), 0);
   \
+std::stringstream ss;  
   \
+ss << bucket_ << "/pq_c" << COLS << "_r" << ROWS << "k";   
   \
+ParquetRead(st, fs_.get(), ss.str(), column_indices, false, #STRATEGY);
   \

Review comment:
   Good idea, I've created a set of helper functions that set up things and 
then call the benchmark.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413797661



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -536,6 +577,14 @@ std::shared_ptr 
ParquetFileReader::RowGroup(int i) {
   return contents_->GetRowGroup(i);
 }
 
+void ParquetFileReader::PreBuffer(const std::vector& row_groups,
+  const std::vector& column_indices,
+  const ::arrow::io::CacheOptions& options) {
+  // Access private methods here
+  SerializedFile* file = static_cast(contents_.get());

Review comment:
   It's not possible, but I made it a checked cast regardless.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413792469



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   That's fair. In our case, even if a dataset is large, individual files 
are smaller, so it's fine to buffer an entire file and then discard it. But I 
agree that for other use cases, this is not ideal.
   
   A caller who is very concerned about memory might instead choose to 
explicitly read only one row group at a time to limit memory usage. This is 
rather annoying, though. 
   
   We could create a separate cache per row group, but this means we lose some 
performance as we can't coalesce reads across row groups anymore. However that 
might be a worthwhile tradeoff for large files. Correct me if I'm wrong, 
though, but even that doesn't help much without more refactoring, since reading 
is organized along columns. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413763701



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   No, on the contrary, there's only one instance of `Contents`, and hence 
a single shared cache right now (between all reads of the same file). However, 
lots of instances of `RowGroupReader::Contents` get created (one per row group 
per column), so it's not easy to cache each row group separately.
   
   Perhaps I'm missing the point: what you'd like is a way to stream record 
batches out of a file, and have it internally clean up memory for each row 
group once the data has been fully read, right? (And, not pre-buffer more than 
some fixed number of row groups ahead of the current one.)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413759177



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   The other option would be to more fully separate the I/O and allow 
callers to pass a cache for a given read, and have `PreBuffer` instead pre-warm 
a supplied cache for the given row groups/column indices. Then you would have 
full control over the cache's lifetime, at the expense of a lot more 
complicated code (both for the caller and internally).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413757274



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   It depends: reading multiple row groups at once might cache chunks that 
span row groups (and in general you want more opportunities to coalesce reads 
if you are aiming for throughput). For more control over that, a consumer could 
make repeated calls to `FileReader->RowGroup` to get individual RowGroupReaders 
that would cache only a single RowGroup.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-23 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r413744080



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -212,6 +237,21 @@ class SerializedFile : public ParquetFileReader::Contents {
 file_metadata_ = std::move(metadata);
   }
 
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices,
+ const ::arrow::io::CacheOptions& options) {
+cached_source_ =
+std::make_shared(source_, 
options);

Review comment:
   We could add a method to clear/remove the cache, or the user might 
recreate the ParquetFileReader (as the cache is shared among all readers 
created from the file anyways).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-22 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r412987199



##
File path: cpp/src/parquet/properties.h
##
@@ -56,10 +60,32 @@ class PARQUET_EXPORT ReaderProperties {
 
   bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; }
 
+  bool is_coalesced_stream_enabled() const { return coalesced_stream_enabled_; 
}
+
   void enable_buffered_stream() { buffered_stream_enabled_ = true; }
 
   void disable_buffered_stream() { buffered_stream_enabled_ = false; }
 
+  /// Enable read coalescing.
+  ///
+  /// When enabled, readers can optionally call
+  /// ParquetFileReader.PreBuffer to cache the necessary slices of the
+  /// file in-memory before deserialization. Arrow readers
+  /// automatically do this. This is intended to increase performance
+  /// when reading from high-latency filesystems (e.g. Amazon S3).
+  ///
+  /// When this is enabled, it is NOT SAFE to concurrently create
+  /// RecordBatchReaders from the same file.

Review comment:
   I've updated the wording - it's not unsafe (in that it won't crash), but 
it'll fail if you try to read a row group/column chunk that wasn't cached. This 
is because there's one shared cache that is not specific to a particular 
reader. Looking through the code, we could make this work better by avoiding a 
PreBuffer method and instead threading the cache all the way from the Arrow 
readers to the Parquet implementation. I think that's possible, but would 
require passing the cache between several classes/methods - do you think that's 
worth it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-22 Thread GitBox


lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r412985177



##
File path: cpp/src/parquet/file_reader.h
##
@@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader {
   // Returns the file metadata. Only one instance is ever created
   std::shared_ptr metadata() const;
 
+  /// Pre-buffer the specified column indices in all row groups.
+  ///
+  /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is 
set;
+  /// otherwise this is a no-op. The reader internally maintains a cache which 
is
+  /// overwritten each time this is called. Intended to increase performance on
+  /// high-latency filesystems (e.g. Amazon S3).
+  void PreBuffer(const std::vector& row_groups,
+ const std::vector& column_indices);

Review comment:
   I changed this to accept the CacheOptions struct here, which gives the 
caller full control each time they invoke it.

##
File path: cpp/src/parquet/file_reader.h
##
@@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader {
   // Returns the file metadata. Only one instance is ever created
   std::shared_ptr metadata() const;
 
+  /// Pre-buffer the specified column indices in all row groups.
+  ///
+  /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is 
set;

Review comment:
   Yes - that makes more sense, combined with moving the flag to 
ArrowReaderProperties.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org