[GitHub] [arrow-site] wesm commented on issue #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15
wesm commented on issue #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15 URL: https://github.com/apache/arrow-site/pull/19#issuecomment-527287458 In light of the mixed performance results the post might need a new title to reframe around the dictionary read improvements This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [arrow-site] wesm commented on issue #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15
wesm commented on issue #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15 URL: https://github.com/apache/arrow-site/pull/19#issuecomment-527286878 cc @hatemhelal @xhochy for any review. Note that we have dropped BinaryArray read performance in the non-dictionary case. Not sure why that is yet. I opened https://issues.apache.org/jira/browse/ARROW-6417 to investigate ![20190903_parquet_read_perf](https://user-images.githubusercontent.com/329591/64141564-2b9a4b80-cdce-11e9-94ea-bfcc0dea0b23.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [arrow-site] wesm opened a new pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15
wesm opened a new pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15 URL: https://github.com/apache/arrow-site/pull/19 The dates will need to be changed for the actual publication date. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[arrow] branch master updated: ARROW-6411: [Python][Parquet] Improve performance of DictEncoder::PutIndices
This is an automated email from the ASF dual-hosted git repository. wesm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new ab908cc ARROW-6411: [Python][Parquet] Improve performance of DictEncoder::PutIndices ab908cc is described below commit ab908cc0486d7daf643d1a1418328566f24c403b Author: Wes McKinney AuthorDate: Mon Sep 2 21:40:35 2019 -0500 ARROW-6411: [Python][Parquet] Improve performance of DictEncoder::PutIndices I don't really understand why this is faster, though. before ``` Benchmark Time CPU Iterations BM_ArrowBinaryDict/EncodeDictDirectInt8/1048576 7334087 ns7333876 ns 98 136.354M items/s BM_ArrowBinaryDict/EncodeDictDirectInt16/10485767022430 ns7022412 ns100 142.401M items/s BM_ArrowBinaryDict/EncodeDictDirectInt32/10485767061033 ns7060870 ns 99 141.626M items/s BM_ArrowBinaryDict/EncodeDictDirectInt64/10485767084581 ns7084398 ns 97 141.155M items/s ``` after ``` Benchmark Time CPU Iterations BM_ArrowBinaryDict/EncodeDictDirectInt8/1048576 4387151 ns4387175 ns156 227.937M items/s BM_ArrowBinaryDict/EncodeDictDirectInt16/10485764446167 ns4446074 ns159 224.918M items/s BM_ArrowBinaryDict/EncodeDictDirectInt32/10485764501028 ns4500934 ns156 222.176M items/s BM_ArrowBinaryDict/EncodeDictDirectInt64/10485764635792 ns4635728 ns150 215.716M items/s ``` On an i9-9960X CPU before these changes perf reported that `__memmove_avx_unaligned_erms` was taking up a lot of time. In principle `std::vector::reserve` should be correct since memory is not initialized, but something weird seems to be going wrong. If anyone has any ideas I'm interested to learn more. In any case I'll stick with the empirical benchmark evidence on this I started to refactor to use `TypedBufferBuilder` but I'm not sure about the performance of that for scalar appends vs. `std::vector` so I'll leave that for future experimentation. Closes #5248 from wesm/ARROW-6411 and squashes the following commits: b1159ec8a Add C++ benchmarks for DictEncoder::PutIndices da8cc9d79 Add C++ benchmarks 5a73bf509 Add Python benchmark Authored-by: Wes McKinney Signed-off-by: Wes McKinney --- cpp/src/parquet/encoding.cc | 16 ++-- cpp/src/parquet/encoding_benchmark.cc | 37 python/benchmarks/parquet.py | 46 ++- 3 files changed, 91 insertions(+), 8 deletions(-) diff --git a/cpp/src/parquet/encoding.cc b/cpp/src/parquet/encoding.cc index ef1dd34..e63d69f 100644 --- a/cpp/src/parquet/encoding.cc +++ b/cpp/src/parquet/encoding.cc @@ -361,7 +361,8 @@ class DictEncoderImpl : public EncoderImpl, virtual public DictEncoder { --buffer_len; arrow::util::RleEncoder encoder(buffer, buffer_len, bit_width()); -for (int index : buffered_indices_) { + +for (int32_t index : buffered_indices_) { if (!encoder.Put(index)) return -1; } encoder.Flush(); @@ -425,21 +426,22 @@ class DictEncoderImpl : public EncoderImpl, virtual public DictEncoder { using ArrayType = typename arrow::TypeTraits::ArrayType; const auto& indices = checked_cast(data); auto values = indices.raw_values(); -buffered_indices_.reserve( -buffered_indices_.size() + -static_cast(indices.length() - indices.null_count())); + +size_t buffer_position = buffered_indices_.size(); +buffered_indices_.resize( +buffer_position + static_cast(indices.length() - indices.null_count())); if (indices.null_count() > 0) { arrow::internal::BitmapReader valid_bits_reader(indices.null_bitmap_data(), indices.offset(), indices.length()); for (int64_t i = 0; i < indices.length(); ++i) { if (valid_bits_reader.IsSet()) { - buffered_indices_.push_back(static_cast(values[i])); + buffered_indices_[buffer_position++] = static_cast(values[i]); } valid_bits_reader.Next(); } } else { for (int64_t i = 0; i < indices.length(); ++i) { -buffered_indices_.push_back(static_cast(values[i])); +
[arrow] branch master updated (32ef12c -> 1c42e6f)
This is an automated email from the ASF dual-hosted git repository. apitrou pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 32ef12c ARROW-6063: [FlightRPC] implement half-closed semantics for DoPut add 1c42e6f ARROW-6141: [C++] Enable memory-mapping a file region No new revisions were added by this update. Summary of changes: cpp/src/arrow/io/file.cc | 53 ++ cpp/src/arrow/io/file.h | 5 cpp/src/arrow/io/file_test.cc | 54 +++ 3 files changed, 103 insertions(+), 9 deletions(-)
[arrow] branch master updated (149d4cb -> 32ef12c)
This is an automated email from the ASF dual-hosted git repository. apitrou pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 149d4cb ARROW-6383: [Java] Report outstanding child allocators on close add 32ef12c ARROW-6063: [FlightRPC] implement half-closed semantics for DoPut No new revisions were added by this update. Summary of changes: cpp/src/arrow/flight/client.cc | 46 +-- cpp/src/arrow/flight/client.h | 11 ++ cpp/src/arrow/flight/flight_benchmark.cc| 185 cpp/src/arrow/flight/flight_test.cc | 1 + cpp/src/arrow/flight/internal.cc| 4 +- cpp/src/arrow/flight/perf_server.cc | 24 python/pyarrow/_flight.pyx | 6 + python/pyarrow/includes/libarrow_flight.pxd | 1 + python/pyarrow/tests/test_flight.py | 39 ++ 9 files changed, 259 insertions(+), 58 deletions(-)
[arrow] branch master updated (a985483 -> 149d4cb)
This is an automated email from the ASF dual-hosted git repository. ravindra pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from a985483 ARROW-4752: [Rust] Add explicit SIMD vectorization for the divide kernel add 149d4cb ARROW-6383: [Java] Report outstanding child allocators on close No new revisions were added by this update. Summary of changes: .../src/main/java/org/apache/arrow/memory/BaseAllocator.java | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-)