[GitHub] [arrow-site] wesm commented on issue #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-02 Thread GitBox
wesm commented on issue #19: ARROW-6419: [Website] Blog post about Parquet C++ 
read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#issuecomment-527287458
 
 
   In light of the mixed performance results the post might need a new title to 
reframe around the dictionary read improvements


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] wesm commented on issue #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-02 Thread GitBox
wesm commented on issue #19: ARROW-6419: [Website] Blog post about Parquet C++ 
read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#issuecomment-527286878
 
 
   cc @hatemhelal @xhochy for any review. 
   
   Note that we have dropped BinaryArray read performance in the non-dictionary 
case. Not sure why that is yet. I opened 
https://issues.apache.org/jira/browse/ARROW-6417 to investigate
   
   
![20190903_parquet_read_perf](https://user-images.githubusercontent.com/329591/64141564-2b9a4b80-cdce-11e9-94ea-bfcc0dea0b23.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] wesm opened a new pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-02 Thread GitBox
wesm opened a new pull request #19: ARROW-6419: [Website] Blog post about 
Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19
 
 
   The dates will need to be changed for the actual publication date.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[arrow] branch master updated: ARROW-6411: [Python][Parquet] Improve performance of DictEncoder::PutIndices

2019-09-02 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new ab908cc  ARROW-6411: [Python][Parquet] Improve performance of 
DictEncoder::PutIndices
ab908cc is described below

commit ab908cc0486d7daf643d1a1418328566f24c403b
Author: Wes McKinney 
AuthorDate: Mon Sep 2 21:40:35 2019 -0500

ARROW-6411: [Python][Parquet] Improve performance of DictEncoder::PutIndices

I don't really understand why this is faster, though.

before

```


Benchmark Time   
CPU Iterations


BM_ArrowBinaryDict/EncodeDictDirectInt8/1048576 7334087 ns7333876 
ns 98   136.354M items/s
BM_ArrowBinaryDict/EncodeDictDirectInt16/10485767022430 ns7022412 
ns100   142.401M items/s
BM_ArrowBinaryDict/EncodeDictDirectInt32/10485767061033 ns7060870 
ns 99   141.626M items/s
BM_ArrowBinaryDict/EncodeDictDirectInt64/10485767084581 ns7084398 
ns 97   141.155M items/s
```

after

```


Benchmark Time   
CPU Iterations


BM_ArrowBinaryDict/EncodeDictDirectInt8/1048576 4387151 ns4387175 
ns156   227.937M items/s
BM_ArrowBinaryDict/EncodeDictDirectInt16/10485764446167 ns4446074 
ns159   224.918M items/s
BM_ArrowBinaryDict/EncodeDictDirectInt32/10485764501028 ns4500934 
ns156   222.176M items/s
BM_ArrowBinaryDict/EncodeDictDirectInt64/10485764635792 ns4635728 
ns150   215.716M items/s
```

On an i9-9960X CPU before these changes perf reported that 
`__memmove_avx_unaligned_erms` was taking up a lot of time. In principle 
`std::vector::reserve` should be correct since memory is not initialized, but 
something weird seems to be going wrong. If anyone has any ideas I'm interested 
to learn more. In any case I'll stick with the empirical benchmark evidence on 
this

I started to refactor to use `TypedBufferBuilder` but I'm not sure 
about the performance of that for scalar appends vs. `std::vector` so I'll 
leave that for future experimentation.

Closes #5248 from wesm/ARROW-6411 and squashes the following commits:

b1159ec8a  Add C++ benchmarks for DictEncoder::PutIndices
da8cc9d79  Add C++ benchmarks
5a73bf509  Add Python benchmark

Authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 cpp/src/parquet/encoding.cc   | 16 ++--
 cpp/src/parquet/encoding_benchmark.cc | 37 
 python/benchmarks/parquet.py  | 46 ++-
 3 files changed, 91 insertions(+), 8 deletions(-)

diff --git a/cpp/src/parquet/encoding.cc b/cpp/src/parquet/encoding.cc
index ef1dd34..e63d69f 100644
--- a/cpp/src/parquet/encoding.cc
+++ b/cpp/src/parquet/encoding.cc
@@ -361,7 +361,8 @@ class DictEncoderImpl : public EncoderImpl, virtual public 
DictEncoder {
 --buffer_len;
 
 arrow::util::RleEncoder encoder(buffer, buffer_len, bit_width());
-for (int index : buffered_indices_) {
+
+for (int32_t index : buffered_indices_) {
   if (!encoder.Put(index)) return -1;
 }
 encoder.Flush();
@@ -425,21 +426,22 @@ class DictEncoderImpl : public EncoderImpl, virtual 
public DictEncoder {
 using ArrayType = typename arrow::TypeTraits::ArrayType;
 const auto& indices = checked_cast(data);
 auto values = indices.raw_values();
-buffered_indices_.reserve(
-buffered_indices_.size() +
-static_cast(indices.length() - indices.null_count()));
+
+size_t buffer_position = buffered_indices_.size();
+buffered_indices_.resize(
+buffer_position + static_cast(indices.length() - 
indices.null_count()));
 if (indices.null_count() > 0) {
   arrow::internal::BitmapReader 
valid_bits_reader(indices.null_bitmap_data(),
   indices.offset(), 
indices.length());
   for (int64_t i = 0; i < indices.length(); ++i) {
 if (valid_bits_reader.IsSet()) {
-  buffered_indices_.push_back(static_cast(values[i]));
+  buffered_indices_[buffer_position++] = 
static_cast(values[i]);
 }
 valid_bits_reader.Next();
   }
 } else {
   for (int64_t i = 0; i < indices.length(); ++i) {
-buffered_indices_.push_back(static_cast(values[i]));
+

[arrow] branch master updated (32ef12c -> 1c42e6f)

2019-09-02 Thread apitrou
This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 32ef12c  ARROW-6063: [FlightRPC] implement half-closed semantics for 
DoPut
 add 1c42e6f  ARROW-6141: [C++] Enable memory-mapping a file region

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/io/file.cc  | 53 ++
 cpp/src/arrow/io/file.h   |  5 
 cpp/src/arrow/io/file_test.cc | 54 +++
 3 files changed, 103 insertions(+), 9 deletions(-)



[arrow] branch master updated (149d4cb -> 32ef12c)

2019-09-02 Thread apitrou
This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 149d4cb  ARROW-6383: [Java] Report outstanding child allocators on 
close
 add 32ef12c  ARROW-6063: [FlightRPC] implement half-closed semantics for 
DoPut

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/flight/client.cc  |  46 +--
 cpp/src/arrow/flight/client.h   |  11 ++
 cpp/src/arrow/flight/flight_benchmark.cc| 185 
 cpp/src/arrow/flight/flight_test.cc |   1 +
 cpp/src/arrow/flight/internal.cc|   4 +-
 cpp/src/arrow/flight/perf_server.cc |  24 
 python/pyarrow/_flight.pyx  |   6 +
 python/pyarrow/includes/libarrow_flight.pxd |   1 +
 python/pyarrow/tests/test_flight.py |  39 ++
 9 files changed, 259 insertions(+), 58 deletions(-)



[arrow] branch master updated (a985483 -> 149d4cb)

2019-09-02 Thread ravindra
This is an automated email from the ASF dual-hosted git repository.

ravindra pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from a985483  ARROW-4752: [Rust] Add explicit SIMD vectorization for the 
divide kernel
 add 149d4cb  ARROW-6383: [Java] Report outstanding child allocators on 
close

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/arrow/memory/BaseAllocator.java | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)