[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433765#comment-16433765
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

alendit closed pull request #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc
index fb1bebfca..7f7666bb3 100644
--- a/cpp/src/arrow/array-test.cc
+++ b/cpp/src/arrow/array-test.cc
@@ -536,6 +536,44 @@ TYPED_TEST(TestPrimitiveBuilder, SliceEquality) {
   ASSERT_TRUE(array->RangeEquals(5, 15, 0, slice));
 }
 
+TYPED_TEST(TestPrimitiveBuilder, TestPartialFinish) {
+  const int64_t size = 1000;
+  this->RandomData(size * 2);
+
+  // build an array of all values
+  std::shared_ptr all_values_array;
+  ASSERT_OK(MakeArray(this->valid_bytes_, this->draws_, size * 2, 
this->builder_.get(),
+  _values_array));
+
+  for (uint64_t idx = 0; idx < size; ++idx) {
+if (this->valid_bytes_[idx] > 0) {
+  ASSERT_OK(this->builder_->Append(this->draws_[idx]));
+} else {
+  ASSERT_OK(this->builder_->AppendNull());
+}
+  }
+
+  std::shared_ptr result1;
+  ASSERT_OK(this->builder_->Finish(false, ));
+
+  ASSERT_EQ(size, result1->length());
+  ASSERT_TRUE(all_values_array->RangeEquals(0, size, 0, result1));
+
+  for (uint64_t idx = size; idx < size * 2; ++idx) {
+if (this->valid_bytes_[idx] > 0) {
+  ASSERT_OK(this->builder_->Append(this->draws_[idx]));
+} else {
+  ASSERT_OK(this->builder_->AppendNull());
+}
+  }
+
+  std::shared_ptr result2;
+  ASSERT_OK(this->builder_->Finish(true, ));
+
+  ASSERT_EQ(size, result2->length());
+  ASSERT_TRUE(all_values_array->RangeEquals(size, size * 2, 0, result2));
+}
+
 TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) {
   DECL_T();
 
@@ -1027,6 +1065,27 @@ TEST_F(TestStringBuilder, TestZeroLength) {
   Done();
 }
 
+TEST_F(TestStringBuilder, TestPartialFinish) {
+  StringBuilder builder, builder_expected;
+  ASSERT_OK(builder.Append("foo"));
+  ASSERT_OK(builder_expected.Append("foo"));
+
+  std::shared_ptr result1, expected1;
+  ASSERT_OK(builder.Finish(false, ));
+  ASSERT_OK(builder_expected.Finish());
+  ASSERT_EQ(1, result1->length());
+  ASSERT_TRUE(result1->Equals(expected1));
+
+  ASSERT_OK(builder.Append("foo"));
+  ASSERT_OK(builder_expected.Append("foo"));
+  std::shared_ptr result2, expected2;
+  ASSERT_OK(builder.Finish(false, ));
+  ASSERT_OK(builder_expected.Finish());
+  ASSERT_EQ(1, result2->length());
+  ASSERT_EQ(1, result2->offset());
+  ASSERT_TRUE(result2->Equals(expected2));
+}
+
 // Binary container type
 // TODO(emkornfield) there should be some way to refactor these to avoid code 
duplicating
 // with String
@@ -1239,6 +1298,27 @@ TEST_F(TestBinaryBuilder, TestZeroLength) {
   Done();
 }
 
+TEST_F(TestBinaryBuilder, TestPartialFinish) {
+  BinaryBuilder builder, builder_expected;
+  ASSERT_OK(builder.Append("foo"));
+  ASSERT_OK(builder_expected.Append("foo"));
+
+  std::shared_ptr result1, expected1;
+  ASSERT_OK(builder.Finish(false, ));
+  ASSERT_OK(builder_expected.Finish());
+  ASSERT_EQ(1, result1->length());
+  ASSERT_TRUE(result1->Equals(expected1));
+
+  ASSERT_OK(builder.Append("foo"));
+  ASSERT_OK(builder_expected.Append("foo"));
+  std::shared_ptr result2, expected2;
+  ASSERT_OK(builder.Finish(false, ));
+  ASSERT_OK(builder_expected.Finish());
+  ASSERT_EQ(1, result2->length());
+  ASSERT_EQ(1, result2->offset());
+  ASSERT_TRUE(result2->Equals(expected2));
+}
+
 // --
 // Slice tests
 
@@ -1472,6 +1552,26 @@ TEST_F(TestFWBinaryArray, Slice) {
   ASSERT_TRUE(array->RangeEquals(1, 3, 0, slice));
 }
 
+TEST_F(TestFWBinaryArray, TestPartialFinish) {
+  auto type = fixed_size_binary(4);
+  FixedSizeBinaryBuilder builder(type);
+
+  ASSERT_OK(builder.Append("foo"));
+  std::shared_ptr result1;
+  ASSERT_OK(builder.Finish(false, ));
+  ASSERT_EQ(1, result1->length());
+  ASSERT_STREQ("foo", reinterpret_cast(
+  static_cast(*result1).Value(0)));
+
+  ASSERT_OK(builder.Append("bar"));
+  std::shared_ptr result2;
+  ASSERT_OK(builder.Finish());
+  ASSERT_EQ(1, result2->length());
+  ASSERT_EQ(1, result2->offset());
+  ASSERT_STREQ("bar", reinterpret_cast(
+  static_cast(*result2).Value(0)));
+}
+
 // --
 // AdaptiveInt tests
 
@@ -1603,6 +1703,31 @@ TEST_F(TestAdaptiveIntBuilder, TestAppendVector) {
   ASSERT_TRUE(expected_->Equals(result_));
 }
 

[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433764#comment-16433764
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

alendit commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-380419676
 
 
   Hi Uwe,
   
   I've decided to close this PR until I have a better understanding of the 
possible issues with slices.
   
   Cheers!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432556#comment-16432556
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

alendit commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-380165985
 
 
   Hi Uwe,
   
   I see what you mean. Now that I've looked more carefully into 
`BufferSlices`, I see how much danger they can bear. Something like 
[this](https://gist.github.com/alendit/f759bc12e03d9dd9d72cac90a6334cc5) would 
cause a read-after-free. 
   
   The problem, as I see it, is, that the SliceBuffer user has no way to ensure 
its validity. So even if we skip this PR, the problems with slices might happen 
in the future. I think some lean solution, like adding a `shared_ptr 
parent` to the slice and referencing its data instead, will increase the memory 
safety and might benefit this PR, too.
   
   Do you think we should discuss it on the mailing list?
   
   Cheers,
   Dimitri.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432390#comment-16432390
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

xhochy commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-380127439
 
 
   Hello Dimitri,
   
   I forgot to reply to you. This got drowned in all the other Arrow activity 
in the last days. Sorry about that.
   
   > Do you know why the RecordBatchBuilder has a partial Flush method? Is it 
because its instantion is more cumbersome compared to array builders?
   
   The `RecordBatchBuilder` is different to the other builders as it does not 
generate a Arrays but should be used for a more streaming-like output. In it, 
all builders are resetted on Flush. This is also the behaviour I would expect 
in the dictionary builder.

   > That being said, I don't think that it adds that much complexity, compared 
to the previous implementation. Most of the changes are in the testing code, 
and builder files themselves have around 30 additional LOC. At the same time, 
it makes the DictionaryBuilder quite straight forward.
   
   The added complexity is the thing that scares me most at the moment. The 
Builders are built with the intention that they complety own their buffers and 
can modify them as they like internally. You can ask them for snapshot access 
to elements to inspect them but we don't provide any guarantees that the memory 
addresses that were returned from them are still valid once you can a method on 
the builder again.
   
   As far as I understand the current implementation in some cases we return 
the current buffer pointer as new `Buffer` object in the BufferBuilder. Once we 
do expand it though later on, we may free this memory address and thus the 
`Buffer` object point to unreachable memory. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431822#comment-16431822
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

alendit commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-379997360
 
 
   Hi Uwe,
   
   did you come to any conclusion as to if this PR is a go? I could either 
adjust it or we can scrap it all together and then I'll look into getting delta 
dicts into `Readers` and `Writers`.
   
   Note: the test failure is unrelated to my changes, the cause was mentioned 
on the mailing list.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417967#comment-16417967
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

alendit commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-377005372
 
 
   Hi Uwe,
   
   i was actually wondering the same thing myself. The reason I've implemented 
partial finishers for `FixedSizeBinaryBuilder` etc is that they are used by the 
`DictionaryBuilder` internally. I've tried to come up with a use case for the 
partial finishing for other builders, but, as you said, you can almost always 
simply instantiate a new one. Do you know why the `RecordBatchBuilder` has a 
partial `Flush` method? Is it because its instantion is more cumbersome 
compared to array builders?
   
   That being said, I don't think that it adds that much complexity, compared 
to the previous implementation. Most of the changes are in the testing code, 
and builder files themselves have around 30 additional LOC. At the same time, 
it makes the `DictionaryBuilder` quite straight forward.
   
   Maybe a compromise would be to hide the partial ´Finish´ methods for other 
classes besides `DictionaryBuilder` and make them friends with it? I'm not a 
big friend of `friend`, though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417644#comment-16417644
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

xhochy commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-376942167
 
 
   What is the benefit of having ArrayBuilders that partly reset them versus 
just instantiating a new ArrayBuilder? I get it for the Dictionary case where 
keep state but for FixedSizeBinary I don't see the need to add such complexity.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)