[jira] [Updated] (ARROW-2361) [Rust] Start native Rust Implementation

2018-03-28 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2361:
--
Labels: pull-request-available  (was: )

> [Rust] Start native Rust Implementation
> ---
>
> Key: ARROW-2361
> URL: https://issues.apache.org/jira/browse/ARROW-2361
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>
> I'm creating this Jira to track work to donate an work-in-progress native 
> Rust implementation of Arrow.
> I am actively developing this and relying on it for the memory model of my 
> DataFusion project. I would like to donate the code I have now and start 
> working on it under the Apache Arrow project.
> Here is the PR: https://github.com/apache/arrow/pull/1804
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2361) [Rust] Start native Rust Implementation

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418425#comment-16418425
 ] 

ASF GitHub Bot commented on ARROW-2361:
---

wesm commented on issue #1804: ARROW-2361: [Rust] Starting point for a native 
Rust implementation of Arrow
URL: https://github.com/apache/arrow/pull/1804#issuecomment-377119109
 
 
   I'm sorta ambivalent on the package name -- I looked at crates.io and there 
are some other ASF projects with packages that just use the Foo in Apache Foo. 
If "arrow" is shorter and sweeter, that's no problem


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Start native Rust Implementation
> ---
>
> Key: ARROW-2361
> URL: https://issues.apache.org/jira/browse/ARROW-2361
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>
> I'm creating this Jira to track work to donate an work-in-progress native 
> Rust implementation of Arrow.
> I am actively developing this and relying on it for the memory model of my 
> DataFusion project. I would like to donate the code I have now and start 
> working on it under the Apache Arrow project.
> Here is the PR: https://github.com/apache/arrow/pull/1804
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2361) [Rust] Start native Rust Implementation

2018-03-28 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2361:

Summary: [Rust] Start native Rust Implementation  (was: Native Rust 
Implementation)

> [Rust] Start native Rust Implementation
> ---
>
> Key: ARROW-2361
> URL: https://issues.apache.org/jira/browse/ARROW-2361
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>
> I'm creating this Jira to track work to donate an work-in-progress native 
> Rust implementation of Arrow.
> I am actively developing this and relying on it for the memory model of my 
> DataFusion project. I would like to donate the code I have now and start 
> working on it under the Apache Arrow project.
> Here is the PR: https://github.com/apache/arrow/pull/1804
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418419#comment-16418419
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

gaolizhou closed pull request #1806: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1806
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc
index 2aa73a09a..308bbcd8a 100644
--- a/cpp/src/arrow/array-test.cc
+++ b/cpp/src/arrow/array-test.cc
@@ -989,6 +989,39 @@ TEST_F(TestStringBuilder, TestScalarAppend) {
   }
 }
 
+TEST_F(TestStringBuilder, TestAppendVector) {
+  vector strings = {"", "bb", "a", "", "ccc"};
+  vector is_null = {0, 0, 0, 1, 0};
+
+  int N = static_cast(strings.size());
+  int reps = 1000;
+
+  for (int j = 0; j < reps; ++j) {
+ASSERT_OK(builder_->Append(strings, is_null.data()));
+  }
+  Done();
+
+  ASSERT_EQ(reps * N, result_->length());
+  ASSERT_EQ(reps, result_->null_count());
+  ASSERT_EQ(reps * 6, result_->value_data()->size());
+
+  int32_t length;
+  int32_t pos = 0;
+  for (int i = 0; i < N * reps; ++i) {
+if (is_null[i % N]) {
+  ASSERT_TRUE(result_->IsNull(i));
+} else {
+  ASSERT_FALSE(result_->IsNull(i));
+  result_->GetValue(i, &length);
+  ASSERT_EQ(pos, result_->value_offset(i));
+  ASSERT_EQ(static_cast(strings[i % N].size()), length);
+  ASSERT_EQ(strings[i % N], result_->GetString(i));
+
+  pos += length;
+}
+  }
+}
+
 TEST_F(TestStringBuilder, TestZeroLength) {
   // All buffers are null
   Done();
diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc
index aa9f3ce42..ec486566f 100644
--- a/cpp/src/arrow/builder.cc
+++ b/cpp/src/arrow/builder.cc
@@ -16,11 +16,11 @@
 // under the License.
 
 #include "arrow/builder.h"
-
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1385,6 +1385,28 @@ const uint8_t* BinaryBuilder::GetValue(int64_t i, 
int32_t* out_length) const {
 
 StringBuilder::StringBuilder(MemoryPool* pool) : BinaryBuilder(utf8(), pool) {}
 
+Status StringBuilder::Append(const std::vector& values,
+ uint8_t* null_bytes) {
+  std::size_t total_length = std::accumulate(
+  values.begin(), values.end(), 0ULL,
+  [](uint64_t sum, const std::string& str) { return sum + str.size(); });
+  RETURN_NOT_OK(Reserve(values.size()));
+  RETURN_NOT_OK(value_data_builder_.Reserve(total_length));
+  RETURN_NOT_OK(offsets_builder_.Reserve(values.size()));
+
+  for (std::size_t i = 0; i < values.size(); ++i) {
+RETURN_NOT_OK(AppendNextOffset());
+if (null_bytes[i]) {
+  UnsafeAppendToBitmap(false);
+} else {
+  RETURN_NOT_OK(value_data_builder_.Append(
+  reinterpret_cast(values[i].data()), 
values[i].size()));
+  UnsafeAppendToBitmap(true);
+}
+  }
+  return Status::OK();
+}
+
 // --
 // Fixed width binary
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418420#comment-16418420
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

gaolizhou opened a new pull request #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803
 
 
   changed the API from`Status Append(const std::vector& values, 
uint8_t* null_bytes);` to  `Status Append(const std::vector& 
values);` IMO, if string is empty, then it should be null, and vice versa.
   
   **[update]** change the API back to original.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418390#comment-16418390
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

gaolizhou commented on issue #1806: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1806#issuecomment-377111241
 
 
   sorry guys, I closed the pull request by carelessness, Just re-open it . 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418386#comment-16418386
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

gaolizhou opened a new pull request #1806: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1806
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418384#comment-16418384
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

gaolizhou commented on issue #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803#issuecomment-377110720
 
 
   Please help review it  and leave your comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418382#comment-16418382
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

gaolizhou closed pull request #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc
index 2aa73a09a..308bbcd8a 100644
--- a/cpp/src/arrow/array-test.cc
+++ b/cpp/src/arrow/array-test.cc
@@ -989,6 +989,39 @@ TEST_F(TestStringBuilder, TestScalarAppend) {
   }
 }
 
+TEST_F(TestStringBuilder, TestAppendVector) {
+  vector strings = {"", "bb", "a", "", "ccc"};
+  vector is_null = {0, 0, 0, 1, 0};
+
+  int N = static_cast(strings.size());
+  int reps = 1000;
+
+  for (int j = 0; j < reps; ++j) {
+ASSERT_OK(builder_->Append(strings, is_null.data()));
+  }
+  Done();
+
+  ASSERT_EQ(reps * N, result_->length());
+  ASSERT_EQ(reps, result_->null_count());
+  ASSERT_EQ(reps * 6, result_->value_data()->size());
+
+  int32_t length;
+  int32_t pos = 0;
+  for (int i = 0; i < N * reps; ++i) {
+if (is_null[i % N]) {
+  ASSERT_TRUE(result_->IsNull(i));
+} else {
+  ASSERT_FALSE(result_->IsNull(i));
+  result_->GetValue(i, &length);
+  ASSERT_EQ(pos, result_->value_offset(i));
+  ASSERT_EQ(static_cast(strings[i % N].size()), length);
+  ASSERT_EQ(strings[i % N], result_->GetString(i));
+
+  pos += length;
+}
+  }
+}
+
 TEST_F(TestStringBuilder, TestZeroLength) {
   // All buffers are null
   Done();
diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc
index aa9f3ce42..ec486566f 100644
--- a/cpp/src/arrow/builder.cc
+++ b/cpp/src/arrow/builder.cc
@@ -16,11 +16,11 @@
 // under the License.
 
 #include "arrow/builder.h"
-
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1385,6 +1385,28 @@ const uint8_t* BinaryBuilder::GetValue(int64_t i, 
int32_t* out_length) const {
 
 StringBuilder::StringBuilder(MemoryPool* pool) : BinaryBuilder(utf8(), pool) {}
 
+Status StringBuilder::Append(const std::vector& values,
+ uint8_t* null_bytes) {
+  std::size_t total_length = std::accumulate(
+  values.begin(), values.end(), 0ULL,
+  [](uint64_t sum, const std::string& str) { return sum + str.size(); });
+  RETURN_NOT_OK(Reserve(values.size()));
+  RETURN_NOT_OK(value_data_builder_.Reserve(total_length));
+  RETURN_NOT_OK(offsets_builder_.Reserve(values.size()));
+
+  for (std::size_t i = 0; i < values.size(); ++i) {
+RETURN_NOT_OK(AppendNextOffset());
+if (null_bytes[i]) {
+  UnsafeAppendToBitmap(false);
+} else {
+  RETURN_NOT_OK(value_data_builder_.Append(
+  reinterpret_cast(values[i].data()), 
values[i].size()));
+  UnsafeAppendToBitmap(true);
+}
+  }
+  return Status::OK();
+}
+
 // --
 // Fixed width binary
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418381#comment-16418381
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

gaolizhou commented on issue #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803#issuecomment-377110720
 
 
   Please help review it  and leave your comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Anton Shmigirilov (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418191#comment-16418191
 ] 

Anton Shmigirilov commented on ARROW-2359:
--

It seems original topic message is incorrect. It looks like there aren't races 
in singleton itself. There is another race condition in my code somewhere, it's 
related to arrow::DataType usages and it's eliminated if singleton is removed 
using proposed patch. Anyway, this JIRA task has wrong description and can be 
removed or renamed, if needed. Guys, sorry for confuse. (But I still think that 
static shared_ptr is over-engineering ;) )

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1780) JDBC Adapter for Apache Arrow

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418166#comment-16418166
 ] 

ASF GitHub Bot commented on ARROW-1780:
---

atuldambalkar commented on issue #1759: ARROW-1780 - [WIP] JDBC Adapter to 
convert Relational Data objects to Arrow Data Format Vector Objects
URL: https://github.com/apache/arrow/pull/1759#issuecomment-377045969
 
 
   Hi @laurentgo, now I do have a handful of review comments to work on. As I 
work on each one of those, some may need short discussion with you. I hope 
that's okay. Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> JDBC Adapter for Apache Arrow
> -
>
> Key: ARROW-1780
> URL: https://issues.apache.org/jira/browse/ARROW-1780
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Atul Dambalkar
>Assignee: Atul Dambalkar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> At a high level the JDBC Adapter will allow upstream apps to query RDBMS data 
> over JDBC and get the JDBC objects converted to Arrow objects/structures. The 
> upstream utility can then work with Arrow objects/structures with usual 
> performance benefits. The utility will be very much similar to C++ 
> implementation of "Convert a vector of row-wise data into an Arrow table" as 
> described here - 
> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html
> The utility will read data from RDBMS and covert the data into Arrow 
> objects/structures. So from that perspective this will Read data from RDBMS, 
> If the utility can push Arrow objects to RDBMS is something need to be 
> discussed and will be out of scope for this utility for now. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418003#comment-16418003
 ] 

ASF GitHub Bot commented on ARROW-2308:
---

robertnishihara commented on issue #1802: ARROW-2308: [Python] Make 
deserialized numpy arrays 64-byte aligned.
URL: https://github.com/apache/arrow/pull/1802#issuecomment-377011536
 
 
   Thanks @wesm!
   
   @pcmoritz, good catch, the bug should be fixed now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Serialized tensor data should be 64-byte aligned.
> -
>
> Key: ARROW-2308
> URL: https://issues.apache.org/jira/browse/ARROW-2308
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>
> See [https://github.com/ray-project/ray/issues/1658] for an example of this 
> issue. Non-aligned data can trigger a copy when fed into TensorFlow and 
> things like that.
> {code}
> import pyarrow as pa
> import numpy as np
> x = np.zeros(10)
> y = pa.deserialize(pa.serialize(x).to_buffer())
> x.ctypes.data % 64  # 0 (it starts out aligned)
> y.ctypes.data % 64  # 48 (it is no longer aligned)
> {code}
> It should be possible to fix this by calling something like 
> {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
> Note that we already do this before writing the tensor header, but the tensor 
> header is not necessarily a multiple of 64 bytes, so the subsequent data can 
> be unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417995#comment-16417995
 ] 

Antoine Pitrou commented on ARROW-2359:
---

C++11 guarantees the initialization cannot involve any race condition (see 
Stack Overflow post above).

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417967#comment-16417967
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

alendit commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-377005372
 
 
   Hi Uwe,
   
   i was actually wondering the same thing myself. The reason I've implemented 
partial finishers for `FixedSizeBinaryBuilder` etc is that they are used by the 
`DictionaryBuilder` internally. I've tried to come up with a use case for the 
partial finishing for other builders, but, as you said, you can almost always 
simply instantiate a new one. Do you know why the `RecordBatchBuilder` has a 
partial `Flush` method? Is it because its instantion is more cumbersome 
compared to array builders?
   
   That being said, I don't think that it adds that much complexity, compared 
to the previous implementation. Most of the changes are in the testing code, 
and builder files themselves have around 30 additional LOC. At the same time, 
it makes the `DictionaryBuilder` quite straight forward.
   
   Maybe a compromise would be to hide the partial ´Finish´ methods for other 
classes besides `DictionaryBuilder` and make them friends with it? I'm not a 
big friend of `friend`, though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417958#comment-16417958
 ] 

Wes McKinney commented on ARROW-2359:
-

Is it possible there is a race condition in the initialization of the global 
static variable? Seems pretty esoteric

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417917#comment-16417917
 ] 

Phillip Cloud commented on ARROW-2359:
--

Also, http://en.cppreference.com/w/cpp/memory/shared_ptr states the following:

{code}
All member functions (including copy constructor and copy assignment) can be 
called by multiple threads on different instances of shared_ptr without 
additional synchronization even if these instances are copies and share 
ownership of the same object.
{code}

These functions return a copy of a {{shared_ptr}}.

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2362) [Python] Decimal conversions are slow

2018-03-28 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2362:
-

 Summary: [Python] Decimal conversions are slow
 Key: ARROW-2362
 URL: https://issues.apache.org/jira/browse/ARROW-2362
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou


See https://github.com/apache/arrow/pull/1798#issuecomment-376498987

I don't know how critical performance is here, but worth keeping a note of. The 
{{decimal}} module isn't exposing an official C API, so fixing this would be a 
bit involved. If this is important, we can also try to push for an official 
decimal C API in Python (other packages may also benefit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417837#comment-16417837
 ] 

ASF GitHub Bot commented on ARROW-2359:
---

pitrou commented on issue #1800: ARROW-2359: [C++] do not use static shared_ptr 
in TYPE_FACTORY to make it thread safe
URL: https://github.com/apache/arrow/pull/1800#issuecomment-376978097
 
 
   Right now it isn't proven that there is something to fix here. The 
discussion is happening on the JIRA ticket.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417651#comment-16417651
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

xhochy commented on issue #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803#issuecomment-376943418
 
 
   > Do we want to update the API as well to Status Append(const 
std::vector& values, vector null_bytes); to match the API 
for NumericBuilder?
   
   Usability-wise this definitely makes sense but this can also be done in a 
followup-PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417644#comment-16417644
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

xhochy commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-376942167
 
 
   What is the benefit of having ArrayBuilders that partly reset them versus 
just instantiating a new ArrayBuilder? I get it for the Dictionary case where 
keep state but for FixedSizeBinary I don't see the need to add such complexity.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2350) Shrink size of spark_integration Docker container

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417632#comment-16417632
 ] 

ASF GitHub Bot commented on ARROW-2350:
---

xhochy commented on issue #1787: ARROW-2350: Consolidated RUN step in 
spark_integration Dockerfile
URL: https://github.com/apache/arrow/pull/1787#issuecomment-376941028
 
 
   @jameslamb yes, that would be great. It looks like we could trim down all of 
them. In some cases, we install python packages from `pip install git+…`. These 
steps should maybe stay separate so that we can only delete the docker cache 
for them and have a fast rebuild.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Shrink size of spark_integration Docker container
> -
>
> Key: ARROW-2350
> URL: https://issues.apache.org/jira/browse/ARROW-2350
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: James Lamb
>Assignee: James Lamb
>Priority: Minor
>  Labels: docker, pull-request-available, spark
> Fix For: 0.10.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> I would like to propose a few changes to the spark_integration Dockerfile:
> [https://github.com/apache/arrow/tree/master/dev/spark_integration]
> The size of the resulting image can be reduced by making the following 
> changes:
>  * consolidating all RUN commands into a single RUN layer (reducing the 
> number of layers)
>  * running {color:#14892c}apt-get clean{color} to clear out the package cache
>  * running {color:#14892c}conda clean --all{color} to clear out cached 
> package tarballs, abandoned package versions, and other build artifacts from 
> all the libraries that are conda installed
> I will be submitting a PR on GitHub shortly. Generating this issue first so I 
> can tag my PR to it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417565#comment-16417565
 ] 

Antoine Pitrou commented on ARROW-2359:
---

I don't know what the rationale is, though the aim may be to avoid lots of 
spurious allocations. [~wesmckinn]

However, the question is whether there is an actual issue with the singleton 
pattern here, and I'm not convinced there is.

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Anton Shmigirilov (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417556#comment-16417556
 ] 

Anton Shmigirilov commented on ARROW-2359:
--

> TYPE_FACTORY returns a _copy_ of the static shared_ptr

If so, it's unclear for me why is shared_ptr there. Or, why is static object 
there. Only reason is saving memory a bit? But RAII should help to mange memory 
reasonable.

Anyway, I propose to think if such pattern is reasonable here.

 

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417537#comment-16417537
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

rvernica commented on issue #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803#issuecomment-376927452
 
 
   Do we want to update the API as well to `Status Append(const 
std::vector& values, vector null_bytes);` to match the API 
for `NumericBuilder`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2347) [Python] Multiple warnings with -Wconversion

2018-03-28 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417526#comment-16417526
 ] 

Antoine Pitrou commented on ARROW-2347:
---

> Hm, are there other solutions to the enum signedness issue?

I'm not sure. It seems Cython is generating that code precisely to know whether 
the enum is signed:

{code:c}
static CYTHON_INLINE PyObject* 
__Pyx_PyInt_From_enumarrow_3a__3a_Type_3a__3a_type(enum  arrow::Type::type 
value) {
const enum  arrow::Type::type neg_one = (enum  arrow::Type::type) -1, 
const_zero = (enum  arrow::Type::type) 0;
const int is_unsigned = neg_one > const_zero;
if (is_unsigned) {
if (sizeof(enum  arrow::Type::type) < sizeof(long)) {
return PyInt_FromLong((long) value);
} else if (sizeof(enum  arrow::Type::type) <= sizeof(unsigned long)) {
return PyLong_FromUnsignedLong((unsigned long) value);
#ifdef HAVE_LONG_LONG
} else if (sizeof(enum  arrow::Type::type) <= sizeof(unsigned 
PY_LONG_LONG)) {
return PyLong_FromUnsignedLongLong((unsigned PY_LONG_LONG) value);
#endif
}
} else {
if (sizeof(enum  arrow::Type::type) <= sizeof(long)) {
return PyInt_FromLong((long) value);
#ifdef HAVE_LONG_LONG
} else if (sizeof(enum  arrow::Type::type) <= sizeof(PY_LONG_LONG)) {
return PyLong_FromLongLong((PY_LONG_LONG) value);
#endif
}
}
{
int one = 1; int little = (int)*(unsigned char *)&one;
unsigned char *bytes = (unsigned char *)&value;
return _PyLong_FromByteArray(bytes, sizeof(enum  arrow::Type::type),
 little, !is_unsigned);
}
}
{code}


> [Python] Multiple warnings with -Wconversion
> 
>
> Key: ARROW-2347
> URL: https://issues.apache.org/jira/browse/ARROW-2347
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Antoine Pitrou
>Priority: Minor
>
> There are multiple warnings when compiling the Cython-generated code with 
> {{-Wconversion}}:
>  {code}
> /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx: In function 
> 'PyObject* __pyx_pf_7pyarrow_3lib_62union(PyObject*, PyObject*, PyObject*)':
> /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:23850:45: 
> warning: conversion to 'std::vector::value_type {aka unsigned 
> char}' from 'int' may alter its value [-Wconversion]
>__pyx_v_type_codes.push_back(__pyx_v_i);
>  ^
> /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx: In function 
> 'PyObject* 
> __Pyx_PyInt_From_enumarrow_3a__3a_Type_3a__3a_type(arrow::Type::type)':
> /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:125758:70: 
> warning: the result of the conversion is unspecified because '-1' is outside 
> the range of type 'arrow::Type::type' [-Wconversion]
>  const enum  arrow::Type::type neg_one = (enum  arrow::Type::type) -1, 
> const_zero = (enum  arrow::Type::type) 0;
>   ^
> /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx: In function 
> 'PyObject* 
> __Pyx_PyInt_From_enumarrow_3a__3a_UnionMode_3a__3a_type(arrow::UnionMode::type)':
> /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:125789:80: 
> warning: the result of the conversion is unspecified because '-1' is outside 
> the range of type 'arrow::UnionMode::type' [-Wconversion]
>  const enum  arrow::UnionMode::type neg_one = (enum  
> arrow::UnionMode::type) -1, const_zero = (enum  arrow::UnionMode::type) 0;
>   
>   ^
> /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx: In function 
> 'PyObject* 
> __Pyx_PyInt_From_enumarrow_3a__3a_TimeUnit_3a__3a_type(arrow::TimeUnit::type)':
> /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:125820:78: 
> warning: the result of the conversion is unspecified because '-1' is outside 
> the range of type 'arrow::TimeUnit::type' [-Wconversion]
>  const enum  arrow::TimeUnit::type neg_one = (enum  
> arrow::TimeUnit::type) -1, const_zero = (enum  arrow::TimeUnit::type) 0;
>   
> ^
> {code}
> (also similar warnings for _parquet.pyx due to Parquet enumerations)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417497#comment-16417497
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

xhochy commented on issue #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803#issuecomment-376921022
 
 
   > hmmm, is it possible that the string is not empty and the null bit is true 
?
   
   No, this should not happen but this should hopefully not make a difference.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417496#comment-16417496
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

xhochy commented on issue #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803#issuecomment-376921022
 
 
   > hmmm, is it possible that the string is not empty and the null bit is true 
?
   
   No, this should not happen. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] xhochy opened a new pull request #25: WIP: Fix OSX RPATHs for Boost

2018-03-28 Thread GitBox
xhochy opened a new pull request #25: WIP: Fix OSX RPATHs for Boost
URL: https://github.com/apache/arrow-dist/pull/25
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Issue Comment Deleted] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Dou Tu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dou Tu updated ARROW-2359:
--
Comment: was deleted

(was: There is no shared data between two threads here!

So it must be thread-safe!

Should be junked.)

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2361) Native Rust Implementation

2018-03-28 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2361:
-

 Summary: Native Rust Implementation
 Key: ARROW-2361
 URL: https://issues.apache.org/jira/browse/ARROW-2361
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Andy Grove


I'm creating this Jira to track work to donate an work-in-progress native Rust 
implementation of Arrow.

I am actively developing this and relying on it for the memory model of my 
DataFusion project. I would like to donate the code I have now and start 
working on it under the Apache Arrow project.

Here is the PR: https://github.com/apache/arrow/pull/1804

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417315#comment-16417315
 ] 

Antoine Pitrou commented on ARROW-2359:
---

> But in case of TYPE_FACTORY implementation, we have not only common/shared 
> control block internals, but common (because static) shared_ptr internals 
> itself.

That doesn't sound accurate to me. TYPE_FACTORY returns a _copy_ of the static 
shared_ptr. Also, the thread stacks you posted don't show any call to 
TYPE_FACTORY, so it seems unlikely that it plays a part here.

After googling a bit, it seems TSAN sometimes produces false positives with 
gcc. That may be one of those. Perhaps you can try with clang?

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417299#comment-16417299
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

gaolizhou commented on issue #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803#issuecomment-376878105
 
 
   > We distinguish between empty strings and null strings
   
   hmmm, is it possible that  the string is not empty and the null bit is true 
? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Anton Shmigirilov (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417266#comment-16417266
 ] 

Anton Shmigirilov commented on ARROW-2359:
--

Yes, I have issue with this. It's reproduced in whole project that uses Arrow. 
Part of project reads RecordBatch using RecordBatchStreamReader, performs 
access to Schema (RecordBatch::schema()) and field's type(). Such reading 
executes in concurrent threads. Executable built with gcc's ThreadSanitizer and 
here is part of sanitizer's output:

Atomic write of size 4 at 0x7b10c008 by thread T5:
 #0 __tsan_atomic32_fetch_add  (libtsan.so.0+0x00064aa0)
 #1 __atomic_add /usr/include/c++/7/ext/atomicity.h:53 (exec+0x008a68c3)
 #2 __atomic_add_dispatch /usr/include/c++/7/ext/atomicity.h:96 
(exec+0x008a68c3)
 #3 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy() 
/usr/include/c++/7/bits/shared_ptr_base.h:138 (FDIOExec+0x008a68c3)
 #4 
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count(std::__shared_count<(__gnu_cxx::_Lock_policy)2>
 const&) /usr/include/c++/7/bits/shared_ptr_base.h:691 (exec+0x008a68c3)
 #5 std::__shared_ptr::__shared_ptr(std::__shared_ptr const&) 
/usr/include/c++/7/bits/shared_ptr_base.h:1121 (exec+0x008a68c3)
 #6 
std::shared_ptr::shared_ptr(std::shared_ptr 
const&) /usr/include/c++/7/bits/shared_ptr.h:119 (exec+0x008a68c3)
 #7 arrow::Field::type() const 
/usr/local/fdio-deps/lib/../include/arrow/type.h:244 (exec+0x008a68c3)
 #8 func_reader() (exec+0x008a68c3)

Previous write of size 8 at 0x7b10c008 by thread T4:
 #0 operator new(unsigned long)  (libtsan.so.0+0x0006f846)
 #1 __gnu_cxx::new_allocator, (__gnu_cxx::_Lock_policy)2> 
>::allocate(unsigned long, void const*) 
/usr/include/c++/7/ext/new_allocator.h:111 (libarrow.so.0+0x00167307)
 #2 func_reader() (exec+0x00897a38)

It hard to reproduce it on simple synthetic test, but I will try it.

std::shared_ptr declared as thread safe in relation to control block, but as I 
understand it, safety is guaranteed only in case of modifying control block's 
internals which happened because copying shared_ptr itself (with sharing common 
control block). But in case of TYPE_FACTORY implementation, we have not only 
common/shared control block internals, but common (because static) shared_ptr 
internals itself. I guess this place is unsafe.

Another one opinion about this. Shared_ptr have been designed to maintain 
complex object's lifecycle in case of copying, moving and so on. In case of 
TYPE_FACTORY we haven't advantages of shared_ptr, we have single instance of 
object with lifetime corresponding whole process's lifetime. I guess is't not 
quite correct usage of the concept, in my opinion.

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417253#comment-16417253
 ] 

ASF GitHub Bot commented on ARROW-2308:
---

wesm commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy 
arrays 64-byte aligned.
URL: https://github.com/apache/arrow/pull/1802#issuecomment-376865586
 
 
   Will review this when I can. I should also revive ARROW-1860 as there are a 
number of interrelated issues around this stuff


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Serialized tensor data should be 64-byte aligned.
> -
>
> Key: ARROW-2308
> URL: https://issues.apache.org/jira/browse/ARROW-2308
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>
> See [https://github.com/ray-project/ray/issues/1658] for an example of this 
> issue. Non-aligned data can trigger a copy when fed into TensorFlow and 
> things like that.
> {code}
> import pyarrow as pa
> import numpy as np
> x = np.zeros(10)
> y = pa.deserialize(pa.serialize(x).to_buffer())
> x.ctypes.data % 64  # 0 (it starts out aligned)
> y.ctypes.data % 64  # 48 (it is no longer aligned)
> {code}
> It should be possible to fix this by calling something like 
> {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
> Note that we already do this before writing the tensor header, but the tensor 
> header is not necessarily a multiple of 64 bytes, so the subsequent data can 
> be unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417252#comment-16417252
 ] 

ASF GitHub Bot commented on ARROW-2351:
---

wesm commented on issue #1803: ARROW-2351 [C++] 
StringBuilder::append(vector...) not impleme…
URL: https://github.com/apache/arrow/pull/1803#issuecomment-376865339
 
 
   > IMO, if string is empty, then it should be null, and vice versa.
   
   We distinguish between empty strings and null strings


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] StringBuilder::append(vector...) not implemented
> --
>
> Key: ARROW-2351
> URL: https://issues.apache.org/jira/browse/ARROW-2351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Rares Vernica
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is 
> [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721]
>  and 
> [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c]
>  but it does not seem to be implemented.
> {code:java}
> undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)'
> collect2: error: ld returned 1 exit status
> {code}
> Also worth noting is that the similar function in {{NumericBuilder}} uses 
> {{vector}} for the null values instead of {{uint8_t*}}. It might be 
> worth making them consistent.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2360) Add set_chunksize for RecordBatchReader in arrow/record_batch.h

2018-03-28 Thread Xianjin YE (JIRA)
Xianjin YE created ARROW-2360:
-

 Summary: Add set_chunksize for RecordBatchReader in 
arrow/record_batch.h
 Key: ARROW-2360
 URL: https://issues.apache.org/jira/browse/ARROW-2360
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Xianjin YE


As discussed in [https://github.com/apache/parquet-cpp/pull/445,] 

Maybe it's better to expose chunksize related API in RecordBatchReader.

 

However RecordBatchStreamReader doesn't conforms to this requirement. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Dou Tu (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417069#comment-16417069
 ] 

Dou Tu edited comment on ARROW-2359 at 3/28/18 9:21 AM:


There is no shared data between two threads here!

So it must be thread-safe!

Should be junked.


was (Author: gaolizhou):
There is no global variables here!

So it must be thread-safe!

Should be junked.

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe

2018-03-28 Thread Dou Tu (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417069#comment-16417069
 ] 

Dou Tu commented on ARROW-2359:
---

There is no global variables here!

So it must be thread-safe!

Should be junked.

> Type objects produced by DataType factory are not thread safe
> -
>
> Key: ARROW-2359
> URL: https://issues.apache.org/jira/browse/ARROW-2359
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Anton Shmigirilov
>Priority: Minor
>  Labels: pull-request-available
>
> TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() 
> and so on) uses static shared_ptr inside. There are race conditions possible 
> against shared_ptr's reference counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417004#comment-16417004
 ] 

ASF GitHub Bot commented on ARROW-2308:
---

robertnishihara commented on issue #1802: ARROW-2308: [Python] Make 
deserialized numpy arrays 64-byte aligned.
URL: https://github.com/apache/arrow/pull/1802#issuecomment-376796605
 
 
   Yes, that seems related, but for Tensors we want 64-byte alignment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Serialized tensor data should be 64-byte aligned.
> -
>
> Key: ARROW-2308
> URL: https://issues.apache.org/jira/browse/ARROW-2308
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>
> See [https://github.com/ray-project/ray/issues/1658] for an example of this 
> issue. Non-aligned data can trigger a copy when fed into TensorFlow and 
> things like that.
> {code}
> import pyarrow as pa
> import numpy as np
> x = np.zeros(10)
> y = pa.deserialize(pa.serialize(x).to_buffer())
> x.ctypes.data % 64  # 0 (it starts out aligned)
> y.ctypes.data % 64  # 48 (it is no longer aligned)
> {code}
> It should be possible to fix this by calling something like 
> {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
> Note that we already do this before writing the tensor header, but the tensor 
> header is not necessarily a multiple of 64 bytes, so the subsequent data can 
> be unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416980#comment-16416980
 ] 

ASF GitHub Bot commented on ARROW-2308:
---

pcmoritz commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy 
arrays 64-byte aligned.
URL: https://github.com/apache/arrow/pull/1802#issuecomment-376788803
 
 
   LGTM! We might also want to have a discussion about the spec and if we want 
Tensors to be aligned in general/by default. It seems important to me and maybe 
it's already implied by the sentence ```It is required to have all the 
contiguous memory buffers in an IPC payload aligned at 8-byte boundaries. In 
other words, each buffer must start at an aligned 8-byte offset.```
   
   Edit: There is a test failure in ipc-write-test we should fix before merging 
:)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Serialized tensor data should be 64-byte aligned.
> -
>
> Key: ARROW-2308
> URL: https://issues.apache.org/jira/browse/ARROW-2308
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>
> See [https://github.com/ray-project/ray/issues/1658] for an example of this 
> issue. Non-aligned data can trigger a copy when fed into TensorFlow and 
> things like that.
> {code}
> import pyarrow as pa
> import numpy as np
> x = np.zeros(10)
> y = pa.deserialize(pa.serialize(x).to_buffer())
> x.ctypes.data % 64  # 0 (it starts out aligned)
> y.ctypes.data % 64  # 48 (it is no longer aligned)
> {code}
> It should be possible to fix this by calling something like 
> {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
> Note that we already do this before writing the tensor header, but the tensor 
> header is not necessarily a multiple of 64 bytes, so the subsequent data can 
> be unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416957#comment-16416957
 ] 

ASF GitHub Bot commented on ARROW-2308:
---

pcmoritz commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy 
arrays 64-byte aligned.
URL: https://github.com/apache/arrow/pull/1802#issuecomment-376788803
 
 
   LGTM! We might also want to have a discussion about the spec and if we want 
Tensors to be aligned in general/by default. It seems important to me and maybe 
it's already implied by the sentence ```It is required to have all the 
contiguous memory buffers in an IPC payload aligned at 8-byte boundaries. In 
other words, each buffer must start at an aligned 8-byte offset.```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Serialized tensor data should be 64-byte aligned.
> -
>
> Key: ARROW-2308
> URL: https://issues.apache.org/jira/browse/ARROW-2308
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>
> See [https://github.com/ray-project/ray/issues/1658] for an example of this 
> issue. Non-aligned data can trigger a copy when fed into TensorFlow and 
> things like that.
> {code}
> import pyarrow as pa
> import numpy as np
> x = np.zeros(10)
> y = pa.deserialize(pa.serialize(x).to_buffer())
> x.ctypes.data % 64  # 0 (it starts out aligned)
> y.ctypes.data % 64  # 48 (it is no longer aligned)
> {code}
> It should be possible to fix this by calling something like 
> {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
> Note that we already do this before writing the tensor header, but the tensor 
> header is not necessarily a multiple of 64 bytes, so the subsequent data can 
> be unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416956#comment-16416956
 ] 

ASF GitHub Bot commented on ARROW-2308:
---

pcmoritz commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy 
arrays 64-byte aligned.
URL: https://github.com/apache/arrow/pull/1802#issuecomment-376788803
 
 
   LGTM! We might also want to have a discussion about the spec and if we want 
Tensors to be aligned in general. It seems important to me and maybe it's 
already implied by the sentence ```It is required to have all the contiguous 
memory buffers in an IPC payload aligned at 8-byte boundaries. In other words, 
each buffer must start at an aligned 8-byte offset.```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Serialized tensor data should be 64-byte aligned.
> -
>
> Key: ARROW-2308
> URL: https://issues.apache.org/jira/browse/ARROW-2308
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>
> See [https://github.com/ray-project/ray/issues/1658] for an example of this 
> issue. Non-aligned data can trigger a copy when fed into TensorFlow and 
> things like that.
> {code}
> import pyarrow as pa
> import numpy as np
> x = np.zeros(10)
> y = pa.deserialize(pa.serialize(x).to_buffer())
> x.ctypes.data % 64  # 0 (it starts out aligned)
> y.ctypes.data % 64  # 48 (it is no longer aligned)
> {code}
> It should be possible to fix this by calling something like 
> {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
> Note that we already do this before writing the tensor header, but the tensor 
> header is not necessarily a multiple of 64 bytes, so the subsequent data can 
> be unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)