[jira] [Updated] (ARROW-2361) [Rust] Start native Rust Implementation
[ https://issues.apache.org/jira/browse/ARROW-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2361: -- Labels: pull-request-available (was: ) > [Rust] Start native Rust Implementation > --- > > Key: ARROW-2361 > URL: https://issues.apache.org/jira/browse/ARROW-2361 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > > I'm creating this Jira to track work to donate an work-in-progress native > Rust implementation of Arrow. > I am actively developing this and relying on it for the memory model of my > DataFusion project. I would like to donate the code I have now and start > working on it under the Apache Arrow project. > Here is the PR: https://github.com/apache/arrow/pull/1804 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2361) [Rust] Start native Rust Implementation
[ https://issues.apache.org/jira/browse/ARROW-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418425#comment-16418425 ] ASF GitHub Bot commented on ARROW-2361: --- wesm commented on issue #1804: ARROW-2361: [Rust] Starting point for a native Rust implementation of Arrow URL: https://github.com/apache/arrow/pull/1804#issuecomment-377119109 I'm sorta ambivalent on the package name -- I looked at crates.io and there are some other ASF projects with packages that just use the Foo in Apache Foo. If "arrow" is shorter and sweeter, that's no problem This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Start native Rust Implementation > --- > > Key: ARROW-2361 > URL: https://issues.apache.org/jira/browse/ARROW-2361 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > > I'm creating this Jira to track work to donate an work-in-progress native > Rust implementation of Arrow. > I am actively developing this and relying on it for the memory model of my > DataFusion project. I would like to donate the code I have now and start > working on it under the Apache Arrow project. > Here is the PR: https://github.com/apache/arrow/pull/1804 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2361) [Rust] Start native Rust Implementation
[ https://issues.apache.org/jira/browse/ARROW-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2361: Summary: [Rust] Start native Rust Implementation (was: Native Rust Implementation) > [Rust] Start native Rust Implementation > --- > > Key: ARROW-2361 > URL: https://issues.apache.org/jira/browse/ARROW-2361 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Priority: Major > > I'm creating this Jira to track work to donate an work-in-progress native > Rust implementation of Arrow. > I am actively developing this and relying on it for the memory model of my > DataFusion project. I would like to donate the code I have now and start > working on it under the Apache Arrow project. > Here is the PR: https://github.com/apache/arrow/pull/1804 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418419#comment-16418419 ] ASF GitHub Bot commented on ARROW-2351: --- gaolizhou closed pull request #1806: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1806 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 2aa73a09a..308bbcd8a 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -989,6 +989,39 @@ TEST_F(TestStringBuilder, TestScalarAppend) { } } +TEST_F(TestStringBuilder, TestAppendVector) { + vector strings = {"", "bb", "a", "", "ccc"}; + vector is_null = {0, 0, 0, 1, 0}; + + int N = static_cast(strings.size()); + int reps = 1000; + + for (int j = 0; j < reps; ++j) { +ASSERT_OK(builder_->Append(strings, is_null.data())); + } + Done(); + + ASSERT_EQ(reps * N, result_->length()); + ASSERT_EQ(reps, result_->null_count()); + ASSERT_EQ(reps * 6, result_->value_data()->size()); + + int32_t length; + int32_t pos = 0; + for (int i = 0; i < N * reps; ++i) { +if (is_null[i % N]) { + ASSERT_TRUE(result_->IsNull(i)); +} else { + ASSERT_FALSE(result_->IsNull(i)); + result_->GetValue(i, &length); + ASSERT_EQ(pos, result_->value_offset(i)); + ASSERT_EQ(static_cast(strings[i % N].size()), length); + ASSERT_EQ(strings[i % N], result_->GetString(i)); + + pos += length; +} + } +} + TEST_F(TestStringBuilder, TestZeroLength) { // All buffers are null Done(); diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index aa9f3ce42..ec486566f 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -16,11 +16,11 @@ // under the License. #include "arrow/builder.h" - #include #include #include #include +#include #include #include #include @@ -1385,6 +1385,28 @@ const uint8_t* BinaryBuilder::GetValue(int64_t i, int32_t* out_length) const { StringBuilder::StringBuilder(MemoryPool* pool) : BinaryBuilder(utf8(), pool) {} +Status StringBuilder::Append(const std::vector& values, + uint8_t* null_bytes) { + std::size_t total_length = std::accumulate( + values.begin(), values.end(), 0ULL, + [](uint64_t sum, const std::string& str) { return sum + str.size(); }); + RETURN_NOT_OK(Reserve(values.size())); + RETURN_NOT_OK(value_data_builder_.Reserve(total_length)); + RETURN_NOT_OK(offsets_builder_.Reserve(values.size())); + + for (std::size_t i = 0; i < values.size(); ++i) { +RETURN_NOT_OK(AppendNextOffset()); +if (null_bytes[i]) { + UnsafeAppendToBitmap(false); +} else { + RETURN_NOT_OK(value_data_builder_.Append( + reinterpret_cast(values[i].data()), values[i].size())); + UnsafeAppendToBitmap(true); +} + } + return Status::OK(); +} + // -- // Fixed width binary This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418420#comment-16418420 ] ASF GitHub Bot commented on ARROW-2351: --- gaolizhou opened a new pull request #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803 changed the API from`Status Append(const std::vector& values, uint8_t* null_bytes);` to `Status Append(const std::vector& values);` IMO, if string is empty, then it should be null, and vice versa. **[update]** change the API back to original. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418390#comment-16418390 ] ASF GitHub Bot commented on ARROW-2351: --- gaolizhou commented on issue #1806: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1806#issuecomment-377111241 sorry guys, I closed the pull request by carelessness, Just re-open it . This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418386#comment-16418386 ] ASF GitHub Bot commented on ARROW-2351: --- gaolizhou opened a new pull request #1806: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1806 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418384#comment-16418384 ] ASF GitHub Bot commented on ARROW-2351: --- gaolizhou commented on issue #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803#issuecomment-377110720 Please help review it and leave your comments. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418382#comment-16418382 ] ASF GitHub Bot commented on ARROW-2351: --- gaolizhou closed pull request #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 2aa73a09a..308bbcd8a 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -989,6 +989,39 @@ TEST_F(TestStringBuilder, TestScalarAppend) { } } +TEST_F(TestStringBuilder, TestAppendVector) { + vector strings = {"", "bb", "a", "", "ccc"}; + vector is_null = {0, 0, 0, 1, 0}; + + int N = static_cast(strings.size()); + int reps = 1000; + + for (int j = 0; j < reps; ++j) { +ASSERT_OK(builder_->Append(strings, is_null.data())); + } + Done(); + + ASSERT_EQ(reps * N, result_->length()); + ASSERT_EQ(reps, result_->null_count()); + ASSERT_EQ(reps * 6, result_->value_data()->size()); + + int32_t length; + int32_t pos = 0; + for (int i = 0; i < N * reps; ++i) { +if (is_null[i % N]) { + ASSERT_TRUE(result_->IsNull(i)); +} else { + ASSERT_FALSE(result_->IsNull(i)); + result_->GetValue(i, &length); + ASSERT_EQ(pos, result_->value_offset(i)); + ASSERT_EQ(static_cast(strings[i % N].size()), length); + ASSERT_EQ(strings[i % N], result_->GetString(i)); + + pos += length; +} + } +} + TEST_F(TestStringBuilder, TestZeroLength) { // All buffers are null Done(); diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index aa9f3ce42..ec486566f 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -16,11 +16,11 @@ // under the License. #include "arrow/builder.h" - #include #include #include #include +#include #include #include #include @@ -1385,6 +1385,28 @@ const uint8_t* BinaryBuilder::GetValue(int64_t i, int32_t* out_length) const { StringBuilder::StringBuilder(MemoryPool* pool) : BinaryBuilder(utf8(), pool) {} +Status StringBuilder::Append(const std::vector& values, + uint8_t* null_bytes) { + std::size_t total_length = std::accumulate( + values.begin(), values.end(), 0ULL, + [](uint64_t sum, const std::string& str) { return sum + str.size(); }); + RETURN_NOT_OK(Reserve(values.size())); + RETURN_NOT_OK(value_data_builder_.Reserve(total_length)); + RETURN_NOT_OK(offsets_builder_.Reserve(values.size())); + + for (std::size_t i = 0; i < values.size(); ++i) { +RETURN_NOT_OK(AppendNextOffset()); +if (null_bytes[i]) { + UnsafeAppendToBitmap(false); +} else { + RETURN_NOT_OK(value_data_builder_.Append( + reinterpret_cast(values[i].data()), values[i].size())); + UnsafeAppendToBitmap(true); +} + } + return Status::OK(); +} + // -- // Fixed width binary This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418381#comment-16418381 ] ASF GitHub Bot commented on ARROW-2351: --- gaolizhou commented on issue #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803#issuecomment-377110720 Please help review it and leave your comments. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418191#comment-16418191 ] Anton Shmigirilov commented on ARROW-2359: -- It seems original topic message is incorrect. It looks like there aren't races in singleton itself. There is another race condition in my code somewhere, it's related to arrow::DataType usages and it's eliminated if singleton is removed using proposed patch. Anyway, this JIRA task has wrong description and can be removed or renamed, if needed. Guys, sorry for confuse. (But I still think that static shared_ptr is over-engineering ;) ) > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1780) JDBC Adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418166#comment-16418166 ] ASF GitHub Bot commented on ARROW-1780: --- atuldambalkar commented on issue #1759: ARROW-1780 - [WIP] JDBC Adapter to convert Relational Data objects to Arrow Data Format Vector Objects URL: https://github.com/apache/arrow/pull/1759#issuecomment-377045969 Hi @laurentgo, now I do have a handful of review comments to work on. As I work on each one of those, some may need short discussion with you. I hope that's okay. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > JDBC Adapter for Apache Arrow > - > > Key: ARROW-1780 > URL: https://issues.apache.org/jira/browse/ARROW-1780 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Atul Dambalkar >Assignee: Atul Dambalkar >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > At a high level the JDBC Adapter will allow upstream apps to query RDBMS data > over JDBC and get the JDBC objects converted to Arrow objects/structures. The > upstream utility can then work with Arrow objects/structures with usual > performance benefits. The utility will be very much similar to C++ > implementation of "Convert a vector of row-wise data into an Arrow table" as > described here - > https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html > The utility will read data from RDBMS and covert the data into Arrow > objects/structures. So from that perspective this will Read data from RDBMS, > If the utility can push Arrow objects to RDBMS is something need to be > discussed and will be out of scope for this utility for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.
[ https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418003#comment-16418003 ] ASF GitHub Bot commented on ARROW-2308: --- robertnishihara commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy arrays 64-byte aligned. URL: https://github.com/apache/arrow/pull/1802#issuecomment-377011536 Thanks @wesm! @pcmoritz, good catch, the bug should be fixed now. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Serialized tensor data should be 64-byte aligned. > - > > Key: ARROW-2308 > URL: https://issues.apache.org/jira/browse/ARROW-2308 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > > See [https://github.com/ray-project/ray/issues/1658] for an example of this > issue. Non-aligned data can trigger a copy when fed into TensorFlow and > things like that. > {code} > import pyarrow as pa > import numpy as np > x = np.zeros(10) > y = pa.deserialize(pa.serialize(x).to_buffer()) > x.ctypes.data % 64 # 0 (it starts out aligned) > y.ctypes.data % 64 # 48 (it is no longer aligned) > {code} > It should be possible to fix this by calling something like > {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. > Note that we already do this before writing the tensor header, but the tensor > header is not necessarily a multiple of 64 bytes, so the subsequent data can > be unaligned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417995#comment-16417995 ] Antoine Pitrou commented on ARROW-2359: --- C++11 guarantees the initialization cannot involve any race condition (see Stack Overflow post above). > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders
[ https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417967#comment-16417967 ] ASF GitHub Bot commented on ARROW-2330: --- alendit commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer creation with partially finishable array builders URL: https://github.com/apache/arrow/pull/1769#issuecomment-377005372 Hi Uwe, i was actually wondering the same thing myself. The reason I've implemented partial finishers for `FixedSizeBinaryBuilder` etc is that they are used by the `DictionaryBuilder` internally. I've tried to come up with a use case for the partial finishing for other builders, but, as you said, you can almost always simply instantiate a new one. Do you know why the `RecordBatchBuilder` has a partial `Flush` method? Is it because its instantion is more cumbersome compared to array builders? That being said, I don't think that it adds that much complexity, compared to the previous implementation. Most of the changes are in the testing code, and builder files themselves have around 30 additional LOC. At the same time, it makes the `DictionaryBuilder` quite straight forward. Maybe a compromise would be to hide the partial ´Finish´ methods for other classes besides `DictionaryBuilder` and make them friends with it? I'm not a big friend of `friend`, though. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Optimize delta buffer creation with partially finishable array builders > - > > Key: ARROW-2330 > URL: https://issues.apache.org/jira/browse/ARROW-2330 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.8.0 >Reporter: Dimitri Vorona >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The main aim of this change is to optimize the building of delta > dictionaries. In the current version delta dictionaries are built using an > additional "overflow" buffer which leads to complicated and potentially > error-prone code and subpar performance by doubling the number of lookups. > I solve this problem by introducing the notion of partially finishable array > builders, i.e. builder which are able to retain the state on calling Finish. > The interface is based on RecordBatchBuilder::Flush, i.e. Finish is > overloaded with additional signature Finish(bool reset_builder, > std::shared_ptr* out). The resulting Arrays point to the same data > buffer with different offsets. > I'm aware that the change is kind of biggish, but I'd like to discuss it > here. The solution makes the code more straight forward, doesn't bloat the > code base too much and leaves the API more or less untouched. Additionally, > the new way to make delta dictionaries by using a different call signature to > Finish feel cleaner to me. > I'm looking forward to your critic and improvement ideas. > The pull request is available at: https://github.com/apache/arrow/pull/1769 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417958#comment-16417958 ] Wes McKinney commented on ARROW-2359: - Is it possible there is a race condition in the initialization of the global static variable? Seems pretty esoteric > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417917#comment-16417917 ] Phillip Cloud commented on ARROW-2359: -- Also, http://en.cppreference.com/w/cpp/memory/shared_ptr states the following: {code} All member functions (including copy constructor and copy assignment) can be called by multiple threads on different instances of shared_ptr without additional synchronization even if these instances are copies and share ownership of the same object. {code} These functions return a copy of a {{shared_ptr}}. > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2362) [Python] Decimal conversions are slow
Antoine Pitrou created ARROW-2362: - Summary: [Python] Decimal conversions are slow Key: ARROW-2362 URL: https://issues.apache.org/jira/browse/ARROW-2362 Project: Apache Arrow Issue Type: Wish Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou See https://github.com/apache/arrow/pull/1798#issuecomment-376498987 I don't know how critical performance is here, but worth keeping a note of. The {{decimal}} module isn't exposing an official C API, so fixing this would be a bit involved. If this is important, we can also try to push for an official decimal C API in Python (other packages may also benefit). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417837#comment-16417837 ] ASF GitHub Bot commented on ARROW-2359: --- pitrou commented on issue #1800: ARROW-2359: [C++] do not use static shared_ptr in TYPE_FACTORY to make it thread safe URL: https://github.com/apache/arrow/pull/1800#issuecomment-376978097 Right now it isn't proven that there is something to fix here. The discussion is happening on the JIRA ticket. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417651#comment-16417651 ] ASF GitHub Bot commented on ARROW-2351: --- xhochy commented on issue #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803#issuecomment-376943418 > Do we want to update the API as well to Status Append(const std::vector& values, vector null_bytes); to match the API for NumericBuilder? Usability-wise this definitely makes sense but this can also be done in a followup-PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders
[ https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417644#comment-16417644 ] ASF GitHub Bot commented on ARROW-2330: --- xhochy commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer creation with partially finishable array builders URL: https://github.com/apache/arrow/pull/1769#issuecomment-376942167 What is the benefit of having ArrayBuilders that partly reset them versus just instantiating a new ArrayBuilder? I get it for the Dictionary case where keep state but for FixedSizeBinary I don't see the need to add such complexity. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Optimize delta buffer creation with partially finishable array builders > - > > Key: ARROW-2330 > URL: https://issues.apache.org/jira/browse/ARROW-2330 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.8.0 >Reporter: Dimitri Vorona >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The main aim of this change is to optimize the building of delta > dictionaries. In the current version delta dictionaries are built using an > additional "overflow" buffer which leads to complicated and potentially > error-prone code and subpar performance by doubling the number of lookups. > I solve this problem by introducing the notion of partially finishable array > builders, i.e. builder which are able to retain the state on calling Finish. > The interface is based on RecordBatchBuilder::Flush, i.e. Finish is > overloaded with additional signature Finish(bool reset_builder, > std::shared_ptr* out). The resulting Arrays point to the same data > buffer with different offsets. > I'm aware that the change is kind of biggish, but I'd like to discuss it > here. The solution makes the code more straight forward, doesn't bloat the > code base too much and leaves the API more or less untouched. Additionally, > the new way to make delta dictionaries by using a different call signature to > Finish feel cleaner to me. > I'm looking forward to your critic and improvement ideas. > The pull request is available at: https://github.com/apache/arrow/pull/1769 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2350) Shrink size of spark_integration Docker container
[ https://issues.apache.org/jira/browse/ARROW-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417632#comment-16417632 ] ASF GitHub Bot commented on ARROW-2350: --- xhochy commented on issue #1787: ARROW-2350: Consolidated RUN step in spark_integration Dockerfile URL: https://github.com/apache/arrow/pull/1787#issuecomment-376941028 @jameslamb yes, that would be great. It looks like we could trim down all of them. In some cases, we install python packages from `pip install git+…`. These steps should maybe stay separate so that we can only delete the docker cache for them and have a fast rebuild. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Shrink size of spark_integration Docker container > - > > Key: ARROW-2350 > URL: https://issues.apache.org/jira/browse/ARROW-2350 > Project: Apache Arrow > Issue Type: Improvement >Reporter: James Lamb >Assignee: James Lamb >Priority: Minor > Labels: docker, pull-request-available, spark > Fix For: 0.10.0 > > Original Estimate: 10m > Remaining Estimate: 10m > > I would like to propose a few changes to the spark_integration Dockerfile: > [https://github.com/apache/arrow/tree/master/dev/spark_integration] > The size of the resulting image can be reduced by making the following > changes: > * consolidating all RUN commands into a single RUN layer (reducing the > number of layers) > * running {color:#14892c}apt-get clean{color} to clear out the package cache > * running {color:#14892c}conda clean --all{color} to clear out cached > package tarballs, abandoned package versions, and other build artifacts from > all the libraries that are conda installed > I will be submitting a PR on GitHub shortly. Generating this issue first so I > can tag my PR to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417565#comment-16417565 ] Antoine Pitrou commented on ARROW-2359: --- I don't know what the rationale is, though the aim may be to avoid lots of spurious allocations. [~wesmckinn] However, the question is whether there is an actual issue with the singleton pattern here, and I'm not convinced there is. > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417556#comment-16417556 ] Anton Shmigirilov commented on ARROW-2359: -- > TYPE_FACTORY returns a _copy_ of the static shared_ptr If so, it's unclear for me why is shared_ptr there. Or, why is static object there. Only reason is saving memory a bit? But RAII should help to mange memory reasonable. Anyway, I propose to think if such pattern is reasonable here. > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417537#comment-16417537 ] ASF GitHub Bot commented on ARROW-2351: --- rvernica commented on issue #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803#issuecomment-376927452 Do we want to update the API as well to `Status Append(const std::vector& values, vector null_bytes);` to match the API for `NumericBuilder`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2347) [Python] Multiple warnings with -Wconversion
[ https://issues.apache.org/jira/browse/ARROW-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417526#comment-16417526 ] Antoine Pitrou commented on ARROW-2347: --- > Hm, are there other solutions to the enum signedness issue? I'm not sure. It seems Cython is generating that code precisely to know whether the enum is signed: {code:c} static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enumarrow_3a__3a_Type_3a__3a_type(enum arrow::Type::type value) { const enum arrow::Type::type neg_one = (enum arrow::Type::type) -1, const_zero = (enum arrow::Type::type) 0; const int is_unsigned = neg_one > const_zero; if (is_unsigned) { if (sizeof(enum arrow::Type::type) < sizeof(long)) { return PyInt_FromLong((long) value); } else if (sizeof(enum arrow::Type::type) <= sizeof(unsigned long)) { return PyLong_FromUnsignedLong((unsigned long) value); #ifdef HAVE_LONG_LONG } else if (sizeof(enum arrow::Type::type) <= sizeof(unsigned PY_LONG_LONG)) { return PyLong_FromUnsignedLongLong((unsigned PY_LONG_LONG) value); #endif } } else { if (sizeof(enum arrow::Type::type) <= sizeof(long)) { return PyInt_FromLong((long) value); #ifdef HAVE_LONG_LONG } else if (sizeof(enum arrow::Type::type) <= sizeof(PY_LONG_LONG)) { return PyLong_FromLongLong((PY_LONG_LONG) value); #endif } } { int one = 1; int little = (int)*(unsigned char *)&one; unsigned char *bytes = (unsigned char *)&value; return _PyLong_FromByteArray(bytes, sizeof(enum arrow::Type::type), little, !is_unsigned); } } {code} > [Python] Multiple warnings with -Wconversion > > > Key: ARROW-2347 > URL: https://issues.apache.org/jira/browse/ARROW-2347 > Project: Apache Arrow > Issue Type: Bug >Reporter: Antoine Pitrou >Priority: Minor > > There are multiple warnings when compiling the Cython-generated code with > {{-Wconversion}}: > {code} > /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx: In function > 'PyObject* __pyx_pf_7pyarrow_3lib_62union(PyObject*, PyObject*, PyObject*)': > /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:23850:45: > warning: conversion to 'std::vector::value_type {aka unsigned > char}' from 'int' may alter its value [-Wconversion] >__pyx_v_type_codes.push_back(__pyx_v_i); > ^ > /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx: In function > 'PyObject* > __Pyx_PyInt_From_enumarrow_3a__3a_Type_3a__3a_type(arrow::Type::type)': > /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:125758:70: > warning: the result of the conversion is unspecified because '-1' is outside > the range of type 'arrow::Type::type' [-Wconversion] > const enum arrow::Type::type neg_one = (enum arrow::Type::type) -1, > const_zero = (enum arrow::Type::type) 0; > ^ > /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx: In function > 'PyObject* > __Pyx_PyInt_From_enumarrow_3a__3a_UnionMode_3a__3a_type(arrow::UnionMode::type)': > /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:125789:80: > warning: the result of the conversion is unspecified because '-1' is outside > the range of type 'arrow::UnionMode::type' [-Wconversion] > const enum arrow::UnionMode::type neg_one = (enum > arrow::UnionMode::type) -1, const_zero = (enum arrow::UnionMode::type) 0; > > ^ > /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx: In function > 'PyObject* > __Pyx_PyInt_From_enumarrow_3a__3a_TimeUnit_3a__3a_type(arrow::TimeUnit::type)': > /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:125820:78: > warning: the result of the conversion is unspecified because '-1' is outside > the range of type 'arrow::TimeUnit::type' [-Wconversion] > const enum arrow::TimeUnit::type neg_one = (enum > arrow::TimeUnit::type) -1, const_zero = (enum arrow::TimeUnit::type) 0; > > ^ > {code} > (also similar warnings for _parquet.pyx due to Parquet enumerations) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417497#comment-16417497 ] ASF GitHub Bot commented on ARROW-2351: --- xhochy commented on issue #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803#issuecomment-376921022 > hmmm, is it possible that the string is not empty and the null bit is true ? No, this should not happen but this should hopefully not make a difference. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417496#comment-16417496 ] ASF GitHub Bot commented on ARROW-2351: --- xhochy commented on issue #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803#issuecomment-376921022 > hmmm, is it possible that the string is not empty and the null bit is true ? No, this should not happen. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] xhochy opened a new pull request #25: WIP: Fix OSX RPATHs for Boost
xhochy opened a new pull request #25: WIP: Fix OSX RPATHs for Boost URL: https://github.com/apache/arrow-dist/pull/25 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Issue Comment Deleted] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dou Tu updated ARROW-2359: -- Comment: was deleted (was: There is no shared data between two threads here! So it must be thread-safe! Should be junked.) > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2361) Native Rust Implementation
Andy Grove created ARROW-2361: - Summary: Native Rust Implementation Key: ARROW-2361 URL: https://issues.apache.org/jira/browse/ARROW-2361 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Andy Grove I'm creating this Jira to track work to donate an work-in-progress native Rust implementation of Arrow. I am actively developing this and relying on it for the memory model of my DataFusion project. I would like to donate the code I have now and start working on it under the Apache Arrow project. Here is the PR: https://github.com/apache/arrow/pull/1804 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417315#comment-16417315 ] Antoine Pitrou commented on ARROW-2359: --- > But in case of TYPE_FACTORY implementation, we have not only common/shared > control block internals, but common (because static) shared_ptr internals > itself. That doesn't sound accurate to me. TYPE_FACTORY returns a _copy_ of the static shared_ptr. Also, the thread stacks you posted don't show any call to TYPE_FACTORY, so it seems unlikely that it plays a part here. After googling a bit, it seems TSAN sometimes produces false positives with gcc. That may be one of those. Perhaps you can try with clang? > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417299#comment-16417299 ] ASF GitHub Bot commented on ARROW-2351: --- gaolizhou commented on issue #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803#issuecomment-376878105 > We distinguish between empty strings and null strings hmmm, is it possible that the string is not empty and the null bit is true ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417266#comment-16417266 ] Anton Shmigirilov commented on ARROW-2359: -- Yes, I have issue with this. It's reproduced in whole project that uses Arrow. Part of project reads RecordBatch using RecordBatchStreamReader, performs access to Schema (RecordBatch::schema()) and field's type(). Such reading executes in concurrent threads. Executable built with gcc's ThreadSanitizer and here is part of sanitizer's output: Atomic write of size 4 at 0x7b10c008 by thread T5: #0 __tsan_atomic32_fetch_add (libtsan.so.0+0x00064aa0) #1 __atomic_add /usr/include/c++/7/ext/atomicity.h:53 (exec+0x008a68c3) #2 __atomic_add_dispatch /usr/include/c++/7/ext/atomicity.h:96 (exec+0x008a68c3) #3 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy() /usr/include/c++/7/bits/shared_ptr_base.h:138 (FDIOExec+0x008a68c3) #4 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count(std::__shared_count<(__gnu_cxx::_Lock_policy)2> const&) /usr/include/c++/7/bits/shared_ptr_base.h:691 (exec+0x008a68c3) #5 std::__shared_ptr::__shared_ptr(std::__shared_ptr const&) /usr/include/c++/7/bits/shared_ptr_base.h:1121 (exec+0x008a68c3) #6 std::shared_ptr::shared_ptr(std::shared_ptr const&) /usr/include/c++/7/bits/shared_ptr.h:119 (exec+0x008a68c3) #7 arrow::Field::type() const /usr/local/fdio-deps/lib/../include/arrow/type.h:244 (exec+0x008a68c3) #8 func_reader() (exec+0x008a68c3) Previous write of size 8 at 0x7b10c008 by thread T4: #0 operator new(unsigned long) (libtsan.so.0+0x0006f846) #1 __gnu_cxx::new_allocator, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) /usr/include/c++/7/ext/new_allocator.h:111 (libarrow.so.0+0x00167307) #2 func_reader() (exec+0x00897a38) It hard to reproduce it on simple synthetic test, but I will try it. std::shared_ptr declared as thread safe in relation to control block, but as I understand it, safety is guaranteed only in case of modifying control block's internals which happened because copying shared_ptr itself (with sharing common control block). But in case of TYPE_FACTORY implementation, we have not only common/shared control block internals, but common (because static) shared_ptr internals itself. I guess this place is unsafe. Another one opinion about this. Shared_ptr have been designed to maintain complex object's lifecycle in case of copying, moving and so on. In case of TYPE_FACTORY we haven't advantages of shared_ptr, we have single instance of object with lifetime corresponding whole process's lifetime. I guess is't not quite correct usage of the concept, in my opinion. > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.
[ https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417253#comment-16417253 ] ASF GitHub Bot commented on ARROW-2308: --- wesm commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy arrays 64-byte aligned. URL: https://github.com/apache/arrow/pull/1802#issuecomment-376865586 Will review this when I can. I should also revive ARROW-1860 as there are a number of interrelated issues around this stuff This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Serialized tensor data should be 64-byte aligned. > - > > Key: ARROW-2308 > URL: https://issues.apache.org/jira/browse/ARROW-2308 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > > See [https://github.com/ray-project/ray/issues/1658] for an example of this > issue. Non-aligned data can trigger a copy when fed into TensorFlow and > things like that. > {code} > import pyarrow as pa > import numpy as np > x = np.zeros(10) > y = pa.deserialize(pa.serialize(x).to_buffer()) > x.ctypes.data % 64 # 0 (it starts out aligned) > y.ctypes.data % 64 # 48 (it is no longer aligned) > {code} > It should be possible to fix this by calling something like > {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. > Note that we already do this before writing the tensor header, but the tensor > header is not necessarily a multiple of 64 bytes, so the subsequent data can > be unaligned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2351) [C++] StringBuilder::append(vector...) not implemented
[ https://issues.apache.org/jira/browse/ARROW-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417252#comment-16417252 ] ASF GitHub Bot commented on ARROW-2351: --- wesm commented on issue #1803: ARROW-2351 [C++] StringBuilder::append(vector...) not impleme… URL: https://github.com/apache/arrow/pull/1803#issuecomment-376865339 > IMO, if string is empty, then it should be null, and vice versa. We distinguish between empty strings and null strings This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] StringBuilder::append(vector...) not implemented > -- > > Key: ARROW-2351 > URL: https://issues.apache.org/jira/browse/ARROW-2351 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: Rares Vernica >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > For {{StringBuilder}} an {{append(vector, uint8_t*)}} function is > [declared|https://github.com/apache/arrow/blob/7b2c79765cf92760e1f8cca079159d9613b86412/cpp/src/arrow/builder.h#L721] > and > [documented|http://arrow.apache.org/docs/cpp/classarrow_1_1_string_builder.html#a59be34b5e11017a392b4ee019d90da3c] > but it does not seem to be implemented. > {code:java} > undefined reference to `arrow::StringBuilder::Append(std::vector std::allocator > const&, unsigned char*)' > collect2: error: ld returned 1 exit status > {code} > Also worth noting is that the similar function in {{NumericBuilder}} uses > {{vector}} for the null values instead of {{uint8_t*}}. It might be > worth making them consistent. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2360) Add set_chunksize for RecordBatchReader in arrow/record_batch.h
Xianjin YE created ARROW-2360: - Summary: Add set_chunksize for RecordBatchReader in arrow/record_batch.h Key: ARROW-2360 URL: https://issues.apache.org/jira/browse/ARROW-2360 Project: Apache Arrow Issue Type: Improvement Reporter: Xianjin YE As discussed in [https://github.com/apache/parquet-cpp/pull/445,] Maybe it's better to expose chunksize related API in RecordBatchReader. However RecordBatchStreamReader doesn't conforms to this requirement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417069#comment-16417069 ] Dou Tu edited comment on ARROW-2359 at 3/28/18 9:21 AM: There is no shared data between two threads here! So it must be thread-safe! Should be junked. was (Author: gaolizhou): There is no global variables here! So it must be thread-safe! Should be junked. > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2359) Type objects produced by DataType factory are not thread safe
[ https://issues.apache.org/jira/browse/ARROW-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417069#comment-16417069 ] Dou Tu commented on ARROW-2359: --- There is no global variables here! So it must be thread-safe! Should be junked. > Type objects produced by DataType factory are not thread safe > - > > Key: ARROW-2359 > URL: https://issues.apache.org/jira/browse/ARROW-2359 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Anton Shmigirilov >Priority: Minor > Labels: pull-request-available > > TYPE_FACTORY() macro that produces type shortcuts (boolean(), int32(), utf8() > and so on) uses static shared_ptr inside. There are race conditions possible > against shared_ptr's reference counter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.
[ https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417004#comment-16417004 ] ASF GitHub Bot commented on ARROW-2308: --- robertnishihara commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy arrays 64-byte aligned. URL: https://github.com/apache/arrow/pull/1802#issuecomment-376796605 Yes, that seems related, but for Tensors we want 64-byte alignment. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Serialized tensor data should be 64-byte aligned. > - > > Key: ARROW-2308 > URL: https://issues.apache.org/jira/browse/ARROW-2308 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > > See [https://github.com/ray-project/ray/issues/1658] for an example of this > issue. Non-aligned data can trigger a copy when fed into TensorFlow and > things like that. > {code} > import pyarrow as pa > import numpy as np > x = np.zeros(10) > y = pa.deserialize(pa.serialize(x).to_buffer()) > x.ctypes.data % 64 # 0 (it starts out aligned) > y.ctypes.data % 64 # 48 (it is no longer aligned) > {code} > It should be possible to fix this by calling something like > {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. > Note that we already do this before writing the tensor header, but the tensor > header is not necessarily a multiple of 64 bytes, so the subsequent data can > be unaligned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.
[ https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416980#comment-16416980 ] ASF GitHub Bot commented on ARROW-2308: --- pcmoritz commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy arrays 64-byte aligned. URL: https://github.com/apache/arrow/pull/1802#issuecomment-376788803 LGTM! We might also want to have a discussion about the spec and if we want Tensors to be aligned in general/by default. It seems important to me and maybe it's already implied by the sentence ```It is required to have all the contiguous memory buffers in an IPC payload aligned at 8-byte boundaries. In other words, each buffer must start at an aligned 8-byte offset.``` Edit: There is a test failure in ipc-write-test we should fix before merging :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Serialized tensor data should be 64-byte aligned. > - > > Key: ARROW-2308 > URL: https://issues.apache.org/jira/browse/ARROW-2308 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > > See [https://github.com/ray-project/ray/issues/1658] for an example of this > issue. Non-aligned data can trigger a copy when fed into TensorFlow and > things like that. > {code} > import pyarrow as pa > import numpy as np > x = np.zeros(10) > y = pa.deserialize(pa.serialize(x).to_buffer()) > x.ctypes.data % 64 # 0 (it starts out aligned) > y.ctypes.data % 64 # 48 (it is no longer aligned) > {code} > It should be possible to fix this by calling something like > {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. > Note that we already do this before writing the tensor header, but the tensor > header is not necessarily a multiple of 64 bytes, so the subsequent data can > be unaligned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.
[ https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416957#comment-16416957 ] ASF GitHub Bot commented on ARROW-2308: --- pcmoritz commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy arrays 64-byte aligned. URL: https://github.com/apache/arrow/pull/1802#issuecomment-376788803 LGTM! We might also want to have a discussion about the spec and if we want Tensors to be aligned in general/by default. It seems important to me and maybe it's already implied by the sentence ```It is required to have all the contiguous memory buffers in an IPC payload aligned at 8-byte boundaries. In other words, each buffer must start at an aligned 8-byte offset.``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Serialized tensor data should be 64-byte aligned. > - > > Key: ARROW-2308 > URL: https://issues.apache.org/jira/browse/ARROW-2308 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > > See [https://github.com/ray-project/ray/issues/1658] for an example of this > issue. Non-aligned data can trigger a copy when fed into TensorFlow and > things like that. > {code} > import pyarrow as pa > import numpy as np > x = np.zeros(10) > y = pa.deserialize(pa.serialize(x).to_buffer()) > x.ctypes.data % 64 # 0 (it starts out aligned) > y.ctypes.data % 64 # 48 (it is no longer aligned) > {code} > It should be possible to fix this by calling something like > {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. > Note that we already do this before writing the tensor header, but the tensor > header is not necessarily a multiple of 64 bytes, so the subsequent data can > be unaligned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.
[ https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416956#comment-16416956 ] ASF GitHub Bot commented on ARROW-2308: --- pcmoritz commented on issue #1802: ARROW-2308: [Python] Make deserialized numpy arrays 64-byte aligned. URL: https://github.com/apache/arrow/pull/1802#issuecomment-376788803 LGTM! We might also want to have a discussion about the spec and if we want Tensors to be aligned in general. It seems important to me and maybe it's already implied by the sentence ```It is required to have all the contiguous memory buffers in an IPC payload aligned at 8-byte boundaries. In other words, each buffer must start at an aligned 8-byte offset.``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Serialized tensor data should be 64-byte aligned. > - > > Key: ARROW-2308 > URL: https://issues.apache.org/jira/browse/ARROW-2308 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > > See [https://github.com/ray-project/ray/issues/1658] for an example of this > issue. Non-aligned data can trigger a copy when fed into TensorFlow and > things like that. > {code} > import pyarrow as pa > import numpy as np > x = np.zeros(10) > y = pa.deserialize(pa.serialize(x).to_buffer()) > x.ctypes.data % 64 # 0 (it starts out aligned) > y.ctypes.data % 64 # 48 (it is no longer aligned) > {code} > It should be possible to fix this by calling something like > {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. > Note that we already do this before writing the tensor header, but the tensor > header is not necessarily a multiple of 64 bytes, so the subsequent data can > be unaligned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)