Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
pitrou commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-3022421404 > 1- API and Handling of the Last Buffer In [this pull request](https://github.com/apache/arrow/pull/46655), I demonstrated that it’s possible to [share buffers](https://github.com/apache/arrow/blob/a5dfadba3626c082235d9ea22db6f2cb22398d9a/cpp/src/arrow/array/builder_binary.cc#L90) without copying or finalizing the last buffer. This avoids [relocating the buffer](https://github.com/apache/arrow/blob/ed13cedd8bf7ddc06db152f97e68d86c2c37e949/cpp/src/arrow/array/builder_binary.h#L563) to remove blank space, which can be a costly operation when the unused space exceeds 64 bytes. > > 2- > > > Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time. > > In [this pull request](https://github.com/apache/arrow/pull/46229), I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in [this issue](https://github.com/apache/arrow/issues/45639), I think this metadata could help determine when CompactArray should be called. Thanks for the reminder, and sorry that this is taking a long time :) I propose that we review these PRs one by one. I've started with the `CompactArray` one and, once that is done, I would like to then move to the `AppendArraySlice` improvement. This PR here is slightly more contentious so I think we should tackle it only after the other APIs have settled semantics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
pitrou commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-3019324942 > Maybe we can open another issue/PR to discuss specifically whether/how parquet should use this api on appending a huge page and append view from it? That sounds fair to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-3019299582 Besides parquet related issue, there is **NO buffer sharing** mechanism in current api. - `AppendBuffer` should only be used for large string - `AppendViewFromBuffer` - for large string sharing, sharing through this interface will save a lot of memory usage - for inlined string it costs as much as directly `Append`. These new interfaces won't introduce more cost for current interfaces from my view. I understand tradeoff on parquet a discussed above, but I believe this kind of buffer sharing interface is lost in current implementation, and there is definitely more need for this api apart from parquet. Maybe we can open another issue/PR to discuss specifically whether/how parquet should use this api on appending a huge page and append view from it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-3018294994 Looks (1) would work in buffer style api, but for parquet reader, it might append buffer one by one. The (2) is a good way for compute, but here I don't know the best way to handle this: whether to adaptive read it, or just throw it to "cast" or "compact". I prefer handling this in reader, and the later handling can "compact" the data when output or throwing to compute -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
andishgar commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-3016821941 @mapleFU @pitrou I believe this pull request is related to several other PRs I've submitted. Here's a summary: 1- API and Handling of the Last Buffer In [this pull request](https://github.com/apache/arrow/pull/46655), I demonstrated that it’s possible to [share buffers](https://github.com/apache/arrow/blob/a5dfadba3626c082235d9ea22db6f2cb22398d9a/cpp/src/arrow/array/builder_binary.cc#L90) without copying or finalizing the last buffer. This avoids [relocating the buffer](https://github.com/apache/arrow/blob/ed13cedd8bf7ddc06db152f97e68d86c2c37e949/cpp/src/arrow/array/builder_binary.h#L563) to remove blank space, which can be a costly operation when the unused space exceeds 64 bytes. 2- >Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time. In [this pull request](https://github.com/apache/arrow/pull/46229), I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in [this issue](https://github.com/apache/arrow/issues/45639), I think this metadata could help determine when CompactArray should be called. Overall, my suggestion is to either modify this pull request or create a new API to support buffer sharing. It is possible to decide whether a created array should be compacted based on some metadata, in order to avoid memory bloat. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2159249895 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result AppendBuffer(const uint8_t* value, const int64_t length) { +if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) { + return Status::Invalid( + "The size of buffer to append should be larger than kInlineSize"); +} +ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock()); +ARROW_ASSIGN_OR_RAISE(auto v, + data_heap_builder_.Append(value, length)); +return v.ref.buffer_index; + } + + Result AppendBuffer(const char* value, const int64_t length) { +return AppendBuffer(reinterpret_cast(value), length); + } + + Result AppendBuffer(const std::string& value) { Review Comment: Can we remove this one since std::string_view is added here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2159268798 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +664,58 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + /// \brief Append a buffer of raw bytes to the internal data heap + /// + /// This method is used to add out-of-line data buffers to the builder. + /// The size of the buffer must be larger than TypeClass::kInlineSize. + /// + /// \param[in] value Pointer to the raw byte data. + /// \param[in] length The number of bytes in the buffer. + /// \return A Result containing the index of the newly appended buffer on success, + /// or a Status error if the length is too small or allocation fails. + Result AppendBuffer(const uint8_t* value, const int64_t length) { +if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) { + return Status::Invalid( + "The size of buffer to append should be larger than kInlineSize"); +} +ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock()); +ARROW_ASSIGN_OR_RAISE(auto v, + data_heap_builder_.Append(value, length)); +return v.ref.buffer_index; + } + + Result AppendBuffer(const char* value, const int64_t length) { +return AppendBuffer(reinterpret_cast(value), length); + } + + Result AppendBuffer(const std::string& value) { +return AppendBuffer(value.data(), static_cast(value.size())); + } + + Result AppendBuffer(const std::string_view value) { +return AppendBuffer(value.data(), static_cast(value.size())); + } + + Status AppendViewFromBuffer(const int32_t buffer_idx, const int32_t start, + const int32_t length) { +ARROW_RETURN_NOT_OK(Reserve(1)); +UnsafeAppendToBitmap(true); +ARROW_ASSIGN_OR_RAISE(const auto v, data_heap_builder_.GetViewFromBuffer( +buffer_idx, start, length)); +data_builder_.UnsafeAppend(v); +return Status::OK(); + } + + /// \pre The caller must ensure that: + /// - `buffer_idx` is a valid index for buffer. + /// - `start` and `length` define a valid range within the specified buffer. + void UnsafeAppendViewFromBuffer(const int32_t buffer_idx, const int32_t start, + const int32_t length) { +UnsafeAppendToBitmap(true); +const auto v = data_heap_builder_.GetViewFromBuffer(buffer_idx, start, length); Review Comment: ```suggestion const auto v = data_heap_builder_.GetViewFromBuffer(buffer_idx, start, length); ``` ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +664,58 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + /// \brief Append a buffer of raw bytes to the internal data heap + /// + /// This method is used to add out-of-line data buffers to the builder. + /// The size of the buffer must be larger than TypeClass::kInlineSize. + /// + /// \param[in] value Pointer to the raw byte data. + /// \param[in] length The number of bytes in the buffer. + /// \return A Result containing the index of the newly appended buffer on success, + /// or a Status error if the length is too small or allocation fails. + Result AppendBuffer(const uint8_t* value, const int64_t length) { +if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) { + return Status::Invalid( + "The size of buffer to append should be larger than kInlineSize"); +} +ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock()); +ARROW_ASSIGN_OR_RAISE(auto v, + data_heap_builder_.Append(value, length)); +return v.ref.buffer_index; + } + + Result AppendBuffer(const char* value, const int64_t length) { +return AppendBuffer(reinterpret_cast(value), length); + } + + Result AppendBuffer(const std::string& value) { +return AppendBuffer(value.data(), static_cast(value.size())); + } + + Result AppendBuffer(const std::string_view value) { +return AppendBuffer(value.data(), static_cast(value.size())); + } + + Status AppendViewFromBuffer(const int32_t buffer_idx, const int32_t start, + const int32_t length) { +ARROW_RETURN_NOT_OK(Reserve(1)); +UnsafeAppendToBitmap(true); +ARROW_ASSIGN_OR_RAISE(const auto v, data_heap_builder_.GetViewFromBuffer( Review Comment: ```suggestion ARROW_ASSIGN_OR_RAISE(const auto v, data_heap_builder_.GetViewFromBuffer( ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2146963301 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result AppendBuffer(const uint8_t* value, const int64_t length) { +if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) { + return Status::Invalid( + "The size of buffer to append should be larger than kInlineSize"); +} +ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock()); +ARROW_ASSIGN_OR_RAISE(auto v, + data_heap_builder_.Append(value, length)); +return v.ref.buffer_index; + } + + Result AppendBuffer(const char* value, const int64_t length) { +return AppendBuffer(reinterpret_cast(value), length); + } + + Result AppendBuffer(const std::string& value) { Review Comment: Can `std::string_view` being used rather than `const std::string&`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2146963301 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result AppendBuffer(const uint8_t* value, const int64_t length) { +if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) { + return Status::Invalid( + "The size of buffer to append should be larger than kInlineSize"); +} +ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock()); +ARROW_ASSIGN_OR_RAISE(auto v, + data_heap_builder_.Append(value, length)); +return v.ref.buffer_index; + } + + Result AppendBuffer(const char* value, const int64_t length) { +return AppendBuffer(reinterpret_cast(value), length); + } + + Result AppendBuffer(const std::string& value) { Review Comment: Can `std::string_view` being used? ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result AppendBuffer(const uint8_t* value, const int64_t length) { +if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) { + return Status::Invalid( + "The size of buffer to append should be larger than kInlineSize"); +} +ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock()); +ARROW_ASSIGN_OR_RAISE(auto v, + data_heap_builder_.Append(value, length)); +return v.ref.buffer_index; + } + + Result AppendBuffer(const char* value, const int64_t length) { +return AppendBuffer(reinterpret_cast(value), length); + } + + Result AppendBuffer(const std::string& value) { +return AppendBuffer(value.data(), static_cast(value.size())); + } + + Status AppendViewFromBuffer(const int32_t buffer_idx, const int32_t start, + const int32_t length) { +ARROW_RETURN_NOT_OK(Reserve(1)); +UnsafeAppendToBitmap(true); +ARROW_ASSIGN_OR_RAISE( +auto v, data_heap_builder_.GetViewFromBuffer(buffer_idx, start, length)); Review Comment: should we also adding a unsafe Append version? ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result AppendBuffer(const uint8_t* value, const int64_t length) { +if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) { + return Status::Invalid( + "The size of buffer to append should be larger than kInlineSize"); +} Review Comment: I'm curious that would this api be a bit strict for outside user 🤔? Should we add comment doc string for this, and add sample usage: ``` if (buffer.size() <= 12) { // Append directly for (...) {} } else { auto buffer_idx = AppendBuffer(); for (..) { AppendViewFromBuffer(buffer_idx, ...); } } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-2972782119 To implement interface aligned with what arrow-rs did, i have to change some behavior of `StringHeapBuild::FinishLastBlock()` and mark it as public. More specific, `FinishLastBlock()` just resize the last block before. Now `FinishLastBlock()` reset internal states, including `current_offset_`, `current_out_buffer_` and `current_remaining_bytes_`. I believe it's more safe and reasonable. The interface was private before so don't mind external usage, for current internal usage of `FinishLastBlock()`, they always reset or change all these status. If this change is unacceptable, plz let me know, i'll try to find another way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2146950173 ## cpp/src/arrow/array/array_test.cc: ## @@ -1000,6 +1000,39 @@ TEST_F(TestArray, TestBinaryViewAppendArraySlice) { AssertArraysEqual(*src, *dst); } +TEST_F(TestArray, BinaryViewFromBlock) { + BinaryViewBuilder src_builder(pool_); + ASSERT_OK(src_builder.Append("Hello")); + /// let block = builder.append_block(b"helloworldbingobongo".into()); + /// + /// builder.try_append_view(block, 0, 5).unwrap(); + /// builder.try_append_view(block, 5, 5).unwrap(); + /// builder.try_append_view(block, 10, 5).unwrap(); + /// builder.try_append_view(block, 15, 5).unwrap(); + /// builder.try_append_view(block, 0, 15).unwrap(); + /// let array = builder.finish(); + ASSERT_OK_AND_ASSIGN(const auto buffer, + src_builder.AppendBuffer("helloworldbingobongo")); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 5, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 10, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 15, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 15)); + Review Comment: added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2134762348 ## cpp/src/arrow/array/array_test.cc: ## @@ -1000,6 +1000,39 @@ TEST_F(TestArray, TestBinaryViewAppendArraySlice) { AssertArraysEqual(*src, *dst); } +TEST_F(TestArray, BinaryViewFromBlock) { + BinaryViewBuilder src_builder(pool_); + ASSERT_OK(src_builder.Append("Hello")); + /// let block = builder.append_block(b"helloworldbingobongo".into()); + /// + /// builder.try_append_view(block, 0, 5).unwrap(); + /// builder.try_append_view(block, 5, 5).unwrap(); + /// builder.try_append_view(block, 10, 5).unwrap(); + /// builder.try_append_view(block, 15, 5).unwrap(); + /// builder.try_append_view(block, 0, 15).unwrap(); + /// let array = builder.finish(); + ASSERT_OK_AND_ASSIGN(const auto buffer, + src_builder.AppendBuffer("helloworldbingobongo")); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 5, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 10, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 15, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 15)); + + ASSERT_OK_AND_ASSIGN(auto src, src_builder.Finish()); + ASSERT_OK(src->ValidateFull()); + + // Verify the content of the resulting array + ASSERT_EQ(src->length(), 6); + const auto& binary_view_array = static_cast(*src); Review Comment: Can we both test stringView and binaryView? ## cpp/src/arrow/array/array_test.cc: ## @@ -1000,6 +1000,39 @@ TEST_F(TestArray, TestBinaryViewAppendArraySlice) { AssertArraysEqual(*src, *dst); } +TEST_F(TestArray, BinaryViewFromBlock) { + BinaryViewBuilder src_builder(pool_); + ASSERT_OK(src_builder.Append("Hello")); + /// let block = builder.append_block(b"helloworldbingobongo".into()); + /// + /// builder.try_append_view(block, 0, 5).unwrap(); + /// builder.try_append_view(block, 5, 5).unwrap(); + /// builder.try_append_view(block, 10, 5).unwrap(); + /// builder.try_append_view(block, 15, 5).unwrap(); + /// builder.try_append_view(block, 0, 15).unwrap(); + /// let array = builder.finish(); + ASSERT_OK_AND_ASSIGN(const auto buffer, + src_builder.AppendBuffer("helloworldbingobongo")); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 5)); Review Comment: Can we try append unexists `buffer_index`? ## cpp/src/arrow/array/array_test.cc: ## @@ -1000,6 +1000,39 @@ TEST_F(TestArray, TestBinaryViewAppendArraySlice) { AssertArraysEqual(*src, *dst); } +TEST_F(TestArray, BinaryViewFromBlock) { + BinaryViewBuilder src_builder(pool_); + ASSERT_OK(src_builder.Append("Hello")); + /// let block = builder.append_block(b"helloworldbingobongo".into()); + /// + /// builder.try_append_view(block, 0, 5).unwrap(); + /// builder.try_append_view(block, 5, 5).unwrap(); + /// builder.try_append_view(block, 10, 5).unwrap(); + /// builder.try_append_view(block, 15, 5).unwrap(); + /// builder.try_append_view(block, 0, 15).unwrap(); + /// let array = builder.finish(); + ASSERT_OK_AND_ASSIGN(const auto buffer, + src_builder.AppendBuffer("helloworldbingobongo")); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 5, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 10, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 15, 5)); + ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 15)); + Review Comment: Can we test append multiple buffers? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-2954043045 Not sure why these 2 ci always failed.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-2952405404 Some personal thoughts: 1. AppendBuffer ( and etc ) which returns a StringView is a bit weird, Block is not a StringView 2. Now aligned with arrow-rs also a good way -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-2951999146 Should we rename `AppendBuffer` or redesign the interface's semantics? since we don't really append a **buffer/block**, we will directly append to the last one if remaining size is enough.. I think It may introduce confusion. Maybe aligning with arrow-rs's impl is fine.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2133473408 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result> AppendBlock(const uint8_t* value, Review Comment: The syntax is a bit weird here? Append a BinaryView and then append the sub-slice of the view? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2133457449 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result> AppendBlock(const uint8_t* value, Review Comment: can i directly use `BinaryViewType` since it already contains these two info we need? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2133457449 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result> AppendBlock(const uint8_t* value, Review Comment: can i directly use `BinaryViewType::c_type` since it already contains these two info we need? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2133456883 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result> AppendBlock(const uint8_t* value, + const int64_t length); + + Result> AppendBlock(const char* value, + const int64_t length) { +return AppendBlock(reinterpret_cast(value), length); + } + + Result> AppendBlock(const std::string& value) { +return AppendBlock(value.data(), static_cast(value.size())); + } + + Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, int32_t start, Review Comment: Personally both is ok for me, I prefer Buffer since a variable is `buffer_index` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2133442265 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result> AppendBlock(const uint8_t* value, + const int64_t length); + + Result> AppendBlock(const char* value, + const int64_t length) { +return AppendBlock(reinterpret_cast(value), length); + } + + Result> AppendBlock(const std::string& value) { +return AppendBlock(value.data(), static_cast(value.size())); + } + + Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, int32_t start, Review Comment: sorry for this annoying inconsistency. To be clear, is one of `block` or `buffer` prefered, or either is fine? I found `block` is used in `StringHeapBuilder` while `buffer` is used in `BinaryBuilder`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2133442265 ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result> AppendBlock(const uint8_t* value, + const int64_t length); + + Result> AppendBlock(const char* value, + const int64_t length) { +return AppendBlock(reinterpret_cast(value), length); + } + + Result> AppendBlock(const std::string& value) { +return AppendBlock(value.data(), static_cast(value.size())); + } + + Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, int32_t start, Review Comment: sorry for this annoying inconsistency. To be clear, is one of `block` or `buffer` prefered, or either is fine? I found `block` is used in `StringHeapBuilder` while `buffer` is used in `BinaryBuilder`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
mapleFU commented on code in PR #46730: URL: https://github.com/apache/arrow/pull/46730#discussion_r2133344885 ## cpp/src/arrow/array/builder_binary.h: ## @@ -511,6 +511,18 @@ class ARROW_EXPORT StringHeapBuilder { return v; } + c_type GetViewFromBlock(int32_t block_id, int32_t block_offset, int32_t offset, + int32_t length) const { +const auto* value = blocks_.at(block_id)->data_as() + block_offset + offset; +if (length <= BinaryViewType::kInlineSize) { Review Comment: Uses `ToBinaryView`? ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result> AppendBlock(const uint8_t* value, + const int64_t length); + + Result> AppendBlock(const char* value, + const int64_t length) { +return AppendBlock(reinterpret_cast(value), length); + } + + Result> AppendBlock(const std::string& value) { +return AppendBlock(value.data(), static_cast(value.size())); + } + + Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, int32_t start, Review Comment: naming: from buffer or from block? ## cpp/src/arrow/array/builder_binary.cc: ## @@ -100,6 +100,13 @@ void BinaryViewBuilder::Reset() { data_heap_builder_.Reset(); } +Result> BinaryViewBuilder::AppendBlock(const uint8_t* value, + const int64_t length) { + DCHECK_GT(length, TypeClass::kInlineSize); Review Comment: If length <= kInlineSize, should this return false or ok? Why just DCHECK here? ## cpp/src/arrow/array/builder_binary.h: ## @@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { UnsafeAppend(value.data(), static_cast(value.size())); } + Result> AppendBlock(const uint8_t* value, Review Comment: Can we use more specific name rather than `pair`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]
IndifferentArea commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-2951496617 @mapleFU is currently implemented interface expected? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org