Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-07-01 Thread via GitHub


pitrou commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-3022421404

   > 1- API and Handling of the Last Buffer In [this pull 
request](https://github.com/apache/arrow/pull/46655), I demonstrated that it’s 
possible to [share 
buffers](https://github.com/apache/arrow/blob/a5dfadba3626c082235d9ea22db6f2cb22398d9a/cpp/src/arrow/array/builder_binary.cc#L90)
 without copying or finalizing the last buffer. This avoids [relocating the 
buffer](https://github.com/apache/arrow/blob/ed13cedd8bf7ddc06db152f97e68d86c2c37e949/cpp/src/arrow/array/builder_binary.h#L563)
 to remove blank space, which can be a costly operation when the unused space 
exceeds 64 bytes.
   > 
   > 2-
   > 
   > > Is it a win, though? If most Parquet strings are <= 12 bytes we would 
pointlessly waste space and CPU time.
   > 
   > In [this pull request](https://github.com/apache/arrow/pull/46229), I 
proposed a method that could help avoid memory bloat when buffers are shared. 
Additionally, in [this issue](https://github.com/apache/arrow/issues/45639), I 
think this metadata could help determine when CompactArray should be called.
   
   Thanks for the reminder, and sorry that this is taking a long time :) I 
propose that we review these PRs one by one. I've started with the 
`CompactArray` one and, once that is done, I would like to then move to the 
`AppendArraySlice` improvement.
   
   This PR here is slightly more contentious so I think we should tackle it 
only after the other APIs have settled semantics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-30 Thread via GitHub


pitrou commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-3019324942

   > Maybe we can open another issue/PR to discuss specifically whether/how 
parquet should use this api on appending a huge page and append view from it?
   
   That sounds fair to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-30 Thread via GitHub


IndifferentArea commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-3019299582

   Besides parquet related issue, there is **NO buffer sharing** mechanism in 
current api. 
   
   - `AppendBuffer` should only be used for large string
   - `AppendViewFromBuffer` 
 - for large string sharing, sharing through this interface will save a lot 
of memory usage
 - for inlined string it costs as much as directly `Append`. 
   
   These new interfaces won't introduce more cost for current interfaces from 
my view.
   I understand tradeoff on parquet a discussed above, but I believe this kind 
of buffer sharing interface is lost in current implementation, and there is 
definitely more need for this api apart from parquet. 
   
   Maybe we can open another issue/PR to discuss specifically whether/how 
parquet should use this api on appending a huge page and append view from it?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-30 Thread via GitHub


mapleFU commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-3018294994

   Looks (1) would work in buffer style api, but for parquet reader, it might 
append buffer one by one.
   
   The (2) is a good way for compute, but here I don't know the best way to 
handle this: whether to adaptive read it, or just throw it to "cast" or 
"compact". I prefer handling this in reader, and the later handling can 
"compact" the data when output or throwing to compute


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-29 Thread via GitHub


andishgar commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-3016821941

   @mapleFU @pitrou
   I believe this pull request is related to several other PRs I've submitted. 
Here's a summary:
   
   1- API and Handling of the Last Buffer
   In [this pull request](https://github.com/apache/arrow/pull/46655), I 
demonstrated that it’s possible to [share 
buffers](https://github.com/apache/arrow/blob/a5dfadba3626c082235d9ea22db6f2cb22398d9a/cpp/src/arrow/array/builder_binary.cc#L90)
 without copying or finalizing the last buffer. This avoids [relocating the 
buffer](https://github.com/apache/arrow/blob/ed13cedd8bf7ddc06db152f97e68d86c2c37e949/cpp/src/arrow/array/builder_binary.h#L563)
 to remove blank space, which can be a costly operation when the unused space 
exceeds 64 bytes.
   
   2-
   >Is it a win, though? If most Parquet strings are <= 12 bytes we would 
pointlessly waste space and CPU time.
   
   In [this pull request](https://github.com/apache/arrow/pull/46229), I 
proposed a method that could help avoid memory bloat when buffers are shared. 
Additionally, in [this issue](https://github.com/apache/arrow/issues/45639), I 
think this metadata could help determine when CompactArray should be called.
   
   Overall, my suggestion is to either modify this pull request or create a new 
API to support buffer sharing. It is possible to decide whether a created array 
should be compacted based on some metadata, in order to avoid memory bloat.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2159249895


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result AppendBuffer(const uint8_t* value, const int64_t length) {
+if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) {
+  return Status::Invalid(
+  "The size of buffer to append should be larger than kInlineSize");
+}
+ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock());
+ARROW_ASSIGN_OR_RAISE(auto v,
+  data_heap_builder_.Append(value, 
length));
+return v.ref.buffer_index;
+  }
+
+  Result AppendBuffer(const char* value, const int64_t length) {
+return AppendBuffer(reinterpret_cast(value), length);
+  }
+
+  Result AppendBuffer(const std::string& value) {

Review Comment:
   Can we remove this one since std::string_view is added here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2159268798


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +664,58 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  /// \brief Append a buffer of raw bytes to the internal data heap
+  ///
+  /// This method is used to add out-of-line data buffers to the builder.
+  /// The size of the buffer must be larger than TypeClass::kInlineSize.
+  ///
+  /// \param[in] value Pointer to the raw byte data.
+  /// \param[in] length The number of bytes in the buffer.
+  /// \return A Result containing the index of the newly appended buffer on 
success,
+  /// or a Status error if the length is too small or allocation fails.
+  Result AppendBuffer(const uint8_t* value, const int64_t length) {
+if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) {
+  return Status::Invalid(
+  "The size of buffer to append should be larger than kInlineSize");
+}
+ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock());
+ARROW_ASSIGN_OR_RAISE(auto v,
+  data_heap_builder_.Append(value, 
length));
+return v.ref.buffer_index;
+  }
+
+  Result AppendBuffer(const char* value, const int64_t length) {
+return AppendBuffer(reinterpret_cast(value), length);
+  }
+
+  Result AppendBuffer(const std::string& value) {
+return AppendBuffer(value.data(), static_cast(value.size()));
+  }
+
+  Result AppendBuffer(const std::string_view value) {
+return AppendBuffer(value.data(), static_cast(value.size()));
+  }
+
+  Status AppendViewFromBuffer(const int32_t buffer_idx, const int32_t start,
+  const int32_t length) {
+ARROW_RETURN_NOT_OK(Reserve(1));
+UnsafeAppendToBitmap(true);
+ARROW_ASSIGN_OR_RAISE(const auto v, 
data_heap_builder_.GetViewFromBuffer(
+buffer_idx, start, length));
+data_builder_.UnsafeAppend(v);
+return Status::OK();
+  }
+
+  /// \pre The caller must ensure that:
+  ///  - `buffer_idx` is a valid index for buffer.
+  ///  - `start` and `length` define a valid range within the specified 
buffer.
+  void UnsafeAppendViewFromBuffer(const int32_t buffer_idx, const int32_t 
start,
+  const int32_t length) {
+UnsafeAppendToBitmap(true);
+const auto v = data_heap_builder_.GetViewFromBuffer(buffer_idx, 
start, length);

Review Comment:
   ```suggestion
   const auto v = 
data_heap_builder_.GetViewFromBuffer(buffer_idx, start, length);
   ```



##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +664,58 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  /// \brief Append a buffer of raw bytes to the internal data heap
+  ///
+  /// This method is used to add out-of-line data buffers to the builder.
+  /// The size of the buffer must be larger than TypeClass::kInlineSize.
+  ///
+  /// \param[in] value Pointer to the raw byte data.
+  /// \param[in] length The number of bytes in the buffer.
+  /// \return A Result containing the index of the newly appended buffer on 
success,
+  /// or a Status error if the length is too small or allocation fails.
+  Result AppendBuffer(const uint8_t* value, const int64_t length) {
+if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) {
+  return Status::Invalid(
+  "The size of buffer to append should be larger than kInlineSize");
+}
+ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock());
+ARROW_ASSIGN_OR_RAISE(auto v,
+  data_heap_builder_.Append(value, 
length));
+return v.ref.buffer_index;
+  }
+
+  Result AppendBuffer(const char* value, const int64_t length) {
+return AppendBuffer(reinterpret_cast(value), length);
+  }
+
+  Result AppendBuffer(const std::string& value) {
+return AppendBuffer(value.data(), static_cast(value.size()));
+  }
+
+  Result AppendBuffer(const std::string_view value) {
+return AppendBuffer(value.data(), static_cast(value.size()));
+  }
+
+  Status AppendViewFromBuffer(const int32_t buffer_idx, const int32_t start,
+  const int32_t length) {
+ARROW_RETURN_NOT_OK(Reserve(1));
+UnsafeAppendToBitmap(true);
+ARROW_ASSIGN_OR_RAISE(const auto v, 
data_heap_builder_.GetViewFromBuffer(

Review Comment:
   ```suggestion
   ARROW_ASSIGN_OR_RAISE(const auto v, 
data_heap_builder_.GetViewFromBuffer(
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-14 Thread via GitHub


mapleFU commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2146963301


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result AppendBuffer(const uint8_t* value, const int64_t length) {
+if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) {
+  return Status::Invalid(
+  "The size of buffer to append should be larger than kInlineSize");
+}
+ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock());
+ARROW_ASSIGN_OR_RAISE(auto v,
+  data_heap_builder_.Append(value, 
length));
+return v.ref.buffer_index;
+  }
+
+  Result AppendBuffer(const char* value, const int64_t length) {
+return AppendBuffer(reinterpret_cast(value), length);
+  }
+
+  Result AppendBuffer(const std::string& value) {

Review Comment:
   Can `std::string_view` being used rather than `const std::string&`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-14 Thread via GitHub


mapleFU commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2146963301


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result AppendBuffer(const uint8_t* value, const int64_t length) {
+if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) {
+  return Status::Invalid(
+  "The size of buffer to append should be larger than kInlineSize");
+}
+ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock());
+ARROW_ASSIGN_OR_RAISE(auto v,
+  data_heap_builder_.Append(value, 
length));
+return v.ref.buffer_index;
+  }
+
+  Result AppendBuffer(const char* value, const int64_t length) {
+return AppendBuffer(reinterpret_cast(value), length);
+  }
+
+  Result AppendBuffer(const std::string& value) {

Review Comment:
   Can `std::string_view` being used?



##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result AppendBuffer(const uint8_t* value, const int64_t length) {
+if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) {
+  return Status::Invalid(
+  "The size of buffer to append should be larger than kInlineSize");
+}
+ARROW_RETURN_NOT_OK(data_heap_builder_.FinishLastBlock());
+ARROW_ASSIGN_OR_RAISE(auto v,
+  data_heap_builder_.Append(value, 
length));
+return v.ref.buffer_index;
+  }
+
+  Result AppendBuffer(const char* value, const int64_t length) {
+return AppendBuffer(reinterpret_cast(value), length);
+  }
+
+  Result AppendBuffer(const std::string& value) {
+return AppendBuffer(value.data(), static_cast(value.size()));
+  }
+
+  Status AppendViewFromBuffer(const int32_t buffer_idx, const int32_t start,
+  const int32_t length) {
+ARROW_RETURN_NOT_OK(Reserve(1));
+UnsafeAppendToBitmap(true);
+ARROW_ASSIGN_OR_RAISE(
+auto v, data_heap_builder_.GetViewFromBuffer(buffer_idx, start, 
length));

Review Comment:
   should we also adding a unsafe Append version?



##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +664,35 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result AppendBuffer(const uint8_t* value, const int64_t length) {
+if (ARROW_PREDICT_FALSE(length <= TypeClass::kInlineSize)) {
+  return Status::Invalid(
+  "The size of buffer to append should be larger than kInlineSize");
+}

Review Comment:
   I'm curious that would this api be a bit strict for outside user 🤔? Should 
we add comment doc string for this, and add sample usage:
   
   ```
   if (buffer.size() <= 12) {
 // Append directly
 for (...) {}
   } else {
 auto buffer_idx = AppendBuffer();
 for (..) {
   AppendViewFromBuffer(buffer_idx, ...);
 }
   }
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-14 Thread via GitHub


IndifferentArea commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-2972782119

   To implement interface aligned with what arrow-rs did, i have to change some 
behavior of `StringHeapBuild::FinishLastBlock()` and mark it as public.
   
   More specific, `FinishLastBlock()` just resize the last block before. Now 
`FinishLastBlock()` reset internal states, including  `current_offset_`, 
`current_out_buffer_` and `current_remaining_bytes_`. I believe it's more safe 
and reasonable.
   The interface was private before so don't mind external usage, for current 
internal usage of `FinishLastBlock()`, they always reset or change all these 
status.
   
   If this change is unacceptable, plz let me know, i'll try to find another 
way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-14 Thread via GitHub


IndifferentArea commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2146950173


##
cpp/src/arrow/array/array_test.cc:
##
@@ -1000,6 +1000,39 @@ TEST_F(TestArray, TestBinaryViewAppendArraySlice) {
   AssertArraysEqual(*src, *dst);
 }
 
+TEST_F(TestArray, BinaryViewFromBlock) {
+  BinaryViewBuilder src_builder(pool_);
+  ASSERT_OK(src_builder.Append("Hello"));
+  /// let block = builder.append_block(b"helloworldbingobongo".into());
+  ///
+  /// builder.try_append_view(block, 0, 5).unwrap();
+  /// builder.try_append_view(block, 5, 5).unwrap();
+  /// builder.try_append_view(block, 10, 5).unwrap();
+  /// builder.try_append_view(block, 15, 5).unwrap();
+  /// builder.try_append_view(block, 0, 15).unwrap();
+  /// let array = builder.finish();
+  ASSERT_OK_AND_ASSIGN(const auto buffer,
+   src_builder.AppendBuffer("helloworldbingobongo"));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 5, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 10, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 15, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 15));
+

Review Comment:
   added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-08 Thread via GitHub


mapleFU commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2134762348


##
cpp/src/arrow/array/array_test.cc:
##
@@ -1000,6 +1000,39 @@ TEST_F(TestArray, TestBinaryViewAppendArraySlice) {
   AssertArraysEqual(*src, *dst);
 }
 
+TEST_F(TestArray, BinaryViewFromBlock) {
+  BinaryViewBuilder src_builder(pool_);
+  ASSERT_OK(src_builder.Append("Hello"));
+  /// let block = builder.append_block(b"helloworldbingobongo".into());
+  ///
+  /// builder.try_append_view(block, 0, 5).unwrap();
+  /// builder.try_append_view(block, 5, 5).unwrap();
+  /// builder.try_append_view(block, 10, 5).unwrap();
+  /// builder.try_append_view(block, 15, 5).unwrap();
+  /// builder.try_append_view(block, 0, 15).unwrap();
+  /// let array = builder.finish();
+  ASSERT_OK_AND_ASSIGN(const auto buffer,
+   src_builder.AppendBuffer("helloworldbingobongo"));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 5, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 10, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 15, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 15));
+
+  ASSERT_OK_AND_ASSIGN(auto src, src_builder.Finish());
+  ASSERT_OK(src->ValidateFull());
+
+  // Verify the content of the resulting array
+  ASSERT_EQ(src->length(), 6);
+  const auto& binary_view_array = static_cast(*src);

Review Comment:
   Can we both test stringView and binaryView?



##
cpp/src/arrow/array/array_test.cc:
##
@@ -1000,6 +1000,39 @@ TEST_F(TestArray, TestBinaryViewAppendArraySlice) {
   AssertArraysEqual(*src, *dst);
 }
 
+TEST_F(TestArray, BinaryViewFromBlock) {
+  BinaryViewBuilder src_builder(pool_);
+  ASSERT_OK(src_builder.Append("Hello"));
+  /// let block = builder.append_block(b"helloworldbingobongo".into());
+  ///
+  /// builder.try_append_view(block, 0, 5).unwrap();
+  /// builder.try_append_view(block, 5, 5).unwrap();
+  /// builder.try_append_view(block, 10, 5).unwrap();
+  /// builder.try_append_view(block, 15, 5).unwrap();
+  /// builder.try_append_view(block, 0, 15).unwrap();
+  /// let array = builder.finish();
+  ASSERT_OK_AND_ASSIGN(const auto buffer,
+   src_builder.AppendBuffer("helloworldbingobongo"));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 5));

Review Comment:
   Can we try append unexists `buffer_index`?



##
cpp/src/arrow/array/array_test.cc:
##
@@ -1000,6 +1000,39 @@ TEST_F(TestArray, TestBinaryViewAppendArraySlice) {
   AssertArraysEqual(*src, *dst);
 }
 
+TEST_F(TestArray, BinaryViewFromBlock) {
+  BinaryViewBuilder src_builder(pool_);
+  ASSERT_OK(src_builder.Append("Hello"));
+  /// let block = builder.append_block(b"helloworldbingobongo".into());
+  ///
+  /// builder.try_append_view(block, 0, 5).unwrap();
+  /// builder.try_append_view(block, 5, 5).unwrap();
+  /// builder.try_append_view(block, 10, 5).unwrap();
+  /// builder.try_append_view(block, 15, 5).unwrap();
+  /// builder.try_append_view(block, 0, 15).unwrap();
+  /// let array = builder.finish();
+  ASSERT_OK_AND_ASSIGN(const auto buffer,
+   src_builder.AppendBuffer("helloworldbingobongo"));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 5, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 10, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 15, 5));
+  ASSERT_OK(src_builder.AppendViewFromBuffer(buffer, 0, 15));
+

Review Comment:
   Can we test append multiple buffers?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-08 Thread via GitHub


IndifferentArea commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-2954043045

   Not sure why these 2 ci always failed..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-07 Thread via GitHub


mapleFU commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-2952405404

   Some personal thoughts:
   1. AppendBuffer ( and etc ) which returns a StringView is a bit weird, Block 
is not a StringView 
   2. Now aligned with arrow-rs also a good way


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-07 Thread via GitHub


IndifferentArea commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-2951999146

   Should we rename `AppendBuffer` or redesign the interface's semantics? since 
we don't really append a **buffer/block**, we will directly append to the last 
one if remaining size is enough.. I think It may introduce confusion.
   
   Maybe aligning with arrow-rs's impl is fine..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-06 Thread via GitHub


mapleFU commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2133473408


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result> AppendBlock(const uint8_t* value,

Review Comment:
   The syntax is a bit weird here? Append a BinaryView and then append the 
sub-slice of the view?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-06 Thread via GitHub


IndifferentArea commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2133457449


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result> AppendBlock(const uint8_t* value,

Review Comment:
   can i directly use `BinaryViewType` since it already contains these two info 
we need?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-06 Thread via GitHub


IndifferentArea commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2133457449


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result> AppendBlock(const uint8_t* value,

Review Comment:
   can i directly use `BinaryViewType::c_type` since it already contains these 
two info we need?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-06 Thread via GitHub


mapleFU commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2133456883


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result> AppendBlock(const uint8_t* value,
+  const int64_t length);
+
+  Result> AppendBlock(const char* value,
+  const int64_t length) {
+return AppendBlock(reinterpret_cast(value), length);
+  }
+
+  Result> AppendBlock(const std::string& value) {
+return AppendBlock(value.data(), static_cast(value.size()));
+  }
+
+  Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, 
int32_t start,

Review Comment:
   Personally both is ok for me, I prefer Buffer since a variable is 
`buffer_index`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-06 Thread via GitHub


IndifferentArea commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2133442265


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result> AppendBlock(const uint8_t* value,
+  const int64_t length);
+
+  Result> AppendBlock(const char* value,
+  const int64_t length) {
+return AppendBlock(reinterpret_cast(value), length);
+  }
+
+  Result> AppendBlock(const std::string& value) {
+return AppendBlock(value.data(), static_cast(value.size()));
+  }
+
+  Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, 
int32_t start,

Review Comment:
   sorry for this annoying inconsistency. To be clear, is one of `block` or 
`buffer` prefered, or either is fine? I found `block` is used in 
`StringHeapBuilder` while `buffer` is used in `BinaryBuilder`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-06 Thread via GitHub


IndifferentArea commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2133442265


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result> AppendBlock(const uint8_t* value,
+  const int64_t length);
+
+  Result> AppendBlock(const char* value,
+  const int64_t length) {
+return AppendBlock(reinterpret_cast(value), length);
+  }
+
+  Result> AppendBlock(const std::string& value) {
+return AppendBlock(value.data(), static_cast(value.size()));
+  }
+
+  Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, 
int32_t start,

Review Comment:
   sorry for this annoying inconsistency. To be clear, is one of `block` or 
`buffer` prefered, or either is fine? I found `block` is used in 
`StringHeapBuilder` while `buffer` is used in `BinaryBuilder`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-06 Thread via GitHub


mapleFU commented on code in PR #46730:
URL: https://github.com/apache/arrow/pull/46730#discussion_r2133344885


##
cpp/src/arrow/array/builder_binary.h:
##
@@ -511,6 +511,18 @@ class ARROW_EXPORT StringHeapBuilder {
 return v;
   }
 
+  c_type GetViewFromBlock(int32_t block_id, int32_t block_offset, int32_t 
offset,
+  int32_t length) const {
+const auto* value = blocks_.at(block_id)->data_as() + 
block_offset + offset;
+if (length <= BinaryViewType::kInlineSize) {

Review Comment:
   Uses `ToBinaryView`?



##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result> AppendBlock(const uint8_t* value,
+  const int64_t length);
+
+  Result> AppendBlock(const char* value,
+  const int64_t length) {
+return AppendBlock(reinterpret_cast(value), length);
+  }
+
+  Result> AppendBlock(const std::string& value) {
+return AppendBlock(value.data(), static_cast(value.size()));
+  }
+
+  Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, 
int32_t start,

Review Comment:
   naming: from buffer or from block?



##
cpp/src/arrow/array/builder_binary.cc:
##
@@ -100,6 +100,13 @@ void BinaryViewBuilder::Reset() {
   data_heap_builder_.Reset();
 }
 
+Result> BinaryViewBuilder::AppendBlock(const 
uint8_t* value,
+   const 
int64_t length) {
+  DCHECK_GT(length, TypeClass::kInlineSize);

Review Comment:
   If length <= kInlineSize, should this return false or ok? Why just DCHECK 
here?



##
cpp/src/arrow/array/builder_binary.h:
##
@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder 
{
 UnsafeAppend(value.data(), static_cast(value.size()));
   }
 
+  Result> AppendBlock(const uint8_t* value,

Review Comment:
   Can we use more specific name rather than `pair`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice [arrow]

2025-06-06 Thread via GitHub


IndifferentArea commented on PR #46730:
URL: https://github.com/apache/arrow/pull/46730#issuecomment-2951496617

   @mapleFU is currently implemented interface expected? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org