Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-25 Thread via GitHub


pitrou commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-3003781128

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-25 Thread via GitHub


github-actions[bot] commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-3003789770

   Revision: 3ea4dd3de5bd4618af53e73fb37139118101c69b
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-e6dee40784](https://github.com/ursacomputing/crossbow/branches/all?query=actions-e6dee40784)
   
   |Task|Status|
   ||--|
   |example-cpp-minimal-build-static|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-example-cpp-minimal-build-static)](https://github.com/ursacomputing/crossbow/actions/runs/15870853412/job/44746841415)|
   |example-cpp-minimal-build-static-system-dependency|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-example-cpp-minimal-build-static-system-dependency)](https://github.com/ursacomputing/crossbow/actions/runs/15870853317/job/44746841177)|
   |example-cpp-tutorial|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-example-cpp-tutorial)](https://github.com/ursacomputing/crossbow/actions/runs/15870853495/job/44746841683)|
   |test-build-cpp-fuzz|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/15870853357/job/44746841232)|
   |test-conda-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/15870853198/job/44746840869)|
   |test-conda-cpp-valgrind|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/actions/runs/15870853221/job/44746840882)|
   |test-cuda-cpp-ubuntu-22.04-cuda-11.7.1|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-cuda-cpp-ubuntu-22.04-cuda-11.7.1)](https://github.com/ursacomputing/crossbow/actions/runs/15870853881/job/44746842867)|
   |test-debian-12-cpp-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-debian-12-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/15870853876/job/44746842841)|
   |test-debian-12-cpp-i386|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-debian-12-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/15870853513/job/44746841758)|
   |test-fedora-39-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-fedora-39-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/15870853910/job/44746843062)|
   |test-ubuntu-22.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/15870853394/job/44746841445)|
   |test-ubuntu-22.04-cpp-20|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/15870853615/job/44746842025)|
   |test-ubuntu-22.04-cpp-bundled|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-ubuntu-22.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/15870853639/job/44746842079)|
   |test-ubuntu-22.04-cpp-emscripten|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-ubuntu-22.04-cpp-emscripten)](https://github.com/ursacomputing/crossbow/actions/runs/15870853688/job/44746842252)|
   |test-ubuntu-22.04-cpp-no-threading|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-ubuntu-22.04-cpp-no-threading)](https://github.com/ursacomputing/crossbow/actions/runs/15870853839/job/44746842698)|
   |test-ubuntu-24.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e6dee40784-github-test-ubuntu-24.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/15870853231/job/44746840888)|
   |test-ubuntu-24.04-cpp-bundled-offline|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workfl

Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-24 Thread via GitHub


AntoinePrv commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-3000985547

   @pitrou do we need anything else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-23 Thread via GitHub


pitrou commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2161928332


##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int 
width, int64_t num_va
 }
 for (int j = 0; j < kNumStreams; ++j) {
   xsimd::store_unaligned(
-  reinterpret_cast(out + (i * kNumStreams + j) * 
sizeof(simd_batch)),
+  reinterpret_cast(out + (i * kNumStreams + j) * 
simd_batch::size),
   stage[kNumStreamsLog2][j]);
 }
   }
 }
 
+template 

Review Comment:
   Being able to instantiate an integer type based on templated number of bytes 
sounds quite general to me. It is not specific to SIMD.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-23 Thread via GitHub


AntoinePrv commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2161920973


##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int 
width, int64_t num_va
 }
 for (int j = 0; j < kNumStreams; ++j) {
   xsimd::store_unaligned(
-  reinterpret_cast(out + (i * kNumStreams + j) * 
sizeof(simd_batch)),
+  reinterpret_cast(out + (i * kNumStreams + j) * 
simd_batch::size),
   stage[kNumStreamsLog2][j]);
 }
   }
 }
 
+template 

Review Comment:
   Type traits looks much more general detection functions, this seemed pretty 
strongly tied to this algorithm. Do you think it would be a better fit there?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-23 Thread via GitHub


AntoinePrv commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2161895195


##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -123,95 +185,79 @@ void ByteStreamSplitEncodeSimd128(const uint8_t* 
raw_values, int width,
   output_buffer_raw[j * num_values + i] = byte_in_value;
 }
   }
-  // The current shuffling algorithm diverges for float and double types but 
the compiler
-  // should be able to remove the branch since only one path is taken for each 
template
-  // instantiation.
-  // Example run for 32-bit variables:
-  // Step 0: copy from unaligned input bytes:
-  //   0: ABCD ABCD ABCD ABCD 1: ABCD ABCD ABCD ABCD ...
-  // Step 1: simd_batch::zip_lo and simd_batch::zip_hi:
-  //   0: AABB CCDD AABB CCDD 1: AABB CCDD AABB CCDD ...
-  // Step 2: apply simd_batch::zip_lo and  simd_batch::zip_hi again:
-  //   0:     1:     ...
-  // Step 3: simd_batch::zip_lo and simd_batch::zip_hi:
-  //   0:     1:     ...
-  // Step 4: simd_batch::zip_lo and simd_batch::zip_hi:
-  //   0:     1:     ...
+
+  // Number of input values we can fit in a simd register
+  constexpr int kNumValuesInBatch = simd_batch::size / kNumStreams;
+  static_assert(kNumValuesInBatch > 0);
+  // Number of bytes we'll bring together in the first byte-level part of the 
algorithm.
+  // Since we zip with the next batch, the number of values in a batch 
determines how many
+  // bytes end up together before we can use a larger type
+  constexpr int kNumBytes = 2 * kNumValuesInBatch;
+  // Number of steps in the first part of the algorithm with byte-level zipping
+  constexpr int kNumStepsByte = ReversePow2(kNumValuesInBatch) + 1;
+  // Number of steps in the first part of the algorithm with large data type 
zipping
+  constexpr int kNumStepsLarge =
+  ReversePow2(static_cast(simd_batch::size) / kNumBytes);
+  // Total number of steps
+  constexpr int kNumSteps = kNumStepsByte + kNumStepsLarge;

Review Comment:
   Correct, I did not see it that way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-23 Thread via GitHub


AntoinePrv commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2161887714


##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int 
width, int64_t num_va
 }
 for (int j = 0; j < kNumStreams; ++j) {
   xsimd::store_unaligned(
-  reinterpret_cast(out + (i * kNumStreams + j) * 
sizeof(simd_batch)),
+  reinterpret_cast(out + (i * kNumStreams + j) * 
simd_batch::size),
   stage[kNumStreamsLog2][j]);
 }
   }
 }
 
+template 
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+  using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+  using type = int16_t;
+};
+
+template <>
+struct grouped_bytes_impl<4> {
+  using type = int32_t;
+};
+
+template <>
+struct grouped_bytes_impl<8> {
+  using type = int64_t;
+};
+
+// Map a number of bytes to a type
+template 
+using grouped_bytes_t = typename grouped_bytes_impl::type;
+
+// Like xsimd::zlip_lo, but zip groups of NBytes at once
+template >
+auto zip_lo_n(Batch const& a, Batch const& b) -> Batch {
+  if constexpr (kNumBytes == kBatchSize) {
+return a;
+  } else {
+return xsimd::bitwise_cast(
+xsimd::zip_lo(xsimd::bitwise_cast>(a),
+  xsimd::bitwise_cast>(b)));
+  }
+}
+
+// Like xsimd::zlip_hi, but zip groups of NBytes at once
+template >
+auto zip_hi_n(Batch const& a, Batch const& b) -> Batch {
+  if constexpr (kNumBytes == kBatchSize) {
+return b;
+  } else {
+return xsimd::bitwise_cast(
+xsimd::zip_hi(xsimd::bitwise_cast>(a),
+  xsimd::bitwise_cast>(b)));
+  }
+}
+
 template 
 void ByteStreamSplitEncodeSimd128(const uint8_t* raw_values, int width,
   const int64_t num_values, uint8_t* 
output_buffer_raw) {
   using simd_batch = xsimd::make_sized_batch_t;
 
   assert(width == kNumStreams);
-  static_assert(kNumStreams == 4 || kNumStreams == 8, "Invalid number of 
streams.");
-  constexpr int kBlockSize = sizeof(simd_batch) * kNumStreams;
-
-  simd_batch stage[3][kNumStreams];
-  simd_batch final_result[kNumStreams];
+  constexpr int kBlockSize = simd_batch::size * kNumStreams;

Review Comment:
   This specific constant was already here. Here `sizeof` was replaced with 
`::size`.
   As for the function in general:
   - it is independent of the *number of streams* (under condition that 
kNumStreams <= simd_batch::size and kNumStream a power of two, for which I'm 
adding a check.)
   - it is not far from being independent of the SIMD batch size. I had plan 
the final changes for another PR but I guess I could include them here if you 
prefer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-23 Thread via GitHub


AntoinePrv commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2161887714


##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int 
width, int64_t num_va
 }
 for (int j = 0; j < kNumStreams; ++j) {
   xsimd::store_unaligned(
-  reinterpret_cast(out + (i * kNumStreams + j) * 
sizeof(simd_batch)),
+  reinterpret_cast(out + (i * kNumStreams + j) * 
simd_batch::size),
   stage[kNumStreamsLog2][j]);
 }
   }
 }
 
+template 
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+  using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+  using type = int16_t;
+};
+
+template <>
+struct grouped_bytes_impl<4> {
+  using type = int32_t;
+};
+
+template <>
+struct grouped_bytes_impl<8> {
+  using type = int64_t;
+};
+
+// Map a number of bytes to a type
+template 
+using grouped_bytes_t = typename grouped_bytes_impl::type;
+
+// Like xsimd::zlip_lo, but zip groups of NBytes at once
+template >
+auto zip_lo_n(Batch const& a, Batch const& b) -> Batch {
+  if constexpr (kNumBytes == kBatchSize) {
+return a;
+  } else {
+return xsimd::bitwise_cast(
+xsimd::zip_lo(xsimd::bitwise_cast>(a),
+  xsimd::bitwise_cast>(b)));
+  }
+}
+
+// Like xsimd::zlip_hi, but zip groups of NBytes at once
+template >
+auto zip_hi_n(Batch const& a, Batch const& b) -> Batch {
+  if constexpr (kNumBytes == kBatchSize) {
+return b;
+  } else {
+return xsimd::bitwise_cast(
+xsimd::zip_hi(xsimd::bitwise_cast>(a),
+  xsimd::bitwise_cast>(b)));
+  }
+}
+
 template 
 void ByteStreamSplitEncodeSimd128(const uint8_t* raw_values, int width,
   const int64_t num_values, uint8_t* 
output_buffer_raw) {
   using simd_batch = xsimd::make_sized_batch_t;
 
   assert(width == kNumStreams);
-  static_assert(kNumStreams == 4 || kNumStreams == 8, "Invalid number of 
streams.");
-  constexpr int kBlockSize = sizeof(simd_batch) * kNumStreams;
-
-  simd_batch stage[3][kNumStreams];
-  simd_batch final_result[kNumStreams];
+  constexpr int kBlockSize = simd_batch::size * kNumStreams;

Review Comment:
   This specific constant was already here. Here `sizeof` was replaced with 
`::size`.
   As for the function in general:
   - it is independent of the *number of streams* (under condition that 
kNumStreams <= simd_batch::size and kNumStream a power of two, for which I'm 
adding a check.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-23 Thread via GitHub


pitrou commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2995582815

   Weirdly, this fails compiling here: 
https://github.com/apache/arrow/actions/runs/15761197292/job/44584131273?pr=46789#step:7:1727


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-23 Thread via GitHub


pitrou commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2995593279

   Ah, also, can you update the tests to exercise the 2-byte BSS SIMD 
specializations?
   
https://github.com/apache/arrow/blob/dacec3080f8a7c245dc884d278ea6280942086f0/cpp/src/arrow/util/byte_stream_split_test.cc#L139-L172
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-23 Thread via GitHub


pitrou commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r216106


##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -123,95 +185,79 @@ void ByteStreamSplitEncodeSimd128(const uint8_t* 
raw_values, int width,
   output_buffer_raw[j * num_values + i] = byte_in_value;
 }
   }
-  // The current shuffling algorithm diverges for float and double types but 
the compiler
-  // should be able to remove the branch since only one path is taken for each 
template
-  // instantiation.
-  // Example run for 32-bit variables:
-  // Step 0: copy from unaligned input bytes:
-  //   0: ABCD ABCD ABCD ABCD 1: ABCD ABCD ABCD ABCD ...
-  // Step 1: simd_batch::zip_lo and simd_batch::zip_hi:
-  //   0: AABB CCDD AABB CCDD 1: AABB CCDD AABB CCDD ...
-  // Step 2: apply simd_batch::zip_lo and  simd_batch::zip_hi again:
-  //   0:     1:     ...
-  // Step 3: simd_batch::zip_lo and simd_batch::zip_hi:
-  //   0:     1:     ...
-  // Step 4: simd_batch::zip_lo and simd_batch::zip_hi:
-  //   0:     1:     ...
+
+  // Number of input values we can fit in a simd register
+  constexpr int kNumValuesInBatch = simd_batch::size / kNumStreams;
+  static_assert(kNumValuesInBatch > 0);
+  // Number of bytes we'll bring together in the first byte-level part of the 
algorithm.
+  // Since we zip with the next batch, the number of values in a batch 
determines how many
+  // bytes end up together before we can use a larger type
+  constexpr int kNumBytes = 2 * kNumValuesInBatch;
+  // Number of steps in the first part of the algorithm with byte-level zipping
+  constexpr int kNumStepsByte = ReversePow2(kNumValuesInBatch) + 1;
+  // Number of steps in the first part of the algorithm with large data type 
zipping
+  constexpr int kNumStepsLarge =
+  ReversePow2(static_cast(simd_batch::size) / kNumBytes);
+  // Total number of steps
+  constexpr int kNumSteps = kNumStepsByte + kNumStepsLarge;

Review Comment:
   So this is `ReversePow2(kNumValuesInBatch) + 1 + 
ReversePow2(simd_batch::size / (2 * kNumValuesInBatch))`, so in the end it's 
`ReversePow2(simd_batch::size)`, right?
   
   If so, perhaps add a `static_assert` for better understanding of the 
algorithm's complexity.



##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int 
width, int64_t num_va
 }
 for (int j = 0; j < kNumStreams; ++j) {
   xsimd::store_unaligned(
-  reinterpret_cast(out + (i * kNumStreams + j) * 
sizeof(simd_batch)),
+  reinterpret_cast(out + (i * kNumStreams + j) * 
simd_batch::size),
   stage[kNumStreamsLog2][j]);
 }
   }
 }
 
+template 

Review Comment:
   This could be given a more general name and go into 
`arrow/util/type_traits.h` perhaps.



##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int 
width, int64_t num_va
 }
 for (int j = 0; j < kNumStreams; ++j) {
   xsimd::store_unaligned(
-  reinterpret_cast(out + (i * kNumStreams + j) * 
sizeof(simd_batch)),
+  reinterpret_cast(out + (i * kNumStreams + j) * 
simd_batch::size),
   stage[kNumStreamsLog2][j]);
 }
   }
 }
 
+template 
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+  using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+  using type = int16_t;
+};
+
+template <>
+struct grouped_bytes_impl<4> {
+  using type = int32_t;
+};
+
+template <>
+struct grouped_bytes_impl<8> {
+  using type = int64_t;
+};
+
+// Map a number of bytes to a type
+template 
+using grouped_bytes_t = typename grouped_bytes_impl::type;
+
+// Like xsimd::zlip_lo, but zip groups of NBytes at once
+template >
+auto zip_lo_n(Batch const& a, Batch const& b) -> Batch {
+  if constexpr (kNumBytes == kBatchSize) {
+return a;
+  } else {
+return xsimd::bitwise_cast(
+xsimd::zip_lo(xsimd::bitwise_cast>(a),
+  xsimd::bitwise_cast>(b)));
+  }
+}
+
+// Like xsimd::zlip_hi, but zip groups of NBytes at once

Review Comment:
   Same here.



##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int 
width, int64_t num_va
 }
 for (int j = 0; j < kNumStreams; ++j) {
   xsimd::store_unaligned(
-  reinterpret_cast(out + (i * kNumStreams + j) * 
sizeof(simd_batch)),
+  reinterpret_cast(out + (i * kNumStreams + j) * 
simd_batch::size),
   stage[kNumStreamsLog2][j]);
 }
   }
 }
 
+template 
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+  using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+  using type = int16_t;
+};
+
+template <>
+struct grouped_

Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-23 Thread via GitHub


pitrou commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2995463508

   Local benchmark results on my AMD Ryzen 9 3900X CPU:
   ```
   
--
   Non-regressions: (40)
   
--
 benchmark   baseline  
contender  change % 

  counters
 BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024  4.020 GiB/sec 59.711 
GiB/sec  1385.469 {'family_index': 2, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1463689}
BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536  4.057 GiB/sec 53.022 
GiB/sec  1206.800  {'family_index': 2, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 23272}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536  4.690 GiB/sec  7.451 
GiB/sec58.859  {'family_index': 7, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 27316}
 BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024  4.777 GiB/sec  7.398 
GiB/sec54.878 {'family_index': 7, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1749779}
  BM_ByteStreamSplitEncode_Double_Generic/1024  7.294 GiB/sec  8.597 
GiB/sec17.874   {'family_index': 6, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 654439}
  BM_ByteStreamSplitDecode_Float_Sse2/1024  7.816 GiB/sec  8.696 
GiB/sec11.247 {'family_index': 14, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1478390}
 BM_ByteStreamSplitEncode_Double_Sse2/1024  7.656 GiB/sec  8.466 
GiB/sec10.575 {'family_index': 17, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 685857}
 BM_ByteStreamSplitDecode_Float_Sse2/65536  7.368 GiB/sec  8.091 
GiB/sec 9.813  {'family_index': 14, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 21078}
BM_ByteStreamSplitEncode_Double_Sse2/65536  7.303 GiB/sec  7.784 
GiB/sec 6.596 {'family_index': 17, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 10413}
 BM_ByteStreamSplitEncode_Double_Generic/65536  7.498 GiB/sec  7.918 
GiB/sec 5.600   {'family_index': 6, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 10468}
BM_ByteStreamSplitDecode_Double_Sse2/65536  8.489 GiB/sec  8.720 
GiB/sec 2.720 {'family_index': 15, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitDecode_Double_Sse2/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 12255}
 BM_ByteStreamSplitDecode_Double_Sse2/1024  9.153 GiB/sec  9.301 
GiB/sec 1.619 {'family_index': 15, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitDecode_Double_Sse2/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 850920}
BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024  5.002 GiB/sec  5.059 
GiB/sec 1.142 {'family_index': 9, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 228207}
BM_ByteStreamSplitDecode_Double_Avx2/65536 13.154 GiB/sec 13.270 
GiB/sec 0.886 {'family_index': 19, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitDecode_Double_Avx2/65536', 'repetitions': 1, 
'repetition_index': 0,

Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-20 Thread via GitHub


AntoinePrv commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2992138287

   Updated benchmarks:
   
   Benchmark result on my Macbook Pro M3:
   `archery benchmark diff --suite-filter=parquet-encoding 
--benchmark-filter='ByteStreamSplit'`
   
Show benchmark results
 
 ```
 
--
   Non-regressions: (44)
   
--
 benchmark   baseline  
contender  change % 

  counters
BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536  6.940 GiB/sec 98.072 
GiB/sec  1313.067  {'family_index': 2, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 40082}
 BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024  7.017 GiB/sec 82.283 
GiB/sec  1072.596 {'family_index': 2, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 2573189}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536  5.736 GiB/sec 15.241 
GiB/sec   165.694  {'family_index': 7, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 32985}
 BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024  5.848 GiB/sec 15.384 
GiB/sec   163.085 {'family_index': 7, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 2152488}
  BM_ByteStreamSplitEncode_Double_Generic/1024 10.565 GiB/sec 17.702 
GiB/sec67.552   {'family_index': 6, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 962530}
 BM_ByteStreamSplitEncode_Double_Neon/4096 11.016 GiB/sec 17.869 
GiB/sec62.210 {'family_index': 17, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/4096', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 252538}
 BM_ByteStreamSplitEncode_Double_Neon/1024 10.972 GiB/sec 17.720 
GiB/sec61.501 {'family_index': 17, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 995648}
 BM_ByteStreamSplitEncode_Float_Neon/32768  8.057 GiB/sec 12.437 
GiB/sec54.357  {'family_index': 16, 'per_family_instance_index': 2, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/32768', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 44834}
  BM_ByteStreamSplitEncode_Float_Neon/1024  8.527 GiB/sec 12.769 
GiB/sec49.742 {'family_index': 16, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1564274}
  BM_ByteStreamSplitEncode_Float_Neon/4096  8.595 GiB/sec 12.806 
GiB/sec48.991  {'family_index': 16, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/4096', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 393601}
   BM_ByteStreamSplitEncode_Float_Generic/1024  8.613 GiB/sec 12.825 
GiB/sec48.904   {'family_index': 5, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1600278}
  BM_ByteStreamSplitEncode_Float_Generic/65536  8.368 GiB/sec 12.420 
GiB/sec48.422{'family_index': 5, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 24025}
 BM_ByteStreamSplitEncode_Float_Neon/65536  8.435 GiB/sec 12.403 
GiB/sec47.045  {'family_index': 16, 'per_family_instance_index': 3, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 22557}
 BM_ByteStreamSplitDecode_Double_Neon/4096  7.535 GiB/sec 10.042 
GiB/sec33.283  

Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-16 Thread via GitHub


AntoinePrv commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2975908167

   > Now that we have SIMD optimizations for this, can we make sure the 
benchmarks cover the different cases? Scalar and the various SIMD kinds (SSE, 
AVX2, Neon).
   
   I've done so with `int16_t`. Since it's gated behind SIMD macros, it should 
not be a problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-12 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2967647488

   Thanks for your patience. Conbench analyzed the 4 benchmarking runs that 
have been run so far on PR commit ca35643e5f217f3ad0451bcaec364fc2ddbcfd5b.
   
   There were 72 benchmark results with an error:
   
   - Pull Request Run on `arm64-t4g-2xlarge-linux` at [2025-06-12 
13:10:19Z](https://conbench.ursa.dev/compare/runs/aa62c40c5b184e07af7e606ead73c9fd...faf46eb3dfb94604922ed6f944fa4172/)
 - [`tpch` (R) with engine=arrow, format=parquet, language=R, 
memory_map=False, query_id=TPCH-07, 
scale_factor=1](https://conbench.ursa.dev/benchmark-results/0684ae46c7ca72458000453af26c325c)
 - [`tpch` (R) with engine=arrow, format=native, language=R, 
memory_map=False, query_id=TPCH-09, 
scale_factor=1](https://conbench.ursa.dev/benchmark-results/0684ae5224b6779f800043fbb638a214)
   - and 70 more (see the report linked below)
   
   There were 18 benchmark results indicating a performance regression:
   
   - Pull Request Run on `arm64-t4g-2xlarge-linux` at [2025-06-12 
13:10:19Z](https://conbench.ursa.dev/compare/runs/aa62c40c5b184e07af7e606ead73c9fd...faf46eb3dfb94604922ed6f944fa4172/)
 - [`BM_ByteStreamSplitEncode_Float_Neon` (C++) with params=32768, 
source=cpp-micro, 
suite=parquet-encoding-benchmark](https://conbench.ursa.dev/compare/benchmarks/0684a7a22f0777d7800034bb8d8ee898...0684ad1f810a787b80007c3dc41e3035)
 - [`BM_ByteStreamSplitEncode_Float_Neon` (C++) with params=1024, 
source=cpp-micro, 
suite=parquet-encoding-benchmark](https://conbench.ursa.dev/compare/benchmarks/0684a7a22e897549800066aec3f5ec50...0684ad1f808f780c80001adefe505218)
   - and 16 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/43992507984) 
has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-12 Thread via GitHub


AntoinePrv commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2142507613


##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -555,7 +603,7 @@ inline void ByteStreamSplitEncode(const uint8_t* 
raw_values, int width,
   memcpy(out, raw_values, num_values);
   return;
 case 2:
-  return ByteStreamSplitEncodeScalar<2>(raw_values, width, num_values, 
out);
+  return ByteStreamSplitEncodeSimd128<2>(raw_values, width, num_values, 
out);

Review Comment:
   Ha yes. I need to duplicate the `Perhaps` function because the AVX2 path 
would fail on `kNumStreams==2`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-12 Thread via GitHub


pitrou commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965978829

   Here are the benchmark numbers on my local machine (AMD Ryzen 9 3900X, gcc 
14.3):
   ```
   
--
   Non-regressions: (42)
   
--
 benchmark   baseline  
contender  change % 

  counters
 BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024  3.828 GiB/sec 60.992 
GiB/sec  1493.306 {'family_index': 2, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1391248}
BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536  3.833 GiB/sec 54.779 
GiB/sec  1329.148  {'family_index': 2, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 22088}
 BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024  4.412 GiB/sec  7.365 
GiB/sec66.918 {'family_index': 7, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1693608}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536  4.584 GiB/sec  7.482 
GiB/sec63.227  {'family_index': 7, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 26266}
  BM_ByteStreamSplitEncode_Double_Generic/1024  7.398 GiB/sec  8.593 
GiB/sec16.151   {'family_index': 6, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 703188}
 BM_ByteStreamSplitEncode_Double_Sse2/1024  7.449 GiB/sec  8.622 
GiB/sec15.743 {'family_index': 17, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 730024}
 BM_ByteStreamSplitEncode_Double_Avx2/1024  7.522 GiB/sec  8.611 
GiB/sec14.487 {'family_index': 21, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Avx2/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 683492}
 BM_ByteStreamSplitEncode_Double_Generic/65536  7.098 GiB/sec  7.898 
GiB/sec11.280   {'family_index': 6, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 10152}
BM_ByteStreamSplitEncode_Double_Avx2/65536  7.343 GiB/sec  7.867 
GiB/sec 7.130 {'family_index': 21, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Avx2/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 10406}
BM_ByteStreamSplitEncode_Double_Sse2/65536  7.451 GiB/sec  7.867 
GiB/sec 5.583 {'family_index': 17, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 10614}
   BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536  3.661 GiB/sec  3.858 
GiB/sec 5.388  {'family_index': 4, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536', 'repetitions': 
1, 'repetition_index': 0, 'threads': 1, 'iterations': 2655}
  BM_ByteStreamSplitEncode_Double_Scalar/65536  4.831 GiB/sec  5.087 
GiB/sec 5.282{'family_index': 13, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 7018}
  BM_ByteStreamSplitDecode_Float_Sse2/1024  7.210 GiB/sec  7.584 
GiB/sec 5.176 {'family_index': 14, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1313805}
   BM_ByteStreamSplitDecode_Double_Scalar/1024  3.866 GiB/sec  4.065 
GiB/sec 5.160   {'family_index': 11, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/1024', 'repetitions

Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-12 Thread via GitHub


ursabot commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965714618

   Benchmark runs are scheduled for commit 
ca35643e5f217f3ad0451bcaec364fc2ddbcfd5b. Watch 
https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A 
comment will be posted here when the runs are complete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-12 Thread via GitHub


pitrou commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2142290799


##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -123,95 +186,80 @@ void ByteStreamSplitEncodeSimd128(const uint8_t* 
raw_values, int width,
   output_buffer_raw[j * num_values + i] = byte_in_value;
 }
   }
-  // The current shuffling algorithm diverges for float and double types but 
the compiler
-  // should be able to remove the branch since only one path is taken for each 
template
-  // instantiation.
-  // Example run for 32-bit variables:
-  // Step 0: copy from unaligned input bytes:
-  //   0: ABCD ABCD ABCD ABCD 1: ABCD ABCD ABCD ABCD ...
-  // Step 1: simd_batch::zip_lo and simd_batch::zip_hi:
-  //   0: AABB CCDD AABB CCDD 1: AABB CCDD AABB CCDD ...
-  // Step 2: apply simd_batch::zip_lo and  simd_batch::zip_hi again:
-  //   0:     1:     ...
-  // Step 3: simd_batch::zip_lo and simd_batch::zip_hi:
-  //   0:     1:     ...
-  // Step 4: simd_batch::zip_lo and simd_batch::zip_hi:
-  //   0:     1:     ...
+
+  // Number of input values we can fit in a simd register
+  constexpr int NumValuesInBatch = sizeof(simd_batch) / kNumStreams;

Review Comment:
   Let's call this `kNumValuesInBatch`. Same for others below of course :)



##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -555,7 +603,7 @@ inline void ByteStreamSplitEncode(const uint8_t* 
raw_values, int width,
   memcpy(out, raw_values, num_values);
   return;
 case 2:
-  return ByteStreamSplitEncodeScalar<2>(raw_values, width, num_values, 
out);
+  return ByteStreamSplitEncodeSimd128<2>(raw_values, width, num_values, 
out);

Review Comment:
   We want this to be conditional, see the `ByteStreamSplitEncodePerhapsSimd` 
macro above.



##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -95,18 +109,67 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int 
width, int64_t num_va
   }
 }
 
+template 
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+  using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+  using type = int16_t;
+};
+
+template <>
+struct grouped_bytes_impl<4> {
+  using type = int32_t;
+};
+
+template <>
+struct grouped_bytes_impl<8> {
+  using type = int64_t;
+};
+
+// Map a number of bytes to a type
+template 
+using grouped_bytes_t = typename grouped_bytes_impl::type;
+
+// Like xsimd::zlip_lo, but zip groups of NBytes at once
+template >

Review Comment:
   We preferably use `kCamelCase` for constants, including integral template 
parameters.
   ```suggestion
   template >
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-12 Thread via GitHub


pitrou commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2966091393

   Now that we have SIMD optimizations for this, can we make sure the 
benchmarks cover the different cases? Scalar and the various SIMD kinds (SSE, 
AVX2, Neon).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-12 Thread via GitHub


raulcd commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965714309

   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-12 Thread via GitHub


AntoinePrv commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965645069

   `archery benchmark diff --suite-filter=parquet-encoding 
--benchmark-filter='ByteStreamSplit'`
   
Show benchmark results
 
 ```
   
--
 Non-regressions: (44)
 
--
   benchmark   baseline  
contender  change % 

  counters
  BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536  6.967 GiB/sec 96.028 
GiB/sec  1278.358  {'family_index': 2, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 40039}
   BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024  7.008 GiB/sec 80.835 
GiB/sec  1053.458 {'family_index': 2, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 2561954}
  BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536  5.514 GiB/sec 15.146 
GiB/sec   174.667  {'family_index': 7, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 31055}
   BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024  5.527 GiB/sec 14.929 
GiB/sec   170.123 {'family_index': 7, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 2072177}
BM_ByteStreamSplitEncode_Double_Generic/1024 10.424 GiB/sec 17.349 
GiB/sec66.429   {'family_index': 6, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 961842}
   BM_ByteStreamSplitEncode_Double_Neon/4096 10.955 GiB/sec 17.807 
GiB/sec62.539 {'family_index': 17, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/4096', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 252562}
   BM_ByteStreamSplitEncode_Double_Neon/1024 10.984 GiB/sec 17.760 
GiB/sec61.688 {'family_index': 17, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 996966}
   BM_ByteStreamSplitEncode_Float_Neon/32768  8.316 GiB/sec 12.530 
GiB/sec50.674  {'family_index': 16, 'per_family_instance_index': 2, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/32768', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 48757}
BM_ByteStreamSplitEncode_Float_Neon/1024  8.517 GiB/sec 12.788 
GiB/sec50.146 {'family_index': 16, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1559934}
BM_ByteStreamSplitEncode_Float_Neon/4096  8.585 GiB/sec 12.839 
GiB/sec49.545  {'family_index': 16, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/4096', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 393889}
   BM_ByteStreamSplitEncode_Float_Neon/65536  8.379 GiB/sec 12.405 
GiB/sec48.038  {'family_index': 16, 'per_family_instance_index': 3, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 24227}
BM_ByteStreamSplitEncode_Float_Generic/65536  8.378 GiB/sec 12.370 
GiB/sec47.660{'family_index': 5, 'per_family_instance_index': 1, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/65536', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 24709}
 BM_ByteStreamSplitEncode_Float_Generic/1024  8.725 GiB/sec 12.785 
GiB/sec46.533   {'family_index': 5, 'per_family_instance_index': 0, 
'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/1024', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1601790}
   BM_ByteStreamSplitEncode_Double_Generic/65536  8.978 GiB/sec 12.445 
GiB/sec38.616   {'family_index': 6, 'per_family_inst

Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]

2025-06-12 Thread via GitHub


github-actions[bot] commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965633577

   :warning: GitHub issue #46788 **has been automatically assigned in GitHub** 
to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]