Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
pitrou commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-3003781128 @github-actions crossbow submit -g cpp -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
github-actions[bot] commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-3003789770 Revision: 3ea4dd3de5bd4618af53e73fb37139118101c69b Submitted crossbow builds: [ursacomputing/crossbow @ actions-e6dee40784](https://github.com/ursacomputing/crossbow/branches/all?query=actions-e6dee40784) |Task|Status| ||--| |example-cpp-minimal-build-static|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853412/job/44746841415)| |example-cpp-minimal-build-static-system-dependency|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853317/job/44746841177)| |example-cpp-tutorial|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853495/job/44746841683)| |test-build-cpp-fuzz|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853357/job/44746841232)| |test-conda-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853198/job/44746840869)| |test-conda-cpp-valgrind|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853221/job/44746840882)| |test-cuda-cpp-ubuntu-22.04-cuda-11.7.1|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853881/job/44746842867)| |test-debian-12-cpp-amd64|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853876/job/44746842841)| |test-debian-12-cpp-i386|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853513/job/44746841758)| |test-fedora-39-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853910/job/44746843062)| |test-ubuntu-22.04-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853394/job/44746841445)| |test-ubuntu-22.04-cpp-20|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853615/job/44746842025)| |test-ubuntu-22.04-cpp-bundled|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853639/job/44746842079)| |test-ubuntu-22.04-cpp-emscripten|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853688/job/44746842252)| |test-ubuntu-22.04-cpp-no-threading|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853839/job/44746842698)| |test-ubuntu-24.04-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/15870853231/job/44746840888)| |test-ubuntu-24.04-cpp-bundled-offline|[ {
xsimd::store_unaligned(
- reinterpret_cast(out + (i * kNumStreams + j) *
sizeof(simd_batch)),
+ reinterpret_cast(out + (i * kNumStreams + j) *
simd_batch::size),
stage[kNumStreamsLog2][j]);
}
}
}
+template
Review Comment:
Being able to instantiate an integer type based on templated number of bytes
sounds quite general to me. It is not specific to SIMD.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
AntoinePrv commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2161920973
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int
width, int64_t num_va
}
for (int j = 0; j < kNumStreams; ++j) {
xsimd::store_unaligned(
- reinterpret_cast(out + (i * kNumStreams + j) *
sizeof(simd_batch)),
+ reinterpret_cast(out + (i * kNumStreams + j) *
simd_batch::size),
stage[kNumStreamsLog2][j]);
}
}
}
+template
Review Comment:
Type traits looks much more general detection functions, this seemed pretty
strongly tied to this algorithm. Do you think it would be a better fit there?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
AntoinePrv commented on code in PR #46789: URL: https://github.com/apache/arrow/pull/46789#discussion_r2161895195 ## cpp/src/arrow/util/byte_stream_split_internal.h: ## @@ -123,95 +185,79 @@ void ByteStreamSplitEncodeSimd128(const uint8_t* raw_values, int width, output_buffer_raw[j * num_values + i] = byte_in_value; } } - // The current shuffling algorithm diverges for float and double types but the compiler - // should be able to remove the branch since only one path is taken for each template - // instantiation. - // Example run for 32-bit variables: - // Step 0: copy from unaligned input bytes: - // 0: ABCD ABCD ABCD ABCD 1: ABCD ABCD ABCD ABCD ... - // Step 1: simd_batch::zip_lo and simd_batch::zip_hi: - // 0: AABB CCDD AABB CCDD 1: AABB CCDD AABB CCDD ... - // Step 2: apply simd_batch::zip_lo and simd_batch::zip_hi again: - // 0: 1: ... - // Step 3: simd_batch::zip_lo and simd_batch::zip_hi: - // 0: 1: ... - // Step 4: simd_batch::zip_lo and simd_batch::zip_hi: - // 0: 1: ... + + // Number of input values we can fit in a simd register + constexpr int kNumValuesInBatch = simd_batch::size / kNumStreams; + static_assert(kNumValuesInBatch > 0); + // Number of bytes we'll bring together in the first byte-level part of the algorithm. + // Since we zip with the next batch, the number of values in a batch determines how many + // bytes end up together before we can use a larger type + constexpr int kNumBytes = 2 * kNumValuesInBatch; + // Number of steps in the first part of the algorithm with byte-level zipping + constexpr int kNumStepsByte = ReversePow2(kNumValuesInBatch) + 1; + // Number of steps in the first part of the algorithm with large data type zipping + constexpr int kNumStepsLarge = + ReversePow2(static_cast(simd_batch::size) / kNumBytes); + // Total number of steps + constexpr int kNumSteps = kNumStepsByte + kNumStepsLarge; Review Comment: Correct, I did not see it that way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
AntoinePrv commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2161887714
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int
width, int64_t num_va
}
for (int j = 0; j < kNumStreams; ++j) {
xsimd::store_unaligned(
- reinterpret_cast(out + (i * kNumStreams + j) *
sizeof(simd_batch)),
+ reinterpret_cast(out + (i * kNumStreams + j) *
simd_batch::size),
stage[kNumStreamsLog2][j]);
}
}
}
+template
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+ using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+ using type = int16_t;
+};
+
+template <>
+struct grouped_bytes_impl<4> {
+ using type = int32_t;
+};
+
+template <>
+struct grouped_bytes_impl<8> {
+ using type = int64_t;
+};
+
+// Map a number of bytes to a type
+template
+using grouped_bytes_t = typename grouped_bytes_impl::type;
+
+// Like xsimd::zlip_lo, but zip groups of NBytes at once
+template >
+auto zip_lo_n(Batch const& a, Batch const& b) -> Batch {
+ if constexpr (kNumBytes == kBatchSize) {
+return a;
+ } else {
+return xsimd::bitwise_cast(
+xsimd::zip_lo(xsimd::bitwise_cast>(a),
+ xsimd::bitwise_cast>(b)));
+ }
+}
+
+// Like xsimd::zlip_hi, but zip groups of NBytes at once
+template >
+auto zip_hi_n(Batch const& a, Batch const& b) -> Batch {
+ if constexpr (kNumBytes == kBatchSize) {
+return b;
+ } else {
+return xsimd::bitwise_cast(
+xsimd::zip_hi(xsimd::bitwise_cast>(a),
+ xsimd::bitwise_cast>(b)));
+ }
+}
+
template
void ByteStreamSplitEncodeSimd128(const uint8_t* raw_values, int width,
const int64_t num_values, uint8_t*
output_buffer_raw) {
using simd_batch = xsimd::make_sized_batch_t;
assert(width == kNumStreams);
- static_assert(kNumStreams == 4 || kNumStreams == 8, "Invalid number of
streams.");
- constexpr int kBlockSize = sizeof(simd_batch) * kNumStreams;
-
- simd_batch stage[3][kNumStreams];
- simd_batch final_result[kNumStreams];
+ constexpr int kBlockSize = simd_batch::size * kNumStreams;
Review Comment:
This specific constant was already here. Here `sizeof` was replaced with
`::size`.
As for the function in general:
- it is independent of the *number of streams* (under condition that
kNumStreams <= simd_batch::size and kNumStream a power of two, for which I'm
adding a check.)
- it is not far from being independent of the SIMD batch size. I had plan
the final changes for another PR but I guess I could include them here if you
prefer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
AntoinePrv commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2161887714
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int
width, int64_t num_va
}
for (int j = 0; j < kNumStreams; ++j) {
xsimd::store_unaligned(
- reinterpret_cast(out + (i * kNumStreams + j) *
sizeof(simd_batch)),
+ reinterpret_cast(out + (i * kNumStreams + j) *
simd_batch::size),
stage[kNumStreamsLog2][j]);
}
}
}
+template
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+ using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+ using type = int16_t;
+};
+
+template <>
+struct grouped_bytes_impl<4> {
+ using type = int32_t;
+};
+
+template <>
+struct grouped_bytes_impl<8> {
+ using type = int64_t;
+};
+
+// Map a number of bytes to a type
+template
+using grouped_bytes_t = typename grouped_bytes_impl::type;
+
+// Like xsimd::zlip_lo, but zip groups of NBytes at once
+template >
+auto zip_lo_n(Batch const& a, Batch const& b) -> Batch {
+ if constexpr (kNumBytes == kBatchSize) {
+return a;
+ } else {
+return xsimd::bitwise_cast(
+xsimd::zip_lo(xsimd::bitwise_cast>(a),
+ xsimd::bitwise_cast>(b)));
+ }
+}
+
+// Like xsimd::zlip_hi, but zip groups of NBytes at once
+template >
+auto zip_hi_n(Batch const& a, Batch const& b) -> Batch {
+ if constexpr (kNumBytes == kBatchSize) {
+return b;
+ } else {
+return xsimd::bitwise_cast(
+xsimd::zip_hi(xsimd::bitwise_cast>(a),
+ xsimd::bitwise_cast>(b)));
+ }
+}
+
template
void ByteStreamSplitEncodeSimd128(const uint8_t* raw_values, int width,
const int64_t num_values, uint8_t*
output_buffer_raw) {
using simd_batch = xsimd::make_sized_batch_t;
assert(width == kNumStreams);
- static_assert(kNumStreams == 4 || kNumStreams == 8, "Invalid number of
streams.");
- constexpr int kBlockSize = sizeof(simd_batch) * kNumStreams;
-
- simd_batch stage[3][kNumStreams];
- simd_batch final_result[kNumStreams];
+ constexpr int kBlockSize = simd_batch::size * kNumStreams;
Review Comment:
This specific constant was already here. Here `sizeof` was replaced with
`::size`.
As for the function in general:
- it is independent of the *number of streams* (under condition that
kNumStreams <= simd_batch::size and kNumStream a power of two, for which I'm
adding a check.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
pitrou commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-2995582815 Weirdly, this fails compiling here: https://github.com/apache/arrow/actions/runs/15761197292/job/44584131273?pr=46789#step:7:1727 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
pitrou commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-2995593279 Ah, also, can you update the tests to exercise the 2-byte BSS SIMD specializations? https://github.com/apache/arrow/blob/dacec3080f8a7c245dc884d278ea6280942086f0/cpp/src/arrow/util/byte_stream_split_test.cc#L139-L172 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
pitrou commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r216106
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -123,95 +185,79 @@ void ByteStreamSplitEncodeSimd128(const uint8_t*
raw_values, int width,
output_buffer_raw[j * num_values + i] = byte_in_value;
}
}
- // The current shuffling algorithm diverges for float and double types but
the compiler
- // should be able to remove the branch since only one path is taken for each
template
- // instantiation.
- // Example run for 32-bit variables:
- // Step 0: copy from unaligned input bytes:
- // 0: ABCD ABCD ABCD ABCD 1: ABCD ABCD ABCD ABCD ...
- // Step 1: simd_batch::zip_lo and simd_batch::zip_hi:
- // 0: AABB CCDD AABB CCDD 1: AABB CCDD AABB CCDD ...
- // Step 2: apply simd_batch::zip_lo and simd_batch::zip_hi again:
- // 0: 1: ...
- // Step 3: simd_batch::zip_lo and simd_batch::zip_hi:
- // 0: 1: ...
- // Step 4: simd_batch::zip_lo and simd_batch::zip_hi:
- // 0: 1: ...
+
+ // Number of input values we can fit in a simd register
+ constexpr int kNumValuesInBatch = simd_batch::size / kNumStreams;
+ static_assert(kNumValuesInBatch > 0);
+ // Number of bytes we'll bring together in the first byte-level part of the
algorithm.
+ // Since we zip with the next batch, the number of values in a batch
determines how many
+ // bytes end up together before we can use a larger type
+ constexpr int kNumBytes = 2 * kNumValuesInBatch;
+ // Number of steps in the first part of the algorithm with byte-level zipping
+ constexpr int kNumStepsByte = ReversePow2(kNumValuesInBatch) + 1;
+ // Number of steps in the first part of the algorithm with large data type
zipping
+ constexpr int kNumStepsLarge =
+ ReversePow2(static_cast(simd_batch::size) / kNumBytes);
+ // Total number of steps
+ constexpr int kNumSteps = kNumStepsByte + kNumStepsLarge;
Review Comment:
So this is `ReversePow2(kNumValuesInBatch) + 1 +
ReversePow2(simd_batch::size / (2 * kNumValuesInBatch))`, so in the end it's
`ReversePow2(simd_batch::size)`, right?
If so, perhaps add a `static_assert` for better understanding of the
algorithm's complexity.
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int
width, int64_t num_va
}
for (int j = 0; j < kNumStreams; ++j) {
xsimd::store_unaligned(
- reinterpret_cast(out + (i * kNumStreams + j) *
sizeof(simd_batch)),
+ reinterpret_cast(out + (i * kNumStreams + j) *
simd_batch::size),
stage[kNumStreamsLog2][j]);
}
}
}
+template
Review Comment:
This could be given a more general name and go into
`arrow/util/type_traits.h` perhaps.
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int
width, int64_t num_va
}
for (int j = 0; j < kNumStreams; ++j) {
xsimd::store_unaligned(
- reinterpret_cast(out + (i * kNumStreams + j) *
sizeof(simd_batch)),
+ reinterpret_cast(out + (i * kNumStreams + j) *
simd_batch::size),
stage[kNumStreamsLog2][j]);
}
}
}
+template
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+ using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+ using type = int16_t;
+};
+
+template <>
+struct grouped_bytes_impl<4> {
+ using type = int32_t;
+};
+
+template <>
+struct grouped_bytes_impl<8> {
+ using type = int64_t;
+};
+
+// Map a number of bytes to a type
+template
+using grouped_bytes_t = typename grouped_bytes_impl::type;
+
+// Like xsimd::zlip_lo, but zip groups of NBytes at once
+template >
+auto zip_lo_n(Batch const& a, Batch const& b) -> Batch {
+ if constexpr (kNumBytes == kBatchSize) {
+return a;
+ } else {
+return xsimd::bitwise_cast(
+xsimd::zip_lo(xsimd::bitwise_cast>(a),
+ xsimd::bitwise_cast>(b)));
+ }
+}
+
+// Like xsimd::zlip_hi, but zip groups of NBytes at once
Review Comment:
Same here.
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -89,23 +102,72 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int
width, int64_t num_va
}
for (int j = 0; j < kNumStreams; ++j) {
xsimd::store_unaligned(
- reinterpret_cast(out + (i * kNumStreams + j) *
sizeof(simd_batch)),
+ reinterpret_cast(out + (i * kNumStreams + j) *
simd_batch::size),
stage[kNumStreamsLog2][j]);
}
}
}
+template
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+ using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+ using type = int16_t;
+};
+
+template <>
+struct grouped_
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
pitrou commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2995463508
Local benchmark results on my AMD Ryzen 9 3900X CPU:
```
--
Non-regressions: (40)
--
benchmark baseline
contender change %
counters
BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024 4.020 GiB/sec 59.711
GiB/sec 1385.469 {'family_index': 2, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1463689}
BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536 4.057 GiB/sec 53.022
GiB/sec 1206.800 {'family_index': 2, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 23272}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536 4.690 GiB/sec 7.451
GiB/sec58.859 {'family_index': 7, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 27316}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024 4.777 GiB/sec 7.398
GiB/sec54.878 {'family_index': 7, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1749779}
BM_ByteStreamSplitEncode_Double_Generic/1024 7.294 GiB/sec 8.597
GiB/sec17.874 {'family_index': 6, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 654439}
BM_ByteStreamSplitDecode_Float_Sse2/1024 7.816 GiB/sec 8.696
GiB/sec11.247 {'family_index': 14, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1478390}
BM_ByteStreamSplitEncode_Double_Sse2/1024 7.656 GiB/sec 8.466
GiB/sec10.575 {'family_index': 17, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 685857}
BM_ByteStreamSplitDecode_Float_Sse2/65536 7.368 GiB/sec 8.091
GiB/sec 9.813 {'family_index': 14, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 21078}
BM_ByteStreamSplitEncode_Double_Sse2/65536 7.303 GiB/sec 7.784
GiB/sec 6.596 {'family_index': 17, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 10413}
BM_ByteStreamSplitEncode_Double_Generic/65536 7.498 GiB/sec 7.918
GiB/sec 5.600 {'family_index': 6, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 10468}
BM_ByteStreamSplitDecode_Double_Sse2/65536 8.489 GiB/sec 8.720
GiB/sec 2.720 {'family_index': 15, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitDecode_Double_Sse2/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 12255}
BM_ByteStreamSplitDecode_Double_Sse2/1024 9.153 GiB/sec 9.301
GiB/sec 1.619 {'family_index': 15, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitDecode_Double_Sse2/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 850920}
BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024 5.002 GiB/sec 5.059
GiB/sec 1.142 {'family_index': 9, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 228207}
BM_ByteStreamSplitDecode_Double_Avx2/65536 13.154 GiB/sec 13.270
GiB/sec 0.886 {'family_index': 19, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitDecode_Double_Avx2/65536', 'repetitions': 1,
'repetition_index': 0,
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
AntoinePrv commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2992138287
Updated benchmarks:
Benchmark result on my Macbook Pro M3:
`archery benchmark diff --suite-filter=parquet-encoding
--benchmark-filter='ByteStreamSplit'`
Show benchmark results
```
--
Non-regressions: (44)
--
benchmark baseline
contender change %
counters
BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536 6.940 GiB/sec 98.072
GiB/sec 1313.067 {'family_index': 2, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 40082}
BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024 7.017 GiB/sec 82.283
GiB/sec 1072.596 {'family_index': 2, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 2573189}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536 5.736 GiB/sec 15.241
GiB/sec 165.694 {'family_index': 7, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 32985}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024 5.848 GiB/sec 15.384
GiB/sec 163.085 {'family_index': 7, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 2152488}
BM_ByteStreamSplitEncode_Double_Generic/1024 10.565 GiB/sec 17.702
GiB/sec67.552 {'family_index': 6, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 962530}
BM_ByteStreamSplitEncode_Double_Neon/4096 11.016 GiB/sec 17.869
GiB/sec62.210 {'family_index': 17, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/4096', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 252538}
BM_ByteStreamSplitEncode_Double_Neon/1024 10.972 GiB/sec 17.720
GiB/sec61.501 {'family_index': 17, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 995648}
BM_ByteStreamSplitEncode_Float_Neon/32768 8.057 GiB/sec 12.437
GiB/sec54.357 {'family_index': 16, 'per_family_instance_index': 2,
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/32768', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 44834}
BM_ByteStreamSplitEncode_Float_Neon/1024 8.527 GiB/sec 12.769
GiB/sec49.742 {'family_index': 16, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1564274}
BM_ByteStreamSplitEncode_Float_Neon/4096 8.595 GiB/sec 12.806
GiB/sec48.991 {'family_index': 16, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/4096', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 393601}
BM_ByteStreamSplitEncode_Float_Generic/1024 8.613 GiB/sec 12.825
GiB/sec48.904 {'family_index': 5, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1600278}
BM_ByteStreamSplitEncode_Float_Generic/65536 8.368 GiB/sec 12.420
GiB/sec48.422{'family_index': 5, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 24025}
BM_ByteStreamSplitEncode_Float_Neon/65536 8.435 GiB/sec 12.403
GiB/sec47.045 {'family_index': 16, 'per_family_instance_index': 3,
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 22557}
BM_ByteStreamSplitDecode_Double_Neon/4096 7.535 GiB/sec 10.042
GiB/sec33.283
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
AntoinePrv commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-2975908167 > Now that we have SIMD optimizations for this, can we make sure the benchmarks cover the different cases? Scalar and the various SIMD kinds (SSE, AVX2, Neon). I've done so with `int16_t`. Since it's gated behind SIMD macros, it should not be a problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
conbench-apache-arrow[bot] commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-2967647488 Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit ca35643e5f217f3ad0451bcaec364fc2ddbcfd5b. There were 72 benchmark results with an error: - Pull Request Run on `arm64-t4g-2xlarge-linux` at [2025-06-12 13:10:19Z](https://conbench.ursa.dev/compare/runs/aa62c40c5b184e07af7e606ead73c9fd...faf46eb3dfb94604922ed6f944fa4172/) - [`tpch` (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-07, scale_factor=1](https://conbench.ursa.dev/benchmark-results/0684ae46c7ca72458000453af26c325c) - [`tpch` (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-09, scale_factor=1](https://conbench.ursa.dev/benchmark-results/0684ae5224b6779f800043fbb638a214) - and 70 more (see the report linked below) There were 18 benchmark results indicating a performance regression: - Pull Request Run on `arm64-t4g-2xlarge-linux` at [2025-06-12 13:10:19Z](https://conbench.ursa.dev/compare/runs/aa62c40c5b184e07af7e606ead73c9fd...faf46eb3dfb94604922ed6f944fa4172/) - [`BM_ByteStreamSplitEncode_Float_Neon` (C++) with params=32768, source=cpp-micro, suite=parquet-encoding-benchmark](https://conbench.ursa.dev/compare/benchmarks/0684a7a22f0777d7800034bb8d8ee898...0684ad1f810a787b80007c3dc41e3035) - [`BM_ByteStreamSplitEncode_Float_Neon` (C++) with params=1024, source=cpp-micro, suite=parquet-encoding-benchmark](https://conbench.ursa.dev/compare/benchmarks/0684a7a22e897549800066aec3f5ec50...0684ad1f808f780c80001adefe505218) - and 16 more (see the report linked below) The [full Conbench report](https://github.com/apache/arrow/runs/43992507984) has more details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
AntoinePrv commented on code in PR #46789: URL: https://github.com/apache/arrow/pull/46789#discussion_r2142507613 ## cpp/src/arrow/util/byte_stream_split_internal.h: ## @@ -555,7 +603,7 @@ inline void ByteStreamSplitEncode(const uint8_t* raw_values, int width, memcpy(out, raw_values, num_values); return; case 2: - return ByteStreamSplitEncodeScalar<2>(raw_values, width, num_values, out); + return ByteStreamSplitEncodeSimd128<2>(raw_values, width, num_values, out); Review Comment: Ha yes. I need to duplicate the `Perhaps` function because the AVX2 path would fail on `kNumStreams==2`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
pitrou commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965978829
Here are the benchmark numbers on my local machine (AMD Ryzen 9 3900X, gcc
14.3):
```
--
Non-regressions: (42)
--
benchmark baseline
contender change %
counters
BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024 3.828 GiB/sec 60.992
GiB/sec 1493.306 {'family_index': 2, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1391248}
BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536 3.833 GiB/sec 54.779
GiB/sec 1329.148 {'family_index': 2, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 22088}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024 4.412 GiB/sec 7.365
GiB/sec66.918 {'family_index': 7, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1693608}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536 4.584 GiB/sec 7.482
GiB/sec63.227 {'family_index': 7, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 26266}
BM_ByteStreamSplitEncode_Double_Generic/1024 7.398 GiB/sec 8.593
GiB/sec16.151 {'family_index': 6, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 703188}
BM_ByteStreamSplitEncode_Double_Sse2/1024 7.449 GiB/sec 8.622
GiB/sec15.743 {'family_index': 17, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 730024}
BM_ByteStreamSplitEncode_Double_Avx2/1024 7.522 GiB/sec 8.611
GiB/sec14.487 {'family_index': 21, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Double_Avx2/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 683492}
BM_ByteStreamSplitEncode_Double_Generic/65536 7.098 GiB/sec 7.898
GiB/sec11.280 {'family_index': 6, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 10152}
BM_ByteStreamSplitEncode_Double_Avx2/65536 7.343 GiB/sec 7.867
GiB/sec 7.130 {'family_index': 21, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Double_Avx2/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 10406}
BM_ByteStreamSplitEncode_Double_Sse2/65536 7.451 GiB/sec 7.867
GiB/sec 5.583 {'family_index': 17, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 10614}
BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536 3.661 GiB/sec 3.858
GiB/sec 5.388 {'family_index': 4, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536', 'repetitions':
1, 'repetition_index': 0, 'threads': 1, 'iterations': 2655}
BM_ByteStreamSplitEncode_Double_Scalar/65536 4.831 GiB/sec 5.087
GiB/sec 5.282{'family_index': 13, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 7018}
BM_ByteStreamSplitDecode_Float_Sse2/1024 7.210 GiB/sec 7.584
GiB/sec 5.176 {'family_index': 14, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1313805}
BM_ByteStreamSplitDecode_Double_Scalar/1024 3.866 GiB/sec 4.065
GiB/sec 5.160 {'family_index': 11, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/1024', 'repetitions
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
ursabot commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965714618 Benchmark runs are scheduled for commit ca35643e5f217f3ad0451bcaec364fc2ddbcfd5b. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
pitrou commented on code in PR #46789:
URL: https://github.com/apache/arrow/pull/46789#discussion_r2142290799
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -123,95 +186,80 @@ void ByteStreamSplitEncodeSimd128(const uint8_t*
raw_values, int width,
output_buffer_raw[j * num_values + i] = byte_in_value;
}
}
- // The current shuffling algorithm diverges for float and double types but
the compiler
- // should be able to remove the branch since only one path is taken for each
template
- // instantiation.
- // Example run for 32-bit variables:
- // Step 0: copy from unaligned input bytes:
- // 0: ABCD ABCD ABCD ABCD 1: ABCD ABCD ABCD ABCD ...
- // Step 1: simd_batch::zip_lo and simd_batch::zip_hi:
- // 0: AABB CCDD AABB CCDD 1: AABB CCDD AABB CCDD ...
- // Step 2: apply simd_batch::zip_lo and simd_batch::zip_hi again:
- // 0: 1: ...
- // Step 3: simd_batch::zip_lo and simd_batch::zip_hi:
- // 0: 1: ...
- // Step 4: simd_batch::zip_lo and simd_batch::zip_hi:
- // 0: 1: ...
+
+ // Number of input values we can fit in a simd register
+ constexpr int NumValuesInBatch = sizeof(simd_batch) / kNumStreams;
Review Comment:
Let's call this `kNumValuesInBatch`. Same for others below of course :)
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -555,7 +603,7 @@ inline void ByteStreamSplitEncode(const uint8_t*
raw_values, int width,
memcpy(out, raw_values, num_values);
return;
case 2:
- return ByteStreamSplitEncodeScalar<2>(raw_values, width, num_values,
out);
+ return ByteStreamSplitEncodeSimd128<2>(raw_values, width, num_values,
out);
Review Comment:
We want this to be conditional, see the `ByteStreamSplitEncodePerhapsSimd`
macro above.
##
cpp/src/arrow/util/byte_stream_split_internal.h:
##
@@ -95,18 +109,67 @@ void ByteStreamSplitDecodeSimd128(const uint8_t* data, int
width, int64_t num_va
}
}
+template
+struct grouped_bytes_impl;
+
+template <>
+struct grouped_bytes_impl<1> {
+ using type = int8_t;
+};
+
+template <>
+struct grouped_bytes_impl<2> {
+ using type = int16_t;
+};
+
+template <>
+struct grouped_bytes_impl<4> {
+ using type = int32_t;
+};
+
+template <>
+struct grouped_bytes_impl<8> {
+ using type = int64_t;
+};
+
+// Map a number of bytes to a type
+template
+using grouped_bytes_t = typename grouped_bytes_impl::type;
+
+// Like xsimd::zlip_lo, but zip groups of NBytes at once
+template >
Review Comment:
We preferably use `kCamelCase` for constants, including integral template
parameters.
```suggestion
template >
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
pitrou commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-2966091393 Now that we have SIMD optimizations for this, can we make sure the benchmarks cover the different cases? Scalar and the various SIMD kinds (SSE, AVX2, Neon). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
raulcd commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965714309 @ursabot please benchmark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
AntoinePrv commented on PR #46789:
URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965645069
`archery benchmark diff --suite-filter=parquet-encoding
--benchmark-filter='ByteStreamSplit'`
Show benchmark results
```
--
Non-regressions: (44)
--
benchmark baseline
contender change %
counters
BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536 6.967 GiB/sec 96.028
GiB/sec 1278.358 {'family_index': 2, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 40039}
BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024 7.008 GiB/sec 80.835
GiB/sec 1053.458 {'family_index': 2, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 2561954}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536 5.514 GiB/sec 15.146
GiB/sec 174.667 {'family_index': 7, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 31055}
BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024 5.527 GiB/sec 14.929
GiB/sec 170.123 {'family_index': 7, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 2072177}
BM_ByteStreamSplitEncode_Double_Generic/1024 10.424 GiB/sec 17.349
GiB/sec66.429 {'family_index': 6, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 961842}
BM_ByteStreamSplitEncode_Double_Neon/4096 10.955 GiB/sec 17.807
GiB/sec62.539 {'family_index': 17, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/4096', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 252562}
BM_ByteStreamSplitEncode_Double_Neon/1024 10.984 GiB/sec 17.760
GiB/sec61.688 {'family_index': 17, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 996966}
BM_ByteStreamSplitEncode_Float_Neon/32768 8.316 GiB/sec 12.530
GiB/sec50.674 {'family_index': 16, 'per_family_instance_index': 2,
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/32768', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 48757}
BM_ByteStreamSplitEncode_Float_Neon/1024 8.517 GiB/sec 12.788
GiB/sec50.146 {'family_index': 16, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1559934}
BM_ByteStreamSplitEncode_Float_Neon/4096 8.585 GiB/sec 12.839
GiB/sec49.545 {'family_index': 16, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/4096', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 393889}
BM_ByteStreamSplitEncode_Float_Neon/65536 8.379 GiB/sec 12.405
GiB/sec48.038 {'family_index': 16, 'per_family_instance_index': 3,
'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 24227}
BM_ByteStreamSplitEncode_Float_Generic/65536 8.378 GiB/sec 12.370
GiB/sec47.660{'family_index': 5, 'per_family_instance_index': 1,
'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/65536', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 24709}
BM_ByteStreamSplitEncode_Float_Generic/1024 8.725 GiB/sec 12.785
GiB/sec46.533 {'family_index': 5, 'per_family_instance_index': 0,
'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/1024', 'repetitions': 1,
'repetition_index': 0, 'threads': 1, 'iterations': 1601790}
BM_ByteStreamSplitEncode_Double_Generic/65536 8.978 GiB/sec 12.445
GiB/sec38.616 {'family_index': 6, 'per_family_inst
Re: [PR] GH-46788: [C++] Enable SIMD for byte stream split with 2 streams [arrow]
github-actions[bot] commented on PR #46789: URL: https://github.com/apache/arrow/pull/46789#issuecomment-2965633577 :warning: GitHub issue #46788 **has been automatically assigned in GitHub** to PR creator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
