Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
conbench-apache-arrow[bot] commented on PR #48309: URL: https://github.com/apache/arrow/pull/48309#issuecomment-3667979737 After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit ea6dc7beb8b68d4b1d607694f12fdfb8013442d1. There were no benchmark performance regressions. 🎉 The [full Conbench report](https://github.com/apache/arrow/runs/58384263530) has more details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2584276641
##
cpp/src/parquet/decoder.cc:
##
@@ -754,43 +760,47 @@ class PlainByteArrayDecoder : public
PlainDecoder {
int64_t valid_bits_offset,
typename EncodingTraits::Accumulator*
out,
int* out_values_decoded) {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
// and each value has a 4-byte length header that doesn't count for the
// Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+ return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY
data");
+}
auto visit_binary_helper = [&](auto* helper) {
int values_decoded = 0;
- RETURN_NOT_OK(VisitBitRuns(
- valid_bits, valid_bits_offset, num_values,
- [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
- for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ -
4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len,
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
- }
- values_decoded += static_cast(run_length);
- return Status::OK();
-} else {
- return helper->AppendNulls(run_length);
+ auto visit_run = [&](int64_t position, int64_t run_length, bool
is_valid) {
+if (is_valid) {
+ for (int64_t i = 0; i < run_length; ++i) {
+// We ensure `len_` is sufficient thanks to:
+// 1. the initial `estimated_data_length` check above,
+// 2. the running `value_len > estimated_data_length` check
below.
+// This precondition follows from those two checks.
+DCHECK_GE(len_, 4);
+auto value_len = SafeLoadAs(data_);
+if (ARROW_PREDICT_FALSE(value_len < 0 || value_len >
estimated_data_length)) {
Review Comment:
Added.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
github-actions[bot] commented on PR #48309: URL: https://github.com/apache/arrow/pull/48309#issuecomment-3605685122 Revision: 65e00521312994d515d6063a71fb20dfcc291978 Submitted crossbow builds: [ursacomputing/crossbow @ actions-c0dbacc455](https://github.com/ursacomputing/crossbow/branches/all?query=actions-c0dbacc455) |Task|Status| ||--| |test-build-cpp-fuzz|[](https://github.com/ursacomputing/crossbow/actions/runs/19887580720/job/56998179241)| -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
pitrou commented on PR #48309: URL: https://github.com/apache/arrow/pull/48309#issuecomment-3605670198 @github-actions crossbow submit *fuzz* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2584103553
##
cpp/src/parquet/decoder.cc:
##
@@ -147,6 +139,20 @@ struct ArrowBinaryHelper {
return Status::OK();
}
+ Status ReserveInitialChunkData(std::optional
estimated_remaining_data_length) {
+RETURN_NOT_OK(builder_->Reserve(entries_remaining_));
+if (estimated_remaining_data_length.has_value()) {
+ int64_t required_capacity =
+ std::min(*estimated_remaining_data_length, chunk_space_remaining_);
+ // Ensure we'll be allocating a non-zero-sized data buffer, to avoid
encountering
+ // a null pointer
+ constexpr int64_t kMinReserveCapacity = 64;
Review Comment:
Well, it turns out that was unnecessary, so I might have misunderstood what
I saw during the debugging session. I'm going to simplify this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
mapleFU commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2583512443
##
cpp/src/parquet/decoder.cc:
##
@@ -147,6 +139,20 @@ struct ArrowBinaryHelper {
return Status::OK();
}
+ Status ReserveInitialChunkData(std::optional
estimated_remaining_data_length) {
+RETURN_NOT_OK(builder_->Reserve(entries_remaining_));
+if (estimated_remaining_data_length.has_value()) {
+ int64_t required_capacity =
+ std::min(*estimated_remaining_data_length, chunk_space_remaining_);
+ // Ensure we'll be allocating a non-zero-sized data buffer, to avoid
encountering
+ // a null pointer
+ constexpr int64_t kMinReserveCapacity = 64;
Review Comment:
https://github.com/apache/arrow/blob/322516f3940120cb2eddf3b1f919b38f0c81b097/cpp/src/arrow/memory_pool_internal.h#L35
Maybe here rather than a null pointer?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
HuaHuaY commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2583488156
##
cpp/src/parquet/decoder.cc:
##
@@ -147,6 +139,20 @@ struct ArrowBinaryHelper {
return Status::OK();
}
+ Status ReserveInitialChunkData(std::optional
estimated_remaining_data_length) {
+RETURN_NOT_OK(builder_->Reserve(entries_remaining_));
+if (estimated_remaining_data_length.has_value()) {
+ int64_t required_capacity =
+ std::min(*estimated_remaining_data_length, chunk_space_remaining_);
+ // Ensure we'll be allocating a non-zero-sized data buffer, to avoid
encountering
+ // a null pointer
+ constexpr int64_t kMinReserveCapacity = 64;
Review Comment:
What is the null pointer here and what will happen if there is a null
pointer?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
mapleFU commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2583356838
##
cpp/src/parquet/decoder.cc:
##
@@ -147,6 +139,20 @@ struct ArrowBinaryHelper {
return Status::OK();
}
+ Status ReserveInitialChunkData(std::optional
estimated_remaining_data_length) {
+RETURN_NOT_OK(builder_->Reserve(entries_remaining_));
+if (estimated_remaining_data_length.has_value()) {
+ int64_t required_capacity =
+ std::min(*estimated_remaining_data_length, chunk_space_remaining_);
+ // Ensure we'll be allocating a non-zero-sized data buffer, to avoid
encountering
+ // a null pointer
+ constexpr int64_t kMinReserveCapacity = 64;
Review Comment:
64 here is just a not too small or not too large number, to avoid `builder_`
return an NULLPTR?
##
cpp/src/parquet/decoder.cc:
##
@@ -754,43 +760,47 @@ class PlainByteArrayDecoder : public
PlainDecoder {
int64_t valid_bits_offset,
typename EncodingTraits::Accumulator*
out,
int* out_values_decoded) {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
// and each value has a 4-byte length header that doesn't count for the
// Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+ return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY
data");
+}
auto visit_binary_helper = [&](auto* helper) {
int values_decoded = 0;
- RETURN_NOT_OK(VisitBitRuns(
- valid_bits, valid_bits_offset, num_values,
- [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
- for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ -
4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len,
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
- }
- values_decoded += static_cast(run_length);
- return Status::OK();
-} else {
- return helper->AppendNulls(run_length);
+ auto visit_run = [&](int64_t position, int64_t run_length, bool
is_valid) {
+if (is_valid) {
+ for (int64_t i = 0; i < run_length; ++i) {
+// We ensure `len_` is sufficient thanks to:
+// 1. the initial `estimated_data_length` check above,
+// 2. the running `value_len > estimated_data_length` check
below.
+// This precondition follows from those two checks.
+DCHECK_GE(len_, 4);
+auto value_len = SafeLoadAs(data_);
+if (ARROW_PREDICT_FALSE(value_len < 0 || value_len >
estimated_data_length)) {
Review Comment:
Would you mind add this to comment?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
github-actions[bot] commented on PR #48309: URL: https://github.com/apache/arrow/pull/48309#issuecomment-3603224787 Revision: 14c65056c63c3958a4ef8734c6ec465b0473ee36 Submitted crossbow builds: [ursacomputing/crossbow @ actions-ff3ab6ba25](https://github.com/ursacomputing/crossbow/branches/all?query=actions-ff3ab6ba25) |Task|Status| ||--| |example-cpp-minimal-build-static|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852981/job/56935380736)| |example-cpp-minimal-build-static-system-dependency|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852927/job/56935380667)| |example-cpp-tutorial|[](https://github.com/ursacomputing/crossbow/actions/runs/19867853826/job/56935383572)| |test-build-cpp-fuzz|[](https://github.com/ursacomputing/crossbow/actions/runs/19867853380/job/56935382670)| |test-conda-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852316/job/56935378528)| |test-conda-cpp-valgrind|[](https://github.com/ursacomputing/crossbow/actions/runs/19867853015/job/56935380895)| |test-cuda-cpp-ubuntu-22.04-cuda-11.7.1|[](https://github.com/ursacomputing/crossbow/actions/runs/19867853109/job/56935381200)| |test-cuda-cpp-ubuntu-24.04-cuda-13.0.2|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852128/job/56935378003)| |test-debian-12-cpp-amd64|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852319/job/56935378718)| |test-debian-12-cpp-i386|[](https://github.com/ursacomputing/crossbow/actions/runs/19867853188/job/56935381578)| |test-fedora-42-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/19867853572/job/56935382608)| |test-ubuntu-22.04-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852479/job/56935379311)| |test-ubuntu-22.04-cpp-20|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852846/job/56935380340)| |test-ubuntu-22.04-cpp-bundled|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852351/job/56935378951)| |test-ubuntu-22.04-cpp-emscripten|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852383/job/56935378838)| |test-ubuntu-22.04-cpp-no-threading|[](https://github.com/ursacomputing/crossbow/actions/runs/19867852830/job/56935380300)| |test-ubuntu-24.04-cpp|[ {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
// and each value has a 4-byte length header that doesn't count for the
// Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+ return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY
data");
+}
auto visit_binary_helper = [&](auto* helper) {
int values_decoded = 0;
- RETURN_NOT_OK(VisitBitRuns(
- valid_bits, valid_bits_offset, num_values,
- [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
- for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ -
4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len,
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
- }
- values_decoded += static_cast(run_length);
- return Status::OK();
-} else {
- return helper->AppendNulls(run_length);
+ auto visit_run = [&](int64_t position, int64_t run_length, bool
is_valid) {
+if (is_valid) {
+ for (int64_t i = 0; i < run_length; ++i) {
+// We ensure `len_` is sufficient thanks to:
+// 1. the initial `estimated_data_length` check above,
+// 2. the running `value_len > estimated_data_length` check
below.
+// This precondition follows from those two checks.
+DCHECK_GE(len_, 4);
+auto value_len = SafeLoadAs(data_);
+if (ARROW_PREDICT_FALSE(value_len < 0 || value_len >
estimated_data_length)) {
Review Comment:
This is the main logical change in this code block: `value_len > len_ - 4`
is replaced with the stricter `value_len > estimated_data_length`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582171832
##
cpp/src/parquet/decoder.cc:
##
@@ -754,43 +760,47 @@ class PlainByteArrayDecoder : public
PlainDecoder {
int64_t valid_bits_offset,
typename EncodingTraits::Accumulator*
out,
int* out_values_decoded) {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
// and each value has a 4-byte length header that doesn't count for the
// Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+ return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY
data");
+}
auto visit_binary_helper = [&](auto* helper) {
int values_decoded = 0;
- RETURN_NOT_OK(VisitBitRuns(
- valid_bits, valid_bits_offset, num_values,
- [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
- for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ -
4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len,
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
- }
- values_decoded += static_cast(run_length);
- return Status::OK();
-} else {
- return helper->AppendNulls(run_length);
+ auto visit_run = [&](int64_t position, int64_t run_length, bool
is_valid) {
+if (is_valid) {
+ for (int64_t i = 0; i < run_length; ++i) {
+// We ensure `len_` is sufficient thanks to:
+// 1. the initial `estimated_data_length` check above,
+// 2. the running `value_len > estimated_data_length` check
below.
+// This precondition follows from those two checks.
+DCHECK_GE(len_, 4);
+auto value_len = SafeLoadAs(data_);
+if (ARROW_PREDICT_FALSE(value_len < 0 || value_len >
estimated_data_length)) {
Review Comment:
This is the main logical change: `value_len > len_ - 4` is replaced with the
stricter `value_len > estimated_data_length`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582169889
##
cpp/src/parquet/decoder.cc:
##
@@ -754,43 +760,47 @@ class PlainByteArrayDecoder : public
PlainDecoder {
int64_t valid_bits_offset,
typename EncodingTraits::Accumulator*
out,
int* out_values_decoded) {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
// and each value has a 4-byte length header that doesn't count for the
// Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+ return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY
data");
+}
auto visit_binary_helper = [&](auto* helper) {
int values_decoded = 0;
- RETURN_NOT_OK(VisitBitRuns(
- valid_bits, valid_bits_offset, num_values,
- [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
- for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ -
4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len,
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
- }
- values_decoded += static_cast(run_length);
- return Status::OK();
-} else {
- return helper->AppendNulls(run_length);
+ auto visit_run = [&](int64_t position, int64_t run_length, bool
is_valid) {
Review Comment:
I pulled out the callback as a lambda because clang-format kept reindenting
it anyway :-/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582123299
##
cpp/src/parquet/decoder.cc:
##
@@ -763,34 +769,35 @@ class PlainByteArrayDecoder : public
PlainDecoder {
auto visit_binary_helper = [&](auto* helper) {
int values_decoded = 0;
- RETURN_NOT_OK(VisitBitRuns(
- valid_bits, valid_bits_offset, num_values,
- [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
- for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ -
4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len,
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
- }
- values_decoded += static_cast(run_length);
- return Status::OK();
-} else {
- return helper->AppendNulls(run_length);
-}
- }));
+ RETURN_NOT_OK(
+ VisitBitRuns(valid_bits, valid_bits_offset, num_values,
+ [&](int64_t position, int64_t run_length, bool
is_valid) {
+ if (is_valid) {
+ for (int64_t i = 0; i < run_length; ++i) {
+ if (ARROW_PREDICT_FALSE(len_ < 4)) {
+ return Status::Invalid(
+ "Invalid or truncated PLAIN-encoded
BYTE_ARRAY data");
+ }
+ auto value_len = SafeLoadAs(data_);
+ if (ARROW_PREDICT_FALSE(value_len < 0 ||
+ value_len >
estimated_data_length)) {
Review Comment:
Sorry that clang-format decided to reindent this entire block of code... but
the main logical change is in these two lines: `value_len > len_ - 4` is
replaced with the stricter `value_len > estimated_data_length`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582123299
##
cpp/src/parquet/decoder.cc:
##
@@ -763,34 +769,35 @@ class PlainByteArrayDecoder : public
PlainDecoder {
auto visit_binary_helper = [&](auto* helper) {
int values_decoded = 0;
- RETURN_NOT_OK(VisitBitRuns(
- valid_bits, valid_bits_offset, num_values,
- [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
- for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ -
4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len,
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
- }
- values_decoded += static_cast(run_length);
- return Status::OK();
-} else {
- return helper->AppendNulls(run_length);
-}
- }));
+ RETURN_NOT_OK(
+ VisitBitRuns(valid_bits, valid_bits_offset, num_values,
+ [&](int64_t position, int64_t run_length, bool
is_valid) {
+ if (is_valid) {
+ for (int64_t i = 0; i < run_length; ++i) {
+ if (ARROW_PREDICT_FALSE(len_ < 4)) {
+ return Status::Invalid(
+ "Invalid or truncated PLAIN-encoded
BYTE_ARRAY data");
+ }
+ auto value_len = SafeLoadAs(data_);
+ if (ARROW_PREDICT_FALSE(value_len < 0 ||
+ value_len >
estimated_data_length)) {
Review Comment:
Sorry that clang-format decided to reindent this entire block of code... but
the only logical change is in these two lines: `value_len > len_ - 4` is
replaced with the stricter `value_len > estimated_data_length`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
github-actions[bot] commented on PR #48309: URL: https://github.com/apache/arrow/pull/48309#issuecomment-3603129689 Revision: 2c5dda8df9a3625898cdf9f1e5e032c9cb03acf7 Submitted crossbow builds: [ursacomputing/crossbow @ actions-6006ee6dbc](https://github.com/ursacomputing/crossbow/branches/all?query=actions-6006ee6dbc) |Task|Status| ||--| |example-cpp-minimal-build-static|[](https://github.com/ursacomputing/crossbow/actions/runs/19867139288/job/56932922031)| |example-cpp-minimal-build-static-system-dependency|[](https://github.com/ursacomputing/crossbow/actions/runs/19867137618/job/56932916060)| |example-cpp-tutorial|[](https://github.com/ursacomputing/crossbow/actions/runs/19867138395/job/56932918344)| |test-build-cpp-fuzz|[](https://github.com/ursacomputing/crossbow/actions/runs/19867137625/job/56932916047)| |test-conda-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/19867137859/job/56932916743)| |test-conda-cpp-valgrind|[](https://github.com/ursacomputing/crossbow/actions/runs/19867138198/job/56932917940)| |test-cuda-cpp-ubuntu-22.04-cuda-11.7.1|[](https://github.com/ursacomputing/crossbow/actions/runs/19867138950/job/56932920427)| |test-cuda-cpp-ubuntu-24.04-cuda-13.0.2|[](https://github.com/ursacomputing/crossbow/actions/runs/19867137635/job/56932916353)| |test-debian-12-cpp-amd64|[](https://github.com/ursacomputing/crossbow/actions/runs/19867138357/job/56932918777)| |test-debian-12-cpp-i386|[](https://github.com/ursacomputing/crossbow/actions/runs/19867137322/job/56932915175)| |test-fedora-42-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/19867137465/job/56932915433)| |test-ubuntu-22.04-cpp|[](https://github.com/ursacomputing/crossbow/actions/runs/19867137376/job/56932915168)| |test-ubuntu-22.04-cpp-20|[](https://github.com/ursacomputing/crossbow/actions/runs/19867138870/job/56932920092)| |test-ubuntu-22.04-cpp-bundled|[](https://github.com/ursacomputing/crossbow/actions/runs/19867138233/job/56932918015)| |test-ubuntu-22.04-cpp-emscripten|[](https://github.com/ursacomputing/crossbow/actions/runs/19867138563/job/56932919138)| |test-ubuntu-22.04-cpp-no-threading|[](https://github.com/ursacomputing/crossbow/actions/runs/19867137123/job/56932914472)| |test-ubuntu-24.04-cpp|[ {
int values_decoded = 0;
- RETURN_NOT_OK(VisitBitRuns(
- valid_bits, valid_bits_offset, num_values,
- [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
- for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ -
4)) {
- return Status::Invalid(
- "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len,
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
- }
- values_decoded += static_cast(run_length);
- return Status::OK();
-} else {
- return helper->AppendNulls(run_length);
-}
- }));
+ RETURN_NOT_OK(
+ VisitBitRuns(valid_bits, valid_bits_offset, num_values,
+ [&](int64_t position, int64_t run_length, bool
is_valid) {
+ if (is_valid) {
+ for (int64_t i = 0; i < run_length; ++i) {
+ if (ARROW_PREDICT_FALSE(len_ < 4)) {
+ return Status::Invalid(
+ "Invalid or truncated PLAIN-encoded
BYTE_ARRAY data");
+ }
+ auto value_len = SafeLoadAs(data_);
+ if (ARROW_PREDICT_FALSE(value_len < 0 ||
+ value_len >
estimated_data_length)) {
Review Comment:
Sorry that clang-format decided to reindent this entire block of code... but
the main logical change is in these two lines: `value_len > len_ - 4` is
replaced with the stricter `value_len > estimated_data_length`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]
pitrou commented on PR #48309: URL: https://github.com/apache/arrow/pull/48309#issuecomment-3603117871 @github-actions crossbow submit -g cpp -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
