Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-17 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #48309:
URL: https://github.com/apache/arrow/pull/48309#issuecomment-3667979737

   After merging your PR, Conbench analyzed the 3 benchmarking runs that have 
been run so far on merge-commit ea6dc7beb8b68d4b1d607694f12fdfb8013442d1.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/58384263530) 
has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-03 Thread via GitHub


pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2584276641


##
cpp/src/parquet/decoder.cc:
##
@@ -754,43 +760,47 @@ class PlainByteArrayDecoder : public 
PlainDecoder {
   int64_t valid_bits_offset,
   typename EncodingTraits::Accumulator* 
out,
   int* out_values_decoded) {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
 // and each value has a 4-byte length header that doesn't count for the
 // Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+  return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY 
data");
+}
 
 auto visit_binary_helper = [&](auto* helper) {
   int values_decoded = 0;
 
-  RETURN_NOT_OK(VisitBitRuns(
-  valid_bits, valid_bits_offset, num_values,
-  [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
-  for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ - 
4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len, 
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
-  }
-  values_decoded += static_cast(run_length);
-  return Status::OK();
-} else {
-  return helper->AppendNulls(run_length);
+  auto visit_run = [&](int64_t position, int64_t run_length, bool 
is_valid) {
+if (is_valid) {
+  for (int64_t i = 0; i < run_length; ++i) {
+// We ensure `len_` is sufficient thanks to:
+//   1. the initial `estimated_data_length` check above,
+//   2. the running `value_len > estimated_data_length` check 
below.
+// This precondition follows from those two checks.
+DCHECK_GE(len_, 4);
+auto value_len = SafeLoadAs(data_);
+if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > 
estimated_data_length)) {

Review Comment:
   Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-03 Thread via GitHub


github-actions[bot] commented on PR #48309:
URL: https://github.com/apache/arrow/pull/48309#issuecomment-3605685122

   Revision: 65e00521312994d515d6063a71fb20dfcc291978
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-c0dbacc455](https://github.com/ursacomputing/crossbow/branches/all?query=actions-c0dbacc455)
   
   |Task|Status|
   ||--|
   |test-build-cpp-fuzz|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-c0dbacc455-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/19887580720/job/56998179241)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-03 Thread via GitHub


pitrou commented on PR #48309:
URL: https://github.com/apache/arrow/pull/48309#issuecomment-3605670198

   @github-actions crossbow submit *fuzz*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-03 Thread via GitHub


pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2584103553


##
cpp/src/parquet/decoder.cc:
##
@@ -147,6 +139,20 @@ struct ArrowBinaryHelper {
 return Status::OK();
   }
 
+  Status ReserveInitialChunkData(std::optional 
estimated_remaining_data_length) {
+RETURN_NOT_OK(builder_->Reserve(entries_remaining_));
+if (estimated_remaining_data_length.has_value()) {
+  int64_t required_capacity =
+  std::min(*estimated_remaining_data_length, chunk_space_remaining_);
+  // Ensure we'll be allocating a non-zero-sized data buffer, to avoid 
encountering
+  // a null pointer
+  constexpr int64_t kMinReserveCapacity = 64;

Review Comment:
   Well, it turns out that was unnecessary, so I might have misunderstood what 
I saw during the debugging session. I'm going to simplify this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


mapleFU commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2583512443


##
cpp/src/parquet/decoder.cc:
##
@@ -147,6 +139,20 @@ struct ArrowBinaryHelper {
 return Status::OK();
   }
 
+  Status ReserveInitialChunkData(std::optional 
estimated_remaining_data_length) {
+RETURN_NOT_OK(builder_->Reserve(entries_remaining_));
+if (estimated_remaining_data_length.has_value()) {
+  int64_t required_capacity =
+  std::min(*estimated_remaining_data_length, chunk_space_remaining_);
+  // Ensure we'll be allocating a non-zero-sized data buffer, to avoid 
encountering
+  // a null pointer
+  constexpr int64_t kMinReserveCapacity = 64;

Review Comment:
   
https://github.com/apache/arrow/blob/322516f3940120cb2eddf3b1f919b38f0c81b097/cpp/src/arrow/memory_pool_internal.h#L35
   
   Maybe here rather than a null pointer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


HuaHuaY commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2583488156


##
cpp/src/parquet/decoder.cc:
##
@@ -147,6 +139,20 @@ struct ArrowBinaryHelper {
 return Status::OK();
   }
 
+  Status ReserveInitialChunkData(std::optional 
estimated_remaining_data_length) {
+RETURN_NOT_OK(builder_->Reserve(entries_remaining_));
+if (estimated_remaining_data_length.has_value()) {
+  int64_t required_capacity =
+  std::min(*estimated_remaining_data_length, chunk_space_remaining_);
+  // Ensure we'll be allocating a non-zero-sized data buffer, to avoid 
encountering
+  // a null pointer
+  constexpr int64_t kMinReserveCapacity = 64;

Review Comment:
   What is the null pointer here and what will happen if there is a null 
pointer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


mapleFU commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2583356838


##
cpp/src/parquet/decoder.cc:
##
@@ -147,6 +139,20 @@ struct ArrowBinaryHelper {
 return Status::OK();
   }
 
+  Status ReserveInitialChunkData(std::optional 
estimated_remaining_data_length) {
+RETURN_NOT_OK(builder_->Reserve(entries_remaining_));
+if (estimated_remaining_data_length.has_value()) {
+  int64_t required_capacity =
+  std::min(*estimated_remaining_data_length, chunk_space_remaining_);
+  // Ensure we'll be allocating a non-zero-sized data buffer, to avoid 
encountering
+  // a null pointer
+  constexpr int64_t kMinReserveCapacity = 64;

Review Comment:
   64 here is just a not too small or not too large number, to avoid `builder_` 
return an NULLPTR?



##
cpp/src/parquet/decoder.cc:
##
@@ -754,43 +760,47 @@ class PlainByteArrayDecoder : public 
PlainDecoder {
   int64_t valid_bits_offset,
   typename EncodingTraits::Accumulator* 
out,
   int* out_values_decoded) {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
 // and each value has a 4-byte length header that doesn't count for the
 // Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+  return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY 
data");
+}
 
 auto visit_binary_helper = [&](auto* helper) {
   int values_decoded = 0;
 
-  RETURN_NOT_OK(VisitBitRuns(
-  valid_bits, valid_bits_offset, num_values,
-  [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
-  for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ - 
4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len, 
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
-  }
-  values_decoded += static_cast(run_length);
-  return Status::OK();
-} else {
-  return helper->AppendNulls(run_length);
+  auto visit_run = [&](int64_t position, int64_t run_length, bool 
is_valid) {
+if (is_valid) {
+  for (int64_t i = 0; i < run_length; ++i) {
+// We ensure `len_` is sufficient thanks to:
+//   1. the initial `estimated_data_length` check above,
+//   2. the running `value_len > estimated_data_length` check 
below.
+// This precondition follows from those two checks.
+DCHECK_GE(len_, 4);
+auto value_len = SafeLoadAs(data_);
+if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > 
estimated_data_length)) {

Review Comment:
   Would you mind add this to comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


github-actions[bot] commented on PR #48309:
URL: https://github.com/apache/arrow/pull/48309#issuecomment-3603224787

   Revision: 14c65056c63c3958a4ef8734c6ec465b0473ee36
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-ff3ab6ba25](https://github.com/ursacomputing/crossbow/branches/all?query=actions-ff3ab6ba25)
   
   |Task|Status|
   ||--|
   |example-cpp-minimal-build-static|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-example-cpp-minimal-build-static)](https://github.com/ursacomputing/crossbow/actions/runs/19867852981/job/56935380736)|
   |example-cpp-minimal-build-static-system-dependency|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-example-cpp-minimal-build-static-system-dependency)](https://github.com/ursacomputing/crossbow/actions/runs/19867852927/job/56935380667)|
   |example-cpp-tutorial|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-example-cpp-tutorial)](https://github.com/ursacomputing/crossbow/actions/runs/19867853826/job/56935383572)|
   |test-build-cpp-fuzz|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/19867853380/job/56935382670)|
   |test-conda-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/19867852316/job/56935378528)|
   |test-conda-cpp-valgrind|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/actions/runs/19867853015/job/56935380895)|
   |test-cuda-cpp-ubuntu-22.04-cuda-11.7.1|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-cuda-cpp-ubuntu-22.04-cuda-11.7.1)](https://github.com/ursacomputing/crossbow/actions/runs/19867853109/job/56935381200)|
   |test-cuda-cpp-ubuntu-24.04-cuda-13.0.2|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-cuda-cpp-ubuntu-24.04-cuda-13.0.2)](https://github.com/ursacomputing/crossbow/actions/runs/19867852128/job/56935378003)|
   |test-debian-12-cpp-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-debian-12-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/19867852319/job/56935378718)|
   |test-debian-12-cpp-i386|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-debian-12-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/19867853188/job/56935381578)|
   |test-fedora-42-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-fedora-42-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/19867853572/job/56935382608)|
   |test-ubuntu-22.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/19867852479/job/56935379311)|
   |test-ubuntu-22.04-cpp-20|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/19867852846/job/56935380340)|
   |test-ubuntu-22.04-cpp-bundled|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-ubuntu-22.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/19867852351/job/56935378951)|
   |test-ubuntu-22.04-cpp-emscripten|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-ubuntu-22.04-cpp-emscripten)](https://github.com/ursacomputing/crossbow/actions/runs/19867852383/job/56935378838)|
   |test-ubuntu-22.04-cpp-no-threading|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ff3ab6ba25-github-test-ubuntu-22.04-cpp-no-threading)](https://github.com/ursacomputing/crossbow/actions/runs/19867852830/job/56935380300)|
   |test-ubuntu-24.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/cross

Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


pitrou commented on PR #48309:
URL: https://github.com/apache/arrow/pull/48309#issuecomment-3603211763

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582171832


##
cpp/src/parquet/decoder.cc:
##
@@ -754,43 +760,47 @@ class PlainByteArrayDecoder : public 
PlainDecoder {
   int64_t valid_bits_offset,
   typename EncodingTraits::Accumulator* 
out,
   int* out_values_decoded) {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
 // and each value has a 4-byte length header that doesn't count for the
 // Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+  return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY 
data");
+}
 
 auto visit_binary_helper = [&](auto* helper) {
   int values_decoded = 0;
 
-  RETURN_NOT_OK(VisitBitRuns(
-  valid_bits, valid_bits_offset, num_values,
-  [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
-  for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ - 
4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len, 
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
-  }
-  values_decoded += static_cast(run_length);
-  return Status::OK();
-} else {
-  return helper->AppendNulls(run_length);
+  auto visit_run = [&](int64_t position, int64_t run_length, bool 
is_valid) {
+if (is_valid) {
+  for (int64_t i = 0; i < run_length; ++i) {
+// We ensure `len_` is sufficient thanks to:
+//   1. the initial `estimated_data_length` check above,
+//   2. the running `value_len > estimated_data_length` check 
below.
+// This precondition follows from those two checks.
+DCHECK_GE(len_, 4);
+auto value_len = SafeLoadAs(data_);
+if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > 
estimated_data_length)) {

Review Comment:
   This is the main logical change in this code block: `value_len > len_ - 4` 
is replaced with the stricter `value_len > estimated_data_length`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582171832


##
cpp/src/parquet/decoder.cc:
##
@@ -754,43 +760,47 @@ class PlainByteArrayDecoder : public 
PlainDecoder {
   int64_t valid_bits_offset,
   typename EncodingTraits::Accumulator* 
out,
   int* out_values_decoded) {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
 // and each value has a 4-byte length header that doesn't count for the
 // Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+  return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY 
data");
+}
 
 auto visit_binary_helper = [&](auto* helper) {
   int values_decoded = 0;
 
-  RETURN_NOT_OK(VisitBitRuns(
-  valid_bits, valid_bits_offset, num_values,
-  [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
-  for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ - 
4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len, 
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
-  }
-  values_decoded += static_cast(run_length);
-  return Status::OK();
-} else {
-  return helper->AppendNulls(run_length);
+  auto visit_run = [&](int64_t position, int64_t run_length, bool 
is_valid) {
+if (is_valid) {
+  for (int64_t i = 0; i < run_length; ++i) {
+// We ensure `len_` is sufficient thanks to:
+//   1. the initial `estimated_data_length` check above,
+//   2. the running `value_len > estimated_data_length` check 
below.
+// This precondition follows from those two checks.
+DCHECK_GE(len_, 4);
+auto value_len = SafeLoadAs(data_);
+if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > 
estimated_data_length)) {

Review Comment:
   This is the main logical change: `value_len > len_ - 4` is replaced with the 
stricter `value_len > estimated_data_length`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582169889


##
cpp/src/parquet/decoder.cc:
##
@@ -754,43 +760,47 @@ class PlainByteArrayDecoder : public 
PlainDecoder {
   int64_t valid_bits_offset,
   typename EncodingTraits::Accumulator* 
out,
   int* out_values_decoded) {
-// We're going to decode up to `num_values - null_count` PLAIN values,
+// We're going to decode `num_values - null_count` PLAIN values,
 // and each value has a 4-byte length header that doesn't count for the
 // Arrow binary data length.
-int64_t estimated_data_length =
-std::max(0, len_ - 4 * (num_values - null_count));
+int64_t estimated_data_length = len_ - 4 * (num_values - null_count);
+if (ARROW_PREDICT_FALSE(estimated_data_length < 0)) {
+  return Status::Invalid("Invalid or truncated PLAIN-encoded BYTE_ARRAY 
data");
+}
 
 auto visit_binary_helper = [&](auto* helper) {
   int values_decoded = 0;
 
-  RETURN_NOT_OK(VisitBitRuns(
-  valid_bits, valid_bits_offset, num_values,
-  [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
-  for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ - 
4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len, 
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
-  }
-  values_decoded += static_cast(run_length);
-  return Status::OK();
-} else {
-  return helper->AppendNulls(run_length);
+  auto visit_run = [&](int64_t position, int64_t run_length, bool 
is_valid) {

Review Comment:
   I pulled out the callback as a lambda because clang-format kept reindenting 
it anyway :-/



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582123299


##
cpp/src/parquet/decoder.cc:
##
@@ -763,34 +769,35 @@ class PlainByteArrayDecoder : public 
PlainDecoder {
 auto visit_binary_helper = [&](auto* helper) {
   int values_decoded = 0;
 
-  RETURN_NOT_OK(VisitBitRuns(
-  valid_bits, valid_bits_offset, num_values,
-  [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
-  for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ - 
4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len, 
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
-  }
-  values_decoded += static_cast(run_length);
-  return Status::OK();
-} else {
-  return helper->AppendNulls(run_length);
-}
-  }));
+  RETURN_NOT_OK(
+  VisitBitRuns(valid_bits, valid_bits_offset, num_values,
+   [&](int64_t position, int64_t run_length, bool 
is_valid) {
+ if (is_valid) {
+   for (int64_t i = 0; i < run_length; ++i) {
+ if (ARROW_PREDICT_FALSE(len_ < 4)) {
+   return Status::Invalid(
+   "Invalid or truncated PLAIN-encoded 
BYTE_ARRAY data");
+ }
+ auto value_len = SafeLoadAs(data_);
+ if (ARROW_PREDICT_FALSE(value_len < 0 ||
+ value_len > 
estimated_data_length)) {

Review Comment:
   Sorry that clang-format decided to reindent this entire block of code... but 
the main logical change is in these two lines: `value_len > len_ - 4` is 
replaced with the stricter `value_len > estimated_data_length`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582123299


##
cpp/src/parquet/decoder.cc:
##
@@ -763,34 +769,35 @@ class PlainByteArrayDecoder : public 
PlainDecoder {
 auto visit_binary_helper = [&](auto* helper) {
   int values_decoded = 0;
 
-  RETURN_NOT_OK(VisitBitRuns(
-  valid_bits, valid_bits_offset, num_values,
-  [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
-  for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ - 
4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len, 
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
-  }
-  values_decoded += static_cast(run_length);
-  return Status::OK();
-} else {
-  return helper->AppendNulls(run_length);
-}
-  }));
+  RETURN_NOT_OK(
+  VisitBitRuns(valid_bits, valid_bits_offset, num_values,
+   [&](int64_t position, int64_t run_length, bool 
is_valid) {
+ if (is_valid) {
+   for (int64_t i = 0; i < run_length; ++i) {
+ if (ARROW_PREDICT_FALSE(len_ < 4)) {
+   return Status::Invalid(
+   "Invalid or truncated PLAIN-encoded 
BYTE_ARRAY data");
+ }
+ auto value_len = SafeLoadAs(data_);
+ if (ARROW_PREDICT_FALSE(value_len < 0 ||
+ value_len > 
estimated_data_length)) {

Review Comment:
   Sorry that clang-format decided to reindent this entire block of code... but 
the only logical change is in these two lines: `value_len > len_ - 4` is 
replaced with the stricter `value_len > estimated_data_length`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


github-actions[bot] commented on PR #48309:
URL: https://github.com/apache/arrow/pull/48309#issuecomment-3603129689

   Revision: 2c5dda8df9a3625898cdf9f1e5e032c9cb03acf7
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-6006ee6dbc](https://github.com/ursacomputing/crossbow/branches/all?query=actions-6006ee6dbc)
   
   |Task|Status|
   ||--|
   |example-cpp-minimal-build-static|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-example-cpp-minimal-build-static)](https://github.com/ursacomputing/crossbow/actions/runs/19867139288/job/56932922031)|
   |example-cpp-minimal-build-static-system-dependency|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-example-cpp-minimal-build-static-system-dependency)](https://github.com/ursacomputing/crossbow/actions/runs/19867137618/job/56932916060)|
   |example-cpp-tutorial|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-example-cpp-tutorial)](https://github.com/ursacomputing/crossbow/actions/runs/19867138395/job/56932918344)|
   |test-build-cpp-fuzz|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/19867137625/job/56932916047)|
   |test-conda-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/19867137859/job/56932916743)|
   |test-conda-cpp-valgrind|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/actions/runs/19867138198/job/56932917940)|
   |test-cuda-cpp-ubuntu-22.04-cuda-11.7.1|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-cuda-cpp-ubuntu-22.04-cuda-11.7.1)](https://github.com/ursacomputing/crossbow/actions/runs/19867138950/job/56932920427)|
   |test-cuda-cpp-ubuntu-24.04-cuda-13.0.2|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-cuda-cpp-ubuntu-24.04-cuda-13.0.2)](https://github.com/ursacomputing/crossbow/actions/runs/19867137635/job/56932916353)|
   |test-debian-12-cpp-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-debian-12-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/19867138357/job/56932918777)|
   |test-debian-12-cpp-i386|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-debian-12-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/19867137322/job/56932915175)|
   |test-fedora-42-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-fedora-42-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/19867137465/job/56932915433)|
   |test-ubuntu-22.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/19867137376/job/56932915168)|
   |test-ubuntu-22.04-cpp-20|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/19867138870/job/56932920092)|
   |test-ubuntu-22.04-cpp-bundled|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-ubuntu-22.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/19867138233/job/56932918015)|
   |test-ubuntu-22.04-cpp-emscripten|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-ubuntu-22.04-cpp-emscripten)](https://github.com/ursacomputing/crossbow/actions/runs/19867138563/job/56932919138)|
   |test-ubuntu-22.04-cpp-no-threading|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-6006ee6dbc-github-test-ubuntu-22.04-cpp-no-threading)](https://github.com/ursacomputing/crossbow/actions/runs/19867137123/job/56932914472)|
   |test-ubuntu-24.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/cross

Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


pitrou commented on code in PR #48309:
URL: https://github.com/apache/arrow/pull/48309#discussion_r2582123299


##
cpp/src/parquet/decoder.cc:
##
@@ -763,34 +769,35 @@ class PlainByteArrayDecoder : public 
PlainDecoder {
 auto visit_binary_helper = [&](auto* helper) {
   int values_decoded = 0;
 
-  RETURN_NOT_OK(VisitBitRuns(
-  valid_bits, valid_bits_offset, num_values,
-  [&](int64_t position, int64_t run_length, bool is_valid) {
-if (is_valid) {
-  for (int64_t i = 0; i < run_length; ++i) {
-if (ARROW_PREDICT_FALSE(len_ < 4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-auto value_len = SafeLoadAs(data_);
-if (ARROW_PREDICT_FALSE(value_len < 0 || value_len > len_ - 
4)) {
-  return Status::Invalid(
-  "Invalid or truncated PLAIN-encoded BYTE_ARRAY data");
-}
-RETURN_NOT_OK(
-helper->AppendValue(data_ + 4, value_len, 
estimated_data_length));
-auto increment = value_len + 4;
-data_ += increment;
-len_ -= increment;
-estimated_data_length -= value_len;
-DCHECK_GE(estimated_data_length, 0);
-  }
-  values_decoded += static_cast(run_length);
-  return Status::OK();
-} else {
-  return helper->AppendNulls(run_length);
-}
-  }));
+  RETURN_NOT_OK(
+  VisitBitRuns(valid_bits, valid_bits_offset, num_values,
+   [&](int64_t position, int64_t run_length, bool 
is_valid) {
+ if (is_valid) {
+   for (int64_t i = 0; i < run_length; ++i) {
+ if (ARROW_PREDICT_FALSE(len_ < 4)) {
+   return Status::Invalid(
+   "Invalid or truncated PLAIN-encoded 
BYTE_ARRAY data");
+ }
+ auto value_len = SafeLoadAs(data_);
+ if (ARROW_PREDICT_FALSE(value_len < 0 ||
+ value_len > 
estimated_data_length)) {

Review Comment:
   Sorry that clang-format decided to reindent this entire block of code... but 
the main logical change is in these two lines: `value_len > len_ - 4` is 
replaced with the stricter `value_len > estimated_data_length`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48308: [C++][Parquet] Fix potential crash when reading invalid Parquet data [arrow]

2025-12-02 Thread via GitHub


pitrou commented on PR #48309:
URL: https://github.com/apache/arrow/pull/48309#issuecomment-3603117871

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]