Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-02 Thread via GitHub


jinchengchenghh merged PR #11140:
URL: https://github.com/apache/incubator-gluten/pull/11140


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-02 Thread via GitHub


jinchengchenghh commented on PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#issuecomment-3601260991

   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-01 Thread via GitHub


jiangjiangtian commented on code in PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2579811137


##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,34 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {
 auto vb = VeloxColumnarBatch::from(pool_, cb);
 auto rv = vb->getRowVector();
+uint64_t numBytes = cb->numBytes();

Review Comment:
   Done. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-01 Thread via GitHub


jiangjiangtian commented on code in PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2579811137


##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,34 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {
 auto vb = VeloxColumnarBatch::from(pool_, cb);
 auto rv = vb->getRowVector();
+uint64_t numBytes = cb->numBytes();

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-01 Thread via GitHub


jinchengchenghh commented on code in PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576837929


##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,34 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {
 auto vb = VeloxColumnarBatch::from(pool_, cb);
 auto rv = vb->getRowVector();
+uint64_t numBytes = cb->numBytes();

Review Comment:
   Move the check with L80 `if (cb->numRows() < minOutputBatchSize_ && numBytes 
<= preferredBatchBytes_)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-01 Thread via GitHub


jiangjiangtian commented on code in PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576828677


##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {
 auto vb = VeloxColumnarBatch::from(pool_, cb);
 auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();
+if (numBytes > preferredBatchBytes_) {
+  // Input batch is too large. Just return it as is.
+  return cb;
+}
 auto buffer = facebook::velox::RowVector::createEmpty(rv->type(), pool_);
 buffer->append(rv.get());
 
-for (auto nextCb = in_->next(); nextCb != nullptr; nextCb = in_->next()) {
-  auto nextVb = VeloxColumnarBatch::from(pool_, nextCb);
-  auto nextRv = nextVb->getRowVector();
-  if (buffer->size() + nextRv->size() > maxOutputBatchSize_) {
+// Call reset manully to potentially release memory
+vector.reset();

Review Comment:
   I remove the four lines of code here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-01 Thread via GitHub


jiangjiangtian commented on code in PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576826119


##
backends-velox/src/test/scala/org/apache/gluten/execution/VeloxTPCHSuite.scala:
##
@@ -73,6 +73,7 @@ abstract class VeloxTPCHSuite extends VeloxTPCHTableSupport {
   // for unexpected blank
   .replaceAll("Scan parquet ", "Scan parquet")
   // Spark QueryStageExec will take it's id as argument, replace it with X
+  .replaceAll("Arguments: [0-9]+, [0-9]+, [0-9]+", "Arguments: X, X")

Review Comment:
   Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-01 Thread via GitHub


jinchengchenghh commented on code in PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576794160


##
backends-velox/src/test/scala/org/apache/gluten/execution/VeloxTPCHSuite.scala:
##
@@ -73,6 +73,7 @@ abstract class VeloxTPCHSuite extends VeloxTPCHTableSupport {
   // for unexpected blank
   .replaceAll("Scan parquet ", "Scan parquet")
   // Spark QueryStageExec will take it's id as argument, replace it with X
+  .replaceAll("Arguments: [0-9]+, [0-9]+, [0-9]+", "Arguments: X, X")

Review Comment:
   X, X, X



##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {
 auto vb = VeloxColumnarBatch::from(pool_, cb);
 auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();
+if (numBytes > preferredBatchBytes_) {
+  // Input batch is too large. Just return it as is.
+  return cb;
+}
 auto buffer = facebook::velox::RowVector::createEmpty(rv->type(), pool_);
 buffer->append(rv.get());
 
-for (auto nextCb = in_->next(); nextCb != nullptr; nextCb = in_->next()) {
-  auto nextVb = VeloxColumnarBatch::from(pool_, nextCb);
-  auto nextRv = nextVb->getRowVector();
-  if (buffer->size() + nextRv->size() > maxOutputBatchSize_) {
+// Call reset manully to potentially release memory
+vector.reset();

Review Comment:
   And also remove following rv batch.reset



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-01 Thread via GitHub


jiangjiangtian commented on code in PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576619641


##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {
 auto vb = VeloxColumnarBatch::from(pool_, cb);
 auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();

Review Comment:
   I get it. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-12-01 Thread via GitHub


jinchengchenghh commented on code in PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576441994


##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {
 auto vb = VeloxColumnarBatch::from(pool_, cb);
 auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);

Review Comment:
   I don't see the vector usage



##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {
 auto vb = VeloxColumnarBatch::from(pool_, cb);
 auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();
+if (numBytes > preferredBatchBytes_) {
+  // Input batch is too large. Just return it as is.
+  return cb;
+}
 auto buffer = facebook::velox::RowVector::createEmpty(rv->type(), pool_);
 buffer->append(rv.get());
 
-for (auto nextCb = in_->next(); nextCb != nullptr; nextCb = in_->next()) {
-  auto nextVb = VeloxColumnarBatch::from(pool_, nextCb);
-  auto nextRv = nextVb->getRowVector();
-  if (buffer->size() + nextRv->size() > maxOutputBatchSize_) {
+// Call reset manully to potentially release memory
+vector.reset();

Review Comment:
   L114 is enough, you reuse the variables in following for loop or release in 
the end



##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {
 auto vb = VeloxColumnarBatch::from(pool_, cb);
 auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();

Review Comment:
   Update here to only load the lazy vector, we don't need to flat dictionary 
vector   
https://github.com/apache/incubator-gluten/blob/7d491d5b373f77a3aa9a1bad14c34acb8987cfb5/cpp/velox/memory/VeloxColumnarBatch.cc#L82



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-11-27 Thread via GitHub


jinchengchenghh commented on code in PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2568000854


##
backends-velox/src/main/scala/org/apache/gluten/extension/AppendBatchResizeForShuffleInputAndOutput.scala:
##
@@ -40,19 +40,20 @@ case class AppendBatchResizeForShuffleInputAndOutput() 
extends Rule[SparkPlan] {
 }
 
 val range = VeloxConfig.get.veloxResizeBatchesShuffleInputOutputRange
+val memoryThreshold = VeloxConfig.get.veloxPreferredBatchBytes
 plan.transformUp {
   case shuffle: ColumnarShuffleExchangeExec
   if resizeBatchesShuffleInputEnabled &&
 shuffle.shuffleWriterType.requiresResizingShuffleInput =>
 val appendBatches =
-  VeloxResizeBatchesExec(shuffle.child, range.min, range.max)
+  VeloxResizeBatchesExec(shuffle.child, range.min, range.max, 
memoryThreshold)

Review Comment:
   Please use the original config name preferredBatchBytes instead of a new name



##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,43 @@ std::shared_ptr VeloxBatchResizer::next() {
   if (cb->numRows() < minOutputBatchSize_) {

Review Comment:
   Please use cb->numBytes() to estimate the batch size, and update the 
numBytes(), get the numBytes does not need to flat the vector, just load the 
vector



##
backends-velox/src/test/resources/tpch-approved-plan/v1-bhj-ras/spark32/1.txt:
##
@@ -60,7 +60,7 @@ Arguments: false
 
 (7) VeloxResizeBatches
 Input [18]: [hash_partition_key#X, l_returnflag#X, l_linestatus#X, sum#X, 
isEmpty#X, sum#X, isEmpty#X, sum#X, isEmpty#X, sum#X, isEmpty#X, sum#X, 
count#X, sum#X, count#X, sum#X, count#X, count#X]
-Arguments: X, X
+Arguments: X, X, 10485760

Review Comment:
   Please change it to X, we don't need to update it if we change the default 
value in the future



##
backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala:
##
@@ -680,6 +682,15 @@ object VeloxConfig extends ConfigRegistry {
   .bytesConf(ByteUnit.BYTE)
   .createWithDefaultString("10MB")
 
+  val ENABLE_LIMIT_BATCH_RESIZER = {

Review Comment:
   This can be a default behavior, please remove this config, all the operator 
should respect `preferredBatchBytes`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-11-20 Thread via GitHub


jiangjiangtian commented on PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#issuecomment-3561612538

   > Could you add a new configuration for `memoryThreshold` and allow 
disabling it? I'm concerned that estimating batch size might cause performance 
overhead.
   
   Sure. Thanks for your advice.
   Actually, I have tested 300 jobs and the overall execution time decrease by 
about 1.1%.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]

2025-11-20 Thread via GitHub


wForget commented on PR #11140:
URL: 
https://github.com/apache/incubator-gluten/pull/11140#issuecomment-3561511689

   Could you add a new configuration for `memoryThreshold` and allow disabling 
it? I'm concerned that estimating batch size might cause performance overhead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]