Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jinchengchenghh merged PR #11140: URL: https://github.com/apache/incubator-gluten/pull/11140 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jinchengchenghh commented on PR #11140: URL: https://github.com/apache/incubator-gluten/pull/11140#issuecomment-3601260991 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jiangjiangtian commented on code in PR #11140:
URL:
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2579811137
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,34 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
auto vb = VeloxColumnarBatch::from(pool_, cb);
auto rv = vb->getRowVector();
+uint64_t numBytes = cb->numBytes();
Review Comment:
Done. Thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jiangjiangtian commented on code in PR #11140:
URL:
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2579811137
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,34 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
auto vb = VeloxColumnarBatch::from(pool_, cb);
auto rv = vb->getRowVector();
+uint64_t numBytes = cb->numBytes();
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jinchengchenghh commented on code in PR #11140:
URL:
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576837929
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,34 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
auto vb = VeloxColumnarBatch::from(pool_, cb);
auto rv = vb->getRowVector();
+uint64_t numBytes = cb->numBytes();
Review Comment:
Move the check with L80 `if (cb->numRows() < minOutputBatchSize_ && numBytes
<= preferredBatchBytes_)`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jiangjiangtian commented on code in PR #11140:
URL:
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576828677
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
auto vb = VeloxColumnarBatch::from(pool_, cb);
auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();
+if (numBytes > preferredBatchBytes_) {
+ // Input batch is too large. Just return it as is.
+ return cb;
+}
auto buffer = facebook::velox::RowVector::createEmpty(rv->type(), pool_);
buffer->append(rv.get());
-for (auto nextCb = in_->next(); nextCb != nullptr; nextCb = in_->next()) {
- auto nextVb = VeloxColumnarBatch::from(pool_, nextCb);
- auto nextRv = nextVb->getRowVector();
- if (buffer->size() + nextRv->size() > maxOutputBatchSize_) {
+// Call reset manully to potentially release memory
+vector.reset();
Review Comment:
I remove the four lines of code here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jiangjiangtian commented on code in PR #11140:
URL:
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576826119
##
backends-velox/src/test/scala/org/apache/gluten/execution/VeloxTPCHSuite.scala:
##
@@ -73,6 +73,7 @@ abstract class VeloxTPCHSuite extends VeloxTPCHTableSupport {
// for unexpected blank
.replaceAll("Scan parquet ", "Scan parquet")
// Spark QueryStageExec will take it's id as argument, replace it with X
+ .replaceAll("Arguments: [0-9]+, [0-9]+, [0-9]+", "Arguments: X, X")
Review Comment:
Thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jinchengchenghh commented on code in PR #11140:
URL:
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576794160
##
backends-velox/src/test/scala/org/apache/gluten/execution/VeloxTPCHSuite.scala:
##
@@ -73,6 +73,7 @@ abstract class VeloxTPCHSuite extends VeloxTPCHTableSupport {
// for unexpected blank
.replaceAll("Scan parquet ", "Scan parquet")
// Spark QueryStageExec will take it's id as argument, replace it with X
+ .replaceAll("Arguments: [0-9]+, [0-9]+, [0-9]+", "Arguments: X, X")
Review Comment:
X, X, X
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
auto vb = VeloxColumnarBatch::from(pool_, cb);
auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();
+if (numBytes > preferredBatchBytes_) {
+ // Input batch is too large. Just return it as is.
+ return cb;
+}
auto buffer = facebook::velox::RowVector::createEmpty(rv->type(), pool_);
buffer->append(rv.get());
-for (auto nextCb = in_->next(); nextCb != nullptr; nextCb = in_->next()) {
- auto nextVb = VeloxColumnarBatch::from(pool_, nextCb);
- auto nextRv = nextVb->getRowVector();
- if (buffer->size() + nextRv->size() > maxOutputBatchSize_) {
+// Call reset manully to potentially release memory
+vector.reset();
Review Comment:
And also remove following rv batch.reset
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jiangjiangtian commented on code in PR #11140:
URL:
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576619641
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
auto vb = VeloxColumnarBatch::from(pool_, cb);
auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();
Review Comment:
I get it. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jinchengchenghh commented on code in PR #11140:
URL:
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2576441994
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
auto vb = VeloxColumnarBatch::from(pool_, cb);
auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
Review Comment:
I don't see the vector usage
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
auto vb = VeloxColumnarBatch::from(pool_, cb);
auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();
+if (numBytes > preferredBatchBytes_) {
+ // Input batch is too large. Just return it as is.
+ return cb;
+}
auto buffer = facebook::velox::RowVector::createEmpty(rv->type(), pool_);
buffer->append(rv.get());
-for (auto nextCb = in_->next(); nextCb != nullptr; nextCb = in_->next()) {
- auto nextVb = VeloxColumnarBatch::from(pool_, nextCb);
- auto nextRv = nextVb->getRowVector();
- if (buffer->size() + nextRv->size() > maxOutputBatchSize_) {
+// Call reset manully to potentially release memory
+vector.reset();
Review Comment:
L114 is enough, you reuse the variables in following for loop or release in
the end
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,41 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
auto vb = VeloxColumnarBatch::from(pool_, cb);
auto rv = vb->getRowVector();
+auto vector = std::static_pointer_cast(rv);
+uint64_t numBytes = cb->numBytes();
Review Comment:
Update here to only load the lazy vector, we don't need to flat dictionary
vector
https://github.com/apache/incubator-gluten/blob/7d491d5b373f77a3aa9a1bad14c34acb8987cfb5/cpp/velox/memory/VeloxColumnarBatch.cc#L82
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jinchengchenghh commented on code in PR #11140:
URL:
https://github.com/apache/incubator-gluten/pull/11140#discussion_r2568000854
##
backends-velox/src/main/scala/org/apache/gluten/extension/AppendBatchResizeForShuffleInputAndOutput.scala:
##
@@ -40,19 +40,20 @@ case class AppendBatchResizeForShuffleInputAndOutput()
extends Rule[SparkPlan] {
}
val range = VeloxConfig.get.veloxResizeBatchesShuffleInputOutputRange
+val memoryThreshold = VeloxConfig.get.veloxPreferredBatchBytes
plan.transformUp {
case shuffle: ColumnarShuffleExchangeExec
if resizeBatchesShuffleInputEnabled &&
shuffle.shuffleWriterType.requiresResizingShuffleInput =>
val appendBatches =
- VeloxResizeBatchesExec(shuffle.child, range.min, range.max)
+ VeloxResizeBatchesExec(shuffle.child, range.min, range.max,
memoryThreshold)
Review Comment:
Please use the original config name preferredBatchBytes instead of a new name
##
cpp/velox/utils/VeloxBatchResizer.cc:
##
@@ -78,22 +80,43 @@ std::shared_ptr VeloxBatchResizer::next() {
if (cb->numRows() < minOutputBatchSize_) {
Review Comment:
Please use cb->numBytes() to estimate the batch size, and update the
numBytes(), get the numBytes does not need to flat the vector, just load the
vector
##
backends-velox/src/test/resources/tpch-approved-plan/v1-bhj-ras/spark32/1.txt:
##
@@ -60,7 +60,7 @@ Arguments: false
(7) VeloxResizeBatches
Input [18]: [hash_partition_key#X, l_returnflag#X, l_linestatus#X, sum#X,
isEmpty#X, sum#X, isEmpty#X, sum#X, isEmpty#X, sum#X, isEmpty#X, sum#X,
count#X, sum#X, count#X, sum#X, count#X, count#X]
-Arguments: X, X
+Arguments: X, X, 10485760
Review Comment:
Please change it to X, we don't need to update it if we change the default
value in the future
##
backends-velox/src/main/scala/org/apache/gluten/config/VeloxConfig.scala:
##
@@ -680,6 +682,15 @@ object VeloxConfig extends ConfigRegistry {
.bytesConf(ByteUnit.BYTE)
.createWithDefaultString("10MB")
+ val ENABLE_LIMIT_BATCH_RESIZER = {
Review Comment:
This can be a default behavior, please remove this config, all the operator
should respect `preferredBatchBytes`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
jiangjiangtian commented on PR #11140: URL: https://github.com/apache/incubator-gluten/pull/11140#issuecomment-3561612538 > Could you add a new configuration for `memoryThreshold` and allow disabling it? I'm concerned that estimating batch size might cause performance overhead. Sure. Thanks for your advice. Actually, I have tested 300 jobs and the overall execution time decrease by about 1.1%. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-10892][VL] Use `veloxPreferredBatchBytes` to control the max size of memory of batches combined [incubator-gluten]
wForget commented on PR #11140: URL: https://github.com/apache/incubator-gluten/pull/11140#issuecomment-3561511689 Could you add a new configuration for `memoryThreshold` and allow disabling it? I'm concerned that estimating batch size might cause performance overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
