Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-07 Thread via GitHub


xinghuayu007 commented on code in PR #9521:
URL: https://github.com/apache/incubator-gluten/pull/9521#discussion_r2078808725


##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
 if (filtered.isEmpty) {
   return ColumnarBatchSerializeResult.EMPTY
 }
-val handleArray =
-  filtered.map(b => 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
-  try {
-ColumnarBatchSerializerJniWrapper
-  .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+  b => {
+val handle = 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+  ColumnarBatchSerializerJniWrapper
+.create(Runtimes
   .contextInstance(BackendsApiManager.getBackendName, 
"BroadcastUtils#serializeStream"))
-  .serialize(handleArray)
-  } finally {
-filtered.foreach(ColumnarBatches.release)
-  }
-serializeResult
+.serialize(handle)
+} finally {
+  ColumnarBatches.release(b)

Review Comment:
   Thanks for your explains.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-07 Thread via GitHub


xinghuayu007 commented on PR #9521:
URL: 
https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2861508117

   > > Please add some unit tests to prove this feature's correctness.
   > 
   > Do you mean I need to create a query with broadcast to verify it doesn't 
result in incorrect results? Currently, the unit test cases should already have 
many queries with broadcast.
   
   ok


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-07 Thread via GitHub


wForget commented on code in PR #9521:
URL: https://github.com/apache/incubator-gluten/pull/9521#discussion_r2077330977


##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
 if (filtered.isEmpty) {
   return ColumnarBatchSerializeResult.EMPTY
 }
-val handleArray =
-  filtered.map(b => 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
-  try {
-ColumnarBatchSerializerJniWrapper
-  .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+  b => {
+val handle = 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+  ColumnarBatchSerializerJniWrapper
+.create(Runtimes
   .contextInstance(BackendsApiManager.getBackendName, 
"BroadcastUtils#serializeStream"))
-  .serialize(handleArray)
-  } finally {
-filtered.foreach(ColumnarBatches.release)
-  }
-serializeResult
+.serialize(handle)
+} finally {
+  ColumnarBatches.release(b)

Review Comment:
   Yes, there are two buffers in VeloxColumnarBatchSerializer to save 
serialized data. If we serialize many columnarbatches at once, it may reserve a 
large offheap memory:
   
   
https://github.com/apache/incubator-gluten/blob/c3aacfcd2593cda65e54a18b024b68cad6d0ade1/cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc#L64
   
   
https://github.com/apache/incubator-gluten/blob/c3aacfcd2593cda65e54a18b024b68cad6d0ade1/cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc#L78



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-07 Thread via GitHub


wForget commented on PR #9521:
URL: 
https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2858062732

   > Please add some unit tests to prove this feature's correctness.
   
   Do you mean I need to create a query with broadcast to verify it doesn't 
result in incorrect results? Currently, the unit test cases should already have 
many queries with broadcast.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-07 Thread via GitHub


xinghuayu007 commented on code in PR #9521:
URL: https://github.com/apache/incubator-gluten/pull/9521#discussion_r2076969635


##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
 if (filtered.isEmpty) {
   return ColumnarBatchSerializeResult.EMPTY
 }
-val handleArray =
-  filtered.map(b => 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
-  try {
-ColumnarBatchSerializerJniWrapper
-  .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+  b => {
+val handle = 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+  ColumnarBatchSerializerJniWrapper
+.create(Runtimes
   .contextInstance(BackendsApiManager.getBackendName, 
"BroadcastUtils#serializeStream"))
-  .serialize(handleArray)
-  } finally {
-filtered.foreach(ColumnarBatches.release)
-  }
-serializeResult
+.serialize(handle)
+} finally {
+  ColumnarBatches.release(b)

Review Comment:
   So here release 'b' every loop can reduce memory consumption?



##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
 if (filtered.isEmpty) {
   return ColumnarBatchSerializeResult.EMPTY
 }
-val handleArray =
-  filtered.map(b => 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
-  try {
-ColumnarBatchSerializerJniWrapper
-  .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+  b => {
+val handle = 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+  ColumnarBatchSerializerJniWrapper
+.create(Runtimes
   .contextInstance(BackendsApiManager.getBackendName, 
"BroadcastUtils#serializeStream"))
-  .serialize(handleArray)
-  } finally {
-filtered.foreach(ColumnarBatches.release)
-  }
-serializeResult
+.serialize(handle)
+} finally {
+  ColumnarBatches.release(b)

Review Comment:
   So release 'b' every loop can reduce memory consumption?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-07 Thread via GitHub


xinghuayu007 commented on code in PR #9521:
URL: https://github.com/apache/incubator-gluten/pull/9521#discussion_r2076969635


##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
 if (filtered.isEmpty) {
   return ColumnarBatchSerializeResult.EMPTY
 }
-val handleArray =
-  filtered.map(b => 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
-  try {
-ColumnarBatchSerializerJniWrapper
-  .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+  b => {
+val handle = 
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+  ColumnarBatchSerializerJniWrapper
+.create(Runtimes
   .contextInstance(BackendsApiManager.getBackendName, 
"BroadcastUtils#serializeStream"))
-  .serialize(handleArray)
-  } finally {
-filtered.foreach(ColumnarBatches.release)
-  }
-serializeResult
+.serialize(handle)
+} finally {
+  ColumnarBatches.release(b)

Review Comment:
   So here release 'b' every loop can reduce memory consumption?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-07 Thread via GitHub


xinghuayu007 commented on PR #9521:
URL: 
https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2857385477

   Please add some unit tests to prove this feature's correctness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-06 Thread via GitHub


wForget commented on PR #9521:
URL: 
https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2856955170

   @zhztheplayer @PHILO-HE  Could you please take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-06 Thread via GitHub


wForget commented on PR #9521:
URL: 
https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2856948823

   Benchmark testing did not find performance regression: 
https://github.com/wForget/benchmarks-spark-native/actions/runs/14861256989
   
   First: 
https://github.com/apache/incubator-gluten/commit/a783515e30ebfc99bb22c5222d8e2a98f92d9e63
   Second: https://github.com/wForget/gluten/tree/GLUTEN-9475
   
   
![tpcds_allqueries](https://github.com/user-attachments/assets/300e66af-1590-4cf0-b4c4-a4c4e251ba84)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]

2025-05-05 Thread via GitHub


github-actions[bot] commented on PR #9521:
URL: 
https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2853348948

   https://github.com/apache/incubator-gluten/issues/9475


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]