Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
xinghuayu007 commented on code in PR #9521:
URL: https://github.com/apache/incubator-gluten/pull/9521#discussion_r2078808725
##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
if (filtered.isEmpty) {
return ColumnarBatchSerializeResult.EMPTY
}
-val handleArray =
- filtered.map(b =>
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
- try {
-ColumnarBatchSerializerJniWrapper
- .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+ b => {
+val handle =
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+ ColumnarBatchSerializerJniWrapper
+.create(Runtimes
.contextInstance(BackendsApiManager.getBackendName,
"BroadcastUtils#serializeStream"))
- .serialize(handleArray)
- } finally {
-filtered.foreach(ColumnarBatches.release)
- }
-serializeResult
+.serialize(handle)
+} finally {
+ ColumnarBatches.release(b)
Review Comment:
Thanks for your explains.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
xinghuayu007 commented on PR #9521: URL: https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2861508117 > > Please add some unit tests to prove this feature's correctness. > > Do you mean I need to create a query with broadcast to verify it doesn't result in incorrect results? Currently, the unit test cases should already have many queries with broadcast. ok -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
wForget commented on code in PR #9521:
URL: https://github.com/apache/incubator-gluten/pull/9521#discussion_r2077330977
##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
if (filtered.isEmpty) {
return ColumnarBatchSerializeResult.EMPTY
}
-val handleArray =
- filtered.map(b =>
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
- try {
-ColumnarBatchSerializerJniWrapper
- .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+ b => {
+val handle =
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+ ColumnarBatchSerializerJniWrapper
+.create(Runtimes
.contextInstance(BackendsApiManager.getBackendName,
"BroadcastUtils#serializeStream"))
- .serialize(handleArray)
- } finally {
-filtered.foreach(ColumnarBatches.release)
- }
-serializeResult
+.serialize(handle)
+} finally {
+ ColumnarBatches.release(b)
Review Comment:
Yes, there are two buffers in VeloxColumnarBatchSerializer to save
serialized data. If we serialize many columnarbatches at once, it may reserve a
large offheap memory:
https://github.com/apache/incubator-gluten/blob/c3aacfcd2593cda65e54a18b024b68cad6d0ade1/cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc#L64
https://github.com/apache/incubator-gluten/blob/c3aacfcd2593cda65e54a18b024b68cad6d0ade1/cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc#L78
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
wForget commented on PR #9521: URL: https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2858062732 > Please add some unit tests to prove this feature's correctness. Do you mean I need to create a query with broadcast to verify it doesn't result in incorrect results? Currently, the unit test cases should already have many queries with broadcast. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
xinghuayu007 commented on code in PR #9521:
URL: https://github.com/apache/incubator-gluten/pull/9521#discussion_r2076969635
##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
if (filtered.isEmpty) {
return ColumnarBatchSerializeResult.EMPTY
}
-val handleArray =
- filtered.map(b =>
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
- try {
-ColumnarBatchSerializerJniWrapper
- .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+ b => {
+val handle =
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+ ColumnarBatchSerializerJniWrapper
+.create(Runtimes
.contextInstance(BackendsApiManager.getBackendName,
"BroadcastUtils#serializeStream"))
- .serialize(handleArray)
- } finally {
-filtered.foreach(ColumnarBatches.release)
- }
-serializeResult
+.serialize(handle)
+} finally {
+ ColumnarBatches.release(b)
Review Comment:
So here release 'b' every loop can reduce memory consumption?
##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
if (filtered.isEmpty) {
return ColumnarBatchSerializeResult.EMPTY
}
-val handleArray =
- filtered.map(b =>
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
- try {
-ColumnarBatchSerializerJniWrapper
- .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+ b => {
+val handle =
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+ ColumnarBatchSerializerJniWrapper
+.create(Runtimes
.contextInstance(BackendsApiManager.getBackendName,
"BroadcastUtils#serializeStream"))
- .serialize(handleArray)
- } finally {
-filtered.foreach(ColumnarBatches.release)
- }
-serializeResult
+.serialize(handle)
+} finally {
+ ColumnarBatches.release(b)
Review Comment:
So release 'b' every loop can reduce memory consumption?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
xinghuayu007 commented on code in PR #9521:
URL: https://github.com/apache/incubator-gluten/pull/9521#discussion_r2076969635
##
backends-velox/src/main/scala/org/apache/spark/sql/execution/BroadcastUtils.scala:
##
@@ -172,19 +172,21 @@ object BroadcastUtils {
if (filtered.isEmpty) {
return ColumnarBatchSerializeResult.EMPTY
}
-val handleArray =
- filtered.map(b =>
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b))
-val serializeResult =
- try {
-ColumnarBatchSerializerJniWrapper
- .create(
-Runtimes
+var rowNums = 0
+val values = filtered.map(
+ b => {
+val handle =
ColumnarBatches.getNativeHandle(BackendsApiManager.getBackendName, b)
+rowNums += b.numRows()
+try {
+ ColumnarBatchSerializerJniWrapper
+.create(Runtimes
.contextInstance(BackendsApiManager.getBackendName,
"BroadcastUtils#serializeStream"))
- .serialize(handleArray)
- } finally {
-filtered.foreach(ColumnarBatches.release)
- }
-serializeResult
+.serialize(handle)
+} finally {
+ ColumnarBatches.release(b)
Review Comment:
So here release 'b' every loop can reduce memory consumption?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
xinghuayu007 commented on PR #9521: URL: https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2857385477 Please add some unit tests to prove this feature's correctness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
wForget commented on PR #9521: URL: https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2856955170 @zhztheplayer @PHILO-HE Could you please take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
wForget commented on PR #9521: URL: https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2856948823 Benchmark testing did not find performance regression: https://github.com/wForget/benchmarks-spark-native/actions/runs/14861256989 First: https://github.com/apache/incubator-gluten/commit/a783515e30ebfc99bb22c5222d8e2a98f92d9e63 Second: https://github.com/wForget/gluten/tree/GLUTEN-9475  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [GLUTEN-9475][VL] Serialize ColumnarBatch one by one when broadcasting [incubator-gluten]
github-actions[bot] commented on PR #9521: URL: https://github.com/apache/incubator-gluten/pull/9521#issuecomment-2853348948 https://github.com/apache/incubator-gluten/issues/9475 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
