[GitHub] [spark] StevenChenDatabricks commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
StevenChenDatabricks commented on PR #40385: URL: https://github.com/apache/spark/pull/40385#issuecomment-1478591539 @cloud-fan I've addressed your comments. Thanks for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] StevenChenDatabricks commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
StevenChenDatabricks commented on PR #40385: URL: https://github.com/apache/spark/pull/40385#issuecomment-1469194144 @cloud-fan It wouldn't because `collectOperatorsWithID` in ExplainUtils is responsible for collecting the list of nodes to print out. It uses a BitSet `collectedOperators` that's globally shared to ensure each node is "collected" and printed exactly once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] StevenChenDatabricks commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
StevenChenDatabricks commented on PR #40385: URL: https://github.com/apache/spark/pull/40385#issuecomment-1468621742 @cloud-fan Thanks for the idea and response! 1. I don't think this issue doesn't affects `ReusedSubquery` because of how its processed and printed. The current algorithm finds all Subquery nodes (including `ReusedSubquery`) and for each Subquery, it traverses the Subquery subtree to generate the IDs if they are missing. Furthermore, a `ReusedSubquery` does not print the details of the Subquery it reuses whereas for ReusedExchange it does print the Exchange ID being reused. For a `ReusedSubquery`, all that is printed is this line: ```Subquery:5 Hosting operator id = 50 Hosting Expression = ReusedSubquery Subquery scalar-subquery#31, [id=#32]``` `Hosting operator ID` is the parent operator that contains the `ReusedSubquery`. The subtree of the `ReusedSubquery` is not printed anywhere and the `ReusedSubquery` node itself is not printed in the main plan tree either. Even if there are non-existing children, the issue is not surfaced in the Explain plan by default. I guess there's still a chance it might affect Spark UI whereby the IDs in the subtree of a `ReusedSubquery` are incorrect because the IDs were generated in a previous AQE iteration... I'm not sure. I think it's best to wait and see if a ticket/bug like this is ever reported. 2. My fix detects all the ReusedExchanges with non-existing children and generate IDs on them. I guess your question is what if multiple `ReusedExchange` reference the same non-existing `Exchange`? That's a good point and I need to account for that edge case in the code in case that is possible. With regards to your idea for a section of non-existing `Exchanges`: we already only print each operator exactly once in the node details section. As shown in the PR description: I currently print out the plan subtree of the Non-Existing Exchange below the `ReusedExchange` (since that subtree is not shown anywhere else) and the node details while maintaining uniqueness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] StevenChenDatabricks commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
StevenChenDatabricks commented on PR #40385: URL: https://github.com/apache/spark/pull/40385#issuecomment-1467287615 @cloud-fan Yes this is purely UI and EXPLAIN issue. It does not affect query execution. I'm not sure how AQE context stageCache map would help. The issue in EXPLAIN is that the ReusedExchange.child references a Exchange node that is not referenced anywhere else in the plan tree so we need to generate IDs on the subtree rooted at ReusedExchange.child and print them out. To do this, we need a way to check whether the ReusedExchange.child is referenced anywhere else - if they are not referenced anywhere else, we need to recursively generate IDs for subtree. I keep a HashSet of nodes with IDs already generated and check ReusedExchange.child against it to see if we need to recursively generate IDs on the subtree. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org