[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session
cloud-fan commented on pull request #28778: URL: https://github.com/apache/spark/pull/28778#issuecomment-645239254 So seems we just need to add a min-partition-num config for file source? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session
cloud-fan commented on pull request #28778: URL: https://github.com/apache/spark/pull/28778#issuecomment-645167188 After more thoughts, I think the file partitions split logic itself is problematic. Its target is to make the number of partitions the same as the total number of cores, which doesn't make sense as the cluster may only have a few free cores. I think a proper way is to set an expected size of each partition, like 64mb. This is also what we do when coalescing shuffle partitions in AQE. Can we add such a config? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session
cloud-fan commented on pull request #28778: URL: https://github.com/apache/spark/pull/28778#issuecomment-644710864 Parallelism is a physical concept already. Can you explain more about how you are going to tune the file partition split? what are the problems you hit? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session
cloud-fan commented on pull request #28778: URL: https://github.com/apache/spark/pull/28778#issuecomment-643064091 The most confusing part is, default parallelism is more like a physical stuff (related to cluster resource), and it's weird to have a per session setting for it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session
cloud-fan commented on pull request #28778: URL: https://github.com/apache/spark/pull/28778#issuecomment-642593763 After more thoughts, I'm wondering what's the real use case of it. The default parallelism depends on the cluster resources, and it looks weird if different sessions can have different default parallelism. Looking at the changes in this PR, I think most of them don't really need a per-session config to tune it. The only place looks reasonable is where we split file partitions. Maybe we can just add a new config to do fine-grained control of the file partition splitting? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org