[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session

2020-06-17 Thread GitBox


cloud-fan commented on pull request #28778:
URL: https://github.com/apache/spark/pull/28778#issuecomment-645239254


   So seems we just need to add a min-partition-num config for file source?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session

2020-06-16 Thread GitBox


cloud-fan commented on pull request #28778:
URL: https://github.com/apache/spark/pull/28778#issuecomment-645167188


   After more thoughts, I think the file partitions split logic itself is 
problematic. Its target is to make the number of partitions the same as the 
total number of cores, which doesn't make sense as the cluster may only have a 
few free cores.
   
   I think a proper way is to set an expected size of each partition, like 
64mb. This is also what we do when coalescing shuffle partitions in AQE. Can we 
add such a config?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session

2020-06-16 Thread GitBox


cloud-fan commented on pull request #28778:
URL: https://github.com/apache/spark/pull/28778#issuecomment-644710864


   Parallelism is a physical concept already. Can you explain more about how 
you are going to tune the file partition split? what are the problems you hit?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session

2020-06-11 Thread GitBox


cloud-fan commented on pull request #28778:
URL: https://github.com/apache/spark/pull/28778#issuecomment-643064091


   The most confusing part is, default parallelism is more like a physical 
stuff (related to cluster resource), and it's weird to have a per session 
setting for it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28778: [SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session

2020-06-11 Thread GitBox


cloud-fan commented on pull request #28778:
URL: https://github.com/apache/spark/pull/28778#issuecomment-642593763


   After more thoughts, I'm wondering what's the real use case of it.
   
   The default parallelism depends on the cluster resources, and it looks weird 
if different sessions can have different default parallelism.
   
   Looking at the changes in this PR, I think most of them don't really need a 
per-session config to tune it. The only place looks reasonable is where we 
split file partitions. Maybe we can just add a new config to do fine-grained 
control of the file partition splitting?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org