[ https://issues.apache.org/jira/browse/SPARK-24591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maxim Gekk resolved SPARK-24591. -------------------------------- Resolution: Won't Fix > Number of cores and executors in the cluster > -------------------------------------------- > > Key: SPARK-24591 > URL: https://issues.apache.org/jira/browse/SPARK-24591 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.3.1 > Reporter: Maxim Gekk > Priority: Minor > > Need to add 2 new methods. The first one should return total number of CPU > cores of all executors in the cluster. The second one should give current > number of executors registered in the cluster. > Main motivations for adding of those methods: > 1. It is the best practice to manage job parallelism relative to available > cores, e.g., df.repartition(5 * sc.coreCount) . In particular, it is an > anti-pattern to leave a bunch of cores on large clusters twiddling their > thumb & doing nothing. Usually users pass predefined constants for > _repartition()_ and _coalesce()_. Selection of the constant is based on > current cluster size. If the code runs on another cluster and/or on the > resized cluster, they need to modify the constant each time. This happens > frequently when a job that normally runs on, say, an hour of data on a small > cluster needs to run on a week of data on a much larger cluster. > 2. *spark.default.parallelism* can be used to get total number of cores in > the cluster but it can be redefined by user. The info can be taken via > registration of a listener but repeating the same looks ugly. We should > follow the DRY principle. > 3. Regarding to executorsCount(), some jobs, e.g., local node ML training, > use a lot of parallelism. It's a common practice to aim to distribute such > jobs such that there is one partition for each executor. > > 4. In some places users collect this info, as well as other settings info > together with job timing (at the app level) for analysis. E.g., you can use > ML to determine optimal cluster size given different objectives, e.g., > fastest throughput vs. lowest cost per unit of processing. > 5. The simpler argument is that basic cluster properties should be easily > discoverable via APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org