[ 
https://issues.apache.org/jira/browse/SPARK-24591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-24591.
--------------------------------
    Resolution: Won't Fix

> Number of cores and executors in the cluster
> --------------------------------------------
>
>                 Key: SPARK-24591
>                 URL: https://issues.apache.org/jira/browse/SPARK-24591
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.3.1
>            Reporter: Maxim Gekk
>            Priority: Minor
>
> Need to add 2 new methods. The first one should return total number of CPU 
> cores of all executors in the cluster. The second one should give current 
> number of executors registered in the cluster.
> Main motivations for adding of those methods:
> 1. It is the best practice to manage job parallelism relative to available 
> cores, e.g., df.repartition(5 * sc.coreCount) . In particular, it is an 
> anti-pattern to leave a bunch of cores on large clusters twiddling their 
> thumb & doing nothing. Usually users pass predefined constants for 
> _repartition()_ and _coalesce()_. Selection of the constant is based on 
> current cluster size. If the code runs on another cluster and/or on the 
> resized cluster, they need to modify the constant each time. This happens 
> frequently when a job that normally runs on, say, an hour of data on a small 
> cluster needs to run on a week of data on a much larger cluster.
> 2. *spark.default.parallelism* can be used to get total number of cores in 
> the cluster but it can be redefined by user. The info can be taken via 
> registration of a listener but repeating the same looks ugly. We should 
> follow the DRY principle.
> 3. Regarding to executorsCount(), some jobs, e.g., local node ML training, 
> use a lot of parallelism. It's a common practice to aim to distribute such 
> jobs such that there is one partition for each executor. 
>  
> 4. In some places users collect this info, as well as other settings info 
> together with job timing (at the app level) for analysis. E.g., you can use 
> ML to determine optimal cluster size given different objectives, e.g., 
> fastest throughput vs. lowest cost per unit of processing.
> 5. The simpler argument is that basic cluster properties should be easily 
> discoverable via APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to