[ https://issues.apache.org/jira/browse/SPARK-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or closed SPARK-3545. ---------------------------- Resolution: Won't Fix > Put HadoopRDD.getPartitions forward and put TaskScheduler.start back in > SparkContext to reduce DAGScheduler.JobSubmitted processing time and shorten > cluster resources occupation period > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-3545 > URL: https://issues.apache.org/jira/browse/SPARK-3545 > Project: Spark > Issue Type: Improvement > Components: Scheduler > Reporter: YanTang Zhai > Priority: Minor > > We have two problems: > (1) HadoopRDD.getPartitions is lazyied to process in > DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much > time. > For example, in our cluster, it needs from 0.029s to 766.699s. If one > JobSubmitted event is processing, others should wait. Thus, we > want to put HadoopRDD.getPartitions forward to reduce > DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event > don't > need to wait much time. HadoopRDD object could get its partitons when it is > instantiated. > (2) When SparkContext object is instantiated, TaskScheduler is started and > some resources are allocated from cluster. However, these > resources may be not used for the moment. For example, > DAGScheduler.JobSubmitted is processing and so on. These resources are wasted > in > this period. Thus, we want to put TaskScheduler.start back to shorten cluster > resources occupation period specially for busy cluster. > TaskScheduler could be started just before running stages. > We could analyse and compare the execution time before and after optimization. > TaskScheduler.start execution time: [time1__] > DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or > TaskScheduler.start) execution time: [time2_] > HadoopRDD.getPartitions execution time: [time3___] > Stages execution time: [time4_____] > (1) The app has only one job > (a) > The execution time of the job before optimization is > [time1__][time2_][time3___][time4_____]. > The execution time of the job after optimization > is....[time3___][time2_][time1__][time4_____]. > (b) > The cluster resources occupation period before optimization is > [time2_][time3___][time4_____]. > The cluster resources occupation period after optimization is....[time4_____]. > In summary, if the app has only one job, the total execution time is same > before and after optimization while the cluster resources > occupation period after optimization is less than before. > (2) The app has 4 jobs > (a) Before optimization, > job1 execution time is [time2_][time3___][time4_____], > job2 execution time is [time2__________][time3___][time4_____], > job3 execution time > is................................[time2____][time3___][time4_____], > job4 execution time > is................................[time2______________][time3___][time4_____]. > After optimization, > job1 execution time is [time3___][time2_][time1__][time4_____], > job2 execution time is [time3___][time2__________][time4_____], > job3 execution time > is................................[time3___][time2_][time4_____], > job4 execution time > is................................[time3___][time2__][time4_____]. > In summary, if the app has multiple jobs, average execution time after > optimization is less than before and the cluster resources > occupation period after optimization is less than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org