sortByKey() runs one job to sample the data, to determine what range of keys to put in each partition.
There is a jira to change it to defer launching the job until the subsequent action, but it will still execute another stage: https://issues.apache.org/jira/browse/SPARK-1021 On Wed, Apr 29, 2015 at 5:57 PM, Tom Hubregtsen <thubregt...@gmail.com> wrote: > "I'm not sure, but I wonder if because you are using the Spark REPL that it > may not be representing what a normal runtime execution would look like and > is possibly eagerly running a partial DAG once you define an operation that > would cause a shuffle. > > What happens if you setup your same set of commands [a-e] in a file and use > the Spark REPL's `load` or `paste` command to load them all at once?" From > Richard > > I have also packaged it in a jar file (without [e], the debug string), and > still see the extra stage before the other two that I would expect. Even > when I remove [d], the action, I still see stage 0 being executed (and do > not see stage 1 and 2). > > Again a shortened log of the Stage 0: > INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at > sortByKey, which has no missing parents > INFO DAGScheduler: ResultStage 0 (sortByKey) finished in 0.192 s > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation-with-an-action-tp22707p22713.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >