Re: A chain of lazy operations starts running tasks

Reynold Xin Mon, 23 Sep 2013 19:44:23 -0700

The reason is sortByKey triggers a sample operation to determine the range
partitioner.



--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Mon, Sep 23, 2013 at 5:47 PM, Mahdi Namazifar
<[email protected]>wrote:

> Hi,
>
> I think I might be missing something but here is what I observe which is
> inconsistent with my understanding of transformation vs action operations:
>  in the Spark shell I do the following
>
> val a = sc.textFile("[my file]", 1000)
> val c = a.flatMap(line => line.split("\t")).map(word => (word,
> 1)).reduceByKey((a,b)=>a+b, 100).sortByKey(false,500)
>
> which is for experimentation purposes only and I'm running a word count on
> a file that is read from HDFS and then I sort the result by the words.
>
> My understanding from the documentation is that all of flatMap, map,
> reduceByKey, and sortByKey are transformation operations and are therefore
> lazy operations.  But when I run the second line, I see 1000
> ShuffleMapTasks, followed by 100 ResultTasks and another 100 ResultTasks
> running on the cluster which in total take about 400 seconds.  Am I missing
> something?  Could someone kindly explain to me what exactly happens when I
> run the second command, because I was expecting for the command to only
> create an RDD and not perform any tasks.
>
> BTW, I'm using Spark 0.7.2 on a 1+4 node cluster.
>
> Thanks,
> Mahdi
>

Re: A chain of lazy operations starts running tasks

Reply via email to