Thanks for your response!
On Mon, Sep 23, 2013 at 7:42 PM, Reynold Xin <[email protected]> wrote: > The reason is sortByKey triggers a sample operation to determine the range > partitioner. > > > -- > Reynold Xin, AMPLab, UC Berkeley > http://rxin.org > > > > On Mon, Sep 23, 2013 at 5:47 PM, Mahdi Namazifar < > [email protected]> wrote: > >> Hi, >> >> I think I might be missing something but here is what I observe which is >> inconsistent with my understanding of transformation vs action operations: >> in the Spark shell I do the following >> >> val a = sc.textFile("[my file]", 1000) >> val c = a.flatMap(line => line.split("\t")).map(word => (word, >> 1)).reduceByKey((a,b)=>a+b, 100).sortByKey(false,500) >> >> which is for experimentation purposes only and I'm running a word count >> on a file that is read from HDFS and then I sort the result by the words. >> >> My understanding from the documentation is that all of flatMap, map, >> reduceByKey, and sortByKey are transformation operations and are therefore >> lazy operations. But when I run the second line, I see 1000 >> ShuffleMapTasks, followed by 100 ResultTasks and another 100 ResultTasks >> running on the cluster which in total take about 400 seconds. Am I missing >> something? Could someone kindly explain to me what exactly happens when I >> run the second command, because I was expecting for the command to only >> create an RDD and not perform any tasks. >> >> BTW, I'm using Spark 0.7.2 on a 1+4 node cluster. >> >> Thanks, >> Mahdi >> > >
