The only point where some *actual* computation happens is when data is requested by driver (using collect()) or materialized in external storage (ex: saveashadoopfile). Rest of the time operations are merely stored & saved. Once you actually ask for the data, the operations are compiled into a DAG of stages. Each stage can contain multiple tasks (like 2 filter operations can be combined into one stage) & executed. Hence the operations are all lazy by default.
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Tue, Mar 11, 2014 at 10:15 PM, David Thomas <dt5434...@gmail.com> wrote: > I think you misunderstood my question - I should have stated it better. > I'm not saying it should be applied immediately, but I'm trying to > understand how Spark achieves this lazy computation transformations. May be > this is due to my ignorance of how Scala works, but when I see the code, I > see that the function is applied to the elements of RDD when I call > distinct - or is it not applied immediately? How does the returned RDD > 'keep track of the operation'? > > > On Tue, Mar 11, 2014 at 10:06 PM, Ewen Cheslack-Postava > <m...@ewencp.org>wrote: > >> You should probably be asking the opposite question: why do you think it >> *should* be applied immediately? Since the driver program hasn't requested >> any data back (distinct generates a new RDD, it doesn't return any data), >> there's no need to actually compute anything yet. >> >> As the documentation describes, if the call returns an RDD, it's >> transforming the data and will just keep track of the operation it >> eventually needs to perform. Only methods that return data back to the >> driver should trigger any computation. >> >> (The one known exception is sortByKey, which really should be lazy, but >> apparently uses an RDD.count call in its implementation: >> https://spark-project.atlassian.net/browse/SPARK-1021). >> >> David Thomas <dt5434...@gmail.com> >> March 11, 2014 at 9:49 PM >> For example, is distinct() transformation lazy? >> >> when I see the Spark source code, distinct applies a map-> reduceByKey -> >> map function to the RDD elements. Why is this lazy? Won't the function be >> applied immediately to the elements of RDD when I call someRDD.distinct? >> >> /** >> * Return a new RDD containing the distinct elements in this RDD. >> */ >> def distinct(numPartitions: Int): RDD[T] = >> map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1) >> >> /** >> * Return a new RDD containing the distinct elements in this RDD. >> */ >> def distinct(): RDD[T] = distinct(partitions.size) >> >> >
<<inline: compose-unknown-contact.jpg>>