For example, is distinct() transformation lazy? when I see the Spark source code, distinct applies a map-> reduceByKey -> map function to the RDD elements. Why is this lazy? Won't the function be applied immediately to the elements of RDD when I call someRDD.distinct?
/** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(numPartitions: Int): RDD[T] = map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1) /** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(): RDD[T] = distinct(partitions.size)