For example, is distinct() transformation lazy?

when I see the Spark source code, distinct applies a map-> reduceByKey ->
map function to the RDD elements. Why is this lazy? Won't the function be
applied immediately to the elements of RDD when I call someRDD.distinct?

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int): RDD[T] =
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T] = distinct(partitions.size)

Reply via email to