Drop functionality with Spark RDD

purav aggarwal Sat, 23 Nov 2013 01:22:15 -0800

Hi,

I have a use case where I read data from files and need to drop certain
number of lines (unwanted data) before I begin processing.


I implemented it as -

  /**
   * Returns an RDD with the first n elements dropped.
   */
  def drop(num: Int) : RDD[T] = {
    if (num <= 0)
        return this
    val toBeDropped = sc.makeRDD(this.take(num))
    this.subtract(toBeDropped)
  }

Is the implementation okay ?
If yes, does it make sense to incorporate it in the spark code base since
most Scala collections have a similar drop functionality.

One imp. point to note is the returned RDD might not have the order
maintained - order in which the RDD was constructed in the first place.
Firing a subsequent drop or any order oriented query on the RDD will give
unpredictable results.

Drop functionality with Spark RDD

Reply via email to