Creating a smaller, derivative RDD from an RDD

2014-12-18 Thread bethesda
We have a very large RDD and I need to create a new RDD whose values are
derived from each record of the original RDD, and we only retain the few new
records that meet a criteria.  I want to avoid creating a second large RDD
and then filtering it since I believe this could tax system resources
unnecessarily (tell me if that assumption is wrong.)

So for example, /and this is just an example/, say we have an RDD with 1 to
1,000,000 and we iterate through each value, and compute it's md5 hash, and
we only keep the results that start with 'A'.

What we've tried and seems to work but which seemed a bit ugly, and perhaps
not efficient, was the following in pseudocode. * Is this the best way to do
this?*

Thanks

bigRdd.flatMap( { i =
  val h = md5(i)
  if (h.substring(1,1) == 'A') {
Array(h)
  } else {
Array[String]()
  }
})



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Creating a smaller, derivative RDD from an RDD

2014-12-18 Thread Sean Owen
I don't think you can avoid examining each element of the RDD, if
that's what you mean. Your approach is basically the best you can do
in general. You're not making a second RDD here, and even if you did
this in two steps, the second RDD is really more of a bookkeeping that
a second huge data structure.

You can simplify your example a bit, although I doubt it's noticeably faster:

bigRdd.flatMap { i =
  val h = md5(i)
  if (h(0) == 'A') {
Some(h)
  } else {
None
  }
}

This is also fine, simpler still, and if it's slower, not by much:

bigRdd.map(md5).filter(_(0) == 'A')


On Thu, Dec 18, 2014 at 10:18 PM, bethesda swearinge...@mac.com wrote:
 We have a very large RDD and I need to create a new RDD whose values are
 derived from each record of the original RDD, and we only retain the few new
 records that meet a criteria.  I want to avoid creating a second large RDD
 and then filtering it since I believe this could tax system resources
 unnecessarily (tell me if that assumption is wrong.)

 So for example, /and this is just an example/, say we have an RDD with 1 to
 1,000,000 and we iterate through each value, and compute it's md5 hash, and
 we only keep the results that start with 'A'.

 What we've tried and seems to work but which seemed a bit ugly, and perhaps
 not efficient, was the following in pseudocode. * Is this the best way to do
 this?*

 Thanks

 bigRdd.flatMap( { i =
   val h = md5(i)
   if (h.substring(1,1) == 'A') {
 Array(h)
   } else {
 Array[String]()
   }
 })



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org