I don't think you can avoid examining each element of the RDD, if that's what you mean. Your approach is basically the best you can do in general. You're not making a second RDD here, and even if you did this in two steps, the second RDD is really more of a bookkeeping that a second huge data structure.
You can simplify your example a bit, although I doubt it's noticeably faster: bigRdd.flatMap { i => val h = md5(i) if (h(0) == 'A') { Some(h) } else { None } } This is also fine, simpler still, and if it's slower, not by much: bigRdd.map(md5).filter(_(0) == 'A') On Thu, Dec 18, 2014 at 10:18 PM, bethesda <swearinge...@mac.com> wrote: > We have a very large RDD and I need to create a new RDD whose values are > derived from each record of the original RDD, and we only retain the few new > records that meet a criteria. I want to avoid creating a second large RDD > and then filtering it since I believe this could tax system resources > unnecessarily (tell me if that assumption is wrong.) > > So for example, /and this is just an example/, say we have an RDD with 1 to > 1,000,000 and we iterate through each value, and compute it's md5 hash, and > we only keep the results that start with 'A'. > > What we've tried and seems to work but which seemed a bit ugly, and perhaps > not efficient, was the following in pseudocode. * Is this the best way to do > this?* > > Thanks > > bigRdd.flatMap( { i => > val h = md5(i) > if (h.substring(1,1) == 'A') { > Array(h) > } else { > Array[String]() > } > }) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org