The key to efficient lookups is having a partitioner in place. If you don't have a partitioner in place, essentially the best you can do is: def contains[T](rdd: RDD[T], value: T): Boolean = ! (rdd.filter(x => x == value).isEmpty)
If you are going to do this sort of operation frequently, it might pay to make it a bit easier. Rather than dealing with an RDD[T], deal with an RDD of pairs; for instance, you could do pairRdd = rdd.map(x => (x, 1)) to get an RDD[(T, Int)]. Now that these are (technically) key-value pairs, you can come up with a partitioner and apply it; something like: val numPartitions = pairRdd.partitions.length val partitioner = new HashPartitioner(numPartitions) val partitionedRdd = pairRdd.partitionBy(partitioner) Now, you can use partitionedRdd.lookup(value), which will give you back a sequence of all the values (in this case, 1's) associated with the key "value". You can use rdd.lookup on any RDD of key-value pairs. However, if you look at the source at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala you'll see that it is very efficient in the case of having a good partitioner -- it only actually looks in the partition that must contain the given key. This can make all the difference! -- Nick On Thu, Dec 31, 2015 at 8:26 AM domibd <[email protected]> wrote: > thanks a lot. > > It is very interesting. > > Unfortunatly it does not solve my very simple problem : > efficiently find whether a value is in a huge rdd. > > thanks again > > Dominique > > Le 31/12/2015 01:26, madaan.amanmadaan [via Apache Spark User List] a > écrit : > > > Hi, > > > > Check out https://github.com/amplab/spark-indexedrdd, might be helpful. > > > > Aman > > > > On Wed, Dec 30, 2015 at 12:13 PM, domibd [via Apache Spark User List] > > <[hidden email] </user/SendEmail.jtp?type=node&node=25840&i=0>> wrote: > > > > hello, > > > > how can i check the existence of an item in a very large rdd > > in a prallelised way such that the process stop as soon as > > the item is found (if it is found)? > > > > thanks a lot > > > > Dominique > > > > > ------------------------------------------------------------------------ > > If you reply to this email, your message will be added to the > > discussion below: > > > http://apache-spark-user-list.1001560.n3.nabble.com/efficient-checking-the-existence-of-an-item-in-a-rdd-tp25839.html > > > > To start a new topic under Apache Spark User List, email [hidden > > email] </user/SendEmail.jtp?type=node&node=25840&i=1> > > To unsubscribe from Apache Spark User List, click here. > > NAML > > < > http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > > > > > > > > > > > > -- > > -Aman > > > > > > ------------------------------------------------------------------------ > > If you reply to this email, your message will be added to the discussion > > below: > > > http://apache-spark-user-list.1001560.n3.nabble.com/efficient-checking-the-existence-of-an-item-in-a-rdd-tp25839p25840.html > > > > To unsubscribe from efficient checking the existence of an item in a > > rdd, click here > > < > > NAML > > < > http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > > > > !DSPAM:152,56847c15223168992667645! > > ------------------------------ > View this message in context: Re: efficient checking the existence of an > item in a rdd > <http://apache-spark-user-list.1001560.n3.nabble.com/efficient-checking-the-existence-of-an-item-in-a-rdd-tp25839p25845.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >
