Re: efficient checking the existence of an item in a rdd

Nick Peterson Thu, 31 Dec 2015 09:05:22 -0800

The key to efficient lookups is having a partitioner in place.

If you don't have a partitioner in place, essentially the best you can do
is:
def contains[T](rdd: RDD[T], value: T): Boolean = ! (rdd.filter(x => x ==
value).isEmpty)


If you are going to do this sort of operation frequently, it might pay to
make it a bit easier. Rather than dealing with an RDD[T], deal with an RDD
of pairs; for instance, you could do pairRdd = rdd.map(x => (x, 1)) to get
an RDD[(T, Int)].

Now that these are (technically) key-value pairs, you can come up with a
partitioner and apply it; something like:

val numPartitions = pairRdd.partitions.length
val partitioner = new HashPartitioner(numPartitions)
val partitionedRdd = pairRdd.partitionBy(partitioner)

Now, you can use partitionedRdd.lookup(value), which will give you back a
sequence of all the values (in this case, 1's) associated with the key
"value".

You can use rdd.lookup on any RDD of key-value pairs.  However, if you look
at the source at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
you'll
see that it is very efficient in the case of having a good partitioner --
it only actually looks in the partition that must contain the given key.
This can make all the difference!

-- Nick

On Thu, Dec 31, 2015 at 8:26 AM domibd <[email protected]> wrote:

> thanks a lot.
>
> It is very interesting.
>
> Unfortunatly it does not solve my very simple problem :
> efficiently find whether a value is in a huge rdd.
>
> thanks again
>
> Dominique
>
> Le 31/12/2015 01:26, madaan.amanmadaan [via Apache Spark User List] a
> écrit :
>
> > Hi,
> >
> > Check out https://github.com/amplab/spark-indexedrdd, might be helpful.
> >
> > Aman
> >
> > On Wed, Dec 30, 2015 at 12:13 PM, domibd [via Apache Spark User List]
> > <[hidden email] </user/SendEmail.jtp?type=node&node=25840&i=0>> wrote:
> >
> >     hello,
> >
> >     how can i check the existence of an item in a very large rdd
> >     in a prallelised way such that the process stop as soon as
> >     the item is found (if it is found)?
> >
> >     thanks a lot
> >
> >     Dominique
> >
> >
> ------------------------------------------------------------------------
> >     If you reply to this email, your message will be added to the
> >     discussion below:
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/efficient-checking-the-existence-of-an-item-in-a-rdd-tp25839.html
> >
> >     To start a new topic under Apache Spark User List, email [hidden
> >     email] </user/SendEmail.jtp?type=node&node=25840&i=1>
> >     To unsubscribe from Apache Spark User List, click here.
> >     NAML
> >     <
> http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> >
> >
> >
> >
> >
> > --
> > -Aman
> >
> >
> > ------------------------------------------------------------------------
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/efficient-checking-the-existence-of-an-item-in-a-rdd-tp25839p25840.html
> >
> > To unsubscribe from efficient checking the existence of an item in a
> > rdd, click here
> > <
> > NAML
> > <
> http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> >
> > !DSPAM:152,56847c15223168992667645!
>
> ------------------------------
> View this message in context: Re: efficient checking the existence of an
> item in a rdd
> <http://apache-spark-user-list.1001560.n3.nabble.com/efficient-checking-the-existence-of-an-item-in-a-rdd-tp25839p25845.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Re: efficient checking the existence of an item in a rdd

Reply via email to