RE: creating a distributed index

Daniel, Ronald (ELS-SDG) Mon, 04 Aug 2014 11:49:07 -0700

At the Spark Summit, the Typesafe people had a toy implementation of a 
full-text index that you could use as a starting point. The bare code is 
available in github at 
https://github.com/deanwampler/spark-workshop/blob/eb077a734aad166235de85494def8fe3d4d2ca66/src/main/scala/spark/InvertedIndex5b.scala.
 You can get a nice presentation of it through Typesafe's "Activator" - see 
http://typesafe.com/activator/template/spark-workshop for more on that.


Please keep the list informed on how things go - I'd like this functionality as 
well.

Ron

> -----Original Message-----
> From: Philip Ogren [mailto:philip.og...@oracle.com]
> Sent: Monday, August 04, 2014 11:08 AM
> To: user@spark.apache.org
> Subject: Re: creating a distributed index
> 
> After playing around with mapPartition I think this does exactly what I want.
> I can pass in a function to mapPartition that looks like this:
> 
>          def f1(iter: Iterator[String]): Iterator[MyIndex] = {
>              val idx: MyIndex = new MyIndex()
>              while (iter.hasNext) {
>                  val text: String = iter.next;
>                  idx.add(text)
>              }
>              List(idx).iterator
>          }
> 
>          val indexRDD: RDD[MyIndex] = rdd.mapPartitions(f1, true)
> 
> Now I can perform a match operation like so:
> 
> val matchRdd: RDD[Match] = indexRDD.flatMap(index =>
> index.`match`(myquery))
> 
> I'm sure it won't take much imagination to figure out how to the the
> matching in a batch way.
> 
> If anyone has done anything along these lines I'd love to have some
> feedback.
> 
> Thanks,
> Philip
> 
> 
> 
> On 08/04/2014 09:46 AM, Philip Ogren wrote:
> > This looks like a really cool feature and it seems likely that this
> > will be extremely useful for things we are doing.  However, I'm not
> > sure it is quite what I need here.  With an inverted index you don't
> > actually look items up by their keys but instead try to match against
> > some input string.  So, if I created an inverted index on 1M
> > strings/documents in a single JVM I could subsequently submit a
> > string-valued query to retrieve the n-best matches.  I'd like to do
> > something like build ten 1M string/document indexes across ten nodes,
> > submit a string-valued query to each of the ten indexes and aggregate
> > the n-best matches from the ten sets of results.  Would this be
> > possible with IndexedRDD or some other feature of Spark?
> >
> > Thanks,
> > Philip
> >
> >
> > On 08/01/2014 04:26 PM, Ankur Dave wrote:
> >> At 2014-08-01 14:50:22 -0600, Philip Ogren <philip.og...@oracle.com>
> >> wrote:
> >>> It seems that I could do this with mapPartition so that each element
> >>> in a partition gets added to an index for that partition.
> >>> [...]
> >>> Would it then be possible to take a string and query each
> >>> partition's index with it? Or better yet, take a batch of strings
> >>> and query each string in the batch against each partition's index?
> >> I proposed a key-value store based on RDDs called IndexedRDD that
> >> does exactly what you described. It uses mapPartitions to construct
> >> an index within each partition, then exposes get and multiget methods
> >> to allow looking up values associated with given keys.
> >>
> >> It will hopefully make it into Spark 1.2.0. Until then you can try it
> >> out by merging in the pull request locally:
> >> https://github.com/apache/spark/pull/1297.
> >>
> >> See JIRA for details and slides on how it works:
> >> https://issues.apache.org/jira/browse/SPARK-2365.
> >>
> >> Ankur
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
> > additional commands, e-mail: user-h...@spark.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: creating a distributed index

Reply via email to