Re: creating a distributed index

Philip Ogren Mon, 04 Aug 2014 11:08:35 -0700

After playing around with mapPartition I think this does exactly what Iwant. I can pass in a function to mapPartition that looks like this:


        def f1(iter: Iterator[String]): Iterator[MyIndex] = {
            val idx: MyIndex = new MyIndex()
            while (iter.hasNext) {
                val text: String = iter.next;
                idx.add(text)
            }
            List(idx).iterator
        }


        val indexRDD: RDD[MyIndex] = rdd.mapPartitions(f1, true)

Now I can perform a match operation like so:

val matchRdd: RDD[Match] = indexRDD.flatMap(index => index.`match`(myquery))

I'm sure it won't take much imagination to figure out how to the thematching in a batch way.

If anyone has done anything along these lines I'd love to have somefeedback.


Thanks,
Philip



On 08/04/2014 09:46 AM, Philip Ogren wrote:

This looks like a really cool feature and it seems likely that thiswill be extremely useful for things we are doing. However, I'm notsure it is quite what I need here. With an inverted index you don'tactually look items up by their keys but instead try to match againstsome input string. So, if I created an inverted index on 1Mstrings/documents in a single JVM I could subsequently submit astring-valued query to retrieve the n-best matches. I'd like to dosomething like build ten 1M string/document indexes across ten nodes,submit a string-valued query to each of the ten indexes and aggregatethe n-best matches from the ten sets of results. Would this bepossible with IndexedRDD or some other feature of Spark?
Thanks,
Philip


On 08/01/2014 04:26 PM, Ankur Dave wrote:
At 2014-08-01 14:50:22 -0600, Philip Ogren <philip.og...@oracle.com>wrote:
It seems that I could do this with mapPartition so that each elementin a
partition gets added to an index for that partition.
[...]
Would it then be possible to take a string and query eachpartition's indexwith it? Or better yet, take a batch of strings and query eachstring in the
batch against each partition's index?
I proposed a key-value store based on RDDs called IndexedRDD thatdoes exactly what you described. It uses mapPartitions to constructan index within each partition, then exposes get and multiget methodsto allow looking up values associated with given keys.
It will hopefully make it into Spark 1.2.0. Until then you can try itout by merging in the pull request locally:https://github.com/apache/spark/pull/1297.
See JIRA for details and slides on how it works:https://issues.apache.org/jira/browse/SPARK-2365.
Ankur
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: creating a distributed index

Reply via email to