At 2014-08-01 14:50:22 -0600, Philip Ogren <philip.og...@oracle.com> wrote:
> It seems that I could do this with mapPartition so that each element in a
> partition gets added to an index for that partition.
> [...]
> Would it then be possible to take a string and query each partition's index
> with it? Or better yet, take a batch of strings and query each string in the
> batch against each partition's index?

I proposed a key-value store based on RDDs called IndexedRDD that does exactly 
what you described. It uses mapPartitions to construct an index within each 
partition, then exposes get and multiget methods to allow looking up values 
associated with given keys.

It will hopefully make it into Spark 1.2.0. Until then you can try it out by 
merging in the pull request locally: https://github.com/apache/spark/pull/1297.

See JIRA for details and slides on how it works: 
https://issues.apache.org/jira/browse/SPARK-2365.

Ankur

Reply via email to