Related question about this kind of problems : what is the best way to get the mappings of a list of keys ?

Does this make sense ? :

val myKeys=sc.parallelize(List(("query1",None),("query2",None)))
val resolved = myKeys.leftJoin(dictionary)

Guillaume
If you have a pair RDD (an RDD[A,B]) then you can use the .lookup() method on it for faster access.


Spark's strength is running computations across a large set of data.  If you're trying to do fast lookup of a few individual keys, I'd recommend something more like memcached or Elasticsearch.


On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <[email protected]> wrote:
Yes, that works.

But then the hashmap functionality of the fast key lookup etc. is gone and the search will be linear using a iterator etc. Not sure if Spark internally creates additional optimizations for Seq but otherwise one has to assume this becomes a List/Array without a fast key lookup of a hashmap or b-tree 

Any thoughts ?





On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <[email protected]> wrote:
Manoj,

I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you just do:

val cr_rdd = sc.parallelize(cr.toSeq)

The toSeq would convert the HashMap[String,Double] into a Seq[(String, Double)] before calling the parallelize function.

Regards,

Frank Austin Nothaft
[email protected]
[email protected]
202-340-0466

On Jan 24, 2014, at 12:56 PM, Manoj Samel <[email protected]> wrote:

> Is there a way to create RDD over a hashmap ?
>
> If I have a hash map and try sc.parallelize, it gives
>
> <console>:17: error: type mismatch;
>  found   : scala.collection.mutable.HashMap[String,Double]
>  required: Seq[?]
> Error occurred in an application involving default arguments.
>        val cr_rdd = sc.parallelize(cr)
>                                    ^





--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply via email to