Re: refer to dictionary

2015-03-31 Thread Peng Xia
Hi Ted,

Thanks very much, yea, using broadcast is much faster.

Best,
Peng

On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu  wrote:

> You can use broadcast variable.
>
> See also this thread:
>
> http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variable&subj=How+Broadcast+variable+scale+
>
>
>
> > On Mar 31, 2015, at 4:43 AM, Peng Xia  wrote:
> >
> > Hi,
> >
> > I have a RDD (rdd1)where each line is split into an array ["a", "b",
> "c], etc.
> > And I also have a local dictionary p (dict1) stores key value pair
> {"a":1, "b": 2, c:3}
> > I want to replace the keys in the rdd with the its corresponding value
> in the dict:
> > rdd1.map(lambda line: [dict1[item] for item in line])
> >
> > But this task is not distributed, I believe the reason is the dict1 is a
> local instance.
> > Can any one provide suggestions on this to parallelize this?
> >
> >
> > Thanks,
> > Best,
> > Peng
> >
>


Re: refer to dictionary

2015-03-31 Thread Ted Yu
You can use broadcast variable. 

See also this thread:
http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variable&subj=How+Broadcast+variable+scale+



> On Mar 31, 2015, at 4:43 AM, Peng Xia  wrote:
> 
> Hi,
> 
> I have a RDD (rdd1)where each line is split into an array ["a", "b", "c], etc.
> And I also have a local dictionary p (dict1) stores key value pair {"a":1, 
> "b": 2, c:3}
> I want to replace the keys in the rdd with the its corresponding value in the 
> dict:
> rdd1.map(lambda line: [dict1[item] for item in line])
> 
> But this task is not distributed, I believe the reason is the dict1 is a 
> local instance.
> Can any one provide suggestions on this to parallelize this?
> 
> 
> Thanks,
> Best,
> Peng
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org