Thanks TD. I was looking into broadcast variables.

Right now I am running it locally...and I plan to move it to "production"
on EC2.

The way I fixed it is by doing myrdd.map(lambda x: (x, mylist)).map(myfunc)
but I don't think it's efficient?

mylist is filled only once at the start and never changes.

Vadim
ᐧ

On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das <t...@databricks.com> wrote:

> Is the mylist present on every executor? If not, then you have to pass it
> on. And broadcasts are the best way to pass them on. But note that once
> broadcasted it will immutable at the executors, and if you update the list
> at the driver, you will have to broadcast it again.
>
> TD
>
> On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy <
> vadim.bichuts...@gmail.com> wrote:
>
>> I am using Spark Streaming with Python. For each RDD, I call a map, i.e.,
>> myrdd.map(myfunc), myfunc is in a separate Python module. In yet another
>> separate Python module I have a global list, i.e. mylist, that's populated
>> with metadata. I can't get myfunc to see mylist...it's always empty.
>> Alternatively, I guess I could pass mylist to map.
>>
>> Any suggestions?
>>
>> Thanks,
>> Vadim
>>
>
>

Reply via email to