Thanks TD. I was looking into broadcast variables. Right now I am running it locally...and I plan to move it to "production" on EC2.
The way I fixed it is by doing myrdd.map(lambda x: (x, mylist)).map(myfunc) but I don't think it's efficient? mylist is filled only once at the start and never changes. Vadim ᐧ On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das <t...@databricks.com> wrote: > Is the mylist present on every executor? If not, then you have to pass it > on. And broadcasts are the best way to pass them on. But note that once > broadcasted it will immutable at the executors, and if you update the list > at the driver, you will have to broadcast it again. > > TD > > On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy < > vadim.bichuts...@gmail.com> wrote: > >> I am using Spark Streaming with Python. For each RDD, I call a map, i.e., >> myrdd.map(myfunc), myfunc is in a separate Python module. In yet another >> separate Python module I have a global list, i.e. mylist, that's populated >> with metadata. I can't get myfunc to see mylist...it's always empty. >> Alternatively, I guess I could pass mylist to map. >> >> Any suggestions? >> >> Thanks, >> Vadim >> > >