On Wed, Feb 11, 2015 at 10:47 AM, rok <rokros...@gmail.com> wrote:
> I was having trouble with memory exceptions when broadcasting a large lookup
> table, so I've resorted to processing it iteratively -- but how can I modify
> an RDD iteratively?
>
> I'm trying something like :
>
> rdd = sc.parallelize(...)
> lookup_tables = {...}
>
> for lookup_table in lookup_tables :
>     rdd = rdd.map(lambda x: func(x, lookup_table))
>
> If I leave it as is, then only the last "lookup_table" is applied instead of
> stringing together all the maps. However, if add a .cache() to the .map then
> it seems to work fine.

This is the something related to Python closure implementation, you should
do it like this:

def create_func(lookup_table):
     return lambda x: func(x, lookup_table)

for lookup_table in lookup_tables:
    rdd = rdd.map(create_func(lookup_table))

The Python closure just remember the variable, not copy the value of it.
In the loop, `lookup_table` is the same variable. When we serialize the final
rdd, all the closures are referring to the same `lookup_table`, which points
to the last value.

When we create the closure in a function, Python create a variable for
each closure, so it works.

> A second problem is that the runtime for each iteration roughly doubles at
> each iteration so this clearly doesn't seem to be the way to do it. What is
> the preferred way of doing such repeated modifications to an RDD and how can
> the accumulation of overhead be minimized?
>
> Thanks!
>
> Rok
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to