Re: Spark serialization in closure

Chen Song Thu, 09 Jul 2015 19:10:48 -0700

Thanks Andrew.

I tried with your suggestions and (2) works for me. (1) still doesn't work.


Chen

On Thu, Jul 9, 2015 at 4:58 PM, Andrew Or <[email protected]> wrote:

> Hi Chen,
>
> I believe the issue is that `object foo` is a member of `object testing`,
> so the only way to access `object foo` is to first pull `object testing`
> into the closure, then access a pointer to get to `object foo`. There are
> two workarounds that I'm aware of:
>
> (1) Move `object foo` outside of `object testing`. This is only a problem
> because of the nested objects. Also, by design it's simpler to reason about
> but that's a separate discussion.
>
> (2) Create a local variable for `foo.v`. If all your closure cares about
> is the integer, then it makes sense to add a `val v = foo.v` inside `func`
> and use this in your closure instead. This avoids pulling in $outer
> pointers into your closure at all since it only references local variables.
>
> As others have commented, I think this is more of a Scala problem than a
> Spark one.
>
> Let me know if these work,
> -Andrew
>
> 2015-07-09 13:36 GMT-07:00 Richard Marscher <[email protected]>:
>
>> Reading that article and applying it to your observations of what happens
>> at runtime:
>>
>> shouldn't the closure require serializing testing? The foo singleton
>> object is a member of testing, and then you call this foo value in the
>> closure func and further in the foreachPartition closure. So following by
>> that article, Scala will attempt to serialize the containing object/class
>> testing to get the foo instance.
>>
>> On Thu, Jul 9, 2015 at 4:11 PM, Chen Song <[email protected]> wrote:
>>
>>> Repost the code example,
>>>
>>> object testing extends Serializable {
>>>     object foo {
>>>       val v = 42
>>>     }
>>>     val list = List(1,2,3)
>>>     val rdd = sc.parallelize(list)
>>>     def func = {
>>>       val after = rdd.foreachPartition {
>>>         it => println(foo.v)
>>>       }
>>>     }
>>>   }
>>>
>>> On Thu, Jul 9, 2015 at 4:09 PM, Chen Song <[email protected]>
>>> wrote:
>>>
>>>> Thanks Erik. I saw the document too. That is why I am confused because
>>>> as per the article, it should be good as long as *foo *is
>>>> serializable. However, what I have seen is that it would work if
>>>> *testing* is serializable, even foo is not serializable, as shown
>>>> below. I don't know if there is something specific to Spark.
>>>>
>>>> For example, the code example below works.
>>>>
>>>> object testing extends Serializable {
>>>>
>>>>     object foo {
>>>>
>>>>       val v = 42
>>>>
>>>>     }
>>>>
>>>>     val list = List(1,2,3)
>>>>
>>>>     val rdd = sc.parallelize(list)
>>>>
>>>>     def func = {
>>>>
>>>>       val after = rdd.foreachPartition {
>>>>
>>>>         it => println(foo.v)
>>>>
>>>>       }
>>>>
>>>>     }
>>>>
>>>>   }
>>>>
>>>> On Thu, Jul 9, 2015 at 12:11 PM, Erik Erlandson <[email protected]> wrote:
>>>>
>>>>> I think you have stumbled across this idiosyncrasy:
>>>>>
>>>>>
>>>>> http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> > I am not sure this is more of a question for Spark or just Scala but
>>>>> I am
>>>>> > posting my question here.
>>>>> >
>>>>> > The code snippet below shows an example of passing a reference to a
>>>>> closure
>>>>> > in rdd.foreachPartition method.
>>>>> >
>>>>> > ```
>>>>> > object testing {
>>>>> >     object foo extends Serializable {
>>>>> >       val v = 42
>>>>> >     }
>>>>> >     val list = List(1,2,3)
>>>>> >     val rdd = sc.parallelize(list)
>>>>> >     def func = {
>>>>> >       val after = rdd.foreachPartition {
>>>>> >         it => println(foo.v)
>>>>> >       }
>>>>> >     }
>>>>> >   }
>>>>> > ```
>>>>> > When running this code, I got an exception
>>>>> >
>>>>> > ```
>>>>> > Caused by: java.io.NotSerializableException:
>>>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$
>>>>> > Serialization stack:
>>>>> > - object not serializable (class:
>>>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$, value:
>>>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$@10b7e824)
>>>>> > - field (class:
>>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1,
>>>>> > name: $outer, type: class
>>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$)
>>>>> > - object (class
>>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1,
>>>>> > <function1>)
>>>>> > ```
>>>>> >
>>>>> > It looks like Spark needs to serialize `testing` object. Why is it
>>>>> > serializing testing even though I only pass foo (another serializable
>>>>> > object) in the closure?
>>>>> >
>>>>> > A more general question is, how can I prevent Spark from serializing
>>>>> the
>>>>> > parent class where RDD is defined, with still support of passing in
>>>>> > function defined in other classes?
>>>>> >
>>>>> > --
>>>>> > Chen Song
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Chen Song
>>>>
>>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>>
>> --
>> --
>> *Richard Marscher*
>> Software Engineer
>> Localytics
>> Localytics.com <http://localytics.com/> | Our Blog
>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
>> Facebook <http://facebook.com/localytics> | LinkedIn
>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>
>
>


-- 
Chen Song

Re: Spark serialization in closure

Reply via email to