Thanks Andrew. I tried with your suggestions and (2) works for me. (1) still doesn't work.
Chen On Thu, Jul 9, 2015 at 4:58 PM, Andrew Or <[email protected]> wrote: > Hi Chen, > > I believe the issue is that `object foo` is a member of `object testing`, > so the only way to access `object foo` is to first pull `object testing` > into the closure, then access a pointer to get to `object foo`. There are > two workarounds that I'm aware of: > > (1) Move `object foo` outside of `object testing`. This is only a problem > because of the nested objects. Also, by design it's simpler to reason about > but that's a separate discussion. > > (2) Create a local variable for `foo.v`. If all your closure cares about > is the integer, then it makes sense to add a `val v = foo.v` inside `func` > and use this in your closure instead. This avoids pulling in $outer > pointers into your closure at all since it only references local variables. > > As others have commented, I think this is more of a Scala problem than a > Spark one. > > Let me know if these work, > -Andrew > > 2015-07-09 13:36 GMT-07:00 Richard Marscher <[email protected]>: > >> Reading that article and applying it to your observations of what happens >> at runtime: >> >> shouldn't the closure require serializing testing? The foo singleton >> object is a member of testing, and then you call this foo value in the >> closure func and further in the foreachPartition closure. So following by >> that article, Scala will attempt to serialize the containing object/class >> testing to get the foo instance. >> >> On Thu, Jul 9, 2015 at 4:11 PM, Chen Song <[email protected]> wrote: >> >>> Repost the code example, >>> >>> object testing extends Serializable { >>> object foo { >>> val v = 42 >>> } >>> val list = List(1,2,3) >>> val rdd = sc.parallelize(list) >>> def func = { >>> val after = rdd.foreachPartition { >>> it => println(foo.v) >>> } >>> } >>> } >>> >>> On Thu, Jul 9, 2015 at 4:09 PM, Chen Song <[email protected]> >>> wrote: >>> >>>> Thanks Erik. I saw the document too. That is why I am confused because >>>> as per the article, it should be good as long as *foo *is >>>> serializable. However, what I have seen is that it would work if >>>> *testing* is serializable, even foo is not serializable, as shown >>>> below. I don't know if there is something specific to Spark. >>>> >>>> For example, the code example below works. >>>> >>>> object testing extends Serializable { >>>> >>>> object foo { >>>> >>>> val v = 42 >>>> >>>> } >>>> >>>> val list = List(1,2,3) >>>> >>>> val rdd = sc.parallelize(list) >>>> >>>> def func = { >>>> >>>> val after = rdd.foreachPartition { >>>> >>>> it => println(foo.v) >>>> >>>> } >>>> >>>> } >>>> >>>> } >>>> >>>> On Thu, Jul 9, 2015 at 12:11 PM, Erik Erlandson <[email protected]> wrote: >>>> >>>>> I think you have stumbled across this idiosyncrasy: >>>>> >>>>> >>>>> http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/ >>>>> >>>>> >>>>> >>>>> >>>>> ----- Original Message ----- >>>>> > I am not sure this is more of a question for Spark or just Scala but >>>>> I am >>>>> > posting my question here. >>>>> > >>>>> > The code snippet below shows an example of passing a reference to a >>>>> closure >>>>> > in rdd.foreachPartition method. >>>>> > >>>>> > ``` >>>>> > object testing { >>>>> > object foo extends Serializable { >>>>> > val v = 42 >>>>> > } >>>>> > val list = List(1,2,3) >>>>> > val rdd = sc.parallelize(list) >>>>> > def func = { >>>>> > val after = rdd.foreachPartition { >>>>> > it => println(foo.v) >>>>> > } >>>>> > } >>>>> > } >>>>> > ``` >>>>> > When running this code, I got an exception >>>>> > >>>>> > ``` >>>>> > Caused by: java.io.NotSerializableException: >>>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$ >>>>> > Serialization stack: >>>>> > - object not serializable (class: >>>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$, value: >>>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$@10b7e824) >>>>> > - field (class: >>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1, >>>>> > name: $outer, type: class >>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$) >>>>> > - object (class >>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1, >>>>> > <function1>) >>>>> > ``` >>>>> > >>>>> > It looks like Spark needs to serialize `testing` object. Why is it >>>>> > serializing testing even though I only pass foo (another serializable >>>>> > object) in the closure? >>>>> > >>>>> > A more general question is, how can I prevent Spark from serializing >>>>> the >>>>> > parent class where RDD is defined, with still support of passing in >>>>> > function defined in other classes? >>>>> > >>>>> > -- >>>>> > Chen Song >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> Chen Song >>>> >>>> >>> >>> >>> -- >>> Chen Song >>> >>> >> >> >> -- >> -- >> *Richard Marscher* >> Software Engineer >> Localytics >> Localytics.com <http://localytics.com/> | Our Blog >> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | >> Facebook <http://facebook.com/localytics> | LinkedIn >> <http://www.linkedin.com/company/1148792?trk=tyah> >> > > -- Chen Song
