I found a detailed explanation here, 
https://www.quora.com/Apache-Spark/What-does-Closure-cleaner-func-mean-in-Spark 
. I copy it for convenience:


When Scala constructs a closure, it determines which outer variables the 
closure will use and stores references to them in the closure object. This 
allows the closure to work properly even when it's called from a different 
scope than it was created in.

Scala sometimes errs on the side of capturing too many outer variables (see 
SI-1419<https://issues.scala-lang.org/browse/SI-1419>). That's harmless in most 
cases, because the extra captured variables simply don't get used (though this 
prevents them from getting GC'd). But it poses a problem for Spark, which has 
to send closures across the network so they can be run on slaves. When a 
closure contains unnecessary references, it wastes network bandwidth. More 
importantly, some of the references may point to non-serializable objects, and 
Spark will fail to serialize the closure.



To work around this bug in Scala, the 
ClosureCleaner<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala>
 traverses the object at runtime and prunes the unnecessary references. Since 
it does this at runtime, it can be more accurate than the Scala compiler can. 
Spark can then safely serialize the cleaned closure.






From: Mayur Rustagi [mailto:mayur.rust...@gmail.com]
Sent: Tuesday, July 29, 2014 11:40 AM
To: user@spark.apache.org
Subject: Re: The function of ClosureCleaner.clean

I am not sure specifically about specific purpose of this function but Spark 
needs to remove elements from the closure that may be included by default but 
not really needed so as to serialize it & send it to executors to operate on 
RDD. For example a function in Map function of RDD  may reference objects 
inside the class, so you may want to send across those objects but not the 
whole parent class.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi<https://twitter.com/mayur_rustagi>


On Mon, Jul 28, 2014 at 8:28 PM, Wang, Jensen 
<jensen.w...@sap.com<mailto:jensen.w...@sap.com>> wrote:
Hi, All
              Before sc.runJob invokes dagScheduler.runJob, the func performed 
on the rdd is “cleaned” by ClosureCleaner.clearn.
             Why  spark has to do this? What’s the purpose?

Reply via email to