Re: Spark unit test question

Matei Zaharia Mon, 21 Oct 2013 18:22:57 -0700

Yup, local mode also catches serialization errors. The issue with local 
variables in the function happens only if they're not Serializable, and even 
then, Spark's closure cleaner tries to eliminate references to them in some 
cases. But for example here's one thing that wouldn't work:


class C {
  val f = new File("f") // not Serializable

  val x = 1 // Serializable, but also part of C

  // map closure accesses this.x, which will pass the whole C object
  def doStuff(rdd: RDD[Int]) = rdd.map(_ + x)
}

new C().doStuff(rdd)

Matei

On Oct 21, 2013, at 1:29 PM, Josh Rosen <[email protected]> wrote:

> I think that the regular 'local' mode will work for testing serialization; it 
> serializes both tasks and results in order to catch serialization errors:
> 
> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala#L187
> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala#L200
> 
> 
> On Mon, Oct 21, 2013 at 1:06 PM, Aaron Davidson <[email protected]> wrote:
> To answer your second question first, you can use the SparkContext format 
> "local-cluster[2, 1, 512]" (instead of "local[2]"), which would create a 
> local test cluster with 2 workers, each with 1 core and 512 MB of memory. 
> This should allow you to accurately test things like serialization.
> 
> I don't believe that adding a function-local variable would cause the 
> function to be unserializable, though. The only concern when shipping around 
> functions is when they refer to variables outside the function's scope, in 
> which case Spark will automatically ship those variables to all workers 
> (unless you override this behavior with a broadcast or accumulator variable).
> 
> 
> On Mon, Oct 21, 2013 at 10:30 AM, Shay Seng <[email protected]> wrote:
> I'm trying to write a unit test to ensure that some functions I rely on will 
> always serialize and run correctly on a cluster. 
> In one of these functions I've deliberately added a "val x:Int = 1" which 
> should prevent this method from being serializable right?
> 
> In the test I've done:
>    sc = new SparkContext("local[2]","test")
>    ...
>    val pdata = sc.parallelize(data)
>    val c = pdata.map().collect()
> 
> The unit tests still complete with no errors; I'm guessing because spark 
> knows that local[2] doesn't require serialization? Is there someway I can 
> force spark to run like it would do on a real cluster?
> 
> 
> tks
> shay
> 
>

Re: Spark unit test question

Reply via email to