Re: Access a Broadcast variable causes Spark to launch a second context

sstraub Mon, 07 Sep 2015 08:21:55 -0700

never mind, all of this was caused because somewhere in my code I wrote `def`
instead of `val`, which caused `collectAsMap` to be executed on each call.
Not sure why Spark at some point decided to create a new context, though...


Anyway, sorry for the disturbance.



sstraub wrote
> Hi,
> 
> I'm working on a spark job that frequently iterates over huge RDDs and
> matches the elements against some Maps that easily fit into memory. So
> what I do is to broadcast that Map and reference it from my RDD.
> 
> Works like a charm, until at some point it doesn't, and I can't figure out
> why...
> Please have a look at this:
> 
>   def fun(sc: SparkContext, someRDD: RDD[(String)], someMap: RDD[(String,
> Double)]) = {
>     // I want to access the Map multiple times, so I broadcast it
>     val broadcast = sc.broadcast(someMap.collectAsMap())
>     // the next line creates one job per element and executes
> collectAsMap() over and over again
>     println(someRDD.take(100).map(s => broadcast.value.getOrElse(s,
> 0.0)).toList.mkString("\n"))
>     // the next line creates a new spark context and crashes (only one
> spark context per JVM...)
>     println(someRDD.map(s => broadcast.value.getOrElse(s,
> 0.0)).collect().mkString("\n"))
>   }
> 
> Here I'm doing just what I've described above: broadcast a Map and access
> the broadcast value while iterating over another RDD.
> 
> Now when I take a subset of the RDD (`take(100)`), Spark creates one job
> per ELEMENT (that's 100 jobs) where `collectAsMap` is called. Obviously,
> this takes quite a lot of time (~500 ms per element).
> When I actually want to map over the entire RDD, Spark tries to launch
> another Spark context and crashes the whole application.
> 
>     org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 2 in stage 37.0 failed 1 times, most recent failure: Lost task 2.0 in
> stage 37.0 (TID 106, localhost): org.apache.spark.SparkException: Only one
> SparkContext may be running in this JVM (see SPARK-2243). To ignore this
> error, set spark.driver.allowMultipleContexts = true.
> 
> I couldn't reproduce this error in a minimal working example, so there
> must be something in my pipeline that is messing things up. The error is
> 100% reproducible in my environment and the application runs fine as soon
> as I don't access this specific Map from this specific RDD.
> 
> Any idea what might cause this problem?
> Can I provide you with any other Information (besides posting >500 lines
> of code)?
> 
> cheers
> Sebastian





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Access-a-Broadcast-variable-causes-Spark-to-launch-a-second-context-tp24595p24596.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Access a Broadcast variable causes Spark to launch a second context

Reply via email to