never mind, all of this was caused because somewhere in my code I wrote `def` instead of `val`, which caused `collectAsMap` to be executed on each call. Not sure why Spark at some point decided to create a new context, though...
Anyway, sorry for the disturbance. sstraub wrote > Hi, > > I'm working on a spark job that frequently iterates over huge RDDs and > matches the elements against some Maps that easily fit into memory. So > what I do is to broadcast that Map and reference it from my RDD. > > Works like a charm, until at some point it doesn't, and I can't figure out > why... > Please have a look at this: > > def fun(sc: SparkContext, someRDD: RDD[(String)], someMap: RDD[(String, > Double)]) = { > // I want to access the Map multiple times, so I broadcast it > val broadcast = sc.broadcast(someMap.collectAsMap()) > // the next line creates one job per element and executes > collectAsMap() over and over again > println(someRDD.take(100).map(s => broadcast.value.getOrElse(s, > 0.0)).toList.mkString("\n")) > // the next line creates a new spark context and crashes (only one > spark context per JVM...) > println(someRDD.map(s => broadcast.value.getOrElse(s, > 0.0)).collect().mkString("\n")) > } > > Here I'm doing just what I've described above: broadcast a Map and access > the broadcast value while iterating over another RDD. > > Now when I take a subset of the RDD (`take(100)`), Spark creates one job > per ELEMENT (that's 100 jobs) where `collectAsMap` is called. Obviously, > this takes quite a lot of time (~500 ms per element). > When I actually want to map over the entire RDD, Spark tries to launch > another Spark context and crashes the whole application. > > org.apache.spark.SparkException: Job aborted due to stage failure: > Task 2 in stage 37.0 failed 1 times, most recent failure: Lost task 2.0 in > stage 37.0 (TID 106, localhost): org.apache.spark.SparkException: Only one > SparkContext may be running in this JVM (see SPARK-2243). To ignore this > error, set spark.driver.allowMultipleContexts = true. > > I couldn't reproduce this error in a minimal working example, so there > must be something in my pipeline that is messing things up. The error is > 100% reproducible in my environment and the application runs fine as soon > as I don't access this specific Map from this specific RDD. > > Any idea what might cause this problem? > Can I provide you with any other Information (besides posting >500 lines > of code)? > > cheers > Sebastian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Access-a-Broadcast-variable-causes-Spark-to-launch-a-second-context-tp24595p24596.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org