I didn't notice other errors -- I also thought such a large broadcast is a bad idea but I tried something similar with a much smaller dictionary and encountered the same problem. I'm not familiar enough with spark internals to know whether the trace indicates an issue with the broadcast variables or perhaps something different?
The driver and executors have 50gb of ram so memory should be fine. Thanks, Rok On Feb 11, 2015 12:19 AM, "Davies Liu" <dav...@databricks.com> wrote: > It's brave to broadcast 8G pickled data, it will take more than 15G in > memory for each Python worker, > how much memory do you have in executor and driver? > Do you see any other exceptions in driver and executors? Something > related to serialization in JVM. > > On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar <rokros...@gmail.com> wrote: > > I get this in the driver log: > > I think this should happen on executor, or you called first() or > take() on the RDD? > > > java.lang.NullPointerException > > at > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) > > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) > > at > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) > > > > and on one of the executor's stderr: > > > > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly > (crashed) > > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > > File > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", > line 57, in main > > split_index = read_int(infile) > > File > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", > line 511, in read_int > > raise EOFError > > EOFError > > > > at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137) > > at > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174) > > at > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) > > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > > at > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) > > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) > > at > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) > > Caused by: java.lang.NullPointerException > > at > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229) > > ... 4 more > > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly > (crashed) > > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > > File > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", > line 57, in main > > split_index = read_int(infile) > > File > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", > line 511, in read_int > > raise EOFError > > EOFError > > > > at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137) > > at > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174) > > at > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) > > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > > at > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) > > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) > > at > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) > > Caused by: java.lang.NullPointerException > > at > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229) > > ... 4 more > > > > > > What I find odd is that when I make the broadcast object, the logs don't > show any significant amount of memory being allocated in any of the block > managers -- but the dictionary is large, it's 8 Gb pickled on disk. > > > > > > On Feb 10, 2015, at 10:01 PM, Davies Liu <dav...@databricks.com> wrote: > > > >> Could you paste the NPE stack trace here? It will better to create a > >> JIRA for it, thanks! > >> > >> On Tue, Feb 10, 2015 at 10:42 AM, rok <rokros...@gmail.com> wrote: > >>> I'm trying to use a broadcasted dictionary inside a map function and am > >>> consistently getting Java null pointer exceptions. This is inside an > IPython > >>> session connected to a standalone spark cluster. I seem to recall > being able > >>> to do this before but at the moment I am at a loss as to what to try > next. > >>> Is there a limit to the size of broadcast variables? This one is rather > >>> large (a few Gb dict). Thanks! > >>> > >>> Rok > >>> > >>> > >>> > >>> -- > >>> View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html > >>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: user-h...@spark.apache.org > >>> > > >