Uh, for some reason I don't seem to automatically reply to the list any more. Here is again my message to Tom.
---------- Forwarded message ---------- Tom, On Wed, Aug 13, 2014 at 5:35 AM, Tom Vacek <minnesota...@gmail.com> wrote: > This is a back-to-basics question. How do we know when Spark will clone > an object and distribute it with task closures versus synchronize access to > it. > > For example, the old rookie mistake of random number generation: > > import scala.util.Random > val randRDD = sc.parallelize(0 until 1000).map(ii => Random.nextGaussian) > > One can check to see that each partition contains a different set of > random numbers, so the RNG obviously was not cloned, but access was > synchronized. > In this case, Random is a singleton object; Random.nextGaussian is like a static method of a Java class. The access is not synchronized (unless I misunderstand "synchronized"), but each Spark worker will use a JVM-local instance of the Random object. You don't actually close over the Random object in this case. In fact, this is one way to have node-local state (e.g., for DB connection pooling). > However: > > val myMap = collection.mutable.Map.empty[Int,Int] > sc.parallelize(0 until 100).mapPartitions(it => {it.foreach(ii => myMap(ii) = > ii); Array(myMap).iterator}).collect > > > This shows that each partition got a copy of the empty map and filled it > in with its portion of the rdd. > In this case, myMap is an instance of the Map class, so it will be serialized and shipped around. In fact, if you did `val Random = new scala.util.Random()` in your code above, then this object would also be serialized and treated just as myMap. (NB. No, it is not. Spark hangs for me when I do this and doesn't return anything...) Tobias