Yes, I looked at the log, and the serialized tasks were about 2k bytes as well. Is there anything I can do to move this along?
On Thu, Oct 24, 2013 at 2:05 PM, Josh Rosen <[email protected]> wrote: > Maybe this is a bug in the ClosureCleaner. If you look at the > > 13/10/23 14:16:39 INFO cluster.ClusterTaskSetManager: Serialized task >> 0.0:263 as 39625334 bytes in 55 ms > > > line in your log, this corresponds to the driver serializing a ~38 > megabyte task. This suggests that the broadcasted data is accidentally > being included in the serialized task. > > When I tried this locally with val lb = sc.broadcast( (1 to > 5000000).toSet)), I saw a serialized task size of 1814 bytes. > > > On Thu, Oct 24, 2013 at 10:59 AM, Tom Vacek <[email protected]>wrote: > >> I've figured out what the problem is, but I don't understand why. I'm >> hoping somebody can explain this: >> >> (in the spark shell) >> val lb = sc.broadcast( (1 to 10000000).toSet) >> val breakMe = sc.parallelize(1 to 250).mapPartitions( it => {val >> serializedSet = lb.value.toString; Array(0).iterator}).count //works great >> >> val ll = (1 to 10000000).toSet >> val lb = sc.broadcast(ll) >> val breakMe = sc.parallelize(1 to 250).mapPartitions( it => {val >> serializedSet = lb.value.toString; Array(0).iterator}).count //Crashes >> ignominiously >> >> >> >
