Maybe this is a bug in the ClosureCleaner. If you look at the 13/10/23 14:16:39 INFO cluster.ClusterTaskSetManager: Serialized task > 0.0:263 as 39625334 bytes in 55 ms
line in your log, this corresponds to the driver serializing a ~38 megabyte task. This suggests that the broadcasted data is accidentally being included in the serialized task. When I tried this locally with val lb = sc.broadcast( (1 to 5000000).toSet)), I saw a serialized task size of 1814 bytes. On Thu, Oct 24, 2013 at 10:59 AM, Tom Vacek <[email protected]> wrote: > I've figured out what the problem is, but I don't understand why. I'm > hoping somebody can explain this: > > (in the spark shell) > val lb = sc.broadcast( (1 to 10000000).toSet) > val breakMe = sc.parallelize(1 to 250).mapPartitions( it => {val > serializedSet = lb.value.toString; Array(0).iterator}).count //works great > > val ll = (1 to 10000000).toSet > val lb = sc.broadcast(ll) > val breakMe = sc.parallelize(1 to 250).mapPartitions( it => {val > serializedSet = lb.value.toString; Array(0).iterator}).count //Crashes > ignominiously > > >
