Hi, I'm loading a 1000 files using the spark-avro package: val df = sqlContext.read.avro(*"/incoming/"*)
When I'm performing an action on this df it seems like for each file a broadcast is being created and is sent to the workers (instead of the workers reading their data-local files): scala> df.coalesce(4).count 15/09/21 15:11:32 INFO storage.MemoryStore: ensureFreeSpace(261920) called with curMem=0, maxMem=2223023063 15/09/21 15:11:32 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 255.8 KB, free 2.1 GB) 15/09/21 15:11:32 INFO storage.MemoryStore: ensureFreeSpace(22987) called with curMem=261920, maxMem=2223023063 15/09/21 15:11:32 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.4 KB, free 2.1 GB) 15/09/21 15:11:32 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.3.4:39736 (size: 22.4 KB, free: 2.1 GB) .... .... .... 15/09/21 15:12:45 INFO storage.MemoryStore: ensureFreeSpace(22987) called with curMem=294913622, maxMem=2223023063 15/09/21 15:12:45 INFO storage.MemoryStore: Block *broadcast_1034_piece0 *stored as bytes in memory (estimated size 22.4 KB, free 1838.8 MB) 15/09/21 15:12:45 INFO storage.BlockManagerInfo: Added broadcast_1034_piece0 in memory on 192.168.3.4:39736 (size: 22.4 KB, free: 2.0 GB) 15/09/21 15:12:45 INFO spark.SparkContext: Created broadcast 1034 from hadoopFile at AvroRelation.scala:121 15/09/21 15:12:46 INFO execution.Exchange: Using SparkSqlSerializer2. 15/09/21 15:12:46 INFO spark.SparkContext: Starting job: count at <console>:25 Am I understanding this wrongs? Thank you. Daniel
