Hi Anders, I am running into the same issue as yours. I am trying to read about 120 thousand avro files into a single data frame.
Is your patch part of a pull request from the master branch in github? Thanks, Prasad. From: Anders Arpteg Date: Thursday, October 22, 2015 at 10:37 AM To: Koert Kuipers Cc: user Subject: Re: Large number of conf broadcasts Yes, seems unnecessary. I actually tried patching the com.databricks.spark.avro reader to only broadcast once per dataset, instead of every single file/partition. It seems to work just as fine, and there are significantly less broadcasts and not seeing out of memory issues any more. Strange that more people does not react to this, since the broadcasting seems completely unnecessary... Best, Anders On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers <ko...@tresata.com<mailto:ko...@tresata.com>> wrote: i am seeing the same thing. its gona completely crazy creating broadcasts for the last 15 mins or so. killing it... On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg <arp...@spotify.com<mailto:arp...@spotify.com>> wrote: Hi, Running spark 1.5.0 in yarn-client mode, and am curios in why there are so many broadcast being done when loading datasets with large number of partitions/files. Have datasets with thousands of partitions, i.e. hdfs files in the avro folder, and sometime loading hundreds of these large datasets. Believe I have located the broadcast to line SparkContext.scala:1006. It seems to just broadcast the hadoop configuration, and I don't see why it should be necessary to broadcast that for EVERY file? Wouldn't it be possible to reuse the same broadcast configuration? It hardly the case the the configuration would be different between each file in a single dataset. Seems to be wasting lots of memory and needs to persist unnecessarily to disk (see below again). Thanks, Anders 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1871_piece0 to disk [19/49086]15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_1871_piece0 on disk on 10.254.35.24:49428<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=> (size: 23.1 KB) 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored as bytes in memory (estimated size 23.1 KB, free 2.4 KB) 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in memory on 10.254.35.24:49428<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=> (size: 23.1 KB, free: 464.0 MB) 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from hadoopFile at AvroRelation.scala:121 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block broadcast_4804 in memory . 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache broadcast_4804 in memory! (computed 496.0 B so far) 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + 0.0 B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage limit = 530.3 MB. 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to disk instead. 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with curMem=556036460, maxMem=556038881 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping 15/09/24 17:11:11 INFO BlockManager: Dropping block broadcast_1872_piece0 from memory 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1872_piece0 to disk