Awesome, thanks for the PR Koert! /Anders
On Thu, Dec 17, 2015 at 10:22 PM Prasad Ravilla <pras...@slalom.com> wrote: > Thanks, Koert. > > Regards, > Prasad. > > From: Koert Kuipers > Date: Thursday, December 17, 2015 at 1:06 PM > To: Prasad Ravilla > Cc: Anders Arpteg, user > > Subject: Re: Large number of conf broadcasts > > https://github.com/databricks/spark-avro/pull/95 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Davro_pull_95&d=CwMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=9AjxHvmieZttugnxWogbT7lOTg1hVM6cMVLj6tfukY4&s=mDYfa3wyqnL6HBitNnJzuriOYqY5e8l7cgMnUgjx96s&e=> > > On Thu, Dec 17, 2015 at 3:35 PM, Prasad Ravilla <pras...@slalom.com> > wrote: > >> Hi Anders, >> >> I am running into the same issue as yours. I am trying to read about 120 >> thousand avro files into a single data frame. >> >> Is your patch part of a pull request from the master branch in github? >> >> Thanks, >> Prasad. >> >> From: Anders Arpteg >> Date: Thursday, October 22, 2015 at 10:37 AM >> To: Koert Kuipers >> Cc: user >> Subject: Re: Large number of conf broadcasts >> >> Yes, seems unnecessary. I actually tried patching the >> com.databricks.spark.avro reader to only broadcast once per dataset, >> instead of every single file/partition. It seems to work just as fine, and >> there are significantly less broadcasts and not seeing out of memory issues >> any more. Strange that more people does not react to this, since the >> broadcasting seems completely unnecessary... >> >> Best, >> Anders >> >> On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers <ko...@tresata.com> wrote: >> >>> i am seeing the same thing. its gona completely crazy creating >>> broadcasts for the last 15 mins or so. killing it... >>> >>> On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg <arp...@spotify.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Running spark 1.5.0 in yarn-client mode, and am curios in why there are >>>> so many broadcast being done when loading datasets with large number of >>>> partitions/files. Have datasets with thousands of partitions, i.e. hdfs >>>> files in the avro folder, and sometime loading hundreds of these large >>>> datasets. Believe I have located the broadcast to line >>>> SparkContext.scala:1006. It seems to just broadcast the hadoop >>>> configuration, and I don't see why it should be necessary to broadcast that >>>> for EVERY file? Wouldn't it be possible to reuse the same broadcast >>>> configuration? It hardly the case the the configuration would be different >>>> between each file in a single dataset. Seems to be wasting lots of memory >>>> and needs to persist unnecessarily to disk (see below again). >>>> >>>> Thanks, >>>> Anders >>>> >>>> 15/09/24 17:11:11 INFO BlockManager: Writing block >>>> broadcast_1871_piece0 to disk >>>> [19/49086]15/09/24 17:11:11 INFO BlockManagerInfo: Added >>>> broadcast_1871_piece0 on disk on 10.254.35.24:49428 >>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=> >>>> (size: 23.1 KB) >>>> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored >>>> as bytes in memory (estimated size 23.1 KB, free 2.4 KB) >>>> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in >>>> memory on 10.254.35.24:49428 >>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=> >>>> (size: 23.1 KB, free: 464.0 MB) >>>> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from >>>> hadoopFile at AvroRelation.scala:121 >>>> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory >>>> threshold of 1024.0 KB for computing block broadcast_4804 in memory >>>> . >>>> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache >>>> broadcast_4804 in memory! (computed 496.0 B so far) >>>> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + >>>> 0.0 B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage >>>> limit = 530.3 MB. >>>> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to >>>> disk instead. >>>> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with >>>> curMem=556036460, maxMem=556038881 >>>> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping >>>> 15/09/24 17:11:11 INFO BlockManager: Dropping block >>>> broadcast_1872_piece0 from memory >>>> 15/09/24 17:11:11 INFO BlockManager: Writing block >>>> broadcast_1872_piece0 to disk >>>> >>>> >>> >>> >