Re: Large number of conf broadcasts

2015-12-18 Thread Anders Arpteg
Awesome, thanks for the PR Koert!

/Anders

On Thu, Dec 17, 2015 at 10:22 PM Prasad Ravilla  wrote:

> Thanks, Koert.
>
> Regards,
> Prasad.
>
> From: Koert Kuipers
> Date: Thursday, December 17, 2015 at 1:06 PM
> To: Prasad Ravilla
> Cc: Anders Arpteg, user
>
> Subject: Re: Large number of conf broadcasts
>
> https://github.com/databricks/spark-avro/pull/95
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Davro_pull_95&d=CwMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=9AjxHvmieZttugnxWogbT7lOTg1hVM6cMVLj6tfukY4&s=mDYfa3wyqnL6HBitNnJzuriOYqY5e8l7cgMnUgjx96s&e=>
>
> On Thu, Dec 17, 2015 at 3:35 PM, Prasad Ravilla 
> wrote:
>
>> Hi Anders,
>>
>> I am running into the same issue as yours. I am trying to read about 120
>> thousand avro files into a single data frame.
>>
>> Is your patch part of a pull request from the master branch in github?
>>
>> Thanks,
>> Prasad.
>>
>> From: Anders Arpteg
>> Date: Thursday, October 22, 2015 at 10:37 AM
>> To: Koert Kuipers
>> Cc: user
>> Subject: Re: Large number of conf broadcasts
>>
>> Yes, seems unnecessary. I actually tried patching the
>> com.databricks.spark.avro reader to only broadcast once per dataset,
>> instead of every single file/partition. It seems to work just as fine, and
>> there are significantly less broadcasts and not seeing out of memory issues
>> any more. Strange that more people does not react to this, since the
>> broadcasting seems completely unnecessary...
>>
>> Best,
>> Anders
>>
>> On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers  wrote:
>>
>>> i am seeing the same thing. its gona completely crazy creating
>>> broadcasts for the last 15 mins or so. killing it...
>>>
>>> On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Running spark 1.5.0 in yarn-client mode, and am curios in why there are
>>>> so many broadcast being done when loading datasets with large number of
>>>> partitions/files. Have datasets with thousands of partitions, i.e. hdfs
>>>> files in the avro folder, and sometime loading hundreds of these large
>>>> datasets. Believe I have located the broadcast to line
>>>> SparkContext.scala:1006. It seems to just broadcast the hadoop
>>>> configuration, and I don't see why it should be necessary to broadcast that
>>>> for EVERY file? Wouldn't it be possible to reuse the same broadcast
>>>> configuration? It hardly the case the the configuration would be different
>>>> between each file in a single dataset. Seems to be wasting lots of memory
>>>> and needs to persist unnecessarily to disk (see below again).
>>>>
>>>> Thanks,
>>>> Anders
>>>>
>>>> 15/09/24 17:11:11 INFO BlockManager: Writing block
>>>> broadcast_1871_piece0 to disk
>>>>  [19/49086]15/09/24 17:11:11 INFO BlockManagerInfo: Added
>>>> broadcast_1871_piece0 on disk on 10.254.35.24:49428
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
>>>> (size: 23.1 KB)
>>>> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored
>>>> as bytes in memory (estimated size 23.1 KB, free 2.4 KB)
>>>> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in
>>>> memory on 10.254.35.24:49428
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
>>>> (size: 23.1 KB, free: 464.0 MB)
>>>> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from
>>>> hadoopFile at AvroRelation.scala:121
>>>> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory
>>>> threshold of 1024.0 KB for computing block broadcast_4804 in memory
>>>> .
>>>> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache
>>>> broadcast_4804 in memory! (computed 496.0 B so far)
>>>> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) +
>>>> 0.0 B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
>>>> limit = 530.3 MB.
>>>> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to
>>>> disk instead.
>>>> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with
>>>> curMem=556036460, maxMem=556038881
>>>> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
>>>> 15/09/24 17:11:11 INFO BlockManager: Dropping block
>>>> broadcast_1872_piece0 from memory
>>>> 15/09/24 17:11:11 INFO BlockManager: Writing block
>>>> broadcast_1872_piece0 to disk
>>>>
>>>>
>>>
>>>
>


Re: Large number of conf broadcasts

2015-12-17 Thread Prasad Ravilla
Thanks, Koert.

Regards,
Prasad.

From: Koert Kuipers
Date: Thursday, December 17, 2015 at 1:06 PM
To: Prasad Ravilla
Cc: Anders Arpteg, user
Subject: Re: Large number of conf broadcasts

https://github.com/databricks/spark-avro/pull/95<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Davro_pull_95&d=CwMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=9AjxHvmieZttugnxWogbT7lOTg1hVM6cMVLj6tfukY4&s=mDYfa3wyqnL6HBitNnJzuriOYqY5e8l7cgMnUgjx96s&e=>

On Thu, Dec 17, 2015 at 3:35 PM, Prasad Ravilla 
mailto:pras...@slalom.com>> wrote:
Hi Anders,

I am running into the same issue as yours. I am trying to read about 120 
thousand avro files into a single data frame.

Is your patch part of a pull request from the master branch in github?

Thanks,
Prasad.

From: Anders Arpteg
Date: Thursday, October 22, 2015 at 10:37 AM
To: Koert Kuipers
Cc: user
Subject: Re: Large number of conf broadcasts

Yes, seems unnecessary. I actually tried patching the com.databricks.spark.avro 
reader to only broadcast once per dataset, instead of every single 
file/partition. It seems to work just as fine, and there are significantly less 
broadcasts and not seeing out of memory issues any more. Strange that more 
people does not react to this, since the broadcasting seems completely 
unnecessary...

Best,
Anders

On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers 
mailto:ko...@tresata.com>> wrote:
i am seeing the same thing. its gona completely crazy creating broadcasts for 
the last 15 mins or so. killing it...

On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg 
mailto:arp...@spotify.com>> wrote:
Hi,

Running spark 1.5.0 in yarn-client mode, and am curios in why there are so many 
broadcast being done when loading datasets with large number of 
partitions/files. Have datasets with thousands of partitions, i.e. hdfs files 
in the avro folder, and sometime loading hundreds of these large datasets. 
Believe I have located the broadcast to line SparkContext.scala:1006. It seems 
to just broadcast the hadoop configuration, and I don't see why it should be 
necessary to broadcast that for EVERY file? Wouldn't it be possible to reuse 
the same broadcast configuration? It hardly the case the the configuration 
would be different between each file in a single dataset. Seems to be wasting 
lots of memory and needs to persist unnecessarily to disk (see below again).

Thanks,
Anders

15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1871_piece0 to 
disk  [19/49086]15/09/24 17:11:11 
INFO BlockManagerInfo: Added broadcast_1871_piece0 on disk on 
10.254.35.24:49428<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
 (size: 23.1 KB)
15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored as bytes 
in memory (estimated size 23.1 KB, free 2.4 KB)
15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in memory 
on 
10.254.35.24:49428<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
 (size: 23.1 KB, free: 464.0 MB)
15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from 
hadoopFile at AvroRelation.scala:121
15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory threshold 
of 1024.0 KB for computing block broadcast_4804 in memory
.
15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache broadcast_4804 in 
memory! (computed 496.0 B so far)
15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + 0.0 B 
(scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
limit = 530.3 MB.
15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to disk 
instead.
15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with 
curMem=556036460, maxMem=556038881
15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
15/09/24 17:11:11 INFO BlockManager: Dropping block broadcast_1872_piece0 from 
memory
15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1872_piece0 to disk





Re: Large number of conf broadcasts

2015-12-17 Thread Koert Kuipers
https://github.com/databricks/spark-avro/pull/95

On Thu, Dec 17, 2015 at 3:35 PM, Prasad Ravilla  wrote:

> Hi Anders,
>
> I am running into the same issue as yours. I am trying to read about 120
> thousand avro files into a single data frame.
>
> Is your patch part of a pull request from the master branch in github?
>
> Thanks,
> Prasad.
>
> From: Anders Arpteg
> Date: Thursday, October 22, 2015 at 10:37 AM
> To: Koert Kuipers
> Cc: user
> Subject: Re: Large number of conf broadcasts
>
> Yes, seems unnecessary. I actually tried patching the
> com.databricks.spark.avro reader to only broadcast once per dataset,
> instead of every single file/partition. It seems to work just as fine, and
> there are significantly less broadcasts and not seeing out of memory issues
> any more. Strange that more people does not react to this, since the
> broadcasting seems completely unnecessary...
>
> Best,
> Anders
>
> On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers  wrote:
>
>> i am seeing the same thing. its gona completely crazy creating broadcasts
>> for the last 15 mins or so. killing it...
>>
>> On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg 
>> wrote:
>>
>>> Hi,
>>>
>>> Running spark 1.5.0 in yarn-client mode, and am curios in why there are
>>> so many broadcast being done when loading datasets with large number of
>>> partitions/files. Have datasets with thousands of partitions, i.e. hdfs
>>> files in the avro folder, and sometime loading hundreds of these large
>>> datasets. Believe I have located the broadcast to line
>>> SparkContext.scala:1006. It seems to just broadcast the hadoop
>>> configuration, and I don't see why it should be necessary to broadcast that
>>> for EVERY file? Wouldn't it be possible to reuse the same broadcast
>>> configuration? It hardly the case the the configuration would be different
>>> between each file in a single dataset. Seems to be wasting lots of memory
>>> and needs to persist unnecessarily to disk (see below again).
>>>
>>> Thanks,
>>> Anders
>>>
>>> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1871_piece0
>>> to disk  [19/49086]15/09/24
>>> 17:11:11 INFO BlockManagerInfo: Added broadcast_1871_piece0 on disk on
>>> 10.254.35.24:49428
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
>>> (size: 23.1 KB)
>>> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored
>>> as bytes in memory (estimated size 23.1 KB, free 2.4 KB)
>>> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in
>>> memory on 10.254.35.24:49428
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
>>> (size: 23.1 KB, free: 464.0 MB)
>>> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from
>>> hadoopFile at AvroRelation.scala:121
>>> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory
>>> threshold of 1024.0 KB for computing block broadcast_4804 in memory
>>> .
>>> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache
>>> broadcast_4804 in memory! (computed 496.0 B so far)
>>> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + 0.0
>>> B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
>>> limit = 530.3 MB.
>>> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to
>>> disk instead.
>>> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with
>>> curMem=556036460, maxMem=556038881
>>> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
>>> 15/09/24 17:11:11 INFO BlockManager: Dropping block
>>> broadcast_1872_piece0 from memory
>>> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1872_piece0
>>> to disk
>>>
>>>
>>
>>


Re: Large number of conf broadcasts

2015-12-17 Thread Prasad Ravilla
Hi Anders,

I am running into the same issue as yours. I am trying to read about 120 
thousand avro files into a single data frame.

Is your patch part of a pull request from the master branch in github?

Thanks,
Prasad.

From: Anders Arpteg
Date: Thursday, October 22, 2015 at 10:37 AM
To: Koert Kuipers
Cc: user
Subject: Re: Large number of conf broadcasts

Yes, seems unnecessary. I actually tried patching the com.databricks.spark.avro 
reader to only broadcast once per dataset, instead of every single 
file/partition. It seems to work just as fine, and there are significantly less 
broadcasts and not seeing out of memory issues any more. Strange that more 
people does not react to this, since the broadcasting seems completely 
unnecessary...

Best,
Anders

On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers 
mailto:ko...@tresata.com>> wrote:
i am seeing the same thing. its gona completely crazy creating broadcasts for 
the last 15 mins or so. killing it...

On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg 
mailto:arp...@spotify.com>> wrote:
Hi,

Running spark 1.5.0 in yarn-client mode, and am curios in why there are so many 
broadcast being done when loading datasets with large number of 
partitions/files. Have datasets with thousands of partitions, i.e. hdfs files 
in the avro folder, and sometime loading hundreds of these large datasets. 
Believe I have located the broadcast to line SparkContext.scala:1006. It seems 
to just broadcast the hadoop configuration, and I don't see why it should be 
necessary to broadcast that for EVERY file? Wouldn't it be possible to reuse 
the same broadcast configuration? It hardly the case the the configuration 
would be different between each file in a single dataset. Seems to be wasting 
lots of memory and needs to persist unnecessarily to disk (see below again).

Thanks,
Anders

15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1871_piece0 to 
disk  [19/49086]15/09/24 17:11:11 
INFO BlockManagerInfo: Added broadcast_1871_piece0 on disk on 
10.254.35.24:49428<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
 (size: 23.1 KB)
15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored as bytes 
in memory (estimated size 23.1 KB, free 2.4 KB)
15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in memory 
on 
10.254.35.24:49428<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
 (size: 23.1 KB, free: 464.0 MB)
15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from 
hadoopFile at AvroRelation.scala:121
15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory threshold 
of 1024.0 KB for computing block broadcast_4804 in memory
.
15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache broadcast_4804 in 
memory! (computed 496.0 B so far)
15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + 0.0 B 
(scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
limit = 530.3 MB.
15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to disk 
instead.
15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with 
curMem=556036460, maxMem=556038881
15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
15/09/24 17:11:11 INFO BlockManager: Dropping block broadcast_1872_piece0 from 
memory
15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1872_piece0 to disk




Re: Large number of conf broadcasts

2015-10-26 Thread Anders Arpteg
Nice Koert, lets hope it gets merged soon.

/Anders

On Fri, Oct 23, 2015 at 6:32 PM Koert Kuipers  wrote:

> https://github.com/databricks/spark-avro/pull/95
>
> On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers  wrote:
>
>> oh no wonder... it undoes the glob (i was reading from /some/path/*),
>> creates a hadoopRdd for every path, and then creates a union of them using
>> UnionRDD.
>>
>> thats not what i want... no need to do union. AvroInpuFormat already has
>> the ability to handle globs (or multiple paths comma separated) very
>> efficiently. AvroRelation should just pass the paths (comma separated).
>>
>>
>>
>>
>> On Thu, Oct 22, 2015 at 1:37 PM, Anders Arpteg 
>> wrote:
>>
>>> Yes, seems unnecessary. I actually tried patching the
>>> com.databricks.spark.avro reader to only broadcast once per dataset,
>>> instead of every single file/partition. It seems to work just as fine, and
>>> there are significantly less broadcasts and not seeing out of memory issues
>>> any more. Strange that more people does not react to this, since the
>>> broadcasting seems completely unnecessary...
>>>
>>> Best,
>>> Anders
>>>
>>>
>>> On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers  wrote:
>>>
 i am seeing the same thing. its gona completely crazy creating
 broadcasts for the last 15 mins or so. killing it...

 On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg 
 wrote:

> Hi,
>
> Running spark 1.5.0 in yarn-client mode, and am curios in why there
> are so many broadcast being done when loading datasets with large number 
> of
> partitions/files. Have datasets with thousands of partitions, i.e. hdfs
> files in the avro folder, and sometime loading hundreds of these large
> datasets. Believe I have located the broadcast to line
> SparkContext.scala:1006. It seems to just broadcast the hadoop
> configuration, and I don't see why it should be necessary to broadcast 
> that
> for EVERY file? Wouldn't it be possible to reuse the same broadcast
> configuration? It hardly the case the the configuration would be different
> between each file in a single dataset. Seems to be wasting lots of memory
> and needs to persist unnecessarily to disk (see below again).
>
> Thanks,
> Anders
>
> 15/09/24 17:11:11 INFO BlockManager: Writing block
> broadcast_1871_piece0 to disk
>  [19/49086]15/09/24 17:11:11 INFO BlockManagerInfo: Added
> broadcast_1871_piece0 on disk on 10.254.35.24:49428 (size: 23.1 KB)
> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored
> as bytes in memory (estimated size 23.1 KB, free 2.4 KB)
> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0
> in memory on 10.254.35.24:49428 (size: 23.1 KB, free: 464.0 MB)
> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803
> from hadoopFile at AvroRelation.scala:121
> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory
> threshold of 1024.0 KB for computing block broadcast_4804 in memory
> .
> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache
> broadcast_4804 in memory! (computed 496.0 B so far)
> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) +
> 0.0 B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
> limit = 530.3 MB.
> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to
> disk instead.
> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with
> curMem=556036460, maxMem=556038881
> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
> 15/09/24 17:11:11 INFO BlockManager: Dropping block
> broadcast_1872_piece0 from memory
> 15/09/24 17:11:11 INFO BlockManager: Writing block
> broadcast_1872_piece0 to disk
>
>


>>
>


Re: Large number of conf broadcasts

2015-10-23 Thread Koert Kuipers
https://github.com/databricks/spark-avro/pull/95

On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers  wrote:

> oh no wonder... it undoes the glob (i was reading from /some/path/*),
> creates a hadoopRdd for every path, and then creates a union of them using
> UnionRDD.
>
> thats not what i want... no need to do union. AvroInpuFormat already has
> the ability to handle globs (or multiple paths comma separated) very
> efficiently. AvroRelation should just pass the paths (comma separated).
>
>
>
>
> On Thu, Oct 22, 2015 at 1:37 PM, Anders Arpteg  wrote:
>
>> Yes, seems unnecessary. I actually tried patching the
>> com.databricks.spark.avro reader to only broadcast once per dataset,
>> instead of every single file/partition. It seems to work just as fine, and
>> there are significantly less broadcasts and not seeing out of memory issues
>> any more. Strange that more people does not react to this, since the
>> broadcasting seems completely unnecessary...
>>
>> Best,
>> Anders
>>
>>
>> On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers  wrote:
>>
>>> i am seeing the same thing. its gona completely crazy creating
>>> broadcasts for the last 15 mins or so. killing it...
>>>
>>> On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg 
>>> wrote:
>>>
 Hi,

 Running spark 1.5.0 in yarn-client mode, and am curios in why there are
 so many broadcast being done when loading datasets with large number of
 partitions/files. Have datasets with thousands of partitions, i.e. hdfs
 files in the avro folder, and sometime loading hundreds of these large
 datasets. Believe I have located the broadcast to line
 SparkContext.scala:1006. It seems to just broadcast the hadoop
 configuration, and I don't see why it should be necessary to broadcast that
 for EVERY file? Wouldn't it be possible to reuse the same broadcast
 configuration? It hardly the case the the configuration would be different
 between each file in a single dataset. Seems to be wasting lots of memory
 and needs to persist unnecessarily to disk (see below again).

 Thanks,
 Anders

 15/09/24 17:11:11 INFO BlockManager: Writing block
 broadcast_1871_piece0 to disk
  [19/49086]15/09/24 17:11:11 INFO BlockManagerInfo: Added
 broadcast_1871_piece0 on disk on 10.254.35.24:49428 (size: 23.1 KB)
 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored
 as bytes in memory (estimated size 23.1 KB, free 2.4 KB)
 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in
 memory on 10.254.35.24:49428 (size: 23.1 KB, free: 464.0 MB)
 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from
 hadoopFile at AvroRelation.scala:121
 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory
 threshold of 1024.0 KB for computing block broadcast_4804 in memory
 .
 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache
 broadcast_4804 in memory! (computed 496.0 B so far)
 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) +
 0.0 B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
 limit = 530.3 MB.
 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to
 disk instead.
 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with
 curMem=556036460, maxMem=556038881
 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
 15/09/24 17:11:11 INFO BlockManager: Dropping block
 broadcast_1872_piece0 from memory
 15/09/24 17:11:11 INFO BlockManager: Writing block
 broadcast_1872_piece0 to disk


>>>
>>>
>


Re: Large number of conf broadcasts

2015-10-23 Thread Koert Kuipers
oh no wonder... it undoes the glob (i was reading from /some/path/*),
creates a hadoopRdd for every path, and then creates a union of them using
UnionRDD.

thats not what i want... no need to do union. AvroInpuFormat already has
the ability to handle globs (or multiple paths comma separated) very
efficiently. AvroRelation should just pass the paths (comma separated).




On Thu, Oct 22, 2015 at 1:37 PM, Anders Arpteg  wrote:

> Yes, seems unnecessary. I actually tried patching the
> com.databricks.spark.avro reader to only broadcast once per dataset,
> instead of every single file/partition. It seems to work just as fine, and
> there are significantly less broadcasts and not seeing out of memory issues
> any more. Strange that more people does not react to this, since the
> broadcasting seems completely unnecessary...
>
> Best,
> Anders
>
>
> On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers  wrote:
>
>> i am seeing the same thing. its gona completely crazy creating broadcasts
>> for the last 15 mins or so. killing it...
>>
>> On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg 
>> wrote:
>>
>>> Hi,
>>>
>>> Running spark 1.5.0 in yarn-client mode, and am curios in why there are
>>> so many broadcast being done when loading datasets with large number of
>>> partitions/files. Have datasets with thousands of partitions, i.e. hdfs
>>> files in the avro folder, and sometime loading hundreds of these large
>>> datasets. Believe I have located the broadcast to line
>>> SparkContext.scala:1006. It seems to just broadcast the hadoop
>>> configuration, and I don't see why it should be necessary to broadcast that
>>> for EVERY file? Wouldn't it be possible to reuse the same broadcast
>>> configuration? It hardly the case the the configuration would be different
>>> between each file in a single dataset. Seems to be wasting lots of memory
>>> and needs to persist unnecessarily to disk (see below again).
>>>
>>> Thanks,
>>> Anders
>>>
>>> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1871_piece0
>>> to disk  [19/49086]15/09/24
>>> 17:11:11 INFO BlockManagerInfo: Added broadcast_1871_piece0 on disk on
>>> 10.254.35.24:49428 (size: 23.1 KB)
>>> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored
>>> as bytes in memory (estimated size 23.1 KB, free 2.4 KB)
>>> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in
>>> memory on 10.254.35.24:49428 (size: 23.1 KB, free: 464.0 MB)
>>> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from
>>> hadoopFile at AvroRelation.scala:121
>>> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory
>>> threshold of 1024.0 KB for computing block broadcast_4804 in memory
>>> .
>>> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache
>>> broadcast_4804 in memory! (computed 496.0 B so far)
>>> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + 0.0
>>> B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
>>> limit = 530.3 MB.
>>> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to
>>> disk instead.
>>> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with
>>> curMem=556036460, maxMem=556038881
>>> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
>>> 15/09/24 17:11:11 INFO BlockManager: Dropping block
>>> broadcast_1872_piece0 from memory
>>> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1872_piece0
>>> to disk
>>>
>>>
>>
>>


Re: Large number of conf broadcasts

2015-10-22 Thread Anders Arpteg
Yes, seems unnecessary. I actually tried patching the
com.databricks.spark.avro reader to only broadcast once per dataset,
instead of every single file/partition. It seems to work just as fine, and
there are significantly less broadcasts and not seeing out of memory issues
any more. Strange that more people does not react to this, since the
broadcasting seems completely unnecessary...

Best,
Anders

On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers  wrote:

> i am seeing the same thing. its gona completely crazy creating broadcasts
> for the last 15 mins or so. killing it...
>
> On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg  wrote:
>
>> Hi,
>>
>> Running spark 1.5.0 in yarn-client mode, and am curios in why there are
>> so many broadcast being done when loading datasets with large number of
>> partitions/files. Have datasets with thousands of partitions, i.e. hdfs
>> files in the avro folder, and sometime loading hundreds of these large
>> datasets. Believe I have located the broadcast to line
>> SparkContext.scala:1006. It seems to just broadcast the hadoop
>> configuration, and I don't see why it should be necessary to broadcast that
>> for EVERY file? Wouldn't it be possible to reuse the same broadcast
>> configuration? It hardly the case the the configuration would be different
>> between each file in a single dataset. Seems to be wasting lots of memory
>> and needs to persist unnecessarily to disk (see below again).
>>
>> Thanks,
>> Anders
>>
>> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1871_piece0
>> to disk  [19/49086]15/09/24
>> 17:11:11 INFO BlockManagerInfo: Added broadcast_1871_piece0 on disk on
>> 10.254.35.24:49428 (size: 23.1 KB)
>> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored as
>> bytes in memory (estimated size 23.1 KB, free 2.4 KB)
>> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in
>> memory on 10.254.35.24:49428 (size: 23.1 KB, free: 464.0 MB)
>> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from
>> hadoopFile at AvroRelation.scala:121
>> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory
>> threshold of 1024.0 KB for computing block broadcast_4804 in memory
>> .
>> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache
>> broadcast_4804 in memory! (computed 496.0 B so far)
>> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + 0.0
>> B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
>> limit = 530.3 MB.
>> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to
>> disk instead.
>> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with
>> curMem=556036460, maxMem=556038881
>> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
>> 15/09/24 17:11:11 INFO BlockManager: Dropping block broadcast_1872_piece0
>> from memory
>> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1872_piece0
>> to disk
>>
>>
>
>


Re: Large number of conf broadcasts

2015-10-22 Thread Koert Kuipers
i am seeing the same thing. its gona completely crazy creating broadcasts
for the last 15 mins or so. killing it...

On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg  wrote:

> Hi,
>
> Running spark 1.5.0 in yarn-client mode, and am curios in why there are so
> many broadcast being done when loading datasets with large number of
> partitions/files. Have datasets with thousands of partitions, i.e. hdfs
> files in the avro folder, and sometime loading hundreds of these large
> datasets. Believe I have located the broadcast to line
> SparkContext.scala:1006. It seems to just broadcast the hadoop
> configuration, and I don't see why it should be necessary to broadcast that
> for EVERY file? Wouldn't it be possible to reuse the same broadcast
> configuration? It hardly the case the the configuration would be different
> between each file in a single dataset. Seems to be wasting lots of memory
> and needs to persist unnecessarily to disk (see below again).
>
> Thanks,
> Anders
>
> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1871_piece0
> to disk  [19/49086]15/09/24
> 17:11:11 INFO BlockManagerInfo: Added broadcast_1871_piece0 on disk on
> 10.254.35.24:49428 (size: 23.1 KB)
> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored as
> bytes in memory (estimated size 23.1 KB, free 2.4 KB)
> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in
> memory on 10.254.35.24:49428 (size: 23.1 KB, free: 464.0 MB)
> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from
> hadoopFile at AvroRelation.scala:121
> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory
> threshold of 1024.0 KB for computing block broadcast_4804 in memory
> .
> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache
> broadcast_4804 in memory! (computed 496.0 B so far)
> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + 0.0 B
> (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
> limit = 530.3 MB.
> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to
> disk instead.
> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with
> curMem=556036460, maxMem=556038881
> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
> 15/09/24 17:11:11 INFO BlockManager: Dropping block broadcast_1872_piece0
> from memory
> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1872_piece0
> to disk
>
>