Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Soila Pertet Kavulya
Thanks Sean and Imran,

I'll try splitting the broadcast variable into smaller ones.

I had tried a regular join but it was failing due to high garbage
collection overhead during the shuffle. One of the RDDs is very large
and has a skewed distribution where a handful of keys account for 90%
of the data. Do you have any pointers on how to handle skewed key
distributions during a join.

Soila

On Fri, Feb 13, 2015 at 10:49 AM, Imran Rashid  wrote:
> unfortunately this is a known issue:
> https://issues.apache.org/jira/browse/SPARK-1476
>
> as Sean suggested, you need to think of some other way of doing the same
> thing, even if its just breaking your one big broadcast var into a few
> smaller ones
>
> On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen  wrote:
>>
>> I think you've hit the nail on the head. Since the serialization
>> ultimately creates a byte array, and arrays can have at most ~2
>> billion elements in the JVM, the broadcast can be at most ~2GB.
>>
>> At that scale, you might consider whether you really have to broadcast
>> these values, or want to handle them as RDDs and join and so on.
>>
>> Or consider whether you can break it up into several broadcasts?
>>
>>
>> On Fri, Feb 13, 2015 at 6:24 PM, soila  wrote:
>> > I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get
>> > the
>> > following exception when the size of the broadcast variable exceeds 2GB.
>> > Any
>> > ideas on how I can resolve this issue?
>> >
>> > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>> > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829)
>> > at
>> > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
>> > at
>> > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
>> > at
>> > org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99)
>> > at
>> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147)
>> > at
>> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114)
>> > at
>> > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787)
>> > at
>> >
>> > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
>> > at
>> > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992)
>> > at
>> >
>> > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
>> > at
>> >
>> > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:84)
>> > at
>> >
>> > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>> > at
>> >
>> > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>> > at
>> >
>> > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>> > at
>> > org.apache.spark.SparkContext.broadcast(SparkContext.scala:945)
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Imran Rashid
unfortunately this is a known issue:
https://issues.apache.org/jira/browse/SPARK-1476

as Sean suggested, you need to think of some other way of doing the same
thing, even if its just breaking your one big broadcast var into a few
smaller ones

On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen  wrote:

> I think you've hit the nail on the head. Since the serialization
> ultimately creates a byte array, and arrays can have at most ~2
> billion elements in the JVM, the broadcast can be at most ~2GB.
>
> At that scale, you might consider whether you really have to broadcast
> these values, or want to handle them as RDDs and join and so on.
>
> Or consider whether you can break it up into several broadcasts?
>
>
> On Fri, Feb 13, 2015 at 6:24 PM, soila  wrote:
> > I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get
> the
> > following exception when the size of the broadcast variable exceeds 2GB.
> Any
> > ideas on how I can resolve this issue?
> >
> > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829)
> > at
> org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
> > at
> org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
> > at
> > org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99)
> > at
> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147)
> > at
> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114)
> > at
> > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787)
> > at
> > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
> > at
> > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992)
> > at
> >
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
> > at
> >
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:84)
> > at
> >
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> > at
> >
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
> > at
> >
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> > at
> org.apache.spark.SparkContext.broadcast(SparkContext.scala:945)
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Sean Owen
I think you've hit the nail on the head. Since the serialization
ultimately creates a byte array, and arrays can have at most ~2
billion elements in the JVM, the broadcast can be at most ~2GB.

At that scale, you might consider whether you really have to broadcast
these values, or want to handle them as RDDs and join and so on.

Or consider whether you can break it up into several broadcasts?


On Fri, Feb 13, 2015 at 6:24 PM, soila  wrote:
> I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the
> following exception when the size of the broadcast variable exceeds 2GB. Any
> ideas on how I can resolve this issue?
>
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
> at
> org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99)
> at
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147)
> at
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114)
> at
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787)
> at
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
> at
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992)
> at
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
> at
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:84)
> at
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
> at
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:945)
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread soila
I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the
following exception when the size of the broadcast variable exceeds 2GB. Any
ideas on how I can resolve this issue?

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
at
org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99)
at
org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147)
at
org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114)
at
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
at
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992)
at
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
at
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:84)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:945)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org