Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable
Thanks Sean and Imran, I'll try splitting the broadcast variable into smaller ones. I had tried a regular join but it was failing due to high garbage collection overhead during the shuffle. One of the RDDs is very large and has a skewed distribution where a handful of keys account for 90% of the data. Do you have any pointers on how to handle skewed key distributions during a join. Soila On Fri, Feb 13, 2015 at 10:49 AM, Imran Rashid wrote: > unfortunately this is a known issue: > https://issues.apache.org/jira/browse/SPARK-1476 > > as Sean suggested, you need to think of some other way of doing the same > thing, even if its just breaking your one big broadcast var into a few > smaller ones > > On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen wrote: >> >> I think you've hit the nail on the head. Since the serialization >> ultimately creates a byte array, and arrays can have at most ~2 >> billion elements in the JVM, the broadcast can be at most ~2GB. >> >> At that scale, you might consider whether you really have to broadcast >> these values, or want to handle them as RDDs and join and so on. >> >> Or consider whether you can break it up into several broadcasts? >> >> >> On Fri, Feb 13, 2015 at 6:24 PM, soila wrote: >> > I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get >> > the >> > following exception when the size of the broadcast variable exceeds 2GB. >> > Any >> > ideas on how I can resolve this issue? >> > >> > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE >> > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829) >> > at >> > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) >> > at >> > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) >> > at >> > org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99) >> > at >> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147) >> > at >> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114) >> > at >> > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787) >> > at >> > >> > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) >> > at >> > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992) >> > at >> > >> > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) >> > at >> > >> > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:84) >> > at >> > >> > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) >> > at >> > >> > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) >> > at >> > >> > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) >> > at >> > org.apache.spark.SparkContext.broadcast(SparkContext.scala:945) >> > >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable
unfortunately this is a known issue: https://issues.apache.org/jira/browse/SPARK-1476 as Sean suggested, you need to think of some other way of doing the same thing, even if its just breaking your one big broadcast var into a few smaller ones On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen wrote: > I think you've hit the nail on the head. Since the serialization > ultimately creates a byte array, and arrays can have at most ~2 > billion elements in the JVM, the broadcast can be at most ~2GB. > > At that scale, you might consider whether you really have to broadcast > these values, or want to handle them as RDDs and join and so on. > > Or consider whether you can break it up into several broadcasts? > > > On Fri, Feb 13, 2015 at 6:24 PM, soila wrote: > > I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get > the > > following exception when the size of the broadcast variable exceeds 2GB. > Any > > ideas on how I can resolve this issue? > > > > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829) > > at > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) > > at > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) > > at > > org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99) > > at > > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147) > > at > > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114) > > at > > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787) > > at > > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) > > at > > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992) > > at > > > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) > > at > > > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:84) > > at > > > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > > at > > > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) > > at > > > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > > at > org.apache.spark.SparkContext.broadcast(SparkContext.scala:945) > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable
I think you've hit the nail on the head. Since the serialization ultimately creates a byte array, and arrays can have at most ~2 billion elements in the JVM, the broadcast can be at most ~2GB. At that scale, you might consider whether you really have to broadcast these values, or want to handle them as RDDs and join and so on. Or consider whether you can break it up into several broadcasts? On Fri, Feb 13, 2015 at 6:24 PM, soila wrote: > I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the > following exception when the size of the broadcast variable exceeds 2GB. Any > ideas on how I can resolve this issue? > > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) > at > org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114) > at > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787) > at > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) > at > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:84) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:945) > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Size exceeds Integer.MAX_VALUE exception when broadcasting large variable
I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the following exception when the size of the broadcast variable exceeds 2GB. Any ideas on how I can resolve this issue? java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:945) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org