It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 
<https://issues.apache.org/jira/browse/SPARK-6235> which limits the size of the 
blocks in the file being written to disk to 2GB.

If so, the solution is for you to try tuning for smaller tasks. Try increasing 
the number of partitions, or using a more space-efficient data structure inside 
the RDD, or increasing the amount of memory available to spark and caching the 
data in memory. Make sure you are using Kryo serialization. 

Andrew

> On Jul 23, 2016, at 9:00 PM, Ascot Moss <ascot.m...@gmail.com> wrote:
> 
> 
> Hi,
> 
> Please help!
> 
> My spark: 1.6.2
> Java: java8_u40
> 
> I am trying random forest training, I got " Size exceeds Integer.MAX_VALUE".
> 
> Any idea how to resolve it?
> 
> 
> (the log) 
> 16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 25) 
>   
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE      
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)     
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>     
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>     
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)    
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)     
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)     
> at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)   
>   
> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)     
>   
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)    
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)    
>   
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)    
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)     
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)   
>   
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)      
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)     
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
>   
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
>   
> at org.apache.spark.scheduler.Task.run(Task.scala:89)   
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       
> at java.lang.Thread.run(Thread.java:745)
> 16/07/24 07:59:49 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 25, 
> localhost): java.lang.IllegalArgumentException: Size exceeds 
> Integer.MAX_VALUE       
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)     
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>     
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>     
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)    
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)     
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)     
> at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)   
>   
> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)     
>   
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)    
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)    
>   
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)    
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)     
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)   
>   
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)      
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)     
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
>   
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
>   
> at org.apache.spark.scheduler.Task.run(Task.scala:89)   
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       
> at java.lang.Thread.run(Thread.java:745) 
> 
> 
> Regards
> 

Reply via email to