the data set is the training data set for random forest training, about 36,500 data, any idea how to further partition it?
On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich <and...@aehrlich.com> wrote: > It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which > limits the size of the blocks in the file being written to disk to 2GB. > > If so, the solution is for you to try tuning for smaller tasks. Try > increasing the number of partitions, or using a more space-efficient data > structure inside the RDD, or increasing the amount of memory available to > spark and caching the data in memory. Make sure you are using Kryo > serialization. > > Andrew > > On Jul 23, 2016, at 9:00 PM, Ascot Moss <ascot.m...@gmail.com> wrote: > > > Hi, > > Please help! > > My spark: 1.6.2 > Java: java8_u40 > > I am trying random forest training, I got " Size exceeds > Integer.MAX_VALUE". > > Any idea how to resolve it? > > > (the log) > 16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID > 25) > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127) > > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115) > > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503) > > at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420) > > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625) > at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154) > > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > 16/07/24 07:59:49 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 25, > localhost): java.lang.IllegalArgumentException: Size exceeds > Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127) > > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115) > > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503) > > at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420) > > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625) > at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154) > > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > > Regards > > >