the data set is the training data set for random forest training, about
36,500 data,  any idea how to further partition it?

On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich <and...@aehrlich.com>
wrote:

> It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which
> limits the size of the blocks in the file being written to disk to 2GB.
>
> If so, the solution is for you to try tuning for smaller tasks. Try
> increasing the number of partitions, or using a more space-efficient data
> structure inside the RDD, or increasing the amount of memory available to
> spark and caching the data in memory. Make sure you are using Kryo
> serialization.
>
> Andrew
>
> On Jul 23, 2016, at 9:00 PM, Ascot Moss <ascot.m...@gmail.com> wrote:
>
>
> Hi,
>
> Please help!
>
> My spark: 1.6.2
> Java: java8_u40
>
> I am trying random forest training, I got " Size exceeds
> Integer.MAX_VALUE".
>
> Any idea how to resolve it?
>
>
> (the log)
> 16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID
> 25)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
> at
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
>
> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
>
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
> 16/07/24 07:59:49 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 25,
> localhost): java.lang.IllegalArgumentException: Size exceeds
> Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
> at
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
>
> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
>
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> Regards
>
>
>

Reply via email to