Hi,

Could you try repartitioning the data by .repartition(# of cores on machine) or 
while reading the data, supply the number of minimum partitions as in
sc.textFile(path, # of cores on machine).

It may be that the whole data is stored in one block? If it is billions of 
rows, then the indexing probably will not work giving the "exceeds 
Integer.MAX_VALUE" error.

Best,
Burak

----- Original Message -----
From: "francisco" <ftanudj...@nextag.com>
To: u...@spark.incubator.apache.org
Sent: Wednesday, September 17, 2014 3:18:29 PM
Subject: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

Hi,

We are running aggregation on a huge data set (few billion rows).
While running the task got the following error (see below). Any ideas?
Running spark 1.1.0 on cdh distribution.

...
14/09/17 13:33:30 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0).
2083 bytes result sent to driver
14/09/17 13:33:30 INFO CoarseGrainedExecutorBackend: Got assigned task 1
14/09/17 13:33:30 INFO Executor: Running task 0.0 in stage 2.0 (TID 1)
14/09/17 13:33:30 INFO TorrentBroadcast: Started reading broadcast variable
2
14/09/17 13:33:30 INFO MemoryStore: ensureFreeSpace(1428) called with
curMem=163719, maxMem=34451478282
14/09/17 13:33:30 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes
in memory (estimated size 1428.0 B, free 32.1 GB)
14/09/17 13:33:30 INFO BlockManagerMaster: Updated info of block
broadcast_2_piece0
14/09/17 13:33:30 INFO TorrentBroadcast: Reading broadcast variable 2 took
0.027374294 s
14/09/17 13:33:30 INFO MemoryStore: ensureFreeSpace(2336) called with
curMem=165147, maxMem=34451478282
14/09/17 13:33:30 INFO MemoryStore: Block broadcast_2 stored as values in
memory (estimated size 2.3 KB, free 32.1 GB)
14/09/17 13:33:30 INFO MapOutputTrackerWorker: Updating epoch to 1 and
clearing cache
14/09/17 13:33:30 INFO MapOutputTrackerWorker: Don't have map outputs for
shuffle 1, fetching them
14/09/17 13:33:30 INFO MapOutputTrackerWorker: Doing the fetch; tracker
actor =
Actor[akka.tcp://sparkdri...@sas-model1.pv.sv.nextag.com:56631/user/MapOutputTracker#794212052]
14/09/17 13:33:30 INFO MapOutputTrackerWorker: Got the output locations
14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 1 non-empty blocks out of 1 blocks
14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 0 remote fetches in 8 ms
14/09/17 13:33:30 ERROR BlockFetcherIterator$BasicBlockFetcherIterator:
Error occurred while fetching local blocks
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:104)
        at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:120)
        at
org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:358)
        at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:208)
        at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:205)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.getLocalBlocks(BlockFetcherIterator.scala:205)
        at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.initialize(BlockFetcherIterator.scala:240)
        at
org.apache.spark.storage.BlockManager.getMultiple(BlockManager.scala:583)
        at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:77)
        at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:41)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:54)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
14/09/17 13:33:30 INFO CoarseGrainedExecutorBackend: Got assigned task 2
14/09/17 13:33:30 INFO Executor: Running task 0.0 in stage 1.1 (TID 2)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-in-BlockFetcherIterator-tp14483.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to