On the advice of some recent discussions on this list, I thought I would
try and consume gz files directly. I'm reading them, doing a preliminary
map, then repartitioning, then doing normal spark things.

As I understand it, zip files aren't readable in partitions because of the
format, so I thought that repartitioning would be the next best thing for
parallelism. I have about 200 files, some about 1GB compressed and some
over 2GB uncompressed.

I'm hitting the 2GB maximum partition size. It's been discussed on this
list (topic: "2GB limit for partitions?", tickets SPARK-1476 and
SPARK-1391).  Stack trace at the end. This happened at 10 hours in
(probably when it saw its first file). I can't just re-run it quickly!

Does anyone have any advice? Might I solve this by re-partitioning as the
first step after reading the file(s)? Or is it effectively impossible to
read a gz file that expands to over 2GB? Does anyone have any experience
with this?

Thanks in advance

Joe

Stack trace:

Exception in thread "main" 15/02/18 20:44:25 INFO scheduler.TaskSetManager:
Lost task 5.3 in stage 1.0 (TID 283) on executor:
java.lang.IllegalArgumentException (Size exceeds Integer.MAX_VALUE)
[duplicate 6]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2
in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage
1.0: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
        at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
        at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
        at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)

Reply via email to