On the advice of some recent discussions on this list, I thought I would try and consume gz files directly. I'm reading them, doing a preliminary map, then repartitioning, then doing normal spark things.
As I understand it, zip files aren't readable in partitions because of the format, so I thought that repartitioning would be the next best thing for parallelism. I have about 200 files, some about 1GB compressed and some over 2GB uncompressed. I'm hitting the 2GB maximum partition size. It's been discussed on this list (topic: "2GB limit for partitions?", tickets SPARK-1476 and SPARK-1391). Stack trace at the end. This happened at 10 hours in (probably when it saw its first file). I can't just re-run it quickly! Does anyone have any advice? Might I solve this by re-partitioning as the first step after reading the file(s)? Or is it effectively impossible to read a gz file that expands to over 2GB? Does anyone have any experience with this? Thanks in advance Joe Stack trace: Exception in thread "main" 15/02/18 20:44:25 INFO scheduler.TaskSetManager: Lost task 5.3 in stage 1.0 (TID 283) on executor: java.lang.IllegalArgumentException (Size exceeds Integer.MAX_VALUE) [duplicate 6] org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)