gzip and zip are not splittable compression formats; bzip and lzo are. Ideally, use a splittable compression format.
Repartitioning is not a great solution since it means a shuffle, typically. This is not necessarily related to how big your partitions are. The question is, when does this happen? what operation? On Thu, Feb 19, 2015 at 9:35 AM, Joe Wass <jw...@crossref.org> wrote: > On the advice of some recent discussions on this list, I thought I would try > and consume gz files directly. I'm reading them, doing a preliminary map, > then repartitioning, then doing normal spark things. > > As I understand it, zip files aren't readable in partitions because of the > format, so I thought that repartitioning would be the next best thing for > parallelism. I have about 200 files, some about 1GB compressed and some over > 2GB uncompressed. > > I'm hitting the 2GB maximum partition size. It's been discussed on this list > (topic: "2GB limit for partitions?", tickets SPARK-1476 and SPARK-1391). > Stack trace at the end. This happened at 10 hours in (probably when it saw > its first file). I can't just re-run it quickly! > > Does anyone have any advice? Might I solve this by re-partitioning as the > first step after reading the file(s)? Or is it effectively impossible to > read a gz file that expands to over 2GB? Does anyone have any experience > with this? > > Thanks in advance > > Joe > > Stack trace: > > Exception in thread "main" 15/02/18 20:44:25 INFO scheduler.TaskSetManager: > Lost task 5.3 in stage 1.0 (TID 283) on executor: > java.lang.IllegalArgumentException (Size exceeds Integer.MAX_VALUE) > [duplicate 6] > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0: > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) > at > org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:245) --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org