gzip and zip are not splittable compression formats; bzip and lzo are.
Ideally, use a splittable compression format.

Repartitioning is not a great solution since it means a shuffle, typically.

This is not necessarily related to how big your partitions are. The
question is, when does this happen? what operation?

On Thu, Feb 19, 2015 at 9:35 AM, Joe Wass <jw...@crossref.org> wrote:
> On the advice of some recent discussions on this list, I thought I would try
> and consume gz files directly. I'm reading them, doing a preliminary map,
> then repartitioning, then doing normal spark things.
>
> As I understand it, zip files aren't readable in partitions because of the
> format, so I thought that repartitioning would be the next best thing for
> parallelism. I have about 200 files, some about 1GB compressed and some over
> 2GB uncompressed.
>
> I'm hitting the 2GB maximum partition size. It's been discussed on this list
> (topic: "2GB limit for partitions?", tickets SPARK-1476 and SPARK-1391).
> Stack trace at the end. This happened at 10 hours in (probably when it saw
> its first file). I can't just re-run it quickly!
>
> Does anyone have any advice? Might I solve this by re-partitioning as the
> first step after reading the file(s)? Or is it effectively impossible to
> read a gz file that expands to over 2GB? Does anyone have any experience
> with this?
>
> Thanks in advance
>
> Joe
>
> Stack trace:
>
> Exception in thread "main" 15/02/18 20:44:25 INFO scheduler.TaskSetManager:
> Lost task 5.3 in stage 1.0 (TID 283) on executor:
> java.lang.IllegalArgumentException (Size exceeds Integer.MAX_VALUE)
> [duplicate 6]
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in
> stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0:
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>         at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829)
>         at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
>         at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
>         at
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
>         at
> org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
>         at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
>         at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to