The basic problem that you are running into is that gzipped file is not
splittable<https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression#8ca1fda1252b67145680b3a5e9d45b2a>
.


On Sat, Oct 12, 2013 at 4:51 PM, Grega Kešpret <[email protected]> wrote:

> Hi,
>
> I'm getting Java OOM (Heap, GC overhead exceeded), Futures timed out after
> [10000] milliseconds, removing BlockManager with no recent heartbeat etc. I
> have narrowed down the cause to be a big input file from S3. I'm trying to
> make Spark split this file to several smaller chunks, so each of these
> chunks will fit in memory, but I'm out of luck.
>
> I have tried:
> - passing minSplits parameter to something greater than 1 in sc.textFile
> - increasing parameter numPartitions to groupByKey
> - using coalesce with numPartitions greater than 1 and shuffle = true
>
> Basically my flow is like this:
> val input = sc.textFile("s3n://.../input.gz", minSplits)
> input
>   .mapPartitions(l => (key, l))
>   .groupByKey(numPartitions)
>   .map(...)
>   .saveAsTextFile
>
> If I do input.toDebugString, I always have 1 partition (even if the
> minSplits is greater than 1). It seems like Spark is trying to ingest the
> whole input at once. When I manually split the file into several smaller
> ones, I was able to progress successfully, and input.toDebugString was
> showing 10 partitions in case of 10 files.
>
> Thanks,
>
> Grega
>

Reply via email to