The basic problem that you are running into is that gzipped file is not splittable<https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression#8ca1fda1252b67145680b3a5e9d45b2a> .
On Sat, Oct 12, 2013 at 4:51 PM, Grega Kešpret <[email protected]> wrote: > Hi, > > I'm getting Java OOM (Heap, GC overhead exceeded), Futures timed out after > [10000] milliseconds, removing BlockManager with no recent heartbeat etc. I > have narrowed down the cause to be a big input file from S3. I'm trying to > make Spark split this file to several smaller chunks, so each of these > chunks will fit in memory, but I'm out of luck. > > I have tried: > - passing minSplits parameter to something greater than 1 in sc.textFile > - increasing parameter numPartitions to groupByKey > - using coalesce with numPartitions greater than 1 and shuffle = true > > Basically my flow is like this: > val input = sc.textFile("s3n://.../input.gz", minSplits) > input > .mapPartitions(l => (key, l)) > .groupByKey(numPartitions) > .map(...) > .saveAsTextFile > > If I do input.toDebugString, I always have 1 partition (even if the > minSplits is greater than 1). It seems like Spark is trying to ingest the > whole input at once. When I manually split the file into several smaller > ones, I was able to progress successfully, and input.toDebugString was > showing 10 partitions in case of 10 files. > > Thanks, > > Grega >
