Re: Large input file problem

Grega Kešpret Sat, 12 Oct 2013 17:11:35 -0700

Thanks a lot, didn't know this. If I use some other compresion format that
supports splitting (like bzip2), do I get decompression for free when I do
sc.textFile (like with gzipped files)?


Grega
--
[image: Inline image 1]
*Grega Kešpret*
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com <http://www.celtra.com/> |
@celtramobile<http://www.twitter.com/celtramobile>


On Sun, Oct 13, 2013 at 2:07 AM, Mark Hamstra <[email protected]>wrote:

> The basic problem that you are running into is that gzipped file is not
> splittable<https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression#8ca1fda1252b67145680b3a5e9d45b2a>
> .
>
>
> On Sat, Oct 12, 2013 at 4:51 PM, Grega Kešpret <[email protected]> wrote:
>
>> Hi,
>>
>> I'm getting Java OOM (Heap, GC overhead exceeded), Futures timed out
>> after [10000] milliseconds, removing BlockManager with no recent heartbeat
>> etc. I have narrowed down the cause to be a big input file from S3. I'm
>> trying to make Spark split this file to several smaller chunks, so each of
>> these chunks will fit in memory, but I'm out of luck.
>>
>> I have tried:
>> - passing minSplits parameter to something greater than 1 in sc.textFile
>> - increasing parameter numPartitions to groupByKey
>> - using coalesce with numPartitions greater than 1 and shuffle = true
>>
>> Basically my flow is like this:
>> val input = sc.textFile("s3n://.../input.gz", minSplits)
>> input
>>   .mapPartitions(l => (key, l))
>>   .groupByKey(numPartitions)
>>   .map(...)
>>   .saveAsTextFile
>>
>> If I do input.toDebugString, I always have 1 partition (even if the
>> minSplits is greater than 1). It seems like Spark is trying to ingest the
>> whole input at once. When I manually split the file into several smaller
>> ones, I was able to progress successfully, and input.toDebugString was
>> showing 10 partitions in case of 10 files.
>>
>> Thanks,
>>
>> Grega
>>
>
>

<<celtra_logo.png>>

Re: Large input file problem

Reply via email to