Re: Spill file compression

Sigurd Spieckermann Wed, 07 Nov 2012 07:14:48 -0800

When I log the calls of the combiner function and print the number of
elements iterated over, it is all 1 during the spill-writing phase and the
combiner is called very often. Is this normal behavior? According to what
mentioned earlier, I would expect the combiner to combine all records with
the same key that are in the in-memory buffer before the spill and it
should be at least a few per spill in my case. This is confusing...



2012/11/7 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>

> Hm, maybe I need some clarification on what the combiner exactly does.
> From what I understand from "Hadoop - The Definitive Guide", there are a
> few occasions when a combiner may be called before the sort&shuffle phase.
>
> 1) Once the in-memory buffer reaches the threshold it will spill out to
> disk. "Before it writes to disk, the thread first divides the data into
> partitions corresponding to the reducers that they will ultimately be sent
> to. Within each partition, the background thread performs an in-memory sort
> by key, and if there is a combiner function, it is run on the output of the
> sort. Running the combiner function makes for a more compact map output, so
> there is less data to write to local disk and to transfer to the reducer."
> So to me, this means that the combiner at this point only operates on the
> data that is located in the in-memory buffer. If the buffer can keep at
> most n records with k distinct keys (uniformly distributed), then the
> combiner will cause a reduction in records spilled to disk by a factor of
> k. (correct?)
>
> 2) "Before the task is finished, the spill files are merged into a single
> partitioned and sorted output file. [...] If there are at least three spill
> files (set by the min.num.spills.for.combine property) then the combiner is
> run again before the output file is written." So the number of spill files
> is not affected by the use of a combiner, only their sizes usually get
> reduced and only at the end of the map task, all spill files are touched
> again, merged and combined. If I have k distinct keys per map-task, then I
> will be guaranteed to have k records at the very end of the map-task.
> (correct?)
>
> Is there any other occasion when the combiner may be called? Are spill
> files ever touched again before the final merge?
>
> Thanks,
> Sigurd
>
>
>
> 2012/11/7 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
>
>> OK, I found the answer to one of my questions just now -- the location of
>> the spill files and their sizes. So, there's a discrepancy between what I
>> see and what you said about the compression. The total size of all spill
>> files of a single task matches with what I estimate for them to be
>> *without* compression. It seems they aren't compressed, but that's strange
>> because I definitely enabled compression the way I described.
>>
>>
>>
>> 2012/11/7 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
>>
>>> OK, just wanted to confirm. Maybe there is another problem then. I just
>>> looked at the task logs and there were ~200 spills recorded for a single
>>> task, only afterwards there was a merge phase. In my case, 200 spills are
>>> about 2GB (uncompressed). One map output record easily fits into the
>>> in-memory buffer, in fact, a few records fit into it. But Hadoop decides to
>>> write gigabytes of spill to disk and it seems that the disk I/O and merging
>>> make everything really slow. There doesn't seem to be a
>>> max.num.spills.for.combine though. Is there any typical advise for this
>>> kind of situation? Also, is there a way to see the size of the compressed
>>> spill files to get a better idea about the file sizes I'm dealing with?
>>>
>>>
>>>
>>> 2012/11/7 Harsh J <ha...@cloudera.com>
>>>
>>>> Yes we do compress each spill output using the same codec as specified
>>>> for map (intermediate) output compression. However, the counted bytes
>>>> may be counting decompressed values of the records written, and not
>>>> post-compressed ones.
>>>>
>>>> On Wed, Nov 7, 2012 at 6:02 PM, Sigurd Spieckermann
>>>> <sigurd.spieckerm...@gmail.com> wrote:
>>>> > Hi guys,
>>>> >
>>>> > I've encountered a situation where the ratio between "Map output
>>>> bytes" and
>>>> > "Map output materialized bytes" is quite huge and during the
>>>> map-phase data
>>>> > is spilled to disk quite a lot. This is something I'll try to
>>>> optimize, but
>>>> > I'm wondering if the spill files are compressed at all. I set
>>>> > mapred.compress.map.output=true and
>>>> >
>>>> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>> > and everything else seems to be working correctly. Does Hadoop
>>>> actually
>>>> > compress spills or just the final spill after finishing the entire
>>>> map-task?
>>>> >
>>>> > Thanks,
>>>> > Sigurd
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>

Re: Spill file compression

Reply via email to