OK, I found the answer to one of my questions just now -- the location of the spill files and their sizes. So, there's a discrepancy between what I see and what you said about the compression. The total size of all spill files of a single task matches with what I estimate for them to be *without* compression. It seems they aren't compressed, but that's strange because I definitely enabled compression the way I described.
2012/11/7 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com> > OK, just wanted to confirm. Maybe there is another problem then. I just > looked at the task logs and there were ~200 spills recorded for a single > task, only afterwards there was a merge phase. In my case, 200 spills are > about 2GB (uncompressed). One map output record easily fits into the > in-memory buffer, in fact, a few records fit into it. But Hadoop decides to > write gigabytes of spill to disk and it seems that the disk I/O and merging > make everything really slow. There doesn't seem to be a > max.num.spills.for.combine though. Is there any typical advise for this > kind of situation? Also, is there a way to see the size of the compressed > spill files to get a better idea about the file sizes I'm dealing with? > > > > 2012/11/7 Harsh J <ha...@cloudera.com> > >> Yes we do compress each spill output using the same codec as specified >> for map (intermediate) output compression. However, the counted bytes >> may be counting decompressed values of the records written, and not >> post-compressed ones. >> >> On Wed, Nov 7, 2012 at 6:02 PM, Sigurd Spieckermann >> <sigurd.spieckerm...@gmail.com> wrote: >> > Hi guys, >> > >> > I've encountered a situation where the ratio between "Map output bytes" >> and >> > "Map output materialized bytes" is quite huge and during the map-phase >> data >> > is spilled to disk quite a lot. This is something I'll try to optimize, >> but >> > I'm wondering if the spill files are compressed at all. I set >> > mapred.compress.map.output=true and >> > >> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec >> > and everything else seems to be working correctly. Does Hadoop actually >> > compress spills or just the final spill after finishing the entire >> map-task? >> > >> > Thanks, >> > Sigurd >> >> >> >> -- >> Harsh J >> > >