Hi, has anyone tried to use one of Hadoop's CompressionCodec for Nutch segments. It could be worth to use another than DefaultCodec, mainly Bzip2 not only because it will use less storage, but also because sometimes copying larger data replicas around in the cluster is slower and more expensive than the extra CPU time required for a tighter compression. In my case, it's about uploading the segments to AWS S3 which is also sometimes slow.
It turned out that the value of the property mapreduce.output.fileoutputformat.compress.codec is not used when writing the segments - DefaultCodec is always used. After passing the compress.codec to the segment writers (see https://github.com/sebastian-nagel/nutch/tree/segment-compression-codec) a test on a small maybe not representative sample showed that BZip2Codec with BLOCK compression reduces the size of the content subdir by about 30%. The other subdirs are too small in the sample to get relevant results or are always compressed per RECORD (parse_text). Before running a larger test I would like to hear about your experiences. Thanks, Sebastian

