I am not sure but I think you need a custom Hadoop build to support compression. Are you using the Apache release or CDH? On Oct 7, 2011 3:57 PM, "Paolo Castagna" <[email protected]> wrote:
> Hi, > I am using Apache Whirr 0.6.0-incubating to create small (i.e. 10-20 nodes) > Hadoop clusters on Amazon EC2 for testing. > > In my hadoop-ec2.properties I have: > > whirr.hardware-id=m1.large > # See http://alestic.com/ > whirr.image-id=eu-west-1/ami-8293a5f6 > whirr.location-id=eu-west-1 > whirr.hadoop.version=0.20.204.0 > whirr.hadoop.tarball.url= > http://archive.apache.org/dist/hadoop/core/hadoop-${whirr.hadoop.version}/hadoop-${whirr.hadoop.version}.tar.gz > > I am able to successfully run MapReduce jobs but I see errors as soon as I > try to enable compression (either for the map output or for the > SequenceFile > at the end of a job). > > In the task logs I see this warning, which I think is relevant: > WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes > where applicable > > This is what I have in my "driver(s)", for the intermediate map output: > > if ( useCompression ) { > configuration.setBoolean("mapred.compress.map.output", true); > configuration.set("mapred.output.compression.type", "BLOCK"); > configuration.set("mapred.map.output.compression.codec", > "org.apache.hadoop.io.compress.GzipCodec"); > } > > For the final job output: > > if ( useCompression ) { > SequenceFileOutputFormat.setCompressOutput(job, true); > SequenceFileOutputFormat.setOutputCompressorClass(job, > GzipCodec.class); > SequenceFileOutputFormat.setOutputCompressionType(job, > CompressionType.BLOCK); > } > > As you can see I am trying to use a simple GzipCodec, nothing strange. > > What am I doing wrong? > What should I do in order to be able to use compression in my MapReduce > jobs? > > Thank you in advance for your help, > Paolo >
