Hi Andrei, probably you are right, there is something you need to do when you install an Hadoop cluster to support compression: http://hadoop.apache.org/common/docs/current/native_libraries.html
However, it would be nice if Whirr could do that and make compression available to developers. I am now going to try to use the DistributedCache to distribute the necessary shared libraries: http://hadoop.apache.org/common/docs/current/native_libraries.html#Native+Shared+Libraries If others are using compression with cluster created via Apache Whirr I am interested to know what they are doing. Thanks, Paolo On 7 October 2011 15:09, Andrei Savu <[email protected]> wrote: > Ask on the Hadoop email list if the official release supports compression > out of the box. > > On Oct 7, 2011 5:00 PM, "Paolo Castagna" <[email protected]> > wrote: >> >> Hi Andrei, >> thank you for your reply. >> >> As I wrote in my previous message, I have this in my hadoop-ec2.properties >> file: >> whirr.hadoop.version=0.20.204.0 >> >> whirr.hadoop.tarball.url=http://archive.apache.org/dist/hadoop/core/hadoop-${whirr.hadoop.version}/hadoop-${whirr.hadoop.version}.tar.gz >> >> Therefore I am using Apache Hadoop release 0.20.204.0. >> >> Can anyone confirm that you cannot use compression with Hadoop >> clusters installed via Whirr? >> >> Paolo >> >> On 7 October 2011 14:23, Andrei Savu <[email protected]> wrote: >> > I am not sure but I think you need a custom Hadoop build to support >> > compression. Are you using the Apache release or CDH? >> > >> > On Oct 7, 2011 3:57 PM, "Paolo Castagna" <[email protected]> >> > wrote: >> >> >> >> Hi, >> >> I am using Apache Whirr 0.6.0-incubating to create small (i.e. 10-20 >> >> nodes) >> >> Hadoop clusters on Amazon EC2 for testing. >> >> >> >> In my hadoop-ec2.properties I have: >> >> >> >> whirr.hardware-id=m1.large >> >> # See http://alestic.com/ >> >> whirr.image-id=eu-west-1/ami-8293a5f6 >> >> whirr.location-id=eu-west-1 >> >> whirr.hadoop.version=0.20.204.0 >> >> >> >> >> >> whirr.hadoop.tarball.url=http://archive.apache.org/dist/hadoop/core/hadoop-${whirr.hadoop.version}/hadoop-${whirr.hadoop.version}.tar.gz >> >> >> >> I am able to successfully run MapReduce jobs but I see errors as soon >> >> as I >> >> try to enable compression (either for the map output or for the >> >> SequenceFile >> >> at the end of a job). >> >> >> >> In the task logs I see this warning, which I think is relevant: >> >> WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load >> >> native-hadoop library for your platform... using builtin-java classes >> >> where applicable >> >> >> >> This is what I have in my "driver(s)", for the intermediate map output: >> >> >> >> if ( useCompression ) { >> >> configuration.setBoolean("mapred.compress.map.output", true); >> >> configuration.set("mapred.output.compression.type", "BLOCK"); >> >> configuration.set("mapred.map.output.compression.codec", >> >> "org.apache.hadoop.io.compress.GzipCodec"); >> >> } >> >> >> >> For the final job output: >> >> >> >> if ( useCompression ) { >> >> SequenceFileOutputFormat.setCompressOutput(job, true); >> >> SequenceFileOutputFormat.setOutputCompressorClass(job, >> >> GzipCodec.class); >> >> SequenceFileOutputFormat.setOutputCompressionType(job, >> >> CompressionType.BLOCK); >> >> } >> >> >> >> As you can see I am trying to use a simple GzipCodec, nothing strange. >> >> >> >> What am I doing wrong? >> >> What should I do in order to be able to use compression in my MapReduce >> >> jobs? >> >> >> >> Thank you in advance for your help, >> >> Paolo >> > >
