Our operations guy handles our hadoop configuration, and I think he has setup our hadoop conf to compress everything. I'm trying to subvert him :-) I think the HADOOP_OPTS trick will work for me, I think that makes sense. Thanks!
-Luke On 3/6/12 6:46 PM, "Sean Owen" <[email protected]> wrote: >Eh, hmm, does this job compress by default? I don't have the code here. >That is not generally how Hadoop works but you could make it do this. I >don't know if there's an override. >On Mar 7, 2012 12:40 AM, "Luke Forehand" < >[email protected]> wrote: > >> Why should it not be compressed in the first place? >> >> Here is the header of one of the reducer parts that was written into >> /mahout/kmeans/clusters-5-final >> >> SEQ >>org.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster >> )org.apache.hadoop.io.compress.SnappyCodec >> >> >> On 3/6/12 6:33 PM, "Sean Owen" <[email protected]> wrote: >> >> >Ok but you're talking about reducer output not mapper. It should not be >> >compressed in the first place. >> >On Mar 7, 2012 12:29 AM, "Luke Forehand" < >> >[email protected]> wrote: >> > >> >> I want the results of the kmeans clustering to be uncompressed or >> >> compressed in a way that my users can natively decompress on their >> >> machines. All our other hadoop jobs use Snappy compression when >>writing >> >> output, but our users don't have Snappy and don't particularly want >>to >> >> install it (especially because of problems installing on mac). I'll >>try >> >> adding this param to the HADOOP_OPTS and in the longterm probably >>come >> >>up >> >> with a cleaner way to do this. Thanks! >> >> >> >> -Luke >> >> >> >> On 3/6/12 6:24 PM, "Sean Owen" <[email protected]> wrote: >> >> >> >> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I >> >> >recall). Or you configure this in your Hadoop config files. It has >>no >> >> >meaning to the driver script. Why do you want to disable compression >> >> >after the mapper? >> >> > >> >> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand >> >> ><[email protected]> wrote: >> >> >> I tried the following and it does not work: >> >> >> >> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c >> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 >>-cd >> >>0.01 >> >> >> -x 100 \ >> >> >> -Dmapreduce.map.output.compress=false >> >> >> >> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c >> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 >>-cd >> >>0.01 >> >> >> -x 100 \ >> >> >> >> >> >> >>>>>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipC >>>>>>od >> >>>>ec >> >> >> >> >> >> >> >> >> And still getting the default codec being used (which is Snappy in >> >>this >> >> >> case and I don't want the users to have to install native snappy >> >>which >> >> >>is >> >> >> why I'm trying to override this param). Passing -Dkey=value on >>the >> >> >>mahout >> >> >> command line does not seem to have any effect on the mapreduce job >> >> >> configuration from what I can tell. Any ideas? >> >> >> >> >> >> -Luke >> >> >> >> >> >> On 3/6/12 3:48 PM, "Sean Owen" <[email protected]> wrote: >> >> >> >> >> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think >> >>the >> >> >>>key was mapred.output.compress in Hadoop 0.20.0. >> >> >>>I am not sure if there is reducer compression built-in, but, I >>could >> >> >>>have missed it. >> >> >>> >> >> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand >> >> >>><[email protected]> wrote: >> >> >>>> Hello, >> >> >>>> >> >> >>>> Is there a way to run the mahout kmeans program from the command >> >>line, >> >> >>>>with a parameter that will override (and disable) the reducer >>task >> >> >>>>compression? I have tried several different ways of specifying >>-D >> >> >>>>parameter but I can't seem to get any options to pass through to >>the >> >> >>>>hadoop mapreduce configuration. >> >> >>>> >> >> >>>> Thanks! >> >> >>>> Luke >> >> >> >> >> >> >> >> >>
