Our operations guy handles our hadoop configuration, and I think he has
setup our hadoop conf to compress everything.  I'm trying to subvert him
:-)  I think the HADOOP_OPTS trick will work for me, I think that makes
sense.  Thanks!

-Luke

On 3/6/12 6:46 PM, "Sean Owen" <[email protected]> wrote:

>Eh, hmm, does this job compress by default? I don't have the code here.
>That is not generally how Hadoop works but you could make it do this. I
>don't know if there's an override.
>On Mar 7, 2012 12:40 AM, "Luke Forehand" <
>[email protected]> wrote:
>
>> Why should it not be compressed in the first place?
>>
>> Here is the header of one of the reducer parts that was written into
>> /mahout/kmeans/clusters-5-final
>>
>> SEQ  
>>org.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster
>>  )org.apache.hadoop.io.compress.SnappyCodec
>>
>>
>> On 3/6/12 6:33 PM, "Sean Owen" <[email protected]> wrote:
>>
>> >Ok but you're talking about reducer output not mapper. It should not be
>> >compressed in the first place.
>> >On Mar 7, 2012 12:29 AM, "Luke Forehand" <
>> >[email protected]> wrote:
>> >
>> >> I want the results of the kmeans clustering to be uncompressed or
>> >> compressed in a way that my users can natively decompress on their
>> >> machines.  All our other hadoop jobs use Snappy compression when
>>writing
>> >> output, but our users don't have Snappy and don't particularly want
>>to
>> >> install it (especially because of problems installing on mac).  I'll
>>try
>> >> adding this param to the HADOOP_OPTS and in the longterm probably
>>come
>> >>up
>> >> with a cleaner way to do this.  Thanks!
>> >>
>> >> -Luke
>> >>
>> >> On 3/6/12 6:24 PM, "Sean Owen" <[email protected]> wrote:
>> >>
>> >> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
>> >> >recall). Or you configure this in your Hadoop config files.  It has
>>no
>> >> >meaning to the driver script. Why do you want to disable compression
>> >> >after the mapper?
>> >> >
>> >> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
>> >> ><[email protected]> wrote:
>> >> >> I tried the following and it does not work:
>> >> >>
>> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000
>>-cd
>> >>0.01
>> >> >> -x 100 \
>> >> >> -Dmapreduce.map.output.compress=false
>> >> >>
>> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000
>>-cd
>> >>0.01
>> >> >> -x 100 \
>> >> >>
>> >>
>> 
>>>>>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipC
>>>>>>od
>> >>>>ec
>> >> >>
>> >> >>
>> >> >> And still getting the default codec being used (which is Snappy in
>> >>this
>> >> >> case and I don't want the users to have to install native snappy
>> >>which
>> >> >>is
>> >> >> why I'm trying to override this param).  Passing -Dkey=value on
>>the
>> >> >>mahout
>> >> >> command line does not seem to have any effect on the mapreduce job
>> >> >> configuration from what I can tell.  Any ideas?
>> >> >>
>> >> >> -Luke
>> >> >>
>> >> >> On 3/6/12 3:48 PM, "Sean Owen" <[email protected]> wrote:
>> >> >>
>> >> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think
>> >>the
>> >> >>>key was mapred.output.compress in Hadoop 0.20.0.
>> >> >>>I am not sure if there is reducer compression built-in, but, I
>>could
>> >> >>>have missed it.
>> >> >>>
>> >> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
>> >> >>><[email protected]> wrote:
>> >> >>>> Hello,
>> >> >>>>
>> >> >>>> Is there a way to run the mahout kmeans program from the command
>> >>line,
>> >> >>>>with a parameter that will override (and disable) the reducer
>>task
>> >> >>>>compression?  I have tried several different ways of specifying
>>-D
>> >> >>>>parameter but I can't seem to get any options to pass through to
>>the
>> >> >>>>hadoop mapreduce configuration.
>> >> >>>>
>> >> >>>> Thanks!
>> >> >>>> Luke
>> >> >>
>> >>
>> >>
>>
>>

Reply via email to