Re: override mapreduce compression?

Dmitriy Lyubimov Wed, 07 Mar 2012 13:02:59 -0800

Aren't hadoop site.xml settings on the driver's client usually
overshadow whatever it is on the cluster? Or you don't have the privs
to change that either?


On Tue, Mar 6, 2012 at 4:54 PM, Luke Forehand
<[email protected]> wrote:
> Our operations guy handles our hadoop configuration, and I think he has
> setup our hadoop conf to compress everything.  I'm trying to subvert him
> :-)  I think the HADOOP_OPTS trick will work for me, I think that makes
> sense.  Thanks!
>
> -Luke
>
> On 3/6/12 6:46 PM, "Sean Owen" <[email protected]> wrote:
>
>>Eh, hmm, does this job compress by default? I don't have the code here.
>>That is not generally how Hadoop works but you could make it do this. I
>>don't know if there's an override.
>>On Mar 7, 2012 12:40 AM, "Luke Forehand" <
>>[email protected]> wrote:
>>
>>> Why should it not be compressed in the first place?
>>>
>>> Here is the header of one of the reducer parts that was written into
>>> /mahout/kmeans/clusters-5-final
>>>
>>> SEQ
>>>org.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster
>>>  )org.apache.hadoop.io.compress.SnappyCodec
>>>
>>>
>>> On 3/6/12 6:33 PM, "Sean Owen" <[email protected]> wrote:
>>>
>>> >Ok but you're talking about reducer output not mapper. It should not be
>>> >compressed in the first place.
>>> >On Mar 7, 2012 12:29 AM, "Luke Forehand" <
>>> >[email protected]> wrote:
>>> >
>>> >> I want the results of the kmeans clustering to be uncompressed or
>>> >> compressed in a way that my users can natively decompress on their
>>> >> machines.  All our other hadoop jobs use Snappy compression when
>>>writing
>>> >> output, but our users don't have Snappy and don't particularly want
>>>to
>>> >> install it (especially because of problems installing on mac).  I'll
>>>try
>>> >> adding this param to the HADOOP_OPTS and in the longterm probably
>>>come
>>> >>up
>>> >> with a cleaner way to do this.  Thanks!
>>> >>
>>> >> -Luke
>>> >>
>>> >> On 3/6/12 6:24 PM, "Sean Owen" <[email protected]> wrote:
>>> >>
>>> >> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
>>> >> >recall). Or you configure this in your Hadoop config files.  It has
>>>no
>>> >> >meaning to the driver script. Why do you want to disable compression
>>> >> >after the mapper?
>>> >> >
>>> >> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
>>> >> ><[email protected]> wrote:
>>> >> >> I tried the following and it does not work:
>>> >> >>
>>> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>>> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000
>>>-cd
>>> >>0.01
>>> >> >> -x 100 \
>>> >> >> -Dmapreduce.map.output.compress=false
>>> >> >>
>>> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>>> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000
>>>-cd
>>> >>0.01
>>> >> >> -x 100 \
>>> >> >>
>>> >>
>>>
>>>>>>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipC
>>>>>>>od
>>> >>>>ec
>>> >> >>
>>> >> >>
>>> >> >> And still getting the default codec being used (which is Snappy in
>>> >>this
>>> >> >> case and I don't want the users to have to install native snappy
>>> >>which
>>> >> >>is
>>> >> >> why I'm trying to override this param).  Passing -Dkey=value on
>>>the
>>> >> >>mahout
>>> >> >> command line does not seem to have any effect on the mapreduce job
>>> >> >> configuration from what I can tell.  Any ideas?
>>> >> >>
>>> >> >> -Luke
>>> >> >>
>>> >> >> On 3/6/12 3:48 PM, "Sean Owen" <[email protected]> wrote:
>>> >> >>
>>> >> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think
>>> >>the
>>> >> >>>key was mapred.output.compress in Hadoop 0.20.0.
>>> >> >>>I am not sure if there is reducer compression built-in, but, I
>>>could
>>> >> >>>have missed it.
>>> >> >>>
>>> >> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
>>> >> >>><[email protected]> wrote:
>>> >> >>>> Hello,
>>> >> >>>>
>>> >> >>>> Is there a way to run the mahout kmeans program from the command
>>> >>line,
>>> >> >>>>with a parameter that will override (and disable) the reducer
>>>task
>>> >> >>>>compression?  I have tried several different ways of specifying
>>>-D
>>> >> >>>>parameter but I can't seem to get any options to pass through to
>>>the
>>> >> >>>>hadoop mapreduce configuration.
>>> >> >>>>
>>> >> >>>> Thanks!
>>> >> >>>> Luke
>>> >> >>
>>> >>
>>> >>
>>>
>>>
>

Re: override mapreduce compression?

Reply via email to