Re: Mahout Kmeans

Paritosh Ranjan Thu, 13 Sep 2012 23:31:15 -0700

The general convention is that if there is a MAHOUT_LOCAL env variable,this means run 'pseudo-distributed' rather than against a cluster.


On 14-09-2012 05:11, Gustavo Enrique Salazar Torres wrote:

Hi Paritosh:


I made it work on Hadoop mode, not Local. I don't know if thats desirable.
I also got this error: Hadoop libraries are missing when running local and,
from what I saw at the mahout script, it simply discards all libraries when
MAHOUT_LOCAL is set.
So, is the local mode used for anything? (please forgive my ignorance, I
don't know the whole project)

Gustavo

On Sat, Sep 8, 2012 at 2:35 AM, Paritosh Ranjan <[email protected]> wrote:

Can you open up a jira describing the problem and submitting the patch for
your fix?
https://issues.apache.org/**jira/browse/MAHOUT<https://issues.apache.org/jira/browse/MAHOUT>


On 08-09-2012 09:40, Gustavo Enrique Salazar Torres wrote:

Nevermind, got it to work, had to fix the script though.

Thanks.
Gustavo

On Fri, Sep 7, 2012 at 5:54 PM, Gustavo Enrique Salazar Torres <
[email protected]> wrote:

  Hi there:

I'm trying to finish an improvement to the Kmeans algorithm but I first
need to get it run in order to compare results.
But running the cluster-reuters.sh script I get this error:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /home/gustavo/Desktop/yandex_**data/hadoop-
0.20.203.0/bin/hadoop and
HADOOP_CONF_DIR=/home/gustavo/**Desktop/yandex_data/hadoop-0.20.203.0/**
conf
MAHOUT-JOB:
/home/gustavo/Desktop/yandex_**data/mahout-distribution-0.7/**
mahout-examples-0.7-job.jar
12/09/07 17:47:43 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=[./reuters-kmeans-**clusters],
--convergenceDelta=[0.5],
--distanceMeasure=[org.apache.**mahout.common.distance.**
CosineDistanceMeasure],
--endPhase=[2147483647],
--input=[./reuters_out_seqdir_**kmeans/tfidf-vectors], --maxIter=[10],
--method=[mapreduce], --numClusters=[20], --output=[./reuters-kmeans],
--overwrite=null, --startPhase=[0], --tempDir=[temp]}
12/09/07 17:47:44 INFO common.HadoopUtil: Deleting
reuters-kmeans-clusters
12/09/07 17:47:44 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
12/09/07 17:47:44 INFO zlib.ZlibFactory: Successfully loaded &
initialized
native-zlib library
12/09/07 17:47:44 INFO compress.CodecPool: Got brand-new compressor
12/09/07 17:47:44 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to
reuters-kmeans-clusters/part-**randomSeed
12/09/07 17:47:44 INFO kmeans.KMeansDriver: Input:
reuters_out_seqdir_kmeans/**tfidf-vectors Clusters In:
reuters-kmeans-clusters/part-**randomSeed Out: reuters-kmeans Distance:
org.apache.mahout.common.**distance.CosineDistanceMeasure
12/09/07 17:47:44 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: org.apache.mahout.math.**VectorWritable
Input Vectors: {}
12/09/07 17:47:44 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.lang.**IllegalStateException: No input
clusters found in reuters-kmeans-clusters/part-**randomSeed. Check your
-c
argument.
at
org.apache.mahout.clustering.**kmeans.KMeansDriver.**
buildClusters(KMeansDriver.**java:218)

As you can see the initial clusters are being created but for a reason I
don't understand why they are being found.
Below is the 'cat' command on the part file containing clusters:

$ dfs -cat reuters-kmeans-clusters/part-**randomSeed
SEQ
org.apache.hadoop.io.Text5org.**apache.mahout.clustering.**
iterator.ClusterWritable
*org.apache.hadoop.io.**compress.DefaultCodec b�W3 K�E�߇H��Vgustavo

Can anyone help me please?

Thanks
Gustavo Salazar

Re: Mahout Kmeans

Reply via email to