There can be two reasons for only one cluster being found. 1) The vectors are really close to each other and the clusters converge. 2) The distance measure you are using is not appropriate with your vector values.
Try to 1) Analyze the vectors and the distance between them. Are they good candidates to be inside different clusters? 2) Try to use CanopyClustering first to guess the number of clusters ( experiment a bit by changing values of t1 and t2 ). 3) Then provided the clusters returned by CanopyClustering to KMeans. 4) Use EuclideanDistance instead of Squared... Paritosh ________________________________________ From: Periya.Data [[email protected]] Sent: Saturday, December 31, 2011 1:07 AM To: [email protected] Subject: number of clusters Hi all, I am a newbie to Mahout. I am running a basic k-means clustering on a sample txt file. No matter what number I give to the --numClusters parameter, I always get only one cluster (VL-0). Can someone please point out any mistake and suggest what I should do to see a decent number of clusters? I successfully convert the txt file into seq-file and then to vectorized format. The command I use is the following: $MAHOUT_HOME/bin/mahout kmeans --input /input/mahout/vectorized/tfidf-vectors \ --output $HDFS_OUTPUT_DIR/clusters \ --clusters $HDFS_OUTPUT_DIR/initialclusters \ --maxIter 10 \ --numClusters 20 \ --clustering \ --overwrite Here is the console output: ===================== pd@PeriyaData:~/Mahout/examples/bin$ ./bigdata_kmeans.sh MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf MAHOUT-JOB: /home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar 11/12/30 15:59:23 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=/output/mahout/kmeans/initialclusters, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/input/mahout/vectorized/tfidf-vectors, --maxIter=10, --method=mapreduce, --numClusters=20, --output=/output/mahout/kmeans/clusters, --overwrite=null, --startPhase=0, --tempDir=temp} 11/12/30 15:59:23 INFO common.HadoopUtil: Deleting /output/mahout/kmeans/clusters 11/12/30 15:59:24 INFO common.HadoopUtil: Deleting /output/mahout/kmeans/initialclusters 11/12/30 15:59:25 INFO util.NativeCodeLoader: Loaded the native-hadoop library 11/12/30 15:59:25 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 11/12/30 15:59:25 INFO compress.CodecPool: Got brand-new compressor 11/12/30 15:59:25 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to /output/mahout/kmeans/initialclusters/part-randomSeed 11/12/30 15:59:25 INFO kmeans.KMeansDriver: Input: /input/mahout/vectorized/tfidf-vectors Clusters In: /output/mahout/kmeans/initialclusters/part-randomSeed Out: /output/mahout/kmeans/clusters Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure 11/12/30 15:59:25 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} 11/12/30 15:59:25 INFO kmeans.KMeansDriver: K-Means Iteration 1 11/12/30 15:59:25 INFO input.FileInputFormat: Total input paths to process : 1 11/12/30 15:59:26 INFO mapred.JobClient: Running job: job_201112301129_0029 11/12/30 15:59:27 INFO mapred.JobClient: map 0% reduce 0% 11/12/30 15:59:30 INFO mapred.JobClient: map 100% reduce 0% 11/12/30 15:59:39 INFO mapred.JobClient: map 100% reduce 100% 11/12/30 15:59:39 INFO mapred.JobClient: Job complete: job_201112301129_0029 11/12/30 15:59:39 INFO mapred.JobClient: Counters: 23 11/12/30 15:59:39 INFO mapred.JobClient: Job Counters 11/12/30 15:59:39 INFO mapred.JobClient: Launched reduce tasks=1 11/12/30 15:59:39 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3074 11/12/30 15:59:39 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/12/30 15:59:39 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/12/30 15:59:39 INFO mapred.JobClient: Launched map tasks=1 11/12/30 15:59:39 INFO mapred.JobClient: Data-local map tasks=1 11/12/30 15:59:39 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8299 11/12/30 15:59:39 INFO mapred.JobClient: Clustering 11/12/30 15:59:39 INFO mapred.JobClient: Converged Clusters=1 11/12/30 15:59:39 INFO mapred.JobClient: FileSystemCounters 11/12/30 15:59:39 INFO mapred.JobClient: FILE_BYTES_READ=185593 11/12/30 15:59:39 INFO mapred.JobClient: HDFS_BYTES_READ=139801 11/12/30 15:59:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=477505 11/12/30 15:59:39 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=92991 11/12/30 15:59:39 INFO mapred.JobClient: Map-Reduce Framework 11/12/30 15:59:39 INFO mapred.JobClient: Reduce input groups=1 11/12/30 15:59:39 INFO mapred.JobClient: Combine output records=1 11/12/30 15:59:39 INFO mapred.JobClient: Map input records=1 11/12/30 15:59:39 INFO mapred.JobClient: Reduce shuffle bytes=0 11/12/30 15:59:39 INFO mapred.JobClient: Reduce output records=1 11/12/30 15:59:39 INFO mapred.JobClient: Spilled Records=2 11/12/30 15:59:39 INFO mapred.JobClient: Map output bytes=185582 11/12/30 15:59:39 INFO mapred.JobClient: Combine input records=1 11/12/30 15:59:39 INFO mapred.JobClient: Map output records=1 11/12/30 15:59:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=137 11/12/30 15:59:39 INFO mapred.JobClient: Reduce input records=1 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Clustering data 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Running Clustering 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Input: /input/mahout/vectorized/tfidf-vectors Clusters In: /output/mahout/kmeans/clusters/clusters-1-final Out: /output/mahout/kmeans/clusters/clusteredPoints Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@14e4e31 11/12/30 15:59:39 INFO kmeans.KMeansDriver: convergence: 0.5 Input Vectors: org.apache.mahout.math.VectorWritable 11/12/30 15:59:40 INFO input.FileInputFormat: Total input paths to process : 1 11/12/30 15:59:40 INFO mapred.JobClient: Running job: job_201112301129_0030 11/12/30 15:59:41 INFO mapred.JobClient: map 0% reduce 0% 11/12/30 15:59:45 INFO mapred.JobClient: map 100% reduce 0% 11/12/30 15:59:45 INFO mapred.JobClient: Job complete: job_201112301129_0030 11/12/30 15:59:45 INFO mapred.JobClient: Counters: 13 11/12/30 15:59:45 INFO mapred.JobClient: Job Counters 11/12/30 15:59:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3815 11/12/30 15:59:45 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/12/30 15:59:45 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/12/30 15:59:45 INFO mapred.JobClient: Launched map tasks=1 11/12/30 15:59:45 INFO mapred.JobClient: Data-local map tasks=1 11/12/30 15:59:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 11/12/30 15:59:45 INFO mapred.JobClient: FileSystemCounters 11/12/30 15:59:45 INFO mapred.JobClient: HDFS_BYTES_READ=186054 11/12/30 15:59:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=52059 11/12/30 15:59:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=92956 11/12/30 15:59:45 INFO mapred.JobClient: Map-Reduce Framework 11/12/30 15:59:45 INFO mapred.JobClient: Map input records=1 11/12/30 15:59:45 INFO mapred.JobClient: Spilled Records=0 11/12/30 15:59:45 INFO mapred.JobClient: Map output records=1 11/12/30 15:59:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=137 11/12/30 15:59:45 INFO driver.MahoutDriver: Program took 21888 ms (Minutes: 0.3648) MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf MAHOUT-JOB: /home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar 11/12/30 15:59:48 INFO common.AbstractJob: Command line arguments: {--dictionary=/input/mahout/vectorized/dictionary.file-0, --dictionaryType=sequencefile, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --numWords=30, --output=/home/pd/Mahout/examples/output/clusteranalyze.txt, --outputFormat=TEXT, --pointsDir=/output/mahout/kmeans/clusters/clusteredPoints, --seqFileDir=/output/mahout/kmeans/clusters/clusters-*-final, --startPhase=0, --tempDir=temp} *11/12/30 15:59:49 INFO clustering.ClusterDumper: Wrote 1 clusters* 11/12/30 15:59:49 INFO driver.MahoutDriver: Program took 1171 ms (Minutes: 0.01951666666666667) pd@PeriyaData:~/Mahout/examples/bin$ pd@PeriyaData:~/Mahout/examples/bin$ hadoop fs -ls /output/mahout/kmeans/clusters Found 2 items drwxr-xr-x - pd supergroup 0 2011-12-30 15:59 /output/mahout/kmeans/clusters/clusteredPoints drwxr-xr-x - pd supergroup 0 2011-12-30 15:59 /output/mahout/kmeans/clusters/clusters-1-final pd@PeriyaData:~/Mahout/rabi/examples/bin$ hadoop fs -ls /output/mahout/kmeans/clusters/clusters-1-final Found 3 items -rw-r--r-- 1 pd supergroup 0 2011-12-30 15:59 /output/mahout/kmeans/clusters/clusters-1-final/_SUCCESS drwxr-xr-x - pd supergroup 0 2011-12-30 15:59 /output/mahout/kmeans/clusters/clusters-1-final/_logs -rw-r--r-- 1 pd supergroup 92991 2011-12-30 15:59 /output/mahout/kmeans/clusters/clusters-1-final/part-r-00000 pd@PeriyaData:~/Mahout/examples/bin$ Thanks, PD
