number of clusters

Periya.Data Fri, 30 Dec 2011 16:08:06 -0800

Hi all,
    I am a newbie to Mahout. I am running a basic k-means clustering on a
sample txt file. No matter what number I give to the --numClusters
parameter, I always get only one cluster (VL-0). Can someone please point
out any mistake and suggest what I should do to see a decent number of
clusters?


I successfully convert the txt file into seq-file and then to vectorized
format.

The command I use is the following:

$MAHOUT_HOME/bin/mahout kmeans       --input
/input/mahout/vectorized/tfidf-vectors \
                        --output           $HDFS_OUTPUT_DIR/clusters \
                        --clusters         $HDFS_OUTPUT_DIR/initialclusters
\
                        --maxIter          10 \
                        --numClusters      20 \
                        --clustering       \
                        --overwrite


Here is the console output:
=====================

pd@PeriyaData:~/Mahout/examples/bin$ ./bigdata_kmeans.sh
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/12/30 15:59:23 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=/output/mahout/kmeans/initialclusters,
--convergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/input/mahout/vectorized/tfidf-vectors,
--maxIter=10, --method=mapreduce, --numClusters=20,
--output=/output/mahout/kmeans/clusters, --overwrite=null, --startPhase=0,
--tempDir=temp}
11/12/30 15:59:23 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/clusters
11/12/30 15:59:24 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/initialclusters
11/12/30 15:59:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/12/30 15:59:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/12/30 15:59:25 INFO compress.CodecPool: Got brand-new compressor
11/12/30 15:59:25 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
/output/mahout/kmeans/initialclusters/part-randomSeed
11/12/30 15:59:25 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/tfidf-vectors Clusters In:
/output/mahout/kmeans/initialclusters/part-randomSeed Out:
/output/mahout/kmeans/clusters Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
11/12/30 15:59:25 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable
Input Vectors: {}
11/12/30 15:59:25 INFO kmeans.KMeansDriver: K-Means Iteration 1
11/12/30 15:59:25 INFO input.FileInputFormat: Total input paths to process
: 1
11/12/30 15:59:26 INFO mapred.JobClient: Running job: job_201112301129_0029
11/12/30 15:59:27 INFO mapred.JobClient:  map 0% reduce 0%
11/12/30 15:59:30 INFO mapred.JobClient:  map 100% reduce 0%
11/12/30 15:59:39 INFO mapred.JobClient:  map 100% reduce 100%
11/12/30 15:59:39 INFO mapred.JobClient: Job complete: job_201112301129_0029
11/12/30 15:59:39 INFO mapred.JobClient: Counters: 23
11/12/30 15:59:39 INFO mapred.JobClient:   Job Counters
11/12/30 15:59:39 INFO mapred.JobClient:     Launched reduce tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3074
11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient:     Launched map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     Data-local map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8299
11/12/30 15:59:39 INFO mapred.JobClient:   Clustering
11/12/30 15:59:39 INFO mapred.JobClient:     Converged Clusters=1
11/12/30 15:59:39 INFO mapred.JobClient:   FileSystemCounters
11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_READ=185593
11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_READ=139801
11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=477505
11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92991
11/12/30 15:59:39 INFO mapred.JobClient:   Map-Reduce Framework
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input groups=1
11/12/30 15:59:39 INFO mapred.JobClient:     Combine output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Map input records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce shuffle bytes=0
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Spilled Records=2
11/12/30 15:59:39 INFO mapred.JobClient:     Map output bytes=185582
11/12/30 15:59:39 INFO mapred.JobClient:     Combine input records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Map output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input records=1
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Clustering data
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Running Clustering
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/tfidf-vectors Clusters In:
/output/mahout/kmeans/clusters/clusters-1-final Out:
/output/mahout/kmeans/clusters/clusteredPoints Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@14e4e31
11/12/30 15:59:39 INFO kmeans.KMeansDriver: convergence: 0.5 Input Vectors:
org.apache.mahout.math.VectorWritable
11/12/30 15:59:40 INFO input.FileInputFormat: Total input paths to process
: 1
11/12/30 15:59:40 INFO mapred.JobClient: Running job: job_201112301129_0030
11/12/30 15:59:41 INFO mapred.JobClient:  map 0% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient:  map 100% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient: Job complete: job_201112301129_0030
11/12/30 15:59:45 INFO mapred.JobClient: Counters: 13
11/12/30 15:59:45 INFO mapred.JobClient:   Job Counters
11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3815
11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient:     Launched map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient:     Data-local map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
11/12/30 15:59:45 INFO mapred.JobClient:   FileSystemCounters
11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_READ=186054
11/12/30 15:59:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=52059
11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92956
11/12/30 15:59:45 INFO mapred.JobClient:   Map-Reduce Framework
11/12/30 15:59:45 INFO mapred.JobClient:     Map input records=1
11/12/30 15:59:45 INFO mapred.JobClient:     Spilled Records=0
11/12/30 15:59:45 INFO mapred.JobClient:     Map output records=1
11/12/30 15:59:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
11/12/30 15:59:45 INFO driver.MahoutDriver: Program took 21888 ms (Minutes:
0.3648)
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/12/30 15:59:48 INFO common.AbstractJob: Command line arguments:
{--dictionary=/input/mahout/vectorized/dictionary.file-0,
--dictionaryType=sequencefile,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --numWords=30,
--output=/home/pd/Mahout/examples/output/clusteranalyze.txt,
--outputFormat=TEXT,
--pointsDir=/output/mahout/kmeans/clusters/clusteredPoints,
--seqFileDir=/output/mahout/kmeans/clusters/clusters-*-final,
--startPhase=0, --tempDir=temp}
*11/12/30 15:59:49 INFO clustering.ClusterDumper: Wrote 1 clusters*
11/12/30 15:59:49 INFO driver.MahoutDriver: Program took 1171 ms (Minutes:
0.01951666666666667)
pd@PeriyaData:~/Mahout/examples/bin$


pd@PeriyaData:~/Mahout/examples/bin$ hadoop fs -ls
/output/mahout/kmeans/clusters
Found 2 items
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusteredPoints
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final
pd@PeriyaData:~/Mahout/rabi/examples/bin$ hadoop fs -ls
/output/mahout/kmeans/clusters/clusters-1-final
Found 3 items
-rw-r--r--   1 pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/_SUCCESS
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/_logs
-rw-r--r--   1 pd supergroup      92991 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/part-r-00000
pd@PeriyaData:~/Mahout/examples/bin$


Thanks,
PD

number of clusters

Reply via email to