Hi all,
I am a newbie to Mahout. I am running a basic k-means clustering on a
sample txt file. No matter what number I give to the --numClusters
parameter, I always get only one cluster (VL-0). Can someone please point
out any mistake and suggest what I should do to see a decent number of
clusters?
I successfully convert the txt file into seq-file and then to vectorized
format.
The command I use is the following:
$MAHOUT_HOME/bin/mahout kmeans --input
/input/mahout/vectorized/tfidf-vectors \
--output $HDFS_OUTPUT_DIR/clusters \
--clusters $HDFS_OUTPUT_DIR/initialclusters
\
--maxIter 10 \
--numClusters 20 \
--clustering \
--overwrite
Here is the console output:
=====================
pd@PeriyaData:~/Mahout/examples/bin$ ./bigdata_kmeans.sh
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/12/30 15:59:23 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=/output/mahout/kmeans/initialclusters,
--convergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/input/mahout/vectorized/tfidf-vectors,
--maxIter=10, --method=mapreduce, --numClusters=20,
--output=/output/mahout/kmeans/clusters, --overwrite=null, --startPhase=0,
--tempDir=temp}
11/12/30 15:59:23 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/clusters
11/12/30 15:59:24 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/initialclusters
11/12/30 15:59:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/12/30 15:59:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/12/30 15:59:25 INFO compress.CodecPool: Got brand-new compressor
11/12/30 15:59:25 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
/output/mahout/kmeans/initialclusters/part-randomSeed
11/12/30 15:59:25 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/tfidf-vectors Clusters In:
/output/mahout/kmeans/initialclusters/part-randomSeed Out:
/output/mahout/kmeans/clusters Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
11/12/30 15:59:25 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable
Input Vectors: {}
11/12/30 15:59:25 INFO kmeans.KMeansDriver: K-Means Iteration 1
11/12/30 15:59:25 INFO input.FileInputFormat: Total input paths to process
: 1
11/12/30 15:59:26 INFO mapred.JobClient: Running job: job_201112301129_0029
11/12/30 15:59:27 INFO mapred.JobClient: map 0% reduce 0%
11/12/30 15:59:30 INFO mapred.JobClient: map 100% reduce 0%
11/12/30 15:59:39 INFO mapred.JobClient: map 100% reduce 100%
11/12/30 15:59:39 INFO mapred.JobClient: Job complete: job_201112301129_0029
11/12/30 15:59:39 INFO mapred.JobClient: Counters: 23
11/12/30 15:59:39 INFO mapred.JobClient: Job Counters
11/12/30 15:59:39 INFO mapred.JobClient: Launched reduce tasks=1
11/12/30 15:59:39 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3074
11/12/30 15:59:39 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient: Launched map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient: Data-local map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8299
11/12/30 15:59:39 INFO mapred.JobClient: Clustering
11/12/30 15:59:39 INFO mapred.JobClient: Converged Clusters=1
11/12/30 15:59:39 INFO mapred.JobClient: FileSystemCounters
11/12/30 15:59:39 INFO mapred.JobClient: FILE_BYTES_READ=185593
11/12/30 15:59:39 INFO mapred.JobClient: HDFS_BYTES_READ=139801
11/12/30 15:59:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=477505
11/12/30 15:59:39 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=92991
11/12/30 15:59:39 INFO mapred.JobClient: Map-Reduce Framework
11/12/30 15:59:39 INFO mapred.JobClient: Reduce input groups=1
11/12/30 15:59:39 INFO mapred.JobClient: Combine output records=1
11/12/30 15:59:39 INFO mapred.JobClient: Map input records=1
11/12/30 15:59:39 INFO mapred.JobClient: Reduce shuffle bytes=0
11/12/30 15:59:39 INFO mapred.JobClient: Reduce output records=1
11/12/30 15:59:39 INFO mapred.JobClient: Spilled Records=2
11/12/30 15:59:39 INFO mapred.JobClient: Map output bytes=185582
11/12/30 15:59:39 INFO mapred.JobClient: Combine input records=1
11/12/30 15:59:39 INFO mapred.JobClient: Map output records=1
11/12/30 15:59:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=137
11/12/30 15:59:39 INFO mapred.JobClient: Reduce input records=1
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Clustering data
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Running Clustering
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/tfidf-vectors Clusters In:
/output/mahout/kmeans/clusters/clusters-1-final Out:
/output/mahout/kmeans/clusters/clusteredPoints Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@14e4e31
11/12/30 15:59:39 INFO kmeans.KMeansDriver: convergence: 0.5 Input Vectors:
org.apache.mahout.math.VectorWritable
11/12/30 15:59:40 INFO input.FileInputFormat: Total input paths to process
: 1
11/12/30 15:59:40 INFO mapred.JobClient: Running job: job_201112301129_0030
11/12/30 15:59:41 INFO mapred.JobClient: map 0% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient: map 100% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient: Job complete: job_201112301129_0030
11/12/30 15:59:45 INFO mapred.JobClient: Counters: 13
11/12/30 15:59:45 INFO mapred.JobClient: Job Counters
11/12/30 15:59:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3815
11/12/30 15:59:45 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient: Launched map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient: Data-local map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
11/12/30 15:59:45 INFO mapred.JobClient: FileSystemCounters
11/12/30 15:59:45 INFO mapred.JobClient: HDFS_BYTES_READ=186054
11/12/30 15:59:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=52059
11/12/30 15:59:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=92956
11/12/30 15:59:45 INFO mapred.JobClient: Map-Reduce Framework
11/12/30 15:59:45 INFO mapred.JobClient: Map input records=1
11/12/30 15:59:45 INFO mapred.JobClient: Spilled Records=0
11/12/30 15:59:45 INFO mapred.JobClient: Map output records=1
11/12/30 15:59:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=137
11/12/30 15:59:45 INFO driver.MahoutDriver: Program took 21888 ms (Minutes:
0.3648)
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/12/30 15:59:48 INFO common.AbstractJob: Command line arguments:
{--dictionary=/input/mahout/vectorized/dictionary.file-0,
--dictionaryType=sequencefile,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --numWords=30,
--output=/home/pd/Mahout/examples/output/clusteranalyze.txt,
--outputFormat=TEXT,
--pointsDir=/output/mahout/kmeans/clusters/clusteredPoints,
--seqFileDir=/output/mahout/kmeans/clusters/clusters-*-final,
--startPhase=0, --tempDir=temp}
*11/12/30 15:59:49 INFO clustering.ClusterDumper: Wrote 1 clusters*
11/12/30 15:59:49 INFO driver.MahoutDriver: Program took 1171 ms (Minutes:
0.01951666666666667)
pd@PeriyaData:~/Mahout/examples/bin$
pd@PeriyaData:~/Mahout/examples/bin$ hadoop fs -ls
/output/mahout/kmeans/clusters
Found 2 items
drwxr-xr-x - pd supergroup 0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusteredPoints
drwxr-xr-x - pd supergroup 0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final
pd@PeriyaData:~/Mahout/rabi/examples/bin$ hadoop fs -ls
/output/mahout/kmeans/clusters/clusters-1-final
Found 3 items
-rw-r--r-- 1 pd supergroup 0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/_SUCCESS
drwxr-xr-x - pd supergroup 0 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/_logs
-rw-r--r-- 1 pd supergroup 92991 2011-12-30 15:59
/output/mahout/kmeans/clusters/clusters-1-final/part-r-00000
pd@PeriyaData:~/Mahout/examples/bin$
Thanks,
PD