Re: number of clusters (Canopy Clustering)

Jeff Eastman Sun, 08 Jan 2012 07:48:40 -0800

I'm almost certain there is no current way to do this from the commandline. You could write a small utility to do this (seeCanopyClusterer.buildClustersSeq() for a simple skeleton you could use).But I would suggest trying CosineDistanceMeasure instead of Euclideanfor text. If you have a small number of input files you could run the-xm sequential mode in the debugger and breakpoint or just add someprintouts to CanopyClusterer.addPointToCanopies(...).


On 1/7/12 5:08 AM, Paritosh Ranjan wrote:

"Is there a way for me to determine the distance from command line? "

I am not aware of any. If anyone else is, then please suggest.
________________________________________
From: Periya.Data [[email protected]]
Sent: Saturday, January 07, 2012 6:31 AM
To: [email protected]
Subject: Re: number of clusters (Canopy Clustering)

I agree that if all the distances are<  t2, I will get only one cluster. I
was just "hoping" that they do fall within that range and was basically
shooting in dark when twiddling with various t1 and t2 values.

Is there an easy way to determine the distance between vectors? In the
CanopyCluster shell script, I use EuclideanDistanceMeasure. The TFIDF
vectors are in binary and I have no idea how to proceed.

Is there a way for me to determine the distance from command line? So far,
I am not using any Java program to do my experiments. As a beginner, I am
running shell scripts and learning.

$MAHOUT_HOME/bin/mahout canopy       -i
/input/mahout/vectorized/tfidf-vectors \
                         -o
$HDFS_OUTPUT_DIR/bigdata-canopy-centroids \
                         -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure \
                         -t1          0.9 \
                         -t2          0.2 \
                         --overwrite

Thanks for your suggestions,
PD.

On Thu, Jan 5, 2012 at 8:47 PM, Paritosh Ranjan<[email protected]>  wrote:

What is the distance between vectors with the Distance measure you are
using?
If all the vectors lie within the range of t2, then you will get only 1
cluster.

Write some piece of test code which creates vectors of the data you are
using, and then find the distance between the vectors ( using the same
distance measure you are using while clustering ). If all distances are
within t2, then you will get only one cluster.


On 05-01-2012 10:14, Periya.Data wrote:

Hi Paritosh,
     Thanks for your suggestions. I am currently trying to use Canopy
Clustering to guess the number of clusters. I have tried various values
(between 0 and 1) for t1 and t2 (t1>   t2). Still I get only one cluster. I
tried (0.9, 0.2), (0.05, 0.001), (0.005, 0.00001) etc. I thought if I make
t2 very close to 0, I would a lot of clusters...but, it is very
strange...I
am getting only one cluster for a vast set of t1/t2 values.

Is this because I am using just one text file for my analysis?

I have only one large text file and want to cluster the words and see how
they are clustered. I thought this would be a simple way to begin
exploring
clustering/mahout.

Your suggestions are appreciated,
PD.

On Sat, Dec 31, 2011 at 2:48 AM, Paritosh Ranjan<[email protected]>
  wrote:

  There can be two reasons for only one cluster being found.

1) The vectors are really close to each other and the clusters converge.
2) The distance measure you are using is not appropriate with your vector
values.

Try to
1) Analyze the vectors and the distance between them. Are they good
candidates to be inside different clusters?
2) Try to use CanopyClustering first to guess the number of clusters (
experiment a bit by changing values of t1 and t2 ).
3) Then provided the clusters returned by CanopyClustering to KMeans.
4) Use EuclideanDistance instead of Squared...

Paritosh

______________________________**__________
From: Periya.Data [[email protected]]
Sent: Saturday, December 31, 2011 1:07 AM
To: [email protected]
Subject: number of clusters

Hi all,
    I am a newbie to Mahout. I am running a basic k-means clustering on a
sample txt file. No matter what number I give to the --numClusters
parameter, I always get only one cluster (VL-0). Can someone please point
out any mistake and suggest what I should do to see a decent number of
clusters?

I successfully convert the txt file into seq-file and then to vectorized
format.

The command I use is the following:

$MAHOUT_HOME/bin/mahout kmeans       --input
/input/mahout/vectorized/**tfidf-vectors \
                        --output           $HDFS_OUTPUT_DIR/clusters \
                        --clusters         $HDFS_OUTPUT_DIR/**
initialclusters
\
                        --maxIter          10 \
                        --numClusters      20 \
                        --clustering       \
                        --overwrite


Here is the console output:
=====================

pd@PeriyaData:~/Mahout/**examples/bin$ ./bigdata_kmeans.sh
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/**hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/**hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/**target/mahout-examples-0.6-**
SNAPSHOT-job.jar
11/12/30 15:59:23 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=/output/mahout/**kmeans/initialclusters,
--convergenceDelta=0.5,

--distanceMeasure=org.apache.**mahout.common.distance.**
SquaredEuclideanDistanceMeasur**e,
--endPhase=2147483647, --input=/input/mahout/**vectorized/tfidf-vectors,
--maxIter=10, --method=mapreduce, --numClusters=20,
--output=/output/mahout/**kmeans/clusters, --overwrite=null,
--startPhase=0,
--tempDir=temp}
11/12/30 15:59:23 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/clusters
11/12/30 15:59:24 INFO common.HadoopUtil: Deleting
/output/mahout/kmeans/**initialclusters
11/12/30 15:59:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/12/30 15:59:25 INFO zlib.ZlibFactory: Successfully loaded&
  initialized

native-zlib library
11/12/30 15:59:25 INFO compress.CodecPool: Got brand-new compressor
11/12/30 15:59:25 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
/output/mahout/kmeans/**initialclusters/part-**randomSeed
11/12/30 15:59:25 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/**tfidf-vectors Clusters In:
/output/mahout/kmeans/**initialclusters/part-**randomSeed Out:
/output/mahout/kmeans/clusters Distance:
org.apache.mahout.common.**distance.**SquaredEuclideanDistanceMeasur**e
11/12/30 15:59:25 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: org.apache.mahout.math.**VectorWritable
Input Vectors: {}
11/12/30 15:59:25 INFO kmeans.KMeansDriver: K-Means Iteration 1
11/12/30 15:59:25 INFO input.FileInputFormat: Total input paths to
process
: 1
11/12/30 15:59:26 INFO mapred.JobClient: Running job:
job_201112301129_0029
11/12/30 15:59:27 INFO mapred.JobClient:  map 0% reduce 0%
11/12/30 15:59:30 INFO mapred.JobClient:  map 100% reduce 0%
11/12/30 15:59:39 INFO mapred.JobClient:  map 100% reduce 100%
11/12/30 15:59:39 INFO mapred.JobClient: Job complete:
job_201112301129_0029
11/12/30 15:59:39 INFO mapred.JobClient: Counters: 23
11/12/30 15:59:39 INFO mapred.JobClient:   Job Counters
11/12/30 15:59:39 INFO mapred.JobClient:     Launched reduce tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3074
11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:39 INFO mapred.JobClient:     Launched map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     Data-local map tasks=1
11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8299
11/12/30 15:59:39 INFO mapred.JobClient:   Clustering
11/12/30 15:59:39 INFO mapred.JobClient:     Converged Clusters=1
11/12/30 15:59:39 INFO mapred.JobClient:   FileSystemCounters
11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_READ=185593
11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_READ=139801
11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=477505
11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92991
11/12/30 15:59:39 INFO mapred.JobClient:   Map-Reduce Framework
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input groups=1
11/12/30 15:59:39 INFO mapred.JobClient:     Combine output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Map input records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce shuffle bytes=0
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Spilled Records=2
11/12/30 15:59:39 INFO mapred.JobClient:     Map output bytes=185582
11/12/30 15:59:39 INFO mapred.JobClient:     Combine input records=1
11/12/30 15:59:39 INFO mapred.JobClient:     Map output records=1
11/12/30 15:59:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input records=1
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Clustering data
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Running Clustering
11/12/30 15:59:39 INFO kmeans.KMeansDriver: Input:
/input/mahout/vectorized/**tfidf-vectors Clusters In:
/output/mahout/kmeans/**clusters/clusters-1-final Out:
/output/mahout/kmeans/**clusters/clusteredPoints Distance:
org.apache.mahout.common.**distance.**SquaredEuclideanDistanceMeasur**
e@14e4e31
11/12/30 15:59:39 INFO kmeans.KMeansDriver: convergence: 0.5 Input
Vectors:
org.apache.mahout.math.**VectorWritable
11/12/30 15:59:40 INFO input.FileInputFormat: Total input paths to
process
: 1
11/12/30 15:59:40 INFO mapred.JobClient: Running job:
job_201112301129_0030
11/12/30 15:59:41 INFO mapred.JobClient:  map 0% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient:  map 100% reduce 0%
11/12/30 15:59:45 INFO mapred.JobClient: Job complete:
job_201112301129_0030
11/12/30 15:59:45 INFO mapred.JobClient: Counters: 13
11/12/30 15:59:45 INFO mapred.JobClient:   Job Counters
11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3815
11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/30 15:59:45 INFO mapred.JobClient:     Launched map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient:     Data-local map tasks=1
11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
11/12/30 15:59:45 INFO mapred.JobClient:   FileSystemCounters
11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_READ=186054
11/12/30 15:59:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=52059
11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92956
11/12/30 15:59:45 INFO mapred.JobClient:   Map-Reduce Framework
11/12/30 15:59:45 INFO mapred.JobClient:     Map input records=1
11/12/30 15:59:45 INFO mapred.JobClient:     Spilled Records=0
11/12/30 15:59:45 INFO mapred.JobClient:     Map output records=1
11/12/30 15:59:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
11/12/30 15:59:45 INFO driver.MahoutDriver: Program took 21888 ms
(Minutes:
0.3648)
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/**hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/**hadoop/conf
MAHOUT-JOB:
/home/pd/CDH3/mahout/examples/**target/mahout-examples-0.6-**
SNAPSHOT-job.jar
11/12/30 15:59:48 INFO common.AbstractJob: Command line arguments:
{--dictionary=/input/mahout/**vectorized/dictionary.file-0,
--dictionaryType=sequencefile,

--distanceMeasure=org.apache.**mahout.common.distance.**
SquaredEuclideanDistanceMeasur**e,
--endPhase=2147483647, --numWords=30,
--output=/home/pd/Mahout/**examples/output/**clusteranalyze.txt,
--outputFormat=TEXT,
--pointsDir=/output/mahout/**kmeans/clusters/**clusteredPoints,
--seqFileDir=/output/mahout/**kmeans/clusters/clusters-*-**final,
--startPhase=0, --tempDir=temp}
*11/12/30 15:59:49 INFO clustering.ClusterDumper: Wrote 1 clusters*
11/12/30 15:59:49 INFO driver.MahoutDriver: Program took 1171 ms
(Minutes:
0.01951666666666667)
pd@PeriyaData:~/Mahout/**examples/bin$


pd@PeriyaData:~/Mahout/**examples/bin$ hadoop fs -ls
/output/mahout/kmeans/clusters
Found 2 items
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/**clusters/clusteredPoints
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/**clusters/clusters-1-final
pd@PeriyaData:~/Mahout/rabi/**examples/bin$ hadoop fs -ls
/output/mahout/kmeans/**clusters/clusters-1-final
Found 3 items
-rw-r--r--   1 pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/**clusters/clusters-1-final/_**SUCCESS
drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
/output/mahout/kmeans/**clusters/clusters-1-final/_**logs
-rw-r--r--   1 pd supergroup      92991 2011-12-30 15:59
/output/mahout/kmeans/**clusters/clusters-1-final/**part-r-00000
pd@PeriyaData:~/Mahout/**examples/bin$


Thanks,
PD

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1416 / Virus Database: 2109/4122 - Release Date: 01/04/12

Re: number of clusters (Canopy Clustering)

Reply via email to