No idea at this point. K-means is actually an unsupervised
classification algorithm and I am certain that it does not remove any of
the offered points for training purposes. The
ClusterClassificationDriver that is invoked to classify the offered
points does accept a clusterClassificationThreshold; however, and that
can be used for outlier removal. If, for some reason, the distances to
the final clusters were less than this threshold (default == 0) then
points would be removed. But I don't see how this could occur and you
are the only user to report such behavior.
Given that you have a pretty small dataset, you might try running the
k-means step in sequential mode. This would allow you to verify the
results while running in memory and also allow you to use a debugger to
investigate further. By setting a breakpoint in
ClusterClassificationDriver.shouldClassify() (you'd need to edit it a
bit first) you could determine if this was removing any of your input
points.
On 8/27/12 10:26 PM, Phoenix Bai wrote:
Hi Jeff,
first of all, thank you for your response.
But unfortunately, I don`t think that is the cause. as I checked, there is
only one file part-m-00000 under directory clusteredPoints.
$ hadoop fs -ls /bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints
Found 1 items
-rw-r----- 3 bmz dev 24608 2012-08-27 10:27
/bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints/part-m-00000
so, what else could it be?
btw, since kmeans belongs to supervised learning, is it possible that it
take out some data to construct a training dataset?
just a guess and it seems unreasonable to do that.
Thanks
On Tue, Aug 28, 2012 at 12:39 AM, Jeff Eastman
<[email protected]>wrote:
Offhand, I wonder why you are specifying only a single part-m-00000 file
in your clusterdump step? If there are more than one part file (a usual
case) then you might be missing some of the clustered points. If so, then
using the directory instead might help:
--pointsDir
/group/tbdev/zhimo.bmz/mahout/**output/videotags-kmeans-**clusters/clusteredPoints
\
On 8/27/12 2:49 AM, Phoenix Bai wrote:
--pointsDir
/group/tbdev/zhimo.bmz/mahout/**output/videotags-kmeans-**
clusters/clusteredPoints/part-**m-00000
\