No idea at this point. K-means is actually an unsupervised classification algorithm and I am certain that it does not remove any of the offered points for training purposes. The ClusterClassificationDriver that is invoked to classify the offered points does accept a clusterClassificationThreshold; however, and that can be used for outlier removal. If, for some reason, the distances to the final clusters were less than this threshold (default == 0) then points would be removed. But I don't see how this could occur and you are the only user to report such behavior.

Given that you have a pretty small dataset, you might try running the k-means step in sequential mode. This would allow you to verify the results while running in memory and also allow you to use a debugger to investigate further. By setting a breakpoint in ClusterClassificationDriver.shouldClassify() (you'd need to edit it a bit first) you could determine if this was removing any of your input points.


On 8/27/12 10:26 PM, Phoenix Bai wrote:
Hi Jeff,

first of all, thank you for your response.

But unfortunately, I don`t think that is the cause. as I checked, there is
only one file part-m-00000 under directory clusteredPoints.

$ hadoop fs -ls /bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints
Found 1 items
-rw-r-----   3 bmz dev      24608 2012-08-27 10:27
/bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints/part-m-00000

so, what else could it be?
btw, since kmeans belongs to supervised learning, is it possible that it
take out some data to construct a training dataset?
just a guess and it seems unreasonable to do that.

Thanks

On Tue, Aug 28, 2012 at 12:39 AM, Jeff Eastman
<[email protected]>wrote:

Offhand, I wonder why you are specifying only a single part-m-00000 file
in your clusterdump step? If there are more than one part file (a usual
case) then you might be missing some of the clustered points. If so, then
using the directory instead might help:

--pointsDir
/group/tbdev/zhimo.bmz/mahout/**output/videotags-kmeans-**clusters/clusteredPoints
\




On 8/27/12 2:49 AM, Phoenix Bai wrote:

--pointsDir
/group/tbdev/zhimo.bmz/mahout/**output/videotags-kmeans-**
clusters/clusteredPoints/part-**m-00000
\



Reply via email to