Re: does seq2sparse or kmeans filter data ? I am losing data!

Phoenix Bai Wed, 29 Aug 2012 01:20:01 -0700

Hi Jeff,

I found the cause.
there is this minSupport option of seq2sparse which is default to 2, is
filtering out most of the objects whose frequency of the featuring terms is
less than 2.


specifying minSupport to 1 solved the problem.

Thanks

On Tue, Aug 28, 2012 at 8:20 PM, Jeff Eastman <[email protected]>wrote:

> No idea at this point. K-means is actually an unsupervised classification
> algorithm and I am certain that it does not remove any of the offered
> points for training purposes. The ClusterClassificationDriver that is
> invoked to classify the offered points does accept a
> clusterClassificationThreshold**; however, and that can be used for
> outlier removal. If, for some reason, the distances to the final clusters
> were less than this threshold (default == 0) then points would be removed.
> But I don't see how this could occur and you are the only user to report
> such behavior.
>
> Given that you have a pretty small dataset, you might try running the
> k-means step in sequential mode. This would allow you to verify the results
> while running in memory and also allow you to use a debugger to investigate
> further. By setting a breakpoint in 
> ClusterClassificationDriver.**shouldClassify()
> (you'd need to edit it a bit first) you could determine if this was
> removing any of your input points.
>
>
>
> On 8/27/12 10:26 PM, Phoenix Bai wrote:
>
>> Hi Jeff,
>>
>> first of all, thank you for your response.
>>
>> But unfortunately, I don`t think that is the cause. as I checked, there is
>> only one file part-m-00000 under directory clusteredPoints.
>>
>> $ hadoop fs -ls /bmz/mahout/output/videotags-**kmeans-clusters/**
>> clusteredPoints
>> Found 1 items
>> -rw-r-----   3 bmz dev      24608 2012-08-27 10:27
>> /bmz/mahout/output/videotags-**kmeans-clusters/**
>> clusteredPoints/part-m-00000
>>
>> so, what else could it be?
>> btw, since kmeans belongs to supervised learning, is it possible that it
>> take out some data to construct a training dataset?
>> just a guess and it seems unreasonable to do that.
>>
>> Thanks
>>
>> On Tue, Aug 28, 2012 at 12:39 AM, Jeff Eastman
>> <[email protected]>**wrote:
>>
>>  Offhand, I wonder why you are specifying only a single part-m-00000 file
>>> in your clusterdump step? If there are more than one part file (a usual
>>> case) then you might be missing some of the clustered points. If so, then
>>> using the directory instead might help:
>>>
>>> --pointsDir
>>> /group/tbdev/zhimo.bmz/mahout/****output/videotags-kmeans-****
>>> clusters/clusteredPoints
>>>
>>> \
>>>
>>>
>>>
>>>
>>> On 8/27/12 2:49 AM, Phoenix Bai wrote:
>>>
>>>  --pointsDir
>>>> /group/tbdev/zhimo.bmz/mahout/****output/videotags-kmeans-**
>>>> clusters/clusteredPoints/part-****m-00000
>>>> \
>>>>
>>>>
>>>
>

Re: does seq2sparse or kmeans filter data ? I am losing data!

Reply via email to