Re: Canopy estimator

gaurav redkar Fri, 11 May 2012 08:40:02 -0700

I have tried out a naive method to estimate the values for t1 and t2 to be
used for meanshift clustering. Any suggestions or comments about
shortcomings of my approach are welcome.


i take a sample of the dataset and compute pairwise similarity between all
the points in the sample using the same distance measure that i will use
while performing clustering(euclidean distance in my case). For simplicity
of explanation, assume i take 3 points from my dataset.
computing similarity among these points give me a similarity matrix as shown

   0 1.56 1.4 1.56 0 1.36 1.4 1.36 0

now looking at each column(or row coz the matrix is symmetric) u can see
that the largest element in the column(or row) is the distance which if
used as t1 will cover all the points from the chosen sample.(and probably
cover a  sizeable percentage of the entire dataset). that would result to
all the points mergin to 1 or few clusters

in order to choose t1:-
 i take the mean of all the elements in each column (ignoring the 0's in
the diagonal).
for the matrix shown above, we get the values
   1.48 1.46 1.38

then i take the average of these values , i.e. t1=1.44

in order to choose t2:-
i find the minimum element in each column (ignoring the 0's in the
diagonal) which will give me
1.4  ,1.36 , 1.36.
 to choose the value of t2 i intend to take mean of all the minimum
elements in each column.

then select the mean of these values , t2=1.37

Any comments on the approach

Thanks
Gaurav



On Fri, May 11, 2012 at 8:28 PM, Jeff Eastman <[email protected]>wrote:

> The reason I use T1==T2 is that T2 is the only threshold that determines
> the number of clusters. T1 affects how many adjacent points are considered
> in the centroid calculations. So you could simplify your histogram analysis
> to 2-d without affecting #clusters.
>
> Hierarchical clustering is one way to think about the clustering of
> information that we have just recently added to Mahout. Any experiences you
> can share with its application would be valuable.
>
>
> On 5/10/12 12:20 PM, Pat Ferrel wrote:
>
>> Naively I imagine giving a range, divide up into equal increments and
>> calculate all relevant cluster numbers. It would take the order of (# of
>> increments)**2  time to do but it seems to me that for a given corpus you
>> wouldn't need to do this very often (actually you only need 1/2 this data).
>> You would get a 3-d surface/histogram with magnitude = # of clusters, x and
>> y = t1 and t2. Then search this data for local maxes, mins and inflection
>> points. I'm not sure what this data would look like -- hence the "naively"
>> disclaimer at the start. It is certainly a large landscape to search by
>> hand.
>>
>> Your method only looks at the diagonal (t1==t2)and maybe that is the most
>> interesting part, in which case the calculations are much quicker.
>>
>> Ultimately I'm interested in finding a better way to do hierarchical
>> clustering. Information very often has a natural hierarchy but the usual
>> methods produce spotty results. If we had a reasonable canopy estimator we
>> could employ it at each level on the subset of the corpus being clustered.
>> Doing this by hand quickly becomes prohibitive given that the number of
>> times you have to estimate canopy values increases exponentially with each
>> level of hierarchy
>>
>> Even a mediocre estimator would likely be better that picking k out of
>> the air. And the times it would fail to produce would also tell you
>> something about your data.
>>
>> On 5/10/12 6:12 AM, Jeff Eastman wrote:
>>
>>> No, the issue was discussed but never reached critical mass. I typically
>>> do a binary search to find the best value setting T1==T2 and then tweak T1
>>> up a bit. For feeding k-means, this latter step is not so important.
>>>
>>> If you could figure out a way to automate this we would be interested.
>>> Conceptually, using the RandomSeedGenerator to sample a few vectors and
>>> comparing them with your chosen DistanceMeasure would give you a hint at
>>> the T-value to begin the search. A utility to do that would be a useful
>>> contribution.
>>>
>>> On 5/9/12 8:36 PM, Pat Ferrel wrote:
>>>
>>>> Some thoughts on 
>>>> https://issues.apache.org/**jira/browse/MAHOUT-563<https://issues.apache.org/jira/browse/MAHOUT-563>
>>>>
>>>> Did anything ever get done with this? Ted mentions limited usefulness.
>>>> This may be true but the cases he mentions as counter examples are also not
>>>> very good for using canopy ahead of kmeans, no? That info would be a useful
>>>> result. To use canopies I find myself running it over and over trying to
>>>> see some inflection in the number of clusters. Why not automate this? Even
>>>> if the data shows nothing, that is itself an answer of value and it would
>>>> save a lot of hand work to find out the same thing.
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Canopy estimator

Reply via email to