I have tried out a naive method to estimate the values for t1 and t2 to be used for meanshift clustering. Any suggestions or comments about shortcomings of my approach are welcome.
i take a sample of the dataset and compute pairwise similarity between all the points in the sample using the same distance measure that i will use while performing clustering(euclidean distance in my case). For simplicity of explanation, assume i take 3 points from my dataset. computing similarity among these points give me a similarity matrix as shown 0 1.56 1.4 1.56 0 1.36 1.4 1.36 0 now looking at each column(or row coz the matrix is symmetric) u can see that the largest element in the column(or row) is the distance which if used as t1 will cover all the points from the chosen sample.(and probably cover a sizeable percentage of the entire dataset). that would result to all the points mergin to 1 or few clusters in order to choose t1:- i take the mean of all the elements in each column (ignoring the 0's in the diagonal). for the matrix shown above, we get the values 1.48 1.46 1.38 then i take the average of these values , i.e. t1=1.44 in order to choose t2:- i find the minimum element in each column (ignoring the 0's in the diagonal) which will give me 1.4 ,1.36 , 1.36. to choose the value of t2 i intend to take mean of all the minimum elements in each column. then select the mean of these values , t2=1.37 Any comments on the approach Thanks Gaurav On Fri, May 11, 2012 at 8:28 PM, Jeff Eastman <[email protected]>wrote: > The reason I use T1==T2 is that T2 is the only threshold that determines > the number of clusters. T1 affects how many adjacent points are considered > in the centroid calculations. So you could simplify your histogram analysis > to 2-d without affecting #clusters. > > Hierarchical clustering is one way to think about the clustering of > information that we have just recently added to Mahout. Any experiences you > can share with its application would be valuable. > > > On 5/10/12 12:20 PM, Pat Ferrel wrote: > >> Naively I imagine giving a range, divide up into equal increments and >> calculate all relevant cluster numbers. It would take the order of (# of >> increments)**2 time to do but it seems to me that for a given corpus you >> wouldn't need to do this very often (actually you only need 1/2 this data). >> You would get a 3-d surface/histogram with magnitude = # of clusters, x and >> y = t1 and t2. Then search this data for local maxes, mins and inflection >> points. I'm not sure what this data would look like -- hence the "naively" >> disclaimer at the start. It is certainly a large landscape to search by >> hand. >> >> Your method only looks at the diagonal (t1==t2)and maybe that is the most >> interesting part, in which case the calculations are much quicker. >> >> Ultimately I'm interested in finding a better way to do hierarchical >> clustering. Information very often has a natural hierarchy but the usual >> methods produce spotty results. If we had a reasonable canopy estimator we >> could employ it at each level on the subset of the corpus being clustered. >> Doing this by hand quickly becomes prohibitive given that the number of >> times you have to estimate canopy values increases exponentially with each >> level of hierarchy >> >> Even a mediocre estimator would likely be better that picking k out of >> the air. And the times it would fail to produce would also tell you >> something about your data. >> >> On 5/10/12 6:12 AM, Jeff Eastman wrote: >> >>> No, the issue was discussed but never reached critical mass. I typically >>> do a binary search to find the best value setting T1==T2 and then tweak T1 >>> up a bit. For feeding k-means, this latter step is not so important. >>> >>> If you could figure out a way to automate this we would be interested. >>> Conceptually, using the RandomSeedGenerator to sample a few vectors and >>> comparing them with your chosen DistanceMeasure would give you a hint at >>> the T-value to begin the search. A utility to do that would be a useful >>> contribution. >>> >>> On 5/9/12 8:36 PM, Pat Ferrel wrote: >>> >>>> Some thoughts on >>>> https://issues.apache.org/**jira/browse/MAHOUT-563<https://issues.apache.org/jira/browse/MAHOUT-563> >>>> >>>> Did anything ever get done with this? Ted mentions limited usefulness. >>>> This may be true but the cases he mentions as counter examples are also not >>>> very good for using canopy ahead of kmeans, no? That info would be a useful >>>> result. To use canopies I find myself running it over and over trying to >>>> see some inflection in the number of clusters. Why not automate this? Even >>>> if the data shows nothing, that is itself an answer of value and it would >>>> save a lot of hand work to find out the same thing. >>>> >>>> >>>> >>> >> >> >
