I would really appreciate if somebody could respond. I am trying to do a online clustering of feed data.
I am now able to write my custom analyzer and create Tf-vectors, use canopy as seed generator and cluster using KMeansDriver. Question1: I want to save the centroids generated. Is there a specific interface with which I can create backups/ Should I have to read it and save somewhere else say database for further use. Say now I have 100 article and have grouped them into 10 clusters. With which I want to cluster the new feed. Lets say I have 10 more article. My first approach: I can use the same cycle to achieve reclustering which takes time. So I do not want to do it for my online clustering. Second Approach: I want to use the saved centroids generated in the initial phase and cluster using Canopy Driver. But Canopy driver takes vector as input and generate centroid. Question2 :Can we do it with Canopy Driver? I want to use the previous centroid. If this possible, let say out of my 10 new articles. 8 is grouped to one of the existing cluster but 2 are new. To achieve this I need previous centroids. I want to cluster the new 2 in the usual kmeans and form new cluster. Question3: How should I add the centroids of the new clusters formed to the initial centroid list? Again, I would appreciate the response. I know my questions are bit stupid but for a novice I guess that is expected. Thanks, Sharath On Fri, Feb 4, 2011 at 9:38 AM, sharath jagannath < [email protected]> wrote: > anybody please? > > Thanks, > Sharath > > > On Thu, Feb 3, 2011 at 10:39 PM, sharath jagannath < > [email protected]> wrote: > >> I have 3 questions: >> 1. Now that I am able to create clusters. I want to know how to find >> intra-cluster distance between the data points say top m data points close >> to me within my cluster. >> 2. Say I have created initial cluster and now want to update it but do not >> want to do it from scratch, I will use canopy to approximate the closest >> cluster but how should I know what is the new cluster created from the data >> points which are not part of any of the old cluster? >> 3. Now after some time I want to recluster everything. How should I do it? >> Where should I get the all the vectors? Should I have to recreate >> everything? >> >> Thanks, >> Sharath >> >>
