Pat, Seed selection is a big deal. See this paper for some ideas:
http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf On Mon, Jul 30, 2012 at 11:33 AM, Pat Ferrel <[email protected]> wrote: > I need to create groups of items that are similar to a seed item. This > seed item may be a synthetic vector or may be based on a real document but > it is known before the group is created. It may also contain weighted > features that are not terms. There are several ways to do this mentioned > below. > > 1. Use Solr's MoreLikeThis and query on the seed vectors. The query > would be limited to term features only since the Solr index is > limited but I could factor in any non-term features after the query. > 2. Modify RowSimilarity to take a list of items as input. Instead of > calculating similar items for all items it would calculate them only > for the seed vectors. This would allow for non-term features. > 3. It seems I could also use a clustering algorithm like kmeans and > supply k=number of seed vectors, seeds=seed vectors, number of > iterations = 1. Not sure what the pros and cons are here. I expect > some stuff would be calculated that would not be used, like gradient > vectors or some such. This would be just a calculation of nearest > neighbors, right? Maybe only using a part of the clustering job is > better? > 4. Other? > > #3 seems like the least work to get running but I may be missing something > so fire away if I'm off base. >
