I need to create groups of items that are similar to a seed item. This seed item may be a synthetic vector or may be based on a real document but it is known before the group is created. It may also contain weighted features that are not terms. There are several ways to do this mentioned below.

1. Use Solr's MoreLikeThis and query on the seed vectors. The query
   would be limited to term features only since the Solr index is
   limited but I could factor in any non-term features after the query.
2. Modify RowSimilarity to take a list of items as input. Instead of
   calculating similar items for all items it would calculate them only
   for the seed vectors. This would allow for non-term features.
3. It seems I could also use a clustering algorithm like kmeans and
   supply k=number of seed vectors, seeds=seed vectors, number of
   iterations = 1. Not sure what the pros and cons are here. I expect
   some stuff would be calculated that would not be used, like gradient
   vectors or some such. This would be just a calculation of nearest
   neighbors, right? Maybe only using a part of the clustering job is
   better?
4. Other?

#3 seems like the least work to get running but I may be missing something so fire away if I'm off base.

Reply via email to