I need to create groups of items that are similar to a seed item. This
seed item may be a synthetic vector or may be based on a real document
but it is known before the group is created. It may also contain
weighted features that are not terms. There are several ways to do this
mentioned below.
1. Use Solr's MoreLikeThis and query on the seed vectors. The query
would be limited to term features only since the Solr index is
limited but I could factor in any non-term features after the query.
2. Modify RowSimilarity to take a list of items as input. Instead of
calculating similar items for all items it would calculate them only
for the seed vectors. This would allow for non-term features.
3. It seems I could also use a clustering algorithm like kmeans and
supply k=number of seed vectors, seeds=seed vectors, number of
iterations = 1. Not sure what the pros and cons are here. I expect
some stuff would be calculated that would not be used, like gradient
vectors or some such. This would be just a calculation of nearest
neighbors, right? Maybe only using a part of the clustering job is
better?
4. Other?
#3 seems like the least work to get running but I may be missing
something so fire away if I'm off base.