RowSimilarity, Solr, or truncated clustering?

Pat Ferrel Mon, 30 Jul 2012 10:35:32 -0700

I need to create groups of items that are similar to a seed item. Thisseed item may be a synthetic vector or may be based on a real documentbut it is known before the group is created. It may also containweighted features that are not terms. There are several ways to do thismentioned below.


1. Use Solr's MoreLikeThis and query on the seed vectors. The query
   would be limited to term features only since the Solr index is
   limited but I could factor in any non-term features after the query.
2. Modify RowSimilarity to take a list of items as input. Instead of
   calculating similar items for all items it would calculate them only
   for the seed vectors. This would allow for non-term features.
3. It seems I could also use a clustering algorithm like kmeans and
   supply k=number of seed vectors, seeds=seed vectors, number of
   iterations = 1. Not sure what the pros and cons are here. I expect
   some stuff would be calculated that would not be used, like gradient
   vectors or some such. This would be just a calculation of nearest
   neighbors, right? Maybe only using a part of the clustering job is
   better?
4. Other?

#3 seems like the least work to get running but I may be missingsomething so fire away if I'm off base.

RowSimilarity, Solr, or truncated clustering?

Reply via email to