Re: RowSimilarity, Solr, or truncated clustering?

Ted Dunning Mon, 30 Jul 2012 13:20:16 -0700

Pat,

Seed selection is a big deal.  See this paper for some ideas:


http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf

On Mon, Jul 30, 2012 at 11:33 AM, Pat Ferrel <[email protected]> wrote:

> I need to create groups of items that are similar to a seed item. This
> seed item may be a synthetic vector or may be based on a real document but
> it is known before the group is created. It may also contain weighted
> features that are not terms. There are several ways to do this mentioned
> below.
>
> 1. Use Solr's MoreLikeThis and query on the seed vectors. The query
>    would be limited to term features only since the Solr index is
>    limited but I could factor in any non-term features after the query.
> 2. Modify RowSimilarity to take a list of items as input. Instead of
>    calculating similar items for all items it would calculate them only
>    for the seed vectors. This would allow for non-term features.
> 3. It seems I could also use a clustering algorithm like kmeans and
>    supply k=number of seed vectors, seeds=seed vectors, number of
>    iterations = 1. Not sure what the pros and cons are here. I expect
>    some stuff would be calculated that would not be used, like gradient
>    vectors or some such. This would be just a calculation of nearest
>    neighbors, right? Maybe only using a part of the clustering job is
>    better?
> 4. Other?
>
> #3 seems like the least work to get running but I may be missing something
> so fire away if I'm off base.
>

Re: RowSimilarity, Solr, or truncated clustering?

Reply via email to