Sorry I completely misunderstood, I guess you were talking about the weighted cosine distance.

Great, I'll try.

Thanks again for your useful suggestions
Marco

On 20 Jul 2011, at 23:38, Ted Dunning wrote:

Actually, I would suggest weighting words by something like tf-idf
weighting.

http://en.wikipedia.org/wiki/Tf%E2%80%93idf

log or sqrt(tf) is often good instead of linear tf. The standard log((N+1)
/ (df+1)) definition is usually good.

On Wed, Jul 20, 2011 at 2:29 PM, Marco Turchi <[email protected]>wrote:

Whao, thanks a lot, it seems very interesting. What you suggested means to weight each single words differently when I apply the cosine similarity. Each weight is the frequency of the word in the seed documents. It is not clear to me how to compute and use the anomalously common cooccurrences, but
I'll investigate.

Thanks a lot
Marco



On 20 Jul 2011, at 20:36, Ted Dunning wrote:

frequency weighted cosine distance




Reply via email to