Sorry I completely misunderstood, I guess you were talking about the
weighted cosine distance.
Great, I'll try.
Thanks again for your useful suggestions
Marco
On 20 Jul 2011, at 23:38, Ted Dunning wrote:
Actually, I would suggest weighting words by something like tf-idf
weighting.
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
log or sqrt(tf) is often good instead of linear tf. The standard
log((N+1)
/ (df+1)) definition is usually good.
On Wed, Jul 20, 2011 at 2:29 PM, Marco Turchi
<[email protected]>wrote:
Whao, thanks a lot, it seems very interesting. What you suggested
means to
weight each single words differently when I apply the cosine
similarity.
Each weight is the frequency of the word in the seed documents. It
is not
clear to me how to compute and use the anomalously common
cooccurrences, but
I'll investigate.
Thanks a lot
Marco
On 20 Jul 2011, at 20:36, Ted Dunning wrote:
frequency weighted cosine distance