Actually, I would suggest weighting words by something like tf-idf weighting.
http://en.wikipedia.org/wiki/Tf%E2%80%93idf log or sqrt(tf) is often good instead of linear tf. The standard log((N+1) / (df+1)) definition is usually good. On Wed, Jul 20, 2011 at 2:29 PM, Marco Turchi <[email protected]>wrote: > Whao, thanks a lot, it seems very interesting. What you suggested means to > weight each single words differently when I apply the cosine similarity. > Each weight is the frequency of the word in the seed documents. It is not > clear to me how to compute and use the anomalously common cooccurrences, but > I'll investigate. > > Thanks a lot > Marco > > > > On 20 Jul 2011, at 20:36, Ted Dunning wrote: > > frequency weighted cosine distance >> > >
