Actually, I would suggest weighting words by something like tf-idf
weighting.

http://en.wikipedia.org/wiki/Tf%E2%80%93idf

log or sqrt(tf) is often good instead of linear tf.  The standard log((N+1)
/ (df+1)) definition is usually good.

On Wed, Jul 20, 2011 at 2:29 PM, Marco Turchi <[email protected]>wrote:

> Whao, thanks a lot, it seems very interesting. What you suggested means to
> weight each single words differently when I apply the cosine similarity.
> Each weight is the frequency of the word in the seed documents. It is not
> clear to me how to compute and use the anomalously common cooccurrences, but
> I'll investigate.
>
> Thanks a lot
> Marco
>
>
>
> On 20 Jul 2011, at 20:36, Ted Dunning wrote:
>
>  frequency weighted cosine distance
>>
>
>

Reply via email to