I tried an experiment running RowSimilarity with 16 docs of shortquotations on a similar subject. It looks to me that using tanimoto thelargest pair-wise distance allowed for the similar docs was 0.4. ThoughI asked for 10 similar docs I got 0 to 10. I see this same effect withlarger data sets but haven't seen an obvious cut-off point

I was expecting to be able to make the decision about cut-off distancemyself. In other words I was expecting to always get 20 similar docswhen I asked for 20. It is useful to see what docs are at larger distances.


How is RowSimilarity deciding when to cut-off the returned docs?

RowSimilarity

Reply via email to