This could be simply due to the fact that there are less similar docs than the number specified in 'maxSimilaritiesPerRow'.
consider() is only invoked if a threshold was specified. Best, Sebastian On 13.05.2012 08:25, Suneel Marthi wrote: > Pat's question was that he was seeing less documents than that specified by > 'maxSimilaritiesPerRow', this could be happening due to the 'consider' > functionality of the applied similarity measure. > > > > ________________________________ > From: Sebastian Schelter <[email protected]> > To: [email protected] > Sent: Sunday, May 13, 2012 2:08 AM > Subject: Re: RowSimilarity > > The option 'maxSimilaritiesPerRow' determines the maximum number of > similar docs/items/rows per row. It depends on your data if there are > enough similar rows per row, so you can't always get 20 similar docs. > > The option 'threshold' determines the minimum similarity value for a > pair of docs (otherwise it will be dropped). This option is not > activated by default however. > > Best, > Sebastian > > On 13.05.2012 01:29, Pat Ferrel wrote: >> I tried an experiment running RowSimilarity with 16 docs of short >> quotations on a similar subject. It looks to me that using tanimoto the >> largest pair-wise distance allowed for the similar docs was 0.4. Though >> I asked for 10 similar docs I got 0 to 10. I see this same effect with >> larger data sets but haven't seen an obvious cut-off point >> >> I was expecting to be able to make the decision about cut-off distance >> myself. In other words I was expecting to always get 20 similar docs >> when I asked for 20. It is useful to see what docs are at larger distances. >> >> How is RowSimilarity deciding when to cut-off the returned docs? >>
