Re: RowSimilarity

Sebastian Schelter Sat, 12 May 2012 23:39:05 -0700

This could be simply due to the fact that there are less similar docs
than the number specified in 'maxSimilaritiesPerRow'.


consider() is only invoked if a threshold was specified.

Best,
Sebastian


On 13.05.2012 08:25, Suneel Marthi wrote:
>  Pat's question was that he was seeing less documents than that specified by 
> 'maxSimilaritiesPerRow', this could be happening due to the 'consider' 
> functionality of the applied similarity measure.
> 
> 
> 
> ________________________________
>  From: Sebastian Schelter <[email protected]>
> To: [email protected] 
> Sent: Sunday, May 13, 2012 2:08 AM
> Subject: Re: RowSimilarity
>  
> The option 'maxSimilaritiesPerRow' determines the maximum number of
> similar docs/items/rows per row. It depends on your data if there are
> enough similar rows per row, so you can't always get 20 similar docs.
> 
> The option 'threshold' determines the minimum similarity value for a
> pair of docs (otherwise it will be dropped). This option is not
> activated by default however.
> 
> Best,
> Sebastian
> 
> On 13.05.2012 01:29, Pat Ferrel wrote:
>> I tried an experiment running RowSimilarity with 16 docs of short
>> quotations on a similar subject. It looks to me that using tanimoto the
>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>> larger data sets but haven't seen an obvious cut-off point
>>
>> I was expecting to be able to make the decision about cut-off distance
>> myself. In other words I was expecting to always get 20 similar docs
>> when I asked for 20. It is useful to see what docs are at larger distances.
>>
>> How is RowSimilarity deciding when to cut-off the returned docs?
>>

Re: RowSimilarity

Reply via email to