"The cutoff is made based on lack of term cooccurrences not the distance measure."
I'd rather use the term similarity measure not distance measure as a lot of the measures implemented are not metric and the term 'distance' might be misleading A lack of (term) cooccurrences is equivalent to a similarity of 0 by definition, therefore the "default cutoff" is also based on the similarity measure. --sebastian On 14.05.2012 19:30, Pat Ferrel wrote: > Thanks, this is quite clear and reasonable. The optional > 'threshold' is based on the distance measure. > > BTW I assume the 'distance' returned is expressed in the distance > measure's units? So using cosine as a distance measure a value near 0 is > actually quite similar because the measure is 1-(cosine of the angle > between the vectors)? > > On 5/13/12 9:10 AM, Sebastian Schelter wrote: >> Hi Pat, >> >> RowSimilarityJob allows the use of a lot of different similarity >> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all >> of which compute a single number for a pair of vectors that denotes how >> similar those are. All these measures have the characteristic that two >> vectors that do not share at least one non-zero value in a single >> dimension are considered not similar (have similarity 0). >> >> In general, an all-pairs comparison, as it is conducted by >> RowSimilarityJob, has quadratic complexity and is therefore not scalable. >> >> If we have sparse data such as text or ratings however, we can exploit >> the fact that we only need to compare pairs which share at least one >> non-zero value in a dimension. This is the basic idea behind row >> similarity job to avoid an all-pairs comparison. >> >> In some real-world usecases you will furthermore encounter a lot of >> pairs with near-zero similarities that are of little value for you. To >> be able to avoid computing these, RowSimilarityJob provides the option >> to specify a minimum threshold so that it ignores pairs with a >> similarity value below this threshold. This threshold is data-dependent >> and you have to experimentally find it. >> >> --sebastian >> >> >> On 13.05.2012 17:33, Pat Ferrel wrote: >>> To paraphrase: >>> >>> There is some internal threshold to be considered 'similar'. This is the >>> one supplied with the 'threshold' option mentioned below and I need to >>> do a special build to get this option activated? I assume it is not >>> active because it has not been tested well? >>> >>> So currently how is the threshold calculated? How can I determine its >>> value? Can I vote that this be activated as an optional parameter in the >>> future? >>> >>> I ask this in part because I want to use RowSimilarity in an experiment >>> to do something like a non-partitioning hierarchical clustering where >>> I'll need to find close centroids in clusters calculated with different >>> levels of specificity. >>> >>> On 5/12/12 11:38 PM, Sebastian Schelter wrote: >>>> This could be simply due to the fact that there are less similar docs >>>> than the number specified in 'maxSimilaritiesPerRow'. >>>> >>>> consider() is only invoked if a threshold was specified. >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>> On 13.05.2012 08:25, Suneel Marthi wrote: >>>>> Pat's question was that he was seeing less documents than that >>>>> specified by 'maxSimilaritiesPerRow', this could be happening due to >>>>> the 'consider' functionality of the applied similarity measure. >>>>> >>>>> >>>>> >>>>> ________________________________ >>>>> From: Sebastian Schelter<[email protected]> >>>>> To: [email protected] >>>>> Sent: Sunday, May 13, 2012 2:08 AM >>>>> Subject: Re: RowSimilarity >>>>> >>>>> The option 'maxSimilaritiesPerRow' determines the maximum number of >>>>> similar docs/items/rows per row. It depends on your data if there are >>>>> enough similar rows per row, so you can't always get 20 similar docs. >>>>> >>>>> The option 'threshold' determines the minimum similarity value for a >>>>> pair of docs (otherwise it will be dropped). This option is not >>>>> activated by default however. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> On 13.05.2012 01:29, Pat Ferrel wrote: >>>>>> I tried an experiment running RowSimilarity with 16 docs of short >>>>>> quotations on a similar subject. It looks to me that using >>>>>> tanimoto the >>>>>> largest pair-wise distance allowed for the similar docs was 0.4. >>>>>> Though >>>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect >>>>>> with >>>>>> larger data sets but haven't seen an obvious cut-off point >>>>>> >>>>>> I was expecting to be able to make the decision about cut-off >>>>>> distance >>>>>> myself. In other words I was expecting to always get 20 similar docs >>>>>> when I asked for 20. It is useful to see what docs are at larger >>>>>> distances. >>>>>> >>>>>> How is RowSimilarity deciding when to cut-off the returned docs? >>>>>> >>>> >> >>
