Thanks, this is quite clear and reasonable. The cutoff is made based on
lack of term cooccurrences not the distance measure. The optional
'threshold' is based on the distance measure.
BTW I assume the 'distance' returned is expressed in the distance
measure's units? So using cosine as a distance measure a value near 0 is
actually quite similar because the measure is 1-(cosine of the angle
between the vectors)?
On 5/13/12 9:10 AM, Sebastian Schelter wrote:
Hi Pat,
RowSimilarityJob allows the use of a lot of different similarity
measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
of which compute a single number for a pair of vectors that denotes how
similar those are. All these measures have the characteristic that two
vectors that do not share at least one non-zero value in a single
dimension are considered not similar (have similarity 0).
In general, an all-pairs comparison, as it is conducted by
RowSimilarityJob, has quadratic complexity and is therefore not scalable.
If we have sparse data such as text or ratings however, we can exploit
the fact that we only need to compare pairs which share at least one
non-zero value in a dimension. This is the basic idea behind row
similarity job to avoid an all-pairs comparison.
In some real-world usecases you will furthermore encounter a lot of
pairs with near-zero similarities that are of little value for you. To
be able to avoid computing these, RowSimilarityJob provides the option
to specify a minimum threshold so that it ignores pairs with a
similarity value below this threshold. This threshold is data-dependent
and you have to experimentally find it.
--sebastian
On 13.05.2012 17:33, Pat Ferrel wrote:
To paraphrase:
There is some internal threshold to be considered 'similar'. This is the
one supplied with the 'threshold' option mentioned below and I need to
do a special build to get this option activated? I assume it is not
active because it has not been tested well?
So currently how is the threshold calculated? How can I determine its
value? Can I vote that this be activated as an optional parameter in the
future?
I ask this in part because I want to use RowSimilarity in an experiment
to do something like a non-partitioning hierarchical clustering where
I'll need to find close centroids in clusters calculated with different
levels of specificity.
On 5/12/12 11:38 PM, Sebastian Schelter wrote:
This could be simply due to the fact that there are less similar docs
than the number specified in 'maxSimilaritiesPerRow'.
consider() is only invoked if a threshold was specified.
Best,
Sebastian
On 13.05.2012 08:25, Suneel Marthi wrote:
Pat's question was that he was seeing less documents than that
specified by 'maxSimilaritiesPerRow', this could be happening due to
the 'consider' functionality of the applied similarity measure.
________________________________
From: Sebastian Schelter<[email protected]>
To: [email protected]
Sent: Sunday, May 13, 2012 2:08 AM
Subject: Re: RowSimilarity
The option 'maxSimilaritiesPerRow' determines the maximum number of
similar docs/items/rows per row. It depends on your data if there are
enough similar rows per row, so you can't always get 20 similar docs.
The option 'threshold' determines the minimum similarity value for a
pair of docs (otherwise it will be dropped). This option is not
activated by default however.
Best,
Sebastian
On 13.05.2012 01:29, Pat Ferrel wrote:
I tried an experiment running RowSimilarity with 16 docs of short
quotations on a similar subject. It looks to me that using tanimoto the
largest pair-wise distance allowed for the similar docs was 0.4. Though
I asked for 10 similar docs I got 0 to 10. I see this same effect with
larger data sets but haven't seen an obvious cut-off point
I was expecting to be able to make the decision about cut-off distance
myself. In other words I was expecting to always get 20 similar docs
when I asked for 20. It is useful to see what docs are at larger
distances.
How is RowSimilarity deciding when to cut-off the returned docs?