Re: Lucene cosine similarity score for more like this query

2015-02-03 Thread Ali Nazemian
Dear Koji,
Thank you very much.
Do you know what is the range of score in this new formula? What is the
reasonable threshold for considering two documents as similar enough in
this formula?
Regards.

On Tue, Feb 3, 2015 at 1:35 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Lucene uses TFIDFSimilarity class to calculate the similarity.
 It is implemented on the idea of cosine measurement but it modifies the
 cosine formula.
 Please take a look at Lucene Practical Scoring Function in the following
 Javadoc:

 http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/
 search/similarities/TFIDFSimilarity.html

 Koji
 --
 http://soleami.com/blog/comparing-document-classification-functions-of-
 lucene-and-mahout.html


 On 2015/02/03 5:39, Ali Nazemian wrote:

 Dear Erik,
 Thank you for your response. Would younplease tell me why this score could
 be higher than 1? While cosine similarity can not be higher than 1.
 On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote:

  The scoring is the same as Lucene.  To get deeper insight into how a
 score
 is computed, use Solr’s debug=true mode to see the explain details in the
 response.

  Erik

  On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com
 wrote:

 Hi,
 I was wondering what is the range of score is brought by more like this
 query in Solr? I know that the Lucene uses cosine similarity in vector
 space model for calculating similarity between two documents. I also
 know
 that cosine similarity is between -1 and 1 but the fact that I dont
 understand is why the score which is brought by more like this query

 could

 be 12 for example?! Would you please explain what is the calculation
 process is Solr?
 Thank you very much.

 Best regards.

 --
 A.Nazemian










-- 
A.Nazemian


Re: Lucene cosine similarity score for more like this query

2015-02-03 Thread Koji Sekiguchi

Lucene uses TFIDFSimilarity class to calculate the similarity.
It is implemented on the idea of cosine measurement but it modifies the cosine 
formula.
Please take a look at Lucene Practical Scoring Function in the following 
Javadoc:

http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

On 2015/02/03 5:39, Ali Nazemian wrote:

Dear Erik,
Thank you for your response. Would younplease tell me why this score could
be higher than 1? While cosine similarity can not be higher than 1.
On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote:


The scoring is the same as Lucene.  To get deeper insight into how a score
is computed, use Solr’s debug=true mode to see the explain details in the
response.

 Erik


On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote:

Hi,
I was wondering what is the range of score is brought by more like this
query in Solr? I know that the Lucene uses cosine similarity in vector
space model for calculating similarity between two documents. I also know
that cosine similarity is between -1 and 1 but the fact that I dont
understand is why the score which is brought by more like this query

could

be 12 for example?! Would you please explain what is the calculation
process is Solr?
Thank you very much.

Best regards.

--
A.Nazemian











RE: Lucene cosine similarity score for more like this query

2015-02-02 Thread Markus Jelsma
Hi - MoreLikeThis is not based on cosine similarity. The idea is that rare 
terms - high IDF - are extracted from the source document, and then used to 
build a regular Query(). That query follows the same rules as regular queries, 
the rules of your similarity implemenation, which is TFIDF by default. So, as 
suggested, if you enable debugging, you can clearly see why scores can be above 
1, or even much higher if queryNorm is disabled when using BM25 as similarity.

If you really need cosine similarity between documents, you have to enable term 
vectors for the source fields, and use them to calculate the angle. The problem 
is that this does not scale well, you would need to calculate angles with 
virtually all other documents.

M.
 
-Original message-
 From:Ali Nazemian alinazem...@gmail.com
 Sent: Monday 2nd February 2015 21:39
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene cosine similarity score for more like this query
 
 Dear Erik,
 Thank you for your response. Would younplease tell me why this score could
 be higher than 1? While cosine similarity can not be higher than 1.
 On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 
  The scoring is the same as Lucene.  To get deeper insight into how a score
  is computed, use Solr’s debug=true mode to see the explain details in the
  response.
 
  Erik
 
   On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote:
  
   Hi,
   I was wondering what is the range of score is brought by more like this
   query in Solr? I know that the Lucene uses cosine similarity in vector
   space model for calculating similarity between two documents. I also know
   that cosine similarity is between -1 and 1 but the fact that I dont
   understand is why the score which is brought by more like this query
  could
   be 12 for example?! Would you please explain what is the calculation
   process is Solr?
   Thank you very much.
  
   Best regards.
  
   --
   A.Nazemian
 
 
 


Re: Lucene cosine similarity score for more like this query

2015-02-02 Thread Ali Nazemian
Dear Erik,
Thank you for your response. Would younplease tell me why this score could
be higher than 1? While cosine similarity can not be higher than 1.
On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote:

 The scoring is the same as Lucene.  To get deeper insight into how a score
 is computed, use Solr’s debug=true mode to see the explain details in the
 response.

 Erik

  On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote:
 
  Hi,
  I was wondering what is the range of score is brought by more like this
  query in Solr? I know that the Lucene uses cosine similarity in vector
  space model for calculating similarity between two documents. I also know
  that cosine similarity is between -1 and 1 but the fact that I dont
  understand is why the score which is brought by more like this query
 could
  be 12 for example?! Would you please explain what is the calculation
  process is Solr?
  Thank you very much.
 
  Best regards.
 
  --
  A.Nazemian




Re: Lucene cosine similarity score for more like this query

2015-02-02 Thread Dikshant Shahi
Conceptually, your understanding is correct about VSM  cosine similarity.
In text analysis, the range is 0 to 1 as there is no negative similarity.

The scores for handler which internally use Lucene's cosine similarity can
also go beyond 1. The reason being these scores are computed for each field
and goes through more computation after that. For example
summation/multiplication of scores for fields, to come up with the final
score for the document. Correct me, if my understanding is wrong.

Thanks,
Dikshant



On Tue, Feb 3, 2015 at 2:53 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - MoreLikeThis is not based on cosine similarity. The idea is that rare
 terms - high IDF - are extracted from the source document, and then used to
 build a regular Query(). That query follows the same rules as regular
 queries, the rules of your similarity implemenation, which is TFIDF by
 default. So, as suggested, if you enable debugging, you can clearly see why
 scores can be above 1, or even much higher if queryNorm is disabled when
 using BM25 as similarity.

 If you really need cosine similarity between documents, you have to enable
 term vectors for the source fields, and use them to calculate the angle.
 The problem is that this does not scale well, you would need to calculate
 angles with virtually all other documents.

 M.

 -Original message-
  From:Ali Nazemian alinazem...@gmail.com
  Sent: Monday 2nd February 2015 21:39
  To: solr-user@lucene.apache.org
  Subject: Re: Lucene cosine similarity score for more like this query
 
  Dear Erik,
  Thank you for your response. Would younplease tell me why this score
 could
  be higher than 1? While cosine similarity can not be higher than 1.
  On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 
   The scoring is the same as Lucene.  To get deeper insight into how a
 score
   is computed, use Solr’s debug=true mode to see the explain details in
 the
   response.
  
   Erik
  
On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com
 wrote:
   
Hi,
I was wondering what is the range of score is brought by more like
 this
query in Solr? I know that the Lucene uses cosine similarity in
 vector
space model for calculating similarity between two documents. I also
 know
that cosine similarity is between -1 and 1 but the fact that I dont
understand is why the score which is brought by more like this query
   could
be 12 for example?! Would you please explain what is the
 calculation
process is Solr?
Thank you very much.
   
Best regards.
   
--
A.Nazemian
  
  
 



Re: Lucene cosine similarity score for more like this query

2015-02-02 Thread Erik Hatcher
The scoring is the same as Lucene.  To get deeper insight into how a score is 
computed, use Solr’s debug=true mode to see the explain details in the response.

Erik

 On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote:
 
 Hi,
 I was wondering what is the range of score is brought by more like this
 query in Solr? I know that the Lucene uses cosine similarity in vector
 space model for calculating similarity between two documents. I also know
 that cosine similarity is between -1 and 1 but the fact that I dont
 understand is why the score which is brought by more like this query could
 be 12 for example?! Would you please explain what is the calculation
 process is Solr?
 Thank you very much.
 
 Best regards.
 
 -- 
 A.Nazemian