Re: Lucene cosine similarity score for more like this query
Dear Koji, Thank you very much. Do you know what is the range of score in this new formula? What is the reasonable threshold for considering two documents as similar enough in this formula? Regards. On Tue, Feb 3, 2015 at 1:35 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Lucene uses TFIDFSimilarity class to calculate the similarity. It is implemented on the idea of cosine measurement but it modifies the cosine formula. Please take a look at Lucene Practical Scoring Function in the following Javadoc: http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/ search/similarities/TFIDFSimilarity.html Koji -- http://soleami.com/blog/comparing-document-classification-functions-of- lucene-and-mahout.html On 2015/02/03 5:39, Ali Nazemian wrote: Dear Erik, Thank you for your response. Would younplease tell me why this score could be higher than 1? While cosine similarity can not be higher than 1. On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote: The scoring is the same as Lucene. To get deeper insight into how a score is computed, use Solr’s debug=true mode to see the explain details in the response. Erik On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote: Hi, I was wondering what is the range of score is brought by more like this query in Solr? I know that the Lucene uses cosine similarity in vector space model for calculating similarity between two documents. I also know that cosine similarity is between -1 and 1 but the fact that I dont understand is why the score which is brought by more like this query could be 12 for example?! Would you please explain what is the calculation process is Solr? Thank you very much. Best regards. -- A.Nazemian -- A.Nazemian
Re: Lucene cosine similarity score for more like this query
Lucene uses TFIDFSimilarity class to calculate the similarity. It is implemented on the idea of cosine measurement but it modifies the cosine formula. Please take a look at Lucene Practical Scoring Function in the following Javadoc: http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html Koji -- http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html On 2015/02/03 5:39, Ali Nazemian wrote: Dear Erik, Thank you for your response. Would younplease tell me why this score could be higher than 1? While cosine similarity can not be higher than 1. On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote: The scoring is the same as Lucene. To get deeper insight into how a score is computed, use Solr’s debug=true mode to see the explain details in the response. Erik On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote: Hi, I was wondering what is the range of score is brought by more like this query in Solr? I know that the Lucene uses cosine similarity in vector space model for calculating similarity between two documents. I also know that cosine similarity is between -1 and 1 but the fact that I dont understand is why the score which is brought by more like this query could be 12 for example?! Would you please explain what is the calculation process is Solr? Thank you very much. Best regards. -- A.Nazemian
RE: Lucene cosine similarity score for more like this query
Hi - MoreLikeThis is not based on cosine similarity. The idea is that rare terms - high IDF - are extracted from the source document, and then used to build a regular Query(). That query follows the same rules as regular queries, the rules of your similarity implemenation, which is TFIDF by default. So, as suggested, if you enable debugging, you can clearly see why scores can be above 1, or even much higher if queryNorm is disabled when using BM25 as similarity. If you really need cosine similarity between documents, you have to enable term vectors for the source fields, and use them to calculate the angle. The problem is that this does not scale well, you would need to calculate angles with virtually all other documents. M. -Original message- From:Ali Nazemian alinazem...@gmail.com Sent: Monday 2nd February 2015 21:39 To: solr-user@lucene.apache.org Subject: Re: Lucene cosine similarity score for more like this query Dear Erik, Thank you for your response. Would younplease tell me why this score could be higher than 1? While cosine similarity can not be higher than 1. On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote: The scoring is the same as Lucene. To get deeper insight into how a score is computed, use Solr’s debug=true mode to see the explain details in the response. Erik On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote: Hi, I was wondering what is the range of score is brought by more like this query in Solr? I know that the Lucene uses cosine similarity in vector space model for calculating similarity between two documents. I also know that cosine similarity is between -1 and 1 but the fact that I dont understand is why the score which is brought by more like this query could be 12 for example?! Would you please explain what is the calculation process is Solr? Thank you very much. Best regards. -- A.Nazemian
Re: Lucene cosine similarity score for more like this query
Dear Erik, Thank you for your response. Would younplease tell me why this score could be higher than 1? While cosine similarity can not be higher than 1. On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote: The scoring is the same as Lucene. To get deeper insight into how a score is computed, use Solr’s debug=true mode to see the explain details in the response. Erik On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote: Hi, I was wondering what is the range of score is brought by more like this query in Solr? I know that the Lucene uses cosine similarity in vector space model for calculating similarity between two documents. I also know that cosine similarity is between -1 and 1 but the fact that I dont understand is why the score which is brought by more like this query could be 12 for example?! Would you please explain what is the calculation process is Solr? Thank you very much. Best regards. -- A.Nazemian
Re: Lucene cosine similarity score for more like this query
Conceptually, your understanding is correct about VSM cosine similarity. In text analysis, the range is 0 to 1 as there is no negative similarity. The scores for handler which internally use Lucene's cosine similarity can also go beyond 1. The reason being these scores are computed for each field and goes through more computation after that. For example summation/multiplication of scores for fields, to come up with the final score for the document. Correct me, if my understanding is wrong. Thanks, Dikshant On Tue, Feb 3, 2015 at 2:53 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - MoreLikeThis is not based on cosine similarity. The idea is that rare terms - high IDF - are extracted from the source document, and then used to build a regular Query(). That query follows the same rules as regular queries, the rules of your similarity implemenation, which is TFIDF by default. So, as suggested, if you enable debugging, you can clearly see why scores can be above 1, or even much higher if queryNorm is disabled when using BM25 as similarity. If you really need cosine similarity between documents, you have to enable term vectors for the source fields, and use them to calculate the angle. The problem is that this does not scale well, you would need to calculate angles with virtually all other documents. M. -Original message- From:Ali Nazemian alinazem...@gmail.com Sent: Monday 2nd February 2015 21:39 To: solr-user@lucene.apache.org Subject: Re: Lucene cosine similarity score for more like this query Dear Erik, Thank you for your response. Would younplease tell me why this score could be higher than 1? While cosine similarity can not be higher than 1. On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote: The scoring is the same as Lucene. To get deeper insight into how a score is computed, use Solr’s debug=true mode to see the explain details in the response. Erik On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote: Hi, I was wondering what is the range of score is brought by more like this query in Solr? I know that the Lucene uses cosine similarity in vector space model for calculating similarity between two documents. I also know that cosine similarity is between -1 and 1 but the fact that I dont understand is why the score which is brought by more like this query could be 12 for example?! Would you please explain what is the calculation process is Solr? Thank you very much. Best regards. -- A.Nazemian
Re: Lucene cosine similarity score for more like this query
The scoring is the same as Lucene. To get deeper insight into how a score is computed, use Solr’s debug=true mode to see the explain details in the response. Erik On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote: Hi, I was wondering what is the range of score is brought by more like this query in Solr? I know that the Lucene uses cosine similarity in vector space model for calculating similarity between two documents. I also know that cosine similarity is between -1 and 1 but the fact that I dont understand is why the score which is brought by more like this query could be 12 for example?! Would you please explain what is the calculation process is Solr? Thank you very much. Best regards. -- A.Nazemian