Re: Judging the MoreLikeThis results for relevancy
So let me answer point by point : 1) Similarity is misleading here if you interpret it as a probabilistic measure. Given a query, it doesn't exist the "Ideal Document". Both with TF-IDF and BM25 ( that solves the problem better) you are scoring the document. Higher the score, higher the relevance of that document for the given query. BM25 does a better job in this , the relevance function will hit a saturation point so it is closer to your expectation, this blog from Doug should help[1] 2) "if document vector A is at a distance of 5 and 10 units from document vectors B and C respectively then can't we say that B is twice as relevant to A as C is to A? Or in terms of distance, C is twice as distant to A and B is to A?" Not in Lucene, at least not strictly. Current MLT uses TF-IDF as a scoring formula. When the score of B is double of the score of C, you can say that B is twice as relevant to A than C for Lucene. >From a User perspective this can be different (quoting Doug : "If an article mentions “dog” six times is it twice as relevant as an article mentioning “dog” 3 times? Most users say no") 3) MLT under the hood build a Lucene query and retrieve documents from the index. When building the MLT query, to keep it simple it extract from the seed document a subset of terms which are considered representative of the seed document ( let's call them relevant terms). This is managed through a parameter, but usually and by default you collect a limited set of relevant terms ( not all the terms). When retrieving similar documents you score them using TF-IDF ( and in the future BM25). So first of all, you can have documents with higher scores than the original ( it doesn't make sense in a probabilistic world, but this is how Lucene works). Reverting the documents, so applying the MLT to document B you could build a slightly different query. So : given seed(a) the score(b) != the score(a) given seed(b) I understand you think it doesn't make sense, but this how Lucene works. I do also understand that a lot of times users want a percentage out of a MLT query. I will work toward that direction for sure, step by step, first I need to have the MLT refactor approved and patched :) [1] https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Judging the MoreLikeThis results for relevancy
Thanks for the reply, Alessandro. Can you please elaborate on a point "a document which has a score 50% of the original doc score, it doesn't mean it is 50% similar"? I did not understand this for two reasons: 1. In the end, we are calculating similarity score between documents when we are solving the Problem of Search where search query is also treated as a small document. Similarity has inherent meaning of how similar one thing is to the another. 2. If we think about the vector representations of documents in multidimensional space, we are basically calculating the "distance" between these documents. We interpret that distance as "similarity". Farther away the document vectors in that space, less similar those documents are with each other. How we calculate the distance is one thing (e.g. cosine distance, Euclidean distance,etc) but once we agree upon distance/similarity calculation method, if document vector A is at a distance of 5 and 10 units from document vectors B and C respectively then can't we say that B is twice as relevant to A as C is to A? Or in terms of distance, C is twice as distant to A and B is to A? I found this response from jlman in following thread very similar to my solution. http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=print_post=561671 He also warns about the scores between two documents not being bidirectional. If all else remains constant (relevancy algorithm, number of documents in index etc), why the relevancy between two documents calculated with the approach that I mentioned is not bidirectional? That is why is it possible that document A is more similar to B than B is similar to A. When I think in terms of multidimensional vector space, this does not make sense at all. Because, distance between A and B in multidimensional space is not going to change provided all else remains constant ( relevancy algorithm, number of document in index etc). If A is at a distance of 5 units from B then B is also at distance of 5 units from A. Isn't it? Thanks, Arnold On Thu, Feb 8, 2018 at 7:02 AM, Alessandro Benedettiwrote: > Hi, > I have been personally working a lot with the MoreLikeThis and I am close > to > contribute a refactor of that module ( to break up the monolithic giant > facade class mostly) . > > First of all the MoreLikeThis handler will return the original document ( > not scored) + the similar documents(scored). > The original document is not considered by the MoreLikeThis query, so it is > not returned as part of the results of the MLT lucene query, it is just > added to the response in the beginning. > > if I remember well, but I am unable to check at the moment, you should be > able to get the original document in the response set ( with max score) > using the More Like This query parser. > Please double check that > > Generally speaking at the moment TF-IDF is used under the hood, which means > that sometime the score is not probabilistic. > So a document which has a score 50% of the original doc score, it doesn't > mean it is 50% similar, but for your use case it may be a feasible > approximation. > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Judging the MoreLikeThis results for relevancy
Hi, I have been personally working a lot with the MoreLikeThis and I am close to contribute a refactor of that module ( to break up the monolithic giant facade class mostly) . First of all the MoreLikeThis handler will return the original document ( not scored) + the similar documents(scored). The original document is not considered by the MoreLikeThis query, so it is not returned as part of the results of the MLT lucene query, it is just added to the response in the beginning. if I remember well, but I am unable to check at the moment, you should be able to get the original document in the response set ( with max score) using the More Like This query parser. Please double check that Generally speaking at the moment TF-IDF is used under the hood, which means that sometime the score is not probabilistic. So a document which has a score 50% of the original doc score, it doesn't mean it is 50% similar, but for your use case it may be a feasible approximation. - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Judging the MoreLikeThis results for relevancy
Hi, I am using MoreLikeThis handler to get related documents for a given document. To determine if I am getting good results or not, here is what I do: The same original document should be returned as a top match. If it is not, then there is some problem with the relevancy. Then, as same input document will be 100% match with itself, we can use its absolute score to compare how other documents (ranked 2nd, ranked 3rd and so on) are doing in terms of relevancy by comparing their scores to the score of the top result which is the same input document Is this a good idea? Do you see any flaw in this logic?