https://issues.apache.org/jira/browse/LUCENE-6954
First draft patch available, I will check better the tests new year ! On 29 December 2015 at 13:43, Alessandro Benedetti <abenede...@apache.org> wrote: > Sure, I will proceed tomorrow with the Jira and the simple patch + tests. > > In the meantime let's try to collect some additional feedback. > > Cheers > > On 29 December 2015 at 12:43, Anshum Gupta <ans...@anshumgupta.net> wrote: > >> Feel free to create a JIRA and put up a patch if you can. >> >> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti < >> abenede...@apache.org >> > wrote: >> >> > Hi guys, >> > While I was exploring the way we build the More Like This query, I >> > discovered a part I am not convinced of : >> > >> > >> > >> > Let's see how we build the query : >> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) >> > >> > 1) we extract the terms from the interesting fields, adding them to a >> map : >> > >> > Map<String, Int> termFreqMap = new HashMap<>(); >> > >> > *( we lose the relation field-> term, we don't know anymore where the >> term >> > was coming ! )* >> > >> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue >> > >> > 2) we build the queue that will contain the query terms, at this point >> we >> > connect again there terms to some field, but : >> > >> > ... >> >> // go through all the fields and find the largest document frequency >> >> String topField = fieldNames[0]; >> >> int docFreq = 0; >> >> for (String fieldName : fieldNames) { >> >> int freq = ir.docFreq(new Term(fieldName, word)); >> >> topField = (freq > docFreq) ? fieldName : topField; >> >> docFreq = (freq > docFreq) ? freq : docFreq; >> >> } >> >> ... >> > >> > >> > We identify the topField as the field with the highest document >> frequency >> > for the term t . >> > Then we build the termQuery : >> > >> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf)); >> > >> > In this way we lose a lot of precision. >> > Not sure why we do that. >> > I would prefer to keep the relation between terms and fields. >> > The MLT query can improve a lot the quality. >> > If i run the MLT on 2 fields : *description* and *facilities* for >> example. >> > It is likely I want to find documents with similar terms in the >> > description and similar terms in the facilities, without mixing up the >> > things and loosing the semantic of the terms. >> > >> > Let me know your opinion, >> > >> > Cheers >> > >> > >> > -- >> > -------------------------- >> > >> > Benedetti Alessandro >> > Visiting card : http://about.me/alessandro_benedetti >> > >> > "Tyger, tyger burning bright >> > In the forests of the night, >> > What immortal hand or eye >> > Could frame thy fearful symmetry?" >> > >> > William Blake - Songs of Experience -1794 England >> > >> >> >> >> -- >> Anshum Gupta >> > > > > -- > -------------------------- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England