https://issues.apache.org/jira/browse/LUCENE-6954

First draft patch available, I will check better the tests new year !

On 29 December 2015 at 13:43, Alessandro Benedetti <abenede...@apache.org>
wrote:

> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>
> In the meantime let's try to collect some additional feedback.
>
> Cheers
>
> On 29 December 2015 at 12:43, Anshum Gupta <ans...@anshumgupta.net> wrote:
>
>> Feel free to create a JIRA and put up a patch if you can.
>>
>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>> abenede...@apache.org
>> > wrote:
>>
>> > Hi guys,
>> > While I was exploring the way we build the More Like This query, I
>> > discovered a part I am not convinced of :
>> >
>> >
>> >
>> > Let's see how we build the query :
>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>> >
>> > 1) we extract the terms from the interesting fields, adding them to a
>> map :
>> >
>> > Map<String, Int> termFreqMap = new HashMap<>();
>> >
>> > *( we lose the relation field-> term, we don't know anymore where the
>> term
>> > was coming ! )*
>> >
>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>> >
>> > 2) we build the queue that will contain the query terms, at this point
>> we
>> > connect again there terms to some field, but :
>> >
>> > ...
>> >> // go through all the fields and find the largest document frequency
>> >> String topField = fieldNames[0];
>> >> int docFreq = 0;
>> >> for (String fieldName : fieldNames) {
>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>> >>   topField = (freq > docFreq) ? fieldName : topField;
>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>> >> }
>> >> ...
>> >
>> >
>> > We identify the topField as the field with the highest document
>> frequency
>> > for the term t .
>> > Then we build the termQuery :
>> >
>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>> >
>> > In this way we lose a lot of precision.
>> > Not sure why we do that.
>> > I would prefer to keep the relation between terms and fields.
>> > The MLT query can improve a lot the quality.
>> > If i run the MLT on 2 fields : *description* and *facilities* for
>> example.
>> > It is likely I want to find documents with similar terms in the
>> > description and similar terms in the facilities, without mixing up the
>> > things and loosing the semantic of the terms.
>> >
>> > Let me know your opinion,
>> >
>> > Cheers
>> >
>> >
>> > --
>> > --------------------------
>> >
>> > Benedetti Alessandro
>> > Visiting card : http://about.me/alessandro_benedetti
>> >
>> > "Tyger, tyger burning bright
>> > In the forests of the night,
>> > What immortal hand or eye
>> > Could frame thy fearful symmetry?"
>> >
>> > William Blake - Songs of Experience -1794 England
>> >
>>
>>
>>
>> --
>> Anshum Gupta
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Reply via email to