subject:"Re\: \[More Like This\] Query building"

Re: [More Like This] Query building

2016-04-12 Thread Scott Stults

Hi Alessandro,

It's not uncommon for Solr patches to remain uncommitted for months, even
years. In fact some never get merged. Don't let that discourage you!


k/r,
Scott

On Fri, Mar 11, 2016 at 11:49 AM, Alessandro Benedetti <
abenede...@apache.org> wrote:

> I start to feel that is not that easy to contribute improvements or small
> fix to Solr ( if they are not super interesting to the mass) .
> I think this one could be a good improvement in the MLT but I would love to
> discuss this with some committer.
> The patch is attached, it is there since months ago...
> Any feedback would be appreciated, I want to contribute, but I need some
> second opinions ...
>
> Cheers
>
> On 11 February 2016 at 13:48, Alessandro Benedetti 
> wrote:
>
> > Hi Guys,
> > is it possible to have any feedback ?
> > Is there any process to speed up bug resolution / discussions ?
> > just want to understand if the patch is not good enough, if I need to
> > improve it or simply no-one took a look ...
> >
> > https://issues.apache.org/jira/browse/LUCENE-6954
> >
> > Cheers
> >
> > On 11 January 2016 at 15:25, Alessandro Benedetti  >
> > wrote:
> >
> >> Hi guys,
> >> the patch seems fine to me.
> >> I didn't spend much more time on the code but I checked the tests and
> the
> >> pre-commit checks.
> >> It seems fine to me.
> >> Let me know ,
> >>
> >> Cheers
> >>
> >> On 31 December 2015 at 18:40, Alessandro Benedetti <
> abenede...@apache.org
> >> > wrote:
> >>
> >>> https://issues.apache.org/jira/browse/LUCENE-6954
> >>>
> >>> First draft patch available, I will check better the tests new year !
> >>>
> >>> On 29 December 2015 at 13:43, Alessandro Benedetti <
> >>> abenede...@apache.org> wrote:
> >>>
>  Sure, I will proceed tomorrow with the Jira and the simple patch +
>  tests.
> 
>  In the meantime let's try to collect some additional feedback.
> 
>  Cheers
> 
>  On 29 December 2015 at 12:43, Anshum Gupta 
>  wrote:
> 
> > Feel free to create a JIRA and put up a patch if you can.
> >
> > On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> > abenede...@apache.org
> > > wrote:
> >
> > > Hi guys,
> > > While I was exploring the way we build the More Like This query, I
> > > discovered a part I am not convinced of :
> > >
> > >
> > >
> > > Let's see how we build the query :
> > > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> > >
> > > 1) we extract the terms from the interesting fields, adding them to
> > a map :
> > >
> > > Map termFreqMap = new HashMap<>();
> > >
> > > *( we lose the relation field-> term, we don't know anymore where
> > the term
> > > was coming ! )*
> > >
> > > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> > >
> > > 2) we build the queue that will contain the query terms, at this
> > point we
> > > connect again there terms to some field, but :
> > >
> > > ...
> > >> // go through all the fields and find the largest document
> frequency
> > >> String topField = fieldNames[0];
> > >> int docFreq = 0;
> > >> for (String fieldName : fieldNames) {
> > >>   int freq = ir.docFreq(new Term(fieldName, word));
> > >>   topField = (freq > docFreq) ? fieldName : topField;
> > >>   docFreq = (freq > docFreq) ? freq : docFreq;
> > >> }
> > >> ...
> > >
> > >
> > > We identify the topField as the field with the highest document
> > frequency
> > > for the term t .
> > > Then we build the termQuery :
> > >
> > > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq,
> tf));
> > >
> > > In this way we lose a lot of precision.
> > > Not sure why we do that.
> > > I would prefer to keep the relation between terms and fields.
> > > The MLT query can improve a lot the quality.
> > > If i run the MLT on 2 fields : *description* and *facilities* for
> > example.
> > > It is likely I want to find documents with similar terms in the
> > > description and similar terms in the facilities, without mixing up
> > the
> > > things and loosing the semantic of the terms.
> > >
> > > Let me know your opinion,
> > >
> > > Cheers
> > >
> > >
> > > --
> > > --
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
> >
> >
> > --
> > Anshum Gupta
> >
> 
> 
> 
>  --
>  --
> 
>  Benedetti Alessandro
>  Visiting card :

Re: [More Like This] Query building

2016-03-11 Thread Alessandro Benedetti

I start to feel that is not that easy to contribute improvements or small
fix to Solr ( if they are not super interesting to the mass) .
I think this one could be a good improvement in the MLT but I would love to
discuss this with some committer.
The patch is attached, it is there since months ago...
Any feedback would be appreciated, I want to contribute, but I need some
second opinions ...

Cheers

On 11 February 2016 at 13:48, Alessandro Benedetti 
wrote:

> Hi Guys,
> is it possible to have any feedback ?
> Is there any process to speed up bug resolution / discussions ?
> just want to understand if the patch is not good enough, if I need to
> improve it or simply no-one took a look ...
>
> https://issues.apache.org/jira/browse/LUCENE-6954
>
> Cheers
>
> On 11 January 2016 at 15:25, Alessandro Benedetti 
> wrote:
>
>> Hi guys,
>> the patch seems fine to me.
>> I didn't spend much more time on the code but I checked the tests and the
>> pre-commit checks.
>> It seems fine to me.
>> Let me know ,
>>
>> Cheers
>>
>> On 31 December 2015 at 18:40, Alessandro Benedetti > > wrote:
>>
>>> https://issues.apache.org/jira/browse/LUCENE-6954
>>>
>>> First draft patch available, I will check better the tests new year !
>>>
>>> On 29 December 2015 at 13:43, Alessandro Benedetti <
>>> abenede...@apache.org> wrote:
>>>
 Sure, I will proceed tomorrow with the Jira and the simple patch +
 tests.

 In the meantime let's try to collect some additional feedback.

 Cheers

 On 29 December 2015 at 12:43, Anshum Gupta 
 wrote:

> Feel free to create a JIRA and put up a patch if you can.
>
> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> abenede...@apache.org
> > wrote:
>
> > Hi guys,
> > While I was exploring the way we build the More Like This query, I
> > discovered a part I am not convinced of :
> >
> >
> >
> > Let's see how we build the query :
> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> >
> > 1) we extract the terms from the interesting fields, adding them to
> a map :
> >
> > Map termFreqMap = new HashMap<>();
> >
> > *( we lose the relation field-> term, we don't know anymore where
> the term
> > was coming ! )*
> >
> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> >
> > 2) we build the queue that will contain the query terms, at this
> point we
> > connect again there terms to some field, but :
> >
> > ...
> >> // go through all the fields and find the largest document frequency
> >> String topField = fieldNames[0];
> >> int docFreq = 0;
> >> for (String fieldName : fieldNames) {
> >>   int freq = ir.docFreq(new Term(fieldName, word));
> >>   topField = (freq > docFreq) ? fieldName : topField;
> >>   docFreq = (freq > docFreq) ? freq : docFreq;
> >> }
> >> ...
> >
> >
> > We identify the topField as the field with the highest document
> frequency
> > for the term t .
> > Then we build the termQuery :
> >
> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
> >
> > In this way we lose a lot of precision.
> > Not sure why we do that.
> > I would prefer to keep the relation between terms and fields.
> > The MLT query can improve a lot the quality.
> > If i run the MLT on 2 fields : *description* and *facilities* for
> example.
> > It is likely I want to find documents with similar terms in the
> > description and similar terms in the facilities, without mixing up
> the
> > things and loosing the semantic of the terms.
> >
> > Let me know your opinion,
> >
> > Cheers
> >
> >
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> Anshum Gupta
>



 --
 --

 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti

 "Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?"

 William Blake - Songs of Experience -1794 England

>>>
>>>
>>>
>>> --
>>> --
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience

Re: [More Like This] Query building

2016-02-11 Thread Alessandro Benedetti

Hi Guys,
is it possible to have any feedback ?
Is there any process to speed up bug resolution / discussions ?
just want to understand if the patch is not good enough, if I need to
improve it or simply no-one took a look ...

https://issues.apache.org/jira/browse/LUCENE-6954

Cheers

On 11 January 2016 at 15:25, Alessandro Benedetti 
wrote:

> Hi guys,
> the patch seems fine to me.
> I didn't spend much more time on the code but I checked the tests and the
> pre-commit checks.
> It seems fine to me.
> Let me know ,
>
> Cheers
>
> On 31 December 2015 at 18:40, Alessandro Benedetti 
> wrote:
>
>> https://issues.apache.org/jira/browse/LUCENE-6954
>>
>> First draft patch available, I will check better the tests new year !
>>
>> On 29 December 2015 at 13:43, Alessandro Benedetti > > wrote:
>>
>>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>>>
>>> In the meantime let's try to collect some additional feedback.
>>>
>>> Cheers
>>>
>>> On 29 December 2015 at 12:43, Anshum Gupta 
>>> wrote:
>>>
 Feel free to create a JIRA and put up a patch if you can.

 On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
 abenede...@apache.org
 > wrote:

 > Hi guys,
 > While I was exploring the way we build the More Like This query, I
 > discovered a part I am not convinced of :
 >
 >
 >
 > Let's see how we build the query :
 > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
 >
 > 1) we extract the terms from the interesting fields, adding them to a
 map :
 >
 > Map termFreqMap = new HashMap<>();
 >
 > *( we lose the relation field-> term, we don't know anymore where the
 term
 > was coming ! )*
 >
 > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
 >
 > 2) we build the queue that will contain the query terms, at this
 point we
 > connect again there terms to some field, but :
 >
 > ...
 >> // go through all the fields and find the largest document frequency
 >> String topField = fieldNames[0];
 >> int docFreq = 0;
 >> for (String fieldName : fieldNames) {
 >>   int freq = ir.docFreq(new Term(fieldName, word));
 >>   topField = (freq > docFreq) ? fieldName : topField;
 >>   docFreq = (freq > docFreq) ? freq : docFreq;
 >> }
 >> ...
 >
 >
 > We identify the topField as the field with the highest document
 frequency
 > for the term t .
 > Then we build the termQuery :
 >
 > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
 >
 > In this way we lose a lot of precision.
 > Not sure why we do that.
 > I would prefer to keep the relation between terms and fields.
 > The MLT query can improve a lot the quality.
 > If i run the MLT on 2 fields : *description* and *facilities* for
 example.
 > It is likely I want to find documents with similar terms in the
 > description and similar terms in the facilities, without mixing up the
 > things and loosing the semantic of the terms.
 >
 > Let me know your opinion,
 >
 > Cheers
 >
 >
 > --
 > --
 >
 > Benedetti Alessandro
 > Visiting card : http://about.me/alessandro_benedetti
 >
 > "Tyger, tyger burning bright
 > In the forests of the night,
 > What immortal hand or eye
 > Could frame thy fearful symmetry?"
 >
 > William Blake - Songs of Experience -1794 England
 >



 --
 Anshum Gupta

>>>
>>>
>>>
>>> --
>>> --
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>>
>> --
>> --
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

2016-01-11 Thread Alessandro Benedetti

Hi guys,
the patch seems fine to me.
I didn't spend much more time on the code but I checked the tests and the
pre-commit checks.
It seems fine to me.
Let me know ,

Cheers

On 31 December 2015 at 18:40, Alessandro Benedetti 
wrote:

> https://issues.apache.org/jira/browse/LUCENE-6954
>
> First draft patch available, I will check better the tests new year !
>
> On 29 December 2015 at 13:43, Alessandro Benedetti 
> wrote:
>
>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>>
>> In the meantime let's try to collect some additional feedback.
>>
>> Cheers
>>
>> On 29 December 2015 at 12:43, Anshum Gupta 
>> wrote:
>>
>>> Feel free to create a JIRA and put up a patch if you can.
>>>
>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>> abenede...@apache.org
>>> > wrote:
>>>
>>> > Hi guys,
>>> > While I was exploring the way we build the More Like This query, I
>>> > discovered a part I am not convinced of :
>>> >
>>> >
>>> >
>>> > Let's see how we build the query :
>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>> >
>>> > 1) we extract the terms from the interesting fields, adding them to a
>>> map :
>>> >
>>> > Map termFreqMap = new HashMap<>();
>>> >
>>> > *( we lose the relation field-> term, we don't know anymore where the
>>> term
>>> > was coming ! )*
>>> >
>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>> >
>>> > 2) we build the queue that will contain the query terms, at this point
>>> we
>>> > connect again there terms to some field, but :
>>> >
>>> > ...
>>> >> // go through all the fields and find the largest document frequency
>>> >> String topField = fieldNames[0];
>>> >> int docFreq = 0;
>>> >> for (String fieldName : fieldNames) {
>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>> >> }
>>> >> ...
>>> >
>>> >
>>> > We identify the topField as the field with the highest document
>>> frequency
>>> > for the term t .
>>> > Then we build the termQuery :
>>> >
>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>>> >
>>> > In this way we lose a lot of precision.
>>> > Not sure why we do that.
>>> > I would prefer to keep the relation between terms and fields.
>>> > The MLT query can improve a lot the quality.
>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>> example.
>>> > It is likely I want to find documents with similar terms in the
>>> > description and similar terms in the facilities, without mixing up the
>>> > things and loosing the semantic of the terms.
>>> >
>>> > Let me know your opinion,
>>> >
>>> > Cheers
>>> >
>>> >
>>> > --
>>> > --
>>> >
>>> > Benedetti Alessandro
>>> > Visiting card : http://about.me/alessandro_benedetti
>>> >
>>> > "Tyger, tyger burning bright
>>> > In the forests of the night,
>>> > What immortal hand or eye
>>> > Could frame thy fearful symmetry?"
>>> >
>>> > William Blake - Songs of Experience -1794 England
>>> >
>>>
>>>
>>>
>>> --
>>> Anshum Gupta
>>>
>>
>>
>>
>> --
>> --
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

2015-12-31 Thread Alessandro Benedetti

https://issues.apache.org/jira/browse/LUCENE-6954

First draft patch available, I will check better the tests new year !

On 29 December 2015 at 13:43, Alessandro Benedetti 
wrote:

> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>
> In the meantime let's try to collect some additional feedback.
>
> Cheers
>
> On 29 December 2015 at 12:43, Anshum Gupta  wrote:
>
>> Feel free to create a JIRA and put up a patch if you can.
>>
>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>> abenede...@apache.org
>> > wrote:
>>
>> > Hi guys,
>> > While I was exploring the way we build the More Like This query, I
>> > discovered a part I am not convinced of :
>> >
>> >
>> >
>> > Let's see how we build the query :
>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>> >
>> > 1) we extract the terms from the interesting fields, adding them to a
>> map :
>> >
>> > Map termFreqMap = new HashMap<>();
>> >
>> > *( we lose the relation field-> term, we don't know anymore where the
>> term
>> > was coming ! )*
>> >
>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>> >
>> > 2) we build the queue that will contain the query terms, at this point
>> we
>> > connect again there terms to some field, but :
>> >
>> > ...
>> >> // go through all the fields and find the largest document frequency
>> >> String topField = fieldNames[0];
>> >> int docFreq = 0;
>> >> for (String fieldName : fieldNames) {
>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>> >>   topField = (freq > docFreq) ? fieldName : topField;
>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>> >> }
>> >> ...
>> >
>> >
>> > We identify the topField as the field with the highest document
>> frequency
>> > for the term t .
>> > Then we build the termQuery :
>> >
>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>> >
>> > In this way we lose a lot of precision.
>> > Not sure why we do that.
>> > I would prefer to keep the relation between terms and fields.
>> > The MLT query can improve a lot the quality.
>> > If i run the MLT on 2 fields : *description* and *facilities* for
>> example.
>> > It is likely I want to find documents with similar terms in the
>> > description and similar terms in the facilities, without mixing up the
>> > things and loosing the semantic of the terms.
>> >
>> > Let me know your opinion,
>> >
>> > Cheers
>> >
>> >
>> > --
>> > --
>> >
>> > Benedetti Alessandro
>> > Visiting card : http://about.me/alessandro_benedetti
>> >
>> > "Tyger, tyger burning bright
>> > In the forests of the night,
>> > What immortal hand or eye
>> > Could frame thy fearful symmetry?"
>> >
>> > William Blake - Songs of Experience -1794 England
>> >
>>
>>
>>
>> --
>> Anshum Gupta
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

2015-12-29 Thread Anshum Gupta

Feel free to create a JIRA and put up a patch if you can.

On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti  wrote:

> Hi guys,
> While I was exploring the way we build the More Like This query, I
> discovered a part I am not convinced of :
>
>
>
> Let's see how we build the query :
> org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>
> 1) we extract the terms from the interesting fields, adding them to a map :
>
> Map termFreqMap = new HashMap<>();
>
> *( we lose the relation field-> term, we don't know anymore where the term
> was coming ! )*
>
> org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>
> 2) we build the queue that will contain the query terms, at this point we
> connect again there terms to some field, but :
>
> ...
>> // go through all the fields and find the largest document frequency
>> String topField = fieldNames[0];
>> int docFreq = 0;
>> for (String fieldName : fieldNames) {
>>   int freq = ir.docFreq(new Term(fieldName, word));
>>   topField = (freq > docFreq) ? fieldName : topField;
>>   docFreq = (freq > docFreq) ? freq : docFreq;
>> }
>> ...
>
>
> We identify the topField as the field with the highest document frequency
> for the term t .
> Then we build the termQuery :
>
> queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>
> In this way we lose a lot of precision.
> Not sure why we do that.
> I would prefer to keep the relation between terms and fields.
> The MLT query can improve a lot the quality.
> If i run the MLT on 2 fields : *description* and *facilities* for example.
> It is likely I want to find documents with similar terms in the
> description and similar terms in the facilities, without mixing up the
> things and loosing the semantic of the terms.
>
> Let me know your opinion,
>
> Cheers
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
Anshum Gupta

Re: [More Like This] Query building

2015-12-29 Thread Alessandro Benedetti

Sure, I will proceed tomorrow with the Jira and the simple patch + tests.

In the meantime let's try to collect some additional feedback.

Cheers

On 29 December 2015 at 12:43, Anshum Gupta  wrote:

> Feel free to create a JIRA and put up a patch if you can.
>
> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> abenede...@apache.org
> > wrote:
>
> > Hi guys,
> > While I was exploring the way we build the More Like This query, I
> > discovered a part I am not convinced of :
> >
> >
> >
> > Let's see how we build the query :
> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> >
> > 1) we extract the terms from the interesting fields, adding them to a
> map :
> >
> > Map termFreqMap = new HashMap<>();
> >
> > *( we lose the relation field-> term, we don't know anymore where the
> term
> > was coming ! )*
> >
> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> >
> > 2) we build the queue that will contain the query terms, at this point we
> > connect again there terms to some field, but :
> >
> > ...
> >> // go through all the fields and find the largest document frequency
> >> String topField = fieldNames[0];
> >> int docFreq = 0;
> >> for (String fieldName : fieldNames) {
> >>   int freq = ir.docFreq(new Term(fieldName, word));
> >>   topField = (freq > docFreq) ? fieldName : topField;
> >>   docFreq = (freq > docFreq) ? freq : docFreq;
> >> }
> >> ...
> >
> >
> > We identify the topField as the field with the highest document frequency
> > for the term t .
> > Then we build the termQuery :
> >
> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
> >
> > In this way we lose a lot of precision.
> > Not sure why we do that.
> > I would prefer to keep the relation between terms and fields.
> > The MLT query can improve a lot the quality.
> > If i run the MLT on 2 fields : *description* and *facilities* for
> example.
> > It is likely I want to find documents with similar terms in the
> > description and similar terms in the facilities, without mixing up the
> > things and loosing the semantic of the terms.
> >
> > Let me know your opinion,
> >
> > Cheers
> >
> >
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> Anshum Gupta
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Re: [More Like This] Query building

Re: [More Like This] Query building

Re: [More Like This] Query building

Re: [More Like This] Query building

Re: [More Like This] Query building

Re: [More Like This] Query building

7 matches

Site Navigation

Mail list logo

Footer information