Re: [More Like This] Query building
Hi Alessandro, It's not uncommon for Solr patches to remain uncommitted for months, even years. In fact some never get merged. Don't let that discourage you! k/r, Scott On Fri, Mar 11, 2016 at 11:49 AM, Alessandro Benedetti < abenede...@apache.org> wrote: > I start to feel that is not that easy to contribute improvements or small > fix to Solr ( if they are not super interesting to the mass) . > I think this one could be a good improvement in the MLT but I would love to > discuss this with some committer. > The patch is attached, it is there since months ago... > Any feedback would be appreciated, I want to contribute, but I need some > second opinions ... > > Cheers > > On 11 February 2016 at 13:48, Alessandro Benedetti> wrote: > > > Hi Guys, > > is it possible to have any feedback ? > > Is there any process to speed up bug resolution / discussions ? > > just want to understand if the patch is not good enough, if I need to > > improve it or simply no-one took a look ... > > > > https://issues.apache.org/jira/browse/LUCENE-6954 > > > > Cheers > > > > On 11 January 2016 at 15:25, Alessandro Benedetti > > > wrote: > > > >> Hi guys, > >> the patch seems fine to me. > >> I didn't spend much more time on the code but I checked the tests and > the > >> pre-commit checks. > >> It seems fine to me. > >> Let me know , > >> > >> Cheers > >> > >> On 31 December 2015 at 18:40, Alessandro Benedetti < > abenede...@apache.org > >> > wrote: > >> > >>> https://issues.apache.org/jira/browse/LUCENE-6954 > >>> > >>> First draft patch available, I will check better the tests new year ! > >>> > >>> On 29 December 2015 at 13:43, Alessandro Benedetti < > >>> abenede...@apache.org> wrote: > >>> > Sure, I will proceed tomorrow with the Jira and the simple patch + > tests. > > In the meantime let's try to collect some additional feedback. > > Cheers > > On 29 December 2015 at 12:43, Anshum Gupta > wrote: > > > Feel free to create a JIRA and put up a patch if you can. > > > > On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti < > > abenede...@apache.org > > > wrote: > > > > > Hi guys, > > > While I was exploring the way we build the More Like This query, I > > > discovered a part I am not convinced of : > > > > > > > > > > > > Let's see how we build the query : > > > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) > > > > > > 1) we extract the terms from the interesting fields, adding them to > > a map : > > > > > > Map termFreqMap = new HashMap<>(); > > > > > > *( we lose the relation field-> term, we don't know anymore where > > the term > > > was coming ! )* > > > > > > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue > > > > > > 2) we build the queue that will contain the query terms, at this > > point we > > > connect again there terms to some field, but : > > > > > > ... > > >> // go through all the fields and find the largest document > frequency > > >> String topField = fieldNames[0]; > > >> int docFreq = 0; > > >> for (String fieldName : fieldNames) { > > >> int freq = ir.docFreq(new Term(fieldName, word)); > > >> topField = (freq > docFreq) ? fieldName : topField; > > >> docFreq = (freq > docFreq) ? freq : docFreq; > > >> } > > >> ... > > > > > > > > > We identify the topField as the field with the highest document > > frequency > > > for the term t . > > > Then we build the termQuery : > > > > > > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, > tf)); > > > > > > In this way we lose a lot of precision. > > > Not sure why we do that. > > > I would prefer to keep the relation between terms and fields. > > > The MLT query can improve a lot the quality. > > > If i run the MLT on 2 fields : *description* and *facilities* for > > example. > > > It is likely I want to find documents with similar terms in the > > > description and similar terms in the facilities, without mixing up > > the > > > things and loosing the semantic of the terms. > > > > > > Let me know your opinion, > > > > > > Cheers > > > > > > > > > -- > > > -- > > > > > > Benedetti Alessandro > > > Visiting card : http://about.me/alessandro_benedetti > > > > > > "Tyger, tyger burning bright > > > In the forests of the night, > > > What immortal hand or eye > > > Could frame thy fearful symmetry?" > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > > > -- > > Anshum Gupta > > > > > > -- > -- > > Benedetti Alessandro > Visiting card :
Re: [More Like This] Query building
I start to feel that is not that easy to contribute improvements or small fix to Solr ( if they are not super interesting to the mass) . I think this one could be a good improvement in the MLT but I would love to discuss this with some committer. The patch is attached, it is there since months ago... Any feedback would be appreciated, I want to contribute, but I need some second opinions ... Cheers On 11 February 2016 at 13:48, Alessandro Benedettiwrote: > Hi Guys, > is it possible to have any feedback ? > Is there any process to speed up bug resolution / discussions ? > just want to understand if the patch is not good enough, if I need to > improve it or simply no-one took a look ... > > https://issues.apache.org/jira/browse/LUCENE-6954 > > Cheers > > On 11 January 2016 at 15:25, Alessandro Benedetti > wrote: > >> Hi guys, >> the patch seems fine to me. >> I didn't spend much more time on the code but I checked the tests and the >> pre-commit checks. >> It seems fine to me. >> Let me know , >> >> Cheers >> >> On 31 December 2015 at 18:40, Alessandro Benedetti > > wrote: >> >>> https://issues.apache.org/jira/browse/LUCENE-6954 >>> >>> First draft patch available, I will check better the tests new year ! >>> >>> On 29 December 2015 at 13:43, Alessandro Benedetti < >>> abenede...@apache.org> wrote: >>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests. In the meantime let's try to collect some additional feedback. Cheers On 29 December 2015 at 12:43, Anshum Gupta wrote: > Feel free to create a JIRA and put up a patch if you can. > > On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti < > abenede...@apache.org > > wrote: > > > Hi guys, > > While I was exploring the way we build the More Like This query, I > > discovered a part I am not convinced of : > > > > > > > > Let's see how we build the query : > > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) > > > > 1) we extract the terms from the interesting fields, adding them to > a map : > > > > Map termFreqMap = new HashMap<>(); > > > > *( we lose the relation field-> term, we don't know anymore where > the term > > was coming ! )* > > > > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue > > > > 2) we build the queue that will contain the query terms, at this > point we > > connect again there terms to some field, but : > > > > ... > >> // go through all the fields and find the largest document frequency > >> String topField = fieldNames[0]; > >> int docFreq = 0; > >> for (String fieldName : fieldNames) { > >> int freq = ir.docFreq(new Term(fieldName, word)); > >> topField = (freq > docFreq) ? fieldName : topField; > >> docFreq = (freq > docFreq) ? freq : docFreq; > >> } > >> ... > > > > > > We identify the topField as the field with the highest document > frequency > > for the term t . > > Then we build the termQuery : > > > > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf)); > > > > In this way we lose a lot of precision. > > Not sure why we do that. > > I would prefer to keep the relation between terms and fields. > > The MLT query can improve a lot the quality. > > If i run the MLT on 2 fields : *description* and *facilities* for > example. > > It is likely I want to find documents with similar terms in the > > description and similar terms in the facilities, without mixing up > the > > things and loosing the semantic of the terms. > > > > Let me know your opinion, > > > > Cheers > > > > > > -- > > -- > > > > Benedetti Alessandro > > Visiting card : http://about.me/alessandro_benedetti > > > > "Tyger, tyger burning bright > > In the forests of the night, > > What immortal hand or eye > > Could frame thy fearful symmetry?" > > > > William Blake - Songs of Experience -1794 England > > > > > > -- > Anshum Gupta > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England >>> >>> >>> >>> -- >>> -- >>> >>> Benedetti Alessandro >>> Visiting card : http://about.me/alessandro_benedetti >>> >>> "Tyger, tyger burning bright >>> In the forests of the night, >>> What immortal hand or eye >>> Could frame thy fearful symmetry?" >>> >>> William Blake - Songs of Experience
Re: [More Like This] Query building
Hi Guys, is it possible to have any feedback ? Is there any process to speed up bug resolution / discussions ? just want to understand if the patch is not good enough, if I need to improve it or simply no-one took a look ... https://issues.apache.org/jira/browse/LUCENE-6954 Cheers On 11 January 2016 at 15:25, Alessandro Benedettiwrote: > Hi guys, > the patch seems fine to me. > I didn't spend much more time on the code but I checked the tests and the > pre-commit checks. > It seems fine to me. > Let me know , > > Cheers > > On 31 December 2015 at 18:40, Alessandro Benedetti > wrote: > >> https://issues.apache.org/jira/browse/LUCENE-6954 >> >> First draft patch available, I will check better the tests new year ! >> >> On 29 December 2015 at 13:43, Alessandro Benedetti > > wrote: >> >>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests. >>> >>> In the meantime let's try to collect some additional feedback. >>> >>> Cheers >>> >>> On 29 December 2015 at 12:43, Anshum Gupta >>> wrote: >>> Feel free to create a JIRA and put up a patch if you can. On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti < abenede...@apache.org > wrote: > Hi guys, > While I was exploring the way we build the More Like This query, I > discovered a part I am not convinced of : > > > > Let's see how we build the query : > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) > > 1) we extract the terms from the interesting fields, adding them to a map : > > Map termFreqMap = new HashMap<>(); > > *( we lose the relation field-> term, we don't know anymore where the term > was coming ! )* > > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue > > 2) we build the queue that will contain the query terms, at this point we > connect again there terms to some field, but : > > ... >> // go through all the fields and find the largest document frequency >> String topField = fieldNames[0]; >> int docFreq = 0; >> for (String fieldName : fieldNames) { >> int freq = ir.docFreq(new Term(fieldName, word)); >> topField = (freq > docFreq) ? fieldName : topField; >> docFreq = (freq > docFreq) ? freq : docFreq; >> } >> ... > > > We identify the topField as the field with the highest document frequency > for the term t . > Then we build the termQuery : > > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf)); > > In this way we lose a lot of precision. > Not sure why we do that. > I would prefer to keep the relation between terms and fields. > The MLT query can improve a lot the quality. > If i run the MLT on 2 fields : *description* and *facilities* for example. > It is likely I want to find documents with similar terms in the > description and similar terms in the facilities, without mixing up the > things and loosing the semantic of the terms. > > Let me know your opinion, > > Cheers > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- Anshum Gupta >>> >>> >>> >>> -- >>> -- >>> >>> Benedetti Alessandro >>> Visiting card : http://about.me/alessandro_benedetti >>> >>> "Tyger, tyger burning bright >>> In the forests of the night, >>> What immortal hand or eye >>> Could frame thy fearful symmetry?" >>> >>> William Blake - Songs of Experience -1794 England >>> >> >> >> >> -- >> -- >> >> Benedetti Alessandro >> Visiting card : http://about.me/alessandro_benedetti >> >> "Tyger, tyger burning bright >> In the forests of the night, >> What immortal hand or eye >> Could frame thy fearful symmetry?" >> >> William Blake - Songs of Experience -1794 England >> > > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: [More Like This] Query building
Hi guys, the patch seems fine to me. I didn't spend much more time on the code but I checked the tests and the pre-commit checks. It seems fine to me. Let me know , Cheers On 31 December 2015 at 18:40, Alessandro Benedettiwrote: > https://issues.apache.org/jira/browse/LUCENE-6954 > > First draft patch available, I will check better the tests new year ! > > On 29 December 2015 at 13:43, Alessandro Benedetti > wrote: > >> Sure, I will proceed tomorrow with the Jira and the simple patch + tests. >> >> In the meantime let's try to collect some additional feedback. >> >> Cheers >> >> On 29 December 2015 at 12:43, Anshum Gupta >> wrote: >> >>> Feel free to create a JIRA and put up a patch if you can. >>> >>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti < >>> abenede...@apache.org >>> > wrote: >>> >>> > Hi guys, >>> > While I was exploring the way we build the More Like This query, I >>> > discovered a part I am not convinced of : >>> > >>> > >>> > >>> > Let's see how we build the query : >>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) >>> > >>> > 1) we extract the terms from the interesting fields, adding them to a >>> map : >>> > >>> > Map termFreqMap = new HashMap<>(); >>> > >>> > *( we lose the relation field-> term, we don't know anymore where the >>> term >>> > was coming ! )* >>> > >>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue >>> > >>> > 2) we build the queue that will contain the query terms, at this point >>> we >>> > connect again there terms to some field, but : >>> > >>> > ... >>> >> // go through all the fields and find the largest document frequency >>> >> String topField = fieldNames[0]; >>> >> int docFreq = 0; >>> >> for (String fieldName : fieldNames) { >>> >> int freq = ir.docFreq(new Term(fieldName, word)); >>> >> topField = (freq > docFreq) ? fieldName : topField; >>> >> docFreq = (freq > docFreq) ? freq : docFreq; >>> >> } >>> >> ... >>> > >>> > >>> > We identify the topField as the field with the highest document >>> frequency >>> > for the term t . >>> > Then we build the termQuery : >>> > >>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf)); >>> > >>> > In this way we lose a lot of precision. >>> > Not sure why we do that. >>> > I would prefer to keep the relation between terms and fields. >>> > The MLT query can improve a lot the quality. >>> > If i run the MLT on 2 fields : *description* and *facilities* for >>> example. >>> > It is likely I want to find documents with similar terms in the >>> > description and similar terms in the facilities, without mixing up the >>> > things and loosing the semantic of the terms. >>> > >>> > Let me know your opinion, >>> > >>> > Cheers >>> > >>> > >>> > -- >>> > -- >>> > >>> > Benedetti Alessandro >>> > Visiting card : http://about.me/alessandro_benedetti >>> > >>> > "Tyger, tyger burning bright >>> > In the forests of the night, >>> > What immortal hand or eye >>> > Could frame thy fearful symmetry?" >>> > >>> > William Blake - Songs of Experience -1794 England >>> > >>> >>> >>> >>> -- >>> Anshum Gupta >>> >> >> >> >> -- >> -- >> >> Benedetti Alessandro >> Visiting card : http://about.me/alessandro_benedetti >> >> "Tyger, tyger burning bright >> In the forests of the night, >> What immortal hand or eye >> Could frame thy fearful symmetry?" >> >> William Blake - Songs of Experience -1794 England >> > > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: [More Like This] Query building
https://issues.apache.org/jira/browse/LUCENE-6954 First draft patch available, I will check better the tests new year ! On 29 December 2015 at 13:43, Alessandro Benedettiwrote: > Sure, I will proceed tomorrow with the Jira and the simple patch + tests. > > In the meantime let's try to collect some additional feedback. > > Cheers > > On 29 December 2015 at 12:43, Anshum Gupta wrote: > >> Feel free to create a JIRA and put up a patch if you can. >> >> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti < >> abenede...@apache.org >> > wrote: >> >> > Hi guys, >> > While I was exploring the way we build the More Like This query, I >> > discovered a part I am not convinced of : >> > >> > >> > >> > Let's see how we build the query : >> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) >> > >> > 1) we extract the terms from the interesting fields, adding them to a >> map : >> > >> > Map termFreqMap = new HashMap<>(); >> > >> > *( we lose the relation field-> term, we don't know anymore where the >> term >> > was coming ! )* >> > >> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue >> > >> > 2) we build the queue that will contain the query terms, at this point >> we >> > connect again there terms to some field, but : >> > >> > ... >> >> // go through all the fields and find the largest document frequency >> >> String topField = fieldNames[0]; >> >> int docFreq = 0; >> >> for (String fieldName : fieldNames) { >> >> int freq = ir.docFreq(new Term(fieldName, word)); >> >> topField = (freq > docFreq) ? fieldName : topField; >> >> docFreq = (freq > docFreq) ? freq : docFreq; >> >> } >> >> ... >> > >> > >> > We identify the topField as the field with the highest document >> frequency >> > for the term t . >> > Then we build the termQuery : >> > >> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf)); >> > >> > In this way we lose a lot of precision. >> > Not sure why we do that. >> > I would prefer to keep the relation between terms and fields. >> > The MLT query can improve a lot the quality. >> > If i run the MLT on 2 fields : *description* and *facilities* for >> example. >> > It is likely I want to find documents with similar terms in the >> > description and similar terms in the facilities, without mixing up the >> > things and loosing the semantic of the terms. >> > >> > Let me know your opinion, >> > >> > Cheers >> > >> > >> > -- >> > -- >> > >> > Benedetti Alessandro >> > Visiting card : http://about.me/alessandro_benedetti >> > >> > "Tyger, tyger burning bright >> > In the forests of the night, >> > What immortal hand or eye >> > Could frame thy fearful symmetry?" >> > >> > William Blake - Songs of Experience -1794 England >> > >> >> >> >> -- >> Anshum Gupta >> > > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: [More Like This] Query building
Feel free to create a JIRA and put up a patch if you can. On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedettiwrote: > Hi guys, > While I was exploring the way we build the More Like This query, I > discovered a part I am not convinced of : > > > > Let's see how we build the query : > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) > > 1) we extract the terms from the interesting fields, adding them to a map : > > Map termFreqMap = new HashMap<>(); > > *( we lose the relation field-> term, we don't know anymore where the term > was coming ! )* > > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue > > 2) we build the queue that will contain the query terms, at this point we > connect again there terms to some field, but : > > ... >> // go through all the fields and find the largest document frequency >> String topField = fieldNames[0]; >> int docFreq = 0; >> for (String fieldName : fieldNames) { >> int freq = ir.docFreq(new Term(fieldName, word)); >> topField = (freq > docFreq) ? fieldName : topField; >> docFreq = (freq > docFreq) ? freq : docFreq; >> } >> ... > > > We identify the topField as the field with the highest document frequency > for the term t . > Then we build the termQuery : > > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf)); > > In this way we lose a lot of precision. > Not sure why we do that. > I would prefer to keep the relation between terms and fields. > The MLT query can improve a lot the quality. > If i run the MLT on 2 fields : *description* and *facilities* for example. > It is likely I want to find documents with similar terms in the > description and similar terms in the facilities, without mixing up the > things and loosing the semantic of the terms. > > Let me know your opinion, > > Cheers > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- Anshum Gupta
Re: [More Like This] Query building
Sure, I will proceed tomorrow with the Jira and the simple patch + tests. In the meantime let's try to collect some additional feedback. Cheers On 29 December 2015 at 12:43, Anshum Guptawrote: > Feel free to create a JIRA and put up a patch if you can. > > On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti < > abenede...@apache.org > > wrote: > > > Hi guys, > > While I was exploring the way we build the More Like This query, I > > discovered a part I am not convinced of : > > > > > > > > Let's see how we build the query : > > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) > > > > 1) we extract the terms from the interesting fields, adding them to a > map : > > > > Map termFreqMap = new HashMap<>(); > > > > *( we lose the relation field-> term, we don't know anymore where the > term > > was coming ! )* > > > > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue > > > > 2) we build the queue that will contain the query terms, at this point we > > connect again there terms to some field, but : > > > > ... > >> // go through all the fields and find the largest document frequency > >> String topField = fieldNames[0]; > >> int docFreq = 0; > >> for (String fieldName : fieldNames) { > >> int freq = ir.docFreq(new Term(fieldName, word)); > >> topField = (freq > docFreq) ? fieldName : topField; > >> docFreq = (freq > docFreq) ? freq : docFreq; > >> } > >> ... > > > > > > We identify the topField as the field with the highest document frequency > > for the term t . > > Then we build the termQuery : > > > > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf)); > > > > In this way we lose a lot of precision. > > Not sure why we do that. > > I would prefer to keep the relation between terms and fields. > > The MLT query can improve a lot the quality. > > If i run the MLT on 2 fields : *description* and *facilities* for > example. > > It is likely I want to find documents with similar terms in the > > description and similar terms in the facilities, without mixing up the > > things and loosing the semantic of the terms. > > > > Let me know your opinion, > > > > Cheers > > > > > > -- > > -- > > > > Benedetti Alessandro > > Visiting card : http://about.me/alessandro_benedetti > > > > "Tyger, tyger burning bright > > In the forests of the night, > > What immortal hand or eye > > Could frame thy fearful symmetry?" > > > > William Blake - Songs of Experience -1794 England > > > > > > -- > Anshum Gupta > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England