Re: Can't get phrase field boosting to work using edismax

Jan Høydahl Wed, 06 Apr 2016 04:08:55 -0700

> Oh, hang on... If a phrase is defined as multiple tokens, and pf is used for 
> phrase  boosting, does that mean that even with a regular tokenizer the pf 
> won't work for fields that only contain one word? For example if the title of 
> one document is "John", and the user searches for 'John' (without any 
> surrounding phrase-characters), will edismax not boost this document?


Yes, phrase boost “pf” is only applied if the user enters a phrase. Thus q=john 
will not trigger pf, since there is no phrase to boost.
My workaround, however, inserts a special token before and after both the 
indexed field and the query, so there will always be 3 or more tokens, and pf 
will kick in. You could use variations of this to have single word queries 
trigger pf boost for text in a field even if it is not an exact match.

But I agree with you that it is not very obvious, and could be better 
documented.
Could perhaps also be useful with a new edismax parameter “pfMinClauseSize” to 
force pf on single-token without this workaround. But there could be good 
reasons for the original design choice here, that we don’t know about...

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 6. apr. 2016 kl. 11.22 skrev jimi.hulleg...@svensktnaringsliv.se:
> 
> OK, well I'm not sure I agree with you. First of all, you ask me to point my 
> "pf" towards a tokenized field, but I already do that (the fact that all text 
> is tokenized into a single token doesn't change that fact). Also, I don't 
> agree with the view that a single term phrase never is valid/reasonable. In 
> this specific case, with a KeywordTokenizer, I see it as very reasonable 
> indeed. And I would consider a "single term keyword phrase" solution more 
> logical than a workaround using special magical characters inserted in the 
> text. Just my two cents... :)
> 
> Oh, hang on... If a phrase is defined as multiple tokens, and pf is used for 
> phrase  boosting, does that mean that even with a regular tokenizer the pf 
> won't work for fields that only contain one word? For example if the title of 
> one document is "John", and the user searches for 'John' (without any 
> surrounding phrase-characters), will edismax not boost this document?
> 
> /Jimi
> 
> -----Original Message-----
> From: Jan Høydahl [mailto:jan....@cominvent.com] 
> Sent: Wednesday, April 6, 2016 10:43 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Can't get phrase field boosting to work using edismax
> 
> Hi,
> 
> Phrase match via “pf” requires the target field to contain a phrase. A phrase 
> is defined as multiple tokens. Yours does not contain a phrase since you use 
> the KeywordTokenizer, leaving only one token in the field. eDismax pf will 
> thus never kick in. Please point your “pf” towards a tokenized field.
> 
> If what you are trying to achieve is to boost only when the whole query 
> exactly matches the full content of the field, then have a look at my 
> solution here https://github.com/cominvent/exactmatch
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
>> 5. apr. 2016 kl. 19.10 skrev jimi.hulleg...@svensktnaringsliv.se:
>> 
>> Some more input, before I call it a day. Just for the heck of it, I tried 
>> changing minClauseSize to 0 using the Eclipse debugger, so that it didn't 
>> return null at line 1203, but instead returned the TermQuery on line 1205. 
>> Then everything worked exactly as it should. The matching document got 
>> boosted as expected. And in the explain output, this can be seen:
>> 
>> [...]
>> 11.274228 = (MATCH) weight(exactTitle:some words^100.0 in 172) 
>> [DefaultSimilarity], result of:
>> [...]
>> 
>> So. In my case, having minClauseSize=2 on line 550 (line 565 for solr 5.5.0) 
>> is the culprit. Is this a bug, or am I using the pf in the wrong way? Can 
>> someone explain why minClauseSize can't be set to 0 here? The comment simply 
>> states "we need at least two or there shouldn't be a boost", but no 
>> explaination *why* at least two is needed.
>> 
>> Regards
>> /Jimi
>> 
>> -----Original Message-----
>> From: jimi.hulleg...@svensktnaringsliv.se 
>> [mailto:jimi.hulleg...@svensktnaringsliv.se]
>> Sent: Tuesday, April 5, 2016 6:51 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Can't get phrase field boosting to work using edismax
>> 
>> I now used the Eclipse debugger, to try and see if I can understand what is 
>> happening, I it seems like the ExtendedDismaxQParser simply ignores my pf 
>> parameter, since it doesn't interpret it as a phrase query.
>> 
>> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.6.0/
>> solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java
>> 
>> On line 1180 I get a query object of type TermQuery (with the term 
>> "exactTitle:some words"). And in the if statements starting at line it is 
>> quite clear that if it is not a PhraseQuery or a MultiPhraseQuery, or if the 
>> minClauseSize > 1 (and it is set to 2 on line 550) the method simply returns 
>> null (ie ignoring my pf parameter). Why is this happening?
>> 
>> I use Solr 4.6 by the way... I forgot to mention that in my original message.
>> 
>> 
>> -----Original Message-----
>> From: jimi.hulleg...@svensktnaringsliv.se 
>> [mailto:jimi.hulleg...@svensktnaringsliv.se]
>> Sent: Tuesday, April 5, 2016 5:36 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Can't get phrase field boosting to work using edismax
>> 
>> OK. Interesting. But... I added a solr.TrimFilterFactory at the end of my 
>> analyzer definition. Shouldn't that take care of the added space at the end? 
>> The admin analysis page indicates that it works as it should, but I still 
>> can't get edismax to boost.
>> 
>> -----Original Message-----
>> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
>> Sent: Tuesday, April 5, 2016 4:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Can't get phrase field boosting to work using edismax
>> 
>> It looks like the code constructing the boost phrase for pf will always add 
>> a trailing blank, which is never a problem when a normal tokenizer is used 
>> that removes white space, but the keyword tokenizer will preserve that extra 
>> space, which prevents an exact match.
>> 
>> See line 531:
>> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/
>> solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java
>> 
>> I'd say it's a bug, but more a narrow use case that wasn't considered or 
>> tested.
>> 
>> -- Jack Krupansky
>> 
>> On Tue, Apr 5, 2016 at 7:50 AM, <jimi.hulleg...@svensktnaringsliv.se> wrote:
>> 
>>> Hi,
>>> 
>>> I'm trying to boost documents using a phrase field boosting (ie the 
>>> pf parameter for edismax), but I can't get it to work (ie boosting 
>>> documents where the pf field match the query as a phrase).
>>> 
>>> As far as I can tell, solr, or more specifically the edismax handler, 
>>> does
>>> *something* when I add this parameter. I know this because the QTime 
>>> increases from around 5-10ms to around 30-40 ms, and the score 
>>> explain structure is *slightly* modified (though with the same final 
>>> score for all documents). But nowhere in the explain structure can I 
>>> see anything about the pf. And I can't understand that. Shouldn't it 
>>> be included in the explain? If not, is there any way to force it to be 
>>> included somehow?
>>> 
>>> The query looks something like this:
>>> 
>>> 
>>> ?q=some+words&rows=10&sort=score+desc&debugQuery=true&fl=objectid,exa
>>> c
>>> tTitle,score%2C%5Bexplain+style%3Dtext%5D&qf=title%5E2&qf=swedishText
>>> 1 %5E1&defType=edismax&pf=exactTitle%5E5&wt=xml&indent=true
>>> 
>>> 
>>> I have one document that has the title "some words", and when I do a 
>>> simple query filter with exactTitle:"some words" I get a match for 
>>> that document. So then I would expect that the query above would 
>>> boost this document, and include information about this in the 
>>> explain. But nothing like this happens, and I can't understand why.
>>> 
>>> The field looks like this:
>>> 
>>> <field name="exactTitle" type="keywordText" indexed="true" stored="true"
>>> required="false" multiValued="false" />
>>> 
>>> And the fieldType looks like this:
>>> 
>>> <fieldType name="keywordText" class="solr.TextField"
>>> positionIncrementGap="100">
>>>                        <analyzer>
>>>                                                 <charFilter 
>>> class="solr.HTMLStripCharFilterFactory" />
>>>                                                 <tokenizer 
>>> class="solr.KeywordTokenizerFactory" />
>>>                                                 <filter 
>>> class="solr.LowerCaseFilterFactory" />
>>>                        </analyzer>
>>> </fieldType>
>>> 
>>> 
>>> I have also tried boosting this document using a boost query, ie 
>>> bq=exactTitle:"some words", and this works as expected. The document 
>>> score is boosted, and the explain states this very clearly, with this 
>>> segment:
>>> 
>>> [...]
>>> 9.870669 = (MATCH) weight(exactTitle:some words^5.0 in 12) 
>>> [DefaultSimilarity], result of:
>>> [...]
>>> 
>>> Why is this working, but q=some+words&pf=exactTitle^5 not? Shouldn't 
>>> edismax rewrite my "pf query" into something very similar to the "bq query"?
>>> 
>>> Regards
>>> /Jimi
>>> 
>

Re: Can't get phrase field boosting to work using edismax

Reply via email to