edismax, phrase field gets ignored for keyword tokenizer

Stefan Matheis Mon, 07 Nov 2016 07:13:02 -0800

I’m guessing that i’m missing something obvious here - so feel free to
ask for more details and as well point out other directions i should
following.


the problem goes as follows: the input in one case might be a phone
number (like +49 1234 12345678), since we’re using edismax the parts
gets split on whitespaces - which is fine. bringing the same field
(based on TextField) to the party (using qf) doesn’t change a thing.

> responseHeader:
>     params:
>         q: '+49 1234 12345678'
>         defType: edismax
>         qf: person_mobile
>         pf: person_mobile^5
> debug:
>     rawquerystring: '+49 1234 12345678'
>     querystring: '+49 1234 12345678'
>     parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49)) 
>DisjunctionMaxQuery((person_mobile:1234)) 
>DisjunctionMaxQuery((person_mobile:12345678))) ())/no_coord'
>     parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234) 
>(person_mobile:12345678)) ()’

but .. as far as i was able to reduce the culprit, that only happens
when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
that to solr.StandardTokenizerFactory the phrase query appears as
expected:

> responseHeader:
>     params:
>         q: '+49 1234 12345678'
>         defType: edismax
>         qf: person_mobile
>         pf: person_mobile^5
> debug:
>     rawquerystring: '+49 1234 12345678'
>     querystring: '+49 1234 12345678'
>     parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49)) 
>DisjunctionMaxQuery((person_mobile:1234)) 
>DisjunctionMaxQuery((person_mobile:12345678))) 
>DisjunctionMaxQuery(((person_mobile:"49 1234 12345678")^5.0)))/no_coord'
>     parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234) 
>(person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’

removing the + at the beginning, doesn’t make a difference either
(just mentioning since tokee already asked this on #solr, where i’ve
brought up the question earlier)

it’s absolutely possible i’m focusing on a very wrong assumption - but
since switching the tokenizer does result in such a rather large
behaviour change, i think something is spooky here.

i’ve read older issues and posts from the list, some of them pointed
out that it might be a optimization that edismax brings to the table -
i didn’t find anything specific about that.

oh, and btw: if that would be working - my idea is to drop out
everything for a given phrase that is not a number, to match the phone
number, like this:

> <fieldType name="phone_number" class="solr.TextField">
>   <analyzer>
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.PatternReplaceFilterFactory" pattern="[^\d]" 
> replacement=""/>
>   </analyzer>
> </fieldType>

any thoughts? or wild guesses?

Thanks Stefan

edismax, phrase field gets ignored for keyword tokenizer

Reply via email to