Re: KeywordTokenizerFactory - trouble with "exact" matches

Aleksander Akerø Thu, 30 Jan 2014 04:30:51 -0800

Hi Srinivasa

Yes I've come to understand that the analyzers will never "see" the
whitespace, thus no need for patternreplacement, like Jack points out. So
the solution would be to set wich parser to use for the query. Also Jack
has pointed out that the "field" queryparser should work in this particular
setting -> http://wiki.apache.org/solr/QueryParser


My problem was though, that it was only for one of the fields in the schema
that i needed this for, but for all the other fields, e.g. name,
description etc., I would very much like to make use of the eDisMax
functionality. And it seems that there can only be defined one query parser
per query. in other words: for all fields. Jack, you may correct me if I'm
wrong here :)

This particular customer wanted a wildcard search at both ends of the
phrase, and that sort of ambiguated the problem. And therefore I chose to
replace all whitespace for this field in sql at index time, using the DIH.
And then using EdgeNGramFilterFactory on both sides of the keyword like the
config below, and that seemed to work pretty nicely.

<!-- #### WildCard search number #### --> <fieldType name="keyword" class=
"solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
"solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory"
minGramSize="2" maxGramSize="25" side="front"/> <filter class=
"solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" side="back"/>
</analyzer> <analyzer type="query"> <tokenizer class=
"solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"
/> </analyzer> </fieldType>

I also added a bit of extra weighting for the "keyword" field so that exact
matches recieved a higher score.

What this solution doesn't do is to exclude values like "EE 009", when
searching for "FE 009", but they return far down on the list, which for the
customer is ok, because usually these results are somewhat related og
within the same category.

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksan...@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-30 Jack Krupansky <j...@basetechnology.com>

> The standard, keyword-oriented query parsers will all treat unquoted,
> unescaped white space as term delimiters and ignore the what space. There
> is no way to bypass that behavior. So, your regex will never even see the
> white space - unless you enclose the text and white space in quotes or use
> a backslash to quote each white space character.
>
> You can use the "field" and "term" query parsers to pass a query string as
> if it were fully enclosed in quotes, but that only handles a single term
> and does not allow for multiple terms or any query operators. For example:
>
> {!field f=myfield}Foo Bar
>
> See:
> http://wiki.apache.org/solr/QueryParser
>
> You can also pre-configure the field query parser with the defType=field
> parameter.
>
> -- Jack Krupansky
>
>
> -----Original Message----- From: Srinivasa7
> Sent: Thursday, January 30, 2014 6:37 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
>
> Hi,
>
> I  have similar kind of problem  where I want search for a words with
> spaces
> in that. And I wanted to search by stripping all the spaces .
>
> I have used following schema for that
>
> <fieldType name="nospaces" class="solr.TextField"
> autoGeneratePhraseQueries="true"  >
>            <analyzer type="index">
>              <tokenizer class="solr.KeywordTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.PatternReplaceFilterFactory"
> pattern="[^\w]+"  replacement="" replace="all"/>
>            </analyzer>
>            <analyzer type="query">
>
>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.PatternReplaceFilterFactory"
> pattern="[^\w]+"  replacement="" replace="all"/>
>            </analyzer>
>        </fieldType>
>
>
> And
>
>
> <field name="text_nospaces" type="nospaces"  indexed="true" stored="true"
> omitNorms="true" />
>        <copyField source="text" dest="text_nospaces" />
>
>
>
> But it is not searching the right terms . we are stripping the spaces and
> indexing lowercase values when we do that.
>
>
> Like : East Enders
>
> when I seach for   'east end ers'  text, its not returning any values
> saying
> no document found.
>
> I realised the solr uses QueryParser before passing query string to the
> QueryAnalyzer in defined in schema.
>
> And The Query parser is tokenizing the query string providing in query . So
> it is sending each token to the QueryAnalyser that is defined in schema.
>
>
> SO is there anyway that I can by pass this query parser or use a correct
> query processor which can consider the entire string as single pharse.
>
> At the moment I am using dismax query processor.
>
> Any suggestion would be much appreciated.
>
> Thanks
> Srinivasa
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/
> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Reply via email to