Re: KeywordTokenizerFactory - trouble with "exact" matches

Jack Krupansky Thu, 30 Jan 2014 06:52:26 -0800

I vaguely recall that there was a Jira floating around for multi-wordsynonyms that dealt with parsing of spaces as well. And Robert Muir has(repeatedly) referred to this query parser feature as a "bug". Somehow,eventually, I think it will be dealt with, but the "difficulty" remains fornow.


-- Jack Krupansky

-----Original Message-----From: Aleksander Akerø

Sent: Thursday, January 30, 2014 9:31 AM
To: solr-user@lucene.apache.org
Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches

Yes, I actually noted that about the filter vs. tokenizer. It's easy to get
confused if you don't have a good understanding of the differences between
tokenizers and filters.

As for the query parser problem, there's always a workaround, but it was
nice to be made aware of. It sort of was a ghost-like problem before.
Allthough it would be great to have the opportunity to "disable" the
splitting on whitespace even for DisMax, I understand that it probably not
the most wanted feature for next solr release :)

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksan...@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no

2014-01-30 Erick Erickson <erickerick...@gmail.com>:

Note, the comments about lowercasetokenizer were a red herring. You were
using LowerCaseFilterFactory. note "Filter" rather than "Tokenizer". So it
would
just do what you expected, lowercase the entire input. You would have used
LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a
Filter.

As for the rest, I expect Jack is right, it's the query parsing above
the field input.

Best
Erick

On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø
<aleksan...@gurusoft.no> wrote:
> Hi Srinivasa
>
> Yes I've come to understand that the analyzers will never "see" the

> whitespace, thus no need for patternreplacement, like Jack points out.> So

> the solution would be to set wich parser to use for the query. Also Jack
> has pointed out that the "field" queryparser should work in this
particular
> setting -> http://wiki.apache.org/solr/QueryParser
>
> My problem was though, that it was only for one of the fields in the
schema
> that i needed this for, but for all the other fields, e.g. name,
> description etc., I would very much like to make use of the eDisMax
> functionality. And it seems that there can only be defined one query
parser
> per query. in other words: for all fields. Jack, you may correct me if
I'm
> wrong here :)
>
> This particular customer wanted a wildcard search at both ends of the

> phrase, and that sort of ambiguated the problem. And therefore I chose> to

> replace all whitespace for this field in sql at index time, using the
DIH.
> And then using EdgeNGramFilterFactory on both sides of the keyword like
the
> config below, and that seemed to work pretty nicely.
>
> <!-- #### WildCard search number #### --> <fieldType name="keyword"
class=
> "solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
> tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
> "solr.LowerCaseFilterFactory"/> <filter
class="solr.EdgeNGramFilterFactory"
> minGramSize="2" maxGramSize="25" side="front"/> <filter class=
> "solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"
side="back"/>
> </analyzer> <analyzer type="query"> <tokenizer class=
> "solr.KeywordTokenizerFactory"/> <filter
class="solr.LowerCaseFilterFactory"
> /> </analyzer> </fieldType>
>
> I also added a bit of extra weighting for the "keyword" field so that
exact
> matches recieved a higher score.
>
> What this solution doesn't do is to exclude values like "EE 009", when
> searching for "FE 009", but they return far down on the list, which for
the
> customer is ok, because usually these results are somewhat related og
> within the same category.
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksan...@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-30 Jack Krupansky <j...@basetechnology.com>
>
>> The standard, keyword-oriented query parsers will all treat unquoted,
>> unescaped white space as term delimiters and ignore the what space.
There
>> is no way to bypass that behavior. So, your regex will never even see
the
>> white space - unless you enclose the text and white space in quotes or
use
>> a backslash to quote each white space character.
>>
>> You can use the "field" and "term" query parsers to pass a query string
as

>> if it were fully enclosed in quotes, but that only handles a single>> term

>> and does not allow for multiple terms or any query operators. For
example:
>>
>> {!field f=myfield}Foo Bar
>>
>> See:
>> http://wiki.apache.org/solr/QueryParser
>>

>> You can also pre-configure the field query parser with the>> defType=field

>> parameter.
>>
>> -- Jack Krupansky
>>
>>
>> -----Original Message----- From: Srinivasa7
>> Sent: Thursday, January 30, 2014 6:37 AM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
>>
>> Hi,
>>
>> I  have similar kind of problem  where I want search for a words with
>> spaces
>> in that. And I wanted to search by stripping all the spaces .
>>
>> I have used following schema for that
>>
>> <fieldType name="nospaces" class="solr.TextField"
>> autoGeneratePhraseQueries="true"  >
>>            <analyzer type="index">
>>              <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[^\w]+"  replacement="" replace="all"/>
>>            </analyzer>
>>            <analyzer type="query">
>>
>>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[^\w]+"  replacement="" replace="all"/>
>>            </analyzer>
>>        </fieldType>
>>
>>
>> And
>>
>>
>> <field name="text_nospaces" type="nospaces"  indexed="true"
stored="true"
>> omitNorms="true" />
>>        <copyField source="text" dest="text_nospaces" />
>>
>>
>>
>> But it is not searching the right terms . we are stripping the spaces
and
>> indexing lowercase values when we do that.
>>
>>
>> Like : East Enders
>>
>> when I seach for   'east end ers'  text, its not returning any values
>> saying
>> no document found.
>>
>> I realised the solr uses QueryParser before passing query string to the
>> QueryAnalyzer in defined in schema.
>>
>> And The Query parser is tokenizing the query string providing in query
. So

>> it is sending each token to the QueryAnalyser that is defined in>> schema.

>>
>>

>> SO is there anyway that I can by pass this query parser or use a>> correct

>> query processor which can consider the entire string as single pharse.
>>
>> At the moment I am using dismax query processor.
>>
>> Any suggestion would be much appreciated.
>>
>> Thanks
>> Srinivasa
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/
>>
KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Reply via email to