Re: Synonyms problem

Walter Underwood Fri, 29 Mar 2013 10:25:37 -0700

There are several problems with this config.

Indexing uses the phonetic filter, but query does not. This almost guarantees 
that nothing will match. Numbers could match, if the filter passes them.


Query time has two stopword filters with different lists. Indexing only has 
one. This isn't fatal, but it is pretty weird. Is letterstops.txt trying to do 
the same thing as the length filter? If so, use the length filter both place. 
Or not at all. Deleting single all single characters is a bad idea. You'll 
never find "Vitamin C".

The same synonyms are used at index and query time, which is unnecessary. Only 
use synonyms at index time unless you really know what you are doing and have a 
special need.

wunder

On Mar 29, 2013, at 9:53 AM, Plamen Mihaylov wrote:

> Guys,
> 
> This is a commented line where expand is false. I moved the synonym filter
> after tokenizer, but the result is the same.
> 
> Actual configuration:
> 
>        <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.WhitespaceTokenizerFactory" />
>                <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>                <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>                    catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1" />
>                <filter class="solr.LowerCaseFilterFactory" />
>                <filter class="solr.PhoneticFilterFactory"
> encoder="DoubleMetaphone" inject="true" />
>                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>                <filter class="solr.LengthFilterFactory" min="2" max="100"
> />
>                <!-- <filter class="solr.SnowballPorterFilterFactory"
> language="English" /> -->
>            </analyzer>
>            <analyzer type="query">
>                <tokenizer class="solr.WhitespaceTokenizerFactory" />
>                <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true" />
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>                <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>                    catenateNumbers="0" catenateAll="0" />
>                <filter class="solr.LowerCaseFilterFactory" />
>                <!-- <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/> -->
>                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="letterstops.txt" enablePositionIncrements="true" />
>            </analyzer>
>        </fieldType>
> 
> 2013/3/29 Walter Underwood <wun...@wunderwood.org>
> 
>> Also, all the filters need to be after the tokenizer. There are two
>> synonym filters specified, one before the tokenizer and one after.
>> 
>> I'm surprised that works at all. Shouldn't that be fatal error when
>> loading the config?
>> 
>> wunder
>> 
>> On Mar 29, 2013, at 9:33 AM, Thomas Krämer | ontopica wrote:
>> 
>>> Hi Plamen
>>> 
>>> You should set expand to true during
>>> 
>>> <analyzer type="index">
>>> ....
>>> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
>>>             ignoreCase="true" expand="true"/>
>>> 
>>> 
>>> ...
>>> 
>>> Greetings,
>>> 
>>> Thomas
>>> 
>>> Am 29.03.2013 17:16, schrieb Plamen Mihaylov:
>>>> Hey guys,
>>>> 
>>>> I have the following problem - I have a website with sport players,
>> where
>>>> using Solr indexing their data. I have defined synonyms like: NY, New
>> York.
>>>> When I search for New York - there are 145 results found, but when I
>> search
>>>> for NY - there are 142 results found. Why there is a diff and how can I
>> fix
>>>> this?
>>>> 
>>>> Configuration snippets:
>>>> 
>>>> synonyms.txt
>>>> 
>>>> ...
>>>> NY, New York
>>>> ...
>>>> 
>>>> ------
>>>> schema.xml
>>>> 
>>>> ...
>>>>        <fieldType name="text" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>           <analyzer type="index">
>>>>               <filter class="solr.
>>>> SynonymFilterFactory" synonyms="synonyms.txt"
>>>>                   ignoreCase="true" expand="true"/>
>>>>               <tokenizer class="solr.WhitespaceTokenizerFactory" />
>>>>               <!-- we will only use synonyms at query time <filter
>>>> class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
>>>>                   ignoreCase="true" expand="false"/> -->
>>>> 
>>>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>>               <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>                   catenateNumbers="1" catenateAll="0"
>>>> splitOnCaseChange="1" />
>>>>               <filter class="solr.LowerCaseFilterFactory" />
>>>>               <filter class="solr.PhoneticFilterFactory"
>>>> encoder="DoubleMetaphone" inject="true" />
>>>>               <filter class="solr.RemoveDuplicatesTokenFilterFactory"
>> />
>>>>               <filter class="solr.LengthFilterFactory" min="2"
>> max="100"
>>>> />
>>>>               <!-- <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English" /> -->
>>>>           </analyzer>
>>>>           <analyzer type="query">
>>>>               <filter class="solr.SynonymFilterFactory"
>>>> synonyms="synonyms.txt" ignoreCase="true" expand="true" />
>>>>               <tokenizer class="solr.WhitespaceTokenizerFactory" />
>>>> 
>>>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt" />
>>>>               <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>                   catenateNumbers="0" catenateAll="0" />
>>>>               <filter class="solr.LowerCaseFilterFactory" />
>>>>               <!-- <filter class="solr.EnglishPorterFilterFactory"
>>>> protected="protwords.txt"/> -->
>>>>               <filter class="solr.RemoveDuplicatesTokenFilterFactory"
>> />
>>>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="letterstops.txt" enablePositionIncrements="true" />
>>>>           </analyzer>
>>>>       </fieldType>
>>>> 
>>>> 
>>>> Thanks in advance.
>>>> Plamen
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> ontopica GmbH
>>> Prinz-Albert-Str. 2b
>>> 53113 Bonn
>>> Germany
>>> fon: +49-228-227229-22
>>> fax: +49-228-227229-77
>>> web: http://www.ontopica.de
>>> ontopica GmbH
>>> Sitz der Gesellschaft: Bonn
>>> 
>>> Geschäftsführung: Thomas Krämer, Christoph Okpue
>>> Handelsregister: Amtsgericht Bonn, HRB 17852
>>> 
>>> 
>>

Re: Synonyms problem

Reply via email to