RE: solr synonyms behaviour

Laurent Gilles Mon, 28 Jul 2008 09:04:05 -0700

Hi,

I was faced with the same issues reguarding multiwords synonyms
Let's say a synonyms list like:


club, bar, night cabaret

Now if we have a document containing "club", with the default synonyms
filter behaviour with expand=true, we will end up in the lucene index with a
document containing "club|bar|night cabaret".
So if the user search for "night", the query-time will search for "night" in
the index and will match our document since it had been "enriched" @
index-time, and it really contains the token "night".

The only valid solution I've founded was to create a field-type exclusively
used for synonyms search where: 

@IndexTime
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
@QueryTime
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />

And with a customised synonyms file that looks like:

SYN_ID_1, club, bar, night cabaret

So for our document containing "club", the synonym filter at index time with
expand=false will replace every matching token/expression in the document
with the SYN_ID_1.

And at query time, when an user search for "night", since "night" is not
alone in synonyms definition, it will not be matched, even by "normal"
search, because every document containing "club" or "bar" would have been
"enriched" with "SYN_ID_1" and NOT with "club|bar|night cabaret", so the
final indexed document will not contains isolated token from synonyms
expression that risks to be matched later without notice.

In order to match our document containing "club", the user HAVE TO type the
entire expression "night cabaret", and not only part of the expression.


Of course, as I said before, this field was exclusively used for synonym
matching, so it requires another field for normal full-text-stemmed search
to add normal results, this approach give us the opportunity to setup
Boosting separately on full-text-stemmed search VS synonyms search, let's
say :

"title_stem":"club"^100 OR "title_syns":"club"^10

I hope to have been clear, even if I dont believe to.. Fact is this
approach have fixed your problem, since we didn't what synonym matching if
the user only types part of synonymic expression.

Regards,
Laurent



-----Message d'origine-----
De : swarag [mailto:[EMAIL PROTECTED] 
Envoyé : vendredi 25 juillet 2008 23:48
À : solr-user@lucene.apache.org
Objet : Re: solr synonyms behaviour



swarag wrote:
> 
> 
> Yonik Seeley wrote:
>> 
>> On Tue, Jul 15, 2008 at 2:27 PM, swarag <[EMAIL PROTECTED]>
>> wrote:
>>> To my understanding, this means I am using synonyms at index time and
>>> NOT
>>> query time. And yet, I am still having these problems with synonyms.
>> 
>> Can you give a specific example?  Use debugQuery=true to see what the
>> resulting query is.
>> You can also use the admin analysis page to see what the output of the
>> index and query analyzers.
>> 
>> -Yonik
>> 
>> 
> 
> So it sounds like using the '=>' operator for synonyms that may or may not
> contain multiple words causes problems.  So I changed my synonyms.txt to
> the following:
> 
> club,bar,night cabaret
> 
> In schema.xml, I now have the following:
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> As you can see, 'night cabaret' is my only multi-word synonym term.
> Searches for 'bar' and 'club' now behave as expected.  However, if I
> search for JUST 'night' or JUST 'cabaret', it looks like it is still using
> the synonyms 'bar' and 'club', which is not what is desired.  I only want
> 'bar' and 'club' to be returned if a search for the complete 'night
> cabaret' is submitted.
> 
> Since query-time synonyms is turned "off", the resulting
> parsedquery_toString is simply "name:night", "name:cabaret", etc...
> 
> Thanks!
> 

We are still having problems. Searches for single words that are part of a
multi-word synonym seem to be affected by the synonyms, when they should
not.  Anyone else experience this?  If not, would you mind explaining your
config and the format of your synonyms.txt file?
-- 
View this message in context:
http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18660135.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr synonyms behaviour

Reply via email to