RE: Solr search engine configuration

Markus Jelsma Tue, 13 Mar 2018 12:52:40 -0700

Inline, cheers.

-----Original message-----
> From:PeterKerk <petervdk...@hotmail.com>
> Sent: Tuesday 13th March 2018 18:53
> To: solr-user@lucene.apache.org
> Subject: RE: Solr search engine configuration
> 
> You must stay in the Javadoc section, there the examples are good, or the
> reference guide: 
> https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions
> 
> PVK COMMENT 1: 
>       This seems to be for Solr 6.5+? I'm using 4.3.1. An upgrade is not on 
> the
> radar soon. Will using DictionaryCompoundWordTokenFilterFactory as I'm doing
> now severely degrade my result quality as opposed to
> HyphenationCompoundWordTokenFilterFactory?


Just change version number, most filters are already quite old:
https://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

Dictionary vs Hyphenation, using Dictionary won't severely degrade results, and 
can be easier to use if you need to add words. If prefer the Hyphenater though, 
but it can bite. Stick to Dictionary, you are fine. But both (iirc) suffer from 
the same problems with overlapping words, or subwords that do not entire make 
up for the full compound (minus genetives or plural forms) this is a real issue.

> 
> 
> Almost, zaken -> zaak is already KP output, no need to input what the
> stemmer will do for you. 
> 
> PVK COMMENT 2: 
>       How do you know zaken -> zaak is already KP output? Is there a list
> somewhere?

I know because i've seen KPs output a million times by now. You should really 
access Solr's analysis GUI, it shows what filters emit, it is really helpful.

>       
> PVK COMMENT 3: 
> I now have:
> 
>       <fieldType name="searchtext_nl" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>            
>         
>               
>               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>               <filter class="solr.LowerCaseFilterFactory"/>           
>               
>               <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compounds_nl.txt"
>          minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>               
>               <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>                
>               
>               <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"/>
>               
>               
>               <filter class="solr.ASCIIFoldingFilterFactory"/>
>               
>       </analyzer>
>       <analyzer type="query">
>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>            
>         
>               
>               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>               <filter class="solr.LowerCaseFilterFactory"/>           
>               
>               <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compounds_nl.txt"
>          minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>                
>                <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>
> 
>                <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"/>
>                
>               
>               <filter class="solr.ASCIIFoldingFilterFactory"/>                
>  
>       </analyzer>
>     </fieldType>

Please increase minWordsize and minSubwordSize. There are no compounds with 
that few characters. minSubwordSize should be at least 4, or you will get a lot 
of crazy output due to problems states above.

> 
> I tested in admin UI (and yes, I restart Solr and reindex every time I make
> a change):    
>       
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true
> returns:
> "hi there dieren zaak something else"
> "hi there dier something else"
> 
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dierenzaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
> returns
> "hi there dierenzaak something else"
> 
> So I added "dieren" to compounds_nl.txt
> 
> Now on "title_search_global:(dieren zaak)" it returns:
> <doc>
>     <str name="title">hi there dieren zaak something else</str>
>     <str name="id">115_3699638</str>
> </doc>
> <doc>
>     <str name="title">hi there dier something else</str>
>     <str name="id">115_3699637</str>
> </doc>
> <doc>
>     <str name="title">hi there dierenzaak something else</str>
>     <str name="id">115_3699639</str>
> </doc>
> 
> So it's starting to look good! :-)
> 
> What I want to know, how can I have Solr consider "dierenzaak" to be of
> higher importance than just "dier" in the above results?

Does the decompounder support emitting the compound word as well? If so, enable 
it. It should help scoring compounds higher via IDF as they are less common.

> 
> Also I'm still not 100% sure what my addition of "dieren" to
> compounds_nl.txt actually does...I assume
> DictionaryCompoundWordTokenFilterFactory just looks for that exact string
> and if it finds it, considers that a separate word? Correct?

Just check in analysis GUI, it will answer all these questions.

> 
> Thanks again!
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

RE: Solr search engine configuration

Reply via email to