Re: partial search help request

Erick Erickson Wed, 05 Aug 2020 05:37:38 -0700

First of all, lots of attachments are stripped by the mail server so a number 
of your attachments didn’t come through, although your field definitions did so 
we can’t see your results.


KeywordTokenizerFactory is something I’d avoid at this point. It doesn’t break 
up the input at all, so input of “my dog has fleas” indexes exactly one token, 
“my dog has fleas” which is usually not what people want.

For the other problems, I’d suggest several ways to narrow down the issue.

1> remove PorterStemFilter and see what you get. This is something of a long 
shot, but I’ve seen this cause unexpected results due to the altorighmic nature 
of the stemmer not quite matching your assumptions.

2> add &debug=query to your URL and look particularly at the “parsed query” 
section. That’ll show you exactly how the search string was transmorgified 
prior to search and often offers clues.

3> Don’t use edismax to start. What you’ve shown looks correct, this is just on 
the theory that using something simpler to start means fewer moving parts.


Also, be a little careful of WhitespaceTokenizer. For controlled experiments 
where you’re tightly controlling the input, but going to prod has some issues. 
That tokenizer works fine, it’s just that it’ll include, say, the period at the 
end of a sentence with the last word of the sentence…

Best,
Erick

> On Aug 5, 2020, at 8:08 AM, Philip Smith <phi...@keep.edu.hk> wrote:
> 
> Hello, 
> I've had a break-through with my partial string search problem, I don't 
> understand why though. 
> 
> I found yet another example, 
> https://medium.com/aubergine-solutions/partial-string-search-in-apache-solr-4b9200e8e6bb
> and this one uses a different tokenizer, whitespaceTokenizerFactory
> 
> <fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100">
>   <analyzer type="index">
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>     <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="50"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
>   <analyzer type="query">
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> 
> The analysis results look very different. It seems to be returning the 
> desired results so far. 
> 
> 
> I don't understand why the other examples that worked for other people 
> weren't working for me. Is it version 8?
> StandardTokenizerFactory didn't work and when I was trying with the 
> KeywordTokenizerFactory it wasn't even matching the full search term.
> If anyone can shed any light, then I'd be grateful.
> Thanks.
> 
> 
> On Wed, Aug 5, 2020 at 7:12 PM Philip Smith <phi...@keep.edu.hk> wrote:
> Hello,
> I'm new to Solr and to this user group. Any help with this problem would be 
> greatly appreciated. 
> 
> I'm trying to get partial keyword search results working. This seems like a 
> fairly common problem, I've found numerous google results offering solutions 
> for instance 
> https://stackoverflow.com/questions/28753671/how-to-configure-solr-to-do-partial-word-matching
> but when I attempt to implement them I'm not receiving the desired results. 
> 
> I'm running solr 8.5.2 in standalone mode, manually editing the configs. 
> 
> I have configured the title field as 
> 
> <field name="title" type="edge_ngram_test_5" indexed="true" stored="true" 
> multiValued="false"/>
> 
> I have also tried it with this parameter  omitTermFreqAndPositions="true"  
> 
> The field type definition is:
> 
>   <fieldType name="edge_ngram_test_5" class="solr.TextField" 
> omitNorms="false">
>   <analyzer type="index">
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.PorterStemFilterFactory"/>
>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" 
> maxGramSize="35" />
>   </analyzer>
>   <analyzer type="query">
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.PorterStemFilterFactory"/>
>   </analyzer>
> </fieldType>
> 
> I'm using edismax and searching on title.
> 
> http://localhost:8983/solr/events/select?defType=edismax&df=title&fl=title&q=educatio
> 
> when using edge_ngram_test_5
> 
> edu          correctly finds 4 results
> educa       finds 0
> educat      finds 0
> educati     finds 0
> educatio   finds 0
> education correctly finds 4.
> 
> Steps taken between changes to the schema.
> bin/solr restart
> reimport data
> core admin > reload core
> 
> In admin, I see the correct value, 
> Typeedge_ngram_test_5 when I check in schema. 
> 
> In admin , when I check in analysis and search on text analyse 
> 
> 
> it appears to be breaking the word down into letters as I would guess is the 
> correct step.
> 
> These are the query results:
> 
> 
> it looks like it is applying the correct filter names and the search term 
> isn't being altered. I don't understand enough to be able to determine why 
> the query can't find the search result when it appears to have been indexed. 
> Any advice is very welcome as I've spent hours trying to get this working. 
> 
> 
> I've also tried with:
> <fieldType name="edge_n2_kw_text" class="solr.TextField" omitNorms="true" 
> positionIncrementGap="100">
>   <analyzer type="index">
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" 
> maxGramSize="25"/>
>   </analyzer>
>   <analyzer type="query">
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> 
> <fieldType name="text_edgengram_prod" class="solr.TextField" 
> positionIncrementGap="100" >
>   <analyzer type="index" >
>     <tokenizer class="solr.KeywordTokenizerFactory"/>           
>     <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
>     <filter class="solr.PorterStemFilterFactory" />
>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
> maxGramSize="30"/> <!-- RDH - removed side="front"-->
>   </analyzer>
>   <analyzer type="query" >
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
>     <filter class="solr.PorterStemFilterFactory" />          
>     <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>   </analyzer>
> </fieldType>
> 
> 
> <fieldType name="edge_ngram_test_4" class="solr.TextField" 
> positionIncrementGap="100" >
>   <analyzer type="index" >
>     <tokenizer class="solr.KeywordTokenizerFactory"/>           
>     <filter class="solr.SnowballPorterFilterFactory" language="English" />
>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
> maxGramSize="25" />
>   </analyzer>
>   <analyzer type="query" >
>     <tokenizer class="solr.KeywordTokenizerFactory"/>        
>   </analyzer>
> </fieldType>
> 
> 
> Thanks in advance for any insights offered.
> Kind regards,
> Phil.

Re: partial search help request

Reply via email to