Re: partial search help request

2020-08-05 Thread Philip Smith
Great advice Erick, kindly appreciated.

I removed PorterStemFilter as you suggested and it worked as one would
expect it to. Very useful to learn about avoiding KeywordTokenizerFactory,
the limitation of the WhitespaceTokenizer and the testing approach.

Best,
Phil

On Wed, Aug 5, 2020 at 8:37 PM Erick Erickson 
wrote:

> First of all, lots of attachments are stripped by the mail server so a
> number of your attachments didn’t come through, although your field
> definitions did so we can’t see your results.
>
> KeywordTokenizerFactory is something I’d avoid at this point. It doesn’t
> break up the input at all, so input of “my dog has fleas” indexes exactly
> one token, “my dog has fleas” which is usually not what people want.
>
> For the other problems, I’d suggest several ways to narrow down the issue.
>
> 1> remove PorterStemFilter and see what you get. This is something of a
> long shot, but I’ve seen this cause unexpected results due to the
> altorighmic nature of the stemmer not quite matching your assumptions.
>
> 2> add =query to your URL and look particularly at the “parsed
> query” section. That’ll show you exactly how the search string was
> transmorgified prior to search and often offers clues.
>
> 3> Don’t use edismax to start. What you’ve shown looks correct, this is
> just on the theory that using something simpler to start means fewer moving
> parts.
>
>
> Also, be a little careful of WhitespaceTokenizer. For controlled
> experiments where you’re tightly controlling the input, but going to prod
> has some issues. That tokenizer works fine, it’s just that it’ll include,
> say, the period at the end of a sentence with the last word of the sentence…
>
> Best,
> Erick
>
> > On Aug 5, 2020, at 8:08 AM, Philip Smith  wrote:
> >
> > Hello,
> > I've had a break-through with my partial string search problem, I don't
> understand why though.
> >
> > I found yet another example,
> https://medium.com/aubergine-solutions/partial-string-search-in-apache-solr-4b9200e8e6bb
> > and this one uses a different tokenizer, whitespaceTokenizerFactory
> >
> >  positionIncrementGap="100">
> >   
> > 
> >  maxGramSize="50"/>
> > 
> >   
> >   
> > 
> > 
> >   
> > 
> >
> > The analysis results look very different. It seems to be returning the
> desired results so far.
> >
> >
> > I don't understand why the other examples that worked for other people
> weren't working for me. Is it version 8?
> > StandardTokenizerFactory didn't work and when I was trying with the
> KeywordTokenizerFactory it wasn't even matching the full search term.
> > If anyone can shed any light, then I'd be grateful.
> > Thanks.
> >
> >
> > On Wed, Aug 5, 2020 at 7:12 PM Philip Smith  wrote:
> > Hello,
> > I'm new to Solr and to this user group. Any help with this problem would
> be greatly appreciated.
> >
> > I'm trying to get partial keyword search results working. This seems
> like a fairly common problem, I've found numerous google results offering
> solutions
> > for instance
> https://stackoverflow.com/questions/28753671/how-to-configure-solr-to-do-partial-word-matching
> > but when I attempt to implement them I'm not receiving the desired
> results.
> >
> > I'm running solr 8.5.2 in standalone mode, manually editing the configs.
> >
> > I have configured the title field as
> >
> >  stored="true" multiValued="false"/>
> >
> > I have also tried it with this parameter
> omitTermFreqAndPositions="true"
> >
> > The field type definition is:
> >
> >omitNorms="false">
> >   
> > 
> > 
> > 
> > 
> >  maxGramSize="35" />
> >   
> >   
> > 
> > 
> > 
> > 
> >   
> > 
> >
> > I'm using edismax and searching on title.
> >
> >
> http://localhost:8983/solr/events/select?defType=edismax=title=title=educatio
> >
> > when using edge_ngram_test_5
> >
> > edu  correctly finds 4 results
> > educa   finds 0
> > educat  finds 0
> > educati finds 0
> > educatio   finds 0
> > education correctly finds 4.
> >
> > Steps taken between changes to the schema.
> > bin/solr restart
> > reimport data
> > core admin > reload core
> >
> > In admin, I see the correct value,
> > Typeedge_ngram_test_5 when I check in schema.
> >
> > In admin , when I check in analysis and search on text analyse
> >
> >
> > it appears to be breaking the word down into letters as I would guess is
> the correct step.
> >
> > These are the query results:
> >
> >
> > it looks like it is applying the correct filter names and the search
> term isn't being altered. I don't understand enough to be able to determine
> why the query can't find the search result when it appears to have been
> indexed. Any advice is very welcome as I've spent hours trying to get this
> working.
> >
> >
> > I've also tried with:
> >  omitNorms="true" positionIncrementGap="100">
> >   
> > 
> > 
> >  maxGramSize="25"/>
> >   
> >   
> > 
> > 
> >   
> > 
> >
> >  positionIncrementGap="100" >
> >   
> > 
> >  

Re: partial search help request

2020-08-05 Thread Erick Erickson
First of all, lots of attachments are stripped by the mail server so a number 
of your attachments didn’t come through, although your field definitions did so 
we can’t see your results.

KeywordTokenizerFactory is something I’d avoid at this point. It doesn’t break 
up the input at all, so input of “my dog has fleas” indexes exactly one token, 
“my dog has fleas” which is usually not what people want.

For the other problems, I’d suggest several ways to narrow down the issue.

1> remove PorterStemFilter and see what you get. This is something of a long 
shot, but I’ve seen this cause unexpected results due to the altorighmic nature 
of the stemmer not quite matching your assumptions.

2> add =query to your URL and look particularly at the “parsed query” 
section. That’ll show you exactly how the search string was transmorgified 
prior to search and often offers clues.

3> Don’t use edismax to start. What you’ve shown looks correct, this is just on 
the theory that using something simpler to start means fewer moving parts.


Also, be a little careful of WhitespaceTokenizer. For controlled experiments 
where you’re tightly controlling the input, but going to prod has some issues. 
That tokenizer works fine, it’s just that it’ll include, say, the period at the 
end of a sentence with the last word of the sentence…

Best,
Erick

> On Aug 5, 2020, at 8:08 AM, Philip Smith  wrote:
> 
> Hello, 
> I've had a break-through with my partial string search problem, I don't 
> understand why though. 
> 
> I found yet another example, 
> https://medium.com/aubergine-solutions/partial-string-search-in-apache-solr-4b9200e8e6bb
> and this one uses a different tokenizer, whitespaceTokenizerFactory
> 
> 
>   
> 
> 
> 
>   
>   
> 
> 
>   
> 
> 
> The analysis results look very different. It seems to be returning the 
> desired results so far. 
> 
> 
> I don't understand why the other examples that worked for other people 
> weren't working for me. Is it version 8?
> StandardTokenizerFactory didn't work and when I was trying with the 
> KeywordTokenizerFactory it wasn't even matching the full search term.
> If anyone can shed any light, then I'd be grateful.
> Thanks.
> 
> 
> On Wed, Aug 5, 2020 at 7:12 PM Philip Smith  wrote:
> Hello,
> I'm new to Solr and to this user group. Any help with this problem would be 
> greatly appreciated. 
> 
> I'm trying to get partial keyword search results working. This seems like a 
> fairly common problem, I've found numerous google results offering solutions 
> for instance 
> https://stackoverflow.com/questions/28753671/how-to-configure-solr-to-do-partial-word-matching
> but when I attempt to implement them I'm not receiving the desired results. 
> 
> I'm running solr 8.5.2 in standalone mode, manually editing the configs. 
> 
> I have configured the title field as 
> 
>  multiValued="false"/>
> 
> I have also tried it with this parameter  omitTermFreqAndPositions="true"  
> 
> The field type definition is:
> 
>omitNorms="false">
>   
> 
> 
> 
> 
>  maxGramSize="35" />
>   
>   
> 
> 
> 
> 
>   
> 
> 
> I'm using edismax and searching on title.
> 
> http://localhost:8983/solr/events/select?defType=edismax=title=title=educatio
> 
> when using edge_ngram_test_5
> 
> edu  correctly finds 4 results
> educa   finds 0
> educat  finds 0
> educati finds 0
> educatio   finds 0
> education correctly finds 4.
> 
> Steps taken between changes to the schema.
> bin/solr restart
> reimport data
> core admin > reload core
> 
> In admin, I see the correct value, 
> Typeedge_ngram_test_5 when I check in schema. 
> 
> In admin , when I check in analysis and search on text analyse 
> 
> 
> it appears to be breaking the word down into letters as I would guess is the 
> correct step.
> 
> These are the query results:
> 
> 
> it looks like it is applying the correct filter names and the search term 
> isn't being altered. I don't understand enough to be able to determine why 
> the query can't find the search result when it appears to have been indexed. 
> Any advice is very welcome as I've spent hours trying to get this working. 
> 
> 
> I've also tried with:
>  positionIncrementGap="100">
>   
> 
> 
>  maxGramSize="25"/>
>   
>   
> 
> 
>   
> 
> 
>  positionIncrementGap="100" >
>   
>
>  words="stopwords.txt" />
> 
>  maxGramSize="30"/> 
>   
>   
> 
>  words="stopwords.txt" />
>   
> 
>   
> 
> 
> 
>  positionIncrementGap="100" >
>   
>
> 
>  maxGramSize="25" />
>   
>   
> 
>   
> 
> 
> 
> Thanks in advance for any insights offered.
> Kind regards,
> Phil.



Re: partial search help request

2020-08-05 Thread Philip Smith
Hello,
I've had a break-through with my partial string search problem, I don't
understand why though.

I found yet another example,
https://medium.com/aubergine-solutions/partial-string-search-in-apache-solr-4b9200e8e6bb
and this one uses a different tokenizer, whitespaceTokenizerFactory













The analysis results look very different. It seems to be returning the
desired results so far.
[image: image.png]

I don't understand why the other examples that worked for other people
weren't working for me. Is it version 8?
StandardTokenizerFactory didn't work and when I was trying with
the KeywordTokenizerFactory it wasn't even matching the full search term.
If anyone can shed any light, then I'd be grateful.
Thanks.


On Wed, Aug 5, 2020 at 7:12 PM Philip Smith  wrote:

> Hello,
> I'm new to Solr and to this user group. Any help with this problem
> would be greatly appreciated.
>
> I'm trying to get partial keyword search results working. This seems like
> a fairly common problem, I've found numerous google results offering
> solutions
> for instance
> https://stackoverflow.com/questions/28753671/how-to-configure-solr-to-do-partial-word-matching
> but when I attempt to implement them I'm not receiving the desired
> results.
>
> I'm running solr 8.5.2 in standalone mode, manually editing the configs.
>
> I have configured the title field as
>
>  multiValued="false"/>
>
> I have also tried it with this parameter  omitTermFreqAndPositions="true"
>
> The field type definition is:
>
>  "false">
> 
> 
> 
> 
> 
>  "35" />
> 
> 
> 
> 
> 
> 
> 
> 
>
> I'm using edismax and searching on title.
>
>
> http://localhost:8983/solr/events/select?defType=edismax=title=title=educatio
>
> when using edge_ngram_test_5
>
> edu  correctly finds 4 results
> educa   finds 0
> educat  finds 0
> educati finds 0
> educatio   finds 0
> education correctly finds 4.
>
> Steps taken between changes to the schema.
> bin/solr restart
> reimport data
> core admin > reload core
>
> In admin, I see the correct value,
> Typeedge_ngram_test_5 when I check in schema.
>
> In admin , when I check in analysis and search on text analyse
>
> [image: image.png]
> it appears to be breaking the word down into letters as I would guess is
> the correct step.
>
> These are the query results:
> [image: image.png]
>
> it looks like it is applying the correct filter names and the search term
> isn't being altered. I don't understand enough to be able to determine why
> the query can't find the search result when it appears to have been
> indexed. Any advice is very welcome as I've spent hours trying to get this
> working.
>
>
> I've also tried with:
>  positionIncrementGap="100">
> 
> 
> 
>  "25"/>
> 
> 
> 
> 
> 
> 
>
>  positionIncrementGap="100" >
> 
> 
>  "stopwords.txt" />
> 
>  "30"/> 
> 
> 
> 
>  "stopwords.txt" />
> 
> 
> 
> 
>
>
>  positionIncrementGap="100" >
> 
> 
> 
>  "25" />
> 
> 
> 
> 
> 
>
>
> Thanks in advance for any insights offered.
> Kind regards,
> Phil.
>