Re: Order of applying tokens/filter

2020-10-06 Thread Walter Underwood
Synonyms only need to be done once. Generally, expand synonyms at index time 
only.

Also, consider the StandardTokeniizer. It is a bit smarter and that can be 
useful.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 5, 2020, at 9:08 PM, Jayadevan Maymala  
> wrote:
> 
>> 
>> ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
>> WhitespaceTokenizerFactory
>> SynonymGraphFilterFactory
>> FlattenGraphFilterFactory
>> KStemFilterFactory
>> RemoveDuplicatesFilterFactory
>> 
>> One doubt related to this. Ideally, the same sequence should be followed
> for indexing and querying, right?
> Regards,
> Jayadevan



Re: Order of applying tokens/filter

2020-10-05 Thread Jayadevan Maymala
>
> ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
> WhitespaceTokenizerFactory
> SynonymGraphFilterFactory
> FlattenGraphFilterFactory
> KStemFilterFactory
> RemoveDuplicatesFilterFactory
>
> One doubt related to this. Ideally, the same sequence should be followed
for indexing and querying, right?
Regards,
Jayadevan


Re: Order of applying tokens/filter

2020-10-05 Thread Jayadevan Maymala
> ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
> WhitespaceTokenizerFactory
> SynonymGraphFilterFactory
> FlattenGraphFilterFactory
> KStemFilterFactory
> RemoveDuplicatesFilterFactory
>
> Thanks a lot. Very useful insights.

Regards,
Jayadevan


Re: Order of applying tokens/filter

2020-10-04 Thread Walter Underwood
Several problems.

1. Do not remove stopwords. That is a 1970s-era hack for saving disk space. 
Want to search for “vitamin a”? Better not remove stopwords.
2. Synonyms are before the stemmer, especially the Porter stemmer, where the 
output isn’t English words.
3. Use KStem instead of Porter. Porter is a clever hack from 1980, but we have 
better technology now.
4. Add RemoveDuplcatesFilter as the last step, just in case your synonyms stem 
to the same word. It is cheap insurance.

Also, I really recommend using the ICUNormalizer2CharFilterFactory with “nfkc” 
mode as the first step before the tokenizer. Otherwise, you’ll get bitten by 
some weird Unicode thing that takes days to debug. And if you are going to 
lower-case everything, let ICU do that for you with “nfkc_cf” mode.

So that gives:

ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
WhitespaceTokenizerFactory
SynonymGraphFilterFactory
FlattenGraphFilterFactory
KStemFilterFactory
RemoveDuplicatesFilterFactory

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 4, 2020, at 9:24 PM, Jayadevan Maymala  
> wrote:
> 
> Hi all,
> 
> Is this the best (performance-wise as well as efficacy) order of applying
> analyzers/filters? We have an eCom site where the many products are listed,
> and users may type in search words and get relevant results.
> 
> 1) Tokenize on whitespace (WhitespaceTokenizerFactory)
> 2) Remove stopwords (StopFilterFactory)
> 3) Stem (PorterStemFilterFactory)
> 4) Convert to lowercase  (LowerCaseFilterFactory)
> 5) Add synonyms (SynonymGraphFilterFactory,FlattenGraphFilterFactory)
> 
> Any possible gotchas?
> 
> Regards,
> Jayadevan



Order of applying tokens/filter

2020-10-04 Thread Jayadevan Maymala
Hi all,

Is this the best (performance-wise as well as efficacy) order of applying
analyzers/filters? We have an eCom site where the many products are listed,
and users may type in search words and get relevant results.

1) Tokenize on whitespace (WhitespaceTokenizerFactory)
2) Remove stopwords (StopFilterFactory)
3) Stem (PorterStemFilterFactory)
4) Convert to lowercase  (LowerCaseFilterFactory)
5) Add synonyms (SynonymGraphFilterFactory,FlattenGraphFilterFactory)

Any possible gotchas?

Regards,
Jayadevan