Re: Order of applying tokens/filter
Synonyms only need to be done once. Generally, expand synonyms at index time only. Also, consider the StandardTokeniizer. It is a bit smarter and that can be useful. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 5, 2020, at 9:08 PM, Jayadevan Maymala > wrote: > >> >> ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default) >> WhitespaceTokenizerFactory >> SynonymGraphFilterFactory >> FlattenGraphFilterFactory >> KStemFilterFactory >> RemoveDuplicatesFilterFactory >> >> One doubt related to this. Ideally, the same sequence should be followed > for indexing and querying, right? > Regards, > Jayadevan
Re: Order of applying tokens/filter
> > ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default) > WhitespaceTokenizerFactory > SynonymGraphFilterFactory > FlattenGraphFilterFactory > KStemFilterFactory > RemoveDuplicatesFilterFactory > > One doubt related to this. Ideally, the same sequence should be followed for indexing and querying, right? Regards, Jayadevan
Re: Order of applying tokens/filter
> ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default) > WhitespaceTokenizerFactory > SynonymGraphFilterFactory > FlattenGraphFilterFactory > KStemFilterFactory > RemoveDuplicatesFilterFactory > > Thanks a lot. Very useful insights. Regards, Jayadevan
Re: Order of applying tokens/filter
Several problems. 1. Do not remove stopwords. That is a 1970s-era hack for saving disk space. Want to search for “vitamin a”? Better not remove stopwords. 2. Synonyms are before the stemmer, especially the Porter stemmer, where the output isn’t English words. 3. Use KStem instead of Porter. Porter is a clever hack from 1980, but we have better technology now. 4. Add RemoveDuplcatesFilter as the last step, just in case your synonyms stem to the same word. It is cheap insurance. Also, I really recommend using the ICUNormalizer2CharFilterFactory with “nfkc” mode as the first step before the tokenizer. Otherwise, you’ll get bitten by some weird Unicode thing that takes days to debug. And if you are going to lower-case everything, let ICU do that for you with “nfkc_cf” mode. So that gives: ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default) WhitespaceTokenizerFactory SynonymGraphFilterFactory FlattenGraphFilterFactory KStemFilterFactory RemoveDuplicatesFilterFactory wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 4, 2020, at 9:24 PM, Jayadevan Maymala > wrote: > > Hi all, > > Is this the best (performance-wise as well as efficacy) order of applying > analyzers/filters? We have an eCom site where the many products are listed, > and users may type in search words and get relevant results. > > 1) Tokenize on whitespace (WhitespaceTokenizerFactory) > 2) Remove stopwords (StopFilterFactory) > 3) Stem (PorterStemFilterFactory) > 4) Convert to lowercase (LowerCaseFilterFactory) > 5) Add synonyms (SynonymGraphFilterFactory,FlattenGraphFilterFactory) > > Any possible gotchas? > > Regards, > Jayadevan
Order of applying tokens/filter
Hi all, Is this the best (performance-wise as well as efficacy) order of applying analyzers/filters? We have an eCom site where the many products are listed, and users may type in search words and get relevant results. 1) Tokenize on whitespace (WhitespaceTokenizerFactory) 2) Remove stopwords (StopFilterFactory) 3) Stem (PorterStemFilterFactory) 4) Convert to lowercase (LowerCaseFilterFactory) 5) Add synonyms (SynonymGraphFilterFactory,FlattenGraphFilterFactory) Any possible gotchas? Regards, Jayadevan