Ages ago at Netflix, I fixed this with a few hundred synonyms. If you are
working with
a fixed vocabulary (movie titles, product names), that can work just fine.
babysitter, baby-sitter, baby sitter
fullmetal, full-metal, full metal
manhunter, man-hunter, man hunter
spiderman, spider-man, spider
Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE:
*FilterFactory are _not_ what you want in this case, they are applied to
individual tokens after parsing
*CharFiterFactory are invoked on the entire input to the field, although I
can’t say for certain that even that’s
Are there any good workarounds/parameters we can use to fix this so it
doesn't have to be solved client side?
On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder
wrote:
> Is the normal/standard solution here to regex remove the '-'s and
> combine them into a single token?
>
> On Tue, Nov 24, 2020
Is the normal/standard solution here to regex remove the '-'s and
combine them into a single token?
On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson wrote:
>
> This is a common point of confusion. There are two phases for creating a
> query,
> query _parsing_ first, then the analysis chain for
This is a common point of confusion. There are two phases for creating a query,
query _parsing_ first, then the analysis chain for the parsed result.
So what e-dismax sees in the two cases is:
Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes into
play.
I am troubleshooting an issue with ranking for search terms that contain a
"-" vs the same query that does not contain the dash e.g. "high-tech" vs
"high tech". The field that I am querying is using the standard tokenizer,
so I would expect that the underlying lucene query should be the same for