Another thing to add to the above,
>
> IT:ibm. In this case, we would want to maintain the colon and the
> capitalization (otherwise “it” would be taken out as a stopword).
>
stopwords are a thing of the past at this point.  there is no benefit to
using them now with hardware being so cheap.

On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> If you don't want it to be touched by a tokenizer, how would the
> protection step know that the sequence of characters you want to
> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> protect"?
>
> What it sounds to me is that you may want to:
> 1) copyField to a second field
> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> 3) Run the results through something like KeepWordFilterFactory
> 4) Search both fields with a boost on the second, higher-signal field
>
> The other option is to run CharacterFilter,
> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> term365". As long as it is done on both indexing and query, they will
> still match. You may have to have a bunch of them or write some sort
> of lookup map.
>
> Regards,
>    Alex.
>
> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
> >
> > Hi All,
> >
> > This is likely a rudimentary question, but I can’t seem to find a
> straight-forward answer on forums or the documentation…is there a way to
> protect tokens from ANY analysis? I know things like the
> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
> terms we don’t even want our tokenizer to touch. Mostly, these are
> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> maintain the colon and the capitalization (otherwise “it” would be taken
> out as a stopword).
> >
> > Any advice is appreciated!
> >
> > Thank you,
> > Audrey
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
>

Reply via email to