Another thing to add to the above, > > IT:ibm. In this case, we would want to maintain the colon and the > capitalization (otherwise “it” would be taken out as a stopword). > stopwords are a thing of the past at this point. there is no benefit to using them now with hardware being so cheap.
On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <arafa...@gmail.com> wrote: > If you don't want it to be touched by a tokenizer, how would the > protection step know that the sequence of characters you want to > protect is "IT:ibm" and not "this is an IT:ibm term I want to > protect"? > > What it sounds to me is that you may want to: > 1) copyField to a second field > 2) Apply a much lighter (whitespace?) tokenizer to that second field > 3) Run the results through something like KeepWordFilterFactory > 4) Search both fields with a boost on the second, higher-signal field > > The other option is to run CharacterFilter, > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm -> > term365". As long as it is done on both indexing and query, they will > still match. You may have to have a bunch of them or write some sort > of lookup map. > > Regards, > Alex. > > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld - > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > > > Hi All, > > > > This is likely a rudimentary question, but I can’t seem to find a > straight-forward answer on forums or the documentation…is there a way to > protect tokens from ANY analysis? I know things like the > KeywordMarkerFilterFactory protect tokens from stemming, but we have some > terms we don’t even want our tokenizer to touch. Mostly, these are > IBM-specific acronyms, such as IT:ibm. In this case, we would want to > maintain the colon and the capitalization (otherwise “it” would be taken > out as a stopword). > > > > Any advice is appreciated! > > > > Thank you, > > Audrey > > > > -- > > Audrey Lorberfeld > > Data Scientist, w3 Search > > IBM > > audrey.lorberf...@ibm.com > > >