Re: Re: Protecting Tokens from Any Analysis

Alexandre Rafalovitch Wed, 09 Oct 2019 07:41:36 -0700

Stopwords (it was discussed on mailing list several times I recall):
The ideas is that it used to be part of the tricks to make the index
as small as possible to allow faster search. Stopwords being the most
common words....
This days, disk space is not an issue most of the time and there have
been many optimizations to make stopwords less relevant. Plus, like
you said, sometimes the stopword management actively gets in the way.
Here is an interesting - if old - article about it too:
https://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be


Regards,
   Alex.

On Wed, 9 Oct 2019 at 09:39, Audrey Lorberfeld -
[email protected] <[email protected]> wrote:
>
> Hey Alex,
>
> Thank you!
>
> Re: stopwords being a thing of the past due to the affordability of 
> hardware...can you expand? I'm not sure I understand.
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [email protected]
>
>
> On 10/8/19, 1:01 PM, "David Hastings" <[email protected]> wrote:
>
>     Another thing to add to the above,
>     >
>     > IT:ibm. In this case, we would want to maintain the colon and the
>     > capitalization (otherwise “it” would be taken out as a stopword).
>     >
>     stopwords are a thing of the past at this point.  there is no benefit to
>     using them now with hardware being so cheap.
>
>     On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <[email protected]>
>     wrote:
>
>     > If you don't want it to be touched by a tokenizer, how would the
>     > protection step know that the sequence of characters you want to
>     > protect is "IT:ibm" and not "this is an IT:ibm term I want to
>     > protect"?
>     >
>     > What it sounds to me is that you may want to:
>     > 1) copyField to a second field
>     > 2) Apply a much lighter (whitespace?) tokenizer to that second field
>     > 3) Run the results through something like KeepWordFilterFactory
>     > 4) Search both fields with a boost on the second, higher-signal field
>     >
>     > The other option is to run CharacterFilter,
>     > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
>     > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>     > term365". As long as it is done on both indexing and query, they will
>     > still match. You may have to have a bunch of them or write some sort
>     > of lookup map.
>     >
>     > Regards,
>     >    Alex.
>     >
>     > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>     > [email protected] <[email protected]> wrote:
>     > >
>     > > Hi All,
>     > >
>     > > This is likely a rudimentary question, but I can’t seem to find a
>     > straight-forward answer on forums or the documentation…is there a way to
>     > protect tokens from ANY analysis? I know things like the
>     > KeywordMarkerFilterFactory protect tokens from stemming, but we have 
> some
>     > terms we don’t even want our tokenizer to touch. Mostly, these are
>     > IBM-specific acronyms, such as IT:ibm. In this case, we would want to
>     > maintain the colon and the capitalization (otherwise “it” would be taken
>     > out as a stopword).
>     > >
>     > > Any advice is appreciated!
>     > >
>     > > Thank you,
>     > > Audrey
>     > >
>     > > --
>     > > Audrey Lorberfeld
>     > > Data Scientist, w3 Search
>     > > IBM
>     > > [email protected]
>     > >
>     >
>
>

Re: Re: Protecting Tokens from Any Analysis

Reply via email to