Re: Leaving certain tokens intact during indexing and search

Erick Erickson Wed, 30 Nov 2011 06:35:00 -0800

There's about a zillion tokenizers, for what you're describing
WhitespaceTokenizerFactory is a good candidate.


See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
for a partial list, and it has links to the authoritative docs.

Best
Erick

On Wed, Nov 30, 2011 at 9:23 AM, Marian Steinbach
<marian.steinb...@gmail.com> wrote:
> I have documents containing tokens of a certain format in arbitrary
> positions, like this:
>
>    ... blah blahblah AB/1234/5678 blah blah blahblah ...
>
> I would like to enable "usual" keyword searching within these documents. In
> addition, I'd also like to enable users to find "AB/1234/5678", ideally
> without a need to quote this as a phrase. And match highlighting should
> highlight this term just as other term matches would be highlighted.
>
> BTW, it's *not* necessary to find this document by searching for parts of
> that token, like "ab", "1234" or "5678".
>
> As I understand, StandardTokenizerFactory considers the slash as a word
> delimiter and thus removes it.
>
> Is there a Tokenizer available that allows me to to skip tokenizing on
> slashes in this case, but only on this case? Or how could I create one
> myself? Do I extend StandardTokenizerFactory in my own Java class?
>
> Thanks!
>
> Marian

Re: Leaving certain tokens intact during indexing and search

Reply via email to