Re: Tokenizing problem with numbers in query

Bernd Brod Tue, 05 Jan 2010 13:49:33 -0800

Hi,

On Tue, Jan 5, 2010 at 5:17 PM, Erick Erickson <erickerick...@gmail.com>wrote:


> We need to back up, this is looking like an XY problem. That is,
> you're asking for specifics when what would probably be more
> helpful is for you to describe *what* the problem you're trying
> to solve is rather than *how* to make a specific behavior
> happen. Although re-reading your original e-mail does give a
> clue <G>....
>
> If, for instance, you really really want the string indexed and searched
> literally (if, for instance, it's a part number), you want to use something
> like WhitespaceTokenizerFactory, perhaps lowercasing too, rather
> than fiddle around with KeywordTokenizerFactory. If you want some
> other behavior, please explain it in more detail <G>...
>

I am indexing files that also include traffic captures (so there can be
pretty much anything inside). When looking for a long alphanumeric string I
would have expected to have fewer results than when searching with a short
one. But through of all the tokenizing it returns more (useless) results.
This is very disappointing because i could find these documents with grep
easily. Whats even more disappointing: disabling the
WordDelimiterFilterFactory (for query and/or text) will just result in 0
hits on my document. Im not quite sure what to do.

Ideally I would like to be able to search for strings as a1a1a1a1a1a1a1 that
would not match against single "a" and / or "1".

Bernd

Re: Tokenizing problem with numbers in query

Reply via email to