On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder <jel...@locamoda.com> wrote:
> Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> past, we were using a patched version of StandardTokenizer which treated
> @twitteruser and #hashtag better, but this became a release engineering
> nightmare so we switched to Whitespace.

in this case, have you considered using a CharFilter (e.g.
MappingCharFilter) before the tokenizer?

This way you could map your special things such as @ and # to some
other string that the tokenizer doesnt split on,
e.g. # => "HASH_".

then your #foobar goes to HASH_foobar.
If you want searches of "#foobar" to only match "#foobar" and not also
"foobar" itself, and vice versa, you are done.
Maybe you want searches of #foobar to only match #foobar, but searches
of "foobar" to match both "#foobar" and "foobar".
In this case, you would probably use a worddelimiterfilter w/
preserveOriginal at index-time only , followed by a StopFilter
containing HASH, so you index HASH_foobar and foobar.

anyway i think you have a lot of flexibility to reuse
standardtokenizer but customize things like this without maintaining
your own tokenizer, this is the purpose of CharFilters.

Reply via email to