On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder <jel...@locamoda.com> wrote: > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the > past, we were using a patched version of StandardTokenizer which treated > @twitteruser and #hashtag better, but this became a release engineering > nightmare so we switched to Whitespace.
in this case, have you considered using a CharFilter (e.g. MappingCharFilter) before the tokenizer? This way you could map your special things such as @ and # to some other string that the tokenizer doesnt split on, e.g. # => "HASH_". then your #foobar goes to HASH_foobar. If you want searches of "#foobar" to only match "#foobar" and not also "foobar" itself, and vice versa, you are done. Maybe you want searches of #foobar to only match #foobar, but searches of "foobar" to match both "#foobar" and "foobar". In this case, you would probably use a worddelimiterfilter w/ preserveOriginal at index-time only , followed by a StopFilter containing HASH, so you index HASH_foobar and foobar. anyway i think you have a lot of flexibility to reuse standardtokenizer but customize things like this without maintaining your own tokenizer, this is the purpose of CharFilters.