On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder <jel...@locamoda.com> wrote:
> StandardTokenizer doesn't handle some of the tokens we need, like
> @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
> Korean. Am I wrong about that?

it uses the unigram method for CJK ideographs... the CJKtokenizer just
uses the bigram method, its just an alternative method.

the whitespace doesnt work at all though, so give up on that!

Reply via email to