StandardTokenizer doesn't handle some of the tokens we need, like @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or Korean. Am I wrong about that?
On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir <rcm...@gmail.com> wrote: > On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder <jel...@locamoda.com> wrote: > > The problem is that the field is not guaranteed to contain just a single > > language. I'm looking for some way to pass it first through CJK, then > > Whitespace. > > > > If I'm totally off-target here, is there a recommended way of dealing > with > > mixed-language fields? > > > > maybe you should consider a tokenizer like StandardTokenizer, that > works reasonably well for most languages. > -- Jacob Elder @jelder (646) 535-3379