StandardTokenizer doesn't handle some of the tokens we need, like
@twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
Korean. Am I wrong about that?

On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir <rcm...@gmail.com> wrote:

> On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder <jel...@locamoda.com> wrote:
> > The problem is that the field is not guaranteed to contain just a single
> > language. I'm looking for some way to pass it first through CJK, then
> > Whitespace.
> >
> > If I'm totally off-target here, is there a recommended way of dealing
> with
> > mixed-language fields?
> >
>
> maybe you should consider a tokenizer like StandardTokenizer, that
> works reasonably well for most languages.
>



-- 
Jacob Elder
@jelder
(646) 535-3379

Reply via email to