The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.

If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?

On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> You can use only one tokenizer per analyzer. You'd better use separate
> fields +
> fieldTypes for different languages.
>
> > I am looking for a clear example of using more than one tokenizer for a
> > source single field. My application has a single "body" field which until
> > recently was all latin characters, but we're now encountering both
> English
> > and Japanese words in a single message. Obviously, we need to be using
> CJK
> > in addition to WhitespaceTokenizerFactory.
> >
> > I've found some references to using copyFields or NGrams but I can't
> quite
> > grasp what the whole solution would look like.
>



-- 
Jacob Elder
@jelder
(646) 535-3379

Reply via email to