The problem is that the field is not guaranteed to contain just a single language. I'm looking for some way to pass it first through CJK, then Whitespace.
If I'm totally off-target here, is there a recommended way of dealing with mixed-language fields? On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma <markus.jel...@openindex.io>wrote: > You can use only one tokenizer per analyzer. You'd better use separate > fields + > fieldTypes for different languages. > > > I am looking for a clear example of using more than one tokenizer for a > > source single field. My application has a single "body" field which until > > recently was all latin characters, but we're now encountering both > English > > and Japanese words in a single message. Obviously, we need to be using > CJK > > in addition to WhitespaceTokenizerFactory. > > > > I've found some references to using copyFields or NGrams but I can't > quite > > grasp what the whole solution would look like. > -- Jacob Elder @jelder (646) 535-3379