Hi, I have documents in different languages and I want to choose the tokenizer to use for a document based on the language of the document. The language of the document is already known and is indexed in a field. What I want to do is when I index the text in the document, I want to choose the tokenizer to use based on the value of the language field. I want to use one field for the text in the document (defining multiple fields for each language is not an option). It seems like I can define a tokenizer for a field, so I guess what I need to do is to write a custom tokenizer that looks at the language field value of the document and calls the appropriate tokenizer for that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK languages etc..). From whatever I have read, it seems quite straight forward to write a custom tokenizer, but how would this custom tokenizer know the language of the document? Is there some way I can pass in this value to the tokenizer? Or is there some way the tokenizer will have access to other fields in the document?. Would be really helpful if someone can provide an answer
Thanks Prabhu