Hi All, I'm trying to index some indian web page content which are basically a mix of indian and say 5% of english content in the same page itself. For all this I can not use standard or simple analyzer as they break the non-english words in a wrong places say[because the isLetter(ch) happens to be false for them, even if they are part of a word]. So I wrote/extended the anayzer that does the following, public class IndicAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream ts = new WhitespaceTokenizer(reader); //ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0); ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS); ts = new LowerCaseFilter(ts); ts = new PorterStemFilter(ts); return ts; } } This is working fine to some extent when the line commented above is left as such, but its not able to give me the results when the documtnt contains a string say "he...@how.com" and the searched query is hello, this is expected as the above code doesnot do any word delimiting around these basic characters [like @ . , etc ]. Now the problem is when I'm trying to use wordDelimiterFilter[hte commented out line, this filter I got from solr] it is breaking say hindi words around some characters which are actually part of a word. After going through the code for this filter I found that it is using the isLetter() standard fuction of java which I think returns false for those hindi characters around which it is breaking the words. As per javadoc isLetter() is Unicode compliant, right? so can't we say that it is aware of the above characters that they are word delimiters, then why is this breaking around those characters. I'm stuck and dont know how to get rid of the problem. And because of this problem when I search for say a hindi word "helo" , assuming its hindi, it highlights this word but alognwith that it also highlights the letters of this word h/e/l/o whereever it finds it which it should not do, right? I request both Solr and Lucene users to guide me in fixing this issue. BTW, do we need to do some sort of normalization for the content before sending it to lucene indexer? just a thought, i don know whats the way out?