Have you looked at the specification to see how it's _supposed_ to work? >From the javadocs: "implements Unicode text segmentation, * as specified by UAX#29."
See http://unicode.org/reports/tr29/#Word_Boundaries If you look at the spec and feel that ClassicAnalyzer incorrectly implements the word break rules then perhaps there's a JIRA. Best, Erick On Thu, Oct 19, 2017 at 6:39 AM, Chitra <chithu.r...@gmail.com> wrote: > Hi, > I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was > indexed as "er l n", some characters were trimmed while indexing. > > Here is my code > > protected Analyzer.TokenStreamComponents createComponents(final String > fieldName, final Reader reader) > { > final ClassicTokenizer src = new ClassicTokenizer(getVersion(), > reader); > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); > > TokenStream tok = new ClassicFilter(src); > tok = new LowerCaseFilter(getVersion(), tok); > tok = new StopFilter(getVersion(), tok, stopwords); > tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive > search > > return new Analyzer.TokenStreamComponents(src, tok) > { > @Override > protected void setReader(final Reader reader) throws IOException > { > > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); > super.setReader(reader); > } > }; > } > > > Am I missing anything? Is that expected behavior for my input or any reason > behind such abnormal behavior? > > > -- > Regards, > Chitra