peina created LUCENE-7509: ----------------------------- Summary: [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended Key: LUCENE-7509 URL: https://issues.apache.org/jira/browse/LUCENE-7509 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 6.2.1 Environment: Mac OS X 10.10 Reporter: peina
Some chinese text is not tokenized correctly with Chinese punctuation marks appended. e.g. 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct. But 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠, The similar case happens when text with numbers appended. e.g. 生活报8月4号 -->生活|报|8|月|4|号 生活报-->生活报 Test Sample: public static void main(String[] args) throws IOException{ Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */ System.out.println("Sample1======="); String sentence = "生活报8月4号"; printTokens(analyzer, sentence); sentence = "生活报"; printTokens(analyzer, sentence); System.out.println("Sample2======="); sentence = "碧绿的眼珠,"; printTokens(analyzer, sentence); sentence = "碧绿的眼珠"; printTokens(analyzer, sentence); analyzer.close(); } private static void printTokens(Analyzer analyzer, String sentence) throws IOException{ System.out.println("sentence:" + sentence); TokenStream tokens = analyzer.tokenStream("dummyfield", sentence); tokens.reset(); CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class); while (tokens.incrementToken()) { System.out.println(termAttr.toString()); } tokens.close(); } Output: Sample1======= sentence:生活报8月4号 生活 报 8 月 4 号 sentence:生活报 生活报 Sample2======= sentence:碧绿的眼珠, 碧绿 的 眼 珠 sentence:碧绿的眼珠 碧绿 的 眼珠 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org