[jira] [Created] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

Jim Ferenczi (JIRA) Fri, 05 Oct 2018 11:12:14 -0700

Jim Ferenczi created LUCENE-8526:
------------------------------------

             Summary: StandardTokenizer doesn't separate hangul characters from 
other non-CJK chars
                 Key: LUCENE-8526
                 URL: https://issues.apache.org/jira/browse/LUCENE-8526
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Jim Ferenczi



It was first reported here 
https://github.com/elastic/elasticsearch/issues/34285.
I don't know if it's the expected behavior but the StandardTokenizer does not 
split words
which are composed of a mixed of non-CJK characters and hangul syllabs. For 
instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an 
alpha-numeric group. This breaks the CJKBigram token filter which will not 
build bigrams on such groups. The other CJK characters are correctly splitted 
when they are mixed with other alphabet so I'd expect the same for hangul.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

Reply via email to