[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-26 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664887#comment-16664887
 ] 

ASF subversion and git services commented on LUCENE-8524:
-

Commit 403babcfd6d024affc8afad00f8fb78c07053e82 in lucene-solr's branch 
refs/heads/branch_7x from [~jim.ferenczi]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=403babc ]

LUCENE-8524: Add the Hangul Letter Araea (interpunct) as a separator in Nori's 
tokenizer.
This change also removes empty terms and trim surface form in Nori's Korean 
dictionary.


> Nori (Korean) analyzer tokenization issues
> --
>
> Key: LUCENE-8524
> URL: https://issues.apache.org/jira/browse/LUCENE-8524
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-8524.patch
>
>
> I opened this originally as an [Elastic 
> bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
>  but was asked to re-file it here. (Sorry for the poor formatting. 
> "pre-formatted" isn't behaving.)
> *Elastic version*
> {
>  "name" : "adOS8gy",
>  "cluster_name" : "elasticsearch",
>  "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
>  "version" : {
>  "number" : "6.4.0",
>  "build_flavor" : "default",
>  "build_type" : "deb",
>  "build_hash" : "595516e",
>  "build_date" : "2018-08-17T23:18:47.308994Z",
>  "build_snapshot" : false,
>  "lucene_version" : "7.4.0",
>  "minimum_wire_compatibility_version" : "5.6.0",
>  "minimum_index_compatibility_version" : "5.0.0"
>  },
>  "tagline" : "You Know, for Search"
> }
>  *Plugins installed:* [analysis-icu, analysis-nori]
> *JVM version:*
>  openjdk version "1.8.0_181"
>  OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
>  OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> *OS version:*
>  Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
> x86_64 GNU/Linux
> *Description of the problem including expected versus actual behavior:*
> I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
> examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
> Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
> order of importance:
> A. Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
>  * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
>  * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) 
> + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
>  * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
>  * don't is tokenized as don + t; same for don’t (with a curly apostrophe).
>  * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
>  * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).
> B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle 
> dot (·, U+00B7) for 
> [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation].
>  When the arae-a is used, everything after the first one ends up in one giant 
> token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구.
>  * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while 
> "HANGUL *JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for 
> which there is no precomposed Unicode character.
> Work around: use a character filter to convert 

[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-26 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664882#comment-16664882
 ] 

ASF subversion and git services commented on LUCENE-8524:
-

Commit 6f291d402b93ca534eccfef620fa392d0cd2b892 in lucene-solr's branch 
refs/heads/master from [~jim.ferenczi]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6f291d4 ]

LUCENE-8524: Add the Hangul Letter Araea (interpunct) as a separator in Nori's 
tokenizer.
This change also removes empty terms and trim surface form in Nori's Korean 
dictionary.


> Nori (Korean) analyzer tokenization issues
> --
>
> Key: LUCENE-8524
> URL: https://issues.apache.org/jira/browse/LUCENE-8524
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-8524.patch
>
>
> I opened this originally as an [Elastic 
> bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
>  but was asked to re-file it here. (Sorry for the poor formatting. 
> "pre-formatted" isn't behaving.)
> *Elastic version*
> {
>  "name" : "adOS8gy",
>  "cluster_name" : "elasticsearch",
>  "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
>  "version" : {
>  "number" : "6.4.0",
>  "build_flavor" : "default",
>  "build_type" : "deb",
>  "build_hash" : "595516e",
>  "build_date" : "2018-08-17T23:18:47.308994Z",
>  "build_snapshot" : false,
>  "lucene_version" : "7.4.0",
>  "minimum_wire_compatibility_version" : "5.6.0",
>  "minimum_index_compatibility_version" : "5.0.0"
>  },
>  "tagline" : "You Know, for Search"
> }
>  *Plugins installed:* [analysis-icu, analysis-nori]
> *JVM version:*
>  openjdk version "1.8.0_181"
>  OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
>  OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> *OS version:*
>  Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
> x86_64 GNU/Linux
> *Description of the problem including expected versus actual behavior:*
> I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
> examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
> Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
> order of importance:
> A. Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
>  * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
>  * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) 
> + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
>  * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
>  * don't is tokenized as don + t; same for don’t (with a curly apostrophe).
>  * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
>  * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).
> B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle 
> dot (·, U+00B7) for 
> [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation].
>  When the arae-a is used, everything after the first one ends up in one giant 
> token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구.
>  * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while 
> "HANGUL *JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for 
> which there is no precomposed Unicode character.
> Work around: use a character filter to convert arae-a 

[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-25 Thread Lucene/Solr QA (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664372#comment-16664372
 ] 

Lucene/Solr QA commented on LUCENE-8524:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  0m 29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m 29s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 24s{color} 
| {color:red} nori in the patch failed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}  3m 38s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | lucene.analysis.ko.dict.TestTokenInfoDictionary |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-8524 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12945190/LUCENE-8524.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene1-us-west 4.4.0-137-generic #163~14.04.1-Ubuntu SMP Mon 
Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 8d10939 |
| ant | version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018 |
| Default Java | 1.8.0_172 |
| unit | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/112/artifact/out/patch-unit-lucene_analysis_nori.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/112/testReport/ |
| modules | C: lucene/analysis/nori U: lucene/analysis/nori |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/112/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Nori (Korean) analyzer tokenization issues
> --
>
> Key: LUCENE-8524
> URL: https://issues.apache.org/jira/browse/LUCENE-8524
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-8524.patch
>
>
> I opened this originally as an [Elastic 
> bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
>  but was asked to re-file it here. (Sorry for the poor formatting. 
> "pre-formatted" isn't behaving.)
> *Elastic version*
> {
>  "name" : "adOS8gy",
>  "cluster_name" : "elasticsearch",
>  "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
>  "version" : {
>  "number" : "6.4.0",
>  "build_flavor" : "default",
>  "build_type" : "deb",
>  "build_hash" : "595516e",
>  "build_date" : "2018-08-17T23:18:47.308994Z",
>  "build_snapshot" : false,
>  "lucene_version" : "7.4.0",
>  "minimum_wire_compatibility_version" : "5.6.0",
>  "minimum_index_compatibility_version" : "5.0.0"
>  },
>  "tagline" : "You Know, for Search"
> }
>  *Plugins installed:* [analysis-icu, analysis-nori]
> *JVM version:*
>  openjdk version "1.8.0_181"
>  OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
>  OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> *OS version:*
>  Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
> x86_64 GNU/Linux
> *Description of the problem including expected versus actual behavior:*
> I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
> examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
> Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
> order of importance:
> A. Tokens are split on different character POS types (which seem to not quite 
> line up with 

[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-24 Thread Trey Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662833#comment-16662833
 ] 

Trey Jones commented on LUCENE-8524:


{quote}A can be discussed but I think it needs a separate issue since this is 
more a feature than a bug. This is a design choice and I am not sure that 
splitting is really an issue here. We could add a mode that join multiple 
alphabet together but it's not a major concern since this mixed terms should 
appear very rarely.
{quote}
All of the examples except for the _Мoscow_ one are taken from Korean Wikipedia 
or Wiktionary, so they do occur. Out of a sample of 10,000 random Korean 
Wikipedia articles (with ~2.4M tokens), 100 Cyrillic and 126 Greek tokens were 
affected. An additional 2758 ID-like tokens (e.g., _BH115E_) were affected. 96 
Phonetic Alphabet tokens were affected. 769 tokens with apostrophes were 
affected, too; most were possessives with _’s,_ but also included were words 
like _An'gorso, Na’vi,_ and _O'Donnell._ Out of 2.4M tokens, these are rare, 
but there are still a lot of them—especially when you scale up 10K sample 43x 
to the full 430K articles on Wikipedia.

It's definitely seems like a bug that a Greek word like _εἰμί_ gets split into 
three tokens, or _Ба̀лтичко̄_ gets split into four. The Greek seems to be the 
worse case, since _ἰ_ is in the “Greek Extended” Unicode block while the rest 
are “Greek and Coptic” block, which aren’t really different character sets.

*Thanks for fixing B, D, and E!*

> Nori (Korean) analyzer tokenization issues
> --
>
> Key: LUCENE-8524
> URL: https://issues.apache.org/jira/browse/LUCENE-8524
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-8524.patch
>
>
> I opened this originally as an [Elastic 
> bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
>  but was asked to re-file it here. (Sorry for the poor formatting. 
> "pre-formatted" isn't behaving.)
> *Elastic version*
> {
>  "name" : "adOS8gy",
>  "cluster_name" : "elasticsearch",
>  "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
>  "version" : {
>  "number" : "6.4.0",
>  "build_flavor" : "default",
>  "build_type" : "deb",
>  "build_hash" : "595516e",
>  "build_date" : "2018-08-17T23:18:47.308994Z",
>  "build_snapshot" : false,
>  "lucene_version" : "7.4.0",
>  "minimum_wire_compatibility_version" : "5.6.0",
>  "minimum_index_compatibility_version" : "5.0.0"
>  },
>  "tagline" : "You Know, for Search"
> }
>  *Plugins installed:* [analysis-icu, analysis-nori]
> *JVM version:*
>  openjdk version "1.8.0_181"
>  OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
>  OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> *OS version:*
>  Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
> x86_64 GNU/Linux
> *Description of the problem including expected versus actual behavior:*
> I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
> examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
> Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
> order of importance:
> A. Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
>  * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
>  * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) 
> + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
>  * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
>  * don't is tokenized as don + t; same for don’t (with a curly apostrophe).
>  * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
>  * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and 

[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-23 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660378#comment-16660378
 ] 

Jim Ferenczi commented on LUCENE-8524:
--

Sorry for the late reply, D and E are bugs due to invalid rules in the 
mecab-ko-dic. I'll provide a patch shortly.

I was not aware of B but it should be easy to add the interpunct as a separator.

A can be discussed but I think it needs a separate issue since this is more a 
feature than a bug. This is a design choice and I am not sure that splitting is 
really an issue here. We could add a mode that join multiple alphabet together 
but it's not a major concern since this mixed terms should appear very rarely.

Regarding C, IMO it's a normalization issue, if you don't want to break on this 
character you should remove it with a char filter.

> Nori (Korean) analyzer tokenization issues
> --
>
> Key: LUCENE-8524
> URL: https://issues.apache.org/jira/browse/LUCENE-8524
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Trey Jones
>Priority: Major
>
> I opened this originally as an [Elastic 
> bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
>  but was asked to re-file it here. (Sorry for the poor formatting. 
> "pre-formatted" isn't behaving.)
> *Elastic version*
> {
>  "name" : "adOS8gy",
>  "cluster_name" : "elasticsearch",
>  "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
>  "version" : {
>  "number" : "6.4.0",
>  "build_flavor" : "default",
>  "build_type" : "deb",
>  "build_hash" : "595516e",
>  "build_date" : "2018-08-17T23:18:47.308994Z",
>  "build_snapshot" : false,
>  "lucene_version" : "7.4.0",
>  "minimum_wire_compatibility_version" : "5.6.0",
>  "minimum_index_compatibility_version" : "5.0.0"
>  },
>  "tagline" : "You Know, for Search"
> }
>  *Plugins installed:* [analysis-icu, analysis-nori]
> *JVM version:*
>  openjdk version "1.8.0_181"
>  OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
>  OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> *OS version:*
>  Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
> x86_64 GNU/Linux
> *Description of the problem including expected versus actual behavior:*
> I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
> examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
> Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
> order of importance:
> A. Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
>  * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
>  * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) 
> + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
>  * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
>  * don't is tokenized as don + t; same for don’t (with a curly apostrophe).
>  * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
>  * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).
> B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle 
> dot (·, U+00B7) for 
> [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation].
>  When the arae-a is used, everything after the first one ends up in one giant 
> token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구.
>  * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used 

[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-04 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639118#comment-16639118
 ] 

Tomoko Uchida commented on LUCENE-8524:
---

I have not looked closely this yet, so it is my intuition rather than strong 
opinion...

About problem A:

I think this is not a problem of tokenizer itself but the built-in dictionary.
 Nori includes mecab-ko-dic ([https://bitbucket.org/eunjeon/mecab-ko-dic]) as 
built-in dectionary, it is a derivative of MeCab IPADIC 
([https://sourceforge.net/projects/mecab/]), a widely used Japanese dictionary.
 JapaneseTokeniezr (a.k.a Kuromoji), this includes mecab ipadic, behaves in the 
same manner. In fact, original MeCab does not handle such non-CJK tokens 
correnctly.

I cannot say it is a fault of MeCab IPADIC, it was originally built for 
handling Japanse texts, before Unicode era.
 But we can apply patches to the dictionary when building the dictionary.
 A patch to `seed/char.def` file, it is used for "unknown word" handling, is 
sufficient, I think.

> Nori (Korean) analyzer tokenization issues
> --
>
> Key: LUCENE-8524
> URL: https://issues.apache.org/jira/browse/LUCENE-8524
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Trey Jones
>Priority: Major
>
> I opened this originally as an [Elastic 
> bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
>  but was asked to re-file it here. (Sorry for the poor formatting. 
> "pre-formatted" isn't behaving.)
> *Elastic version*
> {
>  "name" : "adOS8gy",
>  "cluster_name" : "elasticsearch",
>  "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
>  "version" : {
>  "number" : "6.4.0",
>  "build_flavor" : "default",
>  "build_type" : "deb",
>  "build_hash" : "595516e",
>  "build_date" : "2018-08-17T23:18:47.308994Z",
>  "build_snapshot" : false,
>  "lucene_version" : "7.4.0",
>  "minimum_wire_compatibility_version" : "5.6.0",
>  "minimum_index_compatibility_version" : "5.0.0"
>  },
>  "tagline" : "You Know, for Search"
> }
>  *Plugins installed:* [analysis-icu, analysis-nori]
> *JVM version:*
>  openjdk version "1.8.0_181"
>  OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
>  OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> *OS version:*
>  Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
> x86_64 GNU/Linux
> *Description of the problem including expected versus actual behavior:*
> I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
> examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
> Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
> order of importance:
> A. Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
>  * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
>  * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) 
> + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
>  * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
>  * don't is tokenized as don + t; same for don’t (with a curly apostrophe).
>  * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
>  * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).
> B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle 
> dot (·, U+00B7) for 
> [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation].
>