[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
[ https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664887#comment-16664887 ] ASF subversion and git services commented on LUCENE-8524: - Commit 403babcfd6d024affc8afad00f8fb78c07053e82 in lucene-solr's branch refs/heads/branch_7x from [~jim.ferenczi] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=403babc ] LUCENE-8524: Add the Hangul Letter Araea (interpunct) as a separator in Nori's tokenizer. This change also removes empty terms and trim surface form in Nori's Korean dictionary. > Nori (Korean) analyzer tokenization issues > -- > > Key: LUCENE-8524 > URL: https://issues.apache.org/jira/browse/LUCENE-8524 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Trey Jones >Priority: Major > Attachments: LUCENE-8524.patch > > > I opened this originally as an [Elastic > bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], > but was asked to re-file it here. (Sorry for the poor formatting. > "pre-formatted" isn't behaving.) > *Elastic version* > { > "name" : "adOS8gy", > "cluster_name" : "elasticsearch", > "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw", > "version" : { > "number" : "6.4.0", > "build_flavor" : "default", > "build_type" : "deb", > "build_hash" : "595516e", > "build_date" : "2018-08-17T23:18:47.308994Z", > "build_snapshot" : false, > "lucene_version" : "7.4.0", > "minimum_wire_compatibility_version" : "5.6.0", > "minimum_index_compatibility_version" : "5.0.0" > }, > "tagline" : "You Know, for Search" > } > *Plugins installed:* [analysis-icu, analysis-nori] > *JVM version:* > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > *OS version:* > Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) > x86_64 GNU/Linux > *Description of the problem including expected versus actual behavior:* > I've uncovered a number of oddities in tokenization in the Nori analyzer. All > examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean > Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough > order of importance: > A. Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) > + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > * don't is tokenized as don + t; same for don’t (with a curly apostrophe). > * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one). > B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle > dot (·, U+00B7) for > [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation]. > When the arae-a is used, everything after the first one ends up in one giant > token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구. > * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while > "HANGUL *JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for > which there is no precomposed Unicode character. > Work around: use a character filter to convert
[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
[ https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664882#comment-16664882 ] ASF subversion and git services commented on LUCENE-8524: - Commit 6f291d402b93ca534eccfef620fa392d0cd2b892 in lucene-solr's branch refs/heads/master from [~jim.ferenczi] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6f291d4 ] LUCENE-8524: Add the Hangul Letter Araea (interpunct) as a separator in Nori's tokenizer. This change also removes empty terms and trim surface form in Nori's Korean dictionary. > Nori (Korean) analyzer tokenization issues > -- > > Key: LUCENE-8524 > URL: https://issues.apache.org/jira/browse/LUCENE-8524 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Trey Jones >Priority: Major > Attachments: LUCENE-8524.patch > > > I opened this originally as an [Elastic > bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], > but was asked to re-file it here. (Sorry for the poor formatting. > "pre-formatted" isn't behaving.) > *Elastic version* > { > "name" : "adOS8gy", > "cluster_name" : "elasticsearch", > "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw", > "version" : { > "number" : "6.4.0", > "build_flavor" : "default", > "build_type" : "deb", > "build_hash" : "595516e", > "build_date" : "2018-08-17T23:18:47.308994Z", > "build_snapshot" : false, > "lucene_version" : "7.4.0", > "minimum_wire_compatibility_version" : "5.6.0", > "minimum_index_compatibility_version" : "5.0.0" > }, > "tagline" : "You Know, for Search" > } > *Plugins installed:* [analysis-icu, analysis-nori] > *JVM version:* > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > *OS version:* > Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) > x86_64 GNU/Linux > *Description of the problem including expected versus actual behavior:* > I've uncovered a number of oddities in tokenization in the Nori analyzer. All > examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean > Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough > order of importance: > A. Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) > + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > * don't is tokenized as don + t; same for don’t (with a curly apostrophe). > * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one). > B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle > dot (·, U+00B7) for > [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation]. > When the arae-a is used, everything after the first one ends up in one giant > token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구. > * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while > "HANGUL *JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for > which there is no precomposed Unicode character. > Work around: use a character filter to convert arae-a
[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
[ https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664372#comment-16664372 ] Lucene/Solr QA commented on LUCENE-8524: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 24s{color} | {color:red} nori in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 3m 38s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | lucene.analysis.ko.dict.TestTokenInfoDictionary | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | LUCENE-8524 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12945190/LUCENE-8524.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns | | uname | Linux lucene1-us-west 4.4.0-137-generic #163~14.04.1-Ubuntu SMP Mon Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / 8d10939 | | ant | version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018 | | Default Java | 1.8.0_172 | | unit | https://builds.apache.org/job/PreCommit-LUCENE-Build/112/artifact/out/patch-unit-lucene_analysis_nori.txt | | Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/112/testReport/ | | modules | C: lucene/analysis/nori U: lucene/analysis/nori | | Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/112/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > Nori (Korean) analyzer tokenization issues > -- > > Key: LUCENE-8524 > URL: https://issues.apache.org/jira/browse/LUCENE-8524 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Trey Jones >Priority: Major > Attachments: LUCENE-8524.patch > > > I opened this originally as an [Elastic > bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], > but was asked to re-file it here. (Sorry for the poor formatting. > "pre-formatted" isn't behaving.) > *Elastic version* > { > "name" : "adOS8gy", > "cluster_name" : "elasticsearch", > "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw", > "version" : { > "number" : "6.4.0", > "build_flavor" : "default", > "build_type" : "deb", > "build_hash" : "595516e", > "build_date" : "2018-08-17T23:18:47.308994Z", > "build_snapshot" : false, > "lucene_version" : "7.4.0", > "minimum_wire_compatibility_version" : "5.6.0", > "minimum_index_compatibility_version" : "5.0.0" > }, > "tagline" : "You Know, for Search" > } > *Plugins installed:* [analysis-icu, analysis-nori] > *JVM version:* > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > *OS version:* > Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) > x86_64 GNU/Linux > *Description of the problem including expected versus actual behavior:* > I've uncovered a number of oddities in tokenization in the Nori analyzer. All > examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean > Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough > order of importance: > A. Tokens are split on different character POS types (which seem to not quite > line up with
[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
[ https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662833#comment-16662833 ] Trey Jones commented on LUCENE-8524: {quote}A can be discussed but I think it needs a separate issue since this is more a feature than a bug. This is a design choice and I am not sure that splitting is really an issue here. We could add a mode that join multiple alphabet together but it's not a major concern since this mixed terms should appear very rarely. {quote} All of the examples except for the _Мoscow_ one are taken from Korean Wikipedia or Wiktionary, so they do occur. Out of a sample of 10,000 random Korean Wikipedia articles (with ~2.4M tokens), 100 Cyrillic and 126 Greek tokens were affected. An additional 2758 ID-like tokens (e.g., _BH115E_) were affected. 96 Phonetic Alphabet tokens were affected. 769 tokens with apostrophes were affected, too; most were possessives with _’s,_ but also included were words like _An'gorso, Na’vi,_ and _O'Donnell._ Out of 2.4M tokens, these are rare, but there are still a lot of them—especially when you scale up 10K sample 43x to the full 430K articles on Wikipedia. It's definitely seems like a bug that a Greek word like _εἰμί_ gets split into three tokens, or _Ба̀лтичко̄_ gets split into four. The Greek seems to be the worse case, since _ἰ_ is in the “Greek Extended” Unicode block while the rest are “Greek and Coptic” block, which aren’t really different character sets. *Thanks for fixing B, D, and E!* > Nori (Korean) analyzer tokenization issues > -- > > Key: LUCENE-8524 > URL: https://issues.apache.org/jira/browse/LUCENE-8524 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Trey Jones >Priority: Major > Attachments: LUCENE-8524.patch > > > I opened this originally as an [Elastic > bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], > but was asked to re-file it here. (Sorry for the poor formatting. > "pre-formatted" isn't behaving.) > *Elastic version* > { > "name" : "adOS8gy", > "cluster_name" : "elasticsearch", > "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw", > "version" : { > "number" : "6.4.0", > "build_flavor" : "default", > "build_type" : "deb", > "build_hash" : "595516e", > "build_date" : "2018-08-17T23:18:47.308994Z", > "build_snapshot" : false, > "lucene_version" : "7.4.0", > "minimum_wire_compatibility_version" : "5.6.0", > "minimum_index_compatibility_version" : "5.0.0" > }, > "tagline" : "You Know, for Search" > } > *Plugins installed:* [analysis-icu, analysis-nori] > *JVM version:* > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > *OS version:* > Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) > x86_64 GNU/Linux > *Description of the problem including expected versus actual behavior:* > I've uncovered a number of oddities in tokenization in the Nori analyzer. All > examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean > Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough > order of importance: > A. Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) > + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > * don't is tokenized as don + t; same for don’t (with a curly apostrophe). > * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and
[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
[ https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660378#comment-16660378 ] Jim Ferenczi commented on LUCENE-8524: -- Sorry for the late reply, D and E are bugs due to invalid rules in the mecab-ko-dic. I'll provide a patch shortly. I was not aware of B but it should be easy to add the interpunct as a separator. A can be discussed but I think it needs a separate issue since this is more a feature than a bug. This is a design choice and I am not sure that splitting is really an issue here. We could add a mode that join multiple alphabet together but it's not a major concern since this mixed terms should appear very rarely. Regarding C, IMO it's a normalization issue, if you don't want to break on this character you should remove it with a char filter. > Nori (Korean) analyzer tokenization issues > -- > > Key: LUCENE-8524 > URL: https://issues.apache.org/jira/browse/LUCENE-8524 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Trey Jones >Priority: Major > > I opened this originally as an [Elastic > bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], > but was asked to re-file it here. (Sorry for the poor formatting. > "pre-formatted" isn't behaving.) > *Elastic version* > { > "name" : "adOS8gy", > "cluster_name" : "elasticsearch", > "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw", > "version" : { > "number" : "6.4.0", > "build_flavor" : "default", > "build_type" : "deb", > "build_hash" : "595516e", > "build_date" : "2018-08-17T23:18:47.308994Z", > "build_snapshot" : false, > "lucene_version" : "7.4.0", > "minimum_wire_compatibility_version" : "5.6.0", > "minimum_index_compatibility_version" : "5.0.0" > }, > "tagline" : "You Know, for Search" > } > *Plugins installed:* [analysis-icu, analysis-nori] > *JVM version:* > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > *OS version:* > Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) > x86_64 GNU/Linux > *Description of the problem including expected versus actual behavior:* > I've uncovered a number of oddities in tokenization in the Nori analyzer. All > examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean > Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough > order of importance: > A. Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) > + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > * don't is tokenized as don + t; same for don’t (with a curly apostrophe). > * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one). > B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle > dot (·, U+00B7) for > [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation]. > When the arae-a is used, everything after the first one ends up in one giant > token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구. > * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used
[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
[ https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639118#comment-16639118 ] Tomoko Uchida commented on LUCENE-8524: --- I have not looked closely this yet, so it is my intuition rather than strong opinion... About problem A: I think this is not a problem of tokenizer itself but the built-in dictionary. Nori includes mecab-ko-dic ([https://bitbucket.org/eunjeon/mecab-ko-dic]) as built-in dectionary, it is a derivative of MeCab IPADIC ([https://sourceforge.net/projects/mecab/]), a widely used Japanese dictionary. JapaneseTokeniezr (a.k.a Kuromoji), this includes mecab ipadic, behaves in the same manner. In fact, original MeCab does not handle such non-CJK tokens correnctly. I cannot say it is a fault of MeCab IPADIC, it was originally built for handling Japanse texts, before Unicode era. But we can apply patches to the dictionary when building the dictionary. A patch to `seed/char.def` file, it is used for "unknown word" handling, is sufficient, I think. > Nori (Korean) analyzer tokenization issues > -- > > Key: LUCENE-8524 > URL: https://issues.apache.org/jira/browse/LUCENE-8524 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Trey Jones >Priority: Major > > I opened this originally as an [Elastic > bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], > but was asked to re-file it here. (Sorry for the poor formatting. > "pre-formatted" isn't behaving.) > *Elastic version* > { > "name" : "adOS8gy", > "cluster_name" : "elasticsearch", > "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw", > "version" : { > "number" : "6.4.0", > "build_flavor" : "default", > "build_type" : "deb", > "build_hash" : "595516e", > "build_date" : "2018-08-17T23:18:47.308994Z", > "build_snapshot" : false, > "lucene_version" : "7.4.0", > "minimum_wire_compatibility_version" : "5.6.0", > "minimum_index_compatibility_version" : "5.0.0" > }, > "tagline" : "You Know, for Search" > } > *Plugins installed:* [analysis-icu, analysis-nori] > *JVM version:* > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > *OS version:* > Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) > x86_64 GNU/Linux > *Description of the problem including expected versus actual behavior:* > I've uncovered a number of oddities in tokenization in the Nori analyzer. All > examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean > Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough > order of importance: > A. Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) > + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > * don't is tokenized as don + t; same for don’t (with a curly apostrophe). > * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one). > B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle > dot (·, U+00B7) for > [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation]. >