[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
[ https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662833#comment-16662833 ] Trey Jones commented on LUCENE-8524: {quote}A can be discussed but I think it needs a separate issue since this is more a feature than a bug. This is a design choice and I am not sure that splitting is really an issue here. We could add a mode that join multiple alphabet together but it's not a major concern since this mixed terms should appear very rarely. {quote} All of the examples except for the _Мoscow_ one are taken from Korean Wikipedia or Wiktionary, so they do occur. Out of a sample of 10,000 random Korean Wikipedia articles (with ~2.4M tokens), 100 Cyrillic and 126 Greek tokens were affected. An additional 2758 ID-like tokens (e.g., _BH115E_) were affected. 96 Phonetic Alphabet tokens were affected. 769 tokens with apostrophes were affected, too; most were possessives with _’s,_ but also included were words like _An'gorso, Na’vi,_ and _O'Donnell._ Out of 2.4M tokens, these are rare, but there are still a lot of them—especially when you scale up 10K sample 43x to the full 430K articles on Wikipedia. It's definitely seems like a bug that a Greek word like _εἰμί_ gets split into three tokens, or _Ба̀лтичко̄_ gets split into four. The Greek seems to be the worse case, since _ἰ_ is in the “Greek Extended” Unicode block while the rest are “Greek and Coptic” block, which aren’t really different character sets. *Thanks for fixing B, D, and E!* > Nori (Korean) analyzer tokenization issues > -- > > Key: LUCENE-8524 > URL: https://issues.apache.org/jira/browse/LUCENE-8524 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Trey Jones >Priority: Major > Attachments: LUCENE-8524.patch > > > I opened this originally as an [Elastic > bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], > but was asked to re-file it here. (Sorry for the poor formatting. > "pre-formatted" isn't behaving.) > *Elastic version* > { > "name" : "adOS8gy", > "cluster_name" : "elasticsearch", > "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw", > "version" : { > "number" : "6.4.0", > "build_flavor" : "default", > "build_type" : "deb", > "build_hash" : "595516e", > "build_date" : "2018-08-17T23:18:47.308994Z", > "build_snapshot" : false, > "lucene_version" : "7.4.0", > "minimum_wire_compatibility_version" : "5.6.0", > "minimum_index_compatibility_version" : "5.0.0" > }, > "tagline" : "You Know, for Search" > } > *Plugins installed:* [analysis-icu, analysis-nori] > *JVM version:* > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > *OS version:* > Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) > x86_64 GNU/Linux > *Description of the problem including expected versus actual behavior:* > I've uncovered a number of oddities in tokenization in the Nori analyzer. All > examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean > Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough > order of importance: > A. Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) > + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > * don't is tokenized as don + t; same for don’t (with a curly apostrophe). > * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and
[jira] [Updated] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
[ https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trey Jones updated LUCENE-8524: --- Description: I opened this originally as an [Elastic bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], but was asked to re-file it here. *Elastic version* { "name" : "adOS8gy", "cluster_name" : "elasticsearch", "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw", "version" : { "number" : "6.4.0", "build_flavor" : "default", "build_type" : "deb", "build_hash" : "595516e", "build_date" : "2018-08-17T23:18:47.308994Z", "build_snapshot" : false, "lucene_version" : "7.4.0", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" } *Plugins installed:* [analysis-icu, analysis-nori] *JVM version:* openjdk version "1.8.0_181" OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) *OS version:* Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux *Description of the problem including expected versus actual behavior:* I've uncovered a number of oddities in tokenization in the Nori analyzer. All examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough order of importance: A. Tokens are split on different character POS types (which seem to not quite line up with Unicode character blocks), which leads to weird results for non-CJK tokens: * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other symbol) + μί/SL(Foreign language) * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + лтичко/SL(Foreign language) + ̄/SY(Other symbol) * don't is tokenized as don + t; same for don’t (with a curly apostrophe). * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow While it is still possible to find these words using Nori, there are many more chances for false positives when the tokens are split up like this. In particular, individual numbers and combining diacritics are indexed separately (e.g., in the Cyrillic example above), which can lead to a performance hit on large corpora like Wiktionary or Wikipedia. Work around: use a character filter to get rid of combining diacritics before Nori processes the text. This doesn't solve the Greek, Hebrew, or English cases, though. Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. Combining diacritics should not trigger token splits. Non-CJK text should be tokenized on spaces and punctuation, not by character type shifts. Apostrophe-like characters should not trigger token splits (though I could see someone disagreeing on this one). B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle dot (·, U+00B7) for [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation]. When the arae-a is used, everything after the first one ends up in one giant token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구. * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while "HANGUL *JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for which there is no precomposed Unicode character. Work around: use a character filter to convert arae-a (U+318D) to a space. Suggested fix: split tokens on all instances of arae-a (U+318D). C. Nori splits tokens on soft hyphens (U+00AD) and zero-width non-joiners (U+200C), splitting tokens that should not be split. * hyphenation (with a soft hyphen in the middle) is tokenized as hyphen + ation. * بازیهای (with a zero-width non-joiner) is tokenized as بازی + های. Work around: use a character filter to strip soft hyphens and zero-width non-joiners before Nori. Suggested fix: Nori should strip soft hyphens and zero-width non-joiners. D. Analyzing 그레이맨 generates an extra empty token after it. There may be others, but this is the only one I've found. Work around: at a min length token filter with a minimum length of 1. E. Analyzing 튜토리얼 generates a token with an extra space at the end of it. There may be others, but this is the only one I've found. No work around needed, I guess, since this is only the internal representation of the token. I'm not sure if it has any negative effects. *Steps to reproduce:* 1.
[jira] [Created] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
Trey Jones created LUCENE-8524: -- Summary: Nori (Korean) analyzer tokenization issues Key: LUCENE-8524 URL: https://issues.apache.org/jira/browse/LUCENE-8524 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Trey Jones I opened this originally as an [Elastic bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], but was asked to re-file it here. *Elastic version* {{{}} {{ "name" : "adOS8gy",}} {{ "cluster_name" : "elasticsearch",}} {{ "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",}} {{ "version" : {}} {{ "number" : "6.4.0",}} {{ "build_flavor" : "default",}} {{ "build_type" : "deb",}} {{ "build_hash" : "595516e",}} {{ "build_date" : "2018-08-17T23:18:47.308994Z",}} {{ "build_snapshot" : false,}} {{ "lucene_version" : "7.4.0",}} {{ "minimum_wire_compatibility_version" : "5.6.0",}} {{ "minimum_index_compatibility_version" : "5.0.0"}} {{ },}} {{ "tagline" : "You Know, for Search"}} {{}}} *Plugins installed:* [analysis-icu, analysis-nori] *JVM version:* openjdk version "1.8.0_181" OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) *OS version:* Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux *Description of the problem including expected versus actual behavior:* I've uncovered a number of oddities in tokenization in the Nori analyzer. All examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough order of importance: A. Tokens are split on different character POS types (which seem to not quite line up with Unicode character blocks), which leads to weird results for non-CJK tokens: * `εἰμί` is tokenized as three tokens: `ε/SL(Foreign language) + ἰ/SY(Other symbol) + μί/SL(Foreign language)` * `ka̠k̚t͡ɕ͈a̠k̚` is tokenized as `ka/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol)` * `Ба̀лтичко̄` is tokenized as `ба/SL(Foreign language) + ̀/SY(Other symbol) + лтичко/SL(Foreign language) + ̄/SY(Other symbol)` * `don't` is tokenized as `don + t`; same for `don’t` (with a curly apostrophe). * `אוֹג׳וּ` is tokenized as `אוֹג/SY(Other symbol) + וּ/SY(Other symbol)` * `Мoscow` (with a Cyrillic М and the rest in Latin) is tokenized as `м + oscow` While it is still possible to find these words using Nori, there are many more chances for false positives when the tokens are split up like this. In particular, individual numbers and combining diacritics are indexed separately (e.g., the `/` in the Cyrillic example above), which can lead to a performance hit on large corpora like Wiktionary or Wikipedia. Work around: use a character filter to get rid of combining diacritics before Nori processes the text. This doesn't solve the Greek, Hebrew, or English cases, though. Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. Combining diacritics should not trigger token splits. Non-CJK text should be tokenized on spaces and punctuation, not by character type shifts. Apostrophe-like characters should not trigger token splits (though I could see someone disagreeing on this one). B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle dot (·, U+00B7) for [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation]. When the arae-a is used, everything after the first one ends up in one giant token. `도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구` is tokenized as `도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구`. * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while "HANGUL *JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for which there is no precomposed Unicode character. Work around: use a character filter to convert arae-a (U+318D) to a space. Suggested fix: split tokens on all instances of arae-a (U+318D). C. Nori splits tokens on soft hyphens (U+00AD) and zero-width non-joiners (U+200C), splitting tokens that should not be split. * `hyphenation` (with a soft hyphen in the middle) is tokenized as `hyphen + ation`. * `بازیهای ` (with a zero-width non-joiner) is tokenized as `بازی + های`. Work around: use a character filter to strip soft hyphens and zero-width non-joiners before Nori. Suggested fix: Nori should strip soft hyphens and zero-width non-joiners. D. Analyzing 그레이맨 generates an extra empty token after it. There may be others, but this is the only one I've found. Work around: at a min length token filter with a minimum length of 1. E. Analyzing 튜토리얼 generates a token with
[jira] [Updated] (LUCENE-8524) Nori (Korean) analyzer tokenization issues
[ https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trey Jones updated LUCENE-8524: --- Description: I opened this originally as an [Elastic bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784], but was asked to re-file it here. (Sorry for the poor formatting. "pre-formatted" isn't behaving.) *Elastic version* { "name" : "adOS8gy", "cluster_name" : "elasticsearch", "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw", "version" : { "number" : "6.4.0", "build_flavor" : "default", "build_type" : "deb", "build_hash" : "595516e", "build_date" : "2018-08-17T23:18:47.308994Z", "build_snapshot" : false, "lucene_version" : "7.4.0", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" } *Plugins installed:* [analysis-icu, analysis-nori] *JVM version:* openjdk version "1.8.0_181" OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) *OS version:* Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux *Description of the problem including expected versus actual behavior:* I've uncovered a number of oddities in tokenization in the Nori analyzer. All examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough order of importance: A. Tokens are split on different character POS types (which seem to not quite line up with Unicode character blocks), which leads to weird results for non-CJK tokens: * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other symbol) + μί/SL(Foreign language) * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + лтичко/SL(Foreign language) + ̄/SY(Other symbol) * don't is tokenized as don + t; same for don’t (with a curly apostrophe). * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow While it is still possible to find these words using Nori, there are many more chances for false positives when the tokens are split up like this. In particular, individual numbers and combining diacritics are indexed separately (e.g., in the Cyrillic example above), which can lead to a performance hit on large corpora like Wiktionary or Wikipedia. Work around: use a character filter to get rid of combining diacritics before Nori processes the text. This doesn't solve the Greek, Hebrew, or English cases, though. Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. Combining diacritics should not trigger token splits. Non-CJK text should be tokenized on spaces and punctuation, not by character type shifts. Apostrophe-like characters should not trigger token splits (though I could see someone disagreeing on this one). B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle dot (·, U+00B7) for [lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation]. When the arae-a is used, everything after the first one ends up in one giant token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구. * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while "HANGUL *JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for which there is no precomposed Unicode character. Work around: use a character filter to convert arae-a (U+318D) to a space. Suggested fix: split tokens on all instances of arae-a (U+318D). C. Nori splits tokens on soft hyphens (U+00AD) and zero-width non-joiners (U+200C), splitting tokens that should not be split. * hyphenation (with a soft hyphen in the middle) is tokenized as hyphen + ation. * بازیهای (with a zero-width non-joiner) is tokenized as بازی + های. Work around: use a character filter to strip soft hyphens and zero-width non-joiners before Nori. Suggested fix: Nori should strip soft hyphens and zero-width non-joiners. D. Analyzing 그레이맨 generates an extra empty token after it. There may be others, but this is the only one I've found. Work around: at a min length token filter with a minimum length of 1. E. Analyzing 튜토리얼 generates a token with an extra space at the end of it. There may be others, but this is the only one I've found. No work around needed, I guess, since this is only the internal representation of the token. I'm not
[jira] [Created] (LUCENE-8419) Return token unchanged for pathological Stempel tokens
Trey Jones created LUCENE-8419: -- Summary: Return token unchanged for pathological Stempel tokens Key: LUCENE-8419 URL: https://issues.apache.org/jira/browse/LUCENE-8419 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Trey Jones Attachments: dotc.txt, dotdotc.txt, twoletter.txt In the aggregate, Stempel does a good job, but certain tokens get stemmed pathologically, conflating completely unrelated words in the search index. Depending on the scoring function, documents returned may have no form of the word that was in the query, only unrelated forms (see ć examples below). It's probably not possible to fix the stemmer, and it's probably not possible to catch _every_ error, but catching and ignoring certain large classes of errors would greatly improve precision, and doing it in the stemmer would prevent losses to recall that happen from cleaning up these errors outside the stemmer. An obvious example is that numbers ending in 1 have the last two digits replaced with ć. So 12341 is stemmed as 123ć. Numbers ending in 31 have the last 4 numbers removed and replaced with ć, so 12331 is stemmed as 1ć. Mixed letters and numbers are treated the same: abc123451 is stemmed as abc1234ć, abc1231 is stemmed as abcć. *Proposed solution:* any token that ends in a number should not be stemmed, it should just be returned unchanged. One letter stems from the set [a-zńć] are generally useless and often absurd. ć is the worst offender by far (it's the ending of the infinitive form of verbs). All of these tokens (found on Polish Wikipedia/Wiktionary) get stemmed to ć: * acque Adrien aguas Águas Alainem Alandh Amores Ansoe Arau asinaio aŭdas audyt Awiwie Ayres Baby badż Baina Bains Balue Baon baque Barbola Bazy Beau beim Beroe Betz Blaue blenda bleue Blizzard boor Boruca Boym Brodła Brogi Bronksie Brydż Budgie Budiafa bujny Buon Buot Button Caan Cains Canoe Canona caon Celu Charl Chloe ciag Cioma Cmdr Conseil Conso Cotton Cramp Creel Cuyk cyan czcią Czermny czto D.III Daws Daxue dazzle decy Defoe Dereń Detroit digue Dior Ditton Dojlido dosei douk DRaaS drag drau Dudacy dudas Dutton Duty Dziób eayd Edwy Edyp eiro Eltz Emain erar ESaaS faan Fetz figurar Fitz foam Frau Fugue GAAB gaan Gabirol Gaon gasue Gaup Geol GeoMIP Getz gigue Ginny Gioią Girl Goam Gołymin Gosei Götz grasso Grodnie Gula Guroo gyan HAAB Haan Heim Héroe Hitz Hoam Hohenho Hosei Huon Hutton Huub hyaina Iberii inkuby Inoue Issue ITaaS Iudas Izmaile Jaan Jaws jedyn Jews jira Josepho Jost Josue Judas Kaan Kaleido Karoo Katz Kazue Kehoe khayag kiwa Kiwu Klaas kmdr Kokei Konoe kozer kpią Kringle ksiezyce Któż Kutz L231 L331 Laan Lalli Laon Laws łebka Leroo Liban Ligue Liro Lisoli Logue Loja Londyn Lubomyr Luque Lutz Lytton łzawy Maan mains Mainy malpaco Mammal mandag MBaaS meeki Merl Metz MIDAS middag Miras mmol modą moins Monty Moryń motz mróż Mutz Müzesi MVaaS Naam nabrzeża Nadab Nadala Nalewki Nd:YAG neol News Nieszawa Nimue Nyam ÖAAB oblał oddala okala Olień opar oppi Orioł Osioł osoagi Osyki Otóż Output Oxalido pasmową Patton Pearl Peau peoplk Petz poar Pobrzeża poecie Pogue Pono posagi posł Praha Pringle probie progi Prońko Prosper prwdę Psioł Pułka Putz QDTOE Quien Qwest radża raga Rains reht Reich Retz Revue Right RITZ Roam Rogue Roque rosii RU31 Rutki Ryan SAAB saasso salue Sampaio Satz Sears Sekisho semo Setton Sgan Siloe Sitz Skopje Slot Šmarje Smrkci Soar sopo sozinho springa Steel Stip Straz Strip Suez sukuby Sumach Surgucie Sutton svasso Szosą szto Tadas Taira tęczy Teodorą teol Tisii Tisza Toluca Tomoe Toque TPMŻ Traiana Trask Traue Tulyag Tuque Turinga Undas Uniw usque Vague Value Venue Vidas Vogue Voor W331 Waringa weht Weich Weija Wheel widmem WKAG worku Wotton Wryk Wschowie wsiach wsiami Wybrzeża wydala Wyraz XLIII XVIII XXIII Yaski yeol YONO Yorki zakręcie Zijab zipo. Four-character tokens ending in 31 (like 2,31 9,31 1031 1131 7431 8331 a331) also all get stemmed to ć. Below are examples of other tokens (from Polish Wikipedia/Wiktionary) that get stemmed to one-letter tokens in [a-zńć]. Note that i, o, u, w, and z are stop words, and so don't show up in the list. * a: a, addo, adygea, jhwh, also * b: b, bdrm, barr, bebek, berr, bounty, bures, burr, berm, birm * c: alzira, c, carr, county, haight, hermas, kidoń, paich, pieter, połóż, radoń, soest, tatort, voight, zaba, biegną, pokaż, wskaż, zoisyt * d: award, d, dlek, deeb * e: e, eddy, eloi * f: f, farr, firm * g: g, geagea, grunty, gwdy, gyro, górą * h: h * i: inre, isro * j: j, judo * k: k, kgtj, kpzr, karr, kerr, ksok * l: l, leeb, loeb * m: m, magazyn, marr, mayor, merr, mnsi, murr, mgły, najmu * n: johnowi, n * o: obzr, offy * p: p, pace, paoli, parr, pasji, pawełek, pyro, pirsy, plmb * q: q * r: r, rite, rrek * s: s, sarr, site, sowie, szok * t:
[jira] [Created] (LUCENE-8417) Expose Stempel stopword filter
Trey Jones created LUCENE-8417: -- Summary: Expose Stempel stopword filter Key: LUCENE-8417 URL: https://issues.apache.org/jira/browse/LUCENE-8417 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Trey Jones Stempel (lucene-solr/lucene/analysis/stempel/) internally uses a stopword list. The stemmer is exposed as "polish_stem" but the stopword list is not exposed. If someone wants to unpack the Stempel analyzer to customize it, they have to go find the stopword list on their own and recreate it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8416) Add tokenized version of o.o. to Stempel stopwords
Trey Jones created LUCENE-8416: -- Summary: Add tokenized version of o.o. to Stempel stopwords Key: LUCENE-8416 URL: https://issues.apache.org/jira/browse/LUCENE-8416 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Trey Jones The Stempel stopword list ( lucene-solr/lucene/analysis/stempel/src/resources/org/apache/lucene/analysis/pl/stopwords.txt ) contains "o.o." which is a good stopword (it's part of the abbreviation for "limited liability company", which is "[sp. z o.o.|https://en.wiktionary.org/wiki/sp._z_o.o.];. However, the standard tokenizer changes "o.o." to "o.o" so the stopword filter has no effect. Add "o.o" to the stopword list. (It's probably okay to leave "o.o." in the list, though, in case a different tokenizer is used.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org