[jira] [Commented] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-24 Thread Trey Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662833#comment-16662833
 ] 

Trey Jones commented on LUCENE-8524:


{quote}A can be discussed but I think it needs a separate issue since this is 
more a feature than a bug. This is a design choice and I am not sure that 
splitting is really an issue here. We could add a mode that join multiple 
alphabet together but it's not a major concern since this mixed terms should 
appear very rarely.
{quote}
All of the examples except for the _Мoscow_ one are taken from Korean Wikipedia 
or Wiktionary, so they do occur. Out of a sample of 10,000 random Korean 
Wikipedia articles (with ~2.4M tokens), 100 Cyrillic and 126 Greek tokens were 
affected. An additional 2758 ID-like tokens (e.g., _BH115E_) were affected. 96 
Phonetic Alphabet tokens were affected. 769 tokens with apostrophes were 
affected, too; most were possessives with _’s,_ but also included were words 
like _An'gorso, Na’vi,_ and _O'Donnell._ Out of 2.4M tokens, these are rare, 
but there are still a lot of them—especially when you scale up 10K sample 43x 
to the full 430K articles on Wikipedia.

It's definitely seems like a bug that a Greek word like _εἰμί_ gets split into 
three tokens, or _Ба̀лтичко̄_ gets split into four. The Greek seems to be the 
worse case, since _ἰ_ is in the “Greek Extended” Unicode block while the rest 
are “Greek and Coptic” block, which aren’t really different character sets.

*Thanks for fixing B, D, and E!*

> Nori (Korean) analyzer tokenization issues
> --
>
> Key: LUCENE-8524
> URL: https://issues.apache.org/jira/browse/LUCENE-8524
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-8524.patch
>
>
> I opened this originally as an [Elastic 
> bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
>  but was asked to re-file it here. (Sorry for the poor formatting. 
> "pre-formatted" isn't behaving.)
> *Elastic version*
> {
>  "name" : "adOS8gy",
>  "cluster_name" : "elasticsearch",
>  "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
>  "version" : {
>  "number" : "6.4.0",
>  "build_flavor" : "default",
>  "build_type" : "deb",
>  "build_hash" : "595516e",
>  "build_date" : "2018-08-17T23:18:47.308994Z",
>  "build_snapshot" : false,
>  "lucene_version" : "7.4.0",
>  "minimum_wire_compatibility_version" : "5.6.0",
>  "minimum_index_compatibility_version" : "5.0.0"
>  },
>  "tagline" : "You Know, for Search"
> }
>  *Plugins installed:* [analysis-icu, analysis-nori]
> *JVM version:*
>  openjdk version "1.8.0_181"
>  OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
>  OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> *OS version:*
>  Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
> x86_64 GNU/Linux
> *Description of the problem including expected versus actual behavior:*
> I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
> examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
> Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
> order of importance:
> A. Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
>  * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
>  * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) 
> + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
>  * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
>  * don't is tokenized as don + t; same for don’t (with a curly apostrophe).
>  * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
>  * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and 

[jira] [Updated] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-04 Thread Trey Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trey Jones updated LUCENE-8524:
---
Description: 
I opened this originally as an [Elastic 
bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
 but was asked to re-file it here.

*Elastic version*

{
 "name" : "adOS8gy",
 "cluster_name" : "elasticsearch",
 "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
 "version" : {
 "number" : "6.4.0",
 "build_flavor" : "default",
 "build_type" : "deb",
 "build_hash" : "595516e",
 "build_date" : "2018-08-17T23:18:47.308994Z",
 "build_snapshot" : false,
 "lucene_version" : "7.4.0",
 "minimum_wire_compatibility_version" : "5.6.0",
 "minimum_index_compatibility_version" : "5.0.0"
 },
 "tagline" : "You Know, for Search"
}


 *Plugins installed:* [analysis-icu, analysis-nori]

*JVM version:*
 openjdk version "1.8.0_181"
 OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
 OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

*OS version:*
 Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
x86_64 GNU/Linux

*Description of the problem including expected versus actual behavior:*

I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
order of importance:

A. Tokens are split on different character POS types (which seem to not quite 
line up with Unicode character blocks), which leads to weird results for 
non-CJK tokens:
 * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
symbol) + μί/SL(Foreign language)
 * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
k/SL(Foreign language) + ̚/SY(Other symbol)
 * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
лтичко/SL(Foreign language) + ̄/SY(Other symbol)
 * don't is tokenized as don + t; same for don’t (with a curly apostrophe).
 * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
 * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow

While it is still possible to find these words using Nori, there are many more 
chances for false positives when the tokens are split up like this. In 
particular, individual numbers and combining diacritics are indexed separately 
(e.g., in the Cyrillic example above), which can lead to a performance hit on 
large corpora like Wiktionary or Wikipedia.

Work around: use a character filter to get rid of combining diacritics before 
Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
cases, though.

Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
Combining diacritics should not trigger token splits. Non-CJK text should be 
tokenized on spaces and punctuation, not by character type shifts. 
Apostrophe-like characters should not trigger token splits (though I could see 
someone disagreeing on this one).

B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle dot 
(·, U+00B7) for 
[lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation].
 When the arae-a is used, everything after the first one ends up in one giant 
token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구.
 * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while 
"HANGUL *JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for 
which there is no precomposed Unicode character.

Work around: use a character filter to convert arae-a (U+318D) to a space.

Suggested fix: split tokens on all instances of arae-a (U+318D).

C. Nori splits tokens on soft hyphens (U+00AD) and zero-width non-joiners 
(U+200C), splitting tokens that should not be split.
 * hyphen­ation (with a soft hyphen in the middle) is tokenized as hyphen + 
ation.
 * بازی‌های  (with a zero-width non-joiner) is tokenized as بازی + های.

Work around: use a character filter to strip soft hyphens and zero-width 
non-joiners before Nori.

Suggested fix: Nori should strip soft hyphens and zero-width non-joiners.

D. Analyzing 그레이맨 generates an extra empty token after it. There may be others, 
but this is the only one I've found. Work around: at a min length token filter 
with a minimum length of 1.

E. Analyzing 튜토리얼 generates a token with an extra space at the end of it. There 
may be others, but this is the only one I've found. No work around needed, I 
guess, since this is only the internal representation of the token. I'm not 
sure if it has any negative effects.

*Steps to reproduce:*

1. 

[jira] [Created] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-04 Thread Trey Jones (JIRA)
Trey Jones created LUCENE-8524:
--

 Summary: Nori (Korean) analyzer tokenization issues
 Key: LUCENE-8524
 URL: https://issues.apache.org/jira/browse/LUCENE-8524
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Trey Jones


I opened this originally as an [Elastic 
bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
 but was asked to re-file it here.

*Elastic version*
{{{}}
{{ "name" : "adOS8gy",}}
{{ "cluster_name" : "elasticsearch",}}
{{ "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",}}
{{ "version" : {}}
{{ "number" : "6.4.0",}}
{{ "build_flavor" : "default",}}
{{ "build_type" : "deb",}}
{{ "build_hash" : "595516e",}}
{{ "build_date" : "2018-08-17T23:18:47.308994Z",}}
{{ "build_snapshot" : false,}}
{{ "lucene_version" : "7.4.0",}}
{{ "minimum_wire_compatibility_version" : "5.6.0",}}
{{ "minimum_index_compatibility_version" : "5.0.0"}}
{{ },}}
{{ "tagline" : "You Know, for Search"}}
{{}}}

*Plugins installed:* [analysis-icu, analysis-nori]

*JVM version:*
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

*OS version:*
Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
x86_64 GNU/Linux

*Description of the problem including expected versus actual behavior:*

I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
order of importance:

A. Tokens are split on different character POS types (which seem to not quite 
line up with Unicode character blocks), which leads to weird results for 
non-CJK tokens:
* `εἰμί` is tokenized as three tokens: `ε/SL(Foreign language) + ἰ/SY(Other 
symbol) + μί/SL(Foreign language)`
* `ka̠k̚t͡ɕ͈a̠k̚` is tokenized as `ka/SL(Foreign language) + ̠/SY(Other symbol) 
+ k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
k/SL(Foreign language) + ̚/SY(Other symbol)`
* `Ба̀лтичко̄` is tokenized as `ба/SL(Foreign language) + ̀/SY(Other symbol) + 
лтичко/SL(Foreign language) + ̄/SY(Other symbol)`
* `don't` is tokenized as `don + t`; same for `don’t` (with a curly apostrophe).
* `אוֹג׳וּ` is tokenized as `אוֹג/SY(Other symbol) + וּ/SY(Other symbol)`
* `Мoscow` (with a Cyrillic М and the rest in Latin) is tokenized as `м + oscow`

While it is still possible to find these words using Nori, there are many more 
chances for false positives when the tokens are split up like this. In 
particular, individual numbers and combining diacritics are indexed separately 
(e.g., the `/` in the Cyrillic example above), which can lead to a performance 
hit on large corpora like Wiktionary or Wikipedia.

Work around: use a character filter to get rid of combining diacritics before 
Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
cases, though.

Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
Combining diacritics should not trigger token splits. Non-CJK text should be 
tokenized on spaces and punctuation, not by character type shifts. 
Apostrophe-like characters should not trigger token splits (though I could see 
someone disagreeing on this one).

B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle dot 
(·, U+00B7) for 
[lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation].
 When the arae-a is used, everything after the first one ends up in one giant 
token. `도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구` is tokenized as `도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구`.
* Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while "HANGUL 
*JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for which 
there is no precomposed Unicode character.

Work around: use a character filter to convert arae-a (U+318D) to a space.

Suggested fix: split tokens on all instances of arae-a (U+318D).

C. Nori splits tokens on soft hyphens (U+00AD) and zero-width non-joiners 
(U+200C), splitting tokens that should not be split.
* `hyphen­ation` (with a soft hyphen in the middle) is tokenized as `hyphen + 
ation`.
* `بازی‌های ` (with a zero-width non-joiner) is tokenized as `بازی + های`.

Work around: use a character filter to strip soft hyphens and zero-width 
non-joiners before Nori.

Suggested fix: Nori should strip soft hyphens and zero-width non-joiners.

D. Analyzing 그레이맨 generates an extra empty token after it. There may be others, 
but this is the only one I've found. Work around: at a min length token filter 
with a minimum length of 1.

E. Analyzing 튜토리얼 generates a token with 

[jira] [Updated] (LUCENE-8524) Nori (Korean) analyzer tokenization issues

2018-10-04 Thread Trey Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trey Jones updated LUCENE-8524:
---
Description: 
I opened this originally as an [Elastic 
bug|https://github.com/elastic/elasticsearch/issues/34283#issuecomment-426940784],
 but was asked to re-file it here. (Sorry for the poor formatting. 
"pre-formatted" isn't behaving.)

*Elastic version*

{
 "name" : "adOS8gy",
 "cluster_name" : "elasticsearch",
 "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
 "version" : {
 "number" : "6.4.0",
 "build_flavor" : "default",
 "build_type" : "deb",
 "build_hash" : "595516e",
 "build_date" : "2018-08-17T23:18:47.308994Z",
 "build_snapshot" : false,
 "lucene_version" : "7.4.0",
 "minimum_wire_compatibility_version" : "5.6.0",
 "minimum_index_compatibility_version" : "5.0.0"
 },
 "tagline" : "You Know, for Search"
}


 *Plugins installed:* [analysis-icu, analysis-nori]

*JVM version:*
 openjdk version "1.8.0_181"
 OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
 OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

*OS version:*
 Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) 
x86_64 GNU/Linux

*Description of the problem including expected versus actual behavior:*

I've uncovered a number of oddities in tokenization in the Nori analyzer. All 
examples are from [Korean Wikipedia|https://ko.wikipedia.org/] or [Korean 
Wiktionary|https://ko.wiktionary.org/] (including non-CJK examples). In rough 
order of importance:

A. Tokens are split on different character POS types (which seem to not quite 
line up with Unicode character blocks), which leads to weird results for 
non-CJK tokens:
 * εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
symbol) + μί/SL(Foreign language)
 * ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
k/SL(Foreign language) + ̚/SY(Other symbol)
 * Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
лтичко/SL(Foreign language) + ̄/SY(Other symbol)
 * don't is tokenized as don + t; same for don’t (with a curly apostrophe).
 * אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
 * Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow

While it is still possible to find these words using Nori, there are many more 
chances for false positives when the tokens are split up like this. In 
particular, individual numbers and combining diacritics are indexed separately 
(e.g., in the Cyrillic example above), which can lead to a performance hit on 
large corpora like Wiktionary or Wikipedia.

Work around: use a character filter to get rid of combining diacritics before 
Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
cases, though.

Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
Combining diacritics should not trigger token splits. Non-CJK text should be 
tokenized on spaces and punctuation, not by character type shifts. 
Apostrophe-like characters should not trigger token splits (though I could see 
someone disagreeing on this one).

B. The character "arae-a" (ㆍ, U+318D) is sometimes used instead of a middle dot 
(·, U+00B7) for 
[lists|https://en.wikipedia.org/wiki/Korean_punctuation#Differences_from_European_punctuation].
 When the arae-a is used, everything after the first one ends up in one giant 
token. 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구 is tokenized as 도로 + ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구.
 * Note that "HANGUL *LETTER* ARAEA" (ㆍ, U+318D) is used this way, while 
"HANGUL *JUNGSEONG* ARAEA" (ᆞ, U+119E) is used to create syllable blocks for 
which there is no precomposed Unicode character.

Work around: use a character filter to convert arae-a (U+318D) to a space.

Suggested fix: split tokens on all instances of arae-a (U+318D).

C. Nori splits tokens on soft hyphens (U+00AD) and zero-width non-joiners 
(U+200C), splitting tokens that should not be split.
 * hyphen­ation (with a soft hyphen in the middle) is tokenized as hyphen + 
ation.
 * بازی‌های  (with a zero-width non-joiner) is tokenized as بازی + های.

Work around: use a character filter to strip soft hyphens and zero-width 
non-joiners before Nori.

Suggested fix: Nori should strip soft hyphens and zero-width non-joiners.

D. Analyzing 그레이맨 generates an extra empty token after it. There may be others, 
but this is the only one I've found. Work around: at a min length token filter 
with a minimum length of 1.

E. Analyzing 튜토리얼 generates a token with an extra space at the end of it. There 
may be others, but this is the only one I've found. No work around needed, I 
guess, since this is only the internal representation of the token. I'm not 

[jira] [Created] (LUCENE-8419) Return token unchanged for pathological Stempel tokens

2018-07-20 Thread Trey Jones (JIRA)
Trey Jones created LUCENE-8419:
--

 Summary: Return token unchanged for pathological Stempel tokens
 Key: LUCENE-8419
 URL: https://issues.apache.org/jira/browse/LUCENE-8419
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Trey Jones
 Attachments: dotc.txt, dotdotc.txt, twoletter.txt

In the aggregate, Stempel does a good job, but certain tokens get stemmed 
pathologically, conflating completely unrelated words in the search index. 
Depending on the scoring function, documents returned may have no form of the 
word that was in the query, only unrelated forms (see ć examples below).

It's probably not possible to fix the stemmer, and it's probably not possible 
to catch _every_ error, but catching and ignoring certain large classes of 
errors would greatly improve precision, and doing it in the stemmer would 
prevent losses to recall that happen from cleaning up these errors outside the 
stemmer.

An obvious example is that numbers ending in 1 have the last two digits 
replaced with ć. So 12341 is stemmed as 123ć. Numbers ending in 31 have the 
last 4 numbers removed and replaced with ć, so 12331 is stemmed as 1ć. Mixed 
letters and numbers are treated the same: abc123451 is stemmed as abc1234ć, 
abc1231 is stemmed as abcć.

*Proposed solution:* any token that ends in a number should not be stemmed, it 
should just be returned unchanged.

One letter stems from the set [a-zńć] are generally useless and often absurd.

ć is the worst offender by far (it's the ending of the infinitive form of 
verbs). All of these tokens (found on Polish Wikipedia/Wiktionary) get stemmed 
to ć:
 * acque Adrien aguas Águas Alainem Alandh Amores Ansoe Arau asinaio aŭdas 
audyt Awiwie Ayres Baby badż Baina Bains Balue Baon baque Barbola Bazy Beau 
beim Beroe Betz Blaue blenda bleue Blizzard boor Boruca Boym Brodła Brogi 
Bronksie Brydż Budgie Budiafa bujny Buon Buot Button Caan Cains Canoe Canona 
caon Celu Charl Chloe ciag Cioma Cmdr Conseil Conso Cotton Cramp Creel Cuyk 
cyan czcią Czermny czto D.III Daws Daxue dazzle decy Defoe Dereń Detroit digue 
Dior Ditton Dojlido dosei douk DRaaS drag drau Dudacy dudas Dutton Duty Dziób 
eayd Edwy Edyp eiro Eltz Emain erar ESaaS faan Fetz figurar Fitz foam Frau 
Fugue GAAB gaan Gabirol Gaon gasue Gaup Geol GeoMIP Getz gigue Ginny Gioią Girl 
Goam Gołymin Gosei Götz grasso Grodnie Gula Guroo gyan HAAB Haan Heim Héroe 
Hitz Hoam Hohenho Hosei Huon Hutton Huub hyaina Iberii inkuby Inoue Issue ITaaS 
Iudas Izmaile Jaan Jaws jedyn Jews jira Josepho Jost Josue Judas Kaan Kaleido 
Karoo Katz Kazue Kehoe khayag kiwa Kiwu Klaas kmdr Kokei Konoe kozer kpią 
Kringle ksiezyce Któż Kutz L231 L331 Laan Lalli Laon Laws łebka Leroo Liban 
Ligue Liro Lisoli Logue Loja Londyn Lubomyr Luque Lutz Lytton łzawy Maan mains 
Mainy malpaco Mammal mandag MBaaS meeki Merl Metz MIDAS middag Miras mmol modą 
moins Monty Moryń motz mróż Mutz Müzesi MVaaS Naam nabrzeża Nadab Nadala 
Nalewki Nd:YAG neol News Nieszawa Nimue Nyam ÖAAB oblał oddala okala Olień opar 
oppi Orioł Osioł osoagi Osyki Otóż Output Oxalido pasmową Patton Pearl Peau 
peoplk Petz poar Pobrzeża poecie Pogue Pono posagi posł Praha Pringle probie 
progi Prońko Prosper prwdę Psioł Pułka Putz QDTOE Quien Qwest radża raga Rains 
reht Reich Retz Revue Right RITZ Roam Rogue Roque rosii RU31 Rutki Ryan SAAB 
saasso salue Sampaio Satz Sears Sekisho semo Setton Sgan Siloe Sitz Skopje Slot 
Šmarje Smrkci Soar sopo sozinho springa Steel Stip Straz Strip Suez sukuby 
Sumach Surgucie Sutton svasso Szosą szto Tadas Taira tęczy Teodorą teol Tisii 
Tisza Toluca Tomoe Toque TPMŻ Traiana Trask Traue Tulyag Tuque Turinga Undas 
Uniw usque Vague Value Venue Vidas Vogue Voor W331 Waringa weht Weich Weija 
Wheel widmem WKAG worku Wotton Wryk Wschowie wsiach wsiami Wybrzeża wydala 
Wyraz XLIII XVIII XXIII Yaski yeol YONO Yorki zakręcie Zijab zipo.

Four-character tokens ending in 31 (like 2,31 9,31 1031 1131 7431 8331 a331) 
also all get stemmed to ć.

Below are examples of other tokens (from Polish Wikipedia/Wiktionary) that get 
stemmed to one-letter tokens in [a-zńć]. Note that i, o, u, w, and z are stop 
words, and so don't show up in the list.
 * a: a, addo, adygea, jhwh, also
 * b: b, bdrm, barr, bebek, berr, bounty, bures, burr, berm, birm
 * c: alzira, c, carr, county, haight, hermas, kidoń, paich, pieter, połóż, 
radoń, soest, tatort, voight, zaba, biegną, pokaż, wskaż, zoisyt
 * d: award, d, dlek, deeb
 * e: e, eddy, eloi
 * f: f, farr, firm
 * g: g, geagea, grunty, gwdy, gyro, górą
 * h: h
 * i: inre, isro
 * j: j, judo
 * k: k, kgtj, kpzr, karr, kerr, ksok
 * l: l, leeb, loeb
 * m: m, magazyn, marr, mayor, merr, mnsi, murr, mgły, najmu
 * n: johnowi, n
 * o: obzr, offy
 * p: p, pace, paoli, parr, pasji, pawełek, pyro, pirsy, plmb
 * q: q
 * r: r, rite, rrek
 * s: s, sarr, site, sowie, szok
 * t: 

[jira] [Created] (LUCENE-8417) Expose Stempel stopword filter

2018-07-20 Thread Trey Jones (JIRA)
Trey Jones created LUCENE-8417:
--

 Summary: Expose Stempel stopword filter
 Key: LUCENE-8417
 URL: https://issues.apache.org/jira/browse/LUCENE-8417
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Trey Jones


Stempel (lucene-solr/lucene/analysis/stempel/) internally uses a stopword list. 
The stemmer is exposed as "polish_stem" but the stopword list is not exposed. 
If someone wants to unpack the Stempel analyzer to customize it, they have to 
go find the stopword list on their own and recreate it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8416) Add tokenized version of o.o. to Stempel stopwords

2018-07-20 Thread Trey Jones (JIRA)
Trey Jones created LUCENE-8416:
--

 Summary: Add tokenized version of o.o. to Stempel stopwords
 Key: LUCENE-8416
 URL: https://issues.apache.org/jira/browse/LUCENE-8416
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Trey Jones


The Stempel stopword list ( 
lucene-solr/lucene/analysis/stempel/src/resources/org/apache/lucene/analysis/pl/stopwords.txt
 ) contains "o.o." which is a good stopword (it's part of the abbreviation for 
"limited liability company", which is "[sp. z 
o.o.|https://en.wiktionary.org/wiki/sp._z_o.o.];. However, the standard 
tokenizer changes "o.o." to "o.o" so the stopword filter has no effect.

Add "o.o" to the stopword list. (It's probably okay to leave "o.o." in the 
list, though, in case a different tokenizer is used.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org