[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-12-03 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707497#comment-16707497
 ] 

Christophe Bismuth commented on LUCENE-8548:


That's great, thanks [~jim.ferenczi] for all the details (y)

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (8.0), 7.7
>
> Attachments: LUCENE-8548.patch, screenshot-1.png, 
> testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-12-03 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706938#comment-16706938
 ] 

ASF subversion and git services commented on LUCENE-8548:
-

Commit 01f9adbb3dd4971a1d0eee74a030fcd7948c35b3 in lucene-solr's branch 
refs/heads/branch_7x from [~cbismuth]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=01f9adb ]

LUCENE-8548: The KoreanTokenizer no longer splits unknown words on combining 
diacritics and
detects script boundaries more accurately with Character#UnicodeScript#of.

Signed-off-by: Jim Ferenczi 


> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (8.0), 7.7
>
> Attachments: LUCENE-8548.patch, screenshot-1.png, 
> testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-12-03 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706925#comment-16706925
 ] 

ASF subversion and git services commented on LUCENE-8548:
-

Commit 643ffc6f9fb3f7368d48975d750f75f8a66783e2 in lucene-solr's branch 
refs/heads/master from [~cbismuth]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=643ffc6 ]

LUCENE-8548: The KoreanTokenizer no longer splits unknown words on combining 
diacritics and
detects script boundaries more accurately with Character#UnicodeScript#of.

Signed-off-by: Jim Ferenczi 


> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8548.patch, screenshot-1.png, 
> testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-12-03 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706907#comment-16706907
 ] 

Jim Ferenczi commented on LUCENE-8548:
--

{quote}

I still have one more question, could you please explain what information is 
contained in the {{wordIdRef}} variable and what the 
{{unkDictionary.lookupWordIds(characterId, wordIdRef)}} statement does? The 
debugger tells me {{wordIdRef.length}} is always equal to 36 or 42 and even 
though 42 is a really great number, I'm a tiny lost in there ...

{quote}

 

The wordIdRef is an indirection that maps an id to an array of word ids. These 
word ids are the candidates for the current word so for known words it is a 
pointer to the part of speech + cost + ... for each variant of the word (noun, 
verb, ...) and for character ids this is a pointer to the cost and part of 
speech for words written with this id (mapped with the unicode scripts). I 
don't know how you found 36 or 42 but unknown words should always have a single 
wordIdRef.

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8548.patch, screenshot-1.png, 
> testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-28 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701958#comment-16701958
 ] 

Christophe Bismuth commented on LUCENE-8548:


Thanks a lot for sharing this [~jim.ferenczi], and no worries at all as the 
first iteration was an interesting journey! I think taking time to read about 
Viterbi would help me some more, let's add it to my pretty own todo list :D

I diffed your patch with {{master}} and debugged new tests step-by-step, and I 
think I understand the big picture. Among others, I totally missed the {{if 
(isCommonOrInherited(scriptCode) && isCommonOrInherited(sc) == false)}} 
condition which is essential.

I still have one more question, could you please explain what information is 
contained in the {{wordIdRef}} variable and what the 
{{unkDictionary.lookupWordIds(characterId, wordIdRef)}} statement does? The 
debugger tells me {{wordIdRef.length}} is always equal to 36 or 42 and even 
though 42 is a really great number, I'm a tiny lost in there ...

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8548.patch, screenshot-1.png, 
> testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-23 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697336#comment-16697336
 ] 

Christophe Bismuth commented on LUCENE-8548:


I've made some progress and opened PR 
[#505|https://github.com/apache/lucene-solr/pull/505] to share them with you. 
Feel free to stop me as I don't want to make you loose your time.

Here is what has been done so far:

 * Break on script boundaries with built-in JDK API,
 * Track character classes in a growing byte array,
 * I feel a tiny bit lost when it comes to extract costs: should I call 
{{unkDictionary.lookupWordIds(characterId, wordIdRef)}} for each tracked 
character class?
 * {{мoscow}} word is correctly parsed in the Graphviz output below ...
 * ... but test failed on this 
[line|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java#L199]
 and I still have to understand why.

!screenshot-1.png!

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: screenshot-1.png, testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-22 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696093#comment-16696093
 ] 

Christophe Bismuth commented on LUCENE-8548:


That is really nice, thank you [~jim.ferenczi] and [~rcmuir], I should be able 
to start patch, thanks again!

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-22 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696085#comment-16696085
 ] 

Robert Muir commented on LUCENE-8548:
-

Yeah, if you want to just change that logic as jim suggests, I would look into 
https://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeScript.html#of(int)
 which is build into the JDK.


> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-22 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696072#comment-16696072
 ] 

Jim Ferenczi commented on LUCENE-8548:
--

Yes we should not depend on the icu module. You need to implement something 
specific anyway because the tokenizer in the Nori module uses a rolling buffer 
to read the input so it has its own logic to inspect the underlying characters.

{quote}

Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameter)

{quote}

 

Currently Nori breaks on character class. The character classes are defined in 
the MeCab model 
[https://bitbucket.org/eunjeon/mecab-ko-dic/src/df15a487444d88565ea18f8250330276497cc9b9/seed/char.def?at=master=file-view-default]
 and we access them through the CharacterDefinition class. You can see the 
logic to group unknown words here:

[https://github.com/apache/lucene-solr/blob/master/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/KoreanTokenizer.java#L729]

So instead of splitting on character class, you need to extend the logic to 
break on script boundaries. If the new block contains multiple character 
classes (that are compatible with each other) you still need to choose one 
character id to extract the costs associated to that block:

[https://github.com/apache/lucene-solr/blob/master/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/KoreanTokenizer.java#L745]

In such case picking the first character id in the block should be enough.

 

 

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-22 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696052#comment-16696052
 ] 

Christophe Bismuth commented on LUCENE-8548:


Great! Thank you [~rcmuir], I'll dig into this (y)

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-22 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696038#comment-16696038
 ] 

Robert Muir commented on LUCENE-8548:
-

{quote}
Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse some 
{{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
{quote}

Lets avoid this: this logic in the ICUTokenizer should stay private. Otherwise 
its api (javadocs, etc) will be confusing, it should not expose any of its guts 
stuff as protected or public.

The high level is, if you want to break on script boundaries, unicode specifies 
a way to do it. you don't have to reinvent the wheel. You can get the property 
data from text files https://www.unicode.org/Public/11.0.0/ucd/Scripts.txt, or 
you can depend on icu jar for access to the properties. The link to 
ICUTokenizer just gives an example of how you might do it, but it should not be 
exposed to this nori.
 

 

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-21 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694827#comment-16694827
 ] 

Christophe Bismuth commented on LUCENE-8548:


I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some 
mentoring to keep going on.

Here is what I've done so far:
 * Implement a Cyrillic test failure (see previous comment)
 * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes
 * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class 
attribute (following UAX #29: Unicode Text Segmentation)
 * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse 
some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
 * Enable verbose output (see output below)
 * Enable Graphiz ouput (see attached picture)
 * Debug step by step the 
{{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method
 * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameters)

I would need some code or documentation pointers when you have time.

!testCyrillicWord.dot.png!

Tokenizer verbose output below.
{noformat}
PARSE

  extend @ pos=0 char=м hex=43c
1 arcs in
UNKNOWN word len=1 1 wordIDs
  fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) 
leftID=1793 leftPOS=SL)
**
  + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0

  backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933
add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=1
TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: 
return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)

PARSE

  extend @ pos=1 char=o hex=6f
1 arcs in
UNKNOWN word len=6 1 wordIDs
  fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 
spacePenalty=0) leftID=1793 leftPOS=SL)
**
  + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0
no arcs in; skip pos=2
no arcs in; skip pos=3
no arcs in; skip pos=4
no arcs in; skip pos=5
no arcs in; skip pos=6
  end: 1 nodes

  backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235
add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=7
{noformat}

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and 

[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-20 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693478#comment-16693478
 ] 

Christophe Bismuth commented on LUCENE-8548:


I'll use the test failure below as a starting point.

{code:java}
  // LUCENE-8548 - file TestKoreanAnalyzer.java
  public void testCyrillicWord() throws IOException {
final Analyzer analyzer = new KoreanAnalyzer(TestKoreanTokenizer.readDict(),
KoreanTokenizer.DEFAULT_DECOMPOUND, 
KoreanPartOfSpeechStopFilter.DEFAULT_STOP_TAGS, false);
assertAnalyzesTo(analyzer, "мoscow",
new String[]{"мoscow"},
new int[]{0},
new int[]{6},
new int[]{1}
);
  }
{code}

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-16 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690004#comment-16690004
 ] 

Christophe Bismuth commented on LUCENE-8548:


Yes, I'm interested in this issue (y) I'll start to work on it and let you know.

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689780#comment-16689780
 ] 

Jim Ferenczi commented on LUCENE-8548:
--

I didn't start to work on it so feel free to start something if you are 
interested [~cbismuth]. 

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688242#comment-16688242
 ] 

Christophe Bismuth commented on LUCENE-8548:


Hi [~jim.ferenczi], have you started to work on patch or maybe I could help? 
I'm not an Unicode guru but I can read the docs and learn. Feel free to let me 
know.

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-10-26 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665235#comment-16665235
 ] 

Jim Ferenczi commented on LUCENE-8548:
--

Thanks for the pointer [~rcmuir], I'll work on a patch.

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-10-26 Thread Michael Sokolov
I agree w/Robert let's not reinvent solutions that are solved elsewhere. In
an ideal world, wouldn't you want to be able to delegate tokenization of
latin script portions to StandardTokenizer? I know that's not possible
today, and I wouldn't derail the work here to try to make it happen since
it would be a big shift, but personally I'd like to see some more
discussion about composing Tokenizers

On Fri, Oct 26, 2018 at 3:53 AM Robert Muir (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665014#comment-16665014
> ]
>
> Robert Muir commented on LUCENE-8548:
> -
>
> As far as the suggested fix, why reinvent the wheel? In unicode each
> character gets assigned a script integer value. But there are special
> values such as "Common" and "Inherited", etc.
>
> See [https://unicode.org/reports/tr24/] or icutokenizer code [
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java#L141
> ]
>
>
>
> > Reevaluate scripts boundary break in Nori's tokenizer
> > -
> >
> > Key: LUCENE-8548
> > URL: https://issues.apache.org/jira/browse/LUCENE-8548
> > Project: Lucene - Core
> >  Issue Type: Improvement
> >Reporter: Jim Ferenczi
> >Priority: Minor
> >
> > This was first reported in
> https://issues.apache.org/jira/browse/LUCENE-8526:
> > {noformat}
> > Tokens are split on different character POS types (which seem to not
> quite line up with Unicode character blocks), which leads to weird results
> for non-CJK tokens:
> > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other
> symbol) + μί/SL(Foreign language)
> > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other
> symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign
> language) + ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other
> symbol) + k/SL(Foreign language) + ̚/SY(Other symbol)
> > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol)
> + лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> > don't is tokenized as don + t; same for don't (with a curly apostrophe).
> > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м +
> oscow
> > While it is still possible to find these words using Nori, there are
> many more chances for false positives when the tokens are split up like
> this. In particular, individual numbers and combining diacritics are
> indexed separately (e.g., in the Cyrillic example above), which can lead to
> a performance hit on large corpora like Wiktionary or Wikipedia.
> > Work around: use a character filter to get rid of combining diacritics
> before Nori processes the text. This doesn't solve the Greek, Hebrew, or
> English cases, though.
> > Suggested fix: Characters in related Unicode blocks—like "Greek" and
> "Greek Extended", or "Latin" and "IPA Extensions"—should not trigger token
> splits. Combining diacritics should not trigger token splits. Non-CJK text
> should be tokenized on spaces and punctuation, not by character type
> shifts. Apostrophe-like characters should not trigger token splits (though
> I could see someone disagreeing on this one).{noformat}
> >
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-10-26 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665014#comment-16665014
 ] 

Robert Muir commented on LUCENE-8548:
-

As far as the suggested fix, why reinvent the wheel? In unicode each character 
gets assigned a script integer value. But there are special values such as 
"Common" and "Inherited", etc.

See [https://unicode.org/reports/tr24/] or icutokenizer code 
[https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java#L141]

 

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org