[jira] [Updated] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

2021-03-04 Thread Trey Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trey Jones updated LUCENE-9754:
---
Description: 
The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
the character that comes before preceeding whitespace.

For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 14 
| th.

In general, in a letter-space-number-letter sequence, if the writing system 
before the space is the same as the writing system after the number, then you 
get two tokens. If the writing systems differ, you get three tokens.

-If the conditions are just right, the chunking that the ICU tokenizer does 
(trying to split on spaces to create <4k chunks) can create an artificial 
boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
unexpected split of the second token (_14th_). Because chunking changes can 
ripple through a long document, editing text or the effects of a character 
filter can cause changes in tokenization thousands of lines later in a 
document.- _(This inconsistency was included as a side issue that I thought 
might add more weight to the main problem I am concerned with, but it seems to 
be more of a distraction. Chunking issues should perhaps be addressed in a 
different ticket, so I'm striking it out.)_

My guess is that some "previous character set" flag is not reset at the space, 
and numbers are not in a character set, so _t_ is compared to _ァ_ and they are 
not the same—causing a token split at the character set change—but I'm not sure.

 

  was:
The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
the character that comes before preceeding whitespace.

For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 14 
| th.

In general, in a letter-space-number-letter sequence, if the writing system 
before the space is the same as the writing system after the number, then you 
get two tokens. If the writing systems differ, you get three tokens.

If the conditions are just right, the chunking that the ICU tokenizer does 
(trying to split on spaces to create <4k chunks) can create an artificial 
boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
unexpected split of the second token (_14th_). Because chunking changes can 
ripple through a long document, editing text or the effects of a character 
filter can cause changes in tokenization thousands of lines later in a document.

My guess is that some "previous character set" flag is not reset at the space, 
and numbers are not in a character set, so _t_ is compared to _ァ_ and they are 
not the same—causing a token split at the character set change—but I'm not sure.

 


> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> -If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.- _(This inconsistency was included as a side issue that I thought 
> might add more weight to the main problem I am concerned with, but it seems 
> to be more of a distraction. Chunking issues should perhaps be addressed in a 
> different ticket, so I'm striking it out.)_
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

2021-02-11 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-9754:

Attachment: LUCENE-9754_prototype.patch

> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org