subject:"\[jira\] \[Commented\] \(LUCENE\-9100\) JapaneseTokenizer produces inconsistent tokens"

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

2020-01-10 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013045#comment-17013045
 ] 

Michael McCandless commented on LUCENE-9100:


{quote}Maybe a solution here is to use the tokenizer with 
`discardPunctuation==false`, then stripping the punctuation tokens in a filter.
{quote}
+1, that sounds like a possible workaround.

But it's still spooky that tokens can be formed across (deleted) punctuation ...

> JapaneseTokenizer produces inconsistent tokens
> --
>
> Key: LUCENE-9100
> URL: https://issues.apache.org/jira/browse/LUCENE-9100
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.2
>Reporter: Elbek Kamoliddinov
>Priority: Major
>
> We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. 
> With this text:
>  {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert 
> space before `【` char. Here is the small code snippet demonstrating the case 
> (note we use our own dictionary and connection costs):
> {code:java}
> Analyzer analyzer = new Analyzer() {
> @Override
> protected TokenStreamComponents createComponents(String 
> fieldName) {
> //Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, 
> dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> return new TokenStreamComponents(tokenizer, new 
> LowerCaseFilter(tokenizer));
> }
> };
> String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)";
> String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space
> try (TokenStream tokens = analyzer.tokenStream("field", new 
> StringReader(text1))) {
> CharTermAttribute chars = 
> tokens.addAttribute(CharTermAttribute.class);
> tokens.reset();
> while (tokens.incrementToken()) {
> System.out.println(chars.toString());
> }
> tokens.end();
> } catch (IOException e) {
> // should never happen with a StringReader
> throw new RuntimeException(e);
> } {code}
> Output is:
> {code:java}
> //text1
>  マギ
> アリス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics
> //text2
> マギア
> リス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics{code}
> It looks like tokenizer doesn't view the punctuation ({{【}} is 
> {{Character.START_PUNCTUATION}} type) as an indicator that there should be a 
> token break, and somehow 【 punctuation char causes difference in the output. 
> If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't 
> manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and 
> outputs as is. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

2020-01-09 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012178#comment-17012178
 ] 

Michael Sokolov commented on LUCENE-9100:
-

[~elbek@gmail.com] I'm curious what tokens are produced by {{日本語単話版}} (no 
space in between). From reading `JapaneseTokenizer` I'd expect it to be like 
the former case (with the parentheses inserted). Maybe a solution here is to 
use the tokenizer with `discardPunctuation==false`, then stripping the 
punctuation tokens in a filter.

> JapaneseTokenizer produces inconsistent tokens
> --
>
> Key: LUCENE-9100
> URL: https://issues.apache.org/jira/browse/LUCENE-9100
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.2
>Reporter: Elbek Kamoliddinov
>Priority: Major
>
> We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. 
> With this text:
>  {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert 
> space before `【` char. Here is the small code snippet demonstrating the case 
> (note we use our own dictionary and connection costs):
> {code:java}
> Analyzer analyzer = new Analyzer() {
> @Override
> protected TokenStreamComponents createComponents(String 
> fieldName) {
> //Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, 
> dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> return new TokenStreamComponents(tokenizer, new 
> LowerCaseFilter(tokenizer));
> }
> };
> String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)";
> String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space
> try (TokenStream tokens = analyzer.tokenStream("field", new 
> StringReader(text1))) {
> CharTermAttribute chars = 
> tokens.addAttribute(CharTermAttribute.class);
> tokens.reset();
> while (tokens.incrementToken()) {
> System.out.println(chars.toString());
> }
> tokens.end();
> } catch (IOException e) {
> // should never happen with a StringReader
> throw new RuntimeException(e);
> } {code}
> Output is:
> {code:java}
> //text1
>  マギ
> アリス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics
> //text2
> マギア
> リス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics{code}
> It looks like tokenizer doesn't view the punctuation ({{【}} is 
> {{Character.START_PUNCTUATION}} type) as an indicator that there should be a 
> token break, and somehow 【 punctuation char causes difference in the output. 
> If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't 
> manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and 
> outputs as is. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

2020-01-08 Thread Elbek Kamoliddinov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010934#comment-17010934
 ] 

Elbek Kamoliddinov commented on LUCENE-9100:


I made some progress on this, I found a case where following text influences 
whatever text it follows, For example this text {{日本語【単話版】}} produces following 
tokens:
{code:java}
日本
日本語
語
単
話
版
{code}
But with this text {{日本語 単話版}}
{code:java}
日本語
単
話
版 {code}
This char {{【}} influences how first 3 chars are tokenized, but the char itself 
is punctuation char. Wouldn't it make sense to treat punctuation chars as a 
token breaker and cut its influence? 

The code I used to produce the tokens:
{code:java}
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new 
JapaneseTokenizer(newAttributeFactory(), null, true, 
JapaneseTokenizer.Mode.SEARCH);
return new TokenStreamComponents(tokenizer, new 
LowerCaseFilter(tokenizer));
}
};

try (TokenStream tokens = analyzer.tokenStream("field", new 
StringReader("日本語 単話版"))) {
CharTermAttribute chars = 
tokens.addAttribute(CharTermAttribute.class);
tokens.reset();
while (tokens.incrementToken()) {
System.out.println(chars.toString());
}
tokens.end();
} catch (IOException e) {
// should never happen with a StringReader
throw new RuntimeException(e);
}
{code}

> JapaneseTokenizer produces inconsistent tokens
> --
>
> Key: LUCENE-9100
> URL: https://issues.apache.org/jira/browse/LUCENE-9100
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.2
>Reporter: Elbek Kamoliddinov
>Priority: Major
>
> We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. 
> With this text:
>  {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert 
> space before `【` char. Here is the small code snippet demonstrating the case 
> (note we use our own dictionary and connection costs):
> {code:java}
> Analyzer analyzer = new Analyzer() {
> @Override
> protected TokenStreamComponents createComponents(String 
> fieldName) {
> //Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, 
> dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> return new TokenStreamComponents(tokenizer, new 
> LowerCaseFilter(tokenizer));
> }
> };
> String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)";
> String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space
> try (TokenStream tokens = analyzer.tokenStream("field", new 
> StringReader(text1))) {
> CharTermAttribute chars = 
> tokens.addAttribute(CharTermAttribute.class);
> tokens.reset();
> while (tokens.incrementToken()) {
> System.out.println(chars.toString());
> }
> tokens.end();
> } catch (IOException e) {
> // should never happen with a StringReader
> throw new RuntimeException(e);
> } {code}
> Output is:
> {code:java}
> //text1
>  マギ
> アリス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics
> //text2
> マギア
> リス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics{code}
> It looks like tokenizer doesn't view the punctuation ({{【}} is 
> {{Character.START_PUNCTUATION}} type) as an indicator that there should be a 
> token break, and somehow 【 punctuation char causes difference in the output. 
> If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't 
> manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and 
> outputs as is. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

2019-12-20 Thread Elbek Kamoliddinov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001152#comment-17001152
 ] 

Elbek Kamoliddinov commented on LUCENE-9100:


I printed some details, but it looks like only POS values are set:
{code:java}
text1:
Token=マギ
BaseForm=null
POS=JA名詞
Reading=
Pronunciation=
InflectionForm=null
InflectionType=null

Token=アリス
BaseForm=null
POS=JA名詞
Reading=
Pronunciation=
InflectionForm=null
InflectionType=null

Token=単
BaseForm=null
POS=JA接頭辞
Reading=たん
Pronunciation=
InflectionForm=null
InflectionType=null


text2:
Token=マギア
BaseForm=null
POS=JA名詞
Reading=
Pronunciation=
InflectionForm=null
InflectionType=null

Token=リス
BaseForm=null
POS=JA名詞
Reading=
Pronunciation=
InflectionForm=null
InflectionType=null

Token=単
BaseForm=null
POS=JA接頭辞
Reading=たん
Pronunciation=
InflectionForm=null
InflectionType=null
{code}

Looks like readings are not defined in the dictionary? 

Thanks for your help!

> JapaneseTokenizer produces inconsistent tokens
> --
>
> Key: LUCENE-9100
> URL: https://issues.apache.org/jira/browse/LUCENE-9100
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.2
>Reporter: Elbek Kamoliddinov
>Priority: Major
>
> We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. 
> With this text:
>  {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert 
> space before `【` char. Here is the small code snippet demonstrating the case 
> (note we use our own dictionary and connection costs):
> {code:java}
> Analyzer analyzer = new Analyzer() {
> @Override
> protected TokenStreamComponents createComponents(String 
> fieldName) {
> //Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, 
> dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> return new TokenStreamComponents(tokenizer, new 
> LowerCaseFilter(tokenizer));
> }
> };
> String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)";
> String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space
> try (TokenStream tokens = analyzer.tokenStream("field", new 
> StringReader(text1))) {
> CharTermAttribute chars = 
> tokens.addAttribute(CharTermAttribute.class);
> tokens.reset();
> while (tokens.incrementToken()) {
> System.out.println(chars.toString());
> }
> tokens.end();
> } catch (IOException e) {
> // should never happen with a StringReader
> throw new RuntimeException(e);
> } {code}
> Output is:
> {code:java}
> //text1
>  マギ
> アリス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics
> //text2
> マギア
> リス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics{code}
> It looks like tokenizer doesn't view the punctuation ({{【}} is 
> {{Character.START_PUNCTUATION}} type) as an indicator that there should be a 
> token break, and somehow 【 punctuation char causes difference in the output. 
> If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't 
> manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and 
> outputs as is. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

2019-12-19 Thread Kazuaki Hiraga (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000568#comment-17000568
 ] 

Kazuaki Hiraga commented on LUCENE-9100:


[~elbek@gmail.com] OK. Now I understood why I was not able to reproduce 
your consequences. Then, the tokenization results are depending on how to 
generate your custom dictionary. Can you print part-of-speech tags and other 
attributes with tokenized tokens? If Katakana characters don't have readings, 
they may not in your dictionary. So you can add them to the source of your 
custom system dictionary or just add to the user dictionary to see the outcome. 

> JapaneseTokenizer produces inconsistent tokens
> --
>
> Key: LUCENE-9100
> URL: https://issues.apache.org/jira/browse/LUCENE-9100
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.2
>Reporter: Elbek Kamoliddinov
>Priority: Major
>
> We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. 
> With this text:
>  {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert 
> space before `【` char. Here is the small code snippet demonstrating the case 
> (note we use our own dictionary and connection costs):
> {code:java}
> Analyzer analyzer = new Analyzer() {
> @Override
> protected TokenStreamComponents createComponents(String 
> fieldName) {
> //Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, 
> dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> return new TokenStreamComponents(tokenizer, new 
> LowerCaseFilter(tokenizer));
> }
> };
> String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)";
> String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space
> try (TokenStream tokens = analyzer.tokenStream("field", new 
> StringReader(text1))) {
> CharTermAttribute chars = 
> tokens.addAttribute(CharTermAttribute.class);
> tokens.reset();
> while (tokens.incrementToken()) {
> System.out.println(chars.toString());
> }
> tokens.end();
> } catch (IOException e) {
> // should never happen with a StringReader
> throw new RuntimeException(e);
> } {code}
> Output is:
> {code:java}
> //text1
>  マギ
> アリス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics
> //text2
> マギア
> リス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics{code}
> It looks like tokenizer doesn't view the punctuation ({{【}} is 
> {{Character.START_PUNCTUATION}} type) as an indicator that there should be a 
> token break, and somehow 【 punctuation char causes difference in the output. 
> If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't 
> manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and 
> outputs as is. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

2019-12-18 Thread Kazuaki Hiraga (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999688#comment-16999688
 ] 

Kazuaki Hiraga commented on LUCENE-9100:


{quote}not we use our own dictionary and connection costs{quote}
If you don't use custom dictionary, why you don't use standard/simple 
constructor, which you have commented out from your code?

{code:java}
 Tokenizer tokenizer = new JapaneseTokenizer(newAttributeFactory(), null, true, 
JapaneseTokenizer.Mode.SEARCH);
{code}

It seems I cannot reproduce your symptom with standard one, which uses the 
above code.

> JapaneseTokenizer produces inconsistent tokens
> --
>
> Key: LUCENE-9100
> URL: https://issues.apache.org/jira/browse/LUCENE-9100
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.2
>Reporter: Elbek Kamoliddinov
>Priority: Major
>
> We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. 
> With this text:
>  {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert 
> space before `【` char. Here is the small code snippet demonstrating the case 
> (not we use our own dictionary and connection costs):
> {code:java}
> Analyzer analyzer = new Analyzer() {
> @Override
> protected TokenStreamComponents createComponents(String 
> fieldName) {
> //Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, 
> dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> return new TokenStreamComponents(tokenizer, new 
> LowerCaseFilter(tokenizer));
> }
> };
> String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)";
> String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space
> try (TokenStream tokens = analyzer.tokenStream("field", new 
> StringReader(text1))) {
> CharTermAttribute chars = 
> tokens.addAttribute(CharTermAttribute.class);
> tokens.reset();
> while (tokens.incrementToken()) {
> System.out.println(chars.toString());
> }
> tokens.end();
> } catch (IOException e) {
> // should never happen with a StringReader
> throw new RuntimeException(e);
> } {code}
> Output is:
> {code:java}
> //text1
>  マギ
> アリス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics
> //text2
> マギア
> リス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics{code}
> It looks like tokenizer doesn't view the punctuation (\{{【}} is 
> \{{Character.START_PUNCTUATION}} type) as an indicator that there should be a 
> token break, and somehow 【 punctuation char causes difference in the output. 
> If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't 
> manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and 
> outputs as is. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

2019-12-18 Thread Elbek Kamoliddinov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999651#comment-16999651
 ] 

Elbek Kamoliddinov commented on LUCENE-9100:


I wonder if we could just replace all punctuation characters (ie 
{{JapaneseTokenizer#isPunctuation}}) with space, but then there is much logic 
in the tokenizer for {{isPunctuation}}. 

> JapaneseTokenizer produces inconsistent tokens
> --
>
> Key: LUCENE-9100
> URL: https://issues.apache.org/jira/browse/LUCENE-9100
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.2
>Reporter: Elbek Kamoliddinov
>Priority: Major
>
> We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. 
> With this text:
>  {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert 
> space before `【` char. Here is the small code snippet demonstrating the case 
> (not we use our own dictionary and connection costs):
> {code:java}
> Analyzer analyzer = new Analyzer() {
> @Override
> protected TokenStreamComponents createComponents(String 
> fieldName) {
> //Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> Tokenizer tokenizer = new 
> JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, 
> dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, 
> JapaneseTokenizer.Mode.SEARCH);
> return new TokenStreamComponents(tokenizer, new 
> LowerCaseFilter(tokenizer));
> }
> };
> String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)";
> String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space
> try (TokenStream tokens = analyzer.tokenStream("field", new 
> StringReader(text1))) {
> CharTermAttribute chars = 
> tokens.addAttribute(CharTermAttribute.class);
> tokens.reset();
> while (tokens.incrementToken()) {
> System.out.println(chars.toString());
> }
> tokens.end();
> } catch (IOException e) {
> // should never happen with a StringReader
> throw new RuntimeException(e);
> } {code}
> Output is:
> {code:java}
> //text1
>  マギ
> アリス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics
> //text2
> マギア
> リス
> 単
> 版
> 話
> 4
> 話
> unlimited
> comics{code}
> It looks like tokenizer doesn't view the punctuation (\{{【}} is 
> \{{Character.START_PUNCTUATION}} type) as an indicator that there should be a 
> token break, and somehow 【 punctuation char causes difference in the output. 
> If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't 
> manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and 
> outputs as is. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens

7 matches

Site Navigation

Mail list logo

Footer information