[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens
[ https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013045#comment-17013045 ] Michael McCandless commented on LUCENE-9100: {quote}Maybe a solution here is to use the tokenizer with `discardPunctuation==false`, then stripping the punctuation tokens in a filter. {quote} +1, that sounds like a possible workaround. But it's still spooky that tokens can be formed across (deleted) punctuation ... > JapaneseTokenizer produces inconsistent tokens > -- > > Key: LUCENE-9100 > URL: https://issues.apache.org/jira/browse/LUCENE-9100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.2 >Reporter: Elbek Kamoliddinov >Priority: Major > > We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. > With this text: > {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert > space before `【` char. Here is the small code snippet demonstrating the case > (note we use our own dictionary and connection costs): > {code:java} > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String > fieldName) { > //Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), null, true, > JapaneseTokenizer.Mode.SEARCH); > Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, > dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, > JapaneseTokenizer.Mode.SEARCH); > return new TokenStreamComponents(tokenizer, new > LowerCaseFilter(tokenizer)); > } > }; > String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; > String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space > try (TokenStream tokens = analyzer.tokenStream("field", new > StringReader(text1))) { > CharTermAttribute chars = > tokens.addAttribute(CharTermAttribute.class); > tokens.reset(); > while (tokens.incrementToken()) { > System.out.println(chars.toString()); > } > tokens.end(); > } catch (IOException e) { > // should never happen with a StringReader > throw new RuntimeException(e); > } {code} > Output is: > {code:java} > //text1 > マギ > アリス > 単 > 版 > 話 > 4 > 話 > unlimited > comics > //text2 > マギア > リス > 単 > 版 > 話 > 4 > 話 > unlimited > comics{code} > It looks like tokenizer doesn't view the punctuation ({{【}} is > {{Character.START_PUNCTUATION}} type) as an indicator that there should be a > token break, and somehow 【 punctuation char causes difference in the output. > If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't > manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and > outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens
[ https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012178#comment-17012178 ] Michael Sokolov commented on LUCENE-9100: - [~elbek@gmail.com] I'm curious what tokens are produced by {{日本語単話版}} (no space in between). From reading `JapaneseTokenizer` I'd expect it to be like the former case (with the parentheses inserted). Maybe a solution here is to use the tokenizer with `discardPunctuation==false`, then stripping the punctuation tokens in a filter. > JapaneseTokenizer produces inconsistent tokens > -- > > Key: LUCENE-9100 > URL: https://issues.apache.org/jira/browse/LUCENE-9100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.2 >Reporter: Elbek Kamoliddinov >Priority: Major > > We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. > With this text: > {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert > space before `【` char. Here is the small code snippet demonstrating the case > (note we use our own dictionary and connection costs): > {code:java} > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String > fieldName) { > //Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), null, true, > JapaneseTokenizer.Mode.SEARCH); > Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, > dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, > JapaneseTokenizer.Mode.SEARCH); > return new TokenStreamComponents(tokenizer, new > LowerCaseFilter(tokenizer)); > } > }; > String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; > String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space > try (TokenStream tokens = analyzer.tokenStream("field", new > StringReader(text1))) { > CharTermAttribute chars = > tokens.addAttribute(CharTermAttribute.class); > tokens.reset(); > while (tokens.incrementToken()) { > System.out.println(chars.toString()); > } > tokens.end(); > } catch (IOException e) { > // should never happen with a StringReader > throw new RuntimeException(e); > } {code} > Output is: > {code:java} > //text1 > マギ > アリス > 単 > 版 > 話 > 4 > 話 > unlimited > comics > //text2 > マギア > リス > 単 > 版 > 話 > 4 > 話 > unlimited > comics{code} > It looks like tokenizer doesn't view the punctuation ({{【}} is > {{Character.START_PUNCTUATION}} type) as an indicator that there should be a > token break, and somehow 【 punctuation char causes difference in the output. > If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't > manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and > outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens
[ https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010934#comment-17010934 ] Elbek Kamoliddinov commented on LUCENE-9100: I made some progress on this, I found a case where following text influences whatever text it follows, For example this text {{日本語【単話版】}} produces following tokens: {code:java} 日本 日本語 語 単 話 版 {code} But with this text {{日本語 単話版}} {code:java} 日本語 単 話 版 {code} This char {{【}} influences how first 3 chars are tokenized, but the char itself is punctuation char. Wouldn't it make sense to treat punctuation chars as a token breaker and cut its influence? The code I used to produce the tokens: {code:java} Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { Tokenizer tokenizer = new JapaneseTokenizer(newAttributeFactory(), null, true, JapaneseTokenizer.Mode.SEARCH); return new TokenStreamComponents(tokenizer, new LowerCaseFilter(tokenizer)); } }; try (TokenStream tokens = analyzer.tokenStream("field", new StringReader("日本語 単話版"))) { CharTermAttribute chars = tokens.addAttribute(CharTermAttribute.class); tokens.reset(); while (tokens.incrementToken()) { System.out.println(chars.toString()); } tokens.end(); } catch (IOException e) { // should never happen with a StringReader throw new RuntimeException(e); } {code} > JapaneseTokenizer produces inconsistent tokens > -- > > Key: LUCENE-9100 > URL: https://issues.apache.org/jira/browse/LUCENE-9100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.2 >Reporter: Elbek Kamoliddinov >Priority: Major > > We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. > With this text: > {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert > space before `【` char. Here is the small code snippet demonstrating the case > (note we use our own dictionary and connection costs): > {code:java} > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String > fieldName) { > //Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), null, true, > JapaneseTokenizer.Mode.SEARCH); > Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, > dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, > JapaneseTokenizer.Mode.SEARCH); > return new TokenStreamComponents(tokenizer, new > LowerCaseFilter(tokenizer)); > } > }; > String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; > String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space > try (TokenStream tokens = analyzer.tokenStream("field", new > StringReader(text1))) { > CharTermAttribute chars = > tokens.addAttribute(CharTermAttribute.class); > tokens.reset(); > while (tokens.incrementToken()) { > System.out.println(chars.toString()); > } > tokens.end(); > } catch (IOException e) { > // should never happen with a StringReader > throw new RuntimeException(e); > } {code} > Output is: > {code:java} > //text1 > マギ > アリス > 単 > 版 > 話 > 4 > 話 > unlimited > comics > //text2 > マギア > リス > 単 > 版 > 話 > 4 > 話 > unlimited > comics{code} > It looks like tokenizer doesn't view the punctuation ({{【}} is > {{Character.START_PUNCTUATION}} type) as an indicator that there should be a > token break, and somehow 【 punctuation char causes difference in the output. > If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't > manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and > outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens
[ https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001152#comment-17001152 ] Elbek Kamoliddinov commented on LUCENE-9100: I printed some details, but it looks like only POS values are set: {code:java} text1: Token=マギ BaseForm=null POS=JA名詞 Reading= Pronunciation= InflectionForm=null InflectionType=null Token=アリス BaseForm=null POS=JA名詞 Reading= Pronunciation= InflectionForm=null InflectionType=null Token=単 BaseForm=null POS=JA接頭辞 Reading=たん Pronunciation= InflectionForm=null InflectionType=null text2: Token=マギア BaseForm=null POS=JA名詞 Reading= Pronunciation= InflectionForm=null InflectionType=null Token=リス BaseForm=null POS=JA名詞 Reading= Pronunciation= InflectionForm=null InflectionType=null Token=単 BaseForm=null POS=JA接頭辞 Reading=たん Pronunciation= InflectionForm=null InflectionType=null {code} Looks like readings are not defined in the dictionary? Thanks for your help! > JapaneseTokenizer produces inconsistent tokens > -- > > Key: LUCENE-9100 > URL: https://issues.apache.org/jira/browse/LUCENE-9100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.2 >Reporter: Elbek Kamoliddinov >Priority: Major > > We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. > With this text: > {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert > space before `【` char. Here is the small code snippet demonstrating the case > (note we use our own dictionary and connection costs): > {code:java} > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String > fieldName) { > //Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), null, true, > JapaneseTokenizer.Mode.SEARCH); > Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, > dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, > JapaneseTokenizer.Mode.SEARCH); > return new TokenStreamComponents(tokenizer, new > LowerCaseFilter(tokenizer)); > } > }; > String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; > String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space > try (TokenStream tokens = analyzer.tokenStream("field", new > StringReader(text1))) { > CharTermAttribute chars = > tokens.addAttribute(CharTermAttribute.class); > tokens.reset(); > while (tokens.incrementToken()) { > System.out.println(chars.toString()); > } > tokens.end(); > } catch (IOException e) { > // should never happen with a StringReader > throw new RuntimeException(e); > } {code} > Output is: > {code:java} > //text1 > マギ > アリス > 単 > 版 > 話 > 4 > 話 > unlimited > comics > //text2 > マギア > リス > 単 > 版 > 話 > 4 > 話 > unlimited > comics{code} > It looks like tokenizer doesn't view the punctuation ({{【}} is > {{Character.START_PUNCTUATION}} type) as an indicator that there should be a > token break, and somehow 【 punctuation char causes difference in the output. > If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't > manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and > outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens
[ https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000568#comment-17000568 ] Kazuaki Hiraga commented on LUCENE-9100: [~elbek@gmail.com] OK. Now I understood why I was not able to reproduce your consequences. Then, the tokenization results are depending on how to generate your custom dictionary. Can you print part-of-speech tags and other attributes with tokenized tokens? If Katakana characters don't have readings, they may not in your dictionary. So you can add them to the source of your custom system dictionary or just add to the user dictionary to see the outcome. > JapaneseTokenizer produces inconsistent tokens > -- > > Key: LUCENE-9100 > URL: https://issues.apache.org/jira/browse/LUCENE-9100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.2 >Reporter: Elbek Kamoliddinov >Priority: Major > > We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. > With this text: > {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert > space before `【` char. Here is the small code snippet demonstrating the case > (note we use our own dictionary and connection costs): > {code:java} > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String > fieldName) { > //Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), null, true, > JapaneseTokenizer.Mode.SEARCH); > Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, > dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, > JapaneseTokenizer.Mode.SEARCH); > return new TokenStreamComponents(tokenizer, new > LowerCaseFilter(tokenizer)); > } > }; > String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; > String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space > try (TokenStream tokens = analyzer.tokenStream("field", new > StringReader(text1))) { > CharTermAttribute chars = > tokens.addAttribute(CharTermAttribute.class); > tokens.reset(); > while (tokens.incrementToken()) { > System.out.println(chars.toString()); > } > tokens.end(); > } catch (IOException e) { > // should never happen with a StringReader > throw new RuntimeException(e); > } {code} > Output is: > {code:java} > //text1 > マギ > アリス > 単 > 版 > 話 > 4 > 話 > unlimited > comics > //text2 > マギア > リス > 単 > 版 > 話 > 4 > 話 > unlimited > comics{code} > It looks like tokenizer doesn't view the punctuation ({{【}} is > {{Character.START_PUNCTUATION}} type) as an indicator that there should be a > token break, and somehow 【 punctuation char causes difference in the output. > If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't > manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and > outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens
[ https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999688#comment-16999688 ] Kazuaki Hiraga commented on LUCENE-9100: {quote}not we use our own dictionary and connection costs{quote} If you don't use custom dictionary, why you don't use standard/simple constructor, which you have commented out from your code? {code:java} Tokenizer tokenizer = new JapaneseTokenizer(newAttributeFactory(), null, true, JapaneseTokenizer.Mode.SEARCH); {code} It seems I cannot reproduce your symptom with standard one, which uses the above code. > JapaneseTokenizer produces inconsistent tokens > -- > > Key: LUCENE-9100 > URL: https://issues.apache.org/jira/browse/LUCENE-9100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.2 >Reporter: Elbek Kamoliddinov >Priority: Major > > We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. > With this text: > {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert > space before `【` char. Here is the small code snippet demonstrating the case > (not we use our own dictionary and connection costs): > {code:java} > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String > fieldName) { > //Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), null, true, > JapaneseTokenizer.Mode.SEARCH); > Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, > dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, > JapaneseTokenizer.Mode.SEARCH); > return new TokenStreamComponents(tokenizer, new > LowerCaseFilter(tokenizer)); > } > }; > String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; > String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space > try (TokenStream tokens = analyzer.tokenStream("field", new > StringReader(text1))) { > CharTermAttribute chars = > tokens.addAttribute(CharTermAttribute.class); > tokens.reset(); > while (tokens.incrementToken()) { > System.out.println(chars.toString()); > } > tokens.end(); > } catch (IOException e) { > // should never happen with a StringReader > throw new RuntimeException(e); > } {code} > Output is: > {code:java} > //text1 > マギ > アリス > 単 > 版 > 話 > 4 > 話 > unlimited > comics > //text2 > マギア > リス > 単 > 版 > 話 > 4 > 話 > unlimited > comics{code} > It looks like tokenizer doesn't view the punctuation (\{{【}} is > \{{Character.START_PUNCTUATION}} type) as an indicator that there should be a > token break, and somehow 【 punctuation char causes difference in the output. > If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't > manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and > outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9100) JapaneseTokenizer produces inconsistent tokens
[ https://issues.apache.org/jira/browse/LUCENE-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999651#comment-16999651 ] Elbek Kamoliddinov commented on LUCENE-9100: I wonder if we could just replace all punctuation characters (ie {{JapaneseTokenizer#isPunctuation}}) with space, but then there is much logic in the tokenizer for {{isPunctuation}}. > JapaneseTokenizer produces inconsistent tokens > -- > > Key: LUCENE-9100 > URL: https://issues.apache.org/jira/browse/LUCENE-9100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.2 >Reporter: Elbek Kamoliddinov >Priority: Major > > We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. > With this text: > {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert > space before `【` char. Here is the small code snippet demonstrating the case > (not we use our own dictionary and connection costs): > {code:java} > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String > fieldName) { > //Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), null, true, > JapaneseTokenizer.Mode.SEARCH); > Tokenizer tokenizer = new > JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, > dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, > JapaneseTokenizer.Mode.SEARCH); > return new TokenStreamComponents(tokenizer, new > LowerCaseFilter(tokenizer)); > } > }; > String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; > String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space > try (TokenStream tokens = analyzer.tokenStream("field", new > StringReader(text1))) { > CharTermAttribute chars = > tokens.addAttribute(CharTermAttribute.class); > tokens.reset(); > while (tokens.incrementToken()) { > System.out.println(chars.toString()); > } > tokens.end(); > } catch (IOException e) { > // should never happen with a StringReader > throw new RuntimeException(e); > } {code} > Output is: > {code:java} > //text1 > マギ > アリス > 単 > 版 > 話 > 4 > 話 > unlimited > comics > //text2 > マギア > リス > 単 > 版 > 話 > 4 > 話 > unlimited > comics{code} > It looks like tokenizer doesn't view the punctuation (\{{【}} is > \{{Character.START_PUNCTUATION}} type) as an indicator that there should be a > token break, and somehow 【 punctuation char causes difference in the output. > If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't > manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and > outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org