[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character
[ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134428#comment-17134428 ] Jun Ohtani edited comment on LUCENE-9390 at 6/12/20, 5:53 PM: -- I also checked *UniDic* around punctuation character, because I was working on [https://github.com/apache/lucene-solr/pull/935] . # word that starts punctuation character : 606 words. 222 words that length > 1 # word that all punctuation character : 111 words # word that has punctuation without 1st char: 1780 words Here is the word list. [https://gist.github.com/johtani/3769639bc24ebeab17ddcb1be039ba94] was (Author: jun_o): I also checked *UniDic* around punctuation character, because I was working on [https://github.com/apache/lucene-solr/pull/935] . # word that starts punctuation character : 606 words. 222 words that length > 1 # word that all punctuation character : 111 words # word that has punctuation without 1st char: 1780 words > Kuromoji tokenizer discards tokens if they start with a punctuation character > - > > Key: LUCENE-9390 > URL: https://issues.apache.org/jira/browse/LUCENE-9390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This issue was first raised in Elasticsearch > [here|https://github.com/elastic/elasticsearch/issues/57614] > The unidic dictionary that is used by the Kuromoji tokenizer contains entries > that mix punctuations and other characters. For instance the following entry: > _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ > can be found in the Noun.csv file. > Today, tokens that start with punctuations are automatically removed by > default (discardPunctuation is true). I think the code was written this way > because we expect punctuations to be separated from normal tokens but there > are exceptions in the original dictionary. Maybe we should check the entire > token when discarding punctuations ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character
[ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134249#comment-17134249 ] Jun Ohtani edited comment on LUCENE-9390 at 6/12/20, 2:45 PM: -- I counted 3 types of words in ipadic csv files. # word that starts punctuation character : 101 words. only 4 words that length > 1 # word that all punctuation character : 3 words # word that has punctuation without 1st char: 723 words For no.3, just counted because I was curious it. Reference : Word list. [https://gist.github.com/johtani/50aa2776a385c5c8dfa3a0d1e4e268cd] 4 words that starts punctuation are below: (社) (財) (有) (株) all punctuation words are : —— −− ── was (Author: jun_o): I counted 3 types of words in ipadic csv files. # word that starts punctuation character : 104 words. only 7 words that length > 1 # word that all punctuation character : 0 words # word that has punctuation without 1st char: 723 words Word list. [https://gist.github.com/johtani/50aa2776a385c5c8dfa3a0d1e4e268cd] 7 words are below: —— −− ── (社) (財) (有) (株) > Kuromoji tokenizer discards tokens if they start with a punctuation character > - > > Key: LUCENE-9390 > URL: https://issues.apache.org/jira/browse/LUCENE-9390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This issue was first raised in Elasticsearch > [here|https://github.com/elastic/elasticsearch/issues/57614] > The unidic dictionary that is used by the Kuromoji tokenizer contains entries > that mix punctuations and other characters. For instance the following entry: > _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ > can be found in the Noun.csv file. > Today, tokens that start with punctuations are automatically removed by > default (discardPunctuation is true). I think the code was written this way > because we expect punctuations to be separated from normal tokens but there > are exceptions in the original dictionary. Maybe we should check the entire > token when discarding punctuations ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character
[ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125542#comment-17125542 ] Tomoko Uchida edited comment on LUCENE-9390 at 6/4/20, 4:54 AM: Personally, I usually set the "discardPunctuation" flag to False to avoid such subtle situation. As a possible solution, instead of "discardPunctuation" flag we could add a token filter to discard all tokens which is composed only of punctuation characters after tokenization (just like stop filter) ? To me, it is a token filter's job rather than a tokenizer... was (Author: tomoko uchida): Personally, I usually set the "discardPunctuation" flag to False to avoid such subtle situation. As a possible solution, instead of "discardPunctuation" flag we could add a token filter to discard tokens that remove all tokens which is composed only of punctuation characters after tokenization (just like stop filter) ? To me, it is a token filter's job rather than a tokenizer... > Kuromoji tokenizer discards tokens if they start with a punctuation character > - > > Key: LUCENE-9390 > URL: https://issues.apache.org/jira/browse/LUCENE-9390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This issue was first raised in Elasticsearch > [here|https://github.com/elastic/elasticsearch/issues/57614] > The unidic dictionary that is used by the Kuromoji tokenizer contains entries > that mix punctuations and other characters. For instance the following entry: > _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ > can be found in the Noun.csv file. > Today, tokens that start with punctuations are automatically removed by > default (discardPunctuation is true). I think the code was written this way > because we expect punctuations to be separated from normal tokens but there > are exceptions in the original dictionary. Maybe we should check the entire > token when discarding punctuations ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org