[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character

2020-06-12 Thread Jun Ohtani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134428#comment-17134428
 ] 

Jun Ohtani edited comment on LUCENE-9390 at 6/12/20, 5:53 PM:
--

I also checked *UniDic* around punctuation character, because I was working on 
[https://github.com/apache/lucene-solr/pull/935] .
 # word that starts punctuation character : 606 words. 222  words that length > 
1
 # word that all punctuation character : 111 words
 # word that has punctuation without 1st char: 1780 words

Here is the word list.

[https://gist.github.com/johtani/3769639bc24ebeab17ddcb1be039ba94]


was (Author: jun_o):
I also checked *UniDic* around punctuation character, because I was working on 
[https://github.com/apache/lucene-solr/pull/935] .
 # word that starts punctuation character : 606 words. 222  words that length > 
1
 # word that all punctuation character : 111 words
 # word that has punctuation without 1st char: 1780 words

> Kuromoji tokenizer discards tokens if they start with a punctuation character
> -
>
> Key: LUCENE-9390
> URL: https://issues.apache.org/jira/browse/LUCENE-9390
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This issue was first raised in Elasticsearch 
> [here|https://github.com/elastic/elasticsearch/issues/57614]
> The unidic dictionary that is used by the Kuromoji tokenizer contains entries 
> that mix punctuations and other characters. For instance the following entry:
> _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_
> can be found in the Noun.csv file.
> Today, tokens that start with punctuations are automatically removed by 
> default (discardPunctuation  is true). I think the code was written this way 
> because we expect punctuations to be separated from normal tokens but there 
> are exceptions in the original dictionary. Maybe we should check the entire 
> token when discarding punctuations ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character

2020-06-12 Thread Jun Ohtani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134249#comment-17134249
 ] 

Jun Ohtani edited comment on LUCENE-9390 at 6/12/20, 2:45 PM:
--

I counted 3 types of words in ipadic csv files.
 # word that starts punctuation character : 101 words. only 4 words that length 
> 1
 # word that all punctuation character : 3 words
 # word that has punctuation without 1st char: 723 words

For no.3, just counted because I was curious it. 

Reference : Word list.

 [https://gist.github.com/johtani/50aa2776a385c5c8dfa3a0d1e4e268cd]

4 words that starts punctuation are below:
(社)
 (財)
 (有)
 (株)

all punctuation words are :

——
 −−
 ──
  


was (Author: jun_o):
I counted 3 types of words in ipadic csv files. 
 # word that starts punctuation character : 104 words. only 7 words that length 
> 1
 # word that all punctuation character : 0 words
 # word that has punctuation without 1st char: 723 words

Word list.
 [https://gist.github.com/johtani/50aa2776a385c5c8dfa3a0d1e4e268cd]



7 words are below:
——
−−
──
(社)
(財)
(有)
(株)
 

> Kuromoji tokenizer discards tokens if they start with a punctuation character
> -
>
> Key: LUCENE-9390
> URL: https://issues.apache.org/jira/browse/LUCENE-9390
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This issue was first raised in Elasticsearch 
> [here|https://github.com/elastic/elasticsearch/issues/57614]
> The unidic dictionary that is used by the Kuromoji tokenizer contains entries 
> that mix punctuations and other characters. For instance the following entry:
> _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_
> can be found in the Noun.csv file.
> Today, tokens that start with punctuations are automatically removed by 
> default (discardPunctuation  is true). I think the code was written this way 
> because we expect punctuations to be separated from normal tokens but there 
> are exceptions in the original dictionary. Maybe we should check the entire 
> token when discarding punctuations ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character

2020-06-03 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125542#comment-17125542
 ] 

Tomoko Uchida edited comment on LUCENE-9390 at 6/4/20, 4:54 AM:


Personally, I usually set the "discardPunctuation" flag to False to avoid such 
subtle situation.

As a possible solution, instead of "discardPunctuation" flag we could add a 
token filter to discard all tokens which is composed only of punctuation 
characters after tokenization (just like stop filter) ? To me, it is a token 
filter's job rather than a tokenizer...


was (Author: tomoko uchida):
Personally, I usually set the "discardPunctuation" flag to False to avoid such 
subtle situation.

As a possible solution, instead of "discardPunctuation" flag we could add a 
token filter to discard tokens that remove all tokens which is composed only of 
punctuation characters after tokenization (just like stop filter) ? To me, it 
is a token filter's job rather than a tokenizer...

> Kuromoji tokenizer discards tokens if they start with a punctuation character
> -
>
> Key: LUCENE-9390
> URL: https://issues.apache.org/jira/browse/LUCENE-9390
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This issue was first raised in Elasticsearch 
> [here|https://github.com/elastic/elasticsearch/issues/57614]
> The unidic dictionary that is used by the Kuromoji tokenizer contains entries 
> that mix punctuations and other characters. For instance the following entry:
> _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_
> can be found in the Noun.csv file.
> Today, tokens that start with punctuations are automatically removed by 
> default (discardPunctuation  is true). I think the code was written this way 
> because we expect punctuations to be separated from normal tokens but there 
> are exceptions in the original dictionary. Maybe we should check the entire 
> token when discarding punctuations ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org