[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2021-03-22 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305997#comment-17305997
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

FYI: I'm planning to change the JapaneseAnalyzer default behaviour to use 
CJKWidthCharFilter instead of CJKCharFilter (on main branch only).
 LUCENE-9853

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: main (9.0), 8.8
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233486#comment-17233486
 ] 

ASF subversion and git services commented on LUCENE-9413:
-

Commit 26b55463ffd4afe159d4aaee9713c766938e4276 in lucene-solr's branch 
refs/heads/branch_8x from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=26b5546 ]

LUCENE-9413: Add CJKWidthCharFilter and its factory. (#2081)



> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233487#comment-17233487
 ] 

ASF subversion and git services commented on LUCENE-9413:
-

Commit 4c656ed52ce6ea78e3723e0b2962cebf10d54c18 in lucene-solr's branch 
refs/heads/branch_8x from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c656ed ]

LUCENE-9413: fix tests


> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233441#comment-17233441
 ] 

ASF subversion and git services commented on LUCENE-9413:
-

Commit 8503efdcff91461114a26f6aaae180a90570da2b in lucene-solr's branch 
refs/heads/master from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8503efd ]

LUCENE-9413: Add CJKWidthCharFilter and its factory. (#2081)



> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-15 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232272#comment-17232272
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

Thanks! I created the PR just yesterday. I believe I've correctly ported the 
awesome tricks on CJKWidthFilter into the char filter...  hope it is ready to 
be merged.

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-14 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232135#comment-17232135
 ] 

Robert Muir commented on LUCENE-9413:
-

Yes, I'll help review. I must have missed the PR.

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-14 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232119#comment-17232119
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

[https://github.com/apache/lucene-solr/pull/2081] adds CJKWidthCharFilter that 
is the exact counterpart of CJKWidthFilter. The charfilter would be useful 
especially for dictionary-based CJK analyzers; e.g. kuromoji.

[~rcmuir] what do you think - would you take a look at this?

 

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-06-30 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148563#comment-17148563
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

Thanks for paying attention to this, I have reopened the issue.
{quote}I think that's an acceptable trade-off, these entries with full width 
characters don't seem to be high quality anyway
{quote}
Yes, full-width characters often appear in proper nouns in the dictionary; I 
agree that they can be safely ignored in many search situations ...

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-06-29 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148068#comment-17148068
 ] 

Jim Ferenczi commented on LUCENE-9413:
--

+1, I like the idea, currently we ask users to install the icu normalizer but 
it could be nice to have a simple char filter in core to apply the 
normalization. In essence, this is similar to 
https://issues.apache.org/jira/browse/LUCENE-8972 but with a more contained 
scope.

 

> The mecab-ipadic dictionary has entries which includes FULL width characters, 
>so this naive approach - FULL / HALF width character normalization before 
>tokenizing can break tokenization. :/

 

I think that's an acceptable trade-off,  these entries with full width 
characters don't seem to be high quality anyway ;). 

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-06-20 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141009#comment-17141009
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

The mecab-ipadic dictionary has entries which includes FULL width characters, 
so this naive approach - FULL / HALF width character normalization before 
tokenizing can break tokenization. :/

Maybe we could concat "unknown" word sequence which consists of only numbers or 
latin alphabets, after tokenization ?

{code}
$ cut -d',' -f1 mecab-ipadic-all-utf8.csv | grep 1
12月
1番
11月
1月
10月
G7プラス1
小1
高1
1つ
F1
中1
110番
G1
1
ファスニング21
G10
インパクト21
アルゴテクノス21
セルヴィ21
モクネット21
U19
どさんこワイド212
西15線北
北13線
西14線北
北14線
西10号南
南1条
東11号北
東12線北
西11号北
駒場北1条通
東1線南
第1安井牧場
西10号北
東11線北
美旗町中1番
南21線西
南17線西
西10線北
岩内町第1基線
北15線
南12線西
東13線南
西13線北
西1線北
南16線西
西10線南
西16線北
西11線北
西12号北
西11線南
東10線北
北1線
東1線北
南13号
南14線西
南1線
北11線
西12線南
西14線南
南13線西
浦臼第1
西13線南
東10号北
南19線西
北1条
南11線西
平泉外12入会
東10線南
東10号南
南18線西
南15線西
東11号南
東12号北
北10線
駒場南1条通
南1番通
南10線西
北12線
西1線南
太田1の通り
東11線南
西12線北
東12線南
大泉1区南部
M40A1
F15戦闘機
DF31
F15
G1
辞林21
R12
O157
DF41
スーパー301
GP125
北13条東
M1A2
アポロ11号
{code}

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-06-20 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140976#comment-17140976
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

I cannot take time for working on this soon, but wanted to hook it as an 
issue... comments and thoughts are welcomed.

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or alphabets are separated by the tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org