subject:"\[jira\] \[Commented\] \(LUCENE\-9413\) Add a char filter corresponding to CJKWidthFilter"

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2021-03-22 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305997#comment-17305997
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

FYI: I'm planning to change the JapaneseAnalyzer default behaviour to use 
CJKWidthCharFilter instead of CJKCharFilter (on main branch only).
 LUCENE-9853

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: main (9.0), 8.8
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233486#comment-17233486
 ] 

ASF subversion and git services commented on LUCENE-9413:
-

Commit 26b55463ffd4afe159d4aaee9713c766938e4276 in lucene-solr's branch 
refs/heads/branch_8x from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=26b5546 ]

LUCENE-9413: Add CJKWidthCharFilter and its factory. (#2081)



> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233487#comment-17233487
 ] 

ASF subversion and git services commented on LUCENE-9413:
-

Commit 4c656ed52ce6ea78e3723e0b2962cebf10d54c18 in lucene-solr's branch 
refs/heads/branch_8x from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c656ed ]

LUCENE-9413: fix tests


> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233441#comment-17233441
 ] 

ASF subversion and git services commented on LUCENE-9413:
-

Commit 8503efdcff91461114a26f6aaae180a90570da2b in lucene-solr's branch 
refs/heads/master from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8503efd ]

LUCENE-9413: Add CJKWidthCharFilter and its factory. (#2081)



> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-15 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232272#comment-17232272
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

Thanks! I created the PR just yesterday. I believe I've correctly ported the 
awesome tricks on CJKWidthFilter into the char filter...  hope it is ready to 
be merged.

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-14 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232135#comment-17232135
 ] 

Robert Muir commented on LUCENE-9413:
-

Yes, I'll help review. I must have missed the PR.

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-11-14 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232119#comment-17232119
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

[https://github.com/apache/lucene-solr/pull/2081] adds CJKWidthCharFilter that 
is the exact counterpart of CJKWidthFilter. The charfilter would be useful 
especially for dictionary-based CJK analyzers; e.g. kuromoji.

[~rcmuir] what do you think - would you take a look at this?

 

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-06-30 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148563#comment-17148563
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

Thanks for paying attention to this, I have reopened the issue.
{quote}I think that's an acceptable trade-off, these entries with full width 
characters don't seem to be high quality anyway
{quote}
Yes, full-width characters often appear in proper nouns in the dictionary; I 
agree that they can be safely ignored in many search situations ...

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-06-29 Thread Jim Ferenczi (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148068#comment-17148068
 ] 

Jim Ferenczi commented on LUCENE-9413:
--

+1, I like the idea, currently we ask users to install the icu normalizer but 
it could be nice to have a simple char filter in core to apply the 
normalization. In essence, this is similar to 
https://issues.apache.org/jira/browse/LUCENE-8972 but with a more contained 
scope.

 

> The mecab-ipadic dictionary has entries which includes FULL width characters, 
>so this naive approach - FULL / HALF width character normalization before 
>tokenizing can break tokenization. :/

 

I think that's an acceptable trade-off,  these entries with full width 
characters don't seem to be high quality anyway ;). 

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-06-20 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141009#comment-17141009
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

The mecab-ipadic dictionary has entries which includes FULL width characters, 
so this naive approach - FULL / HALF width character normalization before 
tokenizing can break tokenization. :/

Maybe we could concat "unknown" word sequence which consists of only numbers or 
latin alphabets, after tokenization ?

{code}
$ cut -d',' -f1 mecab-ipadic-all-utf8.csv | grep １
１２月
１番
１１月
１月
１０月
Ｇ７プラス１
小１
高１
１つ
Ｆ１
中１
１１０番
Ｇ１
１
ファスニング２１
Ｇ１０
インパクト２１
アルゴテクノス２１
セルヴィ２１
モクネット２１
Ｕ１９
どさんこワイド２１２
西１５線北
北１３線
西１４線北
北１４線
西１０号南
南１条
東１１号北
東１２線北
西１１号北
駒場北１条通
東１線南
第１安井牧場
西１０号北
東１１線北
美旗町中１番
南２１線西
南１７線西
西１０線北
岩内町第１基線
北１５線
南１２線西
東１３線南
西１３線北
西１線北
南１６線西
西１０線南
西１６線北
西１１線北
西１２号北
西１１線南
東１０線北
北１線
東１線北
南１３号
南１４線西
南１線
北１１線
西１２線南
西１４線南
南１３線西
浦臼第１
西１３線南
東１０号北
南１９線西
北１条
南１１線西
平泉外１２入会
東１０線南
東１０号南
南１８線西
南１５線西
東１１号南
東１２号北
北１０線
駒場南１条通
南１番通
南１０線西
北１２線
西１線南
太田１の通り
東１１線南
西１２線北
東１２線南
大泉１区南部
Ｍ４０Ａ１
Ｆ１５戦闘機
ＤＦ３１
Ｆ１５
Ｇ１
辞林２１
Ｒ１２
Ｏ１５７
ＤＦ４１
スーパー３０１
ＧＰ１２５
北１３条東
Ｍ１Ａ２
アポロ１１号
{code}

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or latin alphabets are separated by the 
> tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

2020-06-20 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140976#comment-17140976
 ] 

Tomoko Uchida commented on LUCENE-9413:
---

I cannot take time for working on this soon, but wanted to hook it as an 
issue... comments and thoughts are welcomed.

> Add a char filter corresponding to CJKWidthFilter
> -
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tomoko Uchida
>Priority: Minor
>
> In association with issues in Elasticsearch 
> ([https://github.com/elastic/elasticsearch/issues/58384] and 
> [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful 
> for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width 
> characters before tokenization, the behaviour sometimes confuses beginners or 
> users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in 
> {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, 
> JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization 
> so some of FULL width numbers or alphabets are separated by the tokenizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter

11 matches

Site Navigation

Mail list logo

Footer information