[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305997#comment-17305997 ] Tomoko Uchida commented on LUCENE-9413: --- FYI: I'm planning to change the JapaneseAnalyzer default behaviour to use CJKWidthCharFilter instead of CJKCharFilter (on main branch only). LUCENE-9853 > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Minor > Fix For: main (9.0), 8.8 > > Time Spent: 0.5h > Remaining Estimate: 0h > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233486#comment-17233486 ] ASF subversion and git services commented on LUCENE-9413: - Commit 26b55463ffd4afe159d4aaee9713c766938e4276 in lucene-solr's branch refs/heads/branch_8x from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=26b5546 ] LUCENE-9413: Add CJKWidthCharFilter and its factory. (#2081) > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233487#comment-17233487 ] ASF subversion and git services commented on LUCENE-9413: - Commit 4c656ed52ce6ea78e3723e0b2962cebf10d54c18 in lucene-solr's branch refs/heads/branch_8x from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c656ed ] LUCENE-9413: fix tests > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233441#comment-17233441 ] ASF subversion and git services commented on LUCENE-9413: - Commit 8503efdcff91461114a26f6aaae180a90570da2b in lucene-solr's branch refs/heads/master from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8503efd ] LUCENE-9413: Add CJKWidthCharFilter and its factory. (#2081) > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232272#comment-17232272 ] Tomoko Uchida commented on LUCENE-9413: --- Thanks! I created the PR just yesterday. I believe I've correctly ported the awesome tricks on CJKWidthFilter into the char filter... hope it is ready to be merged. > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232135#comment-17232135 ] Robert Muir commented on LUCENE-9413: - Yes, I'll help review. I must have missed the PR. > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232119#comment-17232119 ] Tomoko Uchida commented on LUCENE-9413: --- [https://github.com/apache/lucene-solr/pull/2081] adds CJKWidthCharFilter that is the exact counterpart of CJKWidthFilter. The charfilter would be useful especially for dictionary-based CJK analyzers; e.g. kuromoji. [~rcmuir] what do you think - would you take a look at this? > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148563#comment-17148563 ] Tomoko Uchida commented on LUCENE-9413: --- Thanks for paying attention to this, I have reopened the issue. {quote}I think that's an acceptable trade-off, these entries with full width characters don't seem to be high quality anyway {quote} Yes, full-width characters often appear in proper nouns in the dictionary; I agree that they can be safely ignored in many search situations ... > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148068#comment-17148068 ] Jim Ferenczi commented on LUCENE-9413: -- +1, I like the idea, currently we ask users to install the icu normalizer but it could be nice to have a simple char filter in core to apply the normalization. In essence, this is similar to https://issues.apache.org/jira/browse/LUCENE-8972 but with a more contained scope. > The mecab-ipadic dictionary has entries which includes FULL width characters, >so this naive approach - FULL / HALF width character normalization before >tokenizing can break tokenization. :/ I think that's an acceptable trade-off, these entries with full width characters don't seem to be high quality anyway ;). > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141009#comment-17141009 ] Tomoko Uchida commented on LUCENE-9413: --- The mecab-ipadic dictionary has entries which includes FULL width characters, so this naive approach - FULL / HALF width character normalization before tokenizing can break tokenization. :/ Maybe we could concat "unknown" word sequence which consists of only numbers or latin alphabets, after tokenization ? {code} $ cut -d',' -f1 mecab-ipadic-all-utf8.csv | grep 1 12月 1番 11月 1月 10月 G7プラス1 小1 高1 1つ F1 中1 110番 G1 1 ファスニング21 G10 インパクト21 アルゴテクノス21 セルヴィ21 モクネット21 U19 どさんこワイド212 西15線北 北13線 西14線北 北14線 西10号南 南1条 東11号北 東12線北 西11号北 駒場北1条通 東1線南 第1安井牧場 西10号北 東11線北 美旗町中1番 南21線西 南17線西 西10線北 岩内町第1基線 北15線 南12線西 東13線南 西13線北 西1線北 南16線西 西10線南 西16線北 西11線北 西12号北 西11線南 東10線北 北1線 東1線北 南13号 南14線西 南1線 北11線 西12線南 西14線南 南13線西 浦臼第1 西13線南 東10号北 南19線西 北1条 南11線西 平泉外12入会 東10線南 東10号南 南18線西 南15線西 東11号南 東12号北 北10線 駒場南1条通 南1番通 南10線西 北12線 西1線南 太田1の通り 東11線南 西12線北 東12線南 大泉1区南部 M40A1 F15戦闘機 DF31 F15 G1 辞林21 R12 O157 DF41 スーパー301 GP125 北13条東 M1A2 アポロ11号 {code} > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or latin alphabets are separated by the > tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140976#comment-17140976 ] Tomoko Uchida commented on LUCENE-9413: --- I cannot take time for working on this soon, but wanted to hook it as an issue... comments and thoughts are welcomed. > Add a char filter corresponding to CJKWidthFilter > - > > Key: LUCENE-9413 > URL: https://issues.apache.org/jira/browse/LUCENE-9413 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Tomoko Uchida >Priority: Minor > > In association with issues in Elasticsearch > ([https://github.com/elastic/elasticsearch/issues/58384] and > [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful > for Japanese default analyzer. > Although I don't think it's a bug to not normalize FULL and HALF width > characters before tokenization, the behaviour sometimes confuses beginners or > users who have limited knowledge about Japanese analysis (and Unicode). > If we have a FULL and HALF width character normalization filter in > {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, > JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization > so some of FULL width numbers or alphabets are separated by the tokenizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org