[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits
[ https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927933#comment-16927933 ] Namgyu Kim commented on LUCENE-8966: Oh, Thank you for your reply. [~jim.ferenczi] :D I checked again and it was not bug. That result is come from viterbi path. But I think it needs to be discussed. So I added a new issue about it. I'd appreciate if you check LUCENE-8977. P.S. +1 to your patch > KoreanTokenizer should split unknown words on digits > > > Key: LUCENE-8966 > URL: https://issues.apache.org/jira/browse/LUCENE-8966 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-8966.patch, LUCENE-8966.patch > > > Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer > groups characters of unknown words if they belong to the same script or an > inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the > rest in Latin) but this rule doesn't work well on digits since they are > considered common with other scripts. For instance the input "44사이즈" is kept > as is even though "사이즈" is part of the dictionary. We should restore the > original behavior and splits any unknown words if a digit is followed by > another type. > This issue was first discovered in > [https://github.com/elastic/elasticsearch/issues/46365] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8977) Handle punctuation characters in KoreanTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8977: --- Description: As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks. (사이즈 => [사이즈] [.] [...]) But KoreanTokenizer doesn't divide when first character is punctuation. (...사이즈 => [...] [사이즈]) It looks like the result from the viterbi path, but users can think weird about the following case: ("사이즈" means "size" in Korean) ||Case #1||Case #2|| |Input : "...사이즈..."|Input : "...4..4사이즈"| |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]| >From what I checked, Nori has a punctuation characters(like . ,) in the >dictionary but Kuromoji is not. ("サイズ" means "size" in Japanese) ||Case #1||Case #2|| |Input : "...サイズ..."|Input : "...4..4サイズ"| |Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]| There are some ways to resolve it like hard-coding for punctuation but it seems not good. So I think we need to discuss it. was: As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks. (사이즈 => [사이즈] [.] [...]) But KoreanTokenizer doesn't divides when first character is punctuation. (...사이즈 => [...] [사이즈]) It looks like the result from the viterbi path, but users can think weird about the following case: ("사이즈" means "size" in Korean) ||Case #1||Case #2|| |Input : "...사이즈..."|Input : "...4..4사이즈"| |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]| >From what I checked, Nori has a punctuation characters(like . ,) in the >dictionary but Kuromoji is not. ("サイズ" means "size" in Japanese) ||Case #1||Case #2|| |Input : "...サイズ..."|Input : "...4..4サイズ"| |Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]| There are some ways to resolve it like hard-coding for punctuation but it seems not good. So I think we need to discuss it. > Handle punctuation characters in KoreanTokenizer > > > Key: LUCENE-8977 > URL: https://issues.apache.org/jira/browse/LUCENE-8977 > Project: Lucene - Core > Issue Type: Bug >Reporter: Namgyu Kim >Priority: Minor > > As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and > the others now when there are continuous punctuation marks. > (사이즈 => [사이즈] [.] [...]) > But KoreanTokenizer doesn't divide when first character is punctuation. > (...사이즈 => [...] [사이즈]) > It looks like the result from the viterbi path, but users can think weird > about the following case: > ("사이즈" means "size" in Korean) > ||Case #1||Case #2|| > |Input : "...사이즈..."|Input : "...4..4사이즈"| > |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]| > From what I checked, Nori has a punctuation characters(like . ,) in the > dictionary but Kuromoji is not. > ("サイズ" means "size" in Japanese) > ||Case #1||Case #2|| > |Input : "...サイズ..."|Input : "...4..4サイズ"| > |Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]| > There are some ways to resolve it like hard-coding for punctuation but it > seems not good. > So I think we need to discuss it. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8977) Handle punctuation characters in KoreanTokenizer
Namgyu Kim created LUCENE-8977: -- Summary: Handle punctuation characters in KoreanTokenizer Key: LUCENE-8977 URL: https://issues.apache.org/jira/browse/LUCENE-8977 Project: Lucene - Core Issue Type: Bug Reporter: Namgyu Kim As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks. (사이즈 => [사이즈] [.] [...]) But KoreanTokenizer doesn't divides when first character is punctuation. (...사이즈 => [...] [사이즈]) It looks like the result from the viterbi path, but users can think weird about the following case: ("사이즈" means "size" in Korean) ||Case #1||Case #2|| |Input : "...사이즈..."|Input : "...4..4사이즈"| |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]| >From what I checked, Nori has a punctuation characters(like . ,) in the >dictionary but Kuromoji is not. ("サイズ" means "size" in Japanese) ||Case #1||Case #2|| |Input : "...サイズ..."|Input : "...4..4サイズ"| |Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]| There are some ways to resolve it like hard-coding for punctuation but it seems not good. So I think we need to discuss it. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits
[ https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924773#comment-16924773 ] Namgyu Kim commented on LUCENE-8966: But there is a bug I just checked :( Input : "4..4사이즈" Expected : [4] [..] [4] [사이즈] Actual : [4] *[.] [.]* [4] [사이즈] {code:java} // Need to pass! public void testDuplicatePunctuation() throws IOException { assertAnalyzesTo(analyzerWithPunctuation, "4..4사이즈", new String[]{"4", "..", "4", "사이즈"}, new int[]{0, 1, 7, 8}, new int[]{1, 7, 8, 11}, new int[]{1, 1, 1, 1} ); } {code} I think we need to fix it. If it is okay to fix within this JIRA issue, I'll post additional patch. Otherwise I'll create a new one. > KoreanTokenizer should split unknown words on digits > > > Key: LUCENE-8966 > URL: https://issues.apache.org/jira/browse/LUCENE-8966 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-8966.patch, LUCENE-8966.patch > > > Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer > groups characters of unknown words if they belong to the same script or an > inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the > rest in Latin) but this rule doesn't work well on digits since they are > considered common with other scripts. For instance the input "44사이즈" is kept > as is even though "사이즈" is part of the dictionary. We should restore the > original behavior and splits any unknown words if a digit is followed by > another type. > This issue was first discovered in > [https://github.com/elastic/elasticsearch/issues/46365] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits
[ https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924768#comment-16924768 ] Namgyu Kim commented on LUCENE-8966: Good job! [~jim.ferenczi] :D It can be serious enough for Nori users. About Punctuation, as [~jim.ferenczi] said, it can be remained by using discardPunctuation(set false) parameter in KoreanTokenizer. You can test it by using analyzerWithPunctuation instance in TestKoreanTokenizer. > KoreanTokenizer should split unknown words on digits > > > Key: LUCENE-8966 > URL: https://issues.apache.org/jira/browse/LUCENE-8966 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-8966.patch, LUCENE-8966.patch > > > Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer > groups characters of unknown words if they belong to the same script or an > inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the > rest in Latin) but this rule doesn't work well on digits since they are > considered common with other scripts. For instance the input "44사이즈" is kept > as is even though "사이즈" is part of the dictionary. We should restore the > original behavior and splits any unknown words if a digit is followed by > another type. > This issue was first discovered in > [https://github.com/elastic/elasticsearch/issues/46365] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8954) Refactor Nori(Korean) Analyzer
Namgyu Kim created LUCENE-8954: -- Summary: Refactor Nori(Korean) Analyzer Key: LUCENE-8954 URL: https://issues.apache.org/jira/browse/LUCENE-8954 Project: Lucene - Core Issue Type: Improvement Reporter: Namgyu Kim Assignee: Namgyu Kim There are many codes that can be refactored in the Nori analyzer. (whitespace, wrong type casting, unnecessary throws, C-style array, ...) I think it's good to proceed if we can. It has nothing to do with the actual working of Nori. I'll just remove unnecessary code and make the code simple. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8934) Move Nori DictionaryBuilder tool from src/tools to src/
[ https://issues.apache.org/jira/browse/LUCENE-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim resolved LUCENE-8934. Resolution: Fixed Fix Version/s: master (9.0) 8.x > Move Nori DictionaryBuilder tool from src/tools to src/ > --- > > Key: LUCENE-8934 > URL: https://issues.apache.org/jira/browse/LUCENE-8934 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: 8.x, master (9.0) > > Time Spent: 0.5h > Remaining Estimate: 0h > > After LUCENE-8904 tests in Nori tools are not running in the normal test > ({{ant test}}). > As with Kuromoji(before LUCENE-8871), we need to run the {{ant test-tools}} > to test Nori's tools. > Like Kuromoji, we can proceed with the normality test after moving the tools > of Nori to the main source tree. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8912) Remove ICU dependency of nori tools/test-tools
[ https://issues.apache.org/jira/browse/LUCENE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8912: --- Fix Version/s: master (9.0) 8.x > Remove ICU dependency of nori tools/test-tools > -- > > Key: LUCENE-8912 > URL: https://issues.apache.org/jira/browse/LUCENE-8912 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: 8.x, master (9.0) > > Time Spent: 0.5h > Remaining Estimate: 0h > > {quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. > {quote} > As mentioned in LUCENE-8904, I proceed this work from now on. > It is what [~rcmuir] found first(LUCENE-8866) and then I just apply to Nori. > Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC > normalization like Kuromoji. > I think it's OK to remove the library dependency because it can be handled > by JDK. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8912) Remove ICU dependency of nori tools/test-tools
[ https://issues.apache.org/jira/browse/LUCENE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim resolved LUCENE-8912. Resolution: Fixed > Remove ICU dependency of nori tools/test-tools > -- > > Key: LUCENE-8912 > URL: https://issues.apache.org/jira/browse/LUCENE-8912 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: 8.x, master (9.0) > > Time Spent: 0.5h > Remaining Estimate: 0h > > {quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. > {quote} > As mentioned in LUCENE-8904, I proceed this work from now on. > It is what [~rcmuir] found first(LUCENE-8866) and then I just apply to Nori. > Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC > normalization like Kuromoji. > I think it's OK to remove the library dependency because it can be handled > by JDK. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8904) Enhance Nori DictionaryBuilder tool
[ https://issues.apache.org/jira/browse/LUCENE-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim resolved LUCENE-8904. Resolution: Fixed > Enhance Nori DictionaryBuilder tool > --- > > Key: LUCENE-8904 > URL: https://issues.apache.org/jira/browse/LUCENE-8904 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: 8.x, master (9.0) > > Time Spent: 1h 50m > Remaining Estimate: 0h > > It is the Nori version of [~sokolov]'s LUCENE-8863. > This patch has two changes. > 1) Improve exception handling > 2) Enable external dictionary for testing > Overall, it is the same as LUCENE-8863. > But there are some differences between Nori and Kuromoji. > These can be slightly different on the code. > 1) CSV field size > Nori : 12 > Kuromoji : 13 > 2) left context ID == right context ID > Nori : can be different > Kuromoji : always same > 3) Dictionary Type > Nori : just one type > Kuromoji : IPADIC, UNIDIC > After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8904) Enhance Nori DictionaryBuilder tool
[ https://issues.apache.org/jira/browse/LUCENE-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8904: --- Fix Version/s: master (9.0) 8.x > Enhance Nori DictionaryBuilder tool > --- > > Key: LUCENE-8904 > URL: https://issues.apache.org/jira/browse/LUCENE-8904 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: 8.x, master (9.0) > > Time Spent: 1h 50m > Remaining Estimate: 0h > > It is the Nori version of [~sokolov]'s LUCENE-8863. > This patch has two changes. > 1) Improve exception handling > 2) Enable external dictionary for testing > Overall, it is the same as LUCENE-8863. > But there are some differences between Nori and Kuromoji. > These can be slightly different on the code. > 1) CSV field size > Nori : 12 > Kuromoji : 13 > 2) left context ID == right context ID > Nori : can be different > Kuromoji : always same > 3) Dictionary Type > Nori : just one type > Kuromoji : IPADIC, UNIDIC > After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets
[ https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892876#comment-16892876 ] Namgyu Kim commented on LUCENE-8933: Great analysis! :D I checked KoreanTokenizer and there is no issue with this code. {code:java} @Test public void test() throws IOException { UserDictionary dict = UserDictionary.open(new StringReader("aaa,,,")); KoreanTokenizer tok = new KoreanTokenizer(KoreanTokenizer.DEFAULT_TOKEN_ATTRIBUTE_FACTORY, dict, DecompoundMode.NONE, true); tok.setReader(new StringReader("aaa")); tok.reset(); tok.incrementToken(); } {code} > JapaneseTokenizer creates Token objects with corrupt offsets > > > Key: LUCENE-8933 > URL: https://issues.apache.org/jira/browse/LUCENE-8933 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > An Elasticsearch user reported the following stack trace when parsing > synonyms. It looks like the only reason why this might occur is if the offset > of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range. > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44) > ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - > nknize - 2018-12-07 14:44:20] > at > org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486) > ~[?:?] > at > org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) > ~[elasticsearch-6.6.1.jar:6.6.1] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154) > ~[elasticsearch-6.6.1.jar:6.6.1] > ... 24 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets
[ https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892041#comment-16892041 ] Namgyu Kim commented on LUCENE-8933: Elasticsearch Issue Link : [https://github.com/elastic/elasticsearch/issues/44243] > JapaneseTokenizer creates Token objects with corrupt offsets > > > Key: LUCENE-8933 > URL: https://issues.apache.org/jira/browse/LUCENE-8933 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > An Elasticsearch user reported the following stack trace when parsing > synonyms. It looks like the only reason why this might occur is if the offset > of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range. > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44) > ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - > nknize - 2018-12-07 14:44:20] > at > org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486) > ~[?:?] > at > org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) > ~[elasticsearch-6.6.1.jar:6.6.1] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154) > ~[elasticsearch-6.6.1.jar:6.6.1] > ... 24 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8934) Move Nori DictionaryBuilder tool from src/tools to src/
Namgyu Kim created LUCENE-8934: -- Summary: Move Nori DictionaryBuilder tool from src/tools to src/ Key: LUCENE-8934 URL: https://issues.apache.org/jira/browse/LUCENE-8934 Project: Lucene - Core Issue Type: Improvement Reporter: Namgyu Kim Assignee: Namgyu Kim After LUCENE-8904 tests in Nori tools are not running in the normal test ({{ant test}}). As with Kuromoji(before LUCENE-8871), we need to run the {{ant test-tools}} to test Nori's tools. Like Kuromoji, we can proceed with the normality test after moving the tools of Nori to the main source tree. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8912) Remove ICU dependency of nori tools/test-tools
[ https://issues.apache.org/jira/browse/LUCENE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8912: --- Description: {quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. {quote} As mentioned in LUCENE-8904, I proceed this work from now on. It is what [~rcmuir] found first(LUCENE-8866) and then I just apply to Nori. Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC normalization like Kuromoji. I think it's OK to remove the library dependency because it can be handled by JDK. was: {quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. {quote} As mentioned in LUCENE-8904, I proceed this work from now on. It is what [~rcmuir] found first and then I just apply to Nori. Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC normalization like Kuromoji. I think it's OK to remove the library dependency because it can be handled by JDK. > Remove ICU dependency of nori tools/test-tools > -- > > Key: LUCENE-8912 > URL: https://issues.apache.org/jira/browse/LUCENE-8912 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > > {quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. > {quote} > As mentioned in LUCENE-8904, I proceed this work from now on. > It is what [~rcmuir] found first(LUCENE-8866) and then I just apply to Nori. > Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC > normalization like Kuromoji. > I think it's OK to remove the library dependency because it can be handled > by JDK. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8912) Remove ICU dependency of nori tools/test-tools
Namgyu Kim created LUCENE-8912: -- Summary: Remove ICU dependency of nori tools/test-tools Key: LUCENE-8912 URL: https://issues.apache.org/jira/browse/LUCENE-8912 Project: Lucene - Core Issue Type: Improvement Reporter: Namgyu Kim Assignee: Namgyu Kim {quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. {quote} As mentioned in LUCENE-8904, I proceed this work from now on. It is what [~rcmuir] found first and then I just apply to Nori. Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC normalization like Kuromoji. I think it's OK to remove the library dependency because it can be handled by JDK. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-8904) Enhance Nori DictionaryBuilder tool
[ https://issues.apache.org/jira/browse/LUCENE-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim reassigned LUCENE-8904: -- Assignee: Namgyu Kim > Enhance Nori DictionaryBuilder tool > --- > > Key: LUCENE-8904 > URL: https://issues.apache.org/jira/browse/LUCENE-8904 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > It is the Nori version of [~sokolov]'s LUCENE-8863. > This patch has two changes. > 1) Improve exception handling > 2) Enable external dictionary for testing > Overall, it is the same as LUCENE-8863. > But there are some differences between Nori and Kuromoji. > These can be slightly different on the code. > 1) CSV field size > Nori : 12 > Kuromoji : 13 > 2) left context ID == right context ID > Nori : can be different > Kuromoji : always same > 3) Dictionary Type > Nori : just one type > Kuromoji : IPADIC, UNIDIC > After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8900) Simplify MultiSorter
[ https://issues.apache.org/jira/browse/LUCENE-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881302#comment-16881302 ] Namgyu Kim commented on LUCENE-8900: You're welcome! [~jpountz]. > Simplify MultiSorter > > > Key: LUCENE-8900 > URL: https://issues.apache.org/jira/browse/LUCENE-8900 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.2 > > Attachments: LUCENE-8900.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8904) Enhance Nori DictionaryBuilder tool
Namgyu Kim created LUCENE-8904: -- Summary: Enhance Nori DictionaryBuilder tool Key: LUCENE-8904 URL: https://issues.apache.org/jira/browse/LUCENE-8904 Project: Lucene - Core Issue Type: Improvement Reporter: Namgyu Kim It is the Nori version of [~sokolov]'s LUCENE-8863. This patch has two changes. 1) Improve exception handling 2) Enable external dictionary for testing Overall, it is the same as LUCENE-8863. But there are some differences between Nori and Kuromoji. These can be slightly different on the code. 1) CSV field size Nori : 12 Kuromoji : 13 2) left context ID == right context ID Nori : can be different Kuromoji : always same 3) Dictionary Type Nori : just one type Kuromoji : IPADIC, UNIDIC After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8900) Simplify MultiSorter
[ https://issues.apache.org/jira/browse/LUCENE-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877924#comment-16877924 ] Namgyu Kim commented on LUCENE-8900: Oh, about the suggestion 2, I saw it wrong. It can cause a ClassCastException :( Sorry for confusing and thank you for taking the suggestion 1. > Simplify MultiSorter > > > Key: LUCENE-8900 > URL: https://issues.apache.org/jira/browse/LUCENE-8900 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-8900.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8900) Simplify MultiSorter
[ https://issues.apache.org/jira/browse/LUCENE-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877180#comment-16877180 ] Namgyu Kim commented on LUCENE-8900: +1 This patch looks good. [~jpountz] :D I have some opinions about your patch. 1. In lessThan method in PriorityQueue, we can reduce the computation. (very minor difference) Before {code:java} public boolean lessThan(LeafAndDocID a, LeafAndDocID b) { for(int i=0;i Simplify MultiSorter > > > Key: LUCENE-8900 > URL: https://issues.apache.org/jira/browse/LUCENE-8900 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-8900.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8870) Support numeric value in Field class
[ https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873528#comment-16873528 ] Namgyu Kim commented on LUCENE-8870: Thank you for your reply! [~jpountz] :D {quote}My gut feeling is that trying to fold everything into a single field would make things more complicated rather than simpler. {quote} Yeah, I agree with some parts. I thought it would be nice to give more options to user. But if you think it can cause some confusions to users, I think remaining the current status is better. {quote}About the TestField class, I think its class structure needs to be changed slightly. It is no direct connection with this issue. But I plan to modify it like TestIntPoint, TestDoublePoint, ... Should not the test class name depend on the class name? {quote} By the way, what do you think about my first comment in this issue? It looks a little bit ambiguous, but I'm curious about your opinion. > Support numeric value in Field class > > > Key: LUCENE-8870 > URL: https://issues.apache.org/jira/browse/LUCENE-8870 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8870.patch > > > I checked the following comment in Field class. > {code:java} > // TODO: allow direct construction of int, long, float, double value too..? > {code} > We already have some fields like IntPoint and StoredField, but I think it's > okay. > The test cases are set in the TestField class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8870) Support numeric value in Field class
[ https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872601#comment-16872601 ] Namgyu Kim commented on LUCENE-8870: Thank you for your reply! [~jpountz] :D When I wrote a patch, the biggest advantage that I think is the FieldType conversion for Numeric types. Of course, it is not recommended way(it is already mentioned Expert in Javadoc), but it can give users FieldType customization. ex) Currently NumericDocValuesField does not support the option for stored. So users need to add a separate StoredField. If we provide this patch, the user can get the characteristics of NumericDocValuesField and StoredField in a single field. {code:java} FieldType type = new FieldType(); type.setStored(true); type.setDocValuesType(DocValuesType.NUMERIC); type.freeze(); Document doc = new Document(); Field field = new Field("number", 1234, type); doc.add(field); indexWriter.addDocument(doc); {code} After that, we can use methods such as {code:java} Sort sort = new Sort(); sort.setSort(new SortField("number", SortField.Type.INT)); {code} and {code:java} doc.get("number"); {code} in the "number" field. > Support numeric value in Field class > > > Key: LUCENE-8870 > URL: https://issues.apache.org/jira/browse/LUCENE-8870 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8870.patch > > > I checked the following comment in Field class. > {code:java} > // TODO: allow direct construction of int, long, float, double value too..? > {code} > We already have some fields like IntPoint and StoredField, but I think it's > okay. > The test cases are set in the TestField class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8870) Support numeric value in Field class
[ https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870613#comment-16870613 ] Namgyu Kim commented on LUCENE-8870: Thank you for your reply! [~sokolov] :D I want to make sure that I understand your opinion well. {quote}in the end we store the value as an Object and then later cast it. Callers that also handle values generically, as Objects then need an adapter to detect the type of a value, cast it properly, only to have Lucene throw away all the type info and do that dance all over again internally! {quote} I think this is a structure for providing various constructors to users. (Reader, CharSequence, BytesRef, ...) Of course we can only provide it in Object form and handle it in the constructor. But isn't it unfriendly to API users? And I'm not sure about it because the IndexableFieldType check logic is different depending on the value type. ex) BytesRef -> there is no check logic. CharSequence -> (!IndexableFieldType#stored() && IndexableFieldType#indexOptions() == IndexOptions.NONE) should be false. Reader -> (IndexableFieldType#indexOptions() == IndexOptions.NONE || !IndexableFieldType#tokenized()) and (IndexableFieldType#stored()) should be false. In fact, I worried about using the Number class when writing this patch. I think API users may prefer int, float, double, ... rather than the Number class. What do you think about this? {quote}Maybe use Objects.requireNonNull for the null checks? {quote} I made it by referring to the current code structure. That method generates the NullPointerException, and current structure is the IllegalArgumentException. > Support numeric value in Field class > > > Key: LUCENE-8870 > URL: https://issues.apache.org/jira/browse/LUCENE-8870 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8870.patch > > > I checked the following comment in Field class. > {code:java} > // TODO: allow direct construction of int, long, float, double value too..? > {code} > We already have some fields like IntPoint and StoredField, but I think it's > okay. > The test cases are set in the TestField class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8870) Support numeric value in Field class
[ https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867819#comment-16867819 ] Namgyu Kim commented on LUCENE-8870: About the TestField class, I think its class structure needs to be changed slightly. It is no direct connection with this issue. But I plan to modify it like TestIntPoint, TestDoublePoint, ... Should not the test class name depend on the class name? What do you think about it? > Support numeric value in Field class > > > Key: LUCENE-8870 > URL: https://issues.apache.org/jira/browse/LUCENE-8870 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8870.patch > > > I checked the following comment in Field class. > {code:java} > // TODO: allow direct construction of int, long, float, double value too..? > {code} > We already have some fields like IntPoint and StoredField, but I think it's > okay. > The test cases are set in the TestField class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8870) Support numeric value in Field class
[ https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8870: --- Attachment: LUCENE-8870.patch > Support numeric value in Field class > > > Key: LUCENE-8870 > URL: https://issues.apache.org/jira/browse/LUCENE-8870 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8870.patch > > > I checked the following comment in Field class. > {code:java} > // TODO: allow direct construction of int, long, float, double value too..? > {code} > We already have some fields like IntPoint and StoredField, but I think it's > okay. > The test cases are set in the TestField class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8870) Support numeric value in Field class
Namgyu Kim created LUCENE-8870: -- Summary: Support numeric value in Field class Key: LUCENE-8870 URL: https://issues.apache.org/jira/browse/LUCENE-8870 Project: Lucene - Core Issue Type: New Feature Reporter: Namgyu Kim Attachments: LUCENE-8870.patch I checked the following comment in Field class. {code:java} // TODO: allow direct construction of int, long, float, double value too..? {code} We already have some fields like IntPoint and StoredField, but I think it's okay. The test cases are set in the TestField class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim resolved LUCENE-8812. Resolution: Fixed > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862270#comment-16862270 ] Namgyu Kim commented on LUCENE-8812: You're welcome. [~jim.ferenczi]! I checked that the build was completed fine at Jenkins. I'll resolve this issue. > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860162#comment-16860162 ] Namgyu Kim commented on LUCENE-8817: Oh. I read it wrong. Please ignore that part. Thank you for checking. [~tomoko] > Combine Nori and Kuromoji DictionaryBuilder > --- > > Key: LUCENE-8817 > URL: https://issues.apache.org/jira/browse/LUCENE-8817 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > > This issue is related to LUCENE-8816. > Currently Nori and Kuromoji Analyzer use the same dictionary structure. > (MeCab) > If we make combine DictionaryBuilder, we can reduce the code size. > But this task may have a dependency on the language. > (like HEADER string in BinaryDictionary and CharacterDefinition, methods in > BinaryDictionaryWriter, ...) > On the other hand, there are many overlapped classes. > The purpose of this patch is to provide users of Nori and Kuromoji with the > same system dictionary generator. > It may take some time because there is a little workload. > The work will be based on the latest master, and if the LUCENE-8816 is > finished first, I will pull the latest code and proceed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860128#comment-16860128 ] Namgyu Kim commented on LUCENE-8817: Thank you for your replies. [~tomoko] and [~cm] :D I was surprised at your deep thoughts. {code:java} analysis └── ??? ├── common (module: analyzers-???-common) │ ├── build.xml │ └── src ├── kuromoji (module: analyzers-???-kuromoji) │ ├── build.xml │ └── src ├── nori (module: analyzers-???-nori) │ ├── build.xml │ └── src └── tools (module: analyzers-???-tools) ├── build.xml └── src {code} I agree with the module structure proposed by Tomoko. In my personal opinion, "analysis" is better than "analyzers". {quote}In terms of naming, what about using "statistical" instead of "mecab" for this class of analyzers? I'm thinking "Viterbi" could be good to refer to in shared tokenizer code. This said, I think it could be a good to refer to "mecab" in the dictionary compiler code, documentation, etc. to make sure users understand that we can read this model format. Any thoughts? {quote} About the name, the folder name "viterbi" looks much better than "statistical". But to be perfectly honest, I'm not sure that it's really right to use the algorithm name as the folder name. Most users probably don't know what viterbi is. It is also associated with the package name, and "org.apache.lucene.analysis.viterbi.ja" or "~.viterbi.ko" will confuse users. Or just use "org.apache.lucene.analysis.ja", it could be fine. It's because analysis-common is already doing like it. (not org.apache.lucene.common.cjk) It doesn't matter if we use it for administrative purposes, but I also want to hear some opinions from others. {quote}how about using "kuromoji" in the top level module name for both of Japanese and Korean analyzers, and changing current module names "kuromoji" and "nori" to "kuromoji-ja" and "kuromoij-ko"? {quote} I personally don't agree to use kuromoji-ko instead of nori. nori is already a familiar name to users. They may be confused about it. > Combine Nori and Kuromoji DictionaryBuilder > --- > > Key: LUCENE-8817 > URL: https://issues.apache.org/jira/browse/LUCENE-8817 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > > This issue is related to LUCENE-8816. > Currently Nori and Kuromoji Analyzer use the same dictionary structure. > (MeCab) > If we make combine DictionaryBuilder, we can reduce the code size. > But this task may have a dependency on the language. > (like HEADER string in BinaryDictionary and CharacterDefinition, methods in > BinaryDictionaryWriter, ...) > On the other hand, there are many overlapped classes. > The purpose of this patch is to provide users of Nori and Kuromoji with the > same system dictionary generator. > It may take some time because there is a little workload. > The work will be based on the latest master, and if the LUCENE-8816 is > finished first, I will pull the latest code and proceed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859507#comment-16859507 ] Namgyu Kim commented on LUCENE-8812: I re-opened this issue because the error was occurred in branch_8x branch. ([https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-Windows/298/]) After the error was found, the commit to fix the error was pushed. The cause of the error was that when I wrote the test(TestKoreanNumberFilter), I used Java 9 try-with-resources style, and the error was occurred in branch_8x because it is based on Java 8. So I disable it including the master branch, and when the 9.0 version is officially released, I will rework it at that time. And I made a slight mistake. I did not mention the issue number when committing to the branch_8x branch. I wanted to change the commit message, but it could not be changed because it is a protected branch. (force push is blocked) So I was forced to revert. Anyway, both problems(error + wrong commit message) are all my fault, and I will be careful. I will resolve this issue when the Jenkins build is completed. > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim reopened LUCENE-8812: > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim resolved LUCENE-8812. Resolution: Fixed > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859470#comment-16859470 ] Namgyu Kim commented on LUCENE-8812: Hi [~jim.ferenczi] :D I pushed my commit to *branch_8x* and *master* branch. I checked and it seems to be reflected fine. So I'll resolve this issue. Let me know if there are some problems. > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8812: --- Fix Version/s: 8.2 master (9.0) > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim reassigned LUCENE-8812: -- Assignee: Namgyu Kim > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Namgyu Kim >Priority: Major > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858895#comment-16858895 ] Namgyu Kim edited comment on LUCENE-8817 at 6/7/19 6:40 PM: I share the current status. The merging is almost over and I need some discussion. I thought several structures. 1. Save in tools of analysis-common module. It is simple, but I think MeCab is difficult to see as a feature of analysis-common. 2. Create tools folder in analysis and set mecab-tools module in there. analysis/tools ─ analysis-common-tools (to-be) └ icu-tools (to-be) └ mecab-tools └ ... The problem with this is that the number of modules increases a lot because each tool is created as a module. 3. Create a module called mecab we can create a mecab module that is the starting point for merging nori and kuromoji. If we proceed in this direction, we will only have tools in src. But this approach may not be easy to create the runnable jar. Because it will include the library. (ex: MecabAnalyzer, MecabTokenizer, ...) 4. Create a module called mecab-tools It's easy to develop, but there are other library modules in analysis. So something seems strange because it's only runnable-jar. Number 2 seems to be the best, but I'm not sure yet. I would appreciate any comments. I will go ahead if direction is set, but landing will be delayed a little. The reason is that the build system is going to change. (SOLR-13452) But if it does not matter, I will proceed. was (Author: danmuzi): I share the current status. The merge is almost over and I need some discussion. I thought several structures. 1. Save in tools of analysis-common module. It is simple, but I think MeCab is difficult to see as a feature of analysis-common. 2. Create tools folder in analysis and set mecab-tools module in there. analysis/tools ─ analysis-common-tools (to-be) └ icu-tools (to-be) └ mecab-tools └ ... The problem with this is that the number of modules increases a lot because each tool is created as a module. 3. Create a module called mecab we can create a mecab module that is the starting point for merging nori and kuromoji. If we proceed in this direction, we will only have tools in src. But this approach may not be easy to create the runnable jar. Because it will include the library. (ex: MecabAnalyzer, MecabTokenizer, ...) 4. Create a module called mecab-tools It's easy to develop, but there are other library modules in analysis. So something seems strange because it's only runnable-jar. Number 2 seems to be the best, but I'm not sure yet. I would appreciate any comments. I will go ahead if direction is set, but landing will be delayed a little. The reason is that the build system is going to change. (SOLR-13452) But if it does not matter, I will proceed. > Combine Nori and Kuromoji DictionaryBuilder > --- > > Key: LUCENE-8817 > URL: https://issues.apache.org/jira/browse/LUCENE-8817 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > > This issue is related to LUCENE-8816. > Currently Nori and Kuromoji Analyzer use the same dictionary structure. > (MeCab) > If we make combine DictionaryBuilder, we can reduce the code size. > But this task may have a dependency on the language. > (like HEADER string in BinaryDictionary and CharacterDefinition, methods in > BinaryDictionaryWriter, ...) > On the other hand, there are many overlapped classes. > The purpose of this patch is to provide users of Nori and Kuromoji with the > same system dictionary generator. > It may take some time because there is a little workload. > The work will be based on the latest master, and if the LUCENE-8816 is > finished first, I will pull the latest code and proceed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858895#comment-16858895 ] Namgyu Kim commented on LUCENE-8817: I share the current status. The merge is almost over and I need some discussion. I thought several structures. 1. Save in tools of analysis-common module. It is simple, but I think MeCab is difficult to see as a feature of analysis-common. 2. Create tools folder in analysis and set mecab-tools module in there. analysis/tools ─ analysis-common-tools (to-be) └ icu-tools (to-be) └ mecab-tools └ ... The problem with this is that the number of modules increases a lot because each tool is created as a module. 3. Create a module called mecab we can create a mecab module that is the starting point for merging nori and kuromoji. If we proceed in this direction, we will only have tools in src. But this approach may not be easy to create the runnable jar. Because it will include the library. (ex: MecabAnalyzer, MecabTokenizer, ...) 4. Create a module called mecab-tools It's easy to develop, but there are other library modules in analysis. So something seems strange because it's only runnable-jar. Number 2 seems to be the best, but I'm not sure yet. I would appreciate any comments. I will go ahead if direction is set, but landing will be delayed a little. The reason is that the build system is going to change. (SOLR-13452) But if it does not matter, I will proceed. > Combine Nori and Kuromoji DictionaryBuilder > --- > > Key: LUCENE-8817 > URL: https://issues.apache.org/jira/browse/LUCENE-8817 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > > This issue is related to LUCENE-8816. > Currently Nori and Kuromoji Analyzer use the same dictionary structure. > (MeCab) > If we make combine DictionaryBuilder, we can reduce the code size. > But this task may have a dependency on the language. > (like HEADER string in BinaryDictionary and CharacterDefinition, methods in > BinaryDictionaryWriter, ...) > On the other hand, there are many overlapped classes. > The purpose of this patch is to provide users of Nori and Kuromoji with the > same system dictionary generator. > It may take some time because there is a little workload. > The work will be based on the latest master, and if the LUCENE-8816 is > finished first, I will pull the latest code and proceed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858839#comment-16858839 ] Namgyu Kim commented on LUCENE-8812: Thank you for your reply. [~jim.ferenczi] :D Awesome. I'll submit this patch. This is the first time that I submit a patch manually. I will look for the manual and proceed, but it can take a little time. > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
[ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853784#comment-16853784 ] Namgyu Kim commented on LUCENE-8816: Thanks! [~tomoko] :D > Decouple Kuromoji's morphological analyser and its dictionary > - > > Key: LUCENE-8816 > URL: https://issues.apache.org/jira/browse/LUCENE-8816 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > > I've inspired by this mail-list thread. > > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E] > As many Japanese already know, default built-in dictionary bundled with > Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. > While it has been slowly obsoleted, well-maintained and/or extended > dictionaries risen up in recent years (e.g. > [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], > [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some > attempts/projects/efforts are made in Japan. > However current architecture - dictionary bundled jar - is essentially > incompatible with the idea "switch the system dictionary", and developers > have difficulties to do so. > Traditionally, the morphological analysis engine (viterbi logic) and the > encoded dictionary (language model) had been decoupled (like MeCab, the > origin of Kuromoji, or lucene-gosen). So actually decoupling them is a > natural idea, and I feel that it's good time to re-think the current > architecture. > Also this would be good for advanced users who have customized/re-trained > their own system dictionary. > Goals of this issue: > * Decouple JapaneseTokenizer itself and encoded system dictionary. > * Implement dynamic dictionary load mechanism. > * Provide developer-oriented dictionary build tool. > Non-goals: > * Provide learner or language model (it's up to users and should be outside > the scope). > I have not dove into the code yet, so have no idea about it's easy or > difficult at this moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8817: --- Description: This issue is related to LUCENE-8816. Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab) If we make combine DictionaryBuilder, we can reduce the code size. But this task may have a dependency on the language. (like HEADER string in BinaryDictionary and CharacterDefinition, methods in BinaryDictionaryWriter, ...) On the other hand, there are many overlapped classes. The purpose of this patch is to provide users of Nori and Kuromoji with the same system dictionary generator. It may take some time because there is a little workload. The work will be based on the latest master, and if the LUCENE-8816 is finished first, I will pull the latest code and proceed. was: This issue is related to LUCENE-8816. Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab) If we make combine DictionaryBuilder, we can reduce the code size. But this task may have a dependency on the language. (like HEADER string in BinaryDictionary and CharacterDefinition, methods in BinaryDictionaryWriter, ...) On the other hand, there are many overlapped classes. The purpose of this patch is to provide users of Nori and Kuromoji with the same system dictionary generator. It may take some time because there is a little workload. The work will be based on the latest master, and if the LUCENE-8816 is finished first, it will pull the latest code and proceed. > Combine Nori and Kuromoji DictionaryBuilder > --- > > Key: LUCENE-8817 > URL: https://issues.apache.org/jira/browse/LUCENE-8817 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > > This issue is related to LUCENE-8816. > Currently Nori and Kuromoji Analyzer use the same dictionary structure. > (MeCab) > If we make combine DictionaryBuilder, we can reduce the code size. > But this task may have a dependency on the language. > (like HEADER string in BinaryDictionary and CharacterDefinition, methods in > BinaryDictionaryWriter, ...) > On the other hand, there are many overlapped classes. > The purpose of this patch is to provide users of Nori and Kuromoji with the > same system dictionary generator. > It may take some time because there is a little workload. > The work will be based on the latest master, and if the LUCENE-8816 is > finished first, I will pull the latest code and proceed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
[ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853754#comment-16853754 ] Namgyu Kim commented on LUCENE-8816: Thanks for your reply! [~tomoko] I created the LUCENE-8817 issue and set up this issue with the "related to" link. I'll notify you if there is some progress. > Decouple Kuromoji's morphological analyser and its dictionary > - > > Key: LUCENE-8816 > URL: https://issues.apache.org/jira/browse/LUCENE-8816 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > > I've inspired by this mail-list thread. > > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E] > As many Japanese already know, default built-in dictionary bundled with > Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. > While it has been slowly obsoleted, well-maintained and/or extended > dictionaries risen up in recent years (e.g. > [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], > [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some > attempts/projects/efforts are made in Japan. > However current architecture - dictionary bundled jar - is essentially > incompatible with the idea "switch the system dictionary", and developers > have difficulties to do so. > Traditionally, the morphological analysis engine (viterbi logic) and the > encoded dictionary (language model) had been decoupled (like MeCab, the > origin of Kuromoji, or lucene-gosen). So actually decoupling them is a > natural idea, and I feel that it's good time to re-think the current > architecture. > Also this would be good for advanced users who have customized/re-trained > their own system dictionary. > Goals of this issue: > * Decouple JapaneseTokenizer itself and encoded system dictionary. > * Implement dynamic dictionary load mechanism. > * Provide developer-oriented dictionary build tool. > Non-goals: > * Provide learner or language model (it's up to users and should be outside > the scope). > I have not dove into the code yet, so have no idea about it's easy or > difficult at this moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
Namgyu Kim created LUCENE-8817: -- Summary: Combine Nori and Kuromoji DictionaryBuilder Key: LUCENE-8817 URL: https://issues.apache.org/jira/browse/LUCENE-8817 Project: Lucene - Core Issue Type: New Feature Reporter: Namgyu Kim This issue is related to LUCENE-8816. Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab) If we make combine DictionaryBuilder, we can reduce the code size. But this task may have a dependency on the language. (like HEADER string in BinaryDictionary and CharacterDefinition, methods in BinaryDictionaryWriter, ...) On the other hand, there are many overlapped classes. The purpose of this patch is to provide users of Nori and Kuromoji with the same system dictionary generator. It may take some time because there is a little workload. The work will be based on the latest master, and if the LUCENE-8816 is finished first, it will pull the latest code and proceed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
[ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853678#comment-16853678 ] Namgyu Kim commented on LUCENE-8816: Oh, you're right. [~tomoko] :D I'll make a new JIRA issue after clean up the changes. {quote}To avoid confusion, personally I'd like to proceed things in a right order - cleaning up first, then generalizing. But if you are sure that we can go in parallel, can you share your plan? {quote} Sure. It's an important thing. I think we can proceed in parallel. There are two possible cases. 1) You finish between JapaneseTokenizer and DictionaryBuilder job first. In that case, I can pull your new code and merge with nori's DictionaryBuilder. 2) I finish merging DictionaryBuilder(nori) and DictionaryBuilder(kuromoji) first. In that case, you can pull and continue. The DictionaryBuilder logic of kuromoji does not change at all in my work. But if you think it is a little inefficient, I'll do later. What do you think about it? > Decouple Kuromoji's morphological analyser and its dictionary > - > > Key: LUCENE-8816 > URL: https://issues.apache.org/jira/browse/LUCENE-8816 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > > I've inspired by this mail-list thread. > > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E] > As many Japanese already know, default built-in dictionary bundled with > Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. > While it has been slowly obsoleted, well-maintained and/or extended > dictionaries risen up in recent years (e.g. > [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], > [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some > attempts/projects/efforts are made in Japan. > However current architecture - dictionary bundled jar - is essentially > incompatible with the idea "switch the system dictionary", and developers > have difficulties to do so. > Traditionally, the morphological analysis engine (viterbi logic) and the > encoded dictionary (language model) had been decoupled (like MeCab, the > origin of Kuromoji, or lucene-gosen). So actually decoupling them is a > natural idea, and I feel that it's good time to re-think the current > architecture. > Also this would be good for advanced users who have customized/re-trained > their own system dictionary. > Goals of this issue: > * Decouple JapaneseTokenizer itself and encoded system dictionary. > * Implement dynamic dictionary load mechanism. > * Provide developer-oriented dictionary build tool. > Non-goals: > * Provide learner or language model (it's up to users and should be outside > the scope). > I have not dove into the code yet, so have no idea about it's easy or > difficult at this moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
[ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853207#comment-16853207 ] Namgyu Kim commented on LUCENE-8816: {quote}So I thought we can share only DictionaryBuilder between Kuromoji and Nori module, as I wrote in the previous comment. {quote} I agree with your idea. [~tomoko] I don't think it would be difficult to merge DictionaryBuilder. (except BinaryDirectoryWriter) But I think BinaryDirectoryWriter case can be solved if we separate methods. (+ use DictionaryFormat) Can I try this when you are concentrating on JapaneseTokenizer? > Decouple Kuromoji's morphological analyser and its dictionary > - > > Key: LUCENE-8816 > URL: https://issues.apache.org/jira/browse/LUCENE-8816 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > > I've inspired by this mail-list thread. > > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E] > As many Japanese already know, default built-in dictionary bundled with > Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. > While it has been slowly obsoleted, well-maintained and/or extended > dictionaries risen up in recent years (e.g. > [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], > [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some > attempts/projects/efforts are made in Japan. > However current architecture - dictionary bundled jar - is essentially > incompatible with the idea "switch the system dictionary", and developers > have difficulties to do so. > Traditionally, the morphological analysis engine (viterbi logic) and the > encoded dictionary (language model) had been decoupled (like MeCab, the > origin of Kuromoji, or lucene-gosen). So actually decoupling them is a > natural idea, and I feel that it's good time to re-think the current > architecture. > Also this would be good for advanced users who have customized/re-trained > their own system dictionary. > Goals of this issue: > * Decouple JapaneseTokenizer itself and encoded system dictionary. > * Implement dynamic dictionary load mechanism. > * Provide developer-oriented dictionary build tool. > Non-goals: > * Provide learner or language model (it's up to users and should be outside > the scope). > I have not dove into the code yet, so have no idea about it's easy or > difficult at this moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
[ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852084#comment-16852084 ] Namgyu Kim commented on LUCENE-8816: That's a good plan, [~tomoko]! About Tokenizer... I'm not sure that is better to combine KoreanTokenizer and JapaneseTokenizer. If we want, there is a way. KoreanTokenizer and JapaneseTokenizer have parts with duplicate code. We can create an abstract class called MecabTokenizer. (both are mecab based) MecabTokenizer will have duplicate things. (like isPunctuation(), WrappedPositionArray inner-class, ...) What do you think about it? Anyway, combining KoreanTokenizer and JapaneseTokenizer will be a big task and I think it's right to create a new issue. > Decouple Kuromoji's morphological analyser and its dictionary > - > > Key: LUCENE-8816 > URL: https://issues.apache.org/jira/browse/LUCENE-8816 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > > I've inspired by this mail-list thread. > > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E] > As many Japanese already know, default built-in dictionary bundled with > Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. > While it has been slowly obsoleted, well-maintained and/or extended > dictionaries risen up in recent years (e.g. > [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], > [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some > attempts/projects/efforts are made in Japan. > However current architecture - dictionary bundled jar - is essentially > incompatible with the idea "switch the system dictionary", and developers > have difficulties to do so. > Traditionally, the morphological analysis engine (viterbi logic) and the > encoded dictionary (language model) had been decoupled (like MeCab, the > origin of Kuromoji, or lucene-gosen). So actually decoupling them is a > natural idea, and I feel that it's good time to re-think the current > architecture. > Also this would be good for advanced users who have customized/re-trained > their own system dictionary. > Goals of this issue: > * Decouple JapaneseTokenizer itself and encoded system dictionary. > * Implement dynamic dictionary load mechanism. > * Provide developer-oriented dictionary build tool. > Non-goals: > * Provide learner or language model (it's up to users and should be outside > the scope). > I have not dove into the code yet, so have no idea about it's easy or > difficult at this moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852014#comment-16852014 ] Namgyu Kim commented on LUCENE-8812: Thank you for your reply, [~jim.ferenczi] :D {quote}I wonder if it would be difficult to have a base class for the Japanese and Korean number filter since they share a large amount of code. However I think it's ok to merge this first and we can tackle the merge in a follow up, wdyt ? {quote} I think it is an awesome refactoring. If the refactoring is done, we can also share this TokenFilter in SmartChineseAnalyzer. (Chinese and Japanese use the same numeric characters) The amount of code will also be reduced. I think the NumberFilter (new abstract class) can be in the org.apache.lucene.analysis.core(analysis-common) or org.apache.lucene.analysis(lucene-core) package, what do you think? In my personal opinion, analysis-common seems to be correct, but it is a little bit ambiguous. > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
[ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851103#comment-16851103 ] Namgyu Kim commented on LUCENE-8816: Thanks for the answers :D Oh, I think your opinion is good, [~tomoko]. It is important to ensure that users do not modify the source code. If we apply that approach, it seems to be able to create a build tool that supports two pre-types in the CLI environment. And it's very important to be good at handling exceptions as [~rcmuir] said. Otherwise, users will think that stability is poor. About version control, I totally agree with you. > Decouple Kuromoji's morphological analyser and its dictionary > - > > Key: LUCENE-8816 > URL: https://issues.apache.org/jira/browse/LUCENE-8816 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > > I've inspired by this mail-list thread. > > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E] > As many Japanese already know, default built-in dictionary bundled with > Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. > While it has been slowly obsoleted, well-maintained and/or extended > dictionaries risen up in recent years (e.g. > [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], > [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some > attempts/projects/efforts are made in Japan. > However current architecture - dictionary bundled jar - is essentially > incompatible with the idea "switch the system dictionary", and developers > have difficulties to do so. > Traditionally, the morphological analysis engine (viterbi logic) and the > encoded dictionary (language model) had been decoupled (like MeCab, the > origin of Kuromoji, or lucene-gosen). So actually decoupling them is a > natural idea, and I feel that it's good time to re-think the current > architecture. > Also this would be good for advanced users who have customized/re-trained > their own system dictionary. > Goals of this issue: > * Decouple JapaneseTokenizer itself and encoded system dictionary. > * Implement dynamic dictionary load mechanism. > * Provide developer-oriented dictionary build tool. > Non-goals: > * Provide learner or language model (it's up to users and should be outside > the scope). > I have not dove into the code yet, so have no idea about it's easy or > difficult at this moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8813) testIndexTooManyDocs fails
[ https://issues.apache.org/jira/browse/LUCENE-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851076#comment-16851076 ] Namgyu Kim commented on LUCENE-8813: Oh, I saw the comment a little late. I didn't find the root cause, but your analysis is awesome. And patch looks good. I'll also review the PR in github :D > testIndexTooManyDocs fails > -- > > Key: LUCENE-8813 > URL: https://issues.apache.org/jira/browse/LUCENE-8813 > Project: Lucene - Core > Issue Type: Test > Components: core/index >Reporter: Nhat Nguyen >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > testIndexTooManyDocs fails on [Elastic > CI|https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+branch_8x/6402/console]. > This failure does not reproduce locally for me. > {noformat} > [junit4] Suite: org.apache.lucene.index.TestIndexTooManyDocs >[junit4] 2> KTN 23, 2019 4:09:37 PM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: > Thread[Thread-612,5,TGRP-TestIndexTooManyDocs] >[junit4] 2> java.lang.AssertionError: only modifications from the > current flushing queue are permitted while doing a full flush >[junit4] 2> at > __randomizedtesting.SeedInfo.seed([1F16B1DA7056AA52]:0) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.assertTicketQueueModification(DocumentsWriter.java:683) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.applyAllDeletes(DocumentsWriter.java:187) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:411) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:514) >[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) >[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) >[junit4] 2> at > org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) >[junit4] 2> at java.base/java.lang.Thread.run(Thread.java:834) >[junit4] 2> >[junit4] 2> KTN 23, 2019 6:09:36 PM > com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate >[junit4] 2> WARNING: Suite execution timed out: > org.apache.lucene.index.TestIndexTooManyDocs >[junit4] 2>1) Thread[id=669, > name=SUITE-TestIndexTooManyDocs-seed#[1F16B1DA7056AA52], state=RUNNABLE, > group=TGRP-TestIndexTooManyDocs] >[junit4] 2> at > java.base/java.lang.Thread.getStackTrace(Thread.java:1606) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:696) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:693) >[junit4] 2> at > java.base/java.security.AccessController.doPrivileged(Native Method) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.getStackTrace(ThreadLeakControl.java:693) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.getThreadsWithTraces(ThreadLeakControl.java:709) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.formatThreadStacksFull(ThreadLeakControl.java:689) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.access$1000(ThreadLeakControl.java:65) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$2.evaluate(ThreadLeakControl.java:415) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSuite(RandomizedRunner.java:708) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner.access$200(RandomizedRunner.java:138) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:629) >[junit4] 2>2) Thread[id=671, name=Thread-606, state=BLOCKED, > group=TGRP-TestIndexTooManyDocs] >[junit4] 2> at > app//org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:4945) >[junit4] 2> at > app//org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:293) >[junit4] 2> at > app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272) >[junit4] 2> at > app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262) >[junit4] 2> at > app//org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:165) >[junit4] 2> at >
[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851033#comment-16851033 ] Namgyu Kim commented on LUCENE-8812: Hi [~jim.ferenczi] :D Thank you for applying the LUCENE-8784 patch! I checked that this patch can be applied into latest code. (no conflict) What do you think about this patch? > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
[ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849967#comment-16849967 ] Namgyu Kim commented on LUCENE-8816: Hi everybody, Thank you for opening the issue! [~tomoko] To be honest, at first, when I talked about a custom system dictionary, I did not see a big sight. Anyway, the structure I think is as follows. 1. As Tomoko said, make developer-oriented dictionary build tool The "ant regenerate" command inside the build.xml that I checked has the following steps. 1) Compile the code (compile-tools) 2) Download the jar file (download-dict) 3) Save Noun.proper.csv diffs (patch-dict) 4) Run DictionaryBuilder and make dat files (build-dict) It does not matter if user builds only system dictionary. (Of course there is a problem to modify the classpath) (ex) ant build-dict ipadic /home/my/path/customDicIn(custom-dic input) /home/my/path/customDicOutput(dat output) utf-8 false However, if user needs to get a dictionary from the server, they should modify the build.xml. As I know, the url path is hard-coded. Of course, the user can run by modifying ivy.xml and build.xml. But my personal opinion is user should not touch Lucene's internal code. (even build script) Maybe the user is afraid to change or feel reluctant to use it. (Especially who have not used Apache Ant) However, I think that this may be different for every person. 2. Version Control I actually think this is the biggest problem. As I mentioned in the email, if the Lucene version goes up, users have to rebuild their system dictionary unconditionally and put it in the jar. Because the current process is, 1) Process like 1. 2) move users system directory dat files to resources/org.apache.lucene.analysis.ja.dict 3) ant jar Because of the 3), the user always has to rebuild kuromoji module or fix the kuromoji jar. The users can feel irritated, when there is no kuromoji module change in the version up. This problem can be solved easily if the system dictionary can only be parameterized in JapaneseTokenizer. (Of course, the expert javadoc is required) > Decouple Kuromoji's morphological analyser and its dictionary > - > > Key: LUCENE-8816 > URL: https://issues.apache.org/jira/browse/LUCENE-8816 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Major > > I've inspired by this mail-list thread. > > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E] > As many Japanese already know, default built-in dictionary bundled with > Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. > While it has been slowly obsoleted, well-maintained and/or extended > dictionaries risen up in recent years (e.g. > [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], > [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some > attempts/projects/efforts are made in Japan. > However current architecture - dictionary bundled jar - is essentially > incompatible with the idea "switch the system dictionary", and developers > have difficulties to do so. > Traditionally, the morphological analysis engine (viterbi logic) and the > encoded dictionary (language model) had been decoupled (like MeCab, the > origin of Kuromoji, or lucene-gosen). So actually decoupling them is a > natural idea, and I feel that it's good time to re-think the current > architecture. > Also this would be good for advanced users who have customized/re-trained > their own system dictionary. > Goals of this issue: > * Decouple JapaneseTokenizer itself and encoded system dictionary. > * Implement dynamic dictionary load mechanism. > * Provide developer-oriented dictionary build tool. > Non-goals: > * Provide learner or language model (it's up to users and should be outside > the scope). > I have not dove into the code yet, so have no idea about it's easy or > difficult at this moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849000#comment-16849000 ] Namgyu Kim commented on LUCENE-8784: Oh, I checked that discardPunctuation is removed from KoreanAnalyzer. Thank you very much for applying my patch! [~jim.ferenczi] :D > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8813) testIndexTooManyDocs fails
[ https://issues.apache.org/jira/browse/LUCENE-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848196#comment-16848196 ] Namgyu Kim commented on LUCENE-8813: Hi, [~dnhatn] and [~simonw]. The cause of the problem has not yet been analyzed, but the log analysis seems to be finished. I don't know if it will help, but I'll share it. There are two thread types in the TestIndexTooManyDocs test. Type1) Write to Index (IndexWriter#updateDocument()) - 3 ~ 7 threads made (randomly) Type2) Open(Refresh) Index (DirectoryReader#openIfChanged()) - 2 threads made The Type2 thread can only be terminated if *all type1 threads are terminated.* The reason is *"while (done.get() == false)"* condition. "done.set (true)" works when Type1 threads are finished. (indexingDone.await() and done.set(true)) And, unfortunately, if an exception occurs in Type1, The thread terminates *without* calling "indexingDone.countDown()". (ElasticSearch log shows java.lang.AssertionError: "only modifications from the current flushing queue are permitted while doing a full flush" in DocumentWriter#assertTicketQueueModification()) *The exception doesn't end the program*, only the each Type1 thread is terminated. (remain CountDownLatch count) And it goes into an infinite loop. (because of "while (done.get() == false)" loop) ElasticSearch log shows that there is a time limit of 720ms. (2 hours) (Mentioned in "Throwable #1: java.lang.Exception: Suite timeout exceeded (>= 720 msec)") That's why, Log contains from [junit4] HEARTBEAT J1 PID(2340@localhost): 2019-05-23T*14:14:47*, stalled for 310s at: TestIndexTooManyDocs.testIndexTooManyDocs to [junit4] HEARTBEAT J1 PID(2340@localhost): 2019-05-23T*16:08:47*, stalled for 7150s at: TestIndexTooManyDocs.testIndexTooManyDocs. As a result, I think it is an updateDocument() problem and we should look for the cause. > testIndexTooManyDocs fails > -- > > Key: LUCENE-8813 > URL: https://issues.apache.org/jira/browse/LUCENE-8813 > Project: Lucene - Core > Issue Type: Test > Components: core/index >Reporter: Nhat Nguyen >Priority: Major > > testIndexTooManyDocs fails on [Elastic > CI|https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+branch_8x/6402/console]. > This failure does not reproduce locally for me. > {noformat} > [junit4] Suite: org.apache.lucene.index.TestIndexTooManyDocs >[junit4] 2> KTN 23, 2019 4:09:37 PM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: > Thread[Thread-612,5,TGRP-TestIndexTooManyDocs] >[junit4] 2> java.lang.AssertionError: only modifications from the > current flushing queue are permitted while doing a full flush >[junit4] 2> at > __randomizedtesting.SeedInfo.seed([1F16B1DA7056AA52]:0) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.assertTicketQueueModification(DocumentsWriter.java:683) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.applyAllDeletes(DocumentsWriter.java:187) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:411) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:514) >[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) >[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) >[junit4] 2> at > org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) >[junit4] 2> at java.base/java.lang.Thread.run(Thread.java:834) >[junit4] 2> >[junit4] 2> KTN 23, 2019 6:09:36 PM > com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate >[junit4] 2> WARNING: Suite execution timed out: > org.apache.lucene.index.TestIndexTooManyDocs >[junit4] 2>1) Thread[id=669, > name=SUITE-TestIndexTooManyDocs-seed#[1F16B1DA7056AA52], state=RUNNABLE, > group=TGRP-TestIndexTooManyDocs] >[junit4] 2> at > java.base/java.lang.Thread.getStackTrace(Thread.java:1606) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:696) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:693) >[junit4] 2> at > java.base/java.security.AccessController.doPrivileged(Native Method) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.getStackTrace(ThreadLeakControl.java:693) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.getThreadsWithTraces(ThreadLeakControl.java:709)
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847706#comment-16847706 ] Namgyu Kim commented on LUCENE-8784: Thanks, [~jim.ferenczi]! :D If there is something wrong, I would appreciate it if you let me know. > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847659#comment-16847659 ] Namgyu Kim commented on LUCENE-8784: It's a good idea :D I linked LUCENE-8784(discardPunctuation) with LUCENE-8812(KoreanNumberFilter). (Apply LUCENE-8784 *first* and then LUCENE-8812) Your suggestion made this issue cleaner. In LUCENE-8784, I did not change the existing TCs and just added new TCs for discardPunctuation. (remain the current constructor to provide an existing API) > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8812: --- Attachment: LUCENE-8812.patch > add KoreanNumberFilter to Nori(Korean) Analyzer > --- > > Key: LUCENE-8812 > URL: https://issues.apache.org/jira/browse/LUCENE-8812 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8812.patch > > > This is a follow-up issue to LUCENE-8784. > The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to > regular Arabic decimal numbers in half-width characters. > Logic is similar to JapaneseNumberFilter. > It should be able to cover the following test cases. > 1) Korean Word to Number > 십만이천오백 => 102500 > 2) 1 character conversion > 일영영영 => 1000 > 3) Decimal Point Calculation > 3.2천 => 3200 > 4) Comma between three digits > 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8784: --- Attachment: LUCENE-8784.patch > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer
Namgyu Kim created LUCENE-8812: -- Summary: add KoreanNumberFilter to Nori(Korean) Analyzer Key: LUCENE-8812 URL: https://issues.apache.org/jira/browse/LUCENE-8812 Project: Lucene - Core Issue Type: New Feature Reporter: Namgyu Kim This is a follow-up issue to LUCENE-8784. The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to regular Arabic decimal numbers in half-width characters. Logic is similar to JapaneseNumberFilter. It should be able to cover the following test cases. 1) Korean Word to Number 십만이천오백 => 102500 2) 1 character conversion 일영영영 => 1000 3) Decimal Point Calculation 3.2천 => 3200 4) Comma between three digits 4,647.0010 => 4647.001 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846877#comment-16846877 ] Namgyu Kim commented on LUCENE-8784: Oh, I forgot. I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, KoreanTokenizerFactory) > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8784: --- Attachment: LUCENE-8784.patch > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846871#comment-16846871 ] Namgyu Kim commented on LUCENE-8784: Thank you for your reply, [~jim.ferenczi]! Your approach looks awesome. I developed KoreanNumberFilter by referring to JapaneseNumberFilter. Please check my patch :D (use "git apply --whitespace=fix LUCENE-8784.patch" because of trailing whitespace error :() I did not set KoreanNumberFilter as the default filter in KoreanAnalyzer. By the way, would not it be better to leave the constructors that do not use discardPunctuation parameters? (Existing Nori users have to modify the code after uploading) > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845933#comment-16845933 ] Namgyu Kim edited comment on LUCENE-8784 at 5/22/19 2:40 PM: - Thank you for your reply, [~jim.ferenczi] :D I tried to process only "." character in Tokenizer. Because Korean is a language that can have a whitespace in sentence, but Japanese is not. (Character.OTHER_PUNCTUATION would match more than just the full stop character. => Right. That's a problem. I have to change that part...) JapaneseTokenizer keeps whitespace when using the discardPunctuation option. - example - "十万二千五 百 二千五" (means "102005 100 2005") If we run the JapaneseTokenizer with discardPunctuation=false and JapaneseNumberFilter, we get:\{"102005", " ", "100", " ", "2005"} Of course we can do it with StopFilter or internal processing in other Filter, but is it okay..? Developing a NumberFilter looks much more flexible and structurally beautiful rather than internal processing in Tokenizer. But I have developed like this because of the above problems, how can we handle those spaces? I think there are several ways to handle this problem: 1) Remove whitespace from Punctuation list in Tokenizer. 2) Use a TokenFilter to remove whitespace. 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...) 4) Just leave whitespace was (Author: danmuzi): Thank you for your reply, [~jim.ferenczi] :D I tried to process only "." character in Tokenizer. Because Korean is a language that can have a whitespace in sentence, but Japanese is not. (Character.OTHER_PUNCTUATION would match more than just the full stop character. => Right. That's a problem. I have to change that part...) JapaneseTokenizer keeps whitespace when using the discardPunctuation option. (example : "十万二千五 百 二千五" (means "102005 100 2005") If we run the JapaneseTokenizer with discardPunctuation=false and JapaneseNumberFilter, we get: {"102005", " ", "100", " ", "2005"}) Of course we can do it with StopFilter or internal processing in other Filter, but is it okay..? Developing a NumberFilter looks much more flexible and structurally beautiful rather than internal processing in Tokenizer. But I have developed like this because of the above problems, how can we handle those spaces? I think there are several ways to handle this problem: 1) Remove whitespace from Punctuation list in Tokenizer. 2) Use a TokenFilter to remove whitespace. 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...) 4) Just leave whitespace > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845933#comment-16845933 ] Namgyu Kim commented on LUCENE-8784: Thank you for your reply, [~jim.ferenczi] :D I tried to process only "." character in Tokenizer. Because Korean is a language that can have a whitespace in sentence, but Japanese is not. (Character.OTHER_PUNCTUATION would match more than just the full stop character. => Right. That's a problem. I have to change that part...) JapaneseTokenizer keeps whitespace when using the discardPunctuation option. (example : "十万二千五 百 二千五" (means "102005 100 2005") If we run the JapaneseTokenizer with discardPunctuation=false and JapaneseNumberFilter, we get: {"102005", " ", "100", " ", "2005"}) Of course we can do it with StopFilter or internal processing in other Filter, but is it okay..? Developing a NumberFilter looks much more flexible and structurally beautiful rather than internal processing in Tokenizer. But I have developed like this because of the above problems, how can we handle those spaces? I think there are several ways to handle this problem: 1) Remove whitespace from Punctuation list in Tokenizer. 2) Use a TokenFilter to remove whitespace. 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...) 4) Just leave whitespace > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845933#comment-16845933 ] Namgyu Kim edited comment on LUCENE-8784 at 5/22/19 2:38 PM: - Thank you for your reply, [~jim.ferenczi] :D I tried to process only "." character in Tokenizer. Because Korean is a language that can have a whitespace in sentence, but Japanese is not. (Character.OTHER_PUNCTUATION would match more than just the full stop character. => Right. That's a problem. I have to change that part...) JapaneseTokenizer keeps whitespace when using the discardPunctuation option. (example : "十万二千五 百 二千五" (means "102005 100 2005") If we run the JapaneseTokenizer with discardPunctuation=false and JapaneseNumberFilter, we get: {"102005", " ", "100", " ", "2005"}) Of course we can do it with StopFilter or internal processing in other Filter, but is it okay..? Developing a NumberFilter looks much more flexible and structurally beautiful rather than internal processing in Tokenizer. But I have developed like this because of the above problems, how can we handle those spaces? I think there are several ways to handle this problem: 1) Remove whitespace from Punctuation list in Tokenizer. 2) Use a TokenFilter to remove whitespace. 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...) 4) Just leave whitespace was (Author: danmuzi): Thank you for your reply, [~jim.ferenczi] :D I tried to process only "." character in Tokenizer. Because Korean is a language that can have a whitespace in sentence, but Japanese is not. (Character.OTHER_PUNCTUATION would match more than just the full stop character. => Right. That's a problem. I have to change that part...) JapaneseTokenizer keeps whitespace when using the discardPunctuation option. (example : "十万二千五 百 二千五" (means "102005 100 2005") If we run the JapaneseTokenizer with discardPunctuation=false and JapaneseNumberFilter, we get: {"102005", " ", "100", " ", "2005"}) Of course we can do it with StopFilter or internal processing in other Filter, but is it okay..? Developing a NumberFilter looks much more flexible and structurally beautiful rather than internal processing in Tokenizer. But I have developed like this because of the above problems, how can we handle those spaces? I think there are several ways to handle this problem: 1) Remove whitespace from Punctuation list in Tokenizer. 2) Use a TokenFilter to remove whitespace. 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...) 4) Just leave whitespace > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845383#comment-16845383 ] Namgyu Kim commented on LUCENE-8805: Thank you for applying my patch! [~jpountz] and [~rcmuir] > Parameter changes for stringField() in StoredFieldVisitor > - > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-8805.patch, LUCENE-8805.patch, LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > -1) change binaryField() parameters from byte[] to BytesRef.- > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845015#comment-16845015 ] Namgyu Kim commented on LUCENE-8784: Hi. [~jim.ferenczi] and [~Munkyu]. I uploaded a patch for this issue. I only worked about Tokenizer and TokenizerFactory, and did not work about Analyzer. In the case of Japanese, it could not be customized. (discardPunctuation is always true) If necessary, we can easily add it to Analyzer. However, I have a question now. The current patch was developed in such a way that it continues to pass parameters. (in _isPunctuation_ method) If we don't use the static method, we don't have to pass the parameters every time. What do you think about disabling static in the _isPunctuation_ method? > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8784: --- Attachment: LUCENE-8784.patch > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844825#comment-16844825 ] Namgyu Kim commented on LUCENE-8805: Oops, I saw it wrong :( I modified it and uploaded a new patch! Thank you for checking, [~jpountz] > Parameter changes for stringField() in StoredFieldVisitor > - > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch, LUCENE-8805.patch, LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > -1) change binaryField() parameters from byte[] to BytesRef.- > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8805: --- Attachment: LUCENE-8805.patch > Parameter changes for stringField() in StoredFieldVisitor > - > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch, LUCENE-8805.patch, LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > -1) change binaryField() parameters from byte[] to BytesRef.- > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844143#comment-16844143 ] Namgyu Kim commented on LUCENE-8805: I uploaded a new patch with only stringField modified (including null check). Thank you. > Parameter changes for stringField() in StoredFieldVisitor > - > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch, LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > -1) change binaryField() parameters from byte[] to BytesRef.- > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8805: --- Attachment: LUCENE-8805.patch > Parameter changes for stringField() in StoredFieldVisitor > - > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch, LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > -1) change binaryField() parameters from byte[] to BytesRef.- > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8805: --- Summary: Parameter changes for stringField() in StoredFieldVisitor (was: Parameter changes for binaryField() and stringField() in StoredFieldVisitor) > Parameter changes for stringField() in StoredFieldVisitor > - > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > 1) change binaryField() parameters from byte[] to BytesRef. > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8805: --- Description: I wrote this patch after seeing the comments left by [~mikemccand] when SortingStoredFieldsConsumer class was first created. {code:java} @Override public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException { ... // TODO: can we avoid new BR here? ... } @Override public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException { ... // TODO: can we avoid new String here? ... } {code} I changed two things. -1) change binaryField() parameters from byte[] to BytesRef.- 2) change stringField() parameters from byte[] to String. I also changed the related contents while doing the work. was: I wrote this patch after seeing the comments left by [~mikemccand] when SortingStoredFieldsConsumer class was first created. {code:java} @Override public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException { ... // TODO: can we avoid new BR here? ... } @Override public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException { ... // TODO: can we avoid new String here? ... } {code} I changed two things. 1) change binaryField() parameters from byte[] to BytesRef. 2) change stringField() parameters from byte[] to String. I also changed the related contents while doing the work. > Parameter changes for stringField() in StoredFieldVisitor > - > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > -1) change binaryField() parameters from byte[] to BytesRef.- > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844085#comment-16844085 ] Namgyu Kim commented on LUCENE-8805: Sure. I will just work on stringField and update the patch. I'll keep the TODO comment on binaryField. Do we need additional comments on binaryField? I will reflect if you tell me. > Parameter changes for binaryField() and stringField() in StoredFieldVisitor > --- > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > 1) change binaryField() parameters from byte[] to BytesRef. > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844051#comment-16844051 ] Namgyu Kim commented on LUCENE-8805: Thank you for your reply, [~jpountz]. Yes. I agree with you :D I think it's okay to change the parameters of stringField. However, in case of binaryField, there may be more disadvantages than advantages. What do you think about this, [~rcmuir]? > Parameter changes for binaryField() and stringField() in StoredFieldVisitor > --- > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > 1) change binaryField() parameters from byte[] to BytesRef. > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843485#comment-16843485 ] Namgyu Kim commented on LUCENE-8805: Thank you for your reply, and I'm sorry for late reply. [~rcmuir] I will upload a new patch within a few days, based on your feedback. (parameter checking, creating a TC, etc...) > Parameter changes for binaryField() and stringField() in StoredFieldVisitor > --- > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > 1) change binaryField() parameters from byte[] to BytesRef. > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor
[ https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8805: --- Attachment: LUCENE-8805.patch > Parameter changes for binaryField() and stringField() in StoredFieldVisitor > --- > > Key: LUCENE-8805 > URL: https://issues.apache.org/jira/browse/LUCENE-8805 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8805.patch > > > I wrote this patch after seeing the comments left by [~mikemccand] when > SortingStoredFieldsConsumer class was first created. > {code:java} > @Override > public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new BR here? > ... > } > @Override > public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException > { > ... > // TODO: can we avoid new String here? > ... > } > {code} > I changed two things. > 1) change binaryField() parameters from byte[] to BytesRef. > 2) change stringField() parameters from byte[] to String. > I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor
Namgyu Kim created LUCENE-8805: -- Summary: Parameter changes for binaryField() and stringField() in StoredFieldVisitor Key: LUCENE-8805 URL: https://issues.apache.org/jira/browse/LUCENE-8805 Project: Lucene - Core Issue Type: Improvement Reporter: Namgyu Kim I wrote this patch after seeing the comments left by [~mikemccand] when SortingStoredFieldsConsumer class was first created. {code:java} @Override public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException { ... // TODO: can we avoid new BR here? ... } @Override public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException { ... // TODO: can we avoid new String here? ... } {code} I changed two things. 1) change binaryField() parameters from byte[] to BytesRef. 2) change stringField() parameters from byte[] to String. I also changed the related contents while doing the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8768) Javadoc search support
[ https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822529#comment-16822529 ] Namgyu Kim commented on LUCENE-8768: Oh, I checked that the patch was submitted! I'll close my PR in Github. Thank you, [~thetaphi]. :D > Javadoc search support > -- > > Key: LUCENE-8768 > URL: https://issues.apache.org/jira/browse/LUCENE-8768 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Uwe Schindler >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-8768.patch, LUCENE-8768.patch, > javadoc-nightly.png, new-javadoc.png > > Time Spent: 10m > Remaining Estimate: 0h > > Javadoc search is a new feature since Java 9. > ([https://openjdk.java.net/jeps/225]) > I think there is no reason not to use it if the current Lucene Java version > is 11. > It can be a great help to developers looking at API documentation. > (The elastic search also supports it now! > > [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) > > ■ Before (Lucene Nightly Core Module Javadoc) > !javadoc-nightly.png! > ■ After > *!new-javadoc.png!* > > I'll change two lines for this. > 1) change Javadoc's noindex option from true to false. > {code:java} > // common-build.xml line 182 > {code} > 2) add javadoc argument "--no-module-directories" > {code:java} > // common-build.xml line 2100 > overview="@{overview}" > additionalparam="--no-module-directories" // NEW CODE > packagenames="org.apache.lucene.*,org.apache.solr.*" > ... > maxmemory="${javadoc.maxmemory}"> > {code} > Currently there is an issue like the following link in JDK 11, so we need > "--no-module-directories" option. > ([https://bugs.openjdk.java.net/browse/JDK-8215291]) > > ■ How to test > I did +"ant javadocs-modules"+ on lucene project and check Javadoc. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8768) Javadoc search support
[ https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822017#comment-16822017 ] Namgyu Kim edited comment on LUCENE-8768 at 4/19/19 4:05 PM: - Hi, [~jpountz], [~thetaphi]. Thank you so much for your reply and I learned a lot from your comments. I didn't know that Uwe +already checked Javadoc search+ and there was a GNU license issue. I also agree to give the user a choice rather than force it. I checked your patch, and I think it's good. (thanks for adding my opinion!) And your idea, *A cool thing would be to maybe install SOLR on the Lucene/Solr web page and index our Javadocs.* => It is a great idea I didn't think of at all, I will certainly contribute when you make a JIRA issue or project :D was (Author: danmuzi): Hi, [~jpountz], [~thetaphi]. Thank you so much for your reply and I learned a lot from your comments. I didn't know that Uwe +already checked Javadoc search+ and there was a GNU license issue. I also agree to give the user a choice rather than force it. I checked your patch, and I think it's good. (thanks for adding my opinion!) And your idea, *- A cool thing would be to maybe install SOLR on the Lucene/Solr web page and index our Javadocs.* => It is a great idea I didn't think of at all, I will certainly contribute when you make a JIRA issue or project :D > Javadoc search support > -- > > Key: LUCENE-8768 > URL: https://issues.apache.org/jira/browse/LUCENE-8768 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8768.patch, LUCENE-8768.patch, > javadoc-nightly.png, new-javadoc.png > > Time Spent: 10m > Remaining Estimate: 0h > > Javadoc search is a new feature since Java 9. > ([https://openjdk.java.net/jeps/225]) > I think there is no reason not to use it if the current Lucene Java version > is 11. > It can be a great help to developers looking at API documentation. > (The elastic search also supports it now! > > [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) > > ■ Before (Lucene Nightly Core Module Javadoc) > !javadoc-nightly.png! > ■ After > *!new-javadoc.png!* > > I'll change two lines for this. > 1) change Javadoc's noindex option from true to false. > {code:java} > // common-build.xml line 182 > {code} > 2) add javadoc argument "--no-module-directories" > {code:java} > // common-build.xml line 2100 > overview="@{overview}" > additionalparam="--no-module-directories" // NEW CODE > packagenames="org.apache.lucene.*,org.apache.solr.*" > ... > maxmemory="${javadoc.maxmemory}"> > {code} > Currently there is an issue like the following link in JDK 11, so we need > "--no-module-directories" option. > ([https://bugs.openjdk.java.net/browse/JDK-8215291]) > > ■ How to test > I did +"ant javadocs-modules"+ on lucene project and check Javadoc. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8768) Javadoc search support
[ https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822017#comment-16822017 ] Namgyu Kim commented on LUCENE-8768: Hi, [~jpountz], [~thetaphi]. Thank you so much for your reply and I learned a lot from your comments. I didn't know that Uwe +already checked Javadoc search+ and there was a GNU license issue. I also agree to give the user a choice rather than force it. I checked your patch, and I think it's good. (thanks for adding my opinion!) And your idea, *- A cool thing would be to maybe install SOLR on the Lucene/Solr web page and index our Javadocs.* => It is a great idea I didn't think of at all, I will certainly contribute when you make a JIRA issue or project :D > Javadoc search support > -- > > Key: LUCENE-8768 > URL: https://issues.apache.org/jira/browse/LUCENE-8768 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8768.patch, LUCENE-8768.patch, > javadoc-nightly.png, new-javadoc.png > > Time Spent: 10m > Remaining Estimate: 0h > > Javadoc search is a new feature since Java 9. > ([https://openjdk.java.net/jeps/225]) > I think there is no reason not to use it if the current Lucene Java version > is 11. > It can be a great help to developers looking at API documentation. > (The elastic search also supports it now! > > [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) > > ■ Before (Lucene Nightly Core Module Javadoc) > !javadoc-nightly.png! > ■ After > *!new-javadoc.png!* > > I'll change two lines for this. > 1) change Javadoc's noindex option from true to false. > {code:java} > // common-build.xml line 182 > {code} > 2) add javadoc argument "--no-module-directories" > {code:java} > // common-build.xml line 2100 > overview="@{overview}" > additionalparam="--no-module-directories" // NEW CODE > packagenames="org.apache.lucene.*,org.apache.solr.*" > ... > maxmemory="${javadoc.maxmemory}"> > {code} > Currently there is an issue like the following link in JDK 11, so we need > "--no-module-directories" option. > ([https://bugs.openjdk.java.net/browse/JDK-8215291]) > > ■ How to test > I did +"ant javadocs-modules"+ on lucene project and check Javadoc. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8768) Javadoc search support
[ https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8768: --- Description: Javadoc search is a new feature since Java 9. ([https://openjdk.java.net/jeps/225]) I think there is no reason not to use it if the current Lucene Java version is 11. It can be a great help to developers looking at API documentation. (The elastic search also supports it now! [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) ■ Before (Lucene Nightly Core Module Javadoc) !javadoc-nightly.png! ■ After *!new-javadoc.png!* I'll change two lines for this. 1) change Javadoc's noindex option from true to false. {code:java} // common-build.xml line 182 {code} 2) add javadoc argument "--no-module-directories" {code:java} // common-build.xml line 2100 {code} Currently there is an issue like the following link in JDK 11, so we need "--no-module-directories" option. ([https://bugs.openjdk.java.net/browse/JDK-8215291]) ■ How to test I did +"ant javadocs-modules"+ on lucene project and check Javadoc. was: Javadoc search is a new feature since Java 9. ([https://openjdk.java.net/jeps/225]) I think there is no reason not to use it if the current Lucene Java version is 11. It can be a great help to developers looking at API documentation. (The elastic search also supports it now! [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) ■ Before (Lucene Nightly Core Module Javadoc) !javadoc-nightly.png! ■ After *!new-javadoc.png!* I'll change two lines for this. 1) change Javadoc's noindex option from true to false. {code:java} // common-build.xml line 187 {code} 2) add javadoc argument "--no-module-directories" {code:java} // common-build.xml line 2283 {code} Currently there is an issue like the following link in JDK 11, so we need "--no-module-directories" option. ([https://bugs.openjdk.java.net/browse/JDK-8215291]) ■ How to test I did +"ant javadocs-modules"+ on lucene project and check Javadoc. > Javadoc search support > -- > > Key: LUCENE-8768 > URL: https://issues.apache.org/jira/browse/LUCENE-8768 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: javadoc-nightly.png, new-javadoc.png > > > Javadoc search is a new feature since Java 9. > ([https://openjdk.java.net/jeps/225]) > I think there is no reason not to use it if the current Lucene Java version > is 11. > It can be a great help to developers looking at API documentation. > (The elastic search also supports it now! > > [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) > > ■ Before (Lucene Nightly Core Module Javadoc) > !javadoc-nightly.png! > ■ After > *!new-javadoc.png!* > > I'll change two lines for this. > 1) change Javadoc's noindex option from true to false. > {code:java} > // common-build.xml line 182 > {code} > 2) add javadoc argument "--no-module-directories" > {code:java} > // common-build.xml line 2100 > overview="@{overview}" > additionalparam="--no-module-directories" // NEW CODE > packagenames="org.apache.lucene.*,org.apache.solr.*" > ... > maxmemory="${javadoc.maxmemory}"> > {code} > Currently there is an issue like the following link in JDK 11, so we need > "--no-module-directories" option. > ([https://bugs.openjdk.java.net/browse/JDK-8215291]) > > ■ How to test > I did +"ant javadocs-modules"+ on lucene project and check Javadoc. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8768) Javadoc search support
[ https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8768: --- Description: Javadoc search is a new feature since Java 9. ([https://openjdk.java.net/jeps/225]) I think there is no reason not to use it if the current Lucene Java version is 11. It can be a great help to developers looking at API documentation. (The elastic search also supports it now! [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) ■ Before (Lucene Nightly Core Module Javadoc) !javadoc-nightly.png! ■ After *!new-javadoc.png!* I'll change two lines for this. 1) change Javadoc's noindex option from true to false. {code:java} // common-build.xml line 187 {code} 2) add javadoc argument "--no-module-directories" {code:java} // common-build.xml line 2283 {code} Currently there is an issue like the following link in JDK 11, so we need "--no-module-directories" option. ([https://bugs.openjdk.java.net/browse/JDK-8215291]) ■ How to test I did +"ant javadocs-modules"+ on lucene project and check Javadoc. was: Javadoc search is a new feature since Java 9. ([https://openjdk.java.net/jeps/225]) I think there is no reason not to use it if the current Lucene Java version is 11. It can be a great help to developers looking at API documentation. (The elastic search also supports it now! [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) *- Before (Lucene Nightly Core Module Javadoc) -* !javadoc-nightly.png! *- After -* *!new-javadoc.png!* I'll change two lines for this. 1) change Javadoc's noindex option from true to false. {code:java} // common-build.xml line 187 {code} 2) add javadoc argument "--no-module-directories" {code:java} // common-build.xml line 2283 {code} Currently there is an issue like the following link in JDK 11, so we need "--no-module-directories" option. ([https://bugs.openjdk.java.net/browse/JDK-8215291]) *- How to test -* I did +"ant javadocs-modules"+ on lucene project and check Javadoc. > Javadoc search support > -- > > Key: LUCENE-8768 > URL: https://issues.apache.org/jira/browse/LUCENE-8768 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Namgyu Kim >Priority: Major > Attachments: javadoc-nightly.png, new-javadoc.png > > > Javadoc search is a new feature since Java 9. > ([https://openjdk.java.net/jeps/225]) > I think there is no reason not to use it if the current Lucene Java version > is 11. > It can be a great help to developers looking at API documentation. > (The elastic search also supports it now! > > [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) > > ■ Before (Lucene Nightly Core Module Javadoc) > !javadoc-nightly.png! > ■ After > *!new-javadoc.png!* > > I'll change two lines for this. > 1) change Javadoc's noindex option from true to false. > {code:java} > // common-build.xml line 187 > {code} > 2) add javadoc argument "--no-module-directories" > {code:java} > // common-build.xml line 2283 > overview="@{overview}" > additionalparam="--no-module-directories" // NEW CODE > packagenames="org.apache.lucene.*,org.apache.solr.*" > ... > maxmemory="${javadoc.maxmemory}"> > {code} > Currently there is an issue like the following link in JDK 11, so we need > "--no-module-directories" option. > ([https://bugs.openjdk.java.net/browse/JDK-8215291]) > > ■ How to test > I did +"ant javadocs-modules"+ on lucene project and check Javadoc. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8768) Javadoc search support
Namgyu Kim created LUCENE-8768: -- Summary: Javadoc search support Key: LUCENE-8768 URL: https://issues.apache.org/jira/browse/LUCENE-8768 Project: Lucene - Core Issue Type: New Feature Reporter: Namgyu Kim Attachments: javadoc-nightly.png, new-javadoc.png Javadoc search is a new feature since Java 9. ([https://openjdk.java.net/jeps/225]) I think there is no reason not to use it if the current Lucene Java version is 11. It can be a great help to developers looking at API documentation. (The elastic search also supports it now! [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html]) *- Before (Lucene Nightly Core Module Javadoc) -* !javadoc-nightly.png! *- After -* *!new-javadoc.png!* I'll change two lines for this. 1) change Javadoc's noindex option from true to false. {code:java} // common-build.xml line 187 {code} 2) add javadoc argument "--no-module-directories" {code:java} // common-build.xml line 2283 {code} Currently there is an issue like the following link in JDK 11, so we need "--no-module-directories" option. ([https://bugs.openjdk.java.net/browse/JDK-8215291]) *- How to test -* I did +"ant javadocs-modules"+ on lucene project and check Javadoc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8698) Fix replaceIgnoreCase method bug in EscapeQuerySyntaxImpl
[ https://issues.apache.org/jira/browse/LUCENE-8698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8698: --- Attachment: LUCENE-8698.patch > Fix replaceIgnoreCase method bug in EscapeQuerySyntaxImpl > - > > Key: LUCENE-8698 > URL: https://issues.apache.org/jira/browse/LUCENE-8698 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8698.patch > > > It is a patch of LUCENE-8572 issue from [~tonicava]. > > There is a serious bug in the replaceIgnoreCase method of the > EscapeQuerySyntaxImpl class. > This issue can affect QueryNode. (StringIndexOutOfBoundsException) > As I mentioned in comment of the issue, the String#toLowerCase() causes the > array to grow in size. > {code:java} > private static CharSequence replaceIgnoreCase(CharSequence string, > CharSequence sequence1, CharSequence escapeChar, Locale locale) { > // string = "İpone " [304, 112, 111, 110, 101, 32], size = 6 > ... > while (start < count) { > // Convert by toLowerCase as follows. > // string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7 > // firstIndex will be set 6. > if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first, > start)) == -1) > break; > boolean found = true; > ... > if (found) { > // In this line, String.toString() will only have a range of 0 to 5. > // So here we get a StringIndexOutOfBoundsException. > result.append(string.toString().substring(copyStart, firstIndex)); > ... > } else { > start = firstIndex + 1; > } > } > ... > }{code} > Maintaining the overall structure and fixing bug is very simple. > If we change to the following code, the method works fine. > > {code:java} > // Line 135 ~ 136 > // BEFORE > if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first, > start)) == -1) > // AFTER > if ((firstIndex = string.toString().indexOf(first, start)) == -1) > {code} > > > But I wonder if this is the best way. > How do you think about using String#replace() instead? > > {code:java} > // SAMPLE : escapeWhiteChar (escapeChar and escapeQuoted are same) > // BEFORE > private static final CharSequence escapeWhiteChar(CharSequence str, > Locale locale) { > ... > for (int i = 0; i < escapableWhiteChars.length; i++) { > buffer = replaceIgnoreCase(buffer, > escapableWhiteChars[i].toLowerCase(locale), > "\\", locale); > } > ... > } > // AFTER > private static final CharSequence escapeWhiteChar(CharSequence str, > Locale locale) { > ... > for (int i = 0; i < escapableWhiteChars.length; i++) { > buffer = buffer.toString().replace(escapableWhiteChars[i], "\\" + > escapableWhiteChars[i]); > } > ... > } > {code} > > First, I upload the patch using String#replace(). > If you give me some feedback, I will check it :D > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8698) Fix replaceIgnoreCase method bug in EscapeQuerySyntaxImpl
Namgyu Kim created LUCENE-8698: -- Summary: Fix replaceIgnoreCase method bug in EscapeQuerySyntaxImpl Key: LUCENE-8698 URL: https://issues.apache.org/jira/browse/LUCENE-8698 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Reporter: Namgyu Kim It is a patch of LUCENE-8572 issue from [~tonicava]. There is a serious bug in the replaceIgnoreCase method of the EscapeQuerySyntaxImpl class. This issue can affect QueryNode. (StringIndexOutOfBoundsException) As I mentioned in comment of the issue, the String#toLowerCase() causes the array to grow in size. {code:java} private static CharSequence replaceIgnoreCase(CharSequence string, CharSequence sequence1, CharSequence escapeChar, Locale locale) { // string = "İpone " [304, 112, 111, 110, 101, 32], size = 6 ... while (start < count) { // Convert by toLowerCase as follows. // string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7 // firstIndex will be set 6. if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first, start)) == -1) break; boolean found = true; ... if (found) { // In this line, String.toString() will only have a range of 0 to 5. // So here we get a StringIndexOutOfBoundsException. result.append(string.toString().substring(copyStart, firstIndex)); ... } else { start = firstIndex + 1; } } ... }{code} Maintaining the overall structure and fixing bug is very simple. If we change to the following code, the method works fine. {code:java} // Line 135 ~ 136 // BEFORE if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first, start)) == -1) // AFTER if ((firstIndex = string.toString().indexOf(first, start)) == -1) {code} But I wonder if this is the best way. How do you think about using String#replace() instead? {code:java} // SAMPLE : escapeWhiteChar (escapeChar and escapeQuoted are same) // BEFORE private static final CharSequence escapeWhiteChar(CharSequence str, Locale locale) { ... for (int i = 0; i < escapableWhiteChars.length; i++) { buffer = replaceIgnoreCase(buffer, escapableWhiteChars[i].toLowerCase(locale), "\\", locale); } ... } // AFTER private static final CharSequence escapeWhiteChar(CharSequence str, Locale locale) { ... for (int i = 0; i < escapableWhiteChars.length; i++) { buffer = buffer.toString().replace(escapableWhiteChars[i], "\\" + escapableWhiteChars[i]); } ... } {code} First, I upload the patch using String#replace(). If you give me some feedback, I will check it :D -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8572) StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java
[ https://issues.apache.org/jira/browse/LUCENE-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705249#comment-16705249 ] Namgyu Kim edited comment on LUCENE-8572 at 11/30/18 8:37 PM: -- Hi, [~romseygeek], [~thetaphi]. I checked the issue and found that it could be a logical problem. First, I think it's not a Locale problem, but a replace algorithm(replaceIgnoreCase) itself. When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally. (escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase()) {code:java} private static CharSequence replaceIgnoreCase(CharSequence string, CharSequence sequence1, CharSequence escapeChar, Locale locale) { // string = "İpone " [304, 112, 111, 110, 101, 32], size = 6 ... while (start < count) { // Convert by toLowerCase as follows. // string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7 // firstIndex will be set 6. if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first, start)) == -1) break; boolean found = true; ... if (found) { // In this line, String.toString() will only have a range of 0 to 5. // So here we get a StringIndexOutOfBoundsException. result.append(string.toString().substring(copyStart, firstIndex)); ... } else { start = firstIndex + 1; } } ... } {code} Solving this may not be a big problem. But what do you think about using {code:java} public static final CharSequence escapeWhiteChar(CharSequence str, Locale locale) { ... for (int i = 0; i < escapableWhiteChars.length; i++) { // Use String's replace method. buffer = buffer.toString().replace(escapableWhiteChars[i], "\\"); } return buffer; } {code} instead of {code:java} public static final CharSequence escapeWhiteChar(CharSequence str, Locale locale) { ... for (int i = 0; i < escapableWhiteChars.length; i++) { // Stay current method. buffer = replaceIgnoreCase(buffer, escapableWhiteChars[i].toLowerCase(locale), "\\", locale); } return buffer; } {code} in the escapeWhiteChar method? was (Author: danmuzi): Hi, [~romseygeek], [~thetaphi]. I checked the issue and it could be a logical problem. First, I think it's not a Locale problem, but a replace algorithm(replaceIgnoreCase) itself. When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally. (escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase()) {code:java} private static CharSequence replaceIgnoreCase(CharSequence string, CharSequence sequence1, CharSequence escapeChar, Locale locale) { // string = "İpone " [304, 112, 111, 110, 101, 32], size = 6 ... while (start < count) { // Convert by toLowerCase as follows. // string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7 // firstIndex will be set 6. if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first, start)) == -1) break; boolean found = true; ... if (found) { // In this line, String.toString() will only have a range of 0 to 5. // So here we get a StringIndexOutOfBoundsException. result.append(string.toString().substring(copyStart, firstIndex)); ... } else { start = firstIndex + 1; } } ... } {code} Solving this may not be a big problem. But what do you think about using {code:java} public static final CharSequence escapeWhiteChar(CharSequence str, Locale locale) { ... for (int i = 0; i < escapableWhiteChars.length; i++) { // Use String's replace method. buffer = buffer.toString().replace(escapableWhiteChars[i], "\\"); } return buffer; } {code} instead of {code:java} public static final CharSequence escapeWhiteChar(CharSequence str, Locale locale) { ... for (int i = 0; i < escapableWhiteChars.length; i++) { // Stay current method. buffer = replaceIgnoreCase(buffer, escapableWhiteChars[i].toLowerCase(locale), "\\", locale); } return buffer; } {code} in the escapeWhiteChar method? > StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java > > > Key: LUCENE-8572 > URL: https://issues.apache.org/jira/browse/LUCENE-8572 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Affects Versions: 6.3 >Reporter: Octavian Mocanu >Priority: Major > > With "lucene-queryparser-6.3.0", specifically in > "org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java" > > when escaping strings containing extended unicode chars, and with a locale > distinct from that of the character set the string uses, the process fails, > with a
[jira] [Commented] (LUCENE-8572) StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java
[ https://issues.apache.org/jira/browse/LUCENE-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705249#comment-16705249 ] Namgyu Kim commented on LUCENE-8572: Hi, [~romseygeek], [~thetaphi]. I checked the issue and it could be a logical problem. First, I think it's not a Locale problem, but a replace algorithm(replaceIgnoreCase) itself. When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally. (escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase()) {code:java} private static CharSequence replaceIgnoreCase(CharSequence string, CharSequence sequence1, CharSequence escapeChar, Locale locale) { // string = "İpone " [304, 112, 111, 110, 101, 32], size = 6 ... while (start < count) { // Convert by toLowerCase as follows. // string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7 // firstIndex will be set 6. if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first, start)) == -1) break; boolean found = true; ... if (found) { // In this line, String.toString() will only have a range of 0 to 5. // So here we get a StringIndexOutOfBoundsException. result.append(string.toString().substring(copyStart, firstIndex)); ... } else { start = firstIndex + 1; } } ... } {code} Solving this may not be a big problem. But what do you think about using {code:java} public static final CharSequence escapeWhiteChar(CharSequence str, Locale locale) { ... for (int i = 0; i < escapableWhiteChars.length; i++) { // Use String's replace method. buffer = buffer.toString().replace(escapableWhiteChars[i], "\\"); } return buffer; } {code} instead of {code:java} public static final CharSequence escapeWhiteChar(CharSequence str, Locale locale) { ... for (int i = 0; i < escapableWhiteChars.length; i++) { // Stay current method. buffer = replaceIgnoreCase(buffer, escapableWhiteChars[i].toLowerCase(locale), "\\", locale); } return buffer; } {code} in the escapeWhiteChar method? > StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java > > > Key: LUCENE-8572 > URL: https://issues.apache.org/jira/browse/LUCENE-8572 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Affects Versions: 6.3 >Reporter: Octavian Mocanu >Priority: Major > > With "lucene-queryparser-6.3.0", specifically in > "org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java" > > when escaping strings containing extended unicode chars, and with a locale > distinct from that of the character set the string uses, the process fails, > with a "java.lang.StringIndexOutOfBoundsException". > > The reason is that the comparison is done by previously converting all of the > characters of the string to lower case chars, and by doing this, the original > string size isn't anymore the same, but less, as of the transformed one, so > that executing > > org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:89 > fails with a java.lang.StringIndexOutOfBoundsException. > I wonder whether the transformation to lower case is really needed when > treating the escape chars, since by avoiding it, the error may be avoided. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8582) Set parent class of DutchAnalyzer to StopwordAnalyzerBase
[ https://issues.apache.org/jira/browse/LUCENE-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8582: --- Attachment: LUCENE-8582.patch > Set parent class of DutchAnalyzer to StopwordAnalyzerBase > - > > Key: LUCENE-8582 > URL: https://issues.apache.org/jira/browse/LUCENE-8582 > Project: Lucene - Core > Issue Type: Task > Components: modules/analysis >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8582.patch > > > Currently the parent class of DutchAnalyzer is *Analyzer*. > And I saw the comment > {code:java} > // TODO: extend StopwordAnalyzerBase > {code} > in DutchAnalyzer. > > So I changed the code as follows. > {code:java} > public final class DutchAnalyzer extends StopwordAnalyzerBase { > ... > > // This instance is no longer necessary. > // private final CharArraySet stoptable; > > public DutchAnalyzer(CharArraySet stopwords, CharArraySet > stemExclusionTable, Ch1arArrayMap stemOverrideDict) { > super(stopwords); // Use StopwordAnalyzerBase's constructor to set > stopwords. > ... > } > ... > @Override > protected TokenStreamComponents createComponents(String fieldName) { > ... > result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's > instance > ... > } > ... > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8582) Set parent class of DutchAnalyzer to StopwordAnalyzerBase
[ https://issues.apache.org/jira/browse/LUCENE-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8582: --- Description: Currently the parent class of DutchAnalyzer is *Analyzer*. And I saw the comment {code:java} // TODO: extend StopwordAnalyzerBase {code} in DutchAnalyzer. So I changed the code as follows. {code:java} public final class DutchAnalyzer extends StopwordAnalyzerBase { ... // This instance is no longer necessary. // private final CharArraySet stoptable; public DutchAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionTable, CharArrayMap stemOverrideDict) { super(stopwords); // Use StopwordAnalyzerBase's constructor to set stopwords. ... } ... @Override protected TokenStreamComponents createComponents(String fieldName) { ... result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's instance ... } ... } {code} was: Currently the parent class of DutchAnalyzer is *Analyzer*. And I saw the comment {code:java} // TODO: extend StopwordAnalyzerBase {code} in DutchAnalyzer. So I changed the code as follows. {code:java} public final class DutchAnalyzer extends StopwordAnalyzerBase { ... // This instance is no longer necessary. // private final CharArraySet stoptable; public DutchAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionTable, Ch1arArrayMap stemOverrideDict) { super(stopwords); // Use StopwordAnalyzerBase's constructor to set stopwords. ... } ... @Override protected TokenStreamComponents createComponents(String fieldName) { ... result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's instance ... } ... } {code} > Set parent class of DutchAnalyzer to StopwordAnalyzerBase > - > > Key: LUCENE-8582 > URL: https://issues.apache.org/jira/browse/LUCENE-8582 > Project: Lucene - Core > Issue Type: Task > Components: modules/analysis >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8582.patch > > > Currently the parent class of DutchAnalyzer is *Analyzer*. > And I saw the comment > {code:java} > // TODO: extend StopwordAnalyzerBase > {code} > in DutchAnalyzer. > > So I changed the code as follows. > {code:java} > public final class DutchAnalyzer extends StopwordAnalyzerBase { > ... > > // This instance is no longer necessary. > // private final CharArraySet stoptable; > > public DutchAnalyzer(CharArraySet stopwords, CharArraySet > stemExclusionTable, CharArrayMap stemOverrideDict) { > super(stopwords); // Use StopwordAnalyzerBase's constructor to set > stopwords. > ... > } > ... > @Override > protected TokenStreamComponents createComponents(String fieldName) { > ... > result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's > instance > ... > } > ... > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8582) Set parent class of DutchAnalyzer to StopwordAnalyzerBase
Namgyu Kim created LUCENE-8582: -- Summary: Set parent class of DutchAnalyzer to StopwordAnalyzerBase Key: LUCENE-8582 URL: https://issues.apache.org/jira/browse/LUCENE-8582 Project: Lucene - Core Issue Type: Task Components: modules/analysis Reporter: Namgyu Kim Currently the parent class of DutchAnalyzer is *Analyzer*. And I saw the comment {code:java} // TODO: extend StopwordAnalyzerBase {code} in DutchAnalyzer. So I changed the code as follows. {code:java} public final class DutchAnalyzer extends StopwordAnalyzerBase { ... // This instance is no longer necessary. // private final CharArraySet stoptable; public DutchAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionTable, Ch1arArrayMap stemOverrideDict) { super(stopwords); // Use StopwordAnalyzerBase's constructor to set stopwords. ... } ... @Override protected TokenStreamComponents createComponents(String fieldName) { ... result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's instance ... } ... } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8575) Improve toString() in SegmentInfo
[ https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705082#comment-16705082 ] Namgyu Kim commented on LUCENE-8575: Thanks for applying my code, [~jpountz] :D > Improve toString() in SegmentInfo > - > > Key: LUCENE-8575 > URL: https://issues.apache.org/jira/browse/LUCENE-8575 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Namgyu Kim >Priority: Major > Fix For: master (8.0), 7.7 > > Attachments: LUCENE-8575.patch, LUCENE-8575.patch > > > I saw the following code in SegmentInfo class. > {code:java} > // TODO: we could append toString of attributes() here? > {code} > Of course, we can. > > So I wrote a code for that part. > {code:java} > public String toString(int delCount) { > StringBuilder s = new StringBuilder(); > s.append(name).append('(').append(version == null ? "?" : > version).append(')').append(':'); > char cfs = getUseCompoundFile() ? 'c' : 'C'; > s.append(cfs); > s.append(maxDoc); > if (delCount != 0) { > s.append('/').append(delCount); > } > if (indexSort != null) { > s.append(":[indexSort="); > s.append(indexSort); > s.append(']'); > } > // New Code > if (!diagnostics.isEmpty()) { > s.append(":[diagnostics="); > for (Map.Entry entry : diagnostics.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > // New Code > if (!attributes.isEmpty()) { > s.append(":[attributes="); > for (Map.Entry entry : attributes.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > return s.toString(); > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8575) Improve toString() in SegmentInfo
[ https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701982#comment-16701982 ] Namgyu Kim edited comment on LUCENE-8575 at 11/28/18 3:24 PM: -- Thank you for your reply, [~jpountz] :D I uploaded a new patch that reflected your opinion. before: TEST(8.0.0):C1:[indexSort=]:[diagnostics=,]:[attributes=,] after : TEST(8.0.0):C1:[indexSort=]:[diagnostics=\{key1=value1, key2=value2}]:[attributes=\{key1=value1, key2=value2}] was (Author: danmuzi): Thank you for your reply, [~jpountz] :D I uploaded a new patch that reflected your opinion. before: TEST(8.0.0):C1:[indexSort=]:[diagnostics=,]:[attributes=,] after : TEST(8.0.0):C1:[indexSort=]:[diagnostics=\{key1=value1, key2=value2}]:[attributes=\{key1=value1, key2=value2}] > Improve toString() in SegmentInfo > - > > Key: LUCENE-8575 > URL: https://issues.apache.org/jira/browse/LUCENE-8575 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8575.patch, LUCENE-8575.patch > > > I saw the following code in SegmentInfo class. > {code:java} > // TODO: we could append toString of attributes() here? > {code} > Of course, we can. > > So I wrote a code for that part. > {code:java} > public String toString(int delCount) { > StringBuilder s = new StringBuilder(); > s.append(name).append('(').append(version == null ? "?" : > version).append(')').append(':'); > char cfs = getUseCompoundFile() ? 'c' : 'C'; > s.append(cfs); > s.append(maxDoc); > if (delCount != 0) { > s.append('/').append(delCount); > } > if (indexSort != null) { > s.append(":[indexSort="); > s.append(indexSort); > s.append(']'); > } > // New Code > if (!diagnostics.isEmpty()) { > s.append(":[diagnostics="); > for (Map.Entry entry : diagnostics.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > // New Code > if (!attributes.isEmpty()) { > s.append(":[attributes="); > for (Map.Entry entry : attributes.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > return s.toString(); > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8575) Improve toString() in SegmentInfo
[ https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701982#comment-16701982 ] Namgyu Kim commented on LUCENE-8575: Thank you for your reply, [~jpountz] :D I uploaded a new patch that reflected your opinion. before: TEST(8.0.0):C1:[indexSort=]:[diagnostics=,]:[attributes=,] after : TEST(8.0.0):C1:[indexSort=]:[diagnostics=\{key1=value1, key2=value2}]:[attributes=\{key1=value1, key2=value2}] > Improve toString() in SegmentInfo > - > > Key: LUCENE-8575 > URL: https://issues.apache.org/jira/browse/LUCENE-8575 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8575.patch, LUCENE-8575.patch > > > I saw the following code in SegmentInfo class. > {code:java} > // TODO: we could append toString of attributes() here? > {code} > Of course, we can. > > So I wrote a code for that part. > {code:java} > public String toString(int delCount) { > StringBuilder s = new StringBuilder(); > s.append(name).append('(').append(version == null ? "?" : > version).append(')').append(':'); > char cfs = getUseCompoundFile() ? 'c' : 'C'; > s.append(cfs); > s.append(maxDoc); > if (delCount != 0) { > s.append('/').append(delCount); > } > if (indexSort != null) { > s.append(":[indexSort="); > s.append(indexSort); > s.append(']'); > } > // New Code > if (!diagnostics.isEmpty()) { > s.append(":[diagnostics="); > for (Map.Entry entry : diagnostics.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > // New Code > if (!attributes.isEmpty()) { > s.append(":[attributes="); > for (Map.Entry entry : attributes.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > return s.toString(); > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8575) Improve toString() in SegmentInfo
[ https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8575: --- Attachment: LUCENE-8575.patch > Improve toString() in SegmentInfo > - > > Key: LUCENE-8575 > URL: https://issues.apache.org/jira/browse/LUCENE-8575 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8575.patch, LUCENE-8575.patch > > > I saw the following code in SegmentInfo class. > {code:java} > // TODO: we could append toString of attributes() here? > {code} > Of course, we can. > > So I wrote a code for that part. > {code:java} > public String toString(int delCount) { > StringBuilder s = new StringBuilder(); > s.append(name).append('(').append(version == null ? "?" : > version).append(')').append(':'); > char cfs = getUseCompoundFile() ? 'c' : 'C'; > s.append(cfs); > s.append(maxDoc); > if (delCount != 0) { > s.append('/').append(delCount); > } > if (indexSort != null) { > s.append(":[indexSort="); > s.append(indexSort); > s.append(']'); > } > // New Code > if (!diagnostics.isEmpty()) { > s.append(":[diagnostics="); > for (Map.Entry entry : diagnostics.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > // New Code > if (!attributes.isEmpty()) { > s.append(":[attributes="); > for (Map.Entry entry : attributes.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > return s.toString(); > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8575) Improve toString() in SegmentInfo
[ https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8575: --- Attachment: LUCENE-8575.patch > Improve toString() in SegmentInfo > - > > Key: LUCENE-8575 > URL: https://issues.apache.org/jira/browse/LUCENE-8575 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8575.patch > > > I saw the following code in SegmentInfo class. > {code:java} > // TODO: we could append toString of attributes() here? > {code} > Of course, we can. > > So I wrote a code for that part. > {code:java} > public String toString(int delCount) { > StringBuilder s = new StringBuilder(); > s.append(name).append('(').append(version == null ? "?" : > version).append(')').append(':'); > char cfs = getUseCompoundFile() ? 'c' : 'C'; > s.append(cfs); > s.append(maxDoc); > if (delCount != 0) { > s.append('/').append(delCount); > } > if (indexSort != null) { > s.append(":[indexSort="); > s.append(indexSort); > s.append(']'); > } > // New Code > if (!diagnostics.isEmpty()) { > s.append(":[diagnostics="); > for (Map.Entry entry : diagnostics.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > // New Code > if (!attributes.isEmpty()) { > s.append(":[attributes="); > for (Map.Entry entry : attributes.entrySet()) > > s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); > s.setLength(s.length() - 1); > s.append(']'); > } > return s.toString(); > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8575) Improve toString() in SegmentInfo
Namgyu Kim created LUCENE-8575: -- Summary: Improve toString() in SegmentInfo Key: LUCENE-8575 URL: https://issues.apache.org/jira/browse/LUCENE-8575 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Namgyu Kim I saw the following code in SegmentInfo class. {code:java} // TODO: we could append toString of attributes() here? {code} Of course, we can. So I wrote a code for that part. {code:java} public String toString(int delCount) { StringBuilder s = new StringBuilder(); s.append(name).append('(').append(version == null ? "?" : version).append(')').append(':'); char cfs = getUseCompoundFile() ? 'c' : 'C'; s.append(cfs); s.append(maxDoc); if (delCount != 0) { s.append('/').append(delCount); } if (indexSort != null) { s.append(":[indexSort="); s.append(indexSort); s.append(']'); } // New Code if (!diagnostics.isEmpty()) { s.append(":[diagnostics="); for (Map.Entry entry : diagnostics.entrySet()) s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); s.setLength(s.length() - 1); s.append(']'); } // New Code if (!attributes.isEmpty()) { s.append(":[attributes="); for (Map.Entry entry : attributes.entrySet()) s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,"); s.setLength(s.length() - 1); s.append(']'); } return s.toString(); } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8553) New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
[ https://issues.apache.org/jira/browse/LUCENE-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673285#comment-16673285 ] Namgyu Kim commented on LUCENE-8553: Thank you for your comments :D [~rcmuir], [~thetaphi]. Yes. Both of you are right. I know that it is possible to do "Hangul-Jamo" separation by using ICU. However, I am not sure whether the *"Hangul" -> "Choseong"* conversion or *"Dual chars (like" ㄲ "," ㅆ "," ㅢ ", ...)"* conversion can be performed in that library. These functions are also important features in this TokenFilter and I have used a HashMap or a separated Array to reduce its time complexity. That's why I didn't use the ICU library. > New KoreanDecomposeFilter for KoreanAnalyzer(Nori) > -- > > Key: LUCENE-8553 > URL: https://issues.apache.org/jira/browse/LUCENE-8553 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8553.patch > > > This is a patch for KoreanDecomposeFilter. > This filter can be used to decompose Hangul. > (ex) 한글 -> ㅎㄱ or ㅎㅏㄴㄱㅡㄹ) > Hangul input is very unique. > If you want to type apple in English, > you can type it in the order {color:#FF}a -> p -> p -> l -> e{color}. > However, if you want to input "Hangul" in Hangul, > you have to type it in the order of {color:#FF}ㅎ -> ㅏ -> ㄴ -> ㄱ -> ㅡ > -> ㄹ{color}. > (Because of the keyboard shape) > This means that spell check with existing full Hangul can be less accurate. > > The structure of Hangul consists of elements such as *"Choseong"*, > *"Jungseong"*, and *"Jongseong"*. > These three elements are called *"Jamo"*. > If you have the Korean word "된장찌개" (that means Soybean Paste Stew) > *"Choseong"* means {color:#FF}"ㄷ, ㅈ, ㅉ, ㄱ"{color}, > *"Jungseong"* means {color:#FF}"ㅚ, ㅏ, ㅣ, ㅐ"{color}, > *"Jongseong"* means {color:#FF}"ㄴ, ㅇ"{color}. > The reason for Jamo separation is explained above. (spell check) > Also, the reason we need "Choseong Filter" is because many Koreans use > *"Choseong Search"* (especially in mobile environment). > If you want to search for "된장찌개" you need 10 typing, which is quite a lot. > For that reason, I think it would be useful to provide a filter that can be > searched by "ㄷㅈㅉㄱ". > Hangul also has *dual chars*, such as > "ㄲ, ㄸ, ㅁ, ㅃ, ㅉ, ㅚ (ㅗ + ㅣ), ㅢ (ㅡ + ㅣ), ...". > For such reasons, > KoreanDecompose offers *5 options*, > ex) *된장찌개* => [된장], [찌개] > *1) ORIGIN* > [된장], [찌개] > *2) SINGLECHOSEONG* > [ㄷㅈ], [ㅉㄱ] > *3) DUALCHOSEONG* > [ㄷㅈ], [ㅈㅈㄱ] > *4) SINGLEJAMO* > [ㄷㅚㄴㅈㅏㅇ], [ㅉㅣㄱㅐ] > *5) DUALJAMO* > [ㄷㅗㅣㄴㅈㅏㅇ], [ㅈㅈㅣㄱㅐ] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8553) New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
[ https://issues.apache.org/jira/browse/LUCENE-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namgyu Kim updated LUCENE-8553: --- Attachment: LUCENE-8553.patch > New KoreanDecomposeFilter for KoreanAnalyzer(Nori) > -- > > Key: LUCENE-8553 > URL: https://issues.apache.org/jira/browse/LUCENE-8553 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Namgyu Kim >Priority: Major > Attachments: LUCENE-8553.patch > > > This is a patch for KoreanDecomposeFilter. > This filter can be used to decompose Hangul. > (ex) 한글 -> ㅎㄱ or ㅎㅏㄴㄱㅡㄹ) > Hangul input is very unique. > If you want to type apple in English, > you can type it in the order {color:#FF}a -> p -> p -> l -> e{color}. > However, if you want to input "Hangul" in Hangul, > you have to type it in the order of {color:#FF}ㅎ -> ㅏ -> ㄴ -> ㄱ -> ㅡ > -> ㄹ{color}. > (Because of the keyboard shape) > This means that spell check with existing full Hangul can be less accurate. > > The structure of Hangul consists of elements such as *"Choseong"*, > *"Jungseong"*, and *"Jongseong"*. > These three elements are called *"Jamo"*. > If you have the Korean word "된장찌개" (that means Soybean Paste Stew) > *"Choseong"* means {color:#FF}"ㄷ, ㅈ, ㅉ, ㄱ"{color}, > *"Jungseong"* means {color:#FF}"ㅚ, ㅏ, ㅣ, ㅐ"{color}, > *"Jongseong"* means {color:#FF}"ㄴ, ㅇ"{color}. > The reason for Jamo separation is explained above. (spell check) > Also, the reason we need "Choseong Filter" is because many Koreans use > *"Choseong Search"* (especially in mobile environment). > If you want to search for "된장찌개" you need 10 typing, which is quite a lot. > For that reason, I think it would be useful to provide a filter that can be > searched by "ㄷㅈㅉㄱ". > Hangul also has *dual chars*, such as > "ㄲ, ㄸ, ㅁ, ㅃ, ㅉ, ㅚ (ㅗ + ㅣ), ㅢ (ㅡ + ㅣ), ...". > For such reasons, > KoreanDecompose offers *5 options*, > ex) *된장찌개* => [된장], [찌개] > *1) ORIGIN* > [된장], [찌개] > *2) SINGLECHOSEONG* > [ㄷㅈ], [ㅉㄱ] > *3) DUALCHOSEONG* > [ㄷㅈ], [ㅈㅈㄱ] > *4) SINGLEJAMO* > [ㄷㅚㄴㅈㅏㅇ], [ㅉㅣㄱㅐ] > *5) DUALJAMO* > [ㄷㅗㅣㄴㅈㅏㅇ], [ㅈㅈㅣㄱㅐ] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org