[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-11 Thread Namgyu Kim (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927933#comment-16927933
 ] 

Namgyu Kim commented on LUCENE-8966:


Oh, Thank you for your reply. [~jim.ferenczi] :D

I checked again and it was not bug.
 That result is come from viterbi path.

But I think it needs to be discussed.
 So I added a new issue about it. 

I'd appreciate if you check LUCENE-8977.

P.S. +1 to your patch

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8977) Handle punctuation characters in KoreanTokenizer

2019-09-11 Thread Namgyu Kim (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8977:
---
Description: 
As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the 
others now when there are continuous punctuation marks.
 (사이즈 => [사이즈] [.] [...])
 But KoreanTokenizer doesn't divide when first character is punctuation.
 (...사이즈 => [...] [사이즈])

It looks like the result from the viterbi path, but users can think weird about 
the following case:
 ("사이즈" means "size" in Korean)
||Case #1||Case #2||
|Input : "...사이즈..."|Input : "...4..4사이즈"|
|Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]|

>From what I checked, Nori has a punctuation characters(like . ,) in the 
>dictionary but Kuromoji is not.
 ("サイズ" means "size" in Japanese)
||Case #1||Case #2||
|Input : "...サイズ..."|Input : "...4..4サイズ"|
|Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]|

There are some ways to resolve it like hard-coding for punctuation but it seems 
not good.
 So I think we need to discuss it.

  was:
As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the 
others now when there are continuous punctuation marks.
 (사이즈 => [사이즈] [.] [...])
 But KoreanTokenizer doesn't divides when first character is punctuation.
 (...사이즈 => [...] [사이즈])

It looks like the result from the viterbi path, but users can think weird about 
the following case:
 ("사이즈" means "size" in Korean)
||Case #1||Case #2||
|Input : "...사이즈..."|Input : "...4..4사이즈"|
|Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]|

>From what I checked, Nori has a punctuation characters(like . ,) in the 
>dictionary but Kuromoji is not.
 ("サイズ" means "size" in Japanese)
||Case #1||Case #2||
|Input : "...サイズ..."|Input : "...4..4サイズ"|
|Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]|

There are some ways to resolve it like hard-coding for punctuation but it seems 
not good.
 So I think we need to discuss it.


> Handle punctuation characters in KoreanTokenizer
> 
>
> Key: LUCENE-8977
> URL: https://issues.apache.org/jira/browse/LUCENE-8977
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Namgyu Kim
>Priority: Minor
>
> As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and 
> the others now when there are continuous punctuation marks.
>  (사이즈 => [사이즈] [.] [...])
>  But KoreanTokenizer doesn't divide when first character is punctuation.
>  (...사이즈 => [...] [사이즈])
> It looks like the result from the viterbi path, but users can think weird 
> about the following case:
>  ("사이즈" means "size" in Korean)
> ||Case #1||Case #2||
> |Input : "...사이즈..."|Input : "...4..4사이즈"|
> |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]|
> From what I checked, Nori has a punctuation characters(like . ,) in the 
> dictionary but Kuromoji is not.
>  ("サイズ" means "size" in Japanese)
> ||Case #1||Case #2||
> |Input : "...サイズ..."|Input : "...4..4サイズ"|
> |Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]|
> There are some ways to resolve it like hard-coding for punctuation but it 
> seems not good.
>  So I think we need to discuss it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8977) Handle punctuation characters in KoreanTokenizer

2019-09-11 Thread Namgyu Kim (Jira)
Namgyu Kim created LUCENE-8977:
--

 Summary: Handle punctuation characters in KoreanTokenizer
 Key: LUCENE-8977
 URL: https://issues.apache.org/jira/browse/LUCENE-8977
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Namgyu Kim


As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the 
others now when there are continuous punctuation marks.
 (사이즈 => [사이즈] [.] [...])
 But KoreanTokenizer doesn't divides when first character is punctuation.
 (...사이즈 => [...] [사이즈])

It looks like the result from the viterbi path, but users can think weird about 
the following case:
 ("사이즈" means "size" in Korean)
||Case #1||Case #2||
|Input : "...사이즈..."|Input : "...4..4사이즈"|
|Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]|

>From what I checked, Nori has a punctuation characters(like . ,) in the 
>dictionary but Kuromoji is not.
 ("サイズ" means "size" in Japanese)
||Case #1||Case #2||
|Input : "...サイズ..."|Input : "...4..4サイズ"|
|Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]|

There are some ways to resolve it like hard-coding for punctuation but it seems 
not good.
 So I think we need to discuss it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-07 Thread Namgyu Kim (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924773#comment-16924773
 ] 

Namgyu Kim commented on LUCENE-8966:


But there is a bug I just checked :(

Input : "4..4사이즈"
Expected : [4] [..] [4] [사이즈]
Actual : [4] *[.] [.]* [4] [사이즈]

{code:java}
// Need to pass!
public void testDuplicatePunctuation() throws IOException {
  assertAnalyzesTo(analyzerWithPunctuation, "4..4사이즈",
  new String[]{"4", "..", "4", "사이즈"},
  new int[]{0, 1, 7, 8},
  new int[]{1, 7, 8, 11},
  new int[]{1, 1, 1, 1}
  );
}
{code}
 
I think we need to fix it.
If it is okay to fix within this JIRA issue, I'll post additional patch.
Otherwise I'll create a new one.

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-07 Thread Namgyu Kim (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924768#comment-16924768
 ] 

Namgyu Kim commented on LUCENE-8966:


Good job! [~jim.ferenczi] :D

It can be serious enough for Nori users.

About Punctuation,
as [~jim.ferenczi] said, it can be remained by using discardPunctuation(set 
false) parameter in KoreanTokenizer.

You can test it by using analyzerWithPunctuation instance in 
TestKoreanTokenizer.

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8954) Refactor Nori(Korean) Analyzer

2019-08-20 Thread Namgyu Kim (Jira)
Namgyu Kim created LUCENE-8954:
--

 Summary: Refactor Nori(Korean) Analyzer
 Key: LUCENE-8954
 URL: https://issues.apache.org/jira/browse/LUCENE-8954
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Namgyu Kim
Assignee: Namgyu Kim


There are many codes that can be refactored in the Nori analyzer.
(whitespace, wrong type casting, unnecessary throws, C-style array, ...)

I think it's good to proceed if we can.

It has nothing to do with the actual working of Nori.

I'll just remove unnecessary code and make the code simple.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8934) Move Nori DictionaryBuilder tool from src/tools to src/

2019-08-08 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim resolved LUCENE-8934.

   Resolution: Fixed
Fix Version/s: master (9.0)
   8.x

> Move Nori DictionaryBuilder tool from src/tools to src/
> ---
>
> Key: LUCENE-8934
> URL: https://issues.apache.org/jira/browse/LUCENE-8934
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: 8.x, master (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After LUCENE-8904 tests in Nori tools are not running in the normal test 
> ({{ant test}}).
> As with Kuromoji(before LUCENE-8871), we need to run the {{ant test-tools}} 
> to test Nori's tools.
> Like Kuromoji, we can proceed with the normality test after moving the tools 
> of Nori to the main source tree.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8912) Remove ICU dependency of nori tools/test-tools

2019-08-08 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8912:
---
Fix Version/s: master (9.0)
   8.x

> Remove ICU dependency of nori tools/test-tools
> --
>
> Key: LUCENE-8912
> URL: https://issues.apache.org/jira/browse/LUCENE-8912
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: 8.x, master (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.
> {quote}
> As mentioned in LUCENE-8904, I proceed this work from now on.
> It is what [~rcmuir] found first(LUCENE-8866) and then I just apply to Nori.
> Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC 
> normalization like Kuromoji.
>  I think it's OK to remove the library dependency because it can be handled 
> by JDK.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8912) Remove ICU dependency of nori tools/test-tools

2019-08-08 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim resolved LUCENE-8912.

Resolution: Fixed

> Remove ICU dependency of nori tools/test-tools
> --
>
> Key: LUCENE-8912
> URL: https://issues.apache.org/jira/browse/LUCENE-8912
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: 8.x, master (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.
> {quote}
> As mentioned in LUCENE-8904, I proceed this work from now on.
> It is what [~rcmuir] found first(LUCENE-8866) and then I just apply to Nori.
> Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC 
> normalization like Kuromoji.
>  I think it's OK to remove the library dependency because it can be handled 
> by JDK.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8904) Enhance Nori DictionaryBuilder tool

2019-08-08 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim resolved LUCENE-8904.

Resolution: Fixed

> Enhance Nori DictionaryBuilder tool
> ---
>
> Key: LUCENE-8904
> URL: https://issues.apache.org/jira/browse/LUCENE-8904
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: 8.x, master (9.0)
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> It is the Nori version of [~sokolov]'s LUCENE-8863.
>  This patch has two changes.
>  1) Improve exception handling
>  2) Enable external dictionary for testing
> Overall, it is the same as LUCENE-8863.
> But there are some differences between Nori and Kuromoji.
> These can be slightly different on the code.
> 1) CSV field size
> Nori : 12
> Kuromoji : 13
> 2) left context ID == right context ID
> Nori : can be different
> Kuromoji : always same
> 3) Dictionary Type
> Nori : just one type
> Kuromoji : IPADIC, UNIDIC
> After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8904) Enhance Nori DictionaryBuilder tool

2019-08-08 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8904:
---
Fix Version/s: master (9.0)
   8.x

> Enhance Nori DictionaryBuilder tool
> ---
>
> Key: LUCENE-8904
> URL: https://issues.apache.org/jira/browse/LUCENE-8904
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: 8.x, master (9.0)
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> It is the Nori version of [~sokolov]'s LUCENE-8863.
>  This patch has two changes.
>  1) Improve exception handling
>  2) Enable external dictionary for testing
> Overall, it is the same as LUCENE-8863.
> But there are some differences between Nori and Kuromoji.
> These can be slightly different on the code.
> 1) CSV field size
> Nori : 12
> Kuromoji : 13
> 2) left context ID == right context ID
> Nori : can be different
> Kuromoji : always same
> 3) Dictionary Type
> Nori : just one type
> Kuromoji : IPADIC, UNIDIC
> After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-25 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892876#comment-16892876
 ] 

Namgyu Kim commented on LUCENE-8933:


Great analysis! :D
 I checked KoreanTokenizer and there is no issue with this code.
{code:java}
@Test
public void test() throws IOException {
  UserDictionary dict = UserDictionary.open(new StringReader("aaa,,,"));
  KoreanTokenizer tok = new 
KoreanTokenizer(KoreanTokenizer.DEFAULT_TOKEN_ATTRIBUTE_FACTORY, dict, 
DecompoundMode.NONE, true);
  tok.setReader(new StringReader("aaa"));
  tok.reset();
  tok.incrementToken();
}
{code}

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-24 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892041#comment-16892041
 ] 

Namgyu Kim commented on LUCENE-8933:


Elasticsearch Issue Link : 
[https://github.com/elastic/elasticsearch/issues/44243]

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8934) Move Nori DictionaryBuilder tool from src/tools to src/

2019-07-24 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8934:
--

 Summary: Move Nori DictionaryBuilder tool from src/tools to src/
 Key: LUCENE-8934
 URL: https://issues.apache.org/jira/browse/LUCENE-8934
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Namgyu Kim
Assignee: Namgyu Kim


After LUCENE-8904 tests in Nori tools are not running in the normal test ({{ant 
test}}).

As with Kuromoji(before LUCENE-8871), we need to run the {{ant test-tools}} to 
test Nori's tools.

Like Kuromoji, we can proceed with the normality test after moving the tools of 
Nori to the main source tree.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8912) Remove ICU dependency of nori tools/test-tools

2019-07-11 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8912:
---
Description: 
{quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.
{quote}
As mentioned in LUCENE-8904, I proceed this work from now on.

It is what [~rcmuir] found first(LUCENE-8866) and then I just apply to Nori.

Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC 
normalization like Kuromoji.
 I think it's OK to remove the library dependency because it can be handled by 
JDK.

  was:
{quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.
{quote}
As mentioned in LUCENE-8904, I proceed this work from now on.

It is what [~rcmuir] found first and then I just apply to Nori.

Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC 
normalization like Kuromoji.
 I think it's OK to remove the library dependency because it can be handled by 
JDK.


> Remove ICU dependency of nori tools/test-tools
> --
>
> Key: LUCENE-8912
> URL: https://issues.apache.org/jira/browse/LUCENE-8912
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
>
> {quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.
> {quote}
> As mentioned in LUCENE-8904, I proceed this work from now on.
> It is what [~rcmuir] found first(LUCENE-8866) and then I just apply to Nori.
> Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC 
> normalization like Kuromoji.
>  I think it's OK to remove the library dependency because it can be handled 
> by JDK.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8912) Remove ICU dependency of nori tools/test-tools

2019-07-11 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8912:
--

 Summary: Remove ICU dependency of nori tools/test-tools
 Key: LUCENE-8912
 URL: https://issues.apache.org/jira/browse/LUCENE-8912
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Namgyu Kim
Assignee: Namgyu Kim


{quote}After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.
{quote}
As mentioned in LUCENE-8904, I proceed this work from now on.

It is what [~rcmuir] found first and then I just apply to Nori.

Nori doesn't need the ICU library because it uses Normalizer2 only for NFKC 
normalization like Kuromoji.
 I think it's OK to remove the library dependency because it can be handled by 
JDK.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-8904) Enhance Nori DictionaryBuilder tool

2019-07-11 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim reassigned LUCENE-8904:
--

Assignee: Namgyu Kim

> Enhance Nori DictionaryBuilder tool
> ---
>
> Key: LUCENE-8904
> URL: https://issues.apache.org/jira/browse/LUCENE-8904
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> It is the Nori version of [~sokolov]'s LUCENE-8863.
>  This patch has two changes.
>  1) Improve exception handling
>  2) Enable external dictionary for testing
> Overall, it is the same as LUCENE-8863.
> But there are some differences between Nori and Kuromoji.
> These can be slightly different on the code.
> 1) CSV field size
> Nori : 12
> Kuromoji : 13
> 2) left context ID == right context ID
> Nori : can be different
> Kuromoji : always same
> 3) Dictionary Type
> Nori : just one type
> Kuromoji : IPADIC, UNIDIC
> After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8900) Simplify MultiSorter

2019-07-09 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881302#comment-16881302
 ] 

Namgyu Kim commented on LUCENE-8900:


You're welcome! [~jpountz].

> Simplify MultiSorter
> 
>
> Key: LUCENE-8900
> URL: https://issues.apache.org/jira/browse/LUCENE-8900
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8904) Enhance Nori DictionaryBuilder tool

2019-07-07 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8904:
--

 Summary: Enhance Nori DictionaryBuilder tool
 Key: LUCENE-8904
 URL: https://issues.apache.org/jira/browse/LUCENE-8904
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Namgyu Kim


It is the Nori version of [~sokolov]'s LUCENE-8863.
 This patch has two changes.
 1) Improve exception handling
 2) Enable external dictionary for testing

Overall, it is the same as LUCENE-8863.

But there are some differences between Nori and Kuromoji.
These can be slightly different on the code.
1) CSV field size
Nori : 12
Kuromoji : 13
2) left context ID == right context ID
Nori : can be different
Kuromoji : always same
3) Dictionary Type
Nori : just one type
Kuromoji : IPADIC, UNIDIC

After this job, I'll apply LUCENE-8866 and LUCENE-8871 to Nori.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8900) Simplify MultiSorter

2019-07-03 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877924#comment-16877924
 ] 

Namgyu Kim commented on LUCENE-8900:


Oh, about the suggestion 2, I saw it wrong.
It can cause a ClassCastException :(

Sorry for confusing and thank you for taking the suggestion 1.

> Simplify MultiSorter
> 
>
> Key: LUCENE-8900
> URL: https://issues.apache.org/jira/browse/LUCENE-8900
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8900) Simplify MultiSorter

2019-07-02 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877180#comment-16877180
 ] 

Namgyu Kim commented on LUCENE-8900:


+1

This patch looks good. [~jpountz] :D
 I have some opinions about your patch.

1. In lessThan method in PriorityQueue, we can reduce the computation. (very 
minor difference)
 Before
{code:java}
public boolean lessThan(LeafAndDocID a, LeafAndDocID b) {
  for(int i=0;i Simplify MultiSorter
> 
>
> Key: LUCENE-8900
> URL: https://issues.apache.org/jira/browse/LUCENE-8900
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8870) Support numeric value in Field class

2019-06-26 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873528#comment-16873528
 ] 

Namgyu Kim commented on LUCENE-8870:


Thank you for your reply! [~jpountz] :D
{quote}My gut feeling is that trying to fold everything into a single field 
would make things more complicated rather than simpler.
{quote}
Yeah, I agree with some parts.
 I thought it would be nice to give more options to user.
 But if you think it can cause some confusions to users, I think remaining the 
current status is better.
{quote}About the TestField class,
 I think its class structure needs to be changed slightly.

It is no direct connection with this issue.
 But I plan to modify it like TestIntPoint, TestDoublePoint, ...
 Should not the test class name depend on the class name?
{quote}
By the way, what do you think about my first comment in this issue?
 It looks a little bit ambiguous, but I'm curious about your opinion.

> Support numeric value in Field class
> 
>
> Key: LUCENE-8870
> URL: https://issues.apache.org/jira/browse/LUCENE-8870
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8870.patch
>
>
> I checked the following comment in Field class.
> {code:java}
> // TODO: allow direct construction of int, long, float, double value too..?
> {code}
> We already have some fields like IntPoint and StoredField, but I think it's 
> okay.
> The test cases are set in the TestField class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8870) Support numeric value in Field class

2019-06-25 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872601#comment-16872601
 ] 

Namgyu Kim commented on LUCENE-8870:


Thank you for your reply! [~jpountz] :D

When I wrote a patch, the biggest advantage that I think is the FieldType 
conversion for Numeric types.
 Of course, it is not recommended way(it is already mentioned Expert in 
Javadoc), but it can give users FieldType customization.

ex)
 Currently NumericDocValuesField does not support the option for stored.
 So users need to add a separate StoredField.
 If we provide this patch, the user can get the characteristics of 
NumericDocValuesField and StoredField in a single field.
{code:java}
FieldType type = new FieldType();
type.setStored(true);
type.setDocValuesType(DocValuesType.NUMERIC);
type.freeze();

Document doc = new Document();
Field field = new Field("number", 1234, type);
doc.add(field);
indexWriter.addDocument(doc);
{code}
After that, we can use methods such as
{code:java}
Sort sort = new Sort();
sort.setSort(new SortField("number", SortField.Type.INT));
{code}
and
{code:java}
doc.get("number");
{code}
in the "number" field.

> Support numeric value in Field class
> 
>
> Key: LUCENE-8870
> URL: https://issues.apache.org/jira/browse/LUCENE-8870
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8870.patch
>
>
> I checked the following comment in Field class.
> {code:java}
> // TODO: allow direct construction of int, long, float, double value too..?
> {code}
> We already have some fields like IntPoint and StoredField, but I think it's 
> okay.
> The test cases are set in the TestField class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8870) Support numeric value in Field class

2019-06-23 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870613#comment-16870613
 ] 

Namgyu Kim commented on LUCENE-8870:


Thank you for your reply! [~sokolov] :D

I want to make sure that I understand your opinion well.
{quote}in the end we store the value as an Object and then later cast it. 
Callers that also handle values generically, as Objects then need an adapter to 
detect the type of a value, cast it properly, only to have Lucene throw away 
all the type info and do that dance all over again internally!
{quote}
I think this is a structure for providing various constructors to users. 
(Reader, CharSequence, BytesRef, ...)
 Of course we can only provide it in Object form and handle it in the 
constructor.
 But isn't it unfriendly to API users?
 And I'm not sure about it because the IndexableFieldType check logic is 
different depending on the value type.
 ex)
 BytesRef -> there is no check logic.
 CharSequence -> (!IndexableFieldType#stored() && 
IndexableFieldType#indexOptions() == IndexOptions.NONE) should be false.
 Reader -> (IndexableFieldType#indexOptions() == IndexOptions.NONE || 
!IndexableFieldType#tokenized()) and (IndexableFieldType#stored()) should be 
false.

In fact, I worried about using the Number class when writing this patch.
 I think API users may prefer int, float, double, ... rather than the Number 
class.
 What do you think about this?
{quote}Maybe use Objects.requireNonNull for the null checks?
{quote}
I made it by referring to the current code structure.
 That method generates the NullPointerException, and current structure is the 
IllegalArgumentException.

> Support numeric value in Field class
> 
>
> Key: LUCENE-8870
> URL: https://issues.apache.org/jira/browse/LUCENE-8870
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8870.patch
>
>
> I checked the following comment in Field class.
> {code:java}
> // TODO: allow direct construction of int, long, float, double value too..?
> {code}
> We already have some fields like IntPoint and StoredField, but I think it's 
> okay.
> The test cases are set in the TestField class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8870) Support numeric value in Field class

2019-06-19 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867819#comment-16867819
 ] 

Namgyu Kim commented on LUCENE-8870:


About the TestField class,
I think its class structure needs to be changed slightly.

It is no direct connection with this issue.
But I plan to modify it like TestIntPoint, TestDoublePoint, ...
Should not the test class name depend on the class name?

What do you think about it?

> Support numeric value in Field class
> 
>
> Key: LUCENE-8870
> URL: https://issues.apache.org/jira/browse/LUCENE-8870
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8870.patch
>
>
> I checked the following comment in Field class.
> {code:java}
> // TODO: allow direct construction of int, long, float, double value too..?
> {code}
> We already have some fields like IntPoint and StoredField, but I think it's 
> okay.
> The test cases are set in the TestField class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8870) Support numeric value in Field class

2019-06-19 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8870:
---
Attachment: LUCENE-8870.patch

> Support numeric value in Field class
> 
>
> Key: LUCENE-8870
> URL: https://issues.apache.org/jira/browse/LUCENE-8870
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8870.patch
>
>
> I checked the following comment in Field class.
> {code:java}
> // TODO: allow direct construction of int, long, float, double value too..?
> {code}
> We already have some fields like IntPoint and StoredField, but I think it's 
> okay.
> The test cases are set in the TestField class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8870) Support numeric value in Field class

2019-06-19 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8870:
--

 Summary: Support numeric value in Field class
 Key: LUCENE-8870
 URL: https://issues.apache.org/jira/browse/LUCENE-8870
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Namgyu Kim
 Attachments: LUCENE-8870.patch

I checked the following comment in Field class.
{code:java}
// TODO: allow direct construction of int, long, float, double value too..?
{code}

We already have some fields like IntPoint and StoredField, but I think it's 
okay.

The test cases are set in the TestField class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-12 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim resolved LUCENE-8812.

Resolution: Fixed

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-12 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862270#comment-16862270
 ] 

Namgyu Kim commented on LUCENE-8812:


You're welcome. [~jim.ferenczi]!
I checked that the build was completed fine at Jenkins.
I'll resolve this issue.

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-10 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860162#comment-16860162
 ] 

Namgyu Kim commented on LUCENE-8817:


Oh. I read it wrong. Please ignore that part.
Thank you for checking. [~tomoko]

> Combine Nori and Kuromoji DictionaryBuilder
> ---
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. 
> (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
> BinaryDictionaryWriter, ...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the 
> same system dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is 
> finished first, I will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-10 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860128#comment-16860128
 ] 

Namgyu Kim commented on LUCENE-8817:


Thank you for your replies. [~tomoko] and [~cm] :D

I was surprised at your deep thoughts.
{code:java}
analysis
└── ???
 ├── common (module: analyzers-???-common)
 │   ├── build.xml
 │   └── src
 ├── kuromoji (module: analyzers-???-kuromoji)
 │   ├── build.xml
 │   └── src
 ├── nori (module: analyzers-???-nori)
 │   ├── build.xml
 │   └── src
 └── tools  (module: analyzers-???-tools)
 ├── build.xml
 └── src
{code}
I agree with the module structure proposed by Tomoko.
 In my personal opinion, "analysis" is better than "analyzers".
{quote}In terms of naming, what about using "statistical" instead of "mecab" 
for this class of analyzers?
 I'm thinking "Viterbi" could be good to refer to in shared tokenizer code.
 This said, I think it could be a good to refer to "mecab" in the dictionary 
compiler code, documentation, etc. to make sure users understand that we can 
read this model format.
 Any thoughts?
{quote}
About the name, the folder name "viterbi" looks much better than "statistical".
 But to be perfectly honest, I'm not sure that it's really right to use the 
algorithm name as the folder name.
 Most users probably don't know what viterbi is.
 It is also associated with the package name, and 
"org.apache.lucene.analysis.viterbi.ja" or "~.viterbi.ko" will confuse users.
 Or just use "org.apache.lucene.analysis.ja", it could be fine.
 It's because analysis-common is already doing like it.
 (not org.apache.lucene.common.cjk)
 It doesn't matter if we use it for administrative purposes, but I also want to 
hear some opinions from others.
{quote}how about using "kuromoji" in the top level module name for both of 
Japanese and Korean analyzers, and changing current module names "kuromoji" and 
"nori" to "kuromoji-ja" and "kuromoij-ko"?
{quote}
I personally don't agree to use kuromoji-ko instead of nori.
nori is already a familiar name to users.
They may be confused about it.

> Combine Nori and Kuromoji DictionaryBuilder
> ---
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. 
> (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
> BinaryDictionaryWriter, ...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the 
> same system dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is 
> finished first, I will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-09 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859507#comment-16859507
 ] 

Namgyu Kim commented on LUCENE-8812:


I re-opened this issue because the error was occurred in branch_8x branch.
 ([https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-Windows/298/])
 After the error was found, the commit to fix the error was pushed.
 The cause of the error was that when I wrote the test(TestKoreanNumberFilter),
 I used Java 9 try-with-resources style, and the error was occurred in 
branch_8x because it is based on Java 8.
 So I disable it including the master branch, and when the 9.0 version is 
officially released, I will rework it at that time.

And I made a slight mistake.
 I did not mention the issue number when committing to the branch_8x branch.
 I wanted to change the commit message, but it could not be changed because it 
is a protected branch. (force push is blocked)
 So I was forced to revert.

Anyway, both problems(error + wrong commit message) are all my fault, and I 
will be careful.

I will resolve this issue when the Jenkins build is completed.

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-09 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim reopened LUCENE-8812:


> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-09 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim resolved LUCENE-8812.

Resolution: Fixed

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-09 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859470#comment-16859470
 ] 

Namgyu Kim commented on LUCENE-8812:


Hi [~jim.ferenczi] :D

I pushed my commit to *branch_8x* and *master* branch.
 I checked and it seems to be reflected fine.
 So I'll resolve this issue.
Let me know if there are some problems.

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-09 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8812:
---
Fix Version/s: 8.2
   master (9.0)

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-08 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim reassigned LUCENE-8812:
--

Assignee: Namgyu Kim

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-07 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858895#comment-16858895
 ] 

Namgyu Kim edited comment on LUCENE-8817 at 6/7/19 6:40 PM:


I share the current status.
 The merging is almost over and I need some discussion.

 

I thought several structures.

1. Save in tools of analysis-common module.
 It is simple, but I think MeCab is difficult to see as a feature of 
analysis-common.

2. Create tools folder in analysis and set mecab-tools module in there.
 analysis/tools ─ analysis-common-tools (to-be)
                       └ icu-tools (to-be)
                       └ mecab-tools
                       └ ...
 The problem with this is that the number of modules increases a lot because 
each tool is created as a module.

3. Create a module called mecab
 we can create a mecab module that is the starting point for merging nori and 
kuromoji.
 If we proceed in this direction, we will only have tools in src.

But this approach may not be easy to create the runnable jar.
 Because it will include the library.
 (ex: MecabAnalyzer, MecabTokenizer, ...)

4. Create a module called mecab-tools
 It's easy to develop, but there are other library modules in analysis.
 So something seems strange because it's only runnable-jar.

 

Number 2 seems to be the best, but I'm not sure yet.
 I would appreciate any comments.

 

I will go ahead if direction is set, but landing will be delayed a little.
 The reason is that the build system is going to change. (SOLR-13452)
 But if it does not matter, I will proceed.


was (Author: danmuzi):
I share the current status.
 The merge is almost over and I need some discussion.

 

I thought several structures.

1. Save in tools of analysis-common module.
 It is simple, but I think MeCab is difficult to see as a feature of 
analysis-common.

2. Create tools folder in analysis and set mecab-tools module in there.
 analysis/tools ─ analysis-common-tools (to-be)
                      └ icu-tools (to-be)
                      └ mecab-tools
                      └ ...
 The problem with this is that the number of modules increases a lot because 
each tool is created as a module.

3. Create a module called mecab
 we can create a mecab module that is the starting point for merging nori and 
kuromoji.
 If we proceed in this direction, we will only have tools in src.

But this approach may not be easy to create the runnable jar.
 Because it will include the library.
 (ex: MecabAnalyzer, MecabTokenizer, ...)

4. Create a module called mecab-tools
 It's easy to develop, but there are other library modules in analysis.
 So something seems strange because it's only runnable-jar.

 

Number 2 seems to be the best, but I'm not sure yet.
 I would appreciate any comments.

 

I will go ahead if direction is set, but landing will be delayed a little.
 The reason is that the build system is going to change. (SOLR-13452)
 But if it does not matter, I will proceed.

> Combine Nori and Kuromoji DictionaryBuilder
> ---
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. 
> (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
> BinaryDictionaryWriter, ...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the 
> same system dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is 
> finished first, I will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-07 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858895#comment-16858895
 ] 

Namgyu Kim commented on LUCENE-8817:


I share the current status.
 The merge is almost over and I need some discussion.

 

I thought several structures.

1. Save in tools of analysis-common module.
 It is simple, but I think MeCab is difficult to see as a feature of 
analysis-common.

2. Create tools folder in analysis and set mecab-tools module in there.
 analysis/tools ─ analysis-common-tools (to-be)
                      └ icu-tools (to-be)
                      └ mecab-tools
                      └ ...
 The problem with this is that the number of modules increases a lot because 
each tool is created as a module.

3. Create a module called mecab
 we can create a mecab module that is the starting point for merging nori and 
kuromoji.
 If we proceed in this direction, we will only have tools in src.

But this approach may not be easy to create the runnable jar.
 Because it will include the library.
 (ex: MecabAnalyzer, MecabTokenizer, ...)

4. Create a module called mecab-tools
 It's easy to develop, but there are other library modules in analysis.
 So something seems strange because it's only runnable-jar.

 

Number 2 seems to be the best, but I'm not sure yet.
 I would appreciate any comments.

 

I will go ahead if direction is set, but landing will be delayed a little.
 The reason is that the build system is going to change. (SOLR-13452)
 But if it does not matter, I will proceed.

> Combine Nori and Kuromoji DictionaryBuilder
> ---
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. 
> (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
> BinaryDictionaryWriter, ...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the 
> same system dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is 
> finished first, I will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-07 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858839#comment-16858839
 ] 

Namgyu Kim commented on LUCENE-8812:


Thank you for your reply. [~jim.ferenczi] :D

Awesome. I'll submit this patch.

This is the first time that I submit a patch manually.
I will look for the manual and proceed, but it can take a little time.

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-01 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853784#comment-16853784
 ] 

Namgyu Kim commented on LUCENE-8816:


Thanks! [~tomoko] :D

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-01 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8817:
---
Description: 
This issue is related to LUCENE-8816.

Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab)
 If we make combine DictionaryBuilder, we can reduce the code size.
 But this task may have a dependency on the language.
 (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
BinaryDictionaryWriter, ...)
 On the other hand, there are many overlapped classes.

The purpose of this patch is to provide users of Nori and Kuromoji with the 
same system dictionary generator.

It may take some time because there is a little workload.
 The work will be based on the latest master, and if the LUCENE-8816 is 
finished first, I will pull the latest code and proceed.

  was:
This issue is related to LUCENE-8816.

Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab)
If we make combine DictionaryBuilder, we can reduce the code size.
But this task may have a dependency on the language.
(like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
BinaryDictionaryWriter, ...)
On the other hand, there are many overlapped classes.

The purpose of this patch is to provide users of Nori and Kuromoji with the 
same system dictionary generator.

It may take some time because there is a little workload.
The work will be based on the latest master, and if the LUCENE-8816 is finished 
first, it will pull the latest code and proceed.


> Combine Nori and Kuromoji DictionaryBuilder
> ---
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. 
> (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
> BinaryDictionaryWriter, ...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the 
> same system dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is 
> finished first, I will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-01 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853754#comment-16853754
 ] 

Namgyu Kim commented on LUCENE-8816:


Thanks for your reply! [~tomoko]


 I created the LUCENE-8817 issue and set up this issue with the "related to" 
link.
 I'll notify you if there is some progress.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-01 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8817:
--

 Summary: Combine Nori and Kuromoji DictionaryBuilder
 Key: LUCENE-8817
 URL: https://issues.apache.org/jira/browse/LUCENE-8817
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Namgyu Kim


This issue is related to LUCENE-8816.

Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab)
If we make combine DictionaryBuilder, we can reduce the code size.
But this task may have a dependency on the language.
(like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
BinaryDictionaryWriter, ...)
On the other hand, there are many overlapped classes.

The purpose of this patch is to provide users of Nori and Kuromoji with the 
same system dictionary generator.

It may take some time because there is a little workload.
The work will be based on the latest master, and if the LUCENE-8816 is finished 
first, it will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-01 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853678#comment-16853678
 ] 

Namgyu Kim commented on LUCENE-8816:


Oh, you're right. [~tomoko] :D
 I'll make a new JIRA issue after clean up the changes.
{quote}To avoid confusion, personally I'd like to proceed things in a right 
order - cleaning up first, then generalizing. But if you are sure that we can 
go in parallel, can you share your plan?
{quote}
Sure. It's an important thing.
 I think we can proceed in parallel.

There are two possible cases.
 1) You finish between JapaneseTokenizer and DictionaryBuilder job first.
 In that case, I can pull your new code and merge with nori's DictionaryBuilder.

2) I finish merging DictionaryBuilder(nori) and DictionaryBuilder(kuromoji) 
first.
 In that case, you can pull and continue.
 The DictionaryBuilder logic of kuromoji does not change at all in my work.

But if you think it is a little inefficient, I'll do later.
 What do you think about it?

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-31 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853207#comment-16853207
 ] 

Namgyu Kim commented on LUCENE-8816:


{quote}So I thought we can share only DictionaryBuilder between Kuromoji and 
Nori module, as I wrote in the previous comment.
{quote}
I agree with your idea. [~tomoko]

I don't think it would be difficult to merge DictionaryBuilder. (except 
BinaryDirectoryWriter)
 But I think BinaryDirectoryWriter case can be solved if we separate methods. 
(+ use DictionaryFormat)
 Can I try this when you are concentrating on JapaneseTokenizer?

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852084#comment-16852084
 ] 

Namgyu Kim commented on LUCENE-8816:


That's a good plan, [~tomoko]!

About Tokenizer...
I'm not sure that is better to combine KoreanTokenizer and JapaneseTokenizer.

If we want, there is a way.
KoreanTokenizer and JapaneseTokenizer have parts with duplicate code.
We can create an abstract class called MecabTokenizer. (both are mecab based)
MecabTokenizer will have duplicate things. (like isPunctuation(), 
WrappedPositionArray inner-class, ...)
What do you think about it?

Anyway, combining KoreanTokenizer and JapaneseTokenizer will be a big task and 
I think it's right to create a new issue.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-05-30 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852014#comment-16852014
 ] 

Namgyu Kim commented on LUCENE-8812:


Thank you for your reply, [~jim.ferenczi] :D
{quote}I wonder if it would be difficult to have a base class for the Japanese 
and Korean number filter since they share a large amount of code. However I 
think it's ok to merge this first and we can tackle the merge in a follow up, 
wdyt ?
{quote}
I think it is an awesome refactoring.
 If the refactoring is done, we can also share this TokenFilter in 
SmartChineseAnalyzer. (Chinese and Japanese use the same numeric characters)
 The amount of code will also be reduced.

I think the NumberFilter (new abstract class) can be in the 
org.apache.lucene.analysis.core(analysis-common) or 
org.apache.lucene.analysis(lucene-core) package, what do you think?
 In my personal opinion, analysis-common seems to be correct, but it is a 
little bit ambiguous.

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-29 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851103#comment-16851103
 ] 

Namgyu Kim commented on LUCENE-8816:


Thanks for the answers :D

Oh, I think your opinion is good, [~tomoko].
It is important to ensure that users do not modify the source code.
If we apply that approach, it seems to be able to create a build tool that 
supports two pre-types in the CLI environment.
And it's very important to be good at handling exceptions as [~rcmuir] said.
Otherwise, users will think that stability is poor.

About version control, I totally agree with you.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8813) testIndexTooManyDocs fails

2019-05-29 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851076#comment-16851076
 ] 

Namgyu Kim commented on LUCENE-8813:


Oh, I saw the comment a little late.
I didn't find the root cause, but your analysis is awesome.
And patch looks good.
I'll also review the PR in github :D

> testIndexTooManyDocs fails
> --
>
> Key: LUCENE-8813
> URL: https://issues.apache.org/jira/browse/LUCENE-8813
> Project: Lucene - Core
>  Issue Type: Test
>  Components: core/index
>Reporter: Nhat Nguyen
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> testIndexTooManyDocs fails on [Elastic 
> CI|https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+branch_8x/6402/console].
>  This failure does not reproduce locally for me.
> {noformat}
> [junit4] Suite: org.apache.lucene.index.TestIndexTooManyDocs
>[junit4]   2> KTN 23, 2019 4:09:37 PM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: 
> Thread[Thread-612,5,TGRP-TestIndexTooManyDocs]
>[junit4]   2> java.lang.AssertionError: only modifications from the 
> current flushing queue are permitted while doing a full flush
>[junit4]   2> at 
> __randomizedtesting.SeedInfo.seed([1F16B1DA7056AA52]:0)
>[junit4]   2> at 
> org.apache.lucene.index.DocumentsWriter.assertTicketQueueModification(DocumentsWriter.java:683)
>[junit4]   2> at 
> org.apache.lucene.index.DocumentsWriter.applyAllDeletes(DocumentsWriter.java:187)
>[junit4]   2> at 
> org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:411)
>[junit4]   2> at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:514)
>[junit4]   2> at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
>[junit4]   2> at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
>[junit4]   2> at 
> org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70)
>[junit4]   2> at java.base/java.lang.Thread.run(Thread.java:834)
>[junit4]   2> 
>[junit4]   2> KTN 23, 2019 6:09:36 PM 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate
>[junit4]   2> WARNING: Suite execution timed out: 
> org.apache.lucene.index.TestIndexTooManyDocs
>[junit4]   2>1) Thread[id=669, 
> name=SUITE-TestIndexTooManyDocs-seed#[1F16B1DA7056AA52], state=RUNNABLE, 
> group=TGRP-TestIndexTooManyDocs]
>[junit4]   2> at 
> java.base/java.lang.Thread.getStackTrace(Thread.java:1606)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:696)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:693)
>[junit4]   2> at 
> java.base/java.security.AccessController.doPrivileged(Native Method)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.getStackTrace(ThreadLeakControl.java:693)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.getThreadsWithTraces(ThreadLeakControl.java:709)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.formatThreadStacksFull(ThreadLeakControl.java:689)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.access$1000(ThreadLeakControl.java:65)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$2.evaluate(ThreadLeakControl.java:415)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSuite(RandomizedRunner.java:708)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.access$200(RandomizedRunner.java:138)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:629)
>[junit4]   2>2) Thread[id=671, name=Thread-606, state=BLOCKED, 
> group=TGRP-TestIndexTooManyDocs]
>[junit4]   2> at 
> app//org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:4945)
>[junit4]   2> at 
> app//org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:293)
>[junit4]   2> at 
> app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272)
>[junit4]   2> at 
> app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262)
>[junit4]   2> at 
> app//org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:165)
>[junit4]   2> at 
> 

[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-05-29 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851033#comment-16851033
 ] 

Namgyu Kim commented on LUCENE-8812:


Hi [~jim.ferenczi] :D
Thank you for applying the LUCENE-8784 patch!
I checked that this patch can be applied into latest code. (no conflict)

What do you think about this patch?

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849967#comment-16849967
 ] 

Namgyu Kim commented on LUCENE-8816:


Hi everybody,
 Thank you for opening the issue! [~tomoko]

To be honest, at first, when I talked about a custom system dictionary, I did 
not see a big sight.

Anyway, the structure I think is as follows.

 

1. As Tomoko said, make developer-oriented dictionary build tool
 The "ant regenerate" command inside the build.xml that I checked has the 
following steps.
  1) Compile the code (compile-tools)
  2) Download the jar file (download-dict)
  3) Save Noun.proper.csv diffs (patch-dict)
  4) Run DictionaryBuilder and make dat files (build-dict)

It does not matter if user builds only system dictionary. (Of course there is a 
problem to modify the classpath)
 (ex) ant build-dict ipadic /home/my/path/customDicIn(custom-dic input) 
/home/my/path/customDicOutput(dat output) utf-8 false

However, if user needs to get a dictionary from the server, they should modify 
the build.xml.
 As I know, the url path is hard-coded.
 Of course, the user can run by modifying ivy.xml and build.xml.
 But my personal opinion is user should not touch Lucene's internal code. (even 
build script)
 Maybe the user is afraid to change or feel reluctant to use it. (Especially 
who have not used Apache Ant)
 However, I think that this may be different for every person.

 

2. Version Control
 I actually think this is the biggest problem.
 As I mentioned in the email,
 if the Lucene version goes up, users have to rebuild their system dictionary 
unconditionally and put it in the jar.
 Because the current process is,
  1) Process like 1.
  2) move users system directory dat files to 
resources/org.apache.lucene.analysis.ja.dict
  3) ant jar

Because of the 3), the user always has to rebuild kuromoji module or fix the 
kuromoji jar.
 The users can feel irritated, when there is no kuromoji module change in the 
version up.

This problem can be solved easily if the system dictionary can only be 
parameterized in JapaneseTokenizer.
 (Of course, the expert javadoc is required)

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-27 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849000#comment-16849000
 ] 

Namgyu Kim commented on LUCENE-8784:


Oh, I checked that discardPunctuation is removed from KoreanAnalyzer.

Thank you very much for applying my patch! [~jim.ferenczi] :D

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8813) testIndexTooManyDocs fails

2019-05-25 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848196#comment-16848196
 ] 

Namgyu Kim commented on LUCENE-8813:


Hi, [~dnhatn] and [~simonw].

 

The cause of the problem has not yet been analyzed, but the log analysis seems 
to be finished.
 I don't know if it will help, but I'll share it.

 

There are two thread types in the TestIndexTooManyDocs test.

Type1) Write to Index (IndexWriter#updateDocument()) - 3 ~ 7 threads made 
(randomly)
 Type2) Open(Refresh) Index (DirectoryReader#openIfChanged()) - 2 threads made

 

The Type2 thread can only be terminated if *all type1 threads are terminated.*
 The reason is *"while (done.get() == false)"* condition.
 "done.set (true)" works when Type1 threads are finished.
 (indexingDone.await() and done.set(true))

 

And, unfortunately, if an exception occurs in Type1, The thread terminates 
*without* calling "indexingDone.countDown()".
 (ElasticSearch log shows java.lang.AssertionError: "only modifications from 
the current flushing queue are permitted while doing a full flush" in 
DocumentWriter#assertTicketQueueModification())

*The exception doesn't end the program*, only the each Type1 thread is 
terminated. (remain CountDownLatch count)
 And it goes into an infinite loop. (because of "while (done.get() == false)" 
loop)

 

ElasticSearch log shows that there is a time limit of 720ms. (2 hours)
 (Mentioned in "Throwable #1: java.lang.Exception: Suite timeout exceeded (>= 
720 msec)")
 That's why, Log contains from
 [junit4] HEARTBEAT J1 PID(2340@localhost): 2019-05-23T*14:14:47*, stalled for 
310s at: TestIndexTooManyDocs.testIndexTooManyDocs
 to
 [junit4] HEARTBEAT J1 PID(2340@localhost): 2019-05-23T*16:08:47*, stalled for 
7150s at: TestIndexTooManyDocs.testIndexTooManyDocs.

 

As a result, I think it is an updateDocument() problem and we should look for 
the cause.

> testIndexTooManyDocs fails
> --
>
> Key: LUCENE-8813
> URL: https://issues.apache.org/jira/browse/LUCENE-8813
> Project: Lucene - Core
>  Issue Type: Test
>  Components: core/index
>Reporter: Nhat Nguyen
>Priority: Major
>
> testIndexTooManyDocs fails on [Elastic 
> CI|https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+branch_8x/6402/console].
>  This failure does not reproduce locally for me.
> {noformat}
> [junit4] Suite: org.apache.lucene.index.TestIndexTooManyDocs
>[junit4]   2> KTN 23, 2019 4:09:37 PM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: 
> Thread[Thread-612,5,TGRP-TestIndexTooManyDocs]
>[junit4]   2> java.lang.AssertionError: only modifications from the 
> current flushing queue are permitted while doing a full flush
>[junit4]   2> at 
> __randomizedtesting.SeedInfo.seed([1F16B1DA7056AA52]:0)
>[junit4]   2> at 
> org.apache.lucene.index.DocumentsWriter.assertTicketQueueModification(DocumentsWriter.java:683)
>[junit4]   2> at 
> org.apache.lucene.index.DocumentsWriter.applyAllDeletes(DocumentsWriter.java:187)
>[junit4]   2> at 
> org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:411)
>[junit4]   2> at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:514)
>[junit4]   2> at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
>[junit4]   2> at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
>[junit4]   2> at 
> org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70)
>[junit4]   2> at java.base/java.lang.Thread.run(Thread.java:834)
>[junit4]   2> 
>[junit4]   2> KTN 23, 2019 6:09:36 PM 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate
>[junit4]   2> WARNING: Suite execution timed out: 
> org.apache.lucene.index.TestIndexTooManyDocs
>[junit4]   2>1) Thread[id=669, 
> name=SUITE-TestIndexTooManyDocs-seed#[1F16B1DA7056AA52], state=RUNNABLE, 
> group=TGRP-TestIndexTooManyDocs]
>[junit4]   2> at 
> java.base/java.lang.Thread.getStackTrace(Thread.java:1606)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:696)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:693)
>[junit4]   2> at 
> java.base/java.security.AccessController.doPrivileged(Native Method)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.getStackTrace(ThreadLeakControl.java:693)
>[junit4]   2> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.getThreadsWithTraces(ThreadLeakControl.java:709)

[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847706#comment-16847706
 ] 

Namgyu Kim commented on LUCENE-8784:


Thanks, [~jim.ferenczi]! :D
If there is something wrong, I would appreciate it if you let me know.

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847659#comment-16847659
 ] 

Namgyu Kim commented on LUCENE-8784:


It's a good idea :D

I linked LUCENE-8784(discardPunctuation) with LUCENE-8812(KoreanNumberFilter).
(Apply LUCENE-8784 *first* and then LUCENE-8812)
Your suggestion made this issue cleaner.

In LUCENE-8784,
I did not change the existing TCs and just added new TCs for discardPunctuation.
(remain the current constructor to provide an existing API)

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-05-24 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8812:
---
Attachment: LUCENE-8812.patch

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8784:
---
Attachment: LUCENE-8784.patch

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-05-24 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8812:
--

 Summary: add KoreanNumberFilter to Nori(Korean) Analyzer
 Key: LUCENE-8812
 URL: https://issues.apache.org/jira/browse/LUCENE-8812
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Namgyu Kim


This is a follow-up issue to LUCENE-8784.

The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
regular Arabic decimal numbers in half-width characters.

Logic is similar to JapaneseNumberFilter.
It should be able to cover the following test cases.

1) Korean Word to Number
십만이천오백 => 102500

2) 1 character conversion
일영영영 => 1000

3) Decimal Point Calculation
3.2천 => 3200

4) Comma between three digits
4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-23 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846877#comment-16846877
 ] 

Namgyu Kim commented on LUCENE-8784:


Oh, I forgot.
I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, 
KoreanTokenizerFactory)

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-23 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8784:
---
Attachment: LUCENE-8784.patch

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-23 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846871#comment-16846871
 ] 

Namgyu Kim commented on LUCENE-8784:


Thank you for your reply, [~jim.ferenczi]!

Your approach looks awesome.
I developed KoreanNumberFilter by referring to JapaneseNumberFilter.
Please check my patch :D
(use "git apply --whitespace=fix LUCENE-8784.patch" because of trailing 
whitespace error :()

I did not set KoreanNumberFilter as the default filter in KoreanAnalyzer.
By the way, would not it be better to leave the constructors that do not use 
discardPunctuation parameters?
(Existing Nori users have to modify the code after uploading)


>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-22 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845933#comment-16845933
 ] 

Namgyu Kim edited comment on LUCENE-8784 at 5/22/19 2:40 PM:
-

Thank you for your reply, [~jim.ferenczi] :D

 

I tried to process only "." character in Tokenizer.
 Because Korean is a language that can have a whitespace in sentence, but 
Japanese is not.
 (Character.OTHER_PUNCTUATION would match more than just the full stop 
character. => Right. That's a problem. I have to change that part...)

 

JapaneseTokenizer keeps whitespace when using the discardPunctuation option.
- example -
 "十万二千五 百 二千五" (means "102005 100 2005")
 If we run the JapaneseTokenizer with discardPunctuation=false and 
JapaneseNumberFilter, we get:\{"102005", " ", "100", " ", "2005"}

 

Of course we can do it with StopFilter or internal processing in other Filter, 
but is it okay..?

 

Developing a NumberFilter looks much more flexible and structurally beautiful 
rather than internal processing in Tokenizer.
 But I have developed like this because of the above problems, how can we 
handle those spaces?

 

I think there are several ways to handle this problem:
 1) Remove whitespace from Punctuation list in Tokenizer.
 2) Use a TokenFilter to remove whitespace.
 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...)
 4) Just leave whitespace


was (Author: danmuzi):
Thank you for your reply, [~jim.ferenczi] :D

 

I tried to process only "." character in Tokenizer.
 Because Korean is a language that can have a whitespace in sentence, but 
Japanese is not.
 (Character.OTHER_PUNCTUATION would match more than just the full stop 
character. => Right. That's a problem. I have to change that part...)

 

JapaneseTokenizer keeps whitespace when using the discardPunctuation option.
 (example : "十万二千五 百 二千五" (means "102005 100 2005")
 If we run the JapaneseTokenizer with discardPunctuation=false and 
JapaneseNumberFilter, we get:
{"102005", " ", "100", " ", "2005"})

 

Of course we can do it with StopFilter or internal processing in other Filter, 
but is it okay..?

 

Developing a NumberFilter looks much more flexible and structurally beautiful 
rather than internal processing in Tokenizer.
 But I have developed like this because of the above problems, how can we 
handle those spaces?

 

I think there are several ways to handle this problem:
 1) Remove whitespace from Punctuation list in Tokenizer.
 2) Use a TokenFilter to remove whitespace.
 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...)
 4) Just leave whitespace

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-22 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845933#comment-16845933
 ] 

Namgyu Kim commented on LUCENE-8784:


Thank you for your reply, [~jim.ferenczi] :D

 

I tried to process only "." character in Tokenizer.
 Because Korean is a language that can have a whitespace in sentence, but 
Japanese is not.
 (Character.OTHER_PUNCTUATION would match more than just the full stop 
character. => Right. That's a problem. I have to change that part...)

 

JapaneseTokenizer keeps whitespace when using the discardPunctuation option.
 (example : "十万二千五 百 二千五" (means "102005 100 2005")
 If we run the JapaneseTokenizer with discardPunctuation=false and 
JapaneseNumberFilter, we get:
{"102005", " ", "100", " ", "2005"})

 

Of course we can do it with StopFilter or internal processing in other Filter, 
but is it okay..?

 

Developing a NumberFilter looks much more flexible and structurally beautiful 
rather than internal processing in Tokenizer.
 But I have developed like this because of the above problems, how can we 
handle those spaces?

 

I think there are several ways to handle this problem:
 1) Remove whitespace from Punctuation list in Tokenizer.
 2) Use a TokenFilter to remove whitespace.
 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...)
 4) Just leave whitespace

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-22 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845933#comment-16845933
 ] 

Namgyu Kim edited comment on LUCENE-8784 at 5/22/19 2:38 PM:
-

Thank you for your reply, [~jim.ferenczi] :D

 

I tried to process only "." character in Tokenizer.
 Because Korean is a language that can have a whitespace in sentence, but 
Japanese is not.
 (Character.OTHER_PUNCTUATION would match more than just the full stop 
character. => Right. That's a problem. I have to change that part...)

 

JapaneseTokenizer keeps whitespace when using the discardPunctuation option.
 (example : "十万二千五 百 二千五" (means "102005 100 2005")
 If we run the JapaneseTokenizer with discardPunctuation=false and 
JapaneseNumberFilter, we get:
{"102005", " ", "100", " ", "2005"})

 

Of course we can do it with StopFilter or internal processing in other Filter, 
but is it okay..?

 

Developing a NumberFilter looks much more flexible and structurally beautiful 
rather than internal processing in Tokenizer.
 But I have developed like this because of the above problems, how can we 
handle those spaces?

 

I think there are several ways to handle this problem:
 1) Remove whitespace from Punctuation list in Tokenizer.
 2) Use a TokenFilter to remove whitespace.
 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...)
 4) Just leave whitespace


was (Author: danmuzi):
Thank you for your reply, [~jim.ferenczi] :D

 

I tried to process only "." character in Tokenizer.
 Because Korean is a language that can have a whitespace in sentence, but 
Japanese is not.
 (Character.OTHER_PUNCTUATION would match more than just the full stop 
character. => Right. That's a problem. I have to change that part...)

 

JapaneseTokenizer keeps whitespace when using the discardPunctuation option.
 (example : "十万二千五 百 二千五" (means "102005 100 2005")
 If we run the JapaneseTokenizer with discardPunctuation=false and 
JapaneseNumberFilter, we get:
{"102005", " ", "100", " ", "2005"})

 

Of course we can do it with StopFilter or internal processing in other Filter, 
but is it okay..?

 

Developing a NumberFilter looks much more flexible and structurally beautiful 
rather than internal processing in Tokenizer.
 But I have developed like this because of the above problems, how can we 
handle those spaces?

 

I think there are several ways to handle this problem:
 1) Remove whitespace from Punctuation list in Tokenizer.
 2) Use a TokenFilter to remove whitespace.
 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...)
 4) Just leave whitespace

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor

2019-05-21 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845383#comment-16845383
 ] 

Namgyu Kim commented on LUCENE-8805:


Thank you for applying my patch! [~jpountz] and [~rcmuir] 

> Parameter changes for stringField() in StoredFieldVisitor
> -
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8805.patch, LUCENE-8805.patch, LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
>  -1) change binaryField() parameters from byte[] to BytesRef.-
>  2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-21 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845015#comment-16845015
 ] 

Namgyu Kim commented on LUCENE-8784:


Hi. [~jim.ferenczi] and [~Munkyu].
I uploaded a patch for this issue.

I only worked about Tokenizer and TokenizerFactory, and did not work about 
Analyzer.
In the case of Japanese, it could not be customized. (discardPunctuation is 
always true)
If necessary, we can easily add it to Analyzer.

However, I have a question now.
The current patch was developed in such a way that it continues to pass 
parameters. (in _isPunctuation_ method)
If we don't use the static method, we don't have to pass the parameters every 
time.

What do you think about disabling static in the _isPunctuation_ method?

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-21 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8784:
---
Attachment: LUCENE-8784.patch

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor

2019-05-21 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844825#comment-16844825
 ] 

Namgyu Kim commented on LUCENE-8805:


Oops, I saw it wrong :(
I modified it and uploaded a new patch!
Thank you for checking, [~jpountz]

> Parameter changes for stringField() in StoredFieldVisitor
> -
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch, LUCENE-8805.patch, LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
>  -1) change binaryField() parameters from byte[] to BytesRef.-
>  2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor

2019-05-21 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8805:
---
Attachment: LUCENE-8805.patch

> Parameter changes for stringField() in StoredFieldVisitor
> -
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch, LUCENE-8805.patch, LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
>  -1) change binaryField() parameters from byte[] to BytesRef.-
>  2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor

2019-05-20 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844143#comment-16844143
 ] 

Namgyu Kim commented on LUCENE-8805:


I uploaded a new patch with only stringField modified (including null check).
Thank you.

> Parameter changes for stringField() in StoredFieldVisitor
> -
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch, LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
>  -1) change binaryField() parameters from byte[] to BytesRef.-
>  2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor

2019-05-20 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8805:
---
Attachment: LUCENE-8805.patch

> Parameter changes for stringField() in StoredFieldVisitor
> -
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch, LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
>  -1) change binaryField() parameters from byte[] to BytesRef.-
>  2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor

2019-05-20 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8805:
---
Summary: Parameter changes for stringField() in StoredFieldVisitor  (was: 
Parameter changes for binaryField() and stringField() in StoredFieldVisitor)

> Parameter changes for stringField() in StoredFieldVisitor
> -
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
> 1) change binaryField() parameters from byte[] to BytesRef.
> 2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8805) Parameter changes for stringField() in StoredFieldVisitor

2019-05-20 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8805:
---
Description: 
I wrote this patch after seeing the comments left by [~mikemccand] when 
SortingStoredFieldsConsumer class was first created.
{code:java}
@Override
public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException {
  ...
  // TODO: can we avoid new BR here?
  ...
}
@Override
public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException {
  ...
  // TODO: can we avoid new String here?
  ...
}
{code}
I changed two things.
 -1) change binaryField() parameters from byte[] to BytesRef.-
 2) change stringField() parameters from byte[] to String.

I also changed the related contents while doing the work.

  was:
I wrote this patch after seeing the comments left by [~mikemccand] when 
SortingStoredFieldsConsumer class was first created.

{code:java}
@Override
public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException {
  ...
  // TODO: can we avoid new BR here?
  ...
}
@Override
public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException {
  ...
  // TODO: can we avoid new String here?
  ...
}
{code}
I changed two things.
1) change binaryField() parameters from byte[] to BytesRef.
2) change stringField() parameters from byte[] to String.

I also changed the related contents while doing the work.


> Parameter changes for stringField() in StoredFieldVisitor
> -
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
>  -1) change binaryField() parameters from byte[] to BytesRef.-
>  2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor

2019-05-20 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844085#comment-16844085
 ] 

Namgyu Kim commented on LUCENE-8805:


Sure. I will just work on stringField and update the patch.
 I'll keep the TODO comment on binaryField.
 Do we need additional comments on binaryField?
 I will reflect if you tell me.

> Parameter changes for binaryField() and stringField() in StoredFieldVisitor
> ---
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
> 1) change binaryField() parameters from byte[] to BytesRef.
> 2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor

2019-05-20 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844051#comment-16844051
 ] 

Namgyu Kim commented on LUCENE-8805:


Thank you for your reply, [~jpountz].

Yes. I agree with you :D
I think it's okay to change the parameters of stringField.
However, in case of binaryField, there may be more disadvantages than 
advantages.
What do you think about this, [~rcmuir]?

> Parameter changes for binaryField() and stringField() in StoredFieldVisitor
> ---
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
> 1) change binaryField() parameters from byte[] to BytesRef.
> 2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor

2019-05-19 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843485#comment-16843485
 ] 

Namgyu Kim commented on LUCENE-8805:


Thank you for your reply, and I'm sorry for late reply. [~rcmuir]
I will upload a new patch within a few days, based on your feedback.
(parameter checking, creating a TC, etc...)

> Parameter changes for binaryField() and stringField() in StoredFieldVisitor
> ---
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
> 1) change binaryField() parameters from byte[] to BytesRef.
> 2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor

2019-05-18 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8805:
---
Attachment: LUCENE-8805.patch

> Parameter changes for binaryField() and stringField() in StoredFieldVisitor
> ---
>
> Key: LUCENE-8805
> URL: https://issues.apache.org/jira/browse/LUCENE-8805
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8805.patch
>
>
> I wrote this patch after seeing the comments left by [~mikemccand] when 
> SortingStoredFieldsConsumer class was first created.
> {code:java}
> @Override
> public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new BR here?
>   ...
> }
> @Override
> public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
> {
>   ...
>   // TODO: can we avoid new String here?
>   ...
> }
> {code}
> I changed two things.
> 1) change binaryField() parameters from byte[] to BytesRef.
> 2) change stringField() parameters from byte[] to String.
> I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8805) Parameter changes for binaryField() and stringField() in StoredFieldVisitor

2019-05-18 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8805:
--

 Summary: Parameter changes for binaryField() and stringField() in 
StoredFieldVisitor
 Key: LUCENE-8805
 URL: https://issues.apache.org/jira/browse/LUCENE-8805
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Namgyu Kim


I wrote this patch after seeing the comments left by [~mikemccand] when 
SortingStoredFieldsConsumer class was first created.

{code:java}
@Override
public void binaryField(FieldInfo fieldInfo, byte[] value) throws IOException {
  ...
  // TODO: can we avoid new BR here?
  ...
}
@Override
public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException {
  ...
  // TODO: can we avoid new String here?
  ...
}
{code}
I changed two things.
1) change binaryField() parameters from byte[] to BytesRef.
2) change stringField() parameters from byte[] to String.

I also changed the related contents while doing the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8768) Javadoc search support

2019-04-20 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822529#comment-16822529
 ] 

Namgyu Kim commented on LUCENE-8768:


Oh, I checked that the patch was submitted!
I'll close my PR in Github.
Thank you, [~thetaphi]. :D

> Javadoc search support
> --
>
> Key: LUCENE-8768
> URL: https://issues.apache.org/jira/browse/LUCENE-8768
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8768.patch, LUCENE-8768.patch, 
> javadoc-nightly.png, new-javadoc.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Javadoc search is a new feature since Java 9.
>  ([https://openjdk.java.net/jeps/225])
> I think there is no reason not to use it if the current Lucene Java version 
> is 11.
> It can be a great help to developers looking at API documentation.
> (The elastic search also supports it now!
>  
> [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])
>  
> ■ Before (Lucene Nightly Core Module Javadoc)
> !javadoc-nightly.png!
> ■ After 
> *!new-javadoc.png!*
>  
> I'll change two lines for this.
> 1) change Javadoc's noindex option from true to false.
> {code:java}
> // common-build.xml line 182
> {code}
> 2) add javadoc argument "--no-module-directories"
> {code:java}
> // common-build.xml line 2100
>  overview="@{overview}"
> additionalparam="--no-module-directories" // NEW CODE
> packagenames="org.apache.lucene.*,org.apache.solr.*"
> ...
> maxmemory="${javadoc.maxmemory}">
> {code}
> Currently there is an issue like the following link in JDK 11, so we need 
> "--no-module-directories" option.
>  ([https://bugs.openjdk.java.net/browse/JDK-8215291])
>  
> ■ How to test
> I did +"ant javadocs-modules"+ on lucene project and check Javadoc.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8768) Javadoc search support

2019-04-19 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822017#comment-16822017
 ] 

Namgyu Kim edited comment on LUCENE-8768 at 4/19/19 4:05 PM:
-

Hi, [~jpountz], [~thetaphi].
 Thank you so much for your reply and I learned a lot from your comments.

 

I didn't know that Uwe +already checked Javadoc search+ and there was a GNU 
license issue.
 I also agree to give the user a choice rather than force it.
 I checked your patch, and I think it's good. (thanks for adding my opinion!)

 

And your idea,

*A cool thing would be to maybe install SOLR on the Lucene/Solr web page and 
index our Javadocs.*
   => It is a great idea I didn't think of at all, I will certainly contribute 
when you make a JIRA issue or project :D

 


was (Author: danmuzi):
Hi, [~jpountz], [~thetaphi].
Thank you so much for your reply and I learned a lot from your comments.

 

I didn't know that Uwe +already checked Javadoc search+ and there was a GNU 
license issue.
I also agree to give the user a choice rather than force it.
I checked your patch, and I think it's good. (thanks for adding my opinion!)

 

And your idea,
*- A cool thing would be to maybe install SOLR on the Lucene/Solr web page and 
index our Javadocs.*
  => It is a great idea I didn't think of at all, I will certainly contribute 
when you make a JIRA issue or project :D

 

> Javadoc search support
> --
>
> Key: LUCENE-8768
> URL: https://issues.apache.org/jira/browse/LUCENE-8768
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8768.patch, LUCENE-8768.patch, 
> javadoc-nightly.png, new-javadoc.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Javadoc search is a new feature since Java 9.
>  ([https://openjdk.java.net/jeps/225])
> I think there is no reason not to use it if the current Lucene Java version 
> is 11.
> It can be a great help to developers looking at API documentation.
> (The elastic search also supports it now!
>  
> [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])
>  
> ■ Before (Lucene Nightly Core Module Javadoc)
> !javadoc-nightly.png!
> ■ After 
> *!new-javadoc.png!*
>  
> I'll change two lines for this.
> 1) change Javadoc's noindex option from true to false.
> {code:java}
> // common-build.xml line 182
> {code}
> 2) add javadoc argument "--no-module-directories"
> {code:java}
> // common-build.xml line 2100
>  overview="@{overview}"
> additionalparam="--no-module-directories" // NEW CODE
> packagenames="org.apache.lucene.*,org.apache.solr.*"
> ...
> maxmemory="${javadoc.maxmemory}">
> {code}
> Currently there is an issue like the following link in JDK 11, so we need 
> "--no-module-directories" option.
>  ([https://bugs.openjdk.java.net/browse/JDK-8215291])
>  
> ■ How to test
> I did +"ant javadocs-modules"+ on lucene project and check Javadoc.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8768) Javadoc search support

2019-04-19 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822017#comment-16822017
 ] 

Namgyu Kim commented on LUCENE-8768:


Hi, [~jpountz], [~thetaphi].
Thank you so much for your reply and I learned a lot from your comments.

 

I didn't know that Uwe +already checked Javadoc search+ and there was a GNU 
license issue.
I also agree to give the user a choice rather than force it.
I checked your patch, and I think it's good. (thanks for adding my opinion!)

 

And your idea,
*- A cool thing would be to maybe install SOLR on the Lucene/Solr web page and 
index our Javadocs.*
  => It is a great idea I didn't think of at all, I will certainly contribute 
when you make a JIRA issue or project :D

 

> Javadoc search support
> --
>
> Key: LUCENE-8768
> URL: https://issues.apache.org/jira/browse/LUCENE-8768
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8768.patch, LUCENE-8768.patch, 
> javadoc-nightly.png, new-javadoc.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Javadoc search is a new feature since Java 9.
>  ([https://openjdk.java.net/jeps/225])
> I think there is no reason not to use it if the current Lucene Java version 
> is 11.
> It can be a great help to developers looking at API documentation.
> (The elastic search also supports it now!
>  
> [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])
>  
> ■ Before (Lucene Nightly Core Module Javadoc)
> !javadoc-nightly.png!
> ■ After 
> *!new-javadoc.png!*
>  
> I'll change two lines for this.
> 1) change Javadoc's noindex option from true to false.
> {code:java}
> // common-build.xml line 182
> {code}
> 2) add javadoc argument "--no-module-directories"
> {code:java}
> // common-build.xml line 2100
>  overview="@{overview}"
> additionalparam="--no-module-directories" // NEW CODE
> packagenames="org.apache.lucene.*,org.apache.solr.*"
> ...
> maxmemory="${javadoc.maxmemory}">
> {code}
> Currently there is an issue like the following link in JDK 11, so we need 
> "--no-module-directories" option.
>  ([https://bugs.openjdk.java.net/browse/JDK-8215291])
>  
> ■ How to test
> I did +"ant javadocs-modules"+ on lucene project and check Javadoc.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8768) Javadoc search support

2019-04-17 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8768:
---
Description: 
Javadoc search is a new feature since Java 9.
 ([https://openjdk.java.net/jeps/225])

I think there is no reason not to use it if the current Lucene Java version is 
11.

It can be a great help to developers looking at API documentation.

(The elastic search also supports it now!
 
[https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])

 

■ Before (Lucene Nightly Core Module Javadoc)

!javadoc-nightly.png!

■ After 

*!new-javadoc.png!*

 

I'll change two lines for this.

1) change Javadoc's noindex option from true to false.
{code:java}
// common-build.xml line 182
{code}
2) add javadoc argument "--no-module-directories"
{code:java}
// common-build.xml line 2100

{code}
Currently there is an issue like the following link in JDK 11, so we need 
"--no-module-directories" option.
 ([https://bugs.openjdk.java.net/browse/JDK-8215291])

 

■ How to test

I did +"ant javadocs-modules"+ on lucene project and check Javadoc.

 

 

  was:
Javadoc search is a new feature since Java 9.
 ([https://openjdk.java.net/jeps/225])

I think there is no reason not to use it if the current Lucene Java version is 
11.

It can be a great help to developers looking at API documentation.

(The elastic search also supports it now!
 
[https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])

 

■ Before (Lucene Nightly Core Module Javadoc)

!javadoc-nightly.png!

■ After 

*!new-javadoc.png!*

 

I'll change two lines for this.

1) change Javadoc's noindex option from true to false.
{code:java}
// common-build.xml line 187
{code}
2) add javadoc argument "--no-module-directories"
{code:java}
// common-build.xml line 2283

{code}
Currently there is an issue like the following link in JDK 11, so we need 
"--no-module-directories" option.
 ([https://bugs.openjdk.java.net/browse/JDK-8215291])

 

■ How to test

I did +"ant javadocs-modules"+ on lucene project and check Javadoc.

 

 


> Javadoc search support
> --
>
> Key: LUCENE-8768
> URL: https://issues.apache.org/jira/browse/LUCENE-8768
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: javadoc-nightly.png, new-javadoc.png
>
>
> Javadoc search is a new feature since Java 9.
>  ([https://openjdk.java.net/jeps/225])
> I think there is no reason not to use it if the current Lucene Java version 
> is 11.
> It can be a great help to developers looking at API documentation.
> (The elastic search also supports it now!
>  
> [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])
>  
> ■ Before (Lucene Nightly Core Module Javadoc)
> !javadoc-nightly.png!
> ■ After 
> *!new-javadoc.png!*
>  
> I'll change two lines for this.
> 1) change Javadoc's noindex option from true to false.
> {code:java}
> // common-build.xml line 182
> {code}
> 2) add javadoc argument "--no-module-directories"
> {code:java}
> // common-build.xml line 2100
>  overview="@{overview}"
> additionalparam="--no-module-directories" // NEW CODE
> packagenames="org.apache.lucene.*,org.apache.solr.*"
> ...
> maxmemory="${javadoc.maxmemory}">
> {code}
> Currently there is an issue like the following link in JDK 11, so we need 
> "--no-module-directories" option.
>  ([https://bugs.openjdk.java.net/browse/JDK-8215291])
>  
> ■ How to test
> I did +"ant javadocs-modules"+ on lucene project and check Javadoc.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8768) Javadoc search support

2019-04-17 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8768:
---
Description: 
Javadoc search is a new feature since Java 9.
 ([https://openjdk.java.net/jeps/225])

I think there is no reason not to use it if the current Lucene Java version is 
11.

It can be a great help to developers looking at API documentation.

(The elastic search also supports it now!
 
[https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])

 

■ Before (Lucene Nightly Core Module Javadoc)

!javadoc-nightly.png!

■ After 

*!new-javadoc.png!*

 

I'll change two lines for this.

1) change Javadoc's noindex option from true to false.
{code:java}
// common-build.xml line 187
{code}
2) add javadoc argument "--no-module-directories"
{code:java}
// common-build.xml line 2283

{code}
Currently there is an issue like the following link in JDK 11, so we need 
"--no-module-directories" option.
 ([https://bugs.openjdk.java.net/browse/JDK-8215291])

 

■ How to test

I did +"ant javadocs-modules"+ on lucene project and check Javadoc.

 

 

  was:
Javadoc search is a new feature since Java 9.
([https://openjdk.java.net/jeps/225])

I think there is no reason not to use it if the current Lucene Java version is 
11.

It can be a great help to developers looking at API documentation.

(The elastic search also supports it now!
[https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])

 

*- Before (Lucene Nightly Core Module Javadoc) -*

!javadoc-nightly.png!

*- After -*

*!new-javadoc.png!*

 

I'll change two lines for this.

1) change Javadoc's noindex option from true to false.

 
{code:java}
// common-build.xml line 187
{code}
 

2) add javadoc argument "--no-module-directories"

 
{code:java}
// common-build.xml line 2283

{code}
Currently there is an issue like the following link in JDK 11, so we need 
"--no-module-directories" option.
([https://bugs.openjdk.java.net/browse/JDK-8215291])

 

*- How to test -*

I did +"ant javadocs-modules"+ on lucene project and check Javadoc.

 

 


> Javadoc search support
> --
>
> Key: LUCENE-8768
> URL: https://issues.apache.org/jira/browse/LUCENE-8768
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: javadoc-nightly.png, new-javadoc.png
>
>
> Javadoc search is a new feature since Java 9.
>  ([https://openjdk.java.net/jeps/225])
> I think there is no reason not to use it if the current Lucene Java version 
> is 11.
> It can be a great help to developers looking at API documentation.
> (The elastic search also supports it now!
>  
> [https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])
>  
> ■ Before (Lucene Nightly Core Module Javadoc)
> !javadoc-nightly.png!
> ■ After 
> *!new-javadoc.png!*
>  
> I'll change two lines for this.
> 1) change Javadoc's noindex option from true to false.
> {code:java}
> // common-build.xml line 187
> {code}
> 2) add javadoc argument "--no-module-directories"
> {code:java}
> // common-build.xml line 2283
>  overview="@{overview}"
> additionalparam="--no-module-directories" // NEW CODE
> packagenames="org.apache.lucene.*,org.apache.solr.*"
> ...
> maxmemory="${javadoc.maxmemory}">
> {code}
> Currently there is an issue like the following link in JDK 11, so we need 
> "--no-module-directories" option.
>  ([https://bugs.openjdk.java.net/browse/JDK-8215291])
>  
> ■ How to test
> I did +"ant javadocs-modules"+ on lucene project and check Javadoc.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8768) Javadoc search support

2019-04-17 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8768:
--

 Summary: Javadoc search support
 Key: LUCENE-8768
 URL: https://issues.apache.org/jira/browse/LUCENE-8768
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Namgyu Kim
 Attachments: javadoc-nightly.png, new-javadoc.png

Javadoc search is a new feature since Java 9.
([https://openjdk.java.net/jeps/225])

I think there is no reason not to use it if the current Lucene Java version is 
11.

It can be a great help to developers looking at API documentation.

(The elastic search also supports it now!
[https://artifacts.elastic.co/javadoc/org/elasticsearch/client/elasticsearch-rest-client/7.0.0/org/elasticsearch/client/package-summary.html])

 

*- Before (Lucene Nightly Core Module Javadoc) -*

!javadoc-nightly.png!

*- After -*

*!new-javadoc.png!*

 

I'll change two lines for this.

1) change Javadoc's noindex option from true to false.

 
{code:java}
// common-build.xml line 187
{code}
 

2) add javadoc argument "--no-module-directories"

 
{code:java}
// common-build.xml line 2283

{code}
Currently there is an issue like the following link in JDK 11, so we need 
"--no-module-directories" option.
([https://bugs.openjdk.java.net/browse/JDK-8215291])

 

*- How to test -*

I did +"ant javadocs-modules"+ on lucene project and check Javadoc.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8698) Fix replaceIgnoreCase method bug in EscapeQuerySyntaxImpl

2019-02-18 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8698:
---
Attachment: LUCENE-8698.patch

> Fix replaceIgnoreCase method bug in EscapeQuerySyntaxImpl
> -
>
> Key: LUCENE-8698
> URL: https://issues.apache.org/jira/browse/LUCENE-8698
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8698.patch
>
>
> It is a patch of LUCENE-8572 issue from [~tonicava].
>  
> There is a serious bug in the replaceIgnoreCase method of the 
> EscapeQuerySyntaxImpl class.
> This issue can affect QueryNode. (StringIndexOutOfBoundsException)
> As I mentioned in comment of the issue, the String#toLowerCase() causes the 
> array to grow in size.
> {code:java}
> private static CharSequence replaceIgnoreCase(CharSequence string,
> CharSequence sequence1, CharSequence escapeChar, Locale locale) {
>   // string = "İpone " [304, 112, 111, 110, 101, 32],  size = 6
>   ...
>   while (start < count) {
> // Convert by toLowerCase as follows.
> // string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7
> // firstIndex will be set 6.
> if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first,
> start)) == -1)
>   break;
> boolean found = true;
> ...
> if (found) {
>   // In this line, String.toString() will only have a range of 0 to 5.
>   // So here we get a StringIndexOutOfBoundsException.
>   result.append(string.toString().substring(copyStart, firstIndex));
>   ...
> } else {
>   start = firstIndex + 1;
> }
>   }
>   ...
> }{code}
> Maintaining the overall structure and fixing bug is very simple.
> If we change to the following code, the method works fine.
>  
> {code:java}
> // Line 135 ~ 136
> // BEFORE
> if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first, 
> start)) == -1)
> // AFTER
> if ((firstIndex = string.toString().indexOf(first, start)) == -1)
> {code}
>  
>  
> But I wonder if this is the best way.
> How do you think about using String#replace() instead?
>  
> {code:java}
> // SAMPLE : escapeWhiteChar (escapeChar and escapeQuoted are same)
> // BEFORE
> private static final CharSequence escapeWhiteChar(CharSequence str,
> Locale locale) {
>   ...
>   for (int i = 0; i < escapableWhiteChars.length; i++) {
> buffer = replaceIgnoreCase(buffer, 
> escapableWhiteChars[i].toLowerCase(locale),
> "\\", locale);
>   }
>   ...
> }
> // AFTER
> private static final CharSequence escapeWhiteChar(CharSequence str,
> Locale locale) {
>   ...
>   for (int i = 0; i < escapableWhiteChars.length; i++) {
> buffer = buffer.toString().replace(escapableWhiteChars[i], "\\" + 
> escapableWhiteChars[i]);
>   }
>   ...
> }
> {code}
>  
> First, I upload the patch using String#replace().
> If you give me some feedback, I will check it :D
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8698) Fix replaceIgnoreCase method bug in EscapeQuerySyntaxImpl

2019-02-18 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8698:
--

 Summary: Fix replaceIgnoreCase method bug in EscapeQuerySyntaxImpl
 Key: LUCENE-8698
 URL: https://issues.apache.org/jira/browse/LUCENE-8698
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Reporter: Namgyu Kim


It is a patch of LUCENE-8572 issue from [~tonicava].

 

There is a serious bug in the replaceIgnoreCase method of the 
EscapeQuerySyntaxImpl class.

This issue can affect QueryNode. (StringIndexOutOfBoundsException)

As I mentioned in comment of the issue, the String#toLowerCase() causes the 
array to grow in size.
{code:java}
private static CharSequence replaceIgnoreCase(CharSequence string,
CharSequence sequence1, CharSequence escapeChar, Locale locale) {
  // string = "İpone " [304, 112, 111, 110, 101, 32],  size = 6
  ...
  while (start < count) {
// Convert by toLowerCase as follows.
// string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7
// firstIndex will be set 6.
if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first,
start)) == -1)
  break;
boolean found = true;
...
if (found) {
  // In this line, String.toString() will only have a range of 0 to 5.
  // So here we get a StringIndexOutOfBoundsException.
  result.append(string.toString().substring(copyStart, firstIndex));
  ...
} else {
  start = firstIndex + 1;
}
  }
  ...
}{code}
Maintaining the overall structure and fixing bug is very simple.

If we change to the following code, the method works fine.

 
{code:java}
// Line 135 ~ 136
// BEFORE
if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first, start)) 
== -1)

// AFTER
if ((firstIndex = string.toString().indexOf(first, start)) == -1)
{code}
 

 

But I wonder if this is the best way.

How do you think about using String#replace() instead?

 
{code:java}
// SAMPLE : escapeWhiteChar (escapeChar and escapeQuoted are same)
// BEFORE
private static final CharSequence escapeWhiteChar(CharSequence str,
Locale locale) {
  ...
  for (int i = 0; i < escapableWhiteChars.length; i++) {
buffer = replaceIgnoreCase(buffer, 
escapableWhiteChars[i].toLowerCase(locale),
"\\", locale);
  }
  ...
}

// AFTER
private static final CharSequence escapeWhiteChar(CharSequence str,
Locale locale) {
  ...
  for (int i = 0; i < escapableWhiteChars.length; i++) {
buffer = buffer.toString().replace(escapableWhiteChars[i], "\\" + 
escapableWhiteChars[i]);
  }
  ...
}
{code}
 

First, I upload the patch using String#replace().
If you give me some feedback, I will check it :D

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8572) StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java

2018-11-30 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705249#comment-16705249
 ] 

Namgyu Kim edited comment on LUCENE-8572 at 11/30/18 8:37 PM:
--

Hi, [~romseygeek], [~thetaphi].

I checked the issue and found that it could be a logical problem.

First, I think it's not a Locale problem, but a replace 
algorithm(replaceIgnoreCase) itself.

When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally.
 (escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase())

 
{code:java}
private static CharSequence replaceIgnoreCase(CharSequence string,
CharSequence sequence1, CharSequence escapeChar, Locale locale) {
  // string = "İpone " [304, 112, 111, 110, 101, 32],  size = 6
  ...
  while (start < count) {
// Convert by toLowerCase as follows.
// string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7
// firstIndex will be set 6.
if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first,
start)) == -1)
  break;
boolean found = true;
...
if (found) {
  // In this line, String.toString() will only have a range of 0 to 5.
  // So here we get a StringIndexOutOfBoundsException.
  result.append(string.toString().substring(copyStart, firstIndex));
  ...
} else {
  start = firstIndex + 1;
}
  }
  ...
}
{code}
 

Solving this may not be a big problem.

But what do you think about using
{code:java}
public static final CharSequence escapeWhiteChar(CharSequence str,
  Locale locale) {
...

for (int i = 0; i < escapableWhiteChars.length; i++) {
  // Use String's replace method.
  buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
}
return buffer;
  }
{code}
instead of
{code:java}
public static final CharSequence escapeWhiteChar(CharSequence str,
  Locale locale) {
...

for (int i = 0; i < escapableWhiteChars.length; i++) {
  // Stay current method.
  buffer = replaceIgnoreCase(buffer, 
escapableWhiteChars[i].toLowerCase(locale), "\\", locale);
}
return buffer;
  }
{code}
in the escapeWhiteChar method?

 


was (Author: danmuzi):
Hi, [~romseygeek], [~thetaphi].

I checked the issue and it could be a logical problem.

First, I think it's not a Locale problem, but a replace 
algorithm(replaceIgnoreCase) itself.

When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally.
(escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase())

 
{code:java}
private static CharSequence replaceIgnoreCase(CharSequence string,
CharSequence sequence1, CharSequence escapeChar, Locale locale) {
  // string = "İpone " [304, 112, 111, 110, 101, 32],  size = 6
  ...
  while (start < count) {
// Convert by toLowerCase as follows.
// string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7
// firstIndex will be set 6.
if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first,
start)) == -1)
  break;
boolean found = true;
...
if (found) {
  // In this line, String.toString() will only have a range of 0 to 5.
  // So here we get a StringIndexOutOfBoundsException.
  result.append(string.toString().substring(copyStart, firstIndex));
  ...
} else {
  start = firstIndex + 1;
}
  }
  ...
}
{code}
 

Solving this may not be a big problem.


But what do you think about using
{code:java}
public static final CharSequence escapeWhiteChar(CharSequence str,
  Locale locale) {
...

for (int i = 0; i < escapableWhiteChars.length; i++) {
  // Use String's replace method.
  buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
}
return buffer;
  }
{code}
instead of
{code:java}
public static final CharSequence escapeWhiteChar(CharSequence str,
  Locale locale) {
...

for (int i = 0; i < escapableWhiteChars.length; i++) {
  // Stay current method.
  buffer = replaceIgnoreCase(buffer, 
escapableWhiteChars[i].toLowerCase(locale), "\\", locale);
}
return buffer;
  }
{code}
in the escapeWhiteChar method?

 

> StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java
> 
>
> Key: LUCENE-8572
> URL: https://issues.apache.org/jira/browse/LUCENE-8572
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Affects Versions: 6.3
>Reporter: Octavian Mocanu
>Priority: Major
>
> With "lucene-queryparser-6.3.0", specifically in
> "org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java"
>  
> when escaping strings containing extended unicode chars, and with a locale 
> distinct from that of the character set the string uses, the process fails, 
> with a 

[jira] [Commented] (LUCENE-8572) StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java

2018-11-30 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705249#comment-16705249
 ] 

Namgyu Kim commented on LUCENE-8572:


Hi, [~romseygeek], [~thetaphi].

I checked the issue and it could be a logical problem.

First, I think it's not a Locale problem, but a replace 
algorithm(replaceIgnoreCase) itself.

When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally.
(escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase())

 
{code:java}
private static CharSequence replaceIgnoreCase(CharSequence string,
CharSequence sequence1, CharSequence escapeChar, Locale locale) {
  // string = "İpone " [304, 112, 111, 110, 101, 32],  size = 6
  ...
  while (start < count) {
// Convert by toLowerCase as follows.
// string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7
// firstIndex will be set 6.
if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first,
start)) == -1)
  break;
boolean found = true;
...
if (found) {
  // In this line, String.toString() will only have a range of 0 to 5.
  // So here we get a StringIndexOutOfBoundsException.
  result.append(string.toString().substring(copyStart, firstIndex));
  ...
} else {
  start = firstIndex + 1;
}
  }
  ...
}
{code}
 

Solving this may not be a big problem.


But what do you think about using
{code:java}
public static final CharSequence escapeWhiteChar(CharSequence str,
  Locale locale) {
...

for (int i = 0; i < escapableWhiteChars.length; i++) {
  // Use String's replace method.
  buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
}
return buffer;
  }
{code}
instead of
{code:java}
public static final CharSequence escapeWhiteChar(CharSequence str,
  Locale locale) {
...

for (int i = 0; i < escapableWhiteChars.length; i++) {
  // Stay current method.
  buffer = replaceIgnoreCase(buffer, 
escapableWhiteChars[i].toLowerCase(locale), "\\", locale);
}
return buffer;
  }
{code}
in the escapeWhiteChar method?

 

> StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java
> 
>
> Key: LUCENE-8572
> URL: https://issues.apache.org/jira/browse/LUCENE-8572
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Affects Versions: 6.3
>Reporter: Octavian Mocanu
>Priority: Major
>
> With "lucene-queryparser-6.3.0", specifically in
> "org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java"
>  
> when escaping strings containing extended unicode chars, and with a locale 
> distinct from that of the character set the string uses, the process fails, 
> with a "java.lang.StringIndexOutOfBoundsException".
>  
> The reason is that the comparison is done by previously converting all of the 
> characters of the string to lower case chars, and by doing this, the original 
> string size isn't anymore the same, but less, as of the transformed one, so 
> that executing
>  
> org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:89
> fails with a java.lang.StringIndexOutOfBoundsException.
> I wonder whether the transformation to lower case is really needed when 
> treating the escape chars, since by avoiding it, the error may be avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8582) Set parent class of DutchAnalyzer to StopwordAnalyzerBase

2018-11-30 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8582:
---
Attachment: LUCENE-8582.patch

> Set parent class of DutchAnalyzer to StopwordAnalyzerBase
> -
>
> Key: LUCENE-8582
> URL: https://issues.apache.org/jira/browse/LUCENE-8582
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8582.patch
>
>
> Currently the parent class of DutchAnalyzer is *Analyzer*.
> And I saw the comment
> {code:java}
> // TODO: extend StopwordAnalyzerBase
> {code}
> in DutchAnalyzer.
>  
> So I changed the code as follows.
> {code:java}
> public final class DutchAnalyzer extends StopwordAnalyzerBase {
>   ...
>   
>   // This instance is no longer necessary.
>   // private final CharArraySet stoptable;
>   
>   public DutchAnalyzer(CharArraySet stopwords, CharArraySet 
> stemExclusionTable, Ch1arArrayMap stemOverrideDict) {
> super(stopwords); // Use StopwordAnalyzerBase's constructor to set 
> stopwords.
> ...
>   }
>   ...
>   @Override
>   protected TokenStreamComponents createComponents(String fieldName) {
> ...
> result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's 
> instance
> ...
>   }
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8582) Set parent class of DutchAnalyzer to StopwordAnalyzerBase

2018-11-30 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8582:
---
Description: 
Currently the parent class of DutchAnalyzer is *Analyzer*.

And I saw the comment
{code:java}
// TODO: extend StopwordAnalyzerBase
{code}
in DutchAnalyzer.

 

So I changed the code as follows.
{code:java}
public final class DutchAnalyzer extends StopwordAnalyzerBase {
  ...
  
  // This instance is no longer necessary.
  // private final CharArraySet stoptable;
  
  public DutchAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionTable, 
CharArrayMap stemOverrideDict) {
super(stopwords); // Use StopwordAnalyzerBase's constructor to set 
stopwords.
...
  }
  ...
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
...
result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's 
instance
...
  }
  ...
}

{code}

  was:
Currently the parent class of DutchAnalyzer is *Analyzer*.


And I saw the comment
{code:java}
// TODO: extend StopwordAnalyzerBase
{code}
in DutchAnalyzer.

 

So I changed the code as follows.
{code:java}
public final class DutchAnalyzer extends StopwordAnalyzerBase {
  ...
  
  // This instance is no longer necessary.
  // private final CharArraySet stoptable;
  
  public DutchAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionTable, 
Ch1arArrayMap stemOverrideDict) {
super(stopwords); // Use StopwordAnalyzerBase's constructor to set 
stopwords.
...
  }
  ...
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
...
result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's 
instance
...
  }
  ...
}

{code}


> Set parent class of DutchAnalyzer to StopwordAnalyzerBase
> -
>
> Key: LUCENE-8582
> URL: https://issues.apache.org/jira/browse/LUCENE-8582
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8582.patch
>
>
> Currently the parent class of DutchAnalyzer is *Analyzer*.
> And I saw the comment
> {code:java}
> // TODO: extend StopwordAnalyzerBase
> {code}
> in DutchAnalyzer.
>  
> So I changed the code as follows.
> {code:java}
> public final class DutchAnalyzer extends StopwordAnalyzerBase {
>   ...
>   
>   // This instance is no longer necessary.
>   // private final CharArraySet stoptable;
>   
>   public DutchAnalyzer(CharArraySet stopwords, CharArraySet 
> stemExclusionTable, CharArrayMap stemOverrideDict) {
> super(stopwords); // Use StopwordAnalyzerBase's constructor to set 
> stopwords.
> ...
>   }
>   ...
>   @Override
>   protected TokenStreamComponents createComponents(String fieldName) {
> ...
> result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's 
> instance
> ...
>   }
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8582) Set parent class of DutchAnalyzer to StopwordAnalyzerBase

2018-11-30 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8582:
--

 Summary: Set parent class of DutchAnalyzer to StopwordAnalyzerBase
 Key: LUCENE-8582
 URL: https://issues.apache.org/jira/browse/LUCENE-8582
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
Reporter: Namgyu Kim


Currently the parent class of DutchAnalyzer is *Analyzer*.


And I saw the comment
{code:java}
// TODO: extend StopwordAnalyzerBase
{code}
in DutchAnalyzer.

 

So I changed the code as follows.
{code:java}
public final class DutchAnalyzer extends StopwordAnalyzerBase {
  ...
  
  // This instance is no longer necessary.
  // private final CharArraySet stoptable;
  
  public DutchAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionTable, 
Ch1arArrayMap stemOverrideDict) {
super(stopwords); // Use StopwordAnalyzerBase's constructor to set 
stopwords.
...
  }
  ...
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
...
result = new StopFilter(result, stopwords); // Use StopwordAnalyzerBase's 
instance
...
  }
  ...
}

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8575) Improve toString() in SegmentInfo

2018-11-30 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705082#comment-16705082
 ] 

Namgyu Kim commented on LUCENE-8575:


Thanks for applying my code, [~jpountz] :D

> Improve toString() in SegmentInfo
> -
>
> Key: LUCENE-8575
> URL: https://issues.apache.org/jira/browse/LUCENE-8575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Namgyu Kim
>Priority: Major
> Fix For: master (8.0), 7.7
>
> Attachments: LUCENE-8575.patch, LUCENE-8575.patch
>
>
> I saw the following code in SegmentInfo class.
> {code:java}
> // TODO: we could append toString of attributes() here?
> {code}
> Of course, we can.
>  
> So I wrote a code for that part.
> {code:java}
> public String toString(int delCount) {
>   StringBuilder s = new StringBuilder();
>   s.append(name).append('(').append(version == null ? "?" : 
> version).append(')').append(':');
>   char cfs = getUseCompoundFile() ? 'c' : 'C';
>   s.append(cfs);
>   s.append(maxDoc);
>   if (delCount != 0) {
> s.append('/').append(delCount);
>   }
>   if (indexSort != null) {
> s.append(":[indexSort=");
> s.append(indexSort);
> s.append(']');
>   }
>   // New Code
>   if (!diagnostics.isEmpty()) {
> s.append(":[diagnostics=");
> for (Map.Entry entry : diagnostics.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   // New Code
>   if (!attributes.isEmpty()) {
> s.append(":[attributes=");
> for (Map.Entry entry : attributes.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   return s.toString();
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8575) Improve toString() in SegmentInfo

2018-11-28 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701982#comment-16701982
 ] 

Namgyu Kim edited comment on LUCENE-8575 at 11/28/18 3:24 PM:
--

Thank you for your reply, [~jpountz] :D

I uploaded a new patch that reflected your opinion.

 

before:
 
TEST(8.0.0):C1:[indexSort=]:[diagnostics=,]:[attributes=,]

after :
 TEST(8.0.0):C1:[indexSort=]:[diagnostics=\{key1=value1, 
key2=value2}]:[attributes=\{key1=value1, key2=value2}]


was (Author: danmuzi):
Thank you for your reply, [~jpountz] :D

I uploaded a new patch that reflected your opinion.

 

before:
TEST(8.0.0):C1:[indexSort=]:[diagnostics=,]:[attributes=,]

after :
TEST(8.0.0):C1:[indexSort=]:[diagnostics=\{key1=value1, 
key2=value2}]:[attributes=\{key1=value1, key2=value2}]

> Improve toString() in SegmentInfo
> -
>
> Key: LUCENE-8575
> URL: https://issues.apache.org/jira/browse/LUCENE-8575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8575.patch, LUCENE-8575.patch
>
>
> I saw the following code in SegmentInfo class.
> {code:java}
> // TODO: we could append toString of attributes() here?
> {code}
> Of course, we can.
>  
> So I wrote a code for that part.
> {code:java}
> public String toString(int delCount) {
>   StringBuilder s = new StringBuilder();
>   s.append(name).append('(').append(version == null ? "?" : 
> version).append(')').append(':');
>   char cfs = getUseCompoundFile() ? 'c' : 'C';
>   s.append(cfs);
>   s.append(maxDoc);
>   if (delCount != 0) {
> s.append('/').append(delCount);
>   }
>   if (indexSort != null) {
> s.append(":[indexSort=");
> s.append(indexSort);
> s.append(']');
>   }
>   // New Code
>   if (!diagnostics.isEmpty()) {
> s.append(":[diagnostics=");
> for (Map.Entry entry : diagnostics.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   // New Code
>   if (!attributes.isEmpty()) {
> s.append(":[attributes=");
> for (Map.Entry entry : attributes.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   return s.toString();
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8575) Improve toString() in SegmentInfo

2018-11-28 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701982#comment-16701982
 ] 

Namgyu Kim commented on LUCENE-8575:


Thank you for your reply, [~jpountz] :D

I uploaded a new patch that reflected your opinion.

 

before:
TEST(8.0.0):C1:[indexSort=]:[diagnostics=,]:[attributes=,]

after :
TEST(8.0.0):C1:[indexSort=]:[diagnostics=\{key1=value1, 
key2=value2}]:[attributes=\{key1=value1, key2=value2}]

> Improve toString() in SegmentInfo
> -
>
> Key: LUCENE-8575
> URL: https://issues.apache.org/jira/browse/LUCENE-8575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8575.patch, LUCENE-8575.patch
>
>
> I saw the following code in SegmentInfo class.
> {code:java}
> // TODO: we could append toString of attributes() here?
> {code}
> Of course, we can.
>  
> So I wrote a code for that part.
> {code:java}
> public String toString(int delCount) {
>   StringBuilder s = new StringBuilder();
>   s.append(name).append('(').append(version == null ? "?" : 
> version).append(')').append(':');
>   char cfs = getUseCompoundFile() ? 'c' : 'C';
>   s.append(cfs);
>   s.append(maxDoc);
>   if (delCount != 0) {
> s.append('/').append(delCount);
>   }
>   if (indexSort != null) {
> s.append(":[indexSort=");
> s.append(indexSort);
> s.append(']');
>   }
>   // New Code
>   if (!diagnostics.isEmpty()) {
> s.append(":[diagnostics=");
> for (Map.Entry entry : diagnostics.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   // New Code
>   if (!attributes.isEmpty()) {
> s.append(":[attributes=");
> for (Map.Entry entry : attributes.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   return s.toString();
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8575) Improve toString() in SegmentInfo

2018-11-28 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8575:
---
Attachment: LUCENE-8575.patch

> Improve toString() in SegmentInfo
> -
>
> Key: LUCENE-8575
> URL: https://issues.apache.org/jira/browse/LUCENE-8575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8575.patch, LUCENE-8575.patch
>
>
> I saw the following code in SegmentInfo class.
> {code:java}
> // TODO: we could append toString of attributes() here?
> {code}
> Of course, we can.
>  
> So I wrote a code for that part.
> {code:java}
> public String toString(int delCount) {
>   StringBuilder s = new StringBuilder();
>   s.append(name).append('(').append(version == null ? "?" : 
> version).append(')').append(':');
>   char cfs = getUseCompoundFile() ? 'c' : 'C';
>   s.append(cfs);
>   s.append(maxDoc);
>   if (delCount != 0) {
> s.append('/').append(delCount);
>   }
>   if (indexSort != null) {
> s.append(":[indexSort=");
> s.append(indexSort);
> s.append(']');
>   }
>   // New Code
>   if (!diagnostics.isEmpty()) {
> s.append(":[diagnostics=");
> for (Map.Entry entry : diagnostics.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   // New Code
>   if (!attributes.isEmpty()) {
> s.append(":[attributes=");
> for (Map.Entry entry : attributes.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   return s.toString();
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8575) Improve toString() in SegmentInfo

2018-11-27 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8575:
---
Attachment: LUCENE-8575.patch

> Improve toString() in SegmentInfo
> -
>
> Key: LUCENE-8575
> URL: https://issues.apache.org/jira/browse/LUCENE-8575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8575.patch
>
>
> I saw the following code in SegmentInfo class.
> {code:java}
> // TODO: we could append toString of attributes() here?
> {code}
> Of course, we can.
>  
> So I wrote a code for that part.
> {code:java}
> public String toString(int delCount) {
>   StringBuilder s = new StringBuilder();
>   s.append(name).append('(').append(version == null ? "?" : 
> version).append(')').append(':');
>   char cfs = getUseCompoundFile() ? 'c' : 'C';
>   s.append(cfs);
>   s.append(maxDoc);
>   if (delCount != 0) {
> s.append('/').append(delCount);
>   }
>   if (indexSort != null) {
> s.append(":[indexSort=");
> s.append(indexSort);
> s.append(']');
>   }
>   // New Code
>   if (!diagnostics.isEmpty()) {
> s.append(":[diagnostics=");
> for (Map.Entry entry : diagnostics.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   // New Code
>   if (!attributes.isEmpty()) {
> s.append(":[attributes=");
> for (Map.Entry entry : attributes.entrySet())
>   
> s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
> s.setLength(s.length() - 1);
> s.append(']');
>   }
>   return s.toString();
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8575) Improve toString() in SegmentInfo

2018-11-27 Thread Namgyu Kim (JIRA)
Namgyu Kim created LUCENE-8575:
--

 Summary: Improve toString() in SegmentInfo
 Key: LUCENE-8575
 URL: https://issues.apache.org/jira/browse/LUCENE-8575
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Namgyu Kim


I saw the following code in SegmentInfo class.
{code:java}
// TODO: we could append toString of attributes() here?
{code}
Of course, we can.

 

So I wrote a code for that part.
{code:java}
public String toString(int delCount) {
  StringBuilder s = new StringBuilder();
  s.append(name).append('(').append(version == null ? "?" : 
version).append(')').append(':');
  char cfs = getUseCompoundFile() ? 'c' : 'C';
  s.append(cfs);

  s.append(maxDoc);

  if (delCount != 0) {
s.append('/').append(delCount);
  }

  if (indexSort != null) {
s.append(":[indexSort=");
s.append(indexSort);
s.append(']');
  }

  // New Code
  if (!diagnostics.isEmpty()) {
s.append(":[diagnostics=");
for (Map.Entry entry : diagnostics.entrySet())
  
s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
s.setLength(s.length() - 1);
s.append(']');
  }

  // New Code
  if (!attributes.isEmpty()) {
s.append(":[attributes=");
for (Map.Entry entry : attributes.entrySet())
  
s.append("<").append(entry.getKey()).append(",").append(entry.getValue()).append(">,");
s.setLength(s.length() - 1);
s.append(']');
  }

  return s.toString();
}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8553) New KoreanDecomposeFilter for KoreanAnalyzer(Nori)

2018-11-02 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673285#comment-16673285
 ] 

Namgyu Kim commented on LUCENE-8553:


Thank you for your comments :D [~rcmuir], [~thetaphi].

 

Yes. Both of you are right.

I know that it is possible to do "Hangul-Jamo" separation by using ICU.

However, I am not sure whether the *"Hangul" -> "Choseong"* conversion or 
*"Dual chars (like" ㄲ "," ㅆ "," ㅢ ", ...)"* conversion can be performed in that 
library.

These functions are also important features in this TokenFilter and I have used 
a HashMap or a separated Array to reduce its time complexity.

That's why I didn't use the ICU library.

> New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
> --
>
> Key: LUCENE-8553
> URL: https://issues.apache.org/jira/browse/LUCENE-8553
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8553.patch
>
>
> This is a patch for KoreanDecomposeFilter.
> This filter can be used to decompose Hangul.
> (ex) 한글 -> ㅎㄱ or ㅎㅏㄴㄱㅡㄹ)
> Hangul input is very unique.
> If you want to type apple in English,
>    you can type it in the order {color:#FF}a -> p -> p -> l -> e{color}.
> However, if you want to input "Hangul" in Hangul,
>    you have to type it in the order of {color:#FF}ㅎ -> ㅏ -> ㄴ -> ㄱ -> ㅡ 
> -> ㄹ{color}.
>    (Because of the keyboard shape)
> This means that spell check with existing full Hangul can be less accurate.
>  
> The structure of Hangul consists of elements such as *"Choseong"*, 
> *"Jungseong"*, and *"Jongseong"*.
> These three elements are called *"Jamo"*.
> If you have the Korean word "된장찌개" (that means Soybean Paste Stew)
> *"Choseong"* means {color:#FF}"ㄷ, ㅈ, ㅉ, ㄱ"{color},
> *"Jungseong"* means {color:#FF}"ㅚ, ㅏ, ㅣ, ㅐ"{color},
> *"Jongseong"* means {color:#FF}"ㄴ, ㅇ"{color}.
> The reason for Jamo separation is explained above. (spell check)
> Also, the reason we need "Choseong Filter" is because many Koreans use 
> *"Choseong Search"* (especially in mobile environment).
> If you want to search for "된장찌개" you need 10 typing, which is quite a lot.
> For that reason, I think it would be useful to provide a filter that can be 
> searched by "ㄷㅈㅉㄱ".
> Hangul also has *dual chars*, such as
> "ㄲ, ㄸ, ㅁ, ㅃ, ㅉ, ㅚ (ㅗ + ㅣ), ㅢ (ㅡ + ㅣ), ...".
> For such reasons,
> KoreanDecompose offers *5 options*,
> ex) *된장찌개* => [된장], [찌개]
> *1) ORIGIN*
> [된장], [찌개]
> *2) SINGLECHOSEONG*
> [ㄷㅈ], [ㅉㄱ] 
> *3) DUALCHOSEONG*
> [ㄷㅈ], [ㅈㅈㄱ] 
> *4) SINGLEJAMO*
> [ㄷㅚㄴㅈㅏㅇ], [ㅉㅣㄱㅐ] 
> *5) DUALJAMO*
> [ㄷㅗㅣㄴㅈㅏㅇ], [ㅈㅈㅣㄱㅐ] 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8553) New KoreanDecomposeFilter for KoreanAnalyzer(Nori)

2018-11-01 Thread Namgyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim updated LUCENE-8553:
---
Attachment: LUCENE-8553.patch

> New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
> --
>
> Key: LUCENE-8553
> URL: https://issues.apache.org/jira/browse/LUCENE-8553
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8553.patch
>
>
> This is a patch for KoreanDecomposeFilter.
> This filter can be used to decompose Hangul.
> (ex) 한글 -> ㅎㄱ or ㅎㅏㄴㄱㅡㄹ)
> Hangul input is very unique.
> If you want to type apple in English,
>    you can type it in the order {color:#FF}a -> p -> p -> l -> e{color}.
> However, if you want to input "Hangul" in Hangul,
>    you have to type it in the order of {color:#FF}ㅎ -> ㅏ -> ㄴ -> ㄱ -> ㅡ 
> -> ㄹ{color}.
>    (Because of the keyboard shape)
> This means that spell check with existing full Hangul can be less accurate.
>  
> The structure of Hangul consists of elements such as *"Choseong"*, 
> *"Jungseong"*, and *"Jongseong"*.
> These three elements are called *"Jamo"*.
> If you have the Korean word "된장찌개" (that means Soybean Paste Stew)
> *"Choseong"* means {color:#FF}"ㄷ, ㅈ, ㅉ, ㄱ"{color},
> *"Jungseong"* means {color:#FF}"ㅚ, ㅏ, ㅣ, ㅐ"{color},
> *"Jongseong"* means {color:#FF}"ㄴ, ㅇ"{color}.
> The reason for Jamo separation is explained above. (spell check)
> Also, the reason we need "Choseong Filter" is because many Koreans use 
> *"Choseong Search"* (especially in mobile environment).
> If you want to search for "된장찌개" you need 10 typing, which is quite a lot.
> For that reason, I think it would be useful to provide a filter that can be 
> searched by "ㄷㅈㅉㄱ".
> Hangul also has *dual chars*, such as
> "ㄲ, ㄸ, ㅁ, ㅃ, ㅉ, ㅚ (ㅗ + ㅣ), ㅢ (ㅡ + ㅣ), ...".
> For such reasons,
> KoreanDecompose offers *5 options*,
> ex) *된장찌개* => [된장], [찌개]
> *1) ORIGIN*
> [된장], [찌개]
> *2) SINGLECHOSEONG*
> [ㄷㅈ], [ㅉㄱ] 
> *3) DUALCHOSEONG*
> [ㄷㅈ], [ㅈㅈㄱ] 
> *4) SINGLEJAMO*
> [ㄷㅚㄴㅈㅏㅇ], [ㅉㅣㄱㅐ] 
> *5) DUALJAMO*
> [ㄷㅗㅣㄴㅈㅏㅇ], [ㅈㅈㅣㄱㅐ] 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >