Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-03-18 Thread via GitHub


benwtrent merged PR #12915:
URL: https://github.com/apache/lucene/pull/12915


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-03-18 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1528440247


##
lucene/CHANGES.txt:
##
@@ -174,12 +174,14 @@ API Changes
 
 New Features
 -
-
 * GITHUB#12679: Add support for similarity-based vector searches using 
[Byte|Float]VectorSimilarityQuery. Uses a new
   VectorSimilarityCollector to find all vectors scoring above a 
`resultSimilarity` while traversing the HNSW graph till
   better-scoring nodes are available, or the best candidate is below a score 
of `traversalSimilarity` in the lowest
   level. (Aditya Prakash, Kaival Parikh)
 
+* GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This 
introduces JapaneseHiraganaUppercaseFilter
+  and JapaneseKatakanaUppercaseFilter. (Dai Sugimori)
+

Review Comment:
   Thanks @benwtrent, this is done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-03-18 Thread via GitHub


benwtrent commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1528339145


##
lucene/CHANGES.txt:
##
@@ -174,12 +174,14 @@ API Changes
 
 New Features
 -
-
 * GITHUB#12679: Add support for similarity-based vector searches using 
[Byte|Float]VectorSimilarityQuery. Uses a new
   VectorSimilarityCollector to find all vectors scoring above a 
`resultSimilarity` while traversing the HNSW graph till
   better-scoring nodes are available, or the best candidate is below a score 
of `traversalSimilarity` in the lowest
   level. (Aditya Prakash, Kaival Parikh)
 
+* GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This 
introduces JapaneseHiraganaUppercaseFilter
+  and JapaneseKatakanaUppercaseFilter. (Dai Sugimori)
+

Review Comment:
   We have since released 9.10. Could you add your changes to 9.11 and remove 
from 9.10?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1907129523

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-01-09 Thread via GitHub


dungba88 commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1882788058

   I think it's good to go, but I don't have merge permission. Mike should be 
able to help you, otherwise you can try notify the dev mailing list as 
suggested by the bot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-01-09 Thread via GitHub


daixque commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1882633384

   @mikemccand @dungba88 Let me ping. Do I still have anything to do for this 
PR? If not, could you merge it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-01-08 Thread via GitHub


github-actions[bot] commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1880898815

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-18 Thread via GitHub


daixque commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1860673133

   > Looks great @daixque -- would you like to add a `lucene/CHANGES.txt` entry 
dscribing this awesome new capability? Be sure to put it under the `9.10.0` 
section since we can backport this change (it is not a 10.0.0-only feature).
   
   @mikemccand This is done. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-18 Thread via GitHub


mikemccand commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1860587667

   Looks great @daixque -- would you like to add a `lucene/CHANGES.txt` entry 
dscribing this awesome new capability?  Be sure to put it under the `9.10.0` 
section since we can backport this change (it is not a 10.0.0-only feature).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-15 Thread via GitHub


daixque commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1858684072

   I did refactoring to apply a same kind of enhancement to Katakana filter as 
well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-14 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1427647228


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -60,15 +60,13 @@ public JapaneseHiraganaUppercaseFilter(TokenStream input) {
   @Override
   public boolean incrementToken() throws IOException {
 if (input.incrementToken()) {
-  String term = termAttr.toString();
-  char[] src = term.toCharArray();
-  char[] result = new char[src.length];
-  for (int i = 0; i < src.length; i++) {
-Character c = s2l.get(src[i]);
+  char[] result = new char[termAttr.length()];

Review Comment:
   @mikemccand @dungba88 Yeah, thanks for your suggestion. I reflected this, so 
please check it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-14 Thread via GitHub


mikemccand commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1426601556


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -60,15 +60,13 @@ public JapaneseHiraganaUppercaseFilter(TokenStream input) {
   @Override
   public boolean incrementToken() throws IOException {
 if (input.incrementToken()) {
-  String term = termAttr.toString();
-  char[] src = term.toCharArray();
-  char[] result = new char[src.length];
-  for (int i = 0; i < src.length; i++) {
-Character c = s2l.get(src[i]);
+  char[] result = new char[termAttr.length()];

Review Comment:
   I think this could instead be something like:
   
   ```
   char[] termBuffer = termAttr.buffer();
   int termLength = termAttr.length();
   for(int i=0; i

Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-13 Thread via GitHub


dungba88 commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1853744401

   Besides the optimization of manipulating the internal byte[] directly, I 
think this is good to go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-12 Thread via GitHub


mikemccand commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1424520399


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,65 @@
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();
+  char[] result = new char[src.length];
+  for (int i = 0; i < src.length; i++) {
+Character c = s2l.get(src[i]);
+if (c != null) {
+  result[i] = c;
+} else {
+  result[i] = src[i];
+}
+  }
+  String resultTerm = String.copyValueOf(result);
+  termAttr.setEmpty().append(resultTerm);

Review Comment:
   > This will eliminate all of the byte copy. I don't know if we are supposed 
to do that (but the API allow). Maybe @mikemccand could have some thought here.
   
   This is indeed the intended usage for high performance -- directly alter 
that underlying `char[]` buffer, asking the term att to grow if needed, and 
setting the length when you are done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-12 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423583689


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,65 @@
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();
+  char[] result = new char[src.length];
+  for (int i = 0; i < src.length; i++) {
+Character c = s2l.get(src[i]);
+if (c != null) {
+  result[i] = c;
+} else {
+  result[i] = src[i];
+}
+  }
+  String resultTerm = String.copyValueOf(result);
+  termAttr.setEmpty().append(resultTerm);

Review Comment:
   It seems you can modify the `CharTermAttribute` directly by accessing 
`buffer()`, which will return the internal buffer.
   
   ```
   byte[] buffer = termAttr.buffer();
   buffer[i] = LETTER_MAPPINGS.get(buffer[i]);
   ```
   
   This will eliminate all of the byte copy. I don't know if we are supposed to 
do that (but the API allow). Maybe @mikemccand could have some thought here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-12 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423585285


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {

Review Comment:
   You are right, maybe we can consolidate them with a base class as a 
follow-up. This LGTM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423575442


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();
+  char[] result = new char[src.length];
+  for (int i = 0; i < src.length; i++) {
+Character c = s2l.get(src[i]);
+if (c != null) {
+  result[i] = c;

Review Comment:
   I see, that makes sense. Thank you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423470044


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();
+  char[] result = new char[src.length];
+  for (int i = 0; i < src.length; i++) {
+Character c = s2l.get(src[i]);
+if (c != null) {
+  result[i] = c;

Review Comment:
   > It seems all small characters are just 1 position ahead of the normal 
characters
   
   It's not correct. See `ゕ` for example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423469570


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {

Review Comment:
   @dungba88  How should the constructor look like?
   
   Like this?
   ```
   public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool 
katakana)
   ```
   
   Note that Katakana has an exceptional character `ㇷ゚`, so logic is slightly 
different from hiragana.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423469570


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {

Review Comment:
   @dungba88  How should the constructor look like?
   
   Like this?
   ```
   public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool 
katakana)
   ```
   
   Note that Katakana has an exceptional character `ㇷ゚`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423469570


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {

Review Comment:
   @dungba88  How should the constructor look like?
   
   Like this?
   ```
   public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool 
katakana)
   ```
   
   Note that Katakana has exceptional character `ㇷ゚`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423482123


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();

Review Comment:
   Thanks, let me do that.



##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;

Review Comment:
   Thanks, let me do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423470044


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();
+  char[] result = new char[src.length];
+  for (int i = 0; i < src.length; i++) {
+Character c = s2l.get(src[i]);
+if (c != null) {
+  result[i] = c;

Review Comment:
   No. See `ゕ` for example.



##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+  

Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423469570


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {

Review Comment:
   @dungba88  How should the constructor look like?
   
   Like this?
   ```
   public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool 
katakana)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423402789


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;

Review Comment:
   Also s2l is a bit cryptic, maybe we could use LETTER_MAPPINGS or something



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423384044


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {

Review Comment:
   This seems to be mostly the same as the other filter, so maybe we can 
combine them?
   
   E.g you can either pass the mapping as a constructor parameter and provide 2 
constants mapping



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423383461


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,83 @@
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ
+s2l =
+Map.ofEntries(
+Map.entry('ァ', 'ア'),
+Map.entry('ィ', 'イ'),
+Map.entry('ゥ', 'ウ'),
+Map.entry('ェ', 'エ'),
+Map.entry('ォ', 'オ'),
+Map.entry('ヵ', 'カ'),
+Map.entry('ㇰ', 'ク'),
+Map.entry('ヶ', 'ケ'),
+Map.entry('ㇱ', 'シ'),
+Map.entry('ㇲ', 'ス'),
+Map.entry('ッ', 'ツ'),
+Map.entry('ㇳ', 'ト'),
+Map.entry('ㇴ', 'ヌ'),
+Map.entry('ㇵ', 'ハ'),
+Map.entry('ㇶ', 'ヒ'),
+Map.entry('ㇷ', 'フ'),
+Map.entry('ㇸ', 'ヘ'),
+Map.entry('ㇹ', 'ホ'),
+Map.entry('ㇺ', 'ム'),
+Map.entry('ャ', 'ヤ'),
+Map.entry('ュ', 'ユ'),
+Map.entry('ョ', 'ヨ'),
+Map.entry('ㇻ', 'ラ'),
+Map.entry('ㇼ', 'リ'),
+Map.entry('ㇽ', 'ル'),
+Map.entry('ㇾ', 'レ'),
+Map.entry('ㇿ', 'ロ'),
+Map.entry('ヮ', 'ワ'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseKatakanaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  // Small letter "ㇷ゚" is not single character, so it should be converted 
to "プ" as String
+  term = term.replace("ㇷ゚", "プ");
+  char[] src = term.toCharArray();

Review Comment:
   The buffer return the internal byte[] of the CharTermAttribute, which might 
has more bytes than the actual term length. You need to use term.length() as 
well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423382747


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();
+  char[] result = new char[src.length];
+  for (int i = 0; i < src.length; i++) {
+Character c = s2l.get(src[i]);
+if (c != null) {
+  result[i] = c;

Review Comment:
   It seems all small characters are just 1 position ahead of the normal 
characters, so you can use `result[i] = src[i] + 1;`, and you can use a Set 
instead of Map: https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423381320


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;

Review Comment:
   I think the parameter should be in all-uppercase as it's a constant?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423380431


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();

Review Comment:
   I think you can iterate through the term attribute directly. These methods 
require byte-copy so might be inefficient
   
   ```
   for (int i = 0; i < termAttr.length(); i++) {
   char c = termAttr.charAt(i);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


dungba88 commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423380431


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();

Review Comment:
   I think you can iterate through the term attribute directly. These methods 
require byte-copy so might be inefficient



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1851177452

   Hi @mikemccand and @kojisekig, thank you for your reviews.
   I updated some codes along with the comments and add lines to module-info 
and resources to make `gradle check` green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423277326


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,83 @@
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ
+s2l =
+Map.ofEntries(
+Map.entry('ァ', 'ア'),
+Map.entry('ィ', 'イ'),
+Map.entry('ゥ', 'ウ'),
+Map.entry('ェ', 'エ'),
+Map.entry('ォ', 'オ'),
+Map.entry('ヵ', 'カ'),
+Map.entry('ㇰ', 'ク'),
+Map.entry('ヶ', 'ケ'),
+Map.entry('ㇱ', 'シ'),
+Map.entry('ㇲ', 'ス'),
+Map.entry('ッ', 'ツ'),
+Map.entry('ㇳ', 'ト'),
+Map.entry('ㇴ', 'ヌ'),
+Map.entry('ㇵ', 'ハ'),
+Map.entry('ㇶ', 'ヒ'),
+Map.entry('ㇷ', 'フ'),
+Map.entry('ㇸ', 'ヘ'),
+Map.entry('ㇹ', 'ホ'),
+Map.entry('ㇺ', 'ム'),
+Map.entry('ャ', 'ヤ'),
+Map.entry('ュ', 'ユ'),
+Map.entry('ョ', 'ヨ'),
+Map.entry('ㇻ', 'ラ'),
+Map.entry('ㇼ', 'リ'),
+Map.entry('ㇽ', 'ル'),
+Map.entry('ㇾ', 'レ'),
+Map.entry('ㇿ', 'ロ'),
+Map.entry('ヮ', 'ワ'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseKatakanaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  // Small letter "ㇷ゚" is not single character, so it should be converted 
to "プ" as String
+  term = term.replace("ㇷ゚", "プ");
+  char[] src = term.toCharArray();

Review Comment:
   Thanks, but it will affect length of result character array and break the 
tests. So let me keep current implementation.
   
   Here is the example of test result.
   ```
   term 0 expected:<ちよつと[]> but was:<ちよつと[sTerm��]>
   Expected :ちよつと
   Actual   :ちよつとsTerm��
   ```



##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,83 @@
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ
+s2l =
+Map.ofEntries(
+Map.entry('ァ', 'ア'),
+Map.entry('ィ', 'イ'),
+Map.entry('ゥ', 'ウ'),
+Map.entry('ェ', 'エ'),
+Map.entry('ォ', 'オ'),
+Map.entry('ヵ', 'カ'),
+Map.entry('ㇰ', 'ク'),
+Map.entry('ヶ', 'ケ'),
+Map.entry('ㇱ', 'シ'),
+Map.entry('ㇲ', 'ス'),
+Map.entry('ッ', 'ツ'),
+Map.entry('ㇳ', 'ト'),
+Map.entry('ㇴ', 'ヌ'),
+Map.entry('ㇵ', 'ハ'),
+Map.entry('ㇶ', 'ヒ'),
+Map.entry('ㇷ', 'フ'),
+Map.entry('ㇸ', 'ヘ'),
+Map.entry('ㇹ', 'ホ'),
+Map.entry('ㇺ', 'ム'),
+Map.entry('ャ', 'ヤ'),
+Map.entry('ュ', 'ユ'),
+Map.entry('ョ', 'ヨ'),
+Map.entry('ㇻ', 'ラ'),
+Map.entry('ㇼ', 'リ'),
+Map.entry('ㇽ', 'ル'),
+Map.entry('ㇾ', 'レ'),
+Map.entry('ㇿ', 'ロ'),
+Map.entry('ヮ', 'ワ'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseKatakanaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean 

Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423277099


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,65 @@
+package org.apache.lucene.analysis.ja;

Review Comment:
   I'm happy to do, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


daixque commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1423277455


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,65 @@
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();
+  char[] result = new char[src.length];
+  for (int i = 0; i < src.length; i++) {
+Character c = s2l.get(src[i]);
+if (c != null) {
+  result[i] = c;
+} else {
+  result[i] = src[i];
+}
+  }
+  String resultTerm = String.copyValueOf(result);
+  termAttr.setEmpty().append(resultTerm);

Review Comment:
   I couldn't find `append ` method signature which accept char[]. (There is 
CharSequence instead)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


kojisekig commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1851116639

   From a Japanese perspective, the necessity sounds reasonable. Thank you for 
the contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-11 Thread via GitHub


mikemccand commented on code in PR #12915:
URL: https://github.com/apache/lucene/pull/12915#discussion_r1422804214


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,65 @@
+package org.apache.lucene.analysis.ja;

Review Comment:
   Could you please add the standard Apache copyright header, if that's OK with 
you?  Thanks!  I think this will also make the GitHub actions checks 
(`./gradlew check`) happy.



##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java:
##
@@ -0,0 +1,65 @@
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into 
normal letters. For
+ * instance, "ちょっとまって" will be translated to "ちよつとまつて".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
+s2l =
+Map.ofEntries(
+Map.entry('ぁ', 'あ'),
+Map.entry('ぃ', 'い'),
+Map.entry('ぅ', 'う'),
+Map.entry('ぇ', 'え'),
+Map.entry('ぉ', 'お'),
+Map.entry('っ', 'つ'),
+Map.entry('ゃ', 'や'),
+Map.entry('ゅ', 'ゆ'),
+Map.entry('ょ', 'よ'),
+Map.entry('ゎ', 'わ'),
+Map.entry('ゕ', 'か'),
+Map.entry('ゖ', 'け'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseHiraganaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  char[] src = term.toCharArray();
+  char[] result = new char[src.length];
+  for (int i = 0; i < src.length; i++) {
+Character c = s2l.get(src[i]);
+if (c != null) {
+  result[i] = c;
+} else {
+  result[i] = src[i];
+}
+  }
+  String resultTerm = String.copyValueOf(result);
+  termAttr.setEmpty().append(resultTerm);

Review Comment:
   You can avoid making `String` here by appending the `char[] result` instead.



##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java:
##
@@ -0,0 +1,83 @@
+package org.apache.lucene.analysis.ja;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+
+/**
+ * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into 
normal letters. For
+ * instance, "ストップウォッチ" will be translated to "ストツプウオツチ".
+ *
+ * This filter is useful if you want to search against old style Japanese 
text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {
+  private static final Map s2l;
+
+  static {
+// supported characters are:
+// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ
+s2l =
+Map.ofEntries(
+Map.entry('ァ', 'ア'),
+Map.entry('ィ', 'イ'),
+Map.entry('ゥ', 'ウ'),
+Map.entry('ェ', 'エ'),
+Map.entry('ォ', 'オ'),
+Map.entry('ヵ', 'カ'),
+Map.entry('ㇰ', 'ク'),
+Map.entry('ヶ', 'ケ'),
+Map.entry('ㇱ', 'シ'),
+Map.entry('ㇲ', 'ス'),
+Map.entry('ッ', 'ツ'),
+Map.entry('ㇳ', 'ト'),
+Map.entry('ㇴ', 'ヌ'),
+Map.entry('ㇵ', 'ハ'),
+Map.entry('ㇶ', 'ヒ'),
+Map.entry('ㇷ', 'フ'),
+Map.entry('ㇸ', 'ヘ'),
+Map.entry('ㇹ', 'ホ'),
+Map.entry('ㇺ', 'ム'),
+Map.entry('ャ', 'ヤ'),
+Map.entry('ュ', 'ユ'),
+Map.entry('ョ', 'ヨ'),
+Map.entry('ㇻ', 'ラ'),
+Map.entry('ㇼ', 'リ'),
+Map.entry('ㇽ', 'ル'),
+Map.entry('ㇾ', 'レ'),
+Map.entry('ㇿ', 'ロ'),
+Map.entry('ヮ', 'ワ'));
+  }
+
+  private final CharTermAttribute termAttr = 
addAttribute(CharTermAttribute.class);
+
+  public JapaneseKatakanaUppercaseFilter(TokenStream input) {
+super(input);
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+if (input.incrementToken()) {
+  String term = termAttr.toString();
+  // Small letter "ㇷ゚" is not single character, so it should be converted 
to "プ"