Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
benwtrent merged PR #12915: URL: https://github.com/apache/lucene/pull/12915 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1528440247 ## lucene/CHANGES.txt: ## @@ -174,12 +174,14 @@ API Changes New Features - - * GITHUB#12679: Add support for similarity-based vector searches using [Byte|Float]VectorSimilarityQuery. Uses a new VectorSimilarityCollector to find all vectors scoring above a `resultSimilarity` while traversing the HNSW graph till better-scoring nodes are available, or the best candidate is below a score of `traversalSimilarity` in the lowest level. (Aditya Prakash, Kaival Parikh) +* GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This introduces JapaneseHiraganaUppercaseFilter + and JapaneseKatakanaUppercaseFilter. (Dai Sugimori) + Review Comment: Thanks @benwtrent, this is done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
benwtrent commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1528339145 ## lucene/CHANGES.txt: ## @@ -174,12 +174,14 @@ API Changes New Features - - * GITHUB#12679: Add support for similarity-based vector searches using [Byte|Float]VectorSimilarityQuery. Uses a new VectorSimilarityCollector to find all vectors scoring above a `resultSimilarity` while traversing the HNSW graph till better-scoring nodes are available, or the best candidate is below a score of `traversalSimilarity` in the lowest level. (Aditya Prakash, Kaival Parikh) +* GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This introduces JapaneseHiraganaUppercaseFilter + and JapaneseKatakanaUppercaseFilter. (Dai Sugimori) + Review Comment: We have since released 9.10. Could you add your changes to 9.11 and remove from 9.10? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
github-actions[bot] commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1907129523 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1882788058 I think it's good to go, but I don't have merge permission. Mike should be able to help you, otherwise you can try notify the dev mailing list as suggested by the bot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1882633384 @mikemccand @dungba88 Let me ping. Do I still have anything to do for this PR? If not, could you merge it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
github-actions[bot] commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1880898815 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1860673133 > Looks great @daixque -- would you like to add a `lucene/CHANGES.txt` entry dscribing this awesome new capability? Be sure to put it under the `9.10.0` section since we can backport this change (it is not a 10.0.0-only feature). @mikemccand This is done. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
mikemccand commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1860587667 Looks great @daixque -- would you like to add a `lucene/CHANGES.txt` entry dscribing this awesome new capability? Be sure to put it under the `9.10.0` section since we can backport this change (it is not a 10.0.0-only feature). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1858684072 I did refactoring to apply a same kind of enhancement to Katakana filter as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1427647228 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -60,15 +60,13 @@ public JapaneseHiraganaUppercaseFilter(TokenStream input) { @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { - String term = termAttr.toString(); - char[] src = term.toCharArray(); - char[] result = new char[src.length]; - for (int i = 0; i < src.length; i++) { -Character c = s2l.get(src[i]); + char[] result = new char[termAttr.length()]; Review Comment: @mikemccand @dungba88 Yeah, thanks for your suggestion. I reflected this, so please check it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
mikemccand commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1426601556 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -60,15 +60,13 @@ public JapaneseHiraganaUppercaseFilter(TokenStream input) { @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { - String term = termAttr.toString(); - char[] src = term.toCharArray(); - char[] result = new char[src.length]; - for (int i = 0; i < src.length; i++) { -Character c = s2l.get(src[i]); + char[] result = new char[termAttr.length()]; Review Comment: I think this could instead be something like: ``` char[] termBuffer = termAttr.buffer(); int termLength = termAttr.length(); for(int i=0; i
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1853744401 Besides the optimization of manipulating the internal byte[] directly, I think this is good to go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
mikemccand commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1424520399 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,65 @@ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); + char[] result = new char[src.length]; + for (int i = 0; i < src.length; i++) { +Character c = s2l.get(src[i]); +if (c != null) { + result[i] = c; +} else { + result[i] = src[i]; +} + } + String resultTerm = String.copyValueOf(result); + termAttr.setEmpty().append(resultTerm); Review Comment: > This will eliminate all of the byte copy. I don't know if we are supposed to do that (but the API allow). Maybe @mikemccand could have some thought here. This is indeed the intended usage for high performance -- directly alter that underlying `char[]` buffer, asking the term att to grow if needed, and setting the length when you are done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423583689 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,65 @@ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); + char[] result = new char[src.length]; + for (int i = 0; i < src.length; i++) { +Character c = s2l.get(src[i]); +if (c != null) { + result[i] = c; +} else { + result[i] = src[i]; +} + } + String resultTerm = String.copyValueOf(result); + termAttr.setEmpty().append(resultTerm); Review Comment: It seems you can modify the `CharTermAttribute` directly by accessing `buffer()`, which will return the internal buffer. ``` byte[] buffer = termAttr.buffer(); buffer[i] = LETTER_MAPPINGS.get(buffer[i]); ``` This will eliminate all of the byte copy. I don't know if we are supposed to do that (but the API allow). Maybe @mikemccand could have some thought here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423585285 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { Review Comment: You are right, maybe we can consolidate them with a base class as a follow-up. This LGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423575442 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); + char[] result = new char[src.length]; + for (int i = 0; i < src.length; i++) { +Character c = s2l.get(src[i]); +if (c != null) { + result[i] = c; Review Comment: I see, that makes sense. Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423470044 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); + char[] result = new char[src.length]; + for (int i = 0; i < src.length; i++) { +Character c = s2l.get(src[i]); +if (c != null) { + result[i] = c; Review Comment: > It seems all small characters are just 1 position ahead of the normal characters It's not correct. See `ゕ` for example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423469570 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { Review Comment: @dungba88 How should the constructor look like? Like this? ``` public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool katakana) ``` Note that Katakana has an exceptional character `ㇷ゚`, so logic is slightly different from hiragana. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423469570 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { Review Comment: @dungba88 How should the constructor look like? Like this? ``` public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool katakana) ``` Note that Katakana has an exceptional character `ㇷ゚` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423469570 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { Review Comment: @dungba88 How should the constructor look like? Like this? ``` public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool katakana) ``` Note that Katakana has exceptional character `ㇷ゚` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423482123 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); Review Comment: Thanks, let me do that. ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; Review Comment: Thanks, let me do that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423470044 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); + char[] result = new char[src.length]; + for (int i = 0; i < src.length; i++) { +Character c = s2l.get(src[i]); +if (c != null) { + result[i] = c; Review Comment: No. See `ゕ` for example. ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423469570 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { Review Comment: @dungba88 How should the constructor look like? Like this? ``` public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool katakana) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423402789 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; Review Comment: Also s2l is a bit cryptic, maybe we could use LETTER_MAPPINGS or something -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423384044 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { Review Comment: This seems to be mostly the same as the other filter, so maybe we can combine them? E.g you can either pass the mapping as a constructor parameter and provide 2 constants mapping -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423383461 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,83 @@ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ +s2l = +Map.ofEntries( +Map.entry('ァ', 'ア'), +Map.entry('ィ', 'イ'), +Map.entry('ゥ', 'ウ'), +Map.entry('ェ', 'エ'), +Map.entry('ォ', 'オ'), +Map.entry('ヵ', 'カ'), +Map.entry('ㇰ', 'ク'), +Map.entry('ヶ', 'ケ'), +Map.entry('ㇱ', 'シ'), +Map.entry('ㇲ', 'ス'), +Map.entry('ッ', 'ツ'), +Map.entry('ㇳ', 'ト'), +Map.entry('ㇴ', 'ヌ'), +Map.entry('ㇵ', 'ハ'), +Map.entry('ㇶ', 'ヒ'), +Map.entry('ㇷ', 'フ'), +Map.entry('ㇸ', 'ヘ'), +Map.entry('ㇹ', 'ホ'), +Map.entry('ㇺ', 'ム'), +Map.entry('ャ', 'ヤ'), +Map.entry('ュ', 'ユ'), +Map.entry('ョ', 'ヨ'), +Map.entry('ㇻ', 'ラ'), +Map.entry('ㇼ', 'リ'), +Map.entry('ㇽ', 'ル'), +Map.entry('ㇾ', 'レ'), +Map.entry('ㇿ', 'ロ'), +Map.entry('ヮ', 'ワ')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseKatakanaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + // Small letter "ㇷ゚" is not single character, so it should be converted to "プ" as String + term = term.replace("ㇷ゚", "プ"); + char[] src = term.toCharArray(); Review Comment: The buffer return the internal byte[] of the CharTermAttribute, which might has more bytes than the actual term length. You need to use term.length() as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423382747 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); + char[] result = new char[src.length]; + for (int i = 0; i < src.length; i++) { +Character c = s2l.get(src[i]); +if (c != null) { + result[i] = c; Review Comment: It seems all small characters are just 1 position ahead of the normal characters, so you can use `result[i] = src[i] + 1;`, and you can use a Set instead of Map: https://en.wikipedia.org/wiki/Hiragana_(Unicode_block) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423381320 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; Review Comment: I think the parameter should be in all-uppercase as it's a constant? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423380431 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); Review Comment: I think you can iterate through the term attribute directly. These methods require byte-copy so might be inefficient ``` for (int i = 0; i < termAttr.length(); i++) { char c = termAttr.charAt(i); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423380431 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); Review Comment: I think you can iterate through the term attribute directly. These methods require byte-copy so might be inefficient -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1851177452 Hi @mikemccand and @kojisekig, thank you for your reviews. I updated some codes along with the comments and add lines to module-info and resources to make `gradle check` green. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423277326 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,83 @@ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ +s2l = +Map.ofEntries( +Map.entry('ァ', 'ア'), +Map.entry('ィ', 'イ'), +Map.entry('ゥ', 'ウ'), +Map.entry('ェ', 'エ'), +Map.entry('ォ', 'オ'), +Map.entry('ヵ', 'カ'), +Map.entry('ㇰ', 'ク'), +Map.entry('ヶ', 'ケ'), +Map.entry('ㇱ', 'シ'), +Map.entry('ㇲ', 'ス'), +Map.entry('ッ', 'ツ'), +Map.entry('ㇳ', 'ト'), +Map.entry('ㇴ', 'ヌ'), +Map.entry('ㇵ', 'ハ'), +Map.entry('ㇶ', 'ヒ'), +Map.entry('ㇷ', 'フ'), +Map.entry('ㇸ', 'ヘ'), +Map.entry('ㇹ', 'ホ'), +Map.entry('ㇺ', 'ム'), +Map.entry('ャ', 'ヤ'), +Map.entry('ュ', 'ユ'), +Map.entry('ョ', 'ヨ'), +Map.entry('ㇻ', 'ラ'), +Map.entry('ㇼ', 'リ'), +Map.entry('ㇽ', 'ル'), +Map.entry('ㇾ', 'レ'), +Map.entry('ㇿ', 'ロ'), +Map.entry('ヮ', 'ワ')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseKatakanaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + // Small letter "ㇷ゚" is not single character, so it should be converted to "プ" as String + term = term.replace("ㇷ゚", "プ"); + char[] src = term.toCharArray(); Review Comment: Thanks, but it will affect length of result character array and break the tests. So let me keep current implementation. Here is the example of test result. ``` term 0 expected:<ちよつと[]> but was:<ちよつと[sTerm��]> Expected :ちよつと Actual :ちよつとsTerm�� ``` ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,83 @@ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ +s2l = +Map.ofEntries( +Map.entry('ァ', 'ア'), +Map.entry('ィ', 'イ'), +Map.entry('ゥ', 'ウ'), +Map.entry('ェ', 'エ'), +Map.entry('ォ', 'オ'), +Map.entry('ヵ', 'カ'), +Map.entry('ㇰ', 'ク'), +Map.entry('ヶ', 'ケ'), +Map.entry('ㇱ', 'シ'), +Map.entry('ㇲ', 'ス'), +Map.entry('ッ', 'ツ'), +Map.entry('ㇳ', 'ト'), +Map.entry('ㇴ', 'ヌ'), +Map.entry('ㇵ', 'ハ'), +Map.entry('ㇶ', 'ヒ'), +Map.entry('ㇷ', 'フ'), +Map.entry('ㇸ', 'ヘ'), +Map.entry('ㇹ', 'ホ'), +Map.entry('ㇺ', 'ム'), +Map.entry('ャ', 'ヤ'), +Map.entry('ュ', 'ユ'), +Map.entry('ョ', 'ヨ'), +Map.entry('ㇻ', 'ラ'), +Map.entry('ㇼ', 'リ'), +Map.entry('ㇽ', 'ル'), +Map.entry('ㇾ', 'レ'), +Map.entry('ㇿ', 'ロ'), +Map.entry('ヮ', 'ワ')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseKatakanaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423277099 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,65 @@ +package org.apache.lucene.analysis.ja; Review Comment: I'm happy to do, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
daixque commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423277455 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,65 @@ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); + char[] result = new char[src.length]; + for (int i = 0; i < src.length; i++) { +Character c = s2l.get(src[i]); +if (c != null) { + result[i] = c; +} else { + result[i] = src[i]; +} + } + String resultTerm = String.copyValueOf(result); + termAttr.setEmpty().append(resultTerm); Review Comment: I couldn't find `append ` method signature which accept char[]. (There is CharSequence instead) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
kojisekig commented on PR #12915: URL: https://github.com/apache/lucene/pull/12915#issuecomment-1851116639 From a Japanese perspective, the necessity sounds reasonable. Thank you for the contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]
mikemccand commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1422804214 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,65 @@ +package org.apache.lucene.analysis.ja; Review Comment: Could you please add the standard Apache copyright header, if that's OK with you? Thanks! I think this will also make the GitHub actions checks (`./gradlew check`) happy. ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,65 @@ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in hiragana into normal letters. For + * instance, "ちょっとまって" will be translated to "ちよつとまつて". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ +s2l = +Map.ofEntries( +Map.entry('ぁ', 'あ'), +Map.entry('ぃ', 'い'), +Map.entry('ぅ', 'う'), +Map.entry('ぇ', 'え'), +Map.entry('ぉ', 'お'), +Map.entry('っ', 'つ'), +Map.entry('ゃ', 'や'), +Map.entry('ゅ', 'ゆ'), +Map.entry('ょ', 'よ'), +Map.entry('ゎ', 'わ'), +Map.entry('ゕ', 'か'), +Map.entry('ゖ', 'け')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseHiraganaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + char[] src = term.toCharArray(); + char[] result = new char[src.length]; + for (int i = 0; i < src.length; i++) { +Character c = s2l.get(src[i]); +if (c != null) { + result[i] = c; +} else { + result[i] = src[i]; +} + } + String resultTerm = String.copyValueOf(result); + termAttr.setEmpty().append(resultTerm); Review Comment: You can avoid making `String` here by appending the `char[] result` instead. ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,83 @@ +package org.apache.lucene.analysis.ja; + +import java.io.IOException; +import java.util.Map; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; + +/** + * A {@link TokenFilter} that normalizes small letters (捨て仮名) in katakana into normal letters. For + * instance, "ストップウォッチ" will be translated to "ストツプウオツチ". + * + * This filter is useful if you want to search against old style Japanese text such as patents, + * legal, contract policies, etc. + */ +public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { + private static final Map s2l; + + static { +// supported characters are: +// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ +s2l = +Map.ofEntries( +Map.entry('ァ', 'ア'), +Map.entry('ィ', 'イ'), +Map.entry('ゥ', 'ウ'), +Map.entry('ェ', 'エ'), +Map.entry('ォ', 'オ'), +Map.entry('ヵ', 'カ'), +Map.entry('ㇰ', 'ク'), +Map.entry('ヶ', 'ケ'), +Map.entry('ㇱ', 'シ'), +Map.entry('ㇲ', 'ス'), +Map.entry('ッ', 'ツ'), +Map.entry('ㇳ', 'ト'), +Map.entry('ㇴ', 'ヌ'), +Map.entry('ㇵ', 'ハ'), +Map.entry('ㇶ', 'ヒ'), +Map.entry('ㇷ', 'フ'), +Map.entry('ㇸ', 'ヘ'), +Map.entry('ㇹ', 'ホ'), +Map.entry('ㇺ', 'ム'), +Map.entry('ャ', 'ヤ'), +Map.entry('ュ', 'ユ'), +Map.entry('ョ', 'ヨ'), +Map.entry('ㇻ', 'ラ'), +Map.entry('ㇼ', 'リ'), +Map.entry('ㇽ', 'ル'), +Map.entry('ㇾ', 'レ'), +Map.entry('ㇿ', 'ロ'), +Map.entry('ヮ', 'ワ')); + } + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + + public JapaneseKatakanaUppercaseFilter(TokenStream input) { +super(input); + } + + @Override + public boolean incrementToken() throws IOException { +if (input.incrementToken()) { + String term = termAttr.toString(); + // Small letter "ㇷ゚" is not single character, so it should be converted to "プ"