Namgyu Kim created LUCENE-8553: ---------------------------------- Summary: New KoreanDecomposeFilter for KoreanAnalyzer(Nori) Key: LUCENE-8553 URL: https://issues.apache.org/jira/browse/LUCENE-8553 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Namgyu Kim
This is a patch for KoreanDecomposeFilter. This filter can be used to decompose Hangul. (ex) 한글 -> ㅎㄱ or ㅎㅏㄴㄱㅡㄹ) Hangul input is very unique. If you want to type apple in English, you can type it in the order {color:#FF0000}a -> p -> p -> l -> e{color}. However, if you want to input "Hangul" in Hangul, you have to type it in the order of {color:#FF0000}ㅎ -> ㅏ -> ㄴ -> ㄱ -> ㅡ -> ㄹ{color}. (Because of the keyboard shape) This means that spell check with existing full Hangul can be less accurate. The structure of Hangul consists of elements such as *"Choseong"*, *"Jungseong"*, and *"Jongseong"*. These three elements are called *"Jamo"*. If you have the Korean word "된장찌개" (that means Soybean Paste Stew) *"Choseong"* means {color:#FF0000}"ㄷ, ㅈ, ㅉ, ㄱ"{color}, *"Jungseong"* means {color:#FF0000}"ㅚ, ㅏ, ㅣ, ㅐ"{color}, *"Jongseong"* means {color:#FF0000}"ㄴ, ㅇ"{color}. The reason for Jamo separation is explained above. (spell check) Also, the reason we need "Choseong Filter" is because many Koreans use *"Choseong Search"* (especially in mobile environment). If you want to search for "된장찌개" you need 10 typing, which is quite a lot. For that reason, I think it would be useful to provide a filter that can be searched by "ㄷㅈㅉㄱ". Hangul also has *dual chars*, such as "ㄲ, ㄸ, ㅁ, ㅃ, ㅉ, ㅚ (ㅗ + ㅣ), ㅢ (ㅡ + ㅣ), ...". For such reasons, KoreanDecompose offers *5 options*, ex) *된장찌개* => [된장], [찌개] *1) ORIGIN* [된장], [찌개] *2) SINGLECHOSEONG* [ㄷㅈ], [ㅉㄱ] *3) DUALCHOSEONG* [ㄷㅈ], [ㅈㅈㄱ] *4) SINGLEJAMO* [ㄷㅚㄴㅈㅏㅇ], [ㅉㅣㄱㅐ] *5) DUALJAMO* [ㄷㅗㅣㄴㅈㅏㅇ], [ㅈㅈㅣㄱㅐ] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org