Namgyu Kim created LUCENE-8553:
----------------------------------

             Summary: New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
                 Key: LUCENE-8553
                 URL: https://issues.apache.org/jira/browse/LUCENE-8553
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
            Reporter: Namgyu Kim


This is a patch for KoreanDecomposeFilter.

This filter can be used to decompose Hangul.
(ex) 한글 -> ㅎㄱ or ㅎㅏㄴㄱㅡㄹ)

Hangul input is very unique.

If you want to type apple in English,
   you can type it in the order {color:#FF0000}a -> p -> p -> l -> e{color}.

However, if you want to input "Hangul" in Hangul,
   you have to type it in the order of {color:#FF0000}ㅎ -> ㅏ -> ㄴ -> ㄱ -> ㅡ -> 
ㄹ{color}.
   (Because of the keyboard shape)

This means that spell check with existing full Hangul can be less accurate.

 

The structure of Hangul consists of elements such as *"Choseong"*, 
*"Jungseong"*, and *"Jongseong"*.

These three elements are called *"Jamo"*.

If you have the Korean word "된장찌개" (that means Soybean Paste Stew)
*"Choseong"* means {color:#FF0000}"ㄷ, ㅈ, ㅉ, ㄱ"{color},
*"Jungseong"* means {color:#FF0000}"ㅚ, ㅏ, ㅣ, ㅐ"{color},
*"Jongseong"* means {color:#FF0000}"ㄴ, ㅇ"{color}.

The reason for Jamo separation is explained above. (spell check)

Also, the reason we need "Choseong Filter" is because many Koreans use 
*"Choseong Search"* (especially in mobile environment).
If you want to search for "된장찌개" you need 10 typing, which is quite a lot.
For that reason, I think it would be useful to provide a filter that can be 
searched by "ㄷㅈㅉㄱ".

Hangul also has *dual chars*, such as
"ㄲ, ㄸ, ㅁ, ㅃ, ㅉ, ㅚ (ㅗ + ㅣ), ㅢ (ㅡ + ㅣ), ...".

For such reasons,
KoreanDecompose offers *5 options*,

ex) *된장찌개* => [된장], [찌개]

*1) ORIGIN*
[된장], [찌개]

*2) SINGLECHOSEONG*
[ㄷㅈ], [ㅉㄱ] 

*3) DUALCHOSEONG*
[ㄷㅈ], [ㅈㅈㄱ] 

*4) SINGLEJAMO*
[ㄷㅚㄴㅈㅏㅇ], [ㅉㅣㄱㅐ] 

*5) DUALJAMO*
[ㄷㅗㅣㄴㅈㅏㅇ], [ㅈㅈㅣㄱㅐ] 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to