[jira] [Created] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

Jim Ferenczi (JIRA) Wed, 28 Mar 2018 14:22:40 -0700

Jim Ferenczi created LUCENE-8231:
------------------------------------

             Summary: Godori, a Korean analyzer based on mecab-ko-dic
                 Key: LUCENE-8231
                 URL: https://issues.apache.org/jira/browse/LUCENE-8231
             Project: Lucene - Core
          Issue Type: New Feature
            Reporter: Jim Ferenczi



There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
It is available under an Apache license here:
https://bitbucket.org/eunjeon/mecab-ko-dic

This dictionary was built with MeCab, it defines a format for the features 
adapted for the Korean language.
Since the Kuromoji tokenizer uses the same format for the morphological 
analysis (left cost + right cost + word cost) I tried to adapt the module to 
handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
Kuromoji module and adapts it for the mecab-ko-dic.
I used the same classes to build and read the dictionary but I had to make some 
modifications to handle the differences with the IPADIC and Japanese. 
The resulting binary dictionary takes 28MB on disk, it's bigger than the IPADIC 
but mainly because the source is bigger and there are a lot of
compound and inflect terms that define a group of terms and the segmentation 
that can be applied. 
I attached the patch that contains this new Korean module called godori (a 
popular card game in Korea). It is an adaptation of the Kuromoji module so 
currently
the two modules don't share any code. I wanted to validate the approach first 
and check the relevancy of the results. I don't speak Korean so I used the 
relevancy
tests that was added for another Korean tokenizer 
(https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
I had to simplify the JapaneseTokenizer, my version removes the nBest output 
and the decomposition of too long tokens. I also
modified the handling of whitespaces since they are important in Korean. 
Whitespaces that appear before a term are attached to that term and this
information is used to compute a penalty based on the Part of Speech of the 
token. The penalty cost is a feature added to mecab-ko to handle 
morphemes that should not appear after a morpheme and is described in the 
mecab-ko page:
https://bitbucket.org/eunjeon/mecab-ko
Ignoring whitespaces is also more inlined with the official MeCab library which 
attach the whitespaces to the term that follows.
I also added a decompounder filter that expand the compounds and inflects 
defined in the dictionary and a part of speech filter similar to the Japanese
that removes the morpheme that are not useful for relevance (suffix, prefix, 
interjection, ...). These filters don't play well with the tokenizer if it can 
output multiple paths (nBest output for instance) so for simplicity I removed 
this ability and the Korean tokenizer only outputs the best path.
I compared the result with mecab-ko to confirm that the analyzer is working and 
ran the relevancy test that is defined in HantecRel.java included
in the patch (written by Robert for another Korean analyzer). Here are the 
results:

||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
|Standard|35s|131MB|.007|.1044|.1053|
|CJK|36s|164MB|.1418|.1924|.1916|
|Korean|212s|90MB|.1628|.2094|.2078|

I find the results very promising so I plan to continue to work on this 
project. I started to extract the part of the code that could be shared with the
Kuromoji module but I wanted to share the status and this POC first to confirm 
that this approach is viable. The advantages of using the same model than
the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
moment ;), the resulting dictionary is small compared to other libraries that
use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
lattice on the fly to select the best path efficiently.
The dictionary can be built directly from the godori module with the following 
command:
ant regenerate (you need to create the resource directory (mkdir 
lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) first 
since the dictionary is not included in the patch).
I've also added some minimal tests in the module to play with the analysis.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

Reply via email to