Kazuaki Hiraga created LUCENE-4056: -------------------------------------- Summary: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary Key: LUCENE-4056 URL: https://issues.apache.org/jira/browse/LUCENE-4056 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Environment: Solr 3.6 UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz) Reporter: Kazuaki Hiraga
I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should support UniDic dictionary as standalone Kuromoji does. The following is my procedure: Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict', I got the error as the below. build-dict: [java] dictionary builder [java] [java] dictionary format: UNIDIC [java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src [java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources [java] input encoding: utf-8 [java] normalize entries: false [java] [java] building tokeninfo dict... [java] parse... [java] sort... [java] Exception in thread "main" java.lang.AssertionError [java] encode... [java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141) [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37) [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82) And the diff of build.xml: =================================================================== --- build.xml (revision 1338023) +++ build.xml (working copy) @@ -28,19 +28,31 @@ <property name="maven.dist.dir" location="../../../dist/maven" /> <!-- default configuration: uses mecab-ipadic --> + <!-- <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" /> <property name="dict.src.file" value="${ipadic.version}.tar.gz" /> <property name="dict.url" value="http://mecab.googlecode.com/files/${dict.src.file}"/> + --> <!-- alternative configuration: uses mecab-naist-jdic <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" /> <property name="dict.src.file" value="${ipadic.version}.tar.gz" /> <property name="dict.url" value="http://sourceforge.jp/frs/redir.php?m=iij&f=/naist-jdic/53500/${dict.src.file}"/> --> - + + <!-- alternative configuration: uses UniDic --> + <property name="ipadic.version" value="unidic-mecab1312src" /> + <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" /> + <property name="dict.loc.dir" value="/home/kazu/Work/src/nlp/unidic/_archive"/> + <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" /> + <!-- <property name="dict.encoding" value="euc-jp"/> <property name="dict.format" value="ipadic"/> + --> + <property name="dict.encoding" value="utf-8"/> + <property name="dict.format" value="unidic"/> + <property name="dict.normalize" value="false"/> <property name="dict.target.dir" location="./src/resources"/> @@ -58,7 +70,8 @@ <target name="compile-core" depends="jar-analyzers-common, common.compile-core" /> <target name="download-dict" unless="dict.available"> - <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/> + <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ --> + <copy file="${dict.loc.dir}/${dict.src.file}" tofile="${build.dir}/${dict.src.file}"/> <gunzip src="${build.dir}/${dict.src.file}"/> <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/> </target> -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org