[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-4956: Attachment: eval.patch I did a very quick and dirty evaluation of various analyzers (short queries only) with the HANTEC-2 test collection (http://ir.kaist.ac.kr/anthology/2000.10-%EA%B9%80%EC%A7%80%EC%98%81.pdf) I compared 4 different analyzers for index time, size, and mean average precision for the L2 relevance set: * StandardAnalyzer (whitespace on hangul / unigrams on hanja) * CJKAnalyzer (bigram technique) * KoreanAnalyzer * MecabAnalyzer via JNI (https://github.com/bibreen/mecab-ko-lucene-analyzer) For each one, I used 3 different ranking strategies: DefaultSimilarity, BM25Similarity, and DFR GL2, no parameter tuning of any sort. ||Analyzer||Index Time||Index Size||MAP(TFIDF)||MAP(BM25)||MAP(GL2)|| |Standard|31s|128MB|.0959|.1018|.1028| |CJK|30s|162MB|.1746|.1894|.1910| |Korean|195s|125MB|.2055|.2096|.2058| |Mecab|138s|147MB|.1877|.1960|.1928| Note that on the first try, I was unable to actually index the entire collection with KoreanAnalyzer, so I had to hack the filter to prevent this: {noformat} xception in thread main java.lang.StringIndexOutOfBoundsException: String index out of range: 4 at java.lang.String.substring(String.java:1907) at org.apache.lucene.analysis.ko.KoreanFilter.analysisChinese(KoreanFilter.java:405) at org.apache.lucene.analysis.ko.KoreanFilter.incrementToken(KoreanFilter.java:147) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:54) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) {noformat} See the patch for more information (you can also download the data from http://www.kristalinfo.com/TestCollections/ and set some constants and run it yourself). Don't read too far into it, this was really quick and dirty and might somehow be biased. For example, there are several charset issues in the test collection... But it looks like the analyzer here is effective. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SooMyung Lee updated LUCENE-4956: - Attachment: lucene-4956.patch Hi Christian, I have sync up and I made some modification. I'm attaching the patch. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Moen updated LUCENE-4956: --- Attachment: LUCENE-4956.patch the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SooMyung Lee updated LUCENE-4956: - Attachment: lucene4956.patch I have changed the source code relating to keyword extraction. I have removed the properties relating to keyword extraction and changed the keyword extraction logic. I've also added a test case that describe how for the korean analyzer work. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: kr.analyzer.4x.tar, lucene4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SooMyung Lee updated LUCENE-4956: - Description: Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. (was: Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is best idea to choose the korean analyzer.) the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Labels: newbie Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SooMyung Lee updated LUCENE-4956: - Attachment: kr.analyzer.4x.tar 8edffacb15b3964f25054c82c0d4ea92 the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Labels: newbie Attachments: kr.analyzer.4x.tar Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org