[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-17 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-4956:


Attachment: eval.patch

I did a very quick and dirty evaluation of various analyzers (short queries 
only) with the HANTEC-2 test collection 
(http://ir.kaist.ac.kr/anthology/2000.10-%EA%B9%80%EC%A7%80%EC%98%81.pdf)

I compared 4 different analyzers for index time, size, and mean average 
precision for the L2 relevance set: 
* StandardAnalyzer (whitespace on hangul / unigrams on hanja)
* CJKAnalyzer (bigram technique)
* KoreanAnalyzer
* MecabAnalyzer via JNI (https://github.com/bibreen/mecab-ko-lucene-analyzer)

For each one, I used 3 different ranking strategies: DefaultSimilarity, 
BM25Similarity, and DFR GL2, no parameter tuning of any sort.

||Analyzer||Index Time||Index Size||MAP(TFIDF)||MAP(BM25)||MAP(GL2)||
|Standard|31s|128MB|.0959|.1018|.1028|
|CJK|30s|162MB|.1746|.1894|.1910|
|Korean|195s|125MB|.2055|.2096|.2058|
|Mecab|138s|147MB|.1877|.1960|.1928|

Note that on the first try, I was unable to actually index the entire 
collection with KoreanAnalyzer, so I had to hack the filter to prevent this:
{noformat}
xception in thread main java.lang.StringIndexOutOfBoundsException: String 
index out of range: 4
at java.lang.String.substring(String.java:1907)
at 
org.apache.lucene.analysis.ko.KoreanFilter.analysisChinese(KoreanFilter.java:405)
at 
org.apache.lucene.analysis.ko.KoreanFilter.incrementToken(KoreanFilter.java:147)
at 
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at 
org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:54)
at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
{noformat}

See the patch for more information (you can also download the data from 
http://www.kristalinfo.com/TestCollections/ and set some constants and run it 
yourself).

Don't read too far into it, this was really quick and dirty and might somehow 
be biased. For example, there are several charset issues in the test 
collection... But it looks like the analyzer here is effective.


 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, 
 lucene4956.patch, LUCENE-4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-09-10 Thread SooMyung Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SooMyung Lee updated LUCENE-4956:
-

Attachment: lucene-4956.patch

Hi Christian,

I have sync up and I made some modification.
I'm attaching the patch.

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, 
 LUCENE-4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-08-14 Thread Christian Moen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-4956:
---

Attachment: LUCENE-4956.patch

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-18 Thread SooMyung Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SooMyung Lee updated LUCENE-4956:
-

Attachment: lucene4956.patch

I have changed the source code relating to keyword extraction. I have removed 
the properties relating to keyword extraction and changed the keyword 
extraction logic. I've also added a test case that describe how for the korean 
analyzer work.

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar, lucene4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-04-24 Thread SooMyung Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SooMyung Lee updated LUCENE-4956:
-

Description: Korean language has specific characteristic. When developing 
search service with lucene  solr in korean, there are some problems in 
searching and indexing. The korean analyer solved the problems with a korean 
morphological anlyzer. It consists of a korean morphological analyzer, 
dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is 
made for lucene and solr. If you develop a search service with lucene in 
korean, It is the best idea to choose the korean analyzer.  (was: Korean 
language has specific characteristic. When developing search service with 
lucene  solr in korean, there are some problems in searching and indexing. The 
korean analyer solved the problems with a korean morphological anlyzer. It 
consists of a korean morphological analyzer, dictionaries, a korean tokenizer 
and a korean filter. The korean anlyzer is made for lucene and solr. If you 
develop a search service with lucene in korean, It is best idea to choose the 
korean analyzer.)

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
  Labels: newbie

 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-04-24 Thread SooMyung Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SooMyung Lee updated LUCENE-4956:
-

Attachment: kr.analyzer.4x.tar

8edffacb15b3964f25054c82c0d4ea92

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
  Labels: newbie
 Attachments: kr.analyzer.4x.tar


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org