RE: Build failed in Hudson: Lucene-trunk #828

2009-05-15 Thread Uwe Schindler
This error is in TestStressIndexing2.testRandomIWReader. I checked out
locally, the problem is not the reproducible by same random seed (so
r.setSeed() to the seed from error log does not reproduce the bug). This
test seems to fail very often; maybe there is a real multi-thread
synchronization bug. It seems that the real-time reader returned from the
writer contains more documents ID than the conventional reader open from the
directory.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org]
> Sent: Friday, May 15, 2009 5:15 AM
> To: java-dev@lucene.apache.org
> Subject: Build failed in Hudson: Lucene-trunk #828
> 
> See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/828/changes
> 
> Changes:
> 
> [yonik] LUCENE-1596: MultiTermDocs speedup when set with
> MultiTermDocs.seek(MultiTermEnum)
> 
> [mikemccand] LUCENE-1629: set javadocs encoding to UTF-8
> 
> [uschindler] set eol to native
> 
> [mikemccand] LUCENE-1629: move CHANGES entry to contrib; add
> TestArabicAnalyzer
> 
> [mikemccand] LUCENE-1629: adding new contrib analyzer SmartChineseAnalyzer
> 
> [markrmiller] pendingOutput is a bit generic for a field in a large class
> - changed to pendingSegnOutput
> 
> --
> [...truncated 19281 lines...]
> [junit] Testsuite: org.apache.lucene.search.TestDateFilter
> [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.97 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestDateSort
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.045 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestDisjunctionMaxQuery
> [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 1.619 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestDocBoost
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.903 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestDocIdSet
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.385 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestExplanations
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.836 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestExtendedFieldCache
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.801 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestFieldCacheRangeFilter
> [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.505 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestFieldCacheTermsFilter
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.992 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestFilteredQuery
> [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.404 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestFilteredSearch
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.936 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery
> [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.064 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestMatchAllDocsQuery
> [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.992 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestMultiPhraseQuery
> [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.045 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestMultiSearcher
> [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.226 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestMultiSearcherRanking
> [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 1.413 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestMultiTermConstantScore
> [junit] Tests run: 12, Failures: 0, Errors: 0, Time elapsed: 6.815 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestMultiThreadTermVectors
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.738 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestNot
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.989 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestParallelMultiSearcher
> [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.273 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestPhrasePrefixQuery
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.962 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestPhraseQuery
> [junit] Tests run: 15, Failures: 0, Errors: 0, Time elapsed: 1.9 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestPositionIncrement

[jira] Reopened: (LUCENE-1596) optimize MultiTermEnum/MultiTermDocs

2009-05-15 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-1596:



I'm seeing this new AIOOBE when tracking down the intermittent failure
in TestStressIndexing2:

{code}
1) testRandomIWReader(org.apache.lucene.index.TestStressIndexing2)
java.lang.ArrayIndexOutOfBoundsException: 6
at 
org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.next(MultiSegmentReader.java:672)
at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:292)
at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:250)
at 
org.apache.lucene.index.TestStressIndexing2.testRandomIWReader(TestStressIndexing2.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:168)
at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:232)
at junit.framework.TestSuite.run(TestSuite.java:227)
at 
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
at 
org.junit.internal.runners.CompositeRunner.runChildren(CompositeRunner.java:33)
at 
org.junit.internal.runners.CompositeRunner.run(CompositeRunner.java:28)
at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
at org.junit.runner.JUnitCore.run(JUnitCore.java:109)
at org.junit.runner.JUnitCore.run(JUnitCore.java:100)
at org.junit.runner.JUnitCore.runMain(JUnitCore.java:81)
at org.junit.runner.JUnitCore.main(JUnitCore.java:44)
{code}

I think it's because this optimization isn't admissible in the case
when one calls MultiTermDocs.seek on a MultiTermEnum derived from a
different MultiSegmentReader.  Ie, I think there needs to be another
check that verifies in fact the MultiTermEnum passed to
MultiTermDocs.seek share the same MultiSegmentReader?

Before this optimiztion, this was OK (only the term was used from the
MultiTermEnum).


> optimize MultiTermEnum/MultiTermDocs
> 
>
> Key: LUCENE-1596
> URL: https://issues.apache.org/jira/browse/LUCENE-1596
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 2.9
>
> Attachments: LUCENE-1596.patch
>
>
> Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that 
> don't match the term.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Mingfai Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709796#action_12709796
 ] 

Mingfai Ma commented on LUCENE-1629:


hi Xiaoping,

I'm interested to get the Chinese analyzer work for Traditional Chinese 
(UTF-8/Big5).  Just wonder if your coredict.mem comes from ICTCLAS? 
(http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008?

The ICTCLAS has traditional chinese edition for its 2008 release. But the 
distribution are not in .dct. I wonder if we have a simple specification for 
the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to 
the .dct format to work with your library? 

> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Mingfai Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709796#action_12709796
 ] 

Mingfai Ma edited comment on LUCENE-1629 at 5/15/09 3:23 AM:
-

hi Xiaoping,

I'm interested to get the Chinese analyzer work for Traditional Chinese 
(UTF-8/Big5).  Just wonder if your coredict.dct comes from ICTCLAS? 
(http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008?

The ICTCLAS has traditional chinese edition for its 2008 release. But the 
distribution are not in .dct. I wonder if we have a simple specification for 
the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to 
the .dct format to work with your library? 

  was (Author: mingfai):
hi Xiaoping,

I'm interested to get the Chinese analyzer work for Traditional Chinese 
(UTF-8/Big5).  Just wonder if your coredict.mem comes from ICTCLAS? 
(http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008?

The ICTCLAS has traditional chinese edition for its 2008 release. But the 
distribution are not in .dct. I wonder if we have a simple specification for 
the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to 
the .dct format to work with your library? 
  
> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1596) optimize MultiTermEnum/MultiTermDocs

2009-05-15 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709854#action_12709854
 ] 

Yonik Seeley commented on LUCENE-1596:
--

Gah... I forgot it was permissible (or at least not disallowed) to pass an Enum 
not derived from the same reader.
I'll fix.

> optimize MultiTermEnum/MultiTermDocs
> 
>
> Key: LUCENE-1596
> URL: https://issues.apache.org/jira/browse/LUCENE-1596
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 2.9
>
> Attachments: LUCENE-1596.patch
>
>
> Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that 
> don't match the term.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Xiaoping Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709867#action_12709867
 ] 

Xiaoping Gao commented on LUCENE-1629:
--

Hello Mingfai!

coredict.mem is converted from coredict.dct which come from ICTCLAS1.0,  
neither 2008 nor 2009.
The author authorized me to release just the lexical dictionary from  
ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of  
ictclas2008~2009.
As far as I know, coredict.dct just contain GB2312 characters, so it cannot  
support Big5.

I think we should find the proper big5 dictionary first, then I will help  
you to convert to dct file.


On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)"  wrote:


















































> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Xiaoping Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709866#action_12709866
 ] 

Xiaoping Gao commented on LUCENE-1629:
--

Hello Mingfai!

coredict.mem is converted from coredict.dct which come from ICTCLAS1.0,  
neither 2008 nor 2009.
The author authorized me to release just the lexical dictionary from  
ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of  
ictclas2008~2009.
As far as I know, coredict.dct just contain GB2312 characters, so it cannot  
support Big5.

I think we should find the proper big5 dictionary first, then I will help  
you to convert to dct file.


On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)"  wrote:


















































> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709880#action_12709880
 ] 

Robert Muir commented on LUCENE-1629:
-

if you acquire the big5 resources, do you think it would be possible to create 
a single dictionary that works with both Simplified & Traditional?

(i.e. merge the big5 resources with the gb resources)

The reason I say this, is the existing chinese analyzers, although they 
tokenize in a less intelligent way, they are agnostic to Simplified/Traditional 
issues...


> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709885#action_12709885
 ] 

Robert Muir commented on LUCENE-1629:
-

another potential issue with big5 i want to point out is that many of the big5 
character sets such as HKSCS have characters that are mapped into regions of 
unicode outside of the BMP.

just glancing at the code, some things will need to be modified for this to 
work correctly with surrogate pairs, various functions that take char will need 
to take codepoint (int), etc. 


> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1596) optimize MultiTermEnum/MultiTermDocs

2009-05-15 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved LUCENE-1596.
--

Resolution: Fixed

I just committed the fix (since trunk was broken) and a test that failed w/o 
the fix, but if anyone has a better idea how to handle/fix, we can certainly 
still discuss.I just did the obvious - store the multi-reader in the 
TermEnum and TermDocs instances and compare in TermDocs.seek(TermEnum)

> optimize MultiTermEnum/MultiTermDocs
> 
>
> Key: LUCENE-1596
> URL: https://issues.apache.org/jira/browse/LUCENE-1596
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 2.9
>
> Attachments: LUCENE-1596.patch
>
>
> Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that 
> don't match the term.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #828

2009-05-15 Thread Michael McCandless
OK, this failure was in fact caused by LUCENE-1596 (because the test
uses a MultiTermDocs/Enum to step through the docs, comparing them
across the two readers).

But, in digging into this one, I found & fixed a separate thread
hazard -- I'll open an issue shortly.

Otherwise, I only know of one other intermittent failure for
TestStressIndexing2.testRandomIWReader, which is sometimes it fails to
close all files it had opened... haven't gotten to the bottom of that
one yet.

Mike

On Fri, May 15, 2009 at 3:32 AM, Uwe Schindler  wrote:
> This error is in TestStressIndexing2.testRandomIWReader. I checked out
> locally, the problem is not the reproducible by same random seed (so
> r.setSeed() to the seed from error log does not reproduce the bug). This
> test seems to fail very often; maybe there is a real multi-thread
> synchronization bug. It seems that the real-time reader returned from the
> writer contains more documents ID than the conventional reader open from the
> directory.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>> -Original Message-
>> From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org]
>> Sent: Friday, May 15, 2009 5:15 AM
>> To: java-dev@lucene.apache.org
>> Subject: Build failed in Hudson: Lucene-trunk #828
>>
>> See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/828/changes
>>
>> Changes:
>>
>> [yonik] LUCENE-1596: MultiTermDocs speedup when set with
>> MultiTermDocs.seek(MultiTermEnum)
>>
>> [mikemccand] LUCENE-1629: set javadocs encoding to UTF-8
>>
>> [uschindler] set eol to native
>>
>> [mikemccand] LUCENE-1629: move CHANGES entry to contrib; add
>> TestArabicAnalyzer
>>
>> [mikemccand] LUCENE-1629: adding new contrib analyzer SmartChineseAnalyzer
>>
>> [markrmiller] pendingOutput is a bit generic for a field in a large class
>> - changed to pendingSegnOutput
>>
>> --
>> [...truncated 19281 lines...]
>>     [junit] Testsuite: org.apache.lucene.search.TestDateFilter
>>     [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.97 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestDateSort
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.045 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestDisjunctionMaxQuery
>>     [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 1.619 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestDocBoost
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.903 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestDocIdSet
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.385 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestExplanations
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.836 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestExtendedFieldCache
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.801 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestFieldCacheRangeFilter
>>     [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.505 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestFieldCacheTermsFilter
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.992 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestFilteredQuery
>>     [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.404 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestFilteredSearch
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.936 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery
>>     [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.064 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestMatchAllDocsQuery
>>     [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.992 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestMultiPhraseQuery
>>     [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.045 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestMultiSearcher
>>     [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.226 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestMultiSearcherRanking
>>     [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 1.413 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestMultiTermConstantScore
>>     [junit] Tests run: 12, Failures: 0, Errors: 0, Time elapsed: 6.815 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestMultiThreadTermVectors
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.738 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestNot
>>     

[jira] Created: (LUCENE-1638) Thread safety issue can cause index corruption when autoCommit=true and multiple threads are committing

2009-05-15 Thread Michael McCandless (JIRA)
Thread safety issue can cause index corruption when autoCommit=true and 
multiple threads are committing
---

 Key: LUCENE-1638
 URL: https://issues.apache.org/jira/browse/LUCENE-1638
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9


This is only present in 2.9 trunk, but has been there since
LUCENE-1516 was committed I believe.

It's rare to hit: it only happens if multiple calls to commit() are in
flight (from different threads) and where at least one of those calls
is due to a merge calling commit (because autoCommit is true).

When it strikes, it leaves the index corrupt because it incorrectly
removes an active segment.  It causes exceptions like this:
{code}
java.io.FileNotFoundException: _1e.fnm
at 
org.apache.lucene.store.MockRAMDirectory.openInput(MockRAMDirectory.java:246)
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:67)
at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:536)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:468)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:414)
at 
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:641)
at 
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:627)
at 
org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:923)
at 
org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4987)
at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4165)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4025)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4016)
at 
org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2077)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2040)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2004)
at 
org.apache.lucene.index.TestStressIndexing2.indexRandom(TestStressIndexing2.java:210)
at 
org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:104)
{code}

It's caused by failing to increment changeCount inside the same
synchronized block where segmentInfos was changed, in commitMerge.
The fix is simple -- I plan to commit shortly.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1638) Thread safety issue can cause index corruption when autoCommit=true and multiple threads are committing

2009-05-15 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1638.


Resolution: Fixed

> Thread safety issue can cause index corruption when autoCommit=true and 
> multiple threads are committing
> ---
>
> Key: LUCENE-1638
> URL: https://issues.apache.org/jira/browse/LUCENE-1638
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9
>
>
> This is only present in 2.9 trunk, but has been there since
> LUCENE-1516 was committed I believe.
> It's rare to hit: it only happens if multiple calls to commit() are in
> flight (from different threads) and where at least one of those calls
> is due to a merge calling commit (because autoCommit is true).
> When it strikes, it leaves the index corrupt because it incorrectly
> removes an active segment.  It causes exceptions like this:
> {code}
> java.io.FileNotFoundException: _1e.fnm
>   at 
> org.apache.lucene.store.MockRAMDirectory.openInput(MockRAMDirectory.java:246)
>   at org.apache.lucene.index.FieldInfos.(FieldInfos.java:67)
>   at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:536)
>   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:468)
>   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:414)
>   at 
> org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:641)
>   at 
> org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:627)
>   at 
> org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:923)
>   at 
> org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4987)
>   at 
> org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4165)
>   at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4025)
>   at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4016)
>   at 
> org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2077)
>   at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2040)
>   at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2004)
>   at 
> org.apache.lucene.index.TestStressIndexing2.indexRandom(TestStressIndexing2.java:210)
>   at 
> org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:104)
> {code}
> It's caused by failing to increment changeCount inside the same
> synchronized block where segmentInfos was changed, in commitMerge.
> The fix is simple -- I plan to commit shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #828

2009-05-15 Thread Yonik Seeley
On Fri, May 15, 2009 at 1:06 PM, Michael McCandless
 wrote:
> Otherwise, I only know of one other intermittent failure for
> TestStressIndexing2.testRandomIWReader, which is sometimes it fails to
> close all files it had opened... haven't gotten to the bottom of that
> one yet.

Is there a JIRA issue open for this yet?

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #828

2009-05-15 Thread Michael McCandless
On Fri, May 15, 2009 at 2:00 PM, Yonik Seeley
 wrote:
> On Fri, May 15, 2009 at 1:06 PM, Michael McCandless
>  wrote:
>> Otherwise, I only know of one other intermittent failure for
>> TestStressIndexing2.testRandomIWReader, which is sometimes it fails to
>> close all files it had opened... haven't gotten to the bottom of that
>> one yet.
>
> Is there a JIRA issue open for this yet?

No, not yet.  I'll open one.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1639) intermittent failure in TestIndexWriter. testRandomIWReader

2009-05-15 Thread Michael McCandless (JIRA)
intermittent failure in TestIndexWriter. testRandomIWReader
---

 Key: LUCENE-1639
 URL: https://issues.apache.org/jira/browse/LUCENE-1639
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9
Reporter: Michael McCandless
Priority: Minor
 Fix For: 2.9


Rarely, this test (which was added with LUCENE-1516) fails in 
MockRAMDirectory.close because some files were not closed, eg:
{code}
   [junit] NOTE: random seed of testcase 'testRandomIWReader' was: 
-5001333286299627079
   [junit] -  ---
   [junit] Testcase: 
testRandomIWReader(org.apache.lucene.index.TestStressIndexing2):Caused 
an ERROR
   [junit] MockRAMDirectory: cannot close: there are still open files: 
{_cq.tvx=3, _cq.fdx=3, _cq.tvf=3, _cq.tvd=3, _cq.fdt=3}
   [junit] java.lang.RuntimeException: MockRAMDirectory: cannot close: there 
are still open files: {_cq.tvx=3, _cq.fdx=3, _cq.tvf=3, _cq.tvd=3, _cq.fdt=3}
   [junit] at 
org.apache.lucene.store.MockRAMDirectory.close(MockRAMDirectory.java:292)
   [junit] at 
org.apache.lucene.index.TestStressIndexing2.testRandomIWReader(TestStressIndexing2.java:66)
   [junit] at 
org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Mingfai Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709974#action_12709974
 ] 

Mingfai Ma commented on LUCENE-1629:


could we use CC-CEDICT's dictionary instead? it is using Creative Commons 
Attribution-Share Alike 3.0 license

http://www.mdbg.net/chindict/chindict.php?page=cc-cedict

> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710035#action_12710035
 ] 

Koji Sekiguchi commented on LUCENE-1629:


Just an FYI. There have been a working for mapping between simplified and 
traditional chinese characters in Solr 1.4. (but you need to define mapping 
rules in mapping.txt)
See SOLR-822 and the attached JPG for chinese mapping sample.
I opened LUCENE-1466 for Lucene. :)

> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: svn commit: r774718 [2/3] - in /lucene/java/trunk: ./ contrib/analyzers/src/java/org/apache/lucene/analysis/ar/ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ contrib/analyzers/src/java

2009-05-15 Thread Otis Gospodnetic

Hi,

Is that "Copyright 2009 www.imdict.net" allowed to be in there?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: "mikemcc...@apache.org" 
> To: java-comm...@lucene.apache.org
> Sent: Thursday, May 14, 2009 6:09:24 AM
> Subject: svn commit: r774718 [2/3] - in /lucene/java/trunk: ./ 
> contrib/analyzers/src/java/org/apache/lucene/analysis/ar/ 
> contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ 
> contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/ 
> contrib/analyzers/...
> 
> Added: 
> lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/CopyOfBigramDictionary.java
> URL: 
> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/CopyOfBigramDictionary.java?rev=774718&view=auto
> ==
> --- 
> lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/CopyOfBigramDictionary.java
>  
> (added)
> +++ 
> lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/CopyOfBigramDictionary.java
>  
> Thu May 14 10:09:22 2009
> @@ -0,0 +1,302 @@
> +/**
> + * Copyright 2009 www.imdict.net
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Hudson build is back to normal: Lucene-trunk #829

2009-05-15 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/829/changes



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710070#action_12710070
 ] 

Robert Muir commented on LUCENE-1629:
-

koji, have you considered using icu transforms for this behavior?
Not only is the rule-based language very nice (you can define variables, use 
context, etc), but many transformations such as "Traditional-Simplified" are 
already defined.

http://userguide.icu-project.org/transforms/general


> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org