RE: Build failed in Hudson: Lucene-trunk #828
This error is in TestStressIndexing2.testRandomIWReader. I checked out locally, the problem is not the reproducible by same random seed (so r.setSeed() to the seed from error log does not reproduce the bug). This test seems to fail very often; maybe there is a real multi-thread synchronization bug. It seems that the real-time reader returned from the writer contains more documents ID than the conventional reader open from the directory. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org] > Sent: Friday, May 15, 2009 5:15 AM > To: java-dev@lucene.apache.org > Subject: Build failed in Hudson: Lucene-trunk #828 > > See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/828/changes > > Changes: > > [yonik] LUCENE-1596: MultiTermDocs speedup when set with > MultiTermDocs.seek(MultiTermEnum) > > [mikemccand] LUCENE-1629: set javadocs encoding to UTF-8 > > [uschindler] set eol to native > > [mikemccand] LUCENE-1629: move CHANGES entry to contrib; add > TestArabicAnalyzer > > [mikemccand] LUCENE-1629: adding new contrib analyzer SmartChineseAnalyzer > > [markrmiller] pendingOutput is a bit generic for a field in a large class > - changed to pendingSegnOutput > > -- > [...truncated 19281 lines...] > [junit] Testsuite: org.apache.lucene.search.TestDateFilter > [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.97 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestDateSort > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.045 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestDisjunctionMaxQuery > [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 1.619 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestDocBoost > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.903 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestDocIdSet > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.385 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestExplanations > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.836 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestExtendedFieldCache > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.801 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestFieldCacheRangeFilter > [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.505 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestFieldCacheTermsFilter > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.992 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestFilteredQuery > [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.404 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestFilteredSearch > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.936 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery > [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.064 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestMatchAllDocsQuery > [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.992 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestMultiPhraseQuery > [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.045 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestMultiSearcher > [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.226 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestMultiSearcherRanking > [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 1.413 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestMultiTermConstantScore > [junit] Tests run: 12, Failures: 0, Errors: 0, Time elapsed: 6.815 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestMultiThreadTermVectors > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.738 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestNot > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.989 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestParallelMultiSearcher > [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.273 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestPhrasePrefixQuery > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.962 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestPhraseQuery > [junit] Tests run: 15, Failures: 0, Errors: 0, Time elapsed: 1.9 sec > [junit] > [junit] Testsuite: org.apache.lucene.search.TestPositionIncrement
[jira] Reopened: (LUCENE-1596) optimize MultiTermEnum/MultiTermDocs
[ https://issues.apache.org/jira/browse/LUCENE-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-1596: I'm seeing this new AIOOBE when tracking down the intermittent failure in TestStressIndexing2: {code} 1) testRandomIWReader(org.apache.lucene.index.TestStressIndexing2) java.lang.ArrayIndexOutOfBoundsException: 6 at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.next(MultiSegmentReader.java:672) at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:292) at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:250) at org.apache.lucene.index.TestStressIndexing2.testRandomIWReader(TestStressIndexing2.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at junit.framework.TestCase.runTest(TestCase.java:168) at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81) at org.junit.internal.runners.CompositeRunner.runChildren(CompositeRunner.java:33) at org.junit.internal.runners.CompositeRunner.run(CompositeRunner.java:28) at org.junit.runner.JUnitCore.run(JUnitCore.java:130) at org.junit.runner.JUnitCore.run(JUnitCore.java:109) at org.junit.runner.JUnitCore.run(JUnitCore.java:100) at org.junit.runner.JUnitCore.runMain(JUnitCore.java:81) at org.junit.runner.JUnitCore.main(JUnitCore.java:44) {code} I think it's because this optimization isn't admissible in the case when one calls MultiTermDocs.seek on a MultiTermEnum derived from a different MultiSegmentReader. Ie, I think there needs to be another check that verifies in fact the MultiTermEnum passed to MultiTermDocs.seek share the same MultiSegmentReader? Before this optimiztion, this was OK (only the term was used from the MultiTermEnum). > optimize MultiTermEnum/MultiTermDocs > > > Key: LUCENE-1596 > URL: https://issues.apache.org/jira/browse/LUCENE-1596 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Yonik Seeley >Assignee: Yonik Seeley > Fix For: 2.9 > > Attachments: LUCENE-1596.patch > > > Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that > don't match the term. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709796#action_12709796 ] Mingfai Ma commented on LUCENE-1629: hi Xiaoping, I'm interested to get the Chinese analyzer work for Traditional Chinese (UTF-8/Big5). Just wonder if your coredict.mem comes from ICTCLAS? (http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008? The ICTCLAS has traditional chinese edition for its 2008 release. But the distribution are not in .dct. I wonder if we have a simple specification for the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to the .dct format to work with your library? > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, > build-resources-with-folder.patch, build-resources.patch, > build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, > LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709796#action_12709796 ] Mingfai Ma edited comment on LUCENE-1629 at 5/15/09 3:23 AM: - hi Xiaoping, I'm interested to get the Chinese analyzer work for Traditional Chinese (UTF-8/Big5). Just wonder if your coredict.dct comes from ICTCLAS? (http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008? The ICTCLAS has traditional chinese edition for its 2008 release. But the distribution are not in .dct. I wonder if we have a simple specification for the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to the .dct format to work with your library? was (Author: mingfai): hi Xiaoping, I'm interested to get the Chinese analyzer work for Traditional Chinese (UTF-8/Big5). Just wonder if your coredict.mem comes from ICTCLAS? (http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008? The ICTCLAS has traditional chinese edition for its 2008 release. But the distribution are not in .dct. I wonder if we have a simple specification for the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to the .dct format to work with your library? > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, > build-resources-with-folder.patch, build-resources.patch, > build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, > LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1596) optimize MultiTermEnum/MultiTermDocs
[ https://issues.apache.org/jira/browse/LUCENE-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709854#action_12709854 ] Yonik Seeley commented on LUCENE-1596: -- Gah... I forgot it was permissible (or at least not disallowed) to pass an Enum not derived from the same reader. I'll fix. > optimize MultiTermEnum/MultiTermDocs > > > Key: LUCENE-1596 > URL: https://issues.apache.org/jira/browse/LUCENE-1596 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Yonik Seeley >Assignee: Yonik Seeley > Fix For: 2.9 > > Attachments: LUCENE-1596.patch > > > Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that > don't match the term. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709867#action_12709867 ] Xiaoping Gao commented on LUCENE-1629: -- Hello Mingfai! coredict.mem is converted from coredict.dct which come from ICTCLAS1.0, neither 2008 nor 2009. The author authorized me to release just the lexical dictionary from ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of ictclas2008~2009. As far as I know, coredict.dct just contain GB2312 characters, so it cannot support Big5. I think we should find the proper big5 dictionary first, then I will help you to convert to dct file. On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" wrote: > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, > build-resources-with-folder.patch, build-resources.patch, > build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, > LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709866#action_12709866 ] Xiaoping Gao commented on LUCENE-1629: -- Hello Mingfai! coredict.mem is converted from coredict.dct which come from ICTCLAS1.0, neither 2008 nor 2009. The author authorized me to release just the lexical dictionary from ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of ictclas2008~2009. As far as I know, coredict.dct just contain GB2312 characters, so it cannot support Big5. I think we should find the proper big5 dictionary first, then I will help you to convert to dct file. On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" wrote: > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, > build-resources-with-folder.patch, build-resources.patch, > build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, > LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709880#action_12709880 ] Robert Muir commented on LUCENE-1629: - if you acquire the big5 resources, do you think it would be possible to create a single dictionary that works with both Simplified & Traditional? (i.e. merge the big5 resources with the gb resources) The reason I say this, is the existing chinese analyzers, although they tokenize in a less intelligent way, they are agnostic to Simplified/Traditional issues... > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, > build-resources-with-folder.patch, build-resources.patch, > build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, > LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709885#action_12709885 ] Robert Muir commented on LUCENE-1629: - another potential issue with big5 i want to point out is that many of the big5 character sets such as HKSCS have characters that are mapped into regions of unicode outside of the BMP. just glancing at the code, some things will need to be modified for this to work correctly with surrogate pairs, various functions that take char will need to take codepoint (int), etc. > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, > build-resources-with-folder.patch, build-resources.patch, > build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, > LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1596) optimize MultiTermEnum/MultiTermDocs
[ https://issues.apache.org/jira/browse/LUCENE-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley resolved LUCENE-1596. -- Resolution: Fixed I just committed the fix (since trunk was broken) and a test that failed w/o the fix, but if anyone has a better idea how to handle/fix, we can certainly still discuss.I just did the obvious - store the multi-reader in the TermEnum and TermDocs instances and compare in TermDocs.seek(TermEnum) > optimize MultiTermEnum/MultiTermDocs > > > Key: LUCENE-1596 > URL: https://issues.apache.org/jira/browse/LUCENE-1596 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Yonik Seeley >Assignee: Yonik Seeley > Fix For: 2.9 > > Attachments: LUCENE-1596.patch > > > Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that > don't match the term. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #828
OK, this failure was in fact caused by LUCENE-1596 (because the test uses a MultiTermDocs/Enum to step through the docs, comparing them across the two readers). But, in digging into this one, I found & fixed a separate thread hazard -- I'll open an issue shortly. Otherwise, I only know of one other intermittent failure for TestStressIndexing2.testRandomIWReader, which is sometimes it fails to close all files it had opened... haven't gotten to the bottom of that one yet. Mike On Fri, May 15, 2009 at 3:32 AM, Uwe Schindler wrote: > This error is in TestStressIndexing2.testRandomIWReader. I checked out > locally, the problem is not the reproducible by same random seed (so > r.setSeed() to the seed from error log does not reproduce the bug). This > test seems to fail very often; maybe there is a real multi-thread > synchronization bug. It seems that the real-time reader returned from the > writer contains more documents ID than the conventional reader open from the > directory. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> -Original Message- >> From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org] >> Sent: Friday, May 15, 2009 5:15 AM >> To: java-dev@lucene.apache.org >> Subject: Build failed in Hudson: Lucene-trunk #828 >> >> See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/828/changes >> >> Changes: >> >> [yonik] LUCENE-1596: MultiTermDocs speedup when set with >> MultiTermDocs.seek(MultiTermEnum) >> >> [mikemccand] LUCENE-1629: set javadocs encoding to UTF-8 >> >> [uschindler] set eol to native >> >> [mikemccand] LUCENE-1629: move CHANGES entry to contrib; add >> TestArabicAnalyzer >> >> [mikemccand] LUCENE-1629: adding new contrib analyzer SmartChineseAnalyzer >> >> [markrmiller] pendingOutput is a bit generic for a field in a large class >> - changed to pendingSegnOutput >> >> -- >> [...truncated 19281 lines...] >> [junit] Testsuite: org.apache.lucene.search.TestDateFilter >> [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.97 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestDateSort >> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.045 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestDisjunctionMaxQuery >> [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 1.619 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestDocBoost >> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.903 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestDocIdSet >> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.385 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestExplanations >> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.836 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestExtendedFieldCache >> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.801 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestFieldCacheRangeFilter >> [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.505 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestFieldCacheTermsFilter >> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.992 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestFilteredQuery >> [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.404 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestFilteredSearch >> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.936 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery >> [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.064 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestMatchAllDocsQuery >> [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.992 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestMultiPhraseQuery >> [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.045 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestMultiSearcher >> [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.226 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestMultiSearcherRanking >> [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 1.413 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestMultiTermConstantScore >> [junit] Tests run: 12, Failures: 0, Errors: 0, Time elapsed: 6.815 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestMultiThreadTermVectors >> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.738 sec >> [junit] >> [junit] Testsuite: org.apache.lucene.search.TestNot >>
[jira] Created: (LUCENE-1638) Thread safety issue can cause index corruption when autoCommit=true and multiple threads are committing
Thread safety issue can cause index corruption when autoCommit=true and multiple threads are committing --- Key: LUCENE-1638 URL: https://issues.apache.org/jira/browse/LUCENE-1638 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 This is only present in 2.9 trunk, but has been there since LUCENE-1516 was committed I believe. It's rare to hit: it only happens if multiple calls to commit() are in flight (from different threads) and where at least one of those calls is due to a merge calling commit (because autoCommit is true). When it strikes, it leaves the index corrupt because it incorrectly removes an active segment. It causes exceptions like this: {code} java.io.FileNotFoundException: _1e.fnm at org.apache.lucene.store.MockRAMDirectory.openInput(MockRAMDirectory.java:246) at org.apache.lucene.index.FieldInfos.(FieldInfos.java:67) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:536) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:468) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:414) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:641) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:627) at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:923) at org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4987) at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4165) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4025) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4016) at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2077) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2040) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2004) at org.apache.lucene.index.TestStressIndexing2.indexRandom(TestStressIndexing2.java:210) at org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:104) {code} It's caused by failing to increment changeCount inside the same synchronized block where segmentInfos was changed, in commitMerge. The fix is simple -- I plan to commit shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1638) Thread safety issue can cause index corruption when autoCommit=true and multiple threads are committing
[ https://issues.apache.org/jira/browse/LUCENE-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1638. Resolution: Fixed > Thread safety issue can cause index corruption when autoCommit=true and > multiple threads are committing > --- > > Key: LUCENE-1638 > URL: https://issues.apache.org/jira/browse/LUCENE-1638 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9 > > > This is only present in 2.9 trunk, but has been there since > LUCENE-1516 was committed I believe. > It's rare to hit: it only happens if multiple calls to commit() are in > flight (from different threads) and where at least one of those calls > is due to a merge calling commit (because autoCommit is true). > When it strikes, it leaves the index corrupt because it incorrectly > removes an active segment. It causes exceptions like this: > {code} > java.io.FileNotFoundException: _1e.fnm > at > org.apache.lucene.store.MockRAMDirectory.openInput(MockRAMDirectory.java:246) > at org.apache.lucene.index.FieldInfos.(FieldInfos.java:67) > at > org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:536) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:468) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:414) > at > org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:641) > at > org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:627) > at > org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:923) > at > org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4987) > at > org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4165) > at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4025) > at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4016) > at > org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2077) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2040) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2004) > at > org.apache.lucene.index.TestStressIndexing2.indexRandom(TestStressIndexing2.java:210) > at > org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:104) > {code} > It's caused by failing to increment changeCount inside the same > synchronized block where segmentInfos was changed, in commitMerge. > The fix is simple -- I plan to commit shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #828
On Fri, May 15, 2009 at 1:06 PM, Michael McCandless wrote: > Otherwise, I only know of one other intermittent failure for > TestStressIndexing2.testRandomIWReader, which is sometimes it fails to > close all files it had opened... haven't gotten to the bottom of that > one yet. Is there a JIRA issue open for this yet? -Yonik - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #828
On Fri, May 15, 2009 at 2:00 PM, Yonik Seeley wrote: > On Fri, May 15, 2009 at 1:06 PM, Michael McCandless > wrote: >> Otherwise, I only know of one other intermittent failure for >> TestStressIndexing2.testRandomIWReader, which is sometimes it fails to >> close all files it had opened... haven't gotten to the bottom of that >> one yet. > > Is there a JIRA issue open for this yet? No, not yet. I'll open one. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1639) intermittent failure in TestIndexWriter. testRandomIWReader
intermittent failure in TestIndexWriter. testRandomIWReader --- Key: LUCENE-1639 URL: https://issues.apache.org/jira/browse/LUCENE-1639 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9 Reporter: Michael McCandless Priority: Minor Fix For: 2.9 Rarely, this test (which was added with LUCENE-1516) fails in MockRAMDirectory.close because some files were not closed, eg: {code} [junit] NOTE: random seed of testcase 'testRandomIWReader' was: -5001333286299627079 [junit] - --- [junit] Testcase: testRandomIWReader(org.apache.lucene.index.TestStressIndexing2):Caused an ERROR [junit] MockRAMDirectory: cannot close: there are still open files: {_cq.tvx=3, _cq.fdx=3, _cq.tvf=3, _cq.tvd=3, _cq.fdt=3} [junit] java.lang.RuntimeException: MockRAMDirectory: cannot close: there are still open files: {_cq.tvx=3, _cq.fdx=3, _cq.tvf=3, _cq.tvd=3, _cq.fdt=3} [junit] at org.apache.lucene.store.MockRAMDirectory.close(MockRAMDirectory.java:292) [junit] at org.apache.lucene.index.TestStressIndexing2.testRandomIWReader(TestStressIndexing2.java:66) [junit] at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709974#action_12709974 ] Mingfai Ma commented on LUCENE-1629: could we use CC-CEDICT's dictionary instead? it is using Creative Commons Attribution-Share Alike 3.0 license http://www.mdbg.net/chindict/chindict.php?page=cc-cedict > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, > build-resources-with-folder.patch, build-resources.patch, > build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, > LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710035#action_12710035 ] Koji Sekiguchi commented on LUCENE-1629: Just an FYI. There have been a working for mapping between simplified and traditional chinese characters in Solr 1.4. (but you need to define mapping rules in mapping.txt) See SOLR-822 and the attached JPG for chinese mapping sample. I opened LUCENE-1466 for Lucene. :) > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, > build-resources-with-folder.patch, build-resources.patch, > build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, > LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: svn commit: r774718 [2/3] - in /lucene/java/trunk: ./ contrib/analyzers/src/java/org/apache/lucene/analysis/ar/ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ contrib/analyzers/src/java
Hi, Is that "Copyright 2009 www.imdict.net" allowed to be in there? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "mikemcc...@apache.org" > To: java-comm...@lucene.apache.org > Sent: Thursday, May 14, 2009 6:09:24 AM > Subject: svn commit: r774718 [2/3] - in /lucene/java/trunk: ./ > contrib/analyzers/src/java/org/apache/lucene/analysis/ar/ > contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ > contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/ > contrib/analyzers/... > > Added: > lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/CopyOfBigramDictionary.java > URL: > http://svn.apache.org/viewvc/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/CopyOfBigramDictionary.java?rev=774718&view=auto > == > --- > lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/CopyOfBigramDictionary.java > > (added) > +++ > lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/CopyOfBigramDictionary.java > > Thu May 14 10:09:22 2009 > @@ -0,0 +1,302 @@ > +/** > + * Copyright 2009 www.imdict.net > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Hudson build is back to normal: Lucene-trunk #829
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/829/changes - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710070#action_12710070 ] Robert Muir commented on LUCENE-1629: - koji, have you considered using icu transforms for this behavior? Not only is the rule-based language very nice (you can define variables, use context, etc), but many transformations such as "Traditional-Simplified" are already defined. http://userguide.icu-project.org/transforms/general > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, > build-resources-with-folder.patch, build-resources.patch, > build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, > LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org