chinese stopwords

2010-04-09 Thread John Wang
Hi: I am using the SmartChineseAnalyzer class and it is great! Was wondering if we should have a set of chinese stopwords. The default set containts only punctuations. Thanks -John

Re: Controlling the maximum size of a segment during indexing

2010-04-09 Thread Lance Norskog
I should mention - I tried it with: config.setRAMBufferSizeMB(1.0); and should have posted that version. It still comes up with one 5mb CFS segment file. On Fri, Apr 9, 2010 at 2:55 PM, Lance Norskog wrote: > If the IndexWriterConfig.ram buffer size and the mergeMB size on the > policy object a

Re: Controlling the maximum size of a segment during indexing

2010-04-09 Thread Lance Norskog
If the IndexWriterConfig.ram buffer size and the mergeMB size on the policy object are both 1mg, then can there be a segment larger than 2mb? Or 3mb? Or 10mb? Is there any way to (totally utterly completely absolutely 100%) cap the size of a segment merge?:If so, it appears to be an algebraic equa

Re: TestCodecs running time

2010-04-09 Thread Lance Norskog
I have found it useful to keep two lists of tests: the slow tests and the fast tests. Maybe the TestSuite feature would work for this purpose? An @SlowTest annotation would be even better. JUnit might have a tool to do this filtering. On Fri, Apr 9, 2010 at 2:49 AM, Michael McCandless wrote: > I

[jira] Updated: (LUCENE-2323) reorganize contrib modules

2010-04-09 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2323: Attachment: LUCENE-2323_wikipedia.patch now that flex is merged, its a good time to continue doing

[jira] Commented: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute

2010-04-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855498#action_12855498 ] Uwe Schindler commented on LUCENE-2372: --- One more: PerFieldAnalyzerWrapper :( - Sorr

[jira] Commented: (LUCENE-2376) java.lang.OutOfMemoryError:Java heap space

2010-04-09 Thread Shivender Devarakonda (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855496#action_12855496 ] Shivender Devarakonda commented on LUCENE-2376: --- I have a question on this,

[jira] Commented: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute

2010-04-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855493#action_12855493 ] Uwe Schindler commented on LUCENE-2372: --- Did it already for StandardAna (see patch).

[jira] Commented: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855492#action_12855492 ] Michael McCandless commented on LUCENE-2372: +1 to making KeywordAnalyzer fina

[jira] Commented: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute

2010-04-09 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855489#action_12855489 ] Mark Miller commented on LUCENE-2372: - bq.If I make it final and +1 - lets just remem

[jira] Updated: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute

2010-04-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2372: -- Attachment: LUCENE-2372.patch Small updates. Just one question: The only non-final Analyzer l

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855470#action_12855470 ] Michael McCandless commented on LUCENE-2386: I think oal.index is good. > Ind

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Shai Erera (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855457#action_12855457 ] Shai Erera commented on LUCENE-2386: Ok sounds good. Is there a preferred package for

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855421#action_12855421 ] Michael McCandless commented on LUCENE-2386: Patch looks good! Hmm... maybe w

[jira] Updated: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute

2010-04-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2372: -- Attachment: LUCENE-2372.patch Patch that removes deprecated usage of TermAttribute from Lucene

[jira] Resolved: (LUCENE-2388) the unversioned site points to a dead trunk

2010-04-09 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-2388. - Resolution: Fixed Fix Version/s: 3.1 both patches are committed... if you find any outdat

[jira] Updated: (LUCENE-2388) the unversioned site points to a dead trunk

2010-04-09 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2388: Attachment: LUCENE-2388_solr.patch attached is a patch to fix the references on the solr site. >

[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Shai Erera (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2386: --- Attachment: LUCENE-2386.patch Patch fixes all tests as well as changes to IndexWriter, IndexFileDele

[jira] Resolved: (LUCENE-2387) IndexWriter retains references to Readers used in Fields (memory leak)

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2387. Resolution: Fixed Fix Version/s: 3.1 > IndexWriter retains references to Re

[jira] Commented: (LUCENE-1879) Parallel incremental indexing

2010-04-09 Thread Shai Erera (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855379#action_12855379 ] Shai Erera commented on LUCENE-1879: I have found such version ... and it fails too :)

[jira] Commented: (LUCENE-1879) Parallel incremental indexing

2010-04-09 Thread Michael Busch (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855377#action_12855377 ] Michael Busch commented on LUCENE-1879: --- {quote} I'll start by describing the limita

Re: Controlling the maximum size of a segment during indexing

2010-04-09 Thread Mark Miller
Setting maxMergeMB does not limit the size of segments you will see - it simply limits what segments will be merged - segments over maxMergeMB will not be merged with other segments - you can still buffer up a ton of docs in RAM and flush a segment larger than maxMergeMB, or merge n segments sm

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Shai Erera (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855369#action_12855369 ] Shai Erera commented on LUCENE-2386: I already did that ... just didn't post back. Cre

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855364#action_12855364 ] Michael McCandless commented on LUCENE-2386: How about we subclass FNFE? Eg "

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Earwin Burrfoot (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855360#action_12855360 ] Earwin Burrfoot commented on LUCENE-2386: - I'm at loss for words. No, seriously, b

[jira] Updated: (LUCENE-2388) the unversioned site points to a dead trunk

2010-04-09 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2388: Attachment: LUCENE-2388.patch attached is a patch for lucene. if no one objects, i'd like to comm

[jira] Created: (LUCENE-2388) the unversioned site points to a dead trunk

2010-04-09 Thread Robert Muir (JIRA)
the unversioned site points to a dead trunk --- Key: LUCENE-2388 URL: https://issues.apache.org/jira/browse/LUCENE-2388 Project: Lucene - Java Issue Type: Bug Components: Website

[jira] Commented: (LUCENE-2364) Add support for terms in BytesRef format to Term, TermQuery, TermRangeQuery & Co.

2010-04-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855358#action_12855358 ] Uwe Schindler commented on LUCENE-2364: --- +1 Term is still used at a lot of places i

[jira] Resolved: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

2010-04-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2302. --- Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [New]) Committed revis

[jira] Updated: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

2010-04-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2302: -- Attachment: LUCENE-2302-toString.patch Patch that fixes the toString() problems in Token and a

[jira] Commented: (LUCENE-2387) IndexWriter retains references to Readers used in Fields (memory leak)

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855347#action_12855347 ] Michael McCandless commented on LUCENE-2387: I agree, Uwe -- I'll fold that in

[jira] Commented: (LUCENE-2387) IndexWriter retains references to Readers used in Fields (memory leak)

2010-04-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855345#action_12855345 ] Uwe Schindler commented on LUCENE-2387: --- As Tokenizers are reused, the analyzer hold

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Shai Erera (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855344#action_12855344 ] Shai Erera commented on LUCENE-2386: Ok I've added the following to DirReader: {code}

[jira] Commented: (LUCENE-2364) Add support for terms in BytesRef format to Term, TermQuery, TermRangeQuery & Co.

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855343#action_12855343 ] Michael McCandless commented on LUCENE-2364: Maybe we should simply deprecate

Re: IndexWriter memory leak?

2010-04-09 Thread Michael McCandless
I agree IW should not hold refs to the Field instances from the last doc indexed... I put a patch on LUCENE-2387 to null the reference as we go. Can you confirm this lets GC reclaim? Mike On Fri, Apr 9, 2010 at 12:54 AM, Ruben Laguna wrote: > But the Readers I'm talking about are not held by th

[jira] Updated: (LUCENE-2387) IndexWriter retains references to Readers used in Fields (memory leak)

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2387: --- Attachment: LUCENE-2387.patch Attached patch nulls out the Fieldable reference. > I

[jira] Assigned: (LUCENE-2387) IndexWriter retains references to Readers used in Fields (memory leak)

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-2387: -- Assignee: Michael McCandless > IndexWriter retains references to Readers used

[jira] Commented: (LUCENE-2376) java.lang.OutOfMemoryError:Java heap space

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855336#action_12855336 ] Michael McCandless commented on LUCENE-2376: Hmm indeed you have a great many

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855333#action_12855333 ] Michael McCandless commented on LUCENE-2386: bq. This is a behavioral bw break

Re: TestCodecs running time

2010-04-09 Thread Michael McCandless
It's also slow because it repeats all the tests for each of the core codecs (standard, sep, pulsing, intblock). I think it's fine to reduce the number of iterations -- just make sure there's no seed to newRandom() so the distributing testing is "effective". Mike On Fri, Apr 9, 2010 at 12:43 AM,

[jira] Updated: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute

2010-04-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2372: -- Attachment: LUCENE-2372.patch Here a first patch for the core tokenstreams. Tests not yet chan

[jira] Commented: (LUCENE-2376) java.lang.OutOfMemoryError:Java heap space

2010-04-09 Thread Shivender Devarakonda (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855306#action_12855306 ] Shivender Devarakonda commented on LUCENE-2376: --- Please find the attached Ch