[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837736#action_12837736 ] Michael McCandless commented on LUCENE-2279: Should we deprecate (eventually, remove) Analyzer.tokenStream? Maybe we should absorb ReusableAnalyzerBase back into Analyzer? Or maybe now is an opportune time to create a separate standalone analyzers package (subproject under the Lucene tlp)? We've broached this idea in the past, and I think it's compelling I think Lucene/Solr/Nutch need to eventually get to this point (where they share analyzers from a single source), so maybe now is the time. It'd be a single place where we would pull in all of Lucene's core/contrib, plus Solr's analyzers, plus new analyzers Robert keeps making ;) Robert's efforts to upgrade Solr's analyzers to 3.0 (currently a big patch waiting on SOLR-1657), plus his various other pending analyzer bug fixes, could be done in this new analyzers package. And we could immediately fix problems we have with the current analyzers API (like this reusable/tokenStream amibiguity). eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-2283: -- Assignee: Michael McCandless Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837740#action_12837740 ] Michael McCandless commented on LUCENE-2283: TermVectorsTermsWriter has the same issue. You're right: with irregular sized documents coming through, you can end up with PerDoc instances that waste space, because the RAMFile has buffers allocated from past huge docs that the latest tiny docs don't use. Note that the number of outstanding PerDoc instances is a function of how out of order the docs are being indexed, because the PerDoc holds any state only until that doc can be written to the store files (stored fields, term vectors). It's transient. EG with a single thread, there will only be one PerDoc -- it's written immediately. With 2 threads, if you have a massive doc (which thread 1 get stuck indexing) and then zillions of tiny docs (which thread 2 burns through, while thread 1 is busy), then you can get a large number of PerDocs created, waiting for their turn because thread 1 hasn't finished yet. But this process won't use unbounded RAM -- the RAM used by the RAMFiles is accounted for, and once it gets too high (10% of the RAM buffer size), we forcefully idle the incoming threads until the out of orderness is resolved. EG in this case, thread 2 will stall until thread 1 has finished its doc. That byte accounting does account for the allocated but not used byte[1024] inside RAMFile (we use RAMFile.sizeInBytes()). So... this is not really a memory leak. But it is a potential starvation issue, in that if your PerDoc instances all grow to large RAMFiles over time (as each has had to service a very large document), then it can mean the amount of concurrency that DW allows will become pinched. Especially if these docs are large relative to your ram buffer size. Are you hitting this issue? Ie seeing poor concurrency during indexing despite using many threads, because DW is forcefully idleing the threads? It should only happen if you sometimes index docs that are larger than RAMBufferSize/10/numberOrIndexingThreads. I'll work out a fix. I think we should fix RAMFile.reset to trim its buffers using ArrayUtil.getShrinkSize. Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2283: --- Fix Version/s: 3.1 Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code
[ https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-2282: -- Assignee: Michael McCandless Expose IndexFileNames as public, and make use of its methods in the code Key: LUCENE-2282 URL: https://issues.apache.org/jira/browse/LUCENE-2282 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch IndexFileNames is useful for applications that extend Lucene, an in particular those who extend Directory or IndexWriter. It provides useful constants and methods to query whether a certain file is a core Lucene file or not. In addition, IndexFileNames should be used by Lucene's code to generate segment file names, or query whether a certain file matches a certain extension. I'll post the patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException
[ https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837744#action_12837744 ] Ritesh Nigam commented on LUCENE-2280: -- Are you sure you're using a stock version 2.3.2 of Lucene? - Yes, I checked the manifest of the jar. I ask because... the line numbers in SegmentMerger (specifically 566) don't correlate to 2.3.2. The other line numbers do match. It's odd. But looking at the code I don't see how either of the arrays being passed to System.arraycopy can be null. Can you turn on IndexWriter's infoStream and capture post the output? - I have turned on the infostream for IndexWriter, it will take some time to get the result. once I get the result I will post that. It's also strange that this leads to index corruption; it shouldn't (the merge should just fail, and the index should be untouched). Can you run CheckIndex on the index and post what corruption it uncovers. - Here index corruption I mean that the main index file is getting deleted and search is not returning expected result. Hence there is no index file exists after the NullPointerExcepton, I cannot run CheckIndex. Does this happen in a Sun JRE? - I have not yet tested the same scenario on Sun JRE till now. IndexWriter.optimize() throws NullPointerException -- Key: LUCENE-2280 URL: https://issues.apache.org/jira/browse/LUCENE-2280 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.2 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6 Reporter: Ritesh Nigam I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB database which creates approax 200MB index file, after finishing the indexing and while running optimize() i can see NullPointerExcception thrown in my log and index file is getting corrupted, log says Caused by: java.lang.NullPointerException at org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49) at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) and this is happening quite frequently, although I am not able to reproduce it on demand, I saw an issue logged which is some what related to mine issue (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e) but the only difference here is I am not using Store.Compress for my fields, i am using Store.NO instead. please note that I am using IBM JRE for my application. Is this an issue with lucene?, if yes it is fixed in which version? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837752#action_12837752 ] Simon Willnauer commented on LUCENE-2279: - bq. Should we deprecate (eventually, remove) Analyzer.tokenStream? I would totally agree with that but I guess we can not remove this method until lucene 4.0 which will be hmm in 2020 :) - just joking bq.Maybe we should absorb ReusableAnalyzerBase back into Analyzer? That would be the logical consequence but the problem with ReusableAnalyzerBase is that it will break bw comapt if moved to Analyzer. It assumes both #reusabelTokenStream and #tokenStream to be final and introduces a new factory method. Yet, as an analyzer developer you really want to use the new ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will require you writing half of the code plus gives you reusability of the tokenStream bp. I think Lucene/Solr/Nutch need to eventually get to this point Huge +1 from my side. This could also unify the factory pattern solr uses to build tokenstreams. I would stop right here and ask to discuss it on the dev list, thoughts mike?! eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837759#action_12837759 ] Robert Muir commented on LUCENE-2279: - bq. Yet, as an analyzer developer you really want to use the new ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will require you writing half of the code plus gives you reusability of the tokenStream and the 1% extremely advanced cases that can't reuse, can just use TokenStreams directly when indexing, e.g. the Analyzer class could be reusable by definition. we shouldnt let these obscure cases slow down everyone else. bq. It assumes both #reusabelTokenStream and #tokenStream to be final in my opinion all the core analyzers (you already fixed contrib) should be final. this is another trap, if you subclass one of these analyzers and implement 'tokenStream', its immediately slow due to the backwards code. bq. I think Lucene/Solr/Nutch need to eventually get to this point if this is what we should do to remove the code duplication, then i am all for it. i still don't quite understand how it gives us more freedom to break/change the APIs, i mean however we label this stuff, a break is a break to the user at the end of the day. eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2111) Wrapup flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2111: Attachment: LUCENE-2111_experimental.patch attached is a patch that changes various exposed apis to use @lucene.experimental i didnt mess with IndexFileNames as there is an open issue about it right now. Wrapup flexible indexing Key: LUCENE-2111 URL: https://issues.apache.org/jira/browse/LUCENE-2111 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch Spinoff from LUCENE-1458. The flex branch is in fairly good shape -- all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;) But it still has a number of nocommits, could use some more scrutiny especially on the emulate old API on flex index and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes. The end is in sight! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2272) PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'
[ https://issues.apache.org/jira/browse/LUCENE-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned LUCENE-2272: --- Assignee: Grant Ingersoll PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction' --- Key: LUCENE-2272 URL: https://issues.apache.org/jira/browse/LUCENE-2272 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Peter Keegan Assignee: Grant Ingersoll Attachments: payloadfunctin-patch.txt The 'explain' method in PayloadNearSpanScorer assumes the AveragePayloadFunction was used. This patch adds the 'explain' method to the 'PayloadFunction' interface, where the Scorer can call it. Added unit tests for 'explain' and for {Min,Max}PayloadFunction. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2272) PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'
[ https://issues.apache.org/jira/browse/LUCENE-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837788#action_12837788 ] Grant Ingersoll commented on LUCENE-2272: - Peter, Couple of comments: * The base explain method can't be abstract. Something like: {code} public Explanation explain(int docId){ Explanation result = new Explanation(); result.setDescription(Unimpl Payload Function Explain); result.setValue(1); return result; }; {code} should do the trick * The changes don't seem thread safe any more since there are now member variables. It may still be all right, but have you looked at this aspect? PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction' --- Key: LUCENE-2272 URL: https://issues.apache.org/jira/browse/LUCENE-2272 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Peter Keegan Assignee: Grant Ingersoll Attachments: payloadfunctin-patch.txt The 'explain' method in PayloadNearSpanScorer assumes the AveragePayloadFunction was used. This patch adds the 'explain' method to the 'PayloadFunction' interface, where the Scorer can call it. Added unit tests for 'explain' and for {Min,Max}PayloadFunction. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837792#action_12837792 ] Michael McCandless commented on LUCENE-2279: bq. I would stop right here and ask to discuss it on the dev list, thoughts mike?! Agreed... I'll start a thread. {quote} bq. Maybe we should absorb ReusableAnalyzerBase back into Analyzer? That would be the logical consequence but the problem with ReusableAnalyzerBase is that it will break bw comapt if moved to Analyzer. {quote} Right, this is why I was thinking if we make a new analyzers package, it's a chance to break/improve things. We'd have a single abstract base class that only exposes reuse API. bq. in my opinion all the core analyzers (you already fixed contrib) should be final. I agree, and we should consistently take this approach w/ the new analyzers package... bq. i still don't quite understand how it gives us more freedom to break/change the APIs, i mean however we label this stuff, a break is a break to the user at the end of the day. Because it'd be an entirely new package, so we can create a new base Analyzer class (in that package) that breaks/fixes things when compared to Lucene's Analyzer class. We'd eventually deprecate the analyzers/tokenizers/token filters in Lucene/Solr/Nutch in favor of this new package, and users can switch over on their own schedule. eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837793#action_12837793 ] Tim Smith commented on LUCENE-2283: --- I came across this issue looking for a reported memory leak during indexing a yourkit snapshot showed that the PerDocs for an IndexWriter were using ~40M of memory (at which point i came across this potentially unbounded memory use in StoredFieldsWriter) this snapshot seems more or less at a stable point (memory grows but then returns to a normal state), however i have reports that eventually the memory is completely exhausted resulting in out of memory errors. I so far have not found any other major culprit in the lucene indexing code. This index receives a routine mix of very large and very small documents (which would explain this situation) The VM and system have more than ample amount of memory given the buffer size and what should be normal indexing RAM requirements. Also, a major difference between this leak not occurring and it showing up is that previously, the IndexWriter was closed when performing commits, now the IndexWriter remains open (just calling IndexWriter.commit()). So, if any memory is leaking during indexing, it is no longer being reclaimed during commit. As a side note, closing the index writer at commit time would sometimes fail, resulting in some following updates to fail because the index writer was locked and couldn't be reopened until the old index writer was garbage collected, so i don't want to go back to this for commits. Its possible there is a leak somewhere else (i currently do not have a snapshot right before out of memory issues occur, so currently the only thing that stands out is the PerDoc memory use) As far as a fix goes, wouldn't it be better to have the RAMFile's used for stored fields pull and return byte buffers from the byte block pool on the DocumentsWriter? This would allow the memory to be reclaimed based on the index writers buffer size (otherwise there is no configurable way to tune this memory use) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2111) Wrapup flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837795#action_12837795 ] Robert Muir commented on LUCENE-2111: - these tags are added in revision 915791. Wrapup flexible indexing Key: LUCENE-2111 URL: https://issues.apache.org/jira/browse/LUCENE-2111 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch Spinoff from LUCENE-1458. The flex branch is in fairly good shape -- all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;) But it still has a number of nocommits, could use some more scrutiny especially on the emulate old API on flex index and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes. The end is in sight! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException
[ https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837799#action_12837799 ] Ritesh Nigam commented on LUCENE-2280: -- Attaching the lucene.jar which i am using for my application. IndexWriter.optimize() throws NullPointerException -- Key: LUCENE-2280 URL: https://issues.apache.org/jira/browse/LUCENE-2280 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.2 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6 Reporter: Ritesh Nigam I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB database which creates approax 200MB index file, after finishing the indexing and while running optimize() i can see NullPointerExcception thrown in my log and index file is getting corrupted, log says Caused by: java.lang.NullPointerException at org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49) at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) and this is happening quite frequently, although I am not able to reproduce it on demand, I saw an issue logged which is some what related to mine issue (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e) but the only difference here is I am not using Store.Compress for my fields, i am using Store.NO instead. please note that I am using IBM JRE for my application. Is this an issue with lucene?, if yes it is fixed in which version? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException
[ https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ritesh Nigam updated LUCENE-2280: - Attachment: lucene.jar lucene.jar my application is using IndexWriter.optimize() throws NullPointerException -- Key: LUCENE-2280 URL: https://issues.apache.org/jira/browse/LUCENE-2280 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.2 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6 Reporter: Ritesh Nigam Attachments: lucene.jar I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB database which creates approax 200MB index file, after finishing the indexing and while running optimize() i can see NullPointerExcception thrown in my log and index file is getting corrupted, log says Caused by: java.lang.NullPointerException at org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49) at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) and this is happening quite frequently, although I am not able to reproduce it on demand, I saw an issue logged which is some what related to mine issue (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e) but the only difference here is I am not using Store.Compress for my fields, i am using Store.NO instead. please note that I am using IBM JRE for my application. Is this an issue with lucene?, if yes it is fixed in which version? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837811#action_12837811 ] Michael McCandless commented on LUCENE-2283: bq. a yourkit snapshot showed that the PerDocs for an IndexWriter were using ~40M of memory What was IW's ramBufferSizeMB when you saw this? bq. however i have reports that eventually the memory is completely exhausted resulting in out of memory errors. Hmm, that makes me nervous, because I think in this case the use should be bounded. bq. As a side note, closing the index writer at commit time would sometimes fail, resulting in some following updates to fail because the index writer was locked and couldn't be reopened until the old index writer was garbage collected, so i don't want to go back to this for commits. That doesn't sound good! Can you post some details on this (eg an exception)? But, anyway, keeping the same IW open and just calling commit is (should be) fine. bq. As far as a fix goes, wouldn't it be better to have the RAMFile's used for stored fields pull and return byte buffers from the byte block pool on the DocumentsWriter? Yes, that's a great solution -- a single pool. But that's a somewhat bigger change. I guess we can pass a byte[] allocator to RAMFile. It'd have to be a new pool, too (DW's byte blocks are 32KB not the 1KB that RAMFile uses). Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837821#action_12837821 ] Tim Smith commented on LUCENE-2283: --- ramBufferSizeMB is 64MB Here's the yourkit breakdown per class: * DocumentsWriter - 256 MB ** TermsHash - 38.7 MB ** StoredFieldsWriter - 37.5 MB ** DocumentsWriterThreadState - 36.2 MB ** DocumentsWriterThreadState - 34.6 MB ** DocumentsWriterThreadState - 33.8 MB ** DocumentsWriterThreadState - 27.5 MB ** DocumentsWriterThreadState - 13.4 MB I'm starting to dig into the ThreadStates now to see if anything stands out here bq. Hmm, that makes me nervous, because I think in this case the use should be bounded. I should be getting a new profile dump at crash time soon, so hopefully that will make things clearer bq. That doesn't sound good! Can you post some details on this (eg an exception)? If i recall correctly, I think the exception was caused by an out of disk space situation (which would recover) obviously not much that can be done about this other than adding more disk space, however the situation would recover, but docs would be lost in the interum bq. But, anyway, keeping the same IW open and just calling commit is (should be) fine. Yeah, this should be the way to go, especially as it results in the pooled buffers not needing to be reallocated/reclaimed/etc, however right now this is the only change i can currently think of that could result in memory issues. bq. Yes, that's a great solution - a single pool. But that's a somewhat bigger change. Seems like this would be the best approach as it makes the memory bounded by the configuration of the engine, giving better reuse of byte blocks and better ability to reclaim memory (in DocumentsWriter.balanceRAM()) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
MatchAllDocsQueryNode toString() creates invalid XML-Tag
Hi, I am just getting my feet wet with the queryParser in contrib/queryparser. This new API is really a huge improvement. I am using it to convert Solr-Style input into a custom xml-based format we use to query third party search engines. I encountered the following: The MatchAllDocsQueryNode returns in its toString-Method matchAllDocs field='*' term='*'. Is this by purpose? Is it meant to be closed elsewhere? If not, I'll happily open a JIRA-Issue and provide a patch for it. Thanks frank - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2111) Wrapup flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2111: --- Attachment: LUCENE-2111.patch Attached patch, fixing some more nocommits, and renaming BytesRef.toString - BytesRef.utf8ToString. Wrapup flexible indexing Key: LUCENE-2111 URL: https://issues.apache.org/jira/browse/LUCENE-2111 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch Spinoff from LUCENE-1458. The flex branch is in fairly good shape -- all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;) But it still has a number of nocommits, could use some more scrutiny especially on the emulate old API on flex index and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes. The end is in sight! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: MatchAllDocsQueryNode toString() creates invalid XML-Tag
This sounds like a bug -- can you open an issue? Thanks! Mike On Wed, Feb 24, 2010 at 10:04 AM, Frank Wesemann f.wesem...@fotofinder.net wrote: Hi, I am just getting my feet wet with the queryParser in contrib/queryparser. This new API is really a huge improvement. I am using it to convert Solr-Style input into a custom xml-based format we use to query third party search engines. I encountered the following: The MatchAllDocsQueryNode returns in its toString-Method matchAllDocs field='*' term='*'. Is this by purpose? Is it meant to be closed elsewhere? If not, I'll happily open a JIRA-Issue and provide a patch for it. Thanks frank - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837859#action_12837859 ] Michael McCandless commented on LUCENE-2279: {quote} bq. I would stop right here and ask to discuss it on the dev list, thoughts mike?! Agreed... I'll start a thread. {quote} OK I just started a thread on general@ eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837865#action_12837865 ] Michael McCandless commented on LUCENE-2283: {quote} ramBufferSizeMB is 64MB Here's the yourkit breakdown per class: {quote} Hmm -- spooky. With ram buffer @ 64MB, DocumentsWriter is using 256MB!? Something is clearly amiss. 40 MB used by StoredFieldsWriter's PerDoc still leaves 152 MB unaccounted for... hmm. bq. If i recall correctly, I think the exception was caused by an out of disk space situation (which would recover) Oh OK. Though... closing the iW vs calling IW.commit should be not different in that regard. Both should have the same transient disk space usage. It's odd you'd see out of disk for .close but not also for .commit. bq. Seems like this would be the best approach as it makes the memory bounded by the configuration of the engine, giving better reuse of byte blocks and better ability to reclaim memory (in DocumentsWriter.balanceRAM()) I agree. I'll mull over how to do it... unless you're planning on consing up a patch ;) How many threads do you pass through IW? Are the threads forever from a static pool, or do they come and go? I'd like to try to simulate your usage (huge docs tiny docs) in my dev area to see if I can provoke the same behavior. Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2111) Wrapup flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2111: Attachment: LUCENE-2111_toString.patch Here is a few more toString - utf8ToString. will look at the backwards tests now Wrapup flexible indexing Key: LUCENE-2111 URL: https://issues.apache.org/jira/browse/LUCENE-2111 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch, LUCENE-2111_toString.patch Spinoff from LUCENE-1458. The flex branch is in fairly good shape -- all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;) But it still has a number of nocommits, could use some more scrutiny especially on the emulate old API on flex index and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes. The end is in sight! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837875#action_12837875 ] Tim Smith commented on LUCENE-2283: --- bq. I agree. I'll mull over how to do it... unless you're planning on consing up a patch I'd love to, but don't have the free cycles at the moment :( bq. How many threads do you pass through IW? honestly don't 100% know about the origin of the threads i'm given In general, they should be from a static pool, but may be dynamically allocated if the static pool runs out One thought i had recently was to control this more tightly by having a limited number of static threads that called IndexWriter methods in case that was the issue (but that would be a pretty big change) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code
[ https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837879#action_12837879 ] Michael McCandless commented on LUCENE-2282: Patch looks good Shai! But I don't think we should back port to 3.0.2 -- it's non-trivial enough that there is some risk? As the API is now marked @lucene.internal, and it'll only be very expert usage, I'm not as concerned as Marvin is about the risks of even exposing this. Also, even with flex, a good number of Lucene's index files are not under codec control (codec only touches postings files -- .tis, .tii, .frq, .prx for the standard codec). But I do agree it's not ideal that the knowledge of file extensions is split across this class and the codec. The IndexFileNameFilter in flex now takes a Codec as input, to make up for that... but IndexFileNames just has a NOTE at the top stating the limitation. Expose IndexFileNames as public, and make use of its methods in the code Key: LUCENE-2282 URL: https://issues.apache.org/jira/browse/LUCENE-2282 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch IndexFileNames is useful for applications that extend Lucene, an in particular those who extend Directory or IndexWriter. It provides useful constants and methods to query whether a certain file is a core Lucene file or not. In addition, IndexFileNames should be used by Lucene's code to generate segment file names, or query whether a certain file matches a certain extension. I'll post the patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code
[ https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837880#action_12837880 ] Shai Erera commented on LUCENE-2282: bq. But I don't think we should back port to 3.0.2 Ok, I can live w/ 3.1, as long as it's not released at the end of 2010. I can for now put that part of my code in o.a.l.index, until 3.1 is out. As I wrote in the TestFileSwitchDirectory comment, this IMO has to go in, because otherwise it would make the code of users of FSD fragile (potentially). Thanks for looking at this ! Expose IndexFileNames as public, and make use of its methods in the code Key: LUCENE-2282 URL: https://issues.apache.org/jira/browse/LUCENE-2282 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch IndexFileNames is useful for applications that extend Lucene, an in particular those who extend Directory or IndexWriter. It provides useful constants and methods to query whether a certain file is a core Lucene file or not. In addition, IndexFileNames should be used by Lucene's code to generate segment file names, or query whether a certain file matches a certain extension. I'll post the patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837881#action_12837881 ] Tim Smith commented on LUCENE-2283: --- latest profile dump has pointed out a non-lucene issue as causing some memory growth so feel free to drop down priority however it seems like using the bytepool for the stored fields would be good overall Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code
[ https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837883#action_12837883 ] Uwe Schindler commented on LUCENE-2282: --- bq. But I don't think we should back port to 3.0.2 - it's non-trivial enough that there is some risk? Please no backport to 3.0.2, its an API change. And we are not sure if there will be ever a 3.0.2. BTW: Version 3.0.1 comes out latest on Friday, will appear on the mirrors soon! Expose IndexFileNames as public, and make use of its methods in the code Key: LUCENE-2282 URL: https://issues.apache.org/jira/browse/LUCENE-2282 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch IndexFileNames is useful for applications that extend Lucene, an in particular those who extend Directory or IndexWriter. It provides useful constants and methods to query whether a certain file is a core Lucene file or not. In addition, IndexFileNames should be used by Lucene's code to generate segment file names, or query whether a certain file matches a certain extension. I'll post the patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException
[ https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837892#action_12837892 ] Michael McCandless commented on LUCENE-2280: Indeed that JAR is identical to 2.3.2. Weird. Not sure why the line number doesn't line up. Irks me. bq. Here index corruption I mean that the main index file is getting deleted and search is not returning expected result. Hence there is no index file exists after the NullPointerExcepton, I cannot run CheckIndex. That's even stranger -- nothing should get deleted because a merge fails. Is it possible your app has an exception handler doing this? Or maybe this is a brand new index, and it doesn't get properly closed (ie, no commit) when this exception is hit? If not... can you provide more details? An exception like this should have no impact on the original index. Please post the infoStream output when you get it, and report back whether this happens on Sun's JVM. But I still can't see how either of the arrays could be null here... this is a weird one. Are you using the latest updates to the IBM 1.6 JRE? IndexWriter.optimize() throws NullPointerException -- Key: LUCENE-2280 URL: https://issues.apache.org/jira/browse/LUCENE-2280 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.2 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6 Reporter: Ritesh Nigam Attachments: lucene.jar I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB database which creates approax 200MB index file, after finishing the indexing and while running optimize() i can see NullPointerExcception thrown in my log and index file is getting corrupted, log says Caused by: java.lang.NullPointerException at org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49) at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) and this is happening quite frequently, although I am not able to reproduce it on demand, I saw an issue logged which is some what related to mine issue (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e) but the only difference here is I am not using Store.Compress for my fields, i am using Store.NO instead. please note that I am using IBM JRE for my application. Is this an issue with lucene?, if yes it is fixed in which version? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837896#action_12837896 ] Michael McCandless commented on LUCENE-2283: Yeah it would be good to make the pool shared... It still bugs me that yourkit is claiming DW was using 256 MB when you've got a 64 MB ram buffer that's spooky. Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag
MatchAllDocsQueryNode toString() creates invalid XML-Tag Key: LUCENE-2284 URL: https://issues.apache.org/jira/browse/LUCENE-2284 Project: Lucene - Java Issue Type: Bug Components: contrib/* Environment: all Reporter: Frank Wesemann MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', which is inavlid XML should read matchAllDocs field='*' term='*' /. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag
[ https://issues.apache.org/jira/browse/LUCENE-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Wesemann updated LUCENE-2284: --- Attachment: LUCENE-2284.patch this patch returns a valid xml Element. MatchAllDocsQueryNode toString() creates invalid XML-Tag Key: LUCENE-2284 URL: https://issues.apache.org/jira/browse/LUCENE-2284 Project: Lucene - Java Issue Type: Bug Components: contrib/* Environment: all Reporter: Frank Wesemann Attachments: LUCENE-2284.patch MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', which is inavlid XML should read matchAllDocs field='*' term='*' /. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: MatchAllDocsQueryNode toString() creates invalid XML-Tag
Michael McCandless schrieb: This sounds like a bug -- can you open an issue? Thanks! Created: (LUCENE-2284) and added a Patch -- mit freundlichem Gruß, Frank Wesemann Fotofinder GmbH USt-IdNr. DE812854514 Software EntwicklungWeb: http://www.fotofinder.com/ Potsdamer Str. 96 Tel: +49 30 25 79 28 90 10785 BerlinFax: +49 30 25 79 28 999 Sitz: Berlin Amtsgericht Berlin Charlottenburg (HRB 73099) Geschäftsführer: Ali Paczensky - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag
[ https://issues.apache.org/jira/browse/LUCENE-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-2284: --- Assignee: Robert Muir MatchAllDocsQueryNode toString() creates invalid XML-Tag Key: LUCENE-2284 URL: https://issues.apache.org/jira/browse/LUCENE-2284 Project: Lucene - Java Issue Type: Bug Components: contrib/* Environment: all Reporter: Frank Wesemann Assignee: Robert Muir Attachments: LUCENE-2284.patch MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', which is inavlid XML should read matchAllDocs field='*' term='*' /. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag
[ https://issues.apache.org/jira/browse/LUCENE-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2284: Fix Version/s: 3.1 MatchAllDocsQueryNode toString() creates invalid XML-Tag Key: LUCENE-2284 URL: https://issues.apache.org/jira/browse/LUCENE-2284 Project: Lucene - Java Issue Type: Bug Components: contrib/* Environment: all Reporter: Frank Wesemann Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2284.patch MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', which is inavlid XML should read matchAllDocs field='*' term='*' /. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag
[ https://issues.apache.org/jira/browse/LUCENE-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837909#action_12837909 ] Robert Muir commented on LUCENE-2284: - looks like it would be good to fix, as all the other querynodes return valid xml. will commit in a day or 2 if no one objects. Thanks for reporting this Frank MatchAllDocsQueryNode toString() creates invalid XML-Tag Key: LUCENE-2284 URL: https://issues.apache.org/jira/browse/LUCENE-2284 Project: Lucene - Java Issue Type: Bug Components: contrib/* Environment: all Reporter: Frank Wesemann Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2284.patch MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', which is inavlid XML should read matchAllDocs field='*' term='*' /. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837919#action_12837919 ] Tim Smith commented on LUCENE-2283: --- another note is that this was on 64 bit vm i've noticed that all the memsize calculations assume 4 byte pointers, so perhaps that can lead to more memory being used that would otherwise be expected (although 256 MB is still well over the 2X mem use that would potentially be expected in that case) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2126: -- Attachment: lucene-2126.patch Updated patch to trunk. I'll have to make a change to the backwards-tests too, because moving the copyBytes() method from IndexOutput to DataOutput and changing its parameter from IndexInput to DataInput breaks drop-in compatibility. Split up IndexInput and IndexOutput into DataInput and DataOutput - Key: LUCENE-2126 URL: https://issues.apache.org/jira/browse/LUCENE-2126 Project: Lucene - Java Issue Type: Improvement Affects Versions: Flex Branch Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Flex Branch Attachments: lucene-2126.patch, lucene-2126.patch I'd like to introduce the two new classes DataInput and DataOutput that contain all methods from IndexInput and IndexOutput that actually decode or encode data, such as readByte()/writeByte(), readVInt()/writeVInt(). Methods like getFilePointer(), seek(), close(), etc., which are not related to data encoding, but to files as input/output source stay in IndexInput/IndexOutput. This patch also changes ByteSliceReader/ByteSliceWriter to extend DataInput/DataOutput. Previously ByteSliceReader implemented the methods that stay in IndexInput by throwing RuntimeExceptions. See also LUCENE-2125. All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code
[ https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837988#action_12837988 ] Marvin Humphrey commented on LUCENE-2282: - As the API is now marked @lucene.internal, and it'll only be very expert usage, I'm not as concerned as Marvin is about the risks of even exposing this. Um, the only possible concerns I could have had were regarding public exposure of this API. If it's marked as internal, it's an implementation detail. Whether or not the dot is included in internal-use-only constant strings isn't something I'm going to waste a lot of time thinking about. ;) So now, not only do I really, really not care whether this goes in, I have no qualms about it either. Having users like Shai who are willing to recompile and regenerate to take advantage of experimental features is a big boon, as it allows us to test drive features before declaring them stable. Designing optimal APIs without usability testing is difficult to impossible. Expose IndexFileNames as public, and make use of its methods in the code Key: LUCENE-2282 URL: https://issues.apache.org/jira/browse/LUCENE-2282 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch IndexFileNames is useful for applications that extend Lucene, an in particular those who extend Directory or IndexWriter. It provides useful constants and methods to query whether a certain file is a core Lucene file or not. In addition, IndexFileNames should be used by Lucene's code to generate segment file names, or query whether a certain file matches a certain extension. I'll post the patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838017#action_12838017 ] Tim Smith commented on LUCENE-2283: --- i'm working up a patch for the shared byteblock pool for stored field buffers (found a few cycles) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838060#action_12838060 ] Shyamal Prasad commented on LUCENE-2167: {quote} I don't think it really has to be, i actually am of the opinion StandardTokenizer should follow unicode standard tokenization. then we can throw subjective decisions away, and stick with a standard. {quote} Yep, I see I am going for the wrong ambition level and only tweaking the existing grammar. I'll take a crack at understanding unicode standard tokenization, as you'd suggested originally, and try and produce something as soon as I get a chance. I see your point. Cheers! Shyamal StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters Key: LUCENE-2167 URL: https://issues.apache.org/jira/browse/LUCENE-2167 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Shyamal Prasad Priority: Minor Attachments: LUCENE-2167.patch, LUCENE-2167.patch Original Estimate: 0.5h Remaining Estimate: 0.5h The Javadoc for StandardTokenizer states: {quote} Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. {quote} This is not accurate. The actual JFlex implementation treats hyphens interchangeably with punctuation. So, for example video,mp4,test results in a *single* token and not three tokens as the documentation would suggest. Additionally, the documentation suggests that video-mp4-test-again would become a single token, but in reality it results in two tokens: video-mp4-test and again. IMHO the parser implementation is fine as is since it is hard to keep everyone happy, but it is probably worth cleaning up the documentation string. The patch included here updates the documentation string and adds a few test cases to confirm the cases described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838068#action_12838068 ] Robert Muir commented on LUCENE-2167: - bq. I'll take a crack at understanding unicode standard tokenization, as you'd suggested originally, and try and produce something as soon as I get a chance. I would love it if you could produce a grammar that implemented UAX#29! If so, in my opinion it should become the StandardAnalyzer for the next lucene version. If I thought I could do it correctly, I would have already done it, as the support for the unicode properties needed to do this is now in the trunk of Jflex! here are some references that might help: The standard itself: http://unicode.org/reports/tr29/ particularly the Testing portion: http://unicode.org/reports/tr41/tr41-5.html#Tests29 Unicode provides a WordBreakTest.txt file, that we could use from Junit, to help verify correctness: http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.txt I'll warn you I think it might be hard, but perhaps its not that bad. In particular the standard is defined in terms of chained rules, and Jflex doesnt support rule chaining, but I am not convinced we need rule chaining to implement WordBreak (maybe LineBreak, but maybe WordBreak can be done easily without it?) Steven Rowe is the expert on this stuff, maybe he has some ideas. StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters Key: LUCENE-2167 URL: https://issues.apache.org/jira/browse/LUCENE-2167 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Shyamal Prasad Priority: Minor Attachments: LUCENE-2167.patch, LUCENE-2167.patch Original Estimate: 0.5h Remaining Estimate: 0.5h The Javadoc for StandardTokenizer states: {quote} Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. {quote} This is not accurate. The actual JFlex implementation treats hyphens interchangeably with punctuation. So, for example video,mp4,test results in a *single* token and not three tokens as the documentation would suggest. Additionally, the documentation suggests that video-mp4-test-again would become a single token, but in reality it results in two tokens: video-mp4-test and again. IMHO the parser implementation is fine as is since it is hard to keep everyone happy, but it is probably worth cleaning up the documentation string. The patch included here updates the documentation string and adds a few test cases to confirm the cases described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838073#action_12838073 ] Robert Muir commented on LUCENE-2167: - btw, here is some statement that seems to confirm my suspicions, from the standard: In section 6.3, there is an example of the grapheme cluster boundaries converted into a simple regex (the kind we could do easily in jflex now that it has the properties available). They make this statement: Such a regular expression can also be turned into a fast, deterministic finite-state machine. Similar regular expressions are possible for Word boundaries. Line and Sentence boundaries are more complicated, and more difficult to represent with regular expressions. StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters Key: LUCENE-2167 URL: https://issues.apache.org/jira/browse/LUCENE-2167 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Shyamal Prasad Priority: Minor Attachments: LUCENE-2167.patch, LUCENE-2167.patch Original Estimate: 0.5h Remaining Estimate: 0.5h The Javadoc for StandardTokenizer states: {quote} Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. {quote} This is not accurate. The actual JFlex implementation treats hyphens interchangeably with punctuation. So, for example video,mp4,test results in a *single* token and not three tokens as the documentation would suggest. Additionally, the documentation suggests that video-mp4-test-again would become a single token, but in reality it results in two tokens: video-mp4-test and again. IMHO the parser implementation is fine as is since it is hard to keep everyone happy, but it is probably worth cleaning up the documentation string. The patch included here updates the documentation string and adds a few test cases to confirm the cases described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838081#action_12838081 ] Steven Rowe edited comment on LUCENE-2167 at 2/24/10 11:27 PM: --- I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and both Unicode versions 5.1 and 5.2 - you can see the files here: http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/ The files are {{UnicodeWordBreakRules_5_\*.\*}} - these are written to: parse the Unicode test files; run the generated scanner against each composed test string; output the break opportunities/prohibitions in the same format as the test files; and then finally compare the output against the test file itself, looking for a match. (These tests currently pass.) The .flex files would need to be significantly changed to be used as a StandardTokenizer replacement, but you can get an idea from them how to implement the Unicode word break rules in (as yet unreleased version 1.5.0) JFlex syntax. was (Author: steve_rowe): I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and both Unicode versions 5.1 and 5.2 - you can see the files here: http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/ The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the Unicode test files; run the generated scanner against each composed test string; output the break opportunities/prohibitions in the same format as the test files; and then finally compare the output against the test file itself, looking for a match. (These tests currently pass.) The .flex files would need to be significantly changed to be used as a StandardTokenizer replacement, but you can get an idea from them how to implement the Unicode word break rules in (as yet unreleased version 1.5.0) JFlex syntax. StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters Key: LUCENE-2167 URL: https://issues.apache.org/jira/browse/LUCENE-2167 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Shyamal Prasad Priority: Minor Attachments: LUCENE-2167.patch, LUCENE-2167.patch Original Estimate: 0.5h Remaining Estimate: 0.5h The Javadoc for StandardTokenizer states: {quote} Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. {quote} This is not accurate. The actual JFlex implementation treats hyphens interchangeably with punctuation. So, for example video,mp4,test results in a *single* token and not three tokens as the documentation would suggest. Additionally, the documentation suggests that video-mp4-test-again would become a single token, but in reality it results in two tokens: video-mp4-test and again. IMHO the parser implementation is fine as is since it is hard to keep everyone happy, but it is probably worth cleaning up the documentation string. The patch included here updates the documentation string and adds a few test cases to confirm the cases described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838081#action_12838081 ] Steven Rowe commented on LUCENE-2167: - I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and both Unicode versions 5.1 and 5.2 - you can see the files here: http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/ The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the Unicode test files; run the generated scanner against each composed test string; output the break opportunities/prohibitions in the same format as the test files; and then finally compare the output against the test file itself, looking for a match. (These tests currently pass.) The .flex files would need to be significantly changed to be used as a StandardTokenizer replacement, but you can get an idea from them how to implement the Unicode word break rules in (as yet unreleased version 1.5.0) JFlex syntax. StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters Key: LUCENE-2167 URL: https://issues.apache.org/jira/browse/LUCENE-2167 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Shyamal Prasad Priority: Minor Attachments: LUCENE-2167.patch, LUCENE-2167.patch Original Estimate: 0.5h Remaining Estimate: 0.5h The Javadoc for StandardTokenizer states: {quote} Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. {quote} This is not accurate. The actual JFlex implementation treats hyphens interchangeably with punctuation. So, for example video,mp4,test results in a *single* token and not three tokens as the documentation would suggest. Additionally, the documentation suggests that video-mp4-test-again would become a single token, but in reality it results in two tokens: video-mp4-test and again. IMHO the parser implementation is fine as is since it is hard to keep everyone happy, but it is probably worth cleaning up the documentation string. The patch included here updates the documentation string and adds a few test cases to confirm the cases described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838094#action_12838094 ] Robert Muir commented on LUCENE-2167: - Steven, thanks for providing the link. I guess this is the point where I also say, I think it would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex (I realize in 1.5, we won't have 0x support). Then its name would actually make sense. In my opinion, such a transition would involve something like renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims: {code} This should be a good tokenizer for most European-language documents {code} The new StandardTokenizer could then say {code} This should be a good tokenizer for most languages. {code} All the english/euro-centric stuff like the acronym/company/apostrophe stuff could stay with that EuropeanTokenizer or whatever its called, and it could be used by the european analyzers. but if we implement the Unicode rules, I think we should drop all this english/euro-centric stuff for StandardTokenizer. Otherwise it should be called *StandardishTokenizer*. we can obviously preserve the backwards compat with Version, as Uwe has created a way to use a different grammar for a different Version. I expect some -1 to this, waiting comments :) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters Key: LUCENE-2167 URL: https://issues.apache.org/jira/browse/LUCENE-2167 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Shyamal Prasad Priority: Minor Attachments: LUCENE-2167.patch, LUCENE-2167.patch Original Estimate: 0.5h Remaining Estimate: 0.5h The Javadoc for StandardTokenizer states: {quote} Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. {quote} This is not accurate. The actual JFlex implementation treats hyphens interchangeably with punctuation. So, for example video,mp4,test results in a *single* token and not three tokens as the documentation would suggest. Additionally, the documentation suggests that video-mp4-test-again would become a single token, but in reality it results in two tokens: video-mp4-test and again. IMHO the parser implementation is fine as is since it is hard to keep everyone happy, but it is probably worth cleaning up the documentation string. The patch included here updates the documentation string and adds a few test cases to confirm the cases described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838101#action_12838101 ] Robert Muir commented on LUCENE-2074: - Uwe, given Steven's comment above, I think we should move forward with this issue and flex 1.5? Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer --- Key: LUCENE-2074 URL: https://issues.apache.org/jira/browse/LUCENE-2074 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file. After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2278) FastVectorHighlighter: highlighted term is out of alignment in multi-valued NOT_ANALYZED field
[ https://issues.apache.org/jira/browse/LUCENE-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved LUCENE-2278. Resolution: Fixed Committed revision 916090. FastVectorHighlighter: highlighted term is out of alignment in multi-valued NOT_ANALYZED field -- Key: LUCENE-2278 URL: https://issues.apache.org/jira/browse/LUCENE-2278 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.9, 2.9.1, 3.0 Reporter: Koji Sekiguchi Priority: Minor Fix For: 3.1 Attachments: LUCENE-2278.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2285) Code cleanup from all sorts of (trivial) warnings
Code cleanup from all sorts of (trivial) warnings - Key: LUCENE-2285 URL: https://issues.apache.org/jira/browse/LUCENE-2285 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Priority: Minor Fix For: 3.1 I would like to do some code cleanup and remove all sorts of trivial warnings, like unnecessary casts, problems w/ javadocs, unused variables, redundant null checks, unnecessary semicolon etc. These are all very trivial and should not pose any problem. I'll create another issue for getting rid of deprecated code usage, like LuceneTestCase and all sorts of deprecated constructors. That's also trivial because it only affects Lucene code, but it's a different type of change. Another issue I'd like to create is about introducing more generics in the code, where it's missing today - not changing existing API. There are many places in the code like that. So, with you permission, I'll start with the trivial ones first, and then move on to the others. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Adding .classpath.tmpl
Hi I always find it annoying when I checkout the code to a new project in eclipse, that I need to put everything that I care about in the classpath and adding the dependent libraries. On another project I'm involved with, we did that process once, adding all the source code to the classpath and the libraries and created a .classpath.tmpl. Now when people checkout the code, they can copy the content of that file to their .classpath file and setting up the project is reducing from a couple of minutes to few seconds. I don't want to check-in .classpath because not everyone wants all the code in their classpath. I attached such file to the mail. Note that the only dependency which will break on other machines is the ant.jar dependency, which on my Windows is located under c:\ant. That jar is required to compile contrib/ant from eclipse. Not sure how to resolve that, except besides removing that line from the file and document separately that that's what you need to do if you want to add contrib/ant ... The file is sorted by name, putting the core stuff at the top - so it's easy for people to selectively add the interesting packages. I don't know if an issue is required, if so I can create it in and move the discussion there. Shai lucene.classpath.tmpl Description: Binary data - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2285) Code cleanup from all sorts of (trivial) warnings
[ https://issues.apache.org/jira/browse/LUCENE-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838190#action_12838190 ] Shai Erera commented on LUCENE-2285: Can someone please clarify these for me: || Description || Class || Line |Unsupported @SuppressWarnings(SerializableHasSerializationMethods) | TestCustomScoreQuery.java | 87 |Unsupported @SuppressWarnings(SerializableHasSerializationMethods) | TestCustomScoreQuery.java | 123 |Unsupported @SuppressWarnings(UseOfSystemOutOrSystemErr) | TestFieldScoreQuery.java | 42 |Unsupported @SuppressWarnings(UseOfSystemOutOrSystemErr) | TestOrdValues.java | 37 Are these meant to be there and eclipse just doesn't recognize them for some reason, or are these a mistake? Code cleanup from all sorts of (trivial) warnings - Key: LUCENE-2285 URL: https://issues.apache.org/jira/browse/LUCENE-2285 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Priority: Minor Fix For: 3.1 I would like to do some code cleanup and remove all sorts of trivial warnings, like unnecessary casts, problems w/ javadocs, unused variables, redundant null checks, unnecessary semicolon etc. These are all very trivial and should not pose any problem. I'll create another issue for getting rid of deprecated code usage, like LuceneTestCase and all sorts of deprecated constructors. That's also trivial because it only affects Lucene code, but it's a different type of change. Another issue I'd like to create is about introducing more generics in the code, where it's missing today - not changing existing API. There are many places in the code like that. So, with you permission, I'll start with the trivial ones first, and then move on to the others. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838216#action_12838216 ] Uwe Schindler commented on LUCENE-2074: --- I will update the patch (using TEST_VERSION and so on) later and then we can proceed. Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer --- Key: LUCENE-2074 URL: https://issues.apache.org/jira/browse/LUCENE-2074 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file. After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2285) Code cleanup from all sorts of (trivial) warnings
[ https://issues.apache.org/jira/browse/LUCENE-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838225#action_12838225 ] Shai Erera commented on LUCENE-2285: bq. .. not willing to add these stupid @Test everywhere I don't share the same feeling ... I think it's a strong capability - write a method which doesn't need to start w/ testXYZ just to be run by JUnit (though I do both for clarity). I think moving to JUnit 4 only simplifies things, as it allows testing classes w/o the need to extend TestCase. But I'm not going to argue about it here, I'd like to keep this issue contained, and short. So I won't touch the LuceneTestCase deprecation, as it's still controversial judging by what you say. I'll remove those SuppressWarnings then? About generics, there are the internal parts of the code, like using List, ArrayList etc. Scanning quickly through the list, it looks like most of the Lucene related warnings are about referencing them ... so it should be also easy to fix. I'll take a look at the code style settings (http://wiki.apache.org/lucene-java/HowToContribute?action=AttachFiledo=viewtarget=Eclipse-Lucene-Codestyle.xml?), but I'm talking about compiler warnings. Code cleanup from all sorts of (trivial) warnings - Key: LUCENE-2285 URL: https://issues.apache.org/jira/browse/LUCENE-2285 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Priority: Minor Fix For: 3.1 I would like to do some code cleanup and remove all sorts of trivial warnings, like unnecessary casts, problems w/ javadocs, unused variables, redundant null checks, unnecessary semicolon etc. These are all very trivial and should not pose any problem. I'll create another issue for getting rid of deprecated code usage, like LuceneTestCase and all sorts of deprecated constructors. That's also trivial because it only affects Lucene code, but it's a different type of change. Another issue I'd like to create is about introducing more generics in the code, where it's missing today - not changing existing API. There are many places in the code like that. So, with you permission, I'll start with the trivial ones first, and then move on to the others. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org