[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837736#action_12837736
 ] 

Michael McCandless commented on LUCENE-2279:


Should we deprecate (eventually, remove) Analyzer.tokenStream?

Maybe we should absorb ReusableAnalyzerBase back into Analyzer?

Or maybe now is an opportune time to create a separate standalone
analyzers package (subproject under the Lucene tlp)?  We've broached
this idea in the past, and I think it's compelling I think
Lucene/Solr/Nutch need to eventually get to this point (where they
share analyzers from a single source), so maybe now is the time.

It'd be a single place where we would pull in all of Lucene's
core/contrib, plus Solr's analyzers, plus new analyzers Robert keeps
making ;) Robert's efforts to upgrade Solr's analyzers to 3.0
(currently a big patch waiting on SOLR-1657), plus his various other
pending analyzer bug fixes, could be done in this new analyzers
package.  And we could immediately fix problems we have with the
current analyzers API (like this reusable/tokenStream amibiguity).


 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2283:
--

Assignee: Michael McCandless

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless

 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837740#action_12837740
 ] 

Michael McCandless commented on LUCENE-2283:



TermVectorsTermsWriter has the same issue.

You're right: with irregular sized documents coming through, you can
end up with PerDoc instances that waste space, because the RAMFile has
buffers allocated from past huge docs that the latest tiny docs don't
use.

Note that the number of outstanding PerDoc instances is a function of
how out of order the docs are being indexed, because the PerDoc
holds any state only until that doc can be written to the store files
(stored fields, term vectors).  It's transient.

EG with a single thread, there will only be one PerDoc -- it's written
immediately.  With 2 threads, if you have a massive doc (which thread
1 get stuck indexing) and then zillions of tiny docs (which thread 2
burns through, while thread 1 is busy), then you can get a large
number of PerDocs created, waiting for their turn because thread 1
hasn't finished yet.

But this process won't use unbounded RAM -- the RAM used by the
RAMFiles is accounted for, and once it gets too high (10% of the RAM
buffer size), we forcefully idle the incoming threads until the out
of orderness is resolved.  EG in this case, thread 2 will stall until
thread 1 has finished its doc.  That byte accounting does account for
the allocated but not used byte[1024] inside RAMFile (we use
RAMFile.sizeInBytes()).

So... this is not really a memory leak.  But it is a potential
starvation issue, in that if your PerDoc instances all grow to large
RAMFiles over time (as each has had to service a very large document),
then it can mean the amount of concurrency that DW allows will become
pinched.  Especially if these docs are large relative to your
ram buffer size.

Are you hitting this issue?  Ie seeing poor concurrency during
indexing despite using many threads, because DW is forcefully idleing
the threads?  It should only happen if you sometimes index docs
that are larger than RAMBufferSize/10/numberOrIndexingThreads.

I'll work out  a fix.  I think we should fix RAMFile.reset to trim its
buffers using ArrayUtil.getShrinkSize.


 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless

 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2283:
---

Fix Version/s: 3.1

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code

2010-02-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2282:
--

Assignee: Michael McCandless

 Expose IndexFileNames as public, and make use of its methods in the code
 

 Key: LUCENE-2282
 URL: https://issues.apache.org/jira/browse/LUCENE-2282
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch


 IndexFileNames is useful for applications that extend Lucene, an in 
 particular those who extend Directory or IndexWriter. It provides useful 
 constants and methods to query whether a certain file is a core Lucene file 
 or not. In addition, IndexFileNames should be used by Lucene's code to 
 generate segment file names, or query whether a certain file matches a 
 certain extension.
 I'll post the patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException

2010-02-24 Thread Ritesh Nigam (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837744#action_12837744
 ] 

Ritesh Nigam commented on LUCENE-2280:
--

Are you sure you're using a stock version 2.3.2 of Lucene?
- Yes, I checked the manifest of the jar.

I ask because... the line numbers in SegmentMerger (specifically 566) don't 
correlate to 2.3.2. The other line numbers do match. It's odd.

But looking at the code I don't see how either of the arrays being passed to 
System.arraycopy can be null.

Can you turn on IndexWriter's infoStream and capture  post the output?
- I have turned on the infostream for IndexWriter, it will take some time to 
get the result. once I get the result I will post that.

It's also strange that this leads to index corruption; it shouldn't (the merge 
should just fail, and the index should be untouched). Can you run CheckIndex on 
the index and post what corruption it uncovers.
- Here index corruption I mean that the main index file is getting deleted and 
search is not returning expected result. Hence there is no index file exists 
after the NullPointerExcepton, I cannot run CheckIndex.

Does this happen in a Sun JRE?
- I have not yet tested the same scenario on Sun JRE till now.

 IndexWriter.optimize() throws NullPointerException
 --

 Key: LUCENE-2280
 URL: https://issues.apache.org/jira/browse/LUCENE-2280
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.2
 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6
Reporter: Ritesh Nigam

 I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB 
 database which creates approax 200MB index file, after finishing the indexing 
 and while running optimize() i can see NullPointerExcception thrown in my log 
 and index file is getting corrupted, log says
 
 Caused by: 
 java.lang.NullPointerException
   at 
 org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49)
   at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40)
   at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566)
   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
 
 and this is happening quite frequently, although I am not able to reproduce 
 it on demand, I saw an issue logged which is some what related to mine issue 
 (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e)
  but the only difference here is I am not using Store.Compress for my fields, 
 i am using Store.NO instead. please note that I am using IBM JRE for my 
 application.
 Is this an issue with lucene?, if yes it is fixed in which version?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837752#action_12837752
 ] 

Simon Willnauer commented on LUCENE-2279:
-

bq. Should we deprecate (eventually, remove) Analyzer.tokenStream? 
I would totally agree with that but  I guess we can not remove this method 
until lucene 4.0 which will be hmm in 2020 :) - just joking

bq.Maybe we should absorb ReusableAnalyzerBase back into Analyzer?
That would be the logical consequence but the problem with ReusableAnalyzerBase 
is that it will break bw comapt if moved to Analyzer. It assumes both 
#reusabelTokenStream and #tokenStream to be final and introduces a new factory 
method. Yet, as an analyzer developer you really want to use the new 
ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will 
require you writing half of the code plus gives you reusability of the 
tokenStream

bp. I think Lucene/Solr/Nutch need to eventually get to this point
Huge +1 from my side. This could also unify the factory pattern solr uses to 
build tokenstreams. I would stop right here and ask to discuss it on the dev 
list, thoughts mike?!



 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837759#action_12837759
 ] 

Robert Muir commented on LUCENE-2279:
-

bq. Yet, as an analyzer developer you really want to use the new 
ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will 
require you writing half of the code plus gives you reusability of the 
tokenStream

and the 1% extremely advanced cases that can't reuse, can just use TokenStreams 
directly when indexing, e.g. the Analyzer class could be reusable by 
definition. we shouldnt let these obscure cases slow down everyone else.

bq. It assumes both #reusabelTokenStream and #tokenStream to be final

in my opinion all the core analyzers (you already fixed contrib) should be 
final. this is another trap, if you subclass one of these analyzers and 
implement 'tokenStream', its immediately slow due to the backwards code.

bq. I think Lucene/Solr/Nutch need to eventually get to this point

if this is what we should do to remove the code duplication, then i am all for 
it. i still don't quite understand how it gives us more freedom to break/change 
the APIs, i mean however we label this stuff, a break is a break to the user at 
the end of the day.

 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2111) Wrapup flexible indexing

2010-02-24 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2111:


Attachment: LUCENE-2111_experimental.patch

attached is a patch that changes various exposed apis to use 
@lucene.experimental

i didnt mess with IndexFileNames as there is an open issue about it right now.

 Wrapup flexible indexing
 

 Key: LUCENE-2111
 URL: https://issues.apache.org/jira/browse/LUCENE-2111
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Flex Branch
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2111-EmptyTermsEnum.patch, 
 LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
 LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
 LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, 
 LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch


 Spinoff from LUCENE-1458.
 The flex branch is in fairly good shape -- all tests pass, initial search 
 performance testing looks good, it survived several visits from the Unicode 
 policeman ;)
 But it still has a number of nocommits, could use some more scrutiny 
 especially on the emulate old API on flex index and vice/versa code paths, 
 and still needs some more performance testing.  I'll do these under this 
 issue, and we should open separate issues for other self contained fixes.
 The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2272) PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'

2010-02-24 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-2272:
---

Assignee: Grant Ingersoll

 PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'
 ---

 Key: LUCENE-2272
 URL: https://issues.apache.org/jira/browse/LUCENE-2272
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Peter Keegan
Assignee: Grant Ingersoll
 Attachments: payloadfunctin-patch.txt


 The 'explain' method in PayloadNearSpanScorer assumes the 
 AveragePayloadFunction was used. This patch adds the 'explain' method to the 
 'PayloadFunction' interface, where the Scorer can call it. Added unit tests 
 for 'explain' and for {Min,Max}PayloadFunction.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2272) PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'

2010-02-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837788#action_12837788
 ] 

Grant Ingersoll commented on LUCENE-2272:
-

Peter,

Couple of comments:
*  The base explain method can't be abstract.   Something like:
{code}
public Explanation explain(int docId){
Explanation result = new Explanation();
result.setDescription(Unimpl Payload Function Explain);
result.setValue(1);
return result;
  };
{code}
should do the trick
* The changes don't seem thread safe any more since there are now member 
variables.  It may still be all right, but have you looked at this aspect?

 PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'
 ---

 Key: LUCENE-2272
 URL: https://issues.apache.org/jira/browse/LUCENE-2272
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Peter Keegan
Assignee: Grant Ingersoll
 Attachments: payloadfunctin-patch.txt


 The 'explain' method in PayloadNearSpanScorer assumes the 
 AveragePayloadFunction was used. This patch adds the 'explain' method to the 
 'PayloadFunction' interface, where the Scorer can call it. Added unit tests 
 for 'explain' and for {Min,Max}PayloadFunction.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837792#action_12837792
 ] 

Michael McCandless commented on LUCENE-2279:


bq. I would stop right here and ask to discuss it on the dev list, thoughts 
mike?!

Agreed... I'll start a thread.

{quote}
bq. Maybe we should absorb ReusableAnalyzerBase back into Analyzer?

That would be the logical consequence but the problem with ReusableAnalyzerBase 
is that it will break bw comapt if moved to Analyzer.
{quote}

Right, this is why I was thinking if we make a new analyzers package, it's a 
chance to break/improve things.  We'd have a single abstract base class that 
only exposes reuse API.

bq. in my opinion all the core analyzers (you already fixed contrib) should be 
final. 

I agree, and we should consistently take this approach w/ the new analyzers 
package...

bq. i still don't quite understand how it gives us more freedom to break/change 
the APIs, i mean however we label this stuff, a break is a break to the user at 
the end of the day.

Because it'd be an entirely new package, so we can create a new base Analyzer 
class (in that package) that breaks/fixes things when compared to Lucene's 
Analyzer class.

We'd eventually deprecate the analyzers/tokenizers/token filters in 
Lucene/Solr/Nutch in favor of this new package, and users can switch over on 
their own schedule.


 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837793#action_12837793
 ] 

Tim Smith commented on LUCENE-2283:
---

I came across this issue looking for a reported memory leak during indexing

a yourkit snapshot showed that the PerDocs for an IndexWriter were using ~40M 
of memory (at which point i came across this potentially unbounded memory use 
in StoredFieldsWriter)
this snapshot seems more or less at a stable point (memory grows but then 
returns to a normal state), however i have reports that eventually the memory 
is completely exhausted resulting in out of memory errors.

I so far have not found any other major culprit in the lucene indexing code.

This index receives a routine mix of very large and very small documents (which 
would explain this situation)
The VM and system have more than ample amount of memory given the buffer size 
and what should be normal indexing RAM requirements.

Also, a major difference between this leak not occurring and it showing up is 
that previously, the IndexWriter was closed when performing commits, now the 
IndexWriter remains open (just calling IndexWriter.commit()). So, if any memory 
is leaking during indexing, it is no longer being reclaimed during commit. As a 
side note, closing the index writer at commit time would sometimes fail, 
resulting in some following updates to fail because the index writer was locked 
and couldn't be reopened until the old index writer was garbage collected, so i 
don't want to go back to this for commits.

Its possible there is a leak somewhere else (i currently do not have a snapshot 
right before out of memory issues occur, so currently the only thing that 
stands out is the PerDoc memory use)

As far as a fix goes, wouldn't it be better to have the RAMFile's used for 
stored fields pull and return byte buffers from the byte block pool on the 
DocumentsWriter? This would allow the memory to be reclaimed based on the index 
writers buffer size (otherwise there is no configurable way to tune this memory 
use)



 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2111) Wrapup flexible indexing

2010-02-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837795#action_12837795
 ] 

Robert Muir commented on LUCENE-2111:
-

these tags are added in revision 915791.

 Wrapup flexible indexing
 

 Key: LUCENE-2111
 URL: https://issues.apache.org/jira/browse/LUCENE-2111
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Flex Branch
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2111-EmptyTermsEnum.patch, 
 LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
 LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
 LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, 
 LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch


 Spinoff from LUCENE-1458.
 The flex branch is in fairly good shape -- all tests pass, initial search 
 performance testing looks good, it survived several visits from the Unicode 
 policeman ;)
 But it still has a number of nocommits, could use some more scrutiny 
 especially on the emulate old API on flex index and vice/versa code paths, 
 and still needs some more performance testing.  I'll do these under this 
 issue, and we should open separate issues for other self contained fixes.
 The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException

2010-02-24 Thread Ritesh Nigam (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837799#action_12837799
 ] 

Ritesh Nigam commented on LUCENE-2280:
--

Attaching the lucene.jar which i am using for my application.

 IndexWriter.optimize() throws NullPointerException
 --

 Key: LUCENE-2280
 URL: https://issues.apache.org/jira/browse/LUCENE-2280
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.2
 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6
Reporter: Ritesh Nigam

 I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB 
 database which creates approax 200MB index file, after finishing the indexing 
 and while running optimize() i can see NullPointerExcception thrown in my log 
 and index file is getting corrupted, log says
 
 Caused by: 
 java.lang.NullPointerException
   at 
 org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49)
   at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40)
   at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566)
   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
 
 and this is happening quite frequently, although I am not able to reproduce 
 it on demand, I saw an issue logged which is some what related to mine issue 
 (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e)
  but the only difference here is I am not using Store.Compress for my fields, 
 i am using Store.NO instead. please note that I am using IBM JRE for my 
 application.
 Is this an issue with lucene?, if yes it is fixed in which version?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException

2010-02-24 Thread Ritesh Nigam (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ritesh Nigam updated LUCENE-2280:
-

Attachment: lucene.jar

lucene.jar my application is using

 IndexWriter.optimize() throws NullPointerException
 --

 Key: LUCENE-2280
 URL: https://issues.apache.org/jira/browse/LUCENE-2280
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.2
 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6
Reporter: Ritesh Nigam
 Attachments: lucene.jar


 I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB 
 database which creates approax 200MB index file, after finishing the indexing 
 and while running optimize() i can see NullPointerExcception thrown in my log 
 and index file is getting corrupted, log says
 
 Caused by: 
 java.lang.NullPointerException
   at 
 org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49)
   at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40)
   at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566)
   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
 
 and this is happening quite frequently, although I am not able to reproduce 
 it on demand, I saw an issue logged which is some what related to mine issue 
 (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e)
  but the only difference here is I am not using Store.Compress for my fields, 
 i am using Store.NO instead. please note that I am using IBM JRE for my 
 application.
 Is this an issue with lucene?, if yes it is fixed in which version?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837811#action_12837811
 ] 

Michael McCandless commented on LUCENE-2283:


bq. a yourkit snapshot showed that the PerDocs for an IndexWriter were using 
~40M of memory

What was IW's ramBufferSizeMB when you saw this?

bq. however i have reports that eventually the memory is completely exhausted 
resulting in out of memory errors.

Hmm, that makes me nervous, because I think in this case the use should be 
bounded.

bq. As a side note, closing the index writer at commit time would sometimes 
fail, resulting in some following updates to fail because the index writer was 
locked and couldn't be reopened until the old index writer was garbage 
collected, so i don't want to go back to this for commits.

That doesn't sound good!  Can you post some details on this (eg an exception)?

But, anyway, keeping the same IW open and just calling commit is (should be) 
fine.

bq. As far as a fix goes, wouldn't it be better to have the RAMFile's used for 
stored fields pull and return byte buffers from the byte block pool on the 
DocumentsWriter?

Yes, that's a great solution -- a single pool.  But that's a somewhat bigger 
change.  I guess we can pass a byte[] allocator to RAMFile.  It'd have to be a 
new pool, too (DW's byte blocks are 32KB not the 1KB that RAMFile uses).

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837821#action_12837821
 ] 

Tim Smith commented on LUCENE-2283:
---

ramBufferSizeMB is 64MB

Here's the yourkit breakdown per class:
* DocumentsWriter - 256 MB
** TermsHash - 38.7 MB
** StoredFieldsWriter - 37.5 MB
** DocumentsWriterThreadState - 36.2 MB
** DocumentsWriterThreadState - 34.6 MB
** DocumentsWriterThreadState - 33.8 MB
** DocumentsWriterThreadState - 27.5 MB
** DocumentsWriterThreadState - 13.4 MB

I'm starting to dig into the ThreadStates now to see if anything stands out here

bq. Hmm, that makes me nervous, because I think in this case the use should be 
bounded.

I should be getting a new profile dump at crash time soon, so hopefully that 
will make things clearer

bq. That doesn't sound good! Can you post some details on this (eg an 
exception)?

If i recall correctly, I think the exception was caused by an out of disk space 
situation (which would recover)
obviously not much that can be done about this other than adding more disk 
space, however the situation would recover, but docs would be lost in the 
interum

bq. But, anyway, keeping the same IW open and just calling commit is (should 
be) fine.

Yeah, this should be the way to go, especially as it results in the pooled 
buffers not needing to be reallocated/reclaimed/etc, however right now this is 
the only change i can currently think of that could result in memory issues.

bq. Yes, that's a great solution - a single pool. But that's a somewhat bigger 
change. 

Seems like this would be the best approach as it makes the memory bounded by 
the configuration of the engine, giving better reuse of byte blocks and better 
ability to reclaim memory (in DocumentsWriter.balanceRAM())




 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



MatchAllDocsQueryNode toString() creates invalid XML-Tag

2010-02-24 Thread Frank Wesemann

Hi,
I am just getting my feet wet with the queryParser in contrib/queryparser.
This new API is really a huge improvement.
I am using it to convert Solr-Style input into a custom xml-based format 
we use to query third party search engines.


I encountered the following:
The MatchAllDocsQueryNode returns in its toString-Method
matchAllDocs field='*' term='*'.
Is this by purpose? Is it meant to be closed elsewhere?
If not, I'll happily open a JIRA-Issue and provide a patch for it.

Thanks

frank



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2111) Wrapup flexible indexing

2010-02-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2111:
---

Attachment: LUCENE-2111.patch

Attached patch, fixing some more nocommits, and renaming BytesRef.toString - 
BytesRef.utf8ToString.

 Wrapup flexible indexing
 

 Key: LUCENE-2111
 URL: https://issues.apache.org/jira/browse/LUCENE-2111
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Flex Branch
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2111-EmptyTermsEnum.patch, 
 LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
 LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
 LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, 
 LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch


 Spinoff from LUCENE-1458.
 The flex branch is in fairly good shape -- all tests pass, initial search 
 performance testing looks good, it survived several visits from the Unicode 
 policeman ;)
 But it still has a number of nocommits, could use some more scrutiny 
 especially on the emulate old API on flex index and vice/versa code paths, 
 and still needs some more performance testing.  I'll do these under this 
 issue, and we should open separate issues for other self contained fixes.
 The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: MatchAllDocsQueryNode toString() creates invalid XML-Tag

2010-02-24 Thread Michael McCandless
This sounds like a bug -- can you open an issue?  Thanks!

Mike

On Wed, Feb 24, 2010 at 10:04 AM, Frank Wesemann
f.wesem...@fotofinder.net wrote:
 Hi,
 I am just getting my feet wet with the queryParser in contrib/queryparser.
 This new API is really a huge improvement.
 I am using it to convert Solr-Style input into a custom xml-based format we
 use to query third party search engines.

 I encountered the following:
 The MatchAllDocsQueryNode returns in its toString-Method
 matchAllDocs field='*' term='*'.
 Is this by purpose? Is it meant to be closed elsewhere?
 If not, I'll happily open a JIRA-Issue and provide a patch for it.

 Thanks

 frank



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837859#action_12837859
 ] 

Michael McCandless commented on LUCENE-2279:


{quote}
bq. I would stop right here and ask to discuss it on the dev list, thoughts 
mike?!

Agreed... I'll start a thread.
{quote}

OK I just started a thread on general@

 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837865#action_12837865
 ] 

Michael McCandless commented on LUCENE-2283:


{quote}
ramBufferSizeMB is 64MB

Here's the yourkit breakdown per class:
{quote}

Hmm -- spooky.  With ram buffer @ 64MB, DocumentsWriter is using 256MB!?  
Something is clearly amiss.  40 MB used by StoredFieldsWriter's PerDoc still 
leaves 152 MB unaccounted for... hmm.

bq. If i recall correctly, I think the exception was caused by an out of disk 
space situation (which would recover)

Oh OK.  Though... closing the iW vs calling IW.commit should be not different 
in that regard.  Both should have the same transient disk space usage.  It's 
odd you'd see out of disk for .close but not also for .commit.

bq. Seems like this would be the best approach as it makes the memory bounded 
by the configuration of the engine, giving better reuse of byte blocks and 
better ability to reclaim memory (in DocumentsWriter.balanceRAM())

I agree.  I'll mull over how to do it... unless you're planning on consing up a 
patch ;)

How many threads do you pass through IW?  Are the threads forever from a static 
pool, or do they come and go?  I'd like to try to simulate your usage (huge 
docs  tiny docs)  in my dev area to see if I can provoke the same behavior.

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2111) Wrapup flexible indexing

2010-02-24 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2111:


Attachment: LUCENE-2111_toString.patch

Here is a few more toString - utf8ToString.
will look at the backwards tests now

 Wrapup flexible indexing
 

 Key: LUCENE-2111
 URL: https://issues.apache.org/jira/browse/LUCENE-2111
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Flex Branch
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2111-EmptyTermsEnum.patch, 
 LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
 LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
 LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, 
 LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch, 
 LUCENE-2111_toString.patch


 Spinoff from LUCENE-1458.
 The flex branch is in fairly good shape -- all tests pass, initial search 
 performance testing looks good, it survived several visits from the Unicode 
 policeman ;)
 But it still has a number of nocommits, could use some more scrutiny 
 especially on the emulate old API on flex index and vice/versa code paths, 
 and still needs some more performance testing.  I'll do these under this 
 issue, and we should open separate issues for other self contained fixes.
 The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837875#action_12837875
 ] 

Tim Smith commented on LUCENE-2283:
---

bq. I agree. I'll mull over how to do it... unless you're planning on consing 
up a patch 

I'd love to, but don't have the free cycles at the moment :(

bq. How many threads do you pass through IW?

honestly don't 100% know about the origin of the threads i'm given
In general, they should be from a static pool, but may be dynamically allocated 
if the static pool runs out

One thought i had recently was to control this more tightly by having a limited 
number of static threads that called IndexWriter methods in case that was the 
issue (but that would be a pretty big change)

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code

2010-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837879#action_12837879
 ] 

Michael McCandless commented on LUCENE-2282:


Patch looks good Shai!

But I don't think we should back port to 3.0.2 -- it's non-trivial
enough that there is some risk?

As the API is now marked @lucene.internal, and it'll only be very
expert usage, I'm not as concerned as Marvin is about the risks of
even exposing this.  Also, even with flex, a good number of Lucene's
index files are not under codec control (codec only touches postings
files -- .tis, .tii, .frq, .prx for the standard codec).  But I do
agree it's not ideal that the knowledge of file extensions is split
across this class and the codec.  The IndexFileNameFilter in flex now
takes a Codec as input, to make up for that... but IndexFileNames just
has a NOTE at the top stating the limitation.


 Expose IndexFileNames as public, and make use of its methods in the code
 

 Key: LUCENE-2282
 URL: https://issues.apache.org/jira/browse/LUCENE-2282
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch


 IndexFileNames is useful for applications that extend Lucene, an in 
 particular those who extend Directory or IndexWriter. It provides useful 
 constants and methods to query whether a certain file is a core Lucene file 
 or not. In addition, IndexFileNames should be used by Lucene's code to 
 generate segment file names, or query whether a certain file matches a 
 certain extension.
 I'll post the patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code

2010-02-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837880#action_12837880
 ] 

Shai Erera commented on LUCENE-2282:


bq. But I don't think we should back port to 3.0.2

Ok, I can live w/ 3.1, as long as it's not released at the end of 2010. I can 
for now put that part of my code in o.a.l.index, until 3.1 is out.

As I wrote in the TestFileSwitchDirectory comment, this IMO has to go in, 
because otherwise it would make the code of users of FSD fragile (potentially).

Thanks for looking at this !

 Expose IndexFileNames as public, and make use of its methods in the code
 

 Key: LUCENE-2282
 URL: https://issues.apache.org/jira/browse/LUCENE-2282
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch


 IndexFileNames is useful for applications that extend Lucene, an in 
 particular those who extend Directory or IndexWriter. It provides useful 
 constants and methods to query whether a certain file is a core Lucene file 
 or not. In addition, IndexFileNames should be used by Lucene's code to 
 generate segment file names, or query whether a certain file matches a 
 certain extension.
 I'll post the patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837881#action_12837881
 ] 

Tim Smith commented on LUCENE-2283:
---

latest profile dump has pointed out a non-lucene issue as causing some memory 
growth

so feel free to drop down priority

however it seems like using the bytepool for the stored fields would be good 
overall

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code

2010-02-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837883#action_12837883
 ] 

Uwe Schindler commented on LUCENE-2282:
---

bq. But I don't think we should back port to 3.0.2 - it's non-trivial enough 
that there is some risk?

Please no backport to 3.0.2, its an API change. And we are not sure if there 
will be ever a 3.0.2. BTW: Version 3.0.1 comes out latest on Friday, will 
appear on the mirrors soon!

 Expose IndexFileNames as public, and make use of its methods in the code
 

 Key: LUCENE-2282
 URL: https://issues.apache.org/jira/browse/LUCENE-2282
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch


 IndexFileNames is useful for applications that extend Lucene, an in 
 particular those who extend Directory or IndexWriter. It provides useful 
 constants and methods to query whether a certain file is a core Lucene file 
 or not. In addition, IndexFileNames should be used by Lucene's code to 
 generate segment file names, or query whether a certain file matches a 
 certain extension.
 I'll post the patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException

2010-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837892#action_12837892
 ] 

Michael McCandless commented on LUCENE-2280:


Indeed that JAR is identical to 2.3.2.  Weird.  Not sure why the line number 
doesn't line up.  Irks me.

bq. Here index corruption I mean that the main index file is getting deleted 
and search is not returning expected result. Hence there is no index file 
exists after the NullPointerExcepton, I cannot run CheckIndex.

That's even stranger -- nothing should get deleted because a merge fails.  Is 
it possible your app has an exception handler doing this?  Or maybe this is a 
brand new index, and it doesn't get properly closed (ie, no commit) when this 
exception is hit?  If not... can you provide more details?  An exception like 
this should have no impact on the original index.

Please post the infoStream output when you get it, and report back whether this 
happens on Sun's JVM.  But I still can't see how either of the arrays could be 
null here... this is a weird one.

Are you using the latest updates to the IBM 1.6 JRE?

 IndexWriter.optimize() throws NullPointerException
 --

 Key: LUCENE-2280
 URL: https://issues.apache.org/jira/browse/LUCENE-2280
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.2
 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6
Reporter: Ritesh Nigam
 Attachments: lucene.jar


 I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB 
 database which creates approax 200MB index file, after finishing the indexing 
 and while running optimize() i can see NullPointerExcception thrown in my log 
 and index file is getting corrupted, log says
 
 Caused by: 
 java.lang.NullPointerException
   at 
 org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49)
   at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40)
   at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566)
   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
 
 and this is happening quite frequently, although I am not able to reproduce 
 it on demand, I saw an issue logged which is some what related to mine issue 
 (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e)
  but the only difference here is I am not using Store.Compress for my fields, 
 i am using Store.NO instead. please note that I am using IBM JRE for my 
 application.
 Is this an issue with lucene?, if yes it is fixed in which version?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837896#action_12837896
 ] 

Michael McCandless commented on LUCENE-2283:


Yeah it would be good to make the pool shared...

It still bugs me that yourkit is claiming DW was using 256 MB when you've got a 
64 MB ram buffer that's spooky.

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag

2010-02-24 Thread Frank Wesemann (JIRA)
MatchAllDocsQueryNode toString() creates invalid XML-Tag


 Key: LUCENE-2284
 URL: https://issues.apache.org/jira/browse/LUCENE-2284
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
 Environment: all
Reporter: Frank Wesemann


MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', 
which is inavlid XML should read matchAllDocs field='*' term='*' /.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag

2010-02-24 Thread Frank Wesemann (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Wesemann updated LUCENE-2284:
---

Attachment: LUCENE-2284.patch

this patch returns a valid xml Element.

 MatchAllDocsQueryNode toString() creates invalid XML-Tag
 

 Key: LUCENE-2284
 URL: https://issues.apache.org/jira/browse/LUCENE-2284
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
 Environment: all
Reporter: Frank Wesemann
 Attachments: LUCENE-2284.patch


 MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', 
 which is inavlid XML should read matchAllDocs field='*' term='*' /.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: MatchAllDocsQueryNode toString() creates invalid XML-Tag

2010-02-24 Thread Frank Wesemann

Michael McCandless schrieb:

This sounds like a bug -- can you open an issue?  Thanks!

  


Created: (LUCENE-2284) and added a Patch

--
mit freundlichem Gruß,

Frank Wesemann
Fotofinder GmbH USt-IdNr. DE812854514
Software EntwicklungWeb: http://www.fotofinder.com/
Potsdamer Str. 96   Tel: +49 30 25 79 28 90
10785 BerlinFax: +49 30 25 79 28 999

Sitz: Berlin
Amtsgericht Berlin Charlottenburg (HRB 73099)
Geschäftsführer: Ali Paczensky




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag

2010-02-24 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-2284:
---

Assignee: Robert Muir

 MatchAllDocsQueryNode toString() creates invalid XML-Tag
 

 Key: LUCENE-2284
 URL: https://issues.apache.org/jira/browse/LUCENE-2284
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
 Environment: all
Reporter: Frank Wesemann
Assignee: Robert Muir
 Attachments: LUCENE-2284.patch


 MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', 
 which is inavlid XML should read matchAllDocs field='*' term='*' /.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag

2010-02-24 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2284:


Fix Version/s: 3.1

 MatchAllDocsQueryNode toString() creates invalid XML-Tag
 

 Key: LUCENE-2284
 URL: https://issues.apache.org/jira/browse/LUCENE-2284
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
 Environment: all
Reporter: Frank Wesemann
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2284.patch


 MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', 
 which is inavlid XML should read matchAllDocs field='*' term='*' /.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2284) MatchAllDocsQueryNode toString() creates invalid XML-Tag

2010-02-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837909#action_12837909
 ] 

Robert Muir commented on LUCENE-2284:
-

looks like it would be good to fix, as all the other querynodes return valid 
xml. 

will commit in a day or 2 if no one objects.

Thanks for reporting this Frank

 MatchAllDocsQueryNode toString() creates invalid XML-Tag
 

 Key: LUCENE-2284
 URL: https://issues.apache.org/jira/browse/LUCENE-2284
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
 Environment: all
Reporter: Frank Wesemann
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2284.patch


 MatchAllDocsQueryNode.toString() returns matchAllDocs field='*' term='*', 
 which is inavlid XML should read matchAllDocs field='*' term='*' /.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837919#action_12837919
 ] 

Tim Smith commented on LUCENE-2283:
---

another note is that this was on 64 bit vm

i've noticed that all the memsize calculations assume 4 byte pointers, so 
perhaps that can lead to more memory being used that would otherwise be 
expected (although 256 MB is still well over the 2X mem use that would 
potentially be expected in that case)



 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2010-02-24 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2126:
--

Attachment: lucene-2126.patch

Updated patch to trunk.

I'll have to make a change to the backwards-tests too, because moving the 
copyBytes() method from IndexOutput to DataOutput and changing its parameter 
from IndexInput to DataInput breaks drop-in compatibility. 


 Split up IndexInput and IndexOutput into DataInput and DataOutput
 -

 Key: LUCENE-2126
 URL: https://issues.apache.org/jira/browse/LUCENE-2126
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Flex Branch
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Flex Branch

 Attachments: lucene-2126.patch, lucene-2126.patch


 I'd like to introduce the two new classes DataInput and DataOutput
 that contain all methods from IndexInput and IndexOutput that actually
 decode or encode data, such as readByte()/writeByte(),
 readVInt()/writeVInt().
 Methods like getFilePointer(), seek(), close(), etc., which are not
 related to data encoding, but to files as input/output source stay in
 IndexInput/IndexOutput.
 This patch also changes ByteSliceReader/ByteSliceWriter to extend
 DataInput/DataOutput. Previously ByteSliceReader implemented the
 methods that stay in IndexInput by throwing RuntimeExceptions.
 See also LUCENE-2125.
 All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code

2010-02-24 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837988#action_12837988
 ] 

Marvin Humphrey commented on LUCENE-2282:
-

 As the API is now marked @lucene.internal, and it'll only be very
 expert usage, I'm not as concerned as Marvin is about the risks of
 even exposing this. 

Um, the only possible concerns I could have had were regarding public exposure
of this API.  If it's marked as internal, it's an implementation detail.
Whether or not the dot is included in internal-use-only constant strings isn't
something I'm going to waste a lot of time thinking about. ;)

So now, not only do I really, really not care whether this goes in, I have no
qualms about it either.

Having users like Shai who are willing to recompile and regenerate to take
advantage of experimental features is a big boon, as it allows us to test
drive features before declaring them stable.  Designing optimal APIs without
usability testing is difficult to impossible.

 Expose IndexFileNames as public, and make use of its methods in the code
 

 Key: LUCENE-2282
 URL: https://issues.apache.org/jira/browse/LUCENE-2282
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch


 IndexFileNames is useful for applications that extend Lucene, an in 
 particular those who extend Directory or IndexWriter. It provides useful 
 constants and methods to query whether a certain file is a core Lucene file 
 or not. In addition, IndexFileNames should be used by Lucene's code to 
 generate segment file names, or query whether a certain file matches a 
 certain extension.
 I'll post the patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838017#action_12838017
 ] 

Tim Smith commented on LUCENE-2283:
---

i'm working up a patch for the shared byteblock pool for stored field buffers 
(found a few cycles)


 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

2010-02-24 Thread Shyamal Prasad (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838060#action_12838060
 ] 

Shyamal Prasad commented on LUCENE-2167:


{quote}
I don't think it really has to be, i actually am of the opinion 
StandardTokenizer should follow unicode standard tokenization. then we can 
throw subjective decisions away, and stick with a standard.
{quote}

Yep, I see I am going for the wrong ambition level and only tweaking the 
existing grammar. I'll take a crack at understanding unicode standard 
tokenization, as you'd suggested originally,  and try and produce something as 
soon as I get a chance. I see your point.

Cheers!
Shyamal

 StandardTokenizer Javadoc does not correctly describe tokenization around 
 punctuation characters
 

 Key: LUCENE-2167
 URL: https://issues.apache.org/jira/browse/LUCENE-2167
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Shyamal Prasad
Priority: Minor
 Attachments: LUCENE-2167.patch, LUCENE-2167.patch

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The Javadoc for StandardTokenizer states:
 {quote}
 Splits words at punctuation characters, removing punctuation. 
 However, a dot that's not followed by whitespace is considered part of a 
 token.
 Splits words at hyphens, unless there's a number in the token, in which case 
 the whole 
 token is interpreted as a product number and is not split.
 {quote}
 This is not accurate. The actual JFlex implementation treats hyphens 
 interchangeably with
 punctuation. So, for example video,mp4,test results in a *single* token and 
 not three tokens
 as the documentation would suggest.
 Additionally, the documentation suggests that video-mp4-test-again would 
 become a single
 token, but in reality it results in two tokens: video-mp4-test and again.
 IMHO the parser implementation is fine as is since it is hard to keep 
 everyone happy, but it is probably
 worth cleaning up the documentation string. 
 The patch included here updates the documentation string and adds a few test 
 cases to confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

2010-02-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838068#action_12838068
 ] 

Robert Muir commented on LUCENE-2167:
-

bq. I'll take a crack at understanding unicode standard tokenization, as you'd 
suggested originally, and try and produce something as soon as I get a chance.

I would love it if you could produce a grammar that implemented UAX#29!

If so, in my opinion it should become the StandardAnalyzer for the next lucene 
version. If I thought I could do it correctly, I would have already done it, as 
the support for the unicode properties needed to do this is now in the trunk of 
Jflex!

here are some references that might help: 
The standard itself: http://unicode.org/reports/tr29/

particularly the Testing portion: 
http://unicode.org/reports/tr41/tr41-5.html#Tests29

Unicode provides a WordBreakTest.txt file, that we could use from Junit, to 
help verify correctness: 
http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.txt

I'll warn you I think it might be hard, but perhaps its not that bad. In 
particular the standard is defined in terms of chained rules, and Jflex 
doesnt support rule chaining, but I am not convinced we need rule chaining to 
implement WordBreak (maybe LineBreak, but maybe WordBreak can be done easily 
without it?) 

Steven Rowe is the expert on this stuff, maybe he has some ideas.

 StandardTokenizer Javadoc does not correctly describe tokenization around 
 punctuation characters
 

 Key: LUCENE-2167
 URL: https://issues.apache.org/jira/browse/LUCENE-2167
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Shyamal Prasad
Priority: Minor
 Attachments: LUCENE-2167.patch, LUCENE-2167.patch

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The Javadoc for StandardTokenizer states:
 {quote}
 Splits words at punctuation characters, removing punctuation. 
 However, a dot that's not followed by whitespace is considered part of a 
 token.
 Splits words at hyphens, unless there's a number in the token, in which case 
 the whole 
 token is interpreted as a product number and is not split.
 {quote}
 This is not accurate. The actual JFlex implementation treats hyphens 
 interchangeably with
 punctuation. So, for example video,mp4,test results in a *single* token and 
 not three tokens
 as the documentation would suggest.
 Additionally, the documentation suggests that video-mp4-test-again would 
 become a single
 token, but in reality it results in two tokens: video-mp4-test and again.
 IMHO the parser implementation is fine as is since it is hard to keep 
 everyone happy, but it is probably
 worth cleaning up the documentation string. 
 The patch included here updates the documentation string and adds a few test 
 cases to confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

2010-02-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838073#action_12838073
 ] 

Robert Muir commented on LUCENE-2167:
-

btw, here is some statement that seems to confirm my suspicions, from the 
standard:

In section 6.3, there is an example of the grapheme cluster boundaries 
converted into a simple regex (the kind we could do easily in jflex now that it 
has the properties available).

They make this statement: Such a regular expression can also be turned into a 
fast, deterministic finite-state machine. Similar regular expressions are 
possible for Word boundaries. Line and Sentence boundaries are more 
complicated, and more difficult to represent with regular expressions.

 StandardTokenizer Javadoc does not correctly describe tokenization around 
 punctuation characters
 

 Key: LUCENE-2167
 URL: https://issues.apache.org/jira/browse/LUCENE-2167
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Shyamal Prasad
Priority: Minor
 Attachments: LUCENE-2167.patch, LUCENE-2167.patch

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The Javadoc for StandardTokenizer states:
 {quote}
 Splits words at punctuation characters, removing punctuation. 
 However, a dot that's not followed by whitespace is considered part of a 
 token.
 Splits words at hyphens, unless there's a number in the token, in which case 
 the whole 
 token is interpreted as a product number and is not split.
 {quote}
 This is not accurate. The actual JFlex implementation treats hyphens 
 interchangeably with
 punctuation. So, for example video,mp4,test results in a *single* token and 
 not three tokens
 as the documentation would suggest.
 Additionally, the documentation suggests that video-mp4-test-again would 
 become a single
 token, but in reality it results in two tokens: video-mp4-test and again.
 IMHO the parser implementation is fine as is since it is hard to keep 
 everyone happy, but it is probably
 worth cleaning up the documentation string. 
 The patch included here updates the documentation string and adds a few test 
 cases to confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

2010-02-24 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838081#action_12838081
 ] 

Steven Rowe edited comment on LUCENE-2167 at 2/24/10 11:27 PM:
---

I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and 
both Unicode versions 5.1 and 5.2 - you can see the files here:

http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/

The files are {{UnicodeWordBreakRules_5_\*.\*}} - these are written to: parse 
the Unicode test files; run the generated scanner against each composed test 
string; output the break opportunities/prohibitions in the same format as the 
test files; and then finally compare the output against the test file itself, 
looking for a match.  (These tests currently pass.)

The .flex files would need to be significantly changed to be used as a 
StandardTokenizer replacement, but you can get an idea from them how to 
implement the Unicode word break rules in (as yet unreleased version 1.5.0) 
JFlex syntax.

  was (Author: steve_rowe):
I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT 
and both Unicode versions 5.1 and 5.2 - you can see the files here:

http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/

The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the 
Unicode test files; run the generated scanner against each composed test 
string; output the break opportunities/prohibitions in the same format as the 
test files; and then finally compare the output against the test file itself, 
looking for a match.  (These tests currently pass.)

The .flex files would need to be significantly changed to be used as a 
StandardTokenizer replacement, but you can get an idea from them how to 
implement the Unicode word break rules in (as yet unreleased version 1.5.0) 
JFlex syntax.
  
 StandardTokenizer Javadoc does not correctly describe tokenization around 
 punctuation characters
 

 Key: LUCENE-2167
 URL: https://issues.apache.org/jira/browse/LUCENE-2167
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Shyamal Prasad
Priority: Minor
 Attachments: LUCENE-2167.patch, LUCENE-2167.patch

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The Javadoc for StandardTokenizer states:
 {quote}
 Splits words at punctuation characters, removing punctuation. 
 However, a dot that's not followed by whitespace is considered part of a 
 token.
 Splits words at hyphens, unless there's a number in the token, in which case 
 the whole 
 token is interpreted as a product number and is not split.
 {quote}
 This is not accurate. The actual JFlex implementation treats hyphens 
 interchangeably with
 punctuation. So, for example video,mp4,test results in a *single* token and 
 not three tokens
 as the documentation would suggest.
 Additionally, the documentation suggests that video-mp4-test-again would 
 become a single
 token, but in reality it results in two tokens: video-mp4-test and again.
 IMHO the parser implementation is fine as is since it is hard to keep 
 everyone happy, but it is probably
 worth cleaning up the documentation string. 
 The patch included here updates the documentation string and adds a few test 
 cases to confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

2010-02-24 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838081#action_12838081
 ] 

Steven Rowe commented on LUCENE-2167:
-

I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and 
both Unicode versions 5.1 and 5.2 - you can see the files here:

http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/

The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the 
Unicode test files; run the generated scanner against each composed test 
string; output the break opportunities/prohibitions in the same format as the 
test files; and then finally compare the output against the test file itself, 
looking for a match.  (These tests currently pass.)

The .flex files would need to be significantly changed to be used as a 
StandardTokenizer replacement, but you can get an idea from them how to 
implement the Unicode word break rules in (as yet unreleased version 1.5.0) 
JFlex syntax.

 StandardTokenizer Javadoc does not correctly describe tokenization around 
 punctuation characters
 

 Key: LUCENE-2167
 URL: https://issues.apache.org/jira/browse/LUCENE-2167
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Shyamal Prasad
Priority: Minor
 Attachments: LUCENE-2167.patch, LUCENE-2167.patch

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The Javadoc for StandardTokenizer states:
 {quote}
 Splits words at punctuation characters, removing punctuation. 
 However, a dot that's not followed by whitespace is considered part of a 
 token.
 Splits words at hyphens, unless there's a number in the token, in which case 
 the whole 
 token is interpreted as a product number and is not split.
 {quote}
 This is not accurate. The actual JFlex implementation treats hyphens 
 interchangeably with
 punctuation. So, for example video,mp4,test results in a *single* token and 
 not three tokens
 as the documentation would suggest.
 Additionally, the documentation suggests that video-mp4-test-again would 
 become a single
 token, but in reality it results in two tokens: video-mp4-test and again.
 IMHO the parser implementation is fine as is since it is hard to keep 
 everyone happy, but it is probably
 worth cleaning up the documentation string. 
 The patch included here updates the documentation string and adds a few test 
 cases to confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters

2010-02-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838094#action_12838094
 ] 

Robert Muir commented on LUCENE-2167:
-

Steven, thanks for providing the link.

I guess this is the point where I also say, I think it would be really nice for 
StandardTokenizer to adhere straight to the standard as much as we can with 
jflex (I realize in 1.5, we won't have  0x support). Then its name would 
actually make sense.

In my opinion, such a transition would involve something like renaming the old 
StandardTokenizer to EuropeanTokenizer, as its javadoc claims:
{code}
This should be a good tokenizer for most European-language documents
{code}

The new StandardTokenizer could then say
{code}
This should be a good tokenizer for most languages.
{code}

All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
could stay with that EuropeanTokenizer or whatever its called, and it could 
be used by the european analyzers.

but if we implement the Unicode rules, I think we should drop all this 
english/euro-centric stuff for StandardTokenizer. Otherwise it should be called 
*StandardishTokenizer*.

we can obviously preserve the backwards compat with Version, as Uwe has created 
a way to use a different grammar for a different Version.

I expect some -1 to this, waiting comments :)

 StandardTokenizer Javadoc does not correctly describe tokenization around 
 punctuation characters
 

 Key: LUCENE-2167
 URL: https://issues.apache.org/jira/browse/LUCENE-2167
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Shyamal Prasad
Priority: Minor
 Attachments: LUCENE-2167.patch, LUCENE-2167.patch

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The Javadoc for StandardTokenizer states:
 {quote}
 Splits words at punctuation characters, removing punctuation. 
 However, a dot that's not followed by whitespace is considered part of a 
 token.
 Splits words at hyphens, unless there's a number in the token, in which case 
 the whole 
 token is interpreted as a product number and is not split.
 {quote}
 This is not accurate. The actual JFlex implementation treats hyphens 
 interchangeably with
 punctuation. So, for example video,mp4,test results in a *single* token and 
 not three tokens
 as the documentation would suggest.
 Additionally, the documentation suggests that video-mp4-test-again would 
 become a single
 token, but in reality it results in two tokens: video-mp4-test and again.
 IMHO the parser implementation is fine as is since it is hard to keep 
 everyone happy, but it is probably
 worth cleaning up the documentation string. 
 The patch included here updates the documentation string and adds a few test 
 cases to confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-02-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838101#action_12838101
 ] 

Robert Muir commented on LUCENE-2074:
-

Uwe, given Steven's comment above, I think we should move forward with this 
issue and flex 1.5?

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2278) FastVectorHighlighter: highlighted term is out of alignment in multi-valued NOT_ANALYZED field

2010-02-24 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved LUCENE-2278.


Resolution: Fixed

Committed revision 916090.

 FastVectorHighlighter: highlighted term is out of alignment in multi-valued 
 NOT_ANALYZED field
 --

 Key: LUCENE-2278
 URL: https://issues.apache.org/jira/browse/LUCENE-2278
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.9, 2.9.1, 3.0
Reporter: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2278.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2285) Code cleanup from all sorts of (trivial) warnings

2010-02-24 Thread Shai Erera (JIRA)
Code cleanup from all sorts of (trivial) warnings
-

 Key: LUCENE-2285
 URL: https://issues.apache.org/jira/browse/LUCENE-2285
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.1


I would like to do some code cleanup and remove all sorts of trivial warnings, 
like unnecessary casts, problems w/ javadocs, unused variables, redundant null 
checks, unnecessary semicolon etc. These are all very trivial and should not 
pose any problem.

I'll create another issue for getting rid of deprecated code usage, like 
LuceneTestCase and all sorts of deprecated constructors. That's also trivial 
because it only affects Lucene code, but it's a different type of change.

Another issue I'd like to create is about introducing more generics in the 
code, where it's missing today - not changing existing API. There are many 
places in the code like that.

So, with you permission, I'll start with the trivial ones first, and then move 
on to the others.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Adding .classpath.tmpl

2010-02-24 Thread Shai Erera
Hi

I always find it annoying when I checkout the code to a new project in
eclipse, that I need to put everything that I care about in the classpath
and adding the dependent libraries. On another project I'm involved with, we
did that process once, adding all the source code to the classpath and the
libraries and created a .classpath.tmpl. Now when people checkout the code,
they can copy the content of that file to their .classpath file and setting
up the project is reducing from a couple of minutes to few seconds.

I don't want to check-in .classpath because not everyone wants all the code
in their classpath.

I attached such file to the mail. Note that the only dependency which will
break on other machines is the ant.jar dependency, which on my Windows is
located under c:\ant. That jar is required to compile contrib/ant from
eclipse. Not sure how to resolve that, except besides removing that line
from the file and document separately that that's what you need to do if you
want to add contrib/ant ...

The file is sorted by name, putting the core stuff at the top - so it's easy
for people to selectively add the interesting packages.

I don't know if an issue is required, if so I can create it in and move the
discussion there.

Shai


lucene.classpath.tmpl
Description: Binary data

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2285) Code cleanup from all sorts of (trivial) warnings

2010-02-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838190#action_12838190
 ] 

Shai Erera commented on LUCENE-2285:


Can someone please clarify these for me:
|| Description || Class || Line
|Unsupported @SuppressWarnings(SerializableHasSerializationMethods) | 
TestCustomScoreQuery.java | 87
|Unsupported @SuppressWarnings(SerializableHasSerializationMethods) | 
TestCustomScoreQuery.java | 123
|Unsupported @SuppressWarnings(UseOfSystemOutOrSystemErr) | 
TestFieldScoreQuery.java | 42
|Unsupported @SuppressWarnings(UseOfSystemOutOrSystemErr) | 
TestOrdValues.java | 37

Are these meant to be there and eclipse just doesn't recognize them for some 
reason, or are these a mistake?

 Code cleanup from all sorts of (trivial) warnings
 -

 Key: LUCENE-2285
 URL: https://issues.apache.org/jira/browse/LUCENE-2285
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.1


 I would like to do some code cleanup and remove all sorts of trivial 
 warnings, like unnecessary casts, problems w/ javadocs, unused variables, 
 redundant null checks, unnecessary semicolon etc. These are all very trivial 
 and should not pose any problem.
 I'll create another issue for getting rid of deprecated code usage, like 
 LuceneTestCase and all sorts of deprecated constructors. That's also trivial 
 because it only affects Lucene code, but it's a different type of change.
 Another issue I'd like to create is about introducing more generics in the 
 code, where it's missing today - not changing existing API. There are many 
 places in the code like that.
 So, with you permission, I'll start with the trivial ones first, and then 
 move on to the others.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-02-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838216#action_12838216
 ] 

Uwe Schindler commented on LUCENE-2074:
---

I will update the patch (using TEST_VERSION and so on) later and then we can 
proceed.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2285) Code cleanup from all sorts of (trivial) warnings

2010-02-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838225#action_12838225
 ] 

Shai Erera commented on LUCENE-2285:


bq. .. not willing to add these stupid @Test everywhere

I don't share the same feeling ... I think it's a strong capability - write a 
method which doesn't need to start w/ testXYZ just to be run by JUnit (though I 
do both for clarity). I think moving to JUnit 4 only simplifies things, as it 
allows testing classes w/o the need to extend TestCase. But I'm not going to 
argue about it here, I'd like to keep this issue contained, and short. So I 
won't touch the LuceneTestCase deprecation, as it's still controversial judging 
by what you say. 

I'll remove those SuppressWarnings then?

About generics, there are the internal parts of the code, like using List, 
ArrayList etc. Scanning quickly through the list, it looks like most of the 
Lucene related warnings are about referencing them ... so it should be also 
easy to fix.

I'll take a look at the code style settings 
(http://wiki.apache.org/lucene-java/HowToContribute?action=AttachFiledo=viewtarget=Eclipse-Lucene-Codestyle.xml?),
 but I'm talking about compiler warnings.

 Code cleanup from all sorts of (trivial) warnings
 -

 Key: LUCENE-2285
 URL: https://issues.apache.org/jira/browse/LUCENE-2285
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.1


 I would like to do some code cleanup and remove all sorts of trivial 
 warnings, like unnecessary casts, problems w/ javadocs, unused variables, 
 redundant null checks, unnecessary semicolon etc. These are all very trivial 
 and should not pose any problem.
 I'll create another issue for getting rid of deprecated code usage, like 
 LuceneTestCase and all sorts of deprecated constructors. That's also trivial 
 because it only affects Lucene code, but it's a different type of change.
 Another issue I'd like to create is about introducing more generics in the 
 code, where it's missing today - not changing existing API. There are many 
 places in the code like that.
 So, with you permission, I'll start with the trivial ones first, and then 
 move on to the others.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org