[jira] Commented: (LUCENE-500) Lucene 2.0 requirements - Remove all deprecated code

2006-03-03 Thread Grant Ingersoll (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-500?page=comments#action_12368704 ] 

Grant Ingersoll commented on LUCENE-500:


Does that mean, then, that the usages of it in the QueryParser and the 
DateFilter need to be preserved as well?  Doesn't that mean that anyone using 
the QP to do range queries will need their index to be created using DateField 
instead of DateTools?

Is there a way we can make DateTools be backward compatible to DateField when 
appropriate?  So that we could still remove DateField?

> Lucene 2.0 requirements - Remove all deprecated code
> 
>
>  Key: LUCENE-500
>  URL: http://issues.apache.org/jira/browse/LUCENE-500
>  Project: Lucene - Java
> Type: Task
> Versions: 1.9
> Reporter: Grant Ingersoll
>  Attachments: deprecation.txt
>
> Per the move to Lucene 2.0 from 1.9, remove all deprecated code and update 
> documentation, etc.
> Patch to follow shortly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

2006-03-03 Thread Steven Tamm (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368768 ] 

Steven Tamm commented on LUCENE-502:


If you're using a WildcardTermEnum, this optimization saves a ton.  We usually 
do wildcard searches which retrieve 50-5000 terms.  Since each one of these 
corresponds to a new TermScorer, removing the caching saves a lot.  For a query 
that has 1800 terms, it saves 800K/query, plus it's also quicker by about 15%.

Don't double buffer.

> TermScorer caches values unnecessarily
> --
>
>  Key: LUCENE-502
>  URL: http://issues.apache.org/jira/browse/LUCENE-502
>  Project: Lucene - Java
> Type: Improvement
>   Components: Search
> Versions: 1.9
> Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

2006-03-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368770 ] 

Doug Cutting commented on LUCENE-502:
-

It is not clear to me that your uses are typical uses.  These optimizations 
were added because they made big improvements.  They were not premature.  In 
some cases JVM's may have evolved so that some of them are no longer required.  
But some of them may still make significant improvements for lots of users.  We 
really need a benchmark suite to better understand the effects of things like 
this...


> TermScorer caches values unnecessarily
> --
>
>  Key: LUCENE-502
>  URL: http://issues.apache.org/jira/browse/LUCENE-502
>  Project: Lucene - Java
> Type: Improvement
>   Components: Search
> Versions: 1.9
> Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

2006-03-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-505?page=comments#action_12368771 ] 

Doug Cutting commented on LUCENE-505:
-

It is not clear to me that your uses are typical uses.  These optimizations 
were added because they made big improvements.  They were not premature.  In 
some cases JVM's may have evolved so that some of them are no longer required.  
But some of them may still make significant improvements for lots of users.

I'd like to see some benchmarks from other applications before we commit big 
changes to such inner loops.

> MultiReader.norm() takes up too much memory: norms byte[] should be made into 
> an Object
> ---
>
>  Key: LUCENE-505
>  URL: http://issues.apache.org/jira/browse/LUCENE-505
>  Project: Lucene - Java
> Type: Improvement
>   Components: Index
> Versions: 2.0
>  Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06)
> Reporter: Steven Tamm
>  Attachments: LazyNorms.patch, NormFactors.patch, NormFactors.patch
>
> MultiReader.norms() is very inefficient: it has to construct a byte array 
> that's as long as all the documents in every segment.  This doubles the 
> memory requirement for scoring MultiReaders vs. Segment Readers.  Although 
> this is cached, it's still a baseline of memory that is unnecessary.
> The problem is that the Normalization Factors are passed around as a byte[].  
> If it were instead replaced with an Object, you could perform a whole host of 
> optimizations
> a.  When reading, you wouldn't have to construct a "fakeNorms" array of all 
> 1.0fs.  You could instead return a singleton object that would just return 
> 1.0f.
> b.  MultiReader could use an object that could delegate to NormFactors of the 
> subreaders
> c.  You could write an implementation that could use mmap to access the norm 
> factors.  Or if the index isn't long lived, you could use an implementation 
> that reads directly from the disk.
> The patch provided here replaces the use of byte[] with a new abstract class 
> called NormFactors.  
> NormFactors has two methods on it
> public abstract byte getByte(int doc) throws IOException;  // Returns the 
> byte[doc]
> public float getFactor(int doc) throws IOException;// Calls 
> Similarity.decodeNorm(getByte(doc))
> There are four implementations of this abstract class
> 1.  NormFactors.EmptyNormFactors - This replaces the fakeNorms with a 
> singleton that only returns 1.0
> 2.  NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for 
> backwards compatibility in constructors.
> 3.  MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent 
> the need to construct the gigantic norms array.
> 4.  SegmentReader.Norm - Same class, but now extends NormFactors to provide 
> the same access.
> In addition, Many of the Query and Scorer classes were changes to pass around 
> NormFactors instead of byte[], and to call getFactor() instead of using the 
> byte[].  I have kept around IndexReader.norms(String) for backwards 
> compatibiltiy, but marked it as deprecated.  I believe that the use of 
> ByteNormFactors in IndexReader.getNormFactors() will keep backward 
> compatibility with other IndexReader implementations, but I don't know how to 
> test that.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 1.9.1 release available

2006-03-03 Thread Doug Cutting

Release 1.9.1 of Lucene is now available from:

http://www.apache.org/dyn/closer.cgi/lucene/java/

This fixes a serious bug in 1.9-final.  It is strongly recommended that 
all 1.9-final users upgrade to 1.9.1.  For details see:


http://svn.apache.org/repos/asf/lucene/java/tags/lucene_1_9_1/CHANGES.txt

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

2006-03-03 Thread Steven Tamm (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368775 ] 

Steven Tamm commented on LUCENE-502:


The main point is this:  When you are using TermScorer to score one document, 
it is doing a lot of extra work.  It's reading 31 extra documents from the disk 
and calculating the weight factors for 31 documents.   The question is how does 
the caching help when you have multiple documents.  My analysis is that (with a 
modern VM) it helps you only if the docFreq of a term is 16-31 and you are 
using a ConjunctiveScorer (i.e. not Wildcard searches).  I would imagine this 
is a use case that is not uncommon.  Anyone using Wildcard searches will have 
*immediate* benefit from installing this patch.

So I'm going to analyze this from the "amount of work to do" perspective.
TermScorer.next():  If you are calling TermScorer.next() there is no real 
difference.  SegmentTermDocs.read(int[], float[]) is no different from calling 
SegmentTermDocs.next() 32 times.  The change in the patch switches 
TermScorer.next() to always calling next on the underlying SegmentTermDocs.  
The only cost I'm removing is the caching and I'm not adding any new ones.  
Therefore there's no change, with the exception of adding the cache for use in 
skipTo().

TermScorer.skipTo():  The only case where my patch is worse is if the frequency 
of the term is greater than the skip interval (i.e >= 16 documents per term).  
In this case, if you are retrieving more than 16 documents (but less than 32), 
you can avoid accessing the skipStream entirely.  If you are retrieving more 
than 32 documents, then you need to access the skipStream anyway, and since 
both of the underlying IndexInput's are cached, repositioning the freqStream 
will be only pointer manipulation.

TermScorer.score():
"In some cases JVM's may have evolved so that some of them are no longer 
required."  I can imagine that the scoreCache made a lot of sense in JDK 1.1 
when the cost of Math.sqrt would be high.  However, if the TermScorer is only 
going to be used for a single document, this is obviously wrong.   Like I said 
before, caching DefaultSimilarity.tf(int) inside DefaultSimilarity would end up 
inlined by the HotSpot compiler, but Math.sqrt is inlined into a processor 
trap, so it's not a big deal.

I want other people to test this and tell me any problems with it.  Whether or 
not you accept the patches into are less important to me than providing them to 
other people that have similar performance problems.  Perhaps I should have 
created a parallel structure to TermScorer that you can use when you have a low 
hit/term ratio. 

> TermScorer caches values unnecessarily
> --
>
>  Key: LUCENE-502
>  URL: http://issues.apache.org/jira/browse/LUCENE-502
>  Project: Lucene - Java
> Type: Improvement
>   Components: Search
> Versions: 1.9
> Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

2006-03-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368782 ] 

Doug Cutting commented on LUCENE-502:
-

> The question is how does the caching help when you have multiple documents.  
> My analysis is that (with a modern VM) it helps you only if the docFreq of a 
> term is 16-31 and you are using a ConjunctiveScorer (i.e. not Wildcard 
> searches).

The conjunctive scorer does not call score(HitCollector,int).  This is only 
called in a few cases anymore.  It can help a lot with a single-term query for 
a very common term, or for disjunctive queries involving very common terms, 
although BooleanScorer2 no longer uses it in this case.  That's too bad.  If 
all clauses to a query are optional, then the old BooleanScorer was faster.  
But it didn't always return documents in order...  So it indeed may be time to 
retire this method.

>SegmentTermDocs.read(int[], int[]) is no different from calling 
>SegmentTermDocs.next() 32 times.

If that were the case, then then termDocs(int[], int[]) method would never have 
been added!  Benchmarking showed this to be much faster.   There's also 
optimized C++ code that implements this method in src/gcj.  In C++, with a 
memory-mapped index, the i/o completely inlines.  When I last benchmarked this 
in GCJ, it was twice as fast as anything HotSpot could do.

But without score(HitCollector,int), TermDocs.read(int[], int[]) will never be 
called.  Sigh.

As for the scoreCache, this is certainly useful for terms that occur in 
thousands of documents, and useless for terms that occur only once.  Perhaps we 
should have two TermScorer implementations, one for common terms and one for 
rare terms, and have TermWeight select which to use.

> TermScorer caches values unnecessarily
> --
>
>  Key: LUCENE-502
>  URL: http://issues.apache.org/jira/browse/LUCENE-502
>  Project: Lucene - Java
> Type: Improvement
>   Components: Search
> Versions: 1.9
> Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

2006-03-03 Thread Steven Tamm (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368784 ] 

Steven Tamm commented on LUCENE-502:


> The conjunctive scorer does not call score(HitCollector,int).  This is only 
> called in a few cases anymore. 
However, in your comments to LUCENE-505 you said this: "For example, in 
TermScorer.score(HitCollector, int), Lucene's innermost loop, you change two 
array accesses into a call to an interface. That could make a substantial 
difference."  Which is true?   Or, as it seems likely, TermScorer was optimized 
for a case that is no longer valid (i.e. ConjunctiveScorer).

> If that were the case, then then termDocs(int[], int[]) method would never 
> have been added!
This hasn't been true for at least 3 years.  Inlining by hand is not necessary 
anymore with hotspot (I don't know about gcj).  Run a benchmark on JDK 1.5 to 
prove this to yourself.

In short, we should have two TermScorer implementations.  One for low 
documents/term, and one for high documents/term.


> TermScorer caches values unnecessarily
> --
>
>  Key: LUCENE-502
>  URL: http://issues.apache.org/jira/browse/LUCENE-502
>  Project: Lucene - Java
> Type: Improvement
>   Components: Search
> Versions: 1.9
> Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

2006-03-03 Thread paul.elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368792 ] 

paul.elschot commented on LUCENE-502:
-

>> The question is how does the caching help when you have multiple documents. 
>> My analysis is that (with a modern VM) it helps you only if the docFreq of a 
>> term is 16-31 and you are using a ConjunctiveScorer (i.e. not Wildcard 
>> searches). 
 
> The conjunctive scorer does not call score(HitCollector,int). This is only 
> called in a few cases anymore. It can help a lot with a single-term query for 
> a very common term, or for disjunctive queries involving very common terms, 
> although BooleanScorer2 no longer uses it in this case. That's too bad. If 
> all clauses to a query are optional, then the old BooleanScorer was faster. 
> But it didn't always return documents in order... So it indeed may be time to 
> retire this method. 

With BooleanScorer2 It is quite possible to use different versions of 
DisjunctionScorer:
one for query top level that does not need skipTo(), and one for lower level 
that allows
skipTo(). The top level one can be implemented just like the "old" 
BooleanScorer.

Iirc the method to implement such different behaviour are already in place (for 
scoring a range of documents),
it only needs to be implemented for DisjunctionScorer, and the top level 
BooleanScorer2 should then
use it when appropriate.

Regards,
Paul Elschot


> TermScorer caches values unnecessarily
> --
>
>  Key: LUCENE-502
>  URL: http://issues.apache.org/jira/browse/LUCENE-502
>  Project: Lucene - Java
> Type: Improvement
>   Components: Search
> Versions: 1.9
> Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

2006-03-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368797 ] 

Doug Cutting commented on LUCENE-502:
-

>  Which is true? Or, as it seems likely, TermScorer was optimized for a case 
> that is no longer valid (i.e. ConjunctiveScorer). 

No, it was optimized for BooleanScorer's *disjunctive* scoring algorithm, which 
is no longer used by default, but is faster than BooleanScorer2's disjunctive 
scoring algorithm.  This applies to a very common type of query: classic 
vector-space searches.  So this optimization may not be leveraged much in the 
current codebase, but that does not mean that it is no longer valid.  But it 
may slow other sorts of searches, like your wildcards.  The challenge is not 
just how to figure out how to make your application as fast as possible, but 
how to do this without making other's and future applications slower.

> In short, we should have two TermScorer implementations. One for low 
> documents/term, and one for high documents/term.

Yes, I think that would be useful.  Classically, total query processing time is 
dominated by common terms, so that's an important case to optimize.  But It 
seems that with wildcard queries over smaller collections that these 
optimizations become costly.  So two implementations seems like it would make 
everyone happy.

> TermScorer caches values unnecessarily
> --
>
>  Key: LUCENE-502
>  URL: http://issues.apache.org/jira/browse/LUCENE-502
>  Project: Lucene - Java
> Type: Improvement
>   Components: Search
> Versions: 1.9
> Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ArrayIndexOutOfBoundsException in org.apache.lucene.search.BooleanScorer2$Coordinator.coordFactor

2006-03-03 Thread Robin H. Johnson
I've been developing a search application, and finally rolled it to
production-testing yesterday, after a few million hits on our 5 search
nodes, I've found a glitch in Lucene :-).

Unfortunately I don't have the queries that triggered this - it occurred
a total of 11 times over the first 3 million hits according to my
logging.

Backtrace:
java.lang.ArrayIndexOutOfBoundsException
at 
org.apache.lucene.search.BooleanScorer2$Coordinator.coordFactor(BooleanScorer2.java:54)
at 
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:328)
at 
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:291)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits.(Hits.java:52)
at org.apache.lucene.search.Searcher.search(Searcher.java:62)
at 
com.isohunt.isosearch.frontend.SearchHandlerFactory.search(SearchHandlerFactory.java:113)
at 
com.isohunt.isosearch.frontend.SearchHandler.process(SearchHandler.java:503)
at 
com.isohunt.isosearch.reactor.ReactorHandler.processAndHandOff(ReactorHandler.java:118)
at 
com.isohunt.isosearch.reactor.ReactorHandler.access$200(ReactorHandler.java:19)
at 
com.isohunt.isosearch.reactor.ReactorHandler$Processer.run(ReactorHandler.java:125)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown 
Source)
at java.lang.Thread.run(Thread.java:816)

The Hit instances is local to each thread, and not statically shared in
any way. The Searcher is shared between threads - but this should be
safe according to the Lucene documentation. 
The com.isohunt.isosearch.reactor package is my pluggable code based on
Doug Lea's Reactor design.

JDK is IBM 1.4.2SR4 on AMD64 hardware.

-- 
Robin Hugh Johnson
E-Mail : [EMAIL PROTECTED]
Home Page  : http://www.orbis-terrarum.net/?l=people.robbat2
ICQ#   : 30269588 or 41961639
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85


pgpU3rDmdTndp.pgp
Description: PGP signature


Re: ArrayIndexOutOfBoundsException in org.apache.lucene.search.BooleanScorer2$Coordinator.coordFactor

2006-03-03 Thread Robin H. Johnson
On Fri, Mar 03, 2006 at 03:28:22PM -0800, Robin H. Johnson wrote:
> I've been developing a search application, and finally rolled it to
> production-testing yesterday, after a few million hits on our 5 search
> nodes, I've found a glitch in Lucene :-).
I left out the Lucene version. It's 1.9 with the LUCENE-511 fix
integrated myself before 1.9.1 was released.

-- 
Robin Hugh Johnson
E-Mail : [EMAIL PROTECTED]
Home Page  : http://www.orbis-terrarum.net/?l=people.robbat2
ICQ#   : 30269588 or 41961639
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85


pgpyJMJ3cFihN.pgp
Description: PGP signature


Re: Lucene 1.9.1 release available

2006-03-03 Thread Shay Banon
And I was hoping to get my name in Lucene CHANGES.txt. You know,  
something to show my children ;-)


On 3 Mar 2006, at 18:26, Doug Cutting wrote:


Release 1.9.1 of Lucene is now available from:

http://www.apache.org/dyn/closer.cgi/lucene/java/

This fixes a serious bug in 1.9-final.  It is strongly recommended  
that all 1.9-final users upgrade to 1.9.1.  For details see:


http://svn.apache.org/repos/asf/lucene/java/tags/lucene_1_9_1/ 
CHANGES.txt


Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Online javadocs: 1.9-rc1

2006-03-03 Thread Chris Hostetter

Someone with the neccessary permisions to update the javadocs on the
website might want to do so, they currently say "Lucene 1.9-rc1 API" which
might confuse people (even if the API is exactly the same as 1.9.1)

http://lucene.apache.org/java/docs/api/



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Online javadocs: 1.9-rc1

2006-03-03 Thread Doug Cutting

I just updated this.  Thanks for catching it.

Doug

Chris Hostetter wrote:

Someone with the neccessary permisions to update the javadocs on the
website might want to do so, they currently say "Lucene 1.9-rc1 API" which
might confuse people (even if the API is exactly the same as 1.9.1)

http://lucene.apache.org/java/docs/api/



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 1.9.1 release available

2006-03-03 Thread Doug Cutting

Shay Banon wrote:
And I was hoping to get my name in Lucene CHANGES.txt. You know,  
something to show my children ;-)


Sorry, I was working quickly.  I just added you!

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FilteredQuery within BooleanQuery issue

2006-03-03 Thread Erik Hatcher
I've run into what I feel is an issue with FilteredQuery.  The best  
description is an example.  First I've indexed three documents:


  public void setUp() throws IOException {
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new  
WhitespaceAnalyzer(), true);

Document doc = new Document();
doc.add(new Field("field", "zero", Field.Store.YES,  
Field.Index.TOKENIZED));

writer.addDocument(doc);

doc = new Document();
doc.add(new Field("field", "one", Field.Store.YES,  
Field.Index.TOKENIZED));

writer.addDocument(doc);
writer.close();

doc = new Document();
doc.add(new Field("field", "two", Field.Store.YES,  
Field.Index.TOKENIZED));

writer.addDocument(doc);
writer.close();

searcher = new IndexSearcher(directory);
  }

Now for a mock filter to keep things simple:

public class DummyFilter extends Filter {
  private int doc;

  public DummyFilter(int doc) {
this.doc = doc;
  }


  public BitSet bits(IndexReader reader) throws IOException {
BitSet bits = new BitSet(reader.maxDoc());
bits.set(doc);
return bits;
  }
}

And finally a test case that fails:

  public void testBoolean() throws Exception {
BooleanQuery bq = new BooleanQuery();
Query query = new FilteredQuery(new MatchAllDocsQuery(),
new DummyFilter(0));
bq.add(query, BooleanClause.Occur.MUST);
query = new FilteredQuery(new MatchAllDocsQuery(),
new DummyFilter(1));
bq.add(query, BooleanClause.Occur.MUST);
Hits hits = searcher.search(bq);
assertEquals(0, hits.length());  // fails: hits.length() == 2
  }

I expect no documents should match this BooleanQuery, but yet two  
documents match (id's 0 and 1).  Am I right in thinking that no  
documents should match since each required clause selects a different  
document so there is no intersection?  If so, what's the flaw in  
FilteredQuery that causes this?   If I'm wrong in my assertion, how so?


For comparison, a ChainedFilter does do what I expect:

  public void testChainedFilter() throws Exception {
ChainedFilter filter = new ChainedFilter(
new Filter[] {new DummyFilter(0), new DummyFilter(1)},
ChainedFilter.AND);
Hits hits = searcher.search(new MatchAllDocsQuery(), filter);
assertEquals(0, hits.length());  // passes
  }

Thanks,
Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FilteredQuery within BooleanQuery issue

2006-03-03 Thread Yonik Seeley
This is the first time I've looked at FilteredQuery, but the scorer is
indeed flawed IMO.
next() and skipTo() simply iterate over the documents that match the
query, and just modify the score to return 0 if it doesn't match the
filter.

  public boolean next() throws IOException { return scorer.next(); }
  public boolean skipTo (int i) throws IOException { return
scorer.skipTo(i); }
  // if the document has been filtered out, set score to 0.0
  public float score() throws IOException {
return (bitset.get(scorer.doc())) ? scorer.score() : 0.0f;
  }

The higher level search functions would return the correct results
since they filter out any documents with a score <= 0.

Check out LUCENE-330 for possible fixes.  (sorry, firefox is refusing
to paste the URL for me again...)

-Yonik


On 3/3/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> I've run into what I feel is an issue with FilteredQuery.  The best
> description is an example.  First I've indexed three documents:
>
>public void setUp() throws IOException {
>  RAMDirectory directory = new RAMDirectory();
>  IndexWriter writer = new IndexWriter(directory, new
> WhitespaceAnalyzer(), true);
>  Document doc = new Document();
>  doc.add(new Field("field", "zero", Field.Store.YES,
> Field.Index.TOKENIZED));
>  writer.addDocument(doc);
>
>  doc = new Document();
>  doc.add(new Field("field", "one", Field.Store.YES,
> Field.Index.TOKENIZED));
>  writer.addDocument(doc);
>  writer.close();
>
>  doc = new Document();
>  doc.add(new Field("field", "two", Field.Store.YES,
> Field.Index.TOKENIZED));
>  writer.addDocument(doc);
>  writer.close();
>
>  searcher = new IndexSearcher(directory);
>}
>
> Now for a mock filter to keep things simple:
>
> public class DummyFilter extends Filter {
>private int doc;
>
>public DummyFilter(int doc) {
>  this.doc = doc;
>}
>
>
>public BitSet bits(IndexReader reader) throws IOException {
>  BitSet bits = new BitSet(reader.maxDoc());
>  bits.set(doc);
>  return bits;
>}
> }
>
> And finally a test case that fails:
>
>public void testBoolean() throws Exception {
>  BooleanQuery bq = new BooleanQuery();
>  Query query = new FilteredQuery(new MatchAllDocsQuery(),
>  new DummyFilter(0));
>  bq.add(query, BooleanClause.Occur.MUST);
>  query = new FilteredQuery(new MatchAllDocsQuery(),
>  new DummyFilter(1));
>  bq.add(query, BooleanClause.Occur.MUST);
>  Hits hits = searcher.search(bq);
>  assertEquals(0, hits.length());  // fails: hits.length() == 2
>}
>
> I expect no documents should match this BooleanQuery, but yet two
> documents match (id's 0 and 1).  Am I right in thinking that no
> documents should match since each required clause selects a different
> document so there is no intersection?  If so, what's the flaw in
> FilteredQuery that causes this?   If I'm wrong in my assertion, how so?
>
> For comparison, a ChainedFilter does do what I expect:
>
>public void testChainedFilter() throws Exception {
>  ChainedFilter filter = new ChainedFilter(
>  new Filter[] {new DummyFilter(0), new DummyFilter(1)},
>  ChainedFilter.AND);
>  Hits hits = searcher.search(new MatchAllDocsQuery(), filter);
>  assertEquals(0, hits.length());  // passes
>}
>
> Thanks,
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]