[jira] Commented: (LUCENE-500) Lucene 2.0 requirements - Remove all deprecated code
[ http://issues.apache.org/jira/browse/LUCENE-500?page=comments#action_12368704 ] Grant Ingersoll commented on LUCENE-500: Does that mean, then, that the usages of it in the QueryParser and the DateFilter need to be preserved as well? Doesn't that mean that anyone using the QP to do range queries will need their index to be created using DateField instead of DateTools? Is there a way we can make DateTools be backward compatible to DateField when appropriate? So that we could still remove DateField? > Lucene 2.0 requirements - Remove all deprecated code > > > Key: LUCENE-500 > URL: http://issues.apache.org/jira/browse/LUCENE-500 > Project: Lucene - Java > Type: Task > Versions: 1.9 > Reporter: Grant Ingersoll > Attachments: deprecation.txt > > Per the move to Lucene 2.0 from 1.9, remove all deprecated code and update > documentation, etc. > Patch to follow shortly. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
[ http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368768 ] Steven Tamm commented on LUCENE-502: If you're using a WildcardTermEnum, this optimization saves a ton. We usually do wildcard searches which retrieve 50-5000 terms. Since each one of these corresponds to a new TermScorer, removing the caching saves a lot. For a query that has 1800 terms, it saves 800K/query, plus it's also quicker by about 15%. Don't double buffer. > TermScorer caches values unnecessarily > -- > > Key: LUCENE-502 > URL: http://issues.apache.org/jira/browse/LUCENE-502 > Project: Lucene - Java > Type: Improvement > Components: Search > Versions: 1.9 > Reporter: Steven Tamm > Attachments: TermScorer.patch > > TermScorer aggressively caches the doc and freq of 32 documents at a time for > each term scored. When querying for a lot of terms, this causes a lot of > garbage to be created that's unnecessary. The SegmentTermDocs from which it > retrieves its information doesn't have any optimizations for bulk loading, > and it's unnecessary. > In addition, it has a SCORE_CACHE, that's of limited benefit. It's caching > the result of a sqrt that should be placed in DefaultSimilarity, and if > you're only scoring a few documents that contain those terms, there's no need > to precalculate the SQRT, especially on modern VMs. > Enclosed is a patch that replaces TermScorer with a version that does not > cache the docs or feqs. In the case of a lot of queries, that saves 196 > bytes/term, the unnecessary disk IO, and extra SQRTs which adds up. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
[ http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368770 ] Doug Cutting commented on LUCENE-502: - It is not clear to me that your uses are typical uses. These optimizations were added because they made big improvements. They were not premature. In some cases JVM's may have evolved so that some of them are no longer required. But some of them may still make significant improvements for lots of users. We really need a benchmark suite to better understand the effects of things like this... > TermScorer caches values unnecessarily > -- > > Key: LUCENE-502 > URL: http://issues.apache.org/jira/browse/LUCENE-502 > Project: Lucene - Java > Type: Improvement > Components: Search > Versions: 1.9 > Reporter: Steven Tamm > Attachments: TermScorer.patch > > TermScorer aggressively caches the doc and freq of 32 documents at a time for > each term scored. When querying for a lot of terms, this causes a lot of > garbage to be created that's unnecessary. The SegmentTermDocs from which it > retrieves its information doesn't have any optimizations for bulk loading, > and it's unnecessary. > In addition, it has a SCORE_CACHE, that's of limited benefit. It's caching > the result of a sqrt that should be placed in DefaultSimilarity, and if > you're only scoring a few documents that contain those terms, there's no need > to precalculate the SQRT, especially on modern VMs. > Enclosed is a patch that replaces TermScorer with a version that does not > cache the docs or feqs. In the case of a lot of queries, that saves 196 > bytes/term, the unnecessary disk IO, and extra SQRTs which adds up. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object
[ http://issues.apache.org/jira/browse/LUCENE-505?page=comments#action_12368771 ] Doug Cutting commented on LUCENE-505: - It is not clear to me that your uses are typical uses. These optimizations were added because they made big improvements. They were not premature. In some cases JVM's may have evolved so that some of them are no longer required. But some of them may still make significant improvements for lots of users. I'd like to see some benchmarks from other applications before we commit big changes to such inner loops. > MultiReader.norm() takes up too much memory: norms byte[] should be made into > an Object > --- > > Key: LUCENE-505 > URL: http://issues.apache.org/jira/browse/LUCENE-505 > Project: Lucene - Java > Type: Improvement > Components: Index > Versions: 2.0 > Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06) > Reporter: Steven Tamm > Attachments: LazyNorms.patch, NormFactors.patch, NormFactors.patch > > MultiReader.norms() is very inefficient: it has to construct a byte array > that's as long as all the documents in every segment. This doubles the > memory requirement for scoring MultiReaders vs. Segment Readers. Although > this is cached, it's still a baseline of memory that is unnecessary. > The problem is that the Normalization Factors are passed around as a byte[]. > If it were instead replaced with an Object, you could perform a whole host of > optimizations > a. When reading, you wouldn't have to construct a "fakeNorms" array of all > 1.0fs. You could instead return a singleton object that would just return > 1.0f. > b. MultiReader could use an object that could delegate to NormFactors of the > subreaders > c. You could write an implementation that could use mmap to access the norm > factors. Or if the index isn't long lived, you could use an implementation > that reads directly from the disk. > The patch provided here replaces the use of byte[] with a new abstract class > called NormFactors. > NormFactors has two methods on it > public abstract byte getByte(int doc) throws IOException; // Returns the > byte[doc] > public float getFactor(int doc) throws IOException;// Calls > Similarity.decodeNorm(getByte(doc)) > There are four implementations of this abstract class > 1. NormFactors.EmptyNormFactors - This replaces the fakeNorms with a > singleton that only returns 1.0 > 2. NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for > backwards compatibility in constructors. > 3. MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent > the need to construct the gigantic norms array. > 4. SegmentReader.Norm - Same class, but now extends NormFactors to provide > the same access. > In addition, Many of the Query and Scorer classes were changes to pass around > NormFactors instead of byte[], and to call getFactor() instead of using the > byte[]. I have kept around IndexReader.norms(String) for backwards > compatibiltiy, but marked it as deprecated. I believe that the use of > ByteNormFactors in IndexReader.getNormFactors() will keep backward > compatibility with other IndexReader implementations, but I don't know how to > test that. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene 1.9.1 release available
Release 1.9.1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This fixes a serious bug in 1.9-final. It is strongly recommended that all 1.9-final users upgrade to 1.9.1. For details see: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_1_9_1/CHANGES.txt Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
[ http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368775 ] Steven Tamm commented on LUCENE-502: The main point is this: When you are using TermScorer to score one document, it is doing a lot of extra work. It's reading 31 extra documents from the disk and calculating the weight factors for 31 documents. The question is how does the caching help when you have multiple documents. My analysis is that (with a modern VM) it helps you only if the docFreq of a term is 16-31 and you are using a ConjunctiveScorer (i.e. not Wildcard searches). I would imagine this is a use case that is not uncommon. Anyone using Wildcard searches will have *immediate* benefit from installing this patch. So I'm going to analyze this from the "amount of work to do" perspective. TermScorer.next(): If you are calling TermScorer.next() there is no real difference. SegmentTermDocs.read(int[], float[]) is no different from calling SegmentTermDocs.next() 32 times. The change in the patch switches TermScorer.next() to always calling next on the underlying SegmentTermDocs. The only cost I'm removing is the caching and I'm not adding any new ones. Therefore there's no change, with the exception of adding the cache for use in skipTo(). TermScorer.skipTo(): The only case where my patch is worse is if the frequency of the term is greater than the skip interval (i.e >= 16 documents per term). In this case, if you are retrieving more than 16 documents (but less than 32), you can avoid accessing the skipStream entirely. If you are retrieving more than 32 documents, then you need to access the skipStream anyway, and since both of the underlying IndexInput's are cached, repositioning the freqStream will be only pointer manipulation. TermScorer.score(): "In some cases JVM's may have evolved so that some of them are no longer required." I can imagine that the scoreCache made a lot of sense in JDK 1.1 when the cost of Math.sqrt would be high. However, if the TermScorer is only going to be used for a single document, this is obviously wrong. Like I said before, caching DefaultSimilarity.tf(int) inside DefaultSimilarity would end up inlined by the HotSpot compiler, but Math.sqrt is inlined into a processor trap, so it's not a big deal. I want other people to test this and tell me any problems with it. Whether or not you accept the patches into are less important to me than providing them to other people that have similar performance problems. Perhaps I should have created a parallel structure to TermScorer that you can use when you have a low hit/term ratio. > TermScorer caches values unnecessarily > -- > > Key: LUCENE-502 > URL: http://issues.apache.org/jira/browse/LUCENE-502 > Project: Lucene - Java > Type: Improvement > Components: Search > Versions: 1.9 > Reporter: Steven Tamm > Attachments: TermScorer.patch > > TermScorer aggressively caches the doc and freq of 32 documents at a time for > each term scored. When querying for a lot of terms, this causes a lot of > garbage to be created that's unnecessary. The SegmentTermDocs from which it > retrieves its information doesn't have any optimizations for bulk loading, > and it's unnecessary. > In addition, it has a SCORE_CACHE, that's of limited benefit. It's caching > the result of a sqrt that should be placed in DefaultSimilarity, and if > you're only scoring a few documents that contain those terms, there's no need > to precalculate the SQRT, especially on modern VMs. > Enclosed is a patch that replaces TermScorer with a version that does not > cache the docs or feqs. In the case of a lot of queries, that saves 196 > bytes/term, the unnecessary disk IO, and extra SQRTs which adds up. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
[ http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368782 ] Doug Cutting commented on LUCENE-502: - > The question is how does the caching help when you have multiple documents. > My analysis is that (with a modern VM) it helps you only if the docFreq of a > term is 16-31 and you are using a ConjunctiveScorer (i.e. not Wildcard > searches). The conjunctive scorer does not call score(HitCollector,int). This is only called in a few cases anymore. It can help a lot with a single-term query for a very common term, or for disjunctive queries involving very common terms, although BooleanScorer2 no longer uses it in this case. That's too bad. If all clauses to a query are optional, then the old BooleanScorer was faster. But it didn't always return documents in order... So it indeed may be time to retire this method. >SegmentTermDocs.read(int[], int[]) is no different from calling >SegmentTermDocs.next() 32 times. If that were the case, then then termDocs(int[], int[]) method would never have been added! Benchmarking showed this to be much faster. There's also optimized C++ code that implements this method in src/gcj. In C++, with a memory-mapped index, the i/o completely inlines. When I last benchmarked this in GCJ, it was twice as fast as anything HotSpot could do. But without score(HitCollector,int), TermDocs.read(int[], int[]) will never be called. Sigh. As for the scoreCache, this is certainly useful for terms that occur in thousands of documents, and useless for terms that occur only once. Perhaps we should have two TermScorer implementations, one for common terms and one for rare terms, and have TermWeight select which to use. > TermScorer caches values unnecessarily > -- > > Key: LUCENE-502 > URL: http://issues.apache.org/jira/browse/LUCENE-502 > Project: Lucene - Java > Type: Improvement > Components: Search > Versions: 1.9 > Reporter: Steven Tamm > Attachments: TermScorer.patch > > TermScorer aggressively caches the doc and freq of 32 documents at a time for > each term scored. When querying for a lot of terms, this causes a lot of > garbage to be created that's unnecessary. The SegmentTermDocs from which it > retrieves its information doesn't have any optimizations for bulk loading, > and it's unnecessary. > In addition, it has a SCORE_CACHE, that's of limited benefit. It's caching > the result of a sqrt that should be placed in DefaultSimilarity, and if > you're only scoring a few documents that contain those terms, there's no need > to precalculate the SQRT, especially on modern VMs. > Enclosed is a patch that replaces TermScorer with a version that does not > cache the docs or feqs. In the case of a lot of queries, that saves 196 > bytes/term, the unnecessary disk IO, and extra SQRTs which adds up. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
[ http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368784 ] Steven Tamm commented on LUCENE-502: > The conjunctive scorer does not call score(HitCollector,int). This is only > called in a few cases anymore. However, in your comments to LUCENE-505 you said this: "For example, in TermScorer.score(HitCollector, int), Lucene's innermost loop, you change two array accesses into a call to an interface. That could make a substantial difference." Which is true? Or, as it seems likely, TermScorer was optimized for a case that is no longer valid (i.e. ConjunctiveScorer). > If that were the case, then then termDocs(int[], int[]) method would never > have been added! This hasn't been true for at least 3 years. Inlining by hand is not necessary anymore with hotspot (I don't know about gcj). Run a benchmark on JDK 1.5 to prove this to yourself. In short, we should have two TermScorer implementations. One for low documents/term, and one for high documents/term. > TermScorer caches values unnecessarily > -- > > Key: LUCENE-502 > URL: http://issues.apache.org/jira/browse/LUCENE-502 > Project: Lucene - Java > Type: Improvement > Components: Search > Versions: 1.9 > Reporter: Steven Tamm > Attachments: TermScorer.patch > > TermScorer aggressively caches the doc and freq of 32 documents at a time for > each term scored. When querying for a lot of terms, this causes a lot of > garbage to be created that's unnecessary. The SegmentTermDocs from which it > retrieves its information doesn't have any optimizations for bulk loading, > and it's unnecessary. > In addition, it has a SCORE_CACHE, that's of limited benefit. It's caching > the result of a sqrt that should be placed in DefaultSimilarity, and if > you're only scoring a few documents that contain those terms, there's no need > to precalculate the SQRT, especially on modern VMs. > Enclosed is a patch that replaces TermScorer with a version that does not > cache the docs or feqs. In the case of a lot of queries, that saves 196 > bytes/term, the unnecessary disk IO, and extra SQRTs which adds up. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
[ http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368792 ] paul.elschot commented on LUCENE-502: - >> The question is how does the caching help when you have multiple documents. >> My analysis is that (with a modern VM) it helps you only if the docFreq of a >> term is 16-31 and you are using a ConjunctiveScorer (i.e. not Wildcard >> searches). > The conjunctive scorer does not call score(HitCollector,int). This is only > called in a few cases anymore. It can help a lot with a single-term query for > a very common term, or for disjunctive queries involving very common terms, > although BooleanScorer2 no longer uses it in this case. That's too bad. If > all clauses to a query are optional, then the old BooleanScorer was faster. > But it didn't always return documents in order... So it indeed may be time to > retire this method. With BooleanScorer2 It is quite possible to use different versions of DisjunctionScorer: one for query top level that does not need skipTo(), and one for lower level that allows skipTo(). The top level one can be implemented just like the "old" BooleanScorer. Iirc the method to implement such different behaviour are already in place (for scoring a range of documents), it only needs to be implemented for DisjunctionScorer, and the top level BooleanScorer2 should then use it when appropriate. Regards, Paul Elschot > TermScorer caches values unnecessarily > -- > > Key: LUCENE-502 > URL: http://issues.apache.org/jira/browse/LUCENE-502 > Project: Lucene - Java > Type: Improvement > Components: Search > Versions: 1.9 > Reporter: Steven Tamm > Attachments: TermScorer.patch > > TermScorer aggressively caches the doc and freq of 32 documents at a time for > each term scored. When querying for a lot of terms, this causes a lot of > garbage to be created that's unnecessary. The SegmentTermDocs from which it > retrieves its information doesn't have any optimizations for bulk loading, > and it's unnecessary. > In addition, it has a SCORE_CACHE, that's of limited benefit. It's caching > the result of a sqrt that should be placed in DefaultSimilarity, and if > you're only scoring a few documents that contain those terms, there's no need > to precalculate the SQRT, especially on modern VMs. > Enclosed is a patch that replaces TermScorer with a version that does not > cache the docs or feqs. In the case of a lot of queries, that saves 196 > bytes/term, the unnecessary disk IO, and extra SQRTs which adds up. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
[ http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368797 ] Doug Cutting commented on LUCENE-502: - > Which is true? Or, as it seems likely, TermScorer was optimized for a case > that is no longer valid (i.e. ConjunctiveScorer). No, it was optimized for BooleanScorer's *disjunctive* scoring algorithm, which is no longer used by default, but is faster than BooleanScorer2's disjunctive scoring algorithm. This applies to a very common type of query: classic vector-space searches. So this optimization may not be leveraged much in the current codebase, but that does not mean that it is no longer valid. But it may slow other sorts of searches, like your wildcards. The challenge is not just how to figure out how to make your application as fast as possible, but how to do this without making other's and future applications slower. > In short, we should have two TermScorer implementations. One for low > documents/term, and one for high documents/term. Yes, I think that would be useful. Classically, total query processing time is dominated by common terms, so that's an important case to optimize. But It seems that with wildcard queries over smaller collections that these optimizations become costly. So two implementations seems like it would make everyone happy. > TermScorer caches values unnecessarily > -- > > Key: LUCENE-502 > URL: http://issues.apache.org/jira/browse/LUCENE-502 > Project: Lucene - Java > Type: Improvement > Components: Search > Versions: 1.9 > Reporter: Steven Tamm > Attachments: TermScorer.patch > > TermScorer aggressively caches the doc and freq of 32 documents at a time for > each term scored. When querying for a lot of terms, this causes a lot of > garbage to be created that's unnecessary. The SegmentTermDocs from which it > retrieves its information doesn't have any optimizations for bulk loading, > and it's unnecessary. > In addition, it has a SCORE_CACHE, that's of limited benefit. It's caching > the result of a sqrt that should be placed in DefaultSimilarity, and if > you're only scoring a few documents that contain those terms, there's no need > to precalculate the SQRT, especially on modern VMs. > Enclosed is a patch that replaces TermScorer with a version that does not > cache the docs or feqs. In the case of a lot of queries, that saves 196 > bytes/term, the unnecessary disk IO, and extra SQRTs which adds up. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ArrayIndexOutOfBoundsException in org.apache.lucene.search.BooleanScorer2$Coordinator.coordFactor
I've been developing a search application, and finally rolled it to production-testing yesterday, after a few million hits on our 5 search nodes, I've found a glitch in Lucene :-). Unfortunately I don't have the queries that triggered this - it occurred a total of 11 times over the first 3 million hits according to my logging. Backtrace: java.lang.ArrayIndexOutOfBoundsException at org.apache.lucene.search.BooleanScorer2$Coordinator.coordFactor(BooleanScorer2.java:54) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:328) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:291) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) at com.isohunt.isosearch.frontend.SearchHandlerFactory.search(SearchHandlerFactory.java:113) at com.isohunt.isosearch.frontend.SearchHandler.process(SearchHandler.java:503) at com.isohunt.isosearch.reactor.ReactorHandler.processAndHandOff(ReactorHandler.java:118) at com.isohunt.isosearch.reactor.ReactorHandler.access$200(ReactorHandler.java:19) at com.isohunt.isosearch.reactor.ReactorHandler$Processer.run(ReactorHandler.java:125) at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Thread.java:816) The Hit instances is local to each thread, and not statically shared in any way. The Searcher is shared between threads - but this should be safe according to the Lucene documentation. The com.isohunt.isosearch.reactor package is my pluggable code based on Doug Lea's Reactor design. JDK is IBM 1.4.2SR4 on AMD64 hardware. -- Robin Hugh Johnson E-Mail : [EMAIL PROTECTED] Home Page : http://www.orbis-terrarum.net/?l=people.robbat2 ICQ# : 30269588 or 41961639 GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85 pgpU3rDmdTndp.pgp Description: PGP signature
Re: ArrayIndexOutOfBoundsException in org.apache.lucene.search.BooleanScorer2$Coordinator.coordFactor
On Fri, Mar 03, 2006 at 03:28:22PM -0800, Robin H. Johnson wrote: > I've been developing a search application, and finally rolled it to > production-testing yesterday, after a few million hits on our 5 search > nodes, I've found a glitch in Lucene :-). I left out the Lucene version. It's 1.9 with the LUCENE-511 fix integrated myself before 1.9.1 was released. -- Robin Hugh Johnson E-Mail : [EMAIL PROTECTED] Home Page : http://www.orbis-terrarum.net/?l=people.robbat2 ICQ# : 30269588 or 41961639 GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85 pgpyJMJ3cFihN.pgp Description: PGP signature
Re: Lucene 1.9.1 release available
And I was hoping to get my name in Lucene CHANGES.txt. You know, something to show my children ;-) On 3 Mar 2006, at 18:26, Doug Cutting wrote: Release 1.9.1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This fixes a serious bug in 1.9-final. It is strongly recommended that all 1.9-final users upgrade to 1.9.1. For details see: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_1_9_1/ CHANGES.txt Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Online javadocs: 1.9-rc1
Someone with the neccessary permisions to update the javadocs on the website might want to do so, they currently say "Lucene 1.9-rc1 API" which might confuse people (even if the API is exactly the same as 1.9.1) http://lucene.apache.org/java/docs/api/ -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Online javadocs: 1.9-rc1
I just updated this. Thanks for catching it. Doug Chris Hostetter wrote: Someone with the neccessary permisions to update the javadocs on the website might want to do so, they currently say "Lucene 1.9-rc1 API" which might confuse people (even if the API is exactly the same as 1.9.1) http://lucene.apache.org/java/docs/api/ -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.9.1 release available
Shay Banon wrote: And I was hoping to get my name in Lucene CHANGES.txt. You know, something to show my children ;-) Sorry, I was working quickly. I just added you! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FilteredQuery within BooleanQuery issue
I've run into what I feel is an issue with FilteredQuery. The best description is an example. First I've indexed three documents: public void setUp() throws IOException { RAMDirectory directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true); Document doc = new Document(); doc.add(new Field("field", "zero", Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add(new Field("field", "one", Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); writer.close(); doc = new Document(); doc.add(new Field("field", "two", Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); writer.close(); searcher = new IndexSearcher(directory); } Now for a mock filter to keep things simple: public class DummyFilter extends Filter { private int doc; public DummyFilter(int doc) { this.doc = doc; } public BitSet bits(IndexReader reader) throws IOException { BitSet bits = new BitSet(reader.maxDoc()); bits.set(doc); return bits; } } And finally a test case that fails: public void testBoolean() throws Exception { BooleanQuery bq = new BooleanQuery(); Query query = new FilteredQuery(new MatchAllDocsQuery(), new DummyFilter(0)); bq.add(query, BooleanClause.Occur.MUST); query = new FilteredQuery(new MatchAllDocsQuery(), new DummyFilter(1)); bq.add(query, BooleanClause.Occur.MUST); Hits hits = searcher.search(bq); assertEquals(0, hits.length()); // fails: hits.length() == 2 } I expect no documents should match this BooleanQuery, but yet two documents match (id's 0 and 1). Am I right in thinking that no documents should match since each required clause selects a different document so there is no intersection? If so, what's the flaw in FilteredQuery that causes this? If I'm wrong in my assertion, how so? For comparison, a ChainedFilter does do what I expect: public void testChainedFilter() throws Exception { ChainedFilter filter = new ChainedFilter( new Filter[] {new DummyFilter(0), new DummyFilter(1)}, ChainedFilter.AND); Hits hits = searcher.search(new MatchAllDocsQuery(), filter); assertEquals(0, hits.length()); // passes } Thanks, Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FilteredQuery within BooleanQuery issue
This is the first time I've looked at FilteredQuery, but the scorer is indeed flawed IMO. next() and skipTo() simply iterate over the documents that match the query, and just modify the score to return 0 if it doesn't match the filter. public boolean next() throws IOException { return scorer.next(); } public boolean skipTo (int i) throws IOException { return scorer.skipTo(i); } // if the document has been filtered out, set score to 0.0 public float score() throws IOException { return (bitset.get(scorer.doc())) ? scorer.score() : 0.0f; } The higher level search functions would return the correct results since they filter out any documents with a score <= 0. Check out LUCENE-330 for possible fixes. (sorry, firefox is refusing to paste the URL for me again...) -Yonik On 3/3/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: > I've run into what I feel is an issue with FilteredQuery. The best > description is an example. First I've indexed three documents: > >public void setUp() throws IOException { > RAMDirectory directory = new RAMDirectory(); > IndexWriter writer = new IndexWriter(directory, new > WhitespaceAnalyzer(), true); > Document doc = new Document(); > doc.add(new Field("field", "zero", Field.Store.YES, > Field.Index.TOKENIZED)); > writer.addDocument(doc); > > doc = new Document(); > doc.add(new Field("field", "one", Field.Store.YES, > Field.Index.TOKENIZED)); > writer.addDocument(doc); > writer.close(); > > doc = new Document(); > doc.add(new Field("field", "two", Field.Store.YES, > Field.Index.TOKENIZED)); > writer.addDocument(doc); > writer.close(); > > searcher = new IndexSearcher(directory); >} > > Now for a mock filter to keep things simple: > > public class DummyFilter extends Filter { >private int doc; > >public DummyFilter(int doc) { > this.doc = doc; >} > > >public BitSet bits(IndexReader reader) throws IOException { > BitSet bits = new BitSet(reader.maxDoc()); > bits.set(doc); > return bits; >} > } > > And finally a test case that fails: > >public void testBoolean() throws Exception { > BooleanQuery bq = new BooleanQuery(); > Query query = new FilteredQuery(new MatchAllDocsQuery(), > new DummyFilter(0)); > bq.add(query, BooleanClause.Occur.MUST); > query = new FilteredQuery(new MatchAllDocsQuery(), > new DummyFilter(1)); > bq.add(query, BooleanClause.Occur.MUST); > Hits hits = searcher.search(bq); > assertEquals(0, hits.length()); // fails: hits.length() == 2 >} > > I expect no documents should match this BooleanQuery, but yet two > documents match (id's 0 and 1). Am I right in thinking that no > documents should match since each required clause selects a different > document so there is no intersection? If so, what's the flaw in > FilteredQuery that causes this? If I'm wrong in my assertion, how so? > > For comparison, a ChainedFilter does do what I expect: > >public void testChainedFilter() throws Exception { > ChainedFilter filter = new ChainedFilter( > new Filter[] {new DummyFilter(0), new DummyFilter(1)}, > ChainedFilter.AND); > Hits hits = searcher.search(new MatchAllDocsQuery(), filter); > assertEquals(0, hits.length()); // passes >} > > Thanks, > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]