Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Otis Gospodnetic
If you run the same query again, the IndexSearcher will go all the way
to the index again - no caching.  Some caching will be done by your
file system, possibly, but that's it.  Lucene is fast, so don't
optimize early.

Otis


--- Ben Rooney <[EMAIL PROTECTED]> wrote:

> thanks chris,
> 
> you are correct that i'm not sure if i need the caching ability.  it
> is
> more to understand right now so that if we do need to implement it, i
> am
> able to.
> 
> the reason for the caching is that we will have listing pages for
> certain content types.  for example a listing page of articles.  this
> listing will be generated against lucene engine using a basic query.
> the page will also have the ability to filter the articles based on
> date
> range as one example.  so caching those results could be beneficial.
> 
> however, we will also potentially want to cache the basic query so
> that
> subsequent queries will hit a cache.  when new content is published
> or
> content is removed from the site, the caches will need to be
> invalidated
> so new results are created.
> 
> for the basic query, is there any caching mechanism built into the
> SearchIndexer or do we need to build our own caching mechanism?
> 
> thanks
> ben
> 
> On Tue, 2004-07-12 at 12:29 -0800, Chris Hostetter wrote:
> 
> > : > executes the search, i would keep a static reference to
> SearchIndexer
> > : > and then when i want to invalidate the cache, set it to null or
> create
> > 
> > : design of your system.  But, yes, you do need to keep a reference
> to it
> > : for the cache to work properly.  If you use a new IndexSearcher
> > : instance (I'm simplifying here, you could have an IndexReader
> instance
> > : yourself too, but I'm ignoring that possibility) then the
> filtering
> > : process occurs for each search rather than using the cache.
> > 
> > Assuming you have a finite number of Filters, and assuming those
> Filters
> > are expensive enough to be worth it...
> > 
> > Another approach you can take to "share" the cache among multiple
> > IndexReaders is to explicitly call the bits method on your
> filter(s) once,
> > and then cache the resulting BitSet anywhere you want (ie:
> serialize it to
> > disk if you so choose).  and then impliment a "BitsFilter" class
> that you
> > can construct directly from a BitSet regardless of the IndexReader.
>  The
> > down side of this approach is that it will *ONLY* work if you
> arecertain
> > that the index is never being modified.  If any documents get
> added, or
> > the index gets re-optimized you must regenerate all of the BitSets.
> > 
> > (That's why the CachingWrapperFilter's cache is keyed off of hte
> > IndexReader ... as long as you're re-using the same IndexReader, it
> know's
> > that the cached BitSet must still be valid, because an IndexReader
> > allways sees the same index as when it was opened, even if another
> > thread/process modifies it.)
> > 
> > 
> > class BitsFilter {
> >BitSet bits;
> >public BitsFilter(BitSet bits) {
> >  this.bits=bits;
> >}
> >public BitSet bigs(IndexReader r) {
> >  return bits.clone();
> >}
> > }
> > 
> > 
> > 
> > 
> > -Hoss
> > 
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Ben Rooney
thanks chris,

you are correct that i'm not sure if i need the caching ability.  it is
more to understand right now so that if we do need to implement it, i am
able to.

the reason for the caching is that we will have listing pages for
certain content types.  for example a listing page of articles.  this
listing will be generated against lucene engine using a basic query.
the page will also have the ability to filter the articles based on date
range as one example.  so caching those results could be beneficial.

however, we will also potentially want to cache the basic query so that
subsequent queries will hit a cache.  when new content is published or
content is removed from the site, the caches will need to be invalidated
so new results are created.

for the basic query, is there any caching mechanism built into the
SearchIndexer or do we need to build our own caching mechanism?

thanks
ben

On Tue, 2004-07-12 at 12:29 -0800, Chris Hostetter wrote:

> : > executes the search, i would keep a static reference to SearchIndexer
> : > and then when i want to invalidate the cache, set it to null or create
> 
> : design of your system.  But, yes, you do need to keep a reference to it
> : for the cache to work properly.  If you use a new IndexSearcher
> : instance (I'm simplifying here, you could have an IndexReader instance
> : yourself too, but I'm ignoring that possibility) then the filtering
> : process occurs for each search rather than using the cache.
> 
> Assuming you have a finite number of Filters, and assuming those Filters
> are expensive enough to be worth it...
> 
> Another approach you can take to "share" the cache among multiple
> IndexReaders is to explicitly call the bits method on your filter(s) once,
> and then cache the resulting BitSet anywhere you want (ie: serialize it to
> disk if you so choose).  and then impliment a "BitsFilter" class that you
> can construct directly from a BitSet regardless of the IndexReader.  The
> down side of this approach is that it will *ONLY* work if you arecertain
> that the index is never being modified.  If any documents get added, or
> the index gets re-optimized you must regenerate all of the BitSets.
> 
> (That's why the CachingWrapperFilter's cache is keyed off of hte
> IndexReader ... as long as you're re-using the same IndexReader, it know's
> that the cached BitSet must still be valid, because an IndexReader
> allways sees the same index as when it was opened, even if another
> thread/process modifies it.)
> 
> 
>   class BitsFilter {
>BitSet bits;
>public BitsFilter(BitSet bits) {
>  this.bits=bits;
>}
>public BitSet bigs(IndexReader r) {
>  return bits.clone();
>}
> }
> 
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Ben Rooney
erik, thanks for the reply

i get the filter know and understand how the caching works.  however the
caching is only on the filtering level which means i can cache results
that are filtered.  but if i do a basic search against the index and
want to cache that, do i need to create my own caching mechanism or does
the SearchIndexer cache the results already?  if it caches them already,
then to clear the cache, is it again removing any references to the
SearchIndexer instance?

thanks again,
ben


On Tue, 2004-07-12 at 15:18 -0500, Erik Hatcher wrote:

> On Dec 7, 2004, at 3:06 PM, Ben Rooney wrote:
> > i'm trying to understand the difference/effects between QueryFilter vs
> > CachingWrapperFilter and when you would use one vs the other and how
> > they work exactly.
> 
> QueryFilter caches the results (bit set of documents) of a query by 
> IndexReader.
> 
> CachingWrapperFilter does not actually do any filtering of its own, but 
> merely wraps the results of another non-caching filter, such as 
> DateFilter.  CachingWrapperFilter was added to disconnect caching from 
> filtering.  QueryFilter is the exception as it came first and already 
> does caching.  If you're using QueryFilter, there is no need to concern 
> yourself with CachingWrapperFilter.
> 
> > also, when exactly will the cache be cleared.  looking at the source
> > code, it appears when the IndexReader is released it would be cleared.
> > does this mean i should keep a reference to the SearchIndexer until i
> > want the results to be cleared?  for example, in a class file the
> > executes the search, i would keep a static reference to SearchIndexer
> > and then when i want to invalidate the cache, set it to null or create 
> > a
> > new instance of it?
> 
> How you keep a reference to the IndexSearcher instance is up to the 
> design of your system.  But, yes, you do need to keep a reference to it 
> for the cache to work properly.  If you use a new IndexSearcher 
> instance (I'm simplifying here, you could have an IndexReader instance 
> yourself too, but I'm ignoring that possibility) then the filtering 
> process occurs for each search rather than using the cache.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Chris Hostetter
: > executes the search, i would keep a static reference to SearchIndexer
: > and then when i want to invalidate the cache, set it to null or create

: design of your system.  But, yes, you do need to keep a reference to it
: for the cache to work properly.  If you use a new IndexSearcher
: instance (I'm simplifying here, you could have an IndexReader instance
: yourself too, but I'm ignoring that possibility) then the filtering
: process occurs for each search rather than using the cache.

Assuming you have a finite number of Filters, and assuming those Filters
are expensive enough to be worth it...

Another approach you can take to "share" the cache among multiple
IndexReaders is to explicitly call the bits method on your filter(s) once,
and then cache the resulting BitSet anywhere you want (ie: serialize it to
disk if you so choose).  and then impliment a "BitsFilter" class that you
can construct directly from a BitSet regardless of the IndexReader.  The
down side of this approach is that it will *ONLY* work if you arecertain
that the index is never being modified.  If any documents get added, or
the index gets re-optimized you must regenerate all of the BitSets.

(That's why the CachingWrapperFilter's cache is keyed off of hte
IndexReader ... as long as you're re-using the same IndexReader, it know's
that the cached BitSet must still be valid, because an IndexReader
allways sees the same index as when it was opened, even if another
thread/process modifies it.)


class BitsFilter {
   BitSet bits;
   public BitsFilter(BitSet bits) {
 this.bits=bits;
   }
   public BitSet bigs(IndexReader r) {
 return bits.clone();
   }
}




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Erik Hatcher
On Dec 7, 2004, at 3:06 PM, Ben Rooney wrote:
i'm trying to understand the difference/effects between QueryFilter vs
CachingWrapperFilter and when you would use one vs the other and how
they work exactly.
QueryFilter caches the results (bit set of documents) of a query by 
IndexReader.

CachingWrapperFilter does not actually do any filtering of its own, but 
merely wraps the results of another non-caching filter, such as 
DateFilter.  CachingWrapperFilter was added to disconnect caching from 
filtering.  QueryFilter is the exception as it came first and already 
does caching.  If you're using QueryFilter, there is no need to concern 
yourself with CachingWrapperFilter.

also, when exactly will the cache be cleared.  looking at the source
code, it appears when the IndexReader is released it would be cleared.
does this mean i should keep a reference to the SearchIndexer until i
want the results to be cleared?  for example, in a class file the
executes the search, i would keep a static reference to SearchIndexer
and then when i want to invalidate the cache, set it to null or create 
a
new instance of it?
How you keep a reference to the IndexSearcher instance is up to the 
design of your system.  But, yes, you do need to keep a reference to it 
for the cache to work properly.  If you use a new IndexSearcher 
instance (I'm simplifying here, you could have an IndexReader instance 
yourself too, but I'm ignoring that possibility) then the filtering 
process occurs for each search rather than using the cache.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Ben Rooney
hello, hope someone can help explain things to me. 

i've been searching for sometime and i have not been able to find
anything to answer my questions.

i'm trying to understand the difference/effects between QueryFilter vs
CachingWrapperFilter and when you would use one vs the other and how
they work exactly.  

also, when exactly will the cache be cleared.  looking at the source
code, it appears when the IndexReader is released it would be cleared.
does this mean i should keep a reference to the SearchIndexer until i
want the results to be cleared?  for example, in a class file the
executes the search, i would keep a static reference to SearchIndexer
and then when i want to invalidate the cache, set it to null or create a
new instance of it?

on top of this, using the RangeQuery object in a search does not seem to
be prudent as the time is almost 4 times that of using a filter.  i
basically can dig on this as when doing a query, lucene needs to do
scoring for all the documents that match where as using a filter it
ignores scoring.

to test them out, i created an index against a 2 document repository
where the files in the repository are simply properties files.  in the
properties files, i set the publishDate property so that all documents
are of year 2004.

my test runs 4 queries.  the first test is a basic one that returns all
documents in the index that contains the word 'document'.  the second
test adds the query from the first test to a BooleanQuery along with a
RangeQuery for the year 2004.  the third test uses the query from the
first test along with QueryFilter constructed using the RangeQuery.  the
final test is the same as the third query but the QueryFilter is wrapped
in a CachingWrapperFilter class.  each test runs a search against the
index 100 times with the same configuration.

the output from my test is as follows:


2004-12-07 20:30:03,888 DEBUG (SearchManager.java:
main:138) - 2 total matching documents
2004-12-07 20:30:04,602 INFO  (SearchManager.java:
main:141) - query 1 - all docs - total time (ms): 768
2004-12-07 20:30:04,653 DEBUG (SearchManager.java:
main:146) - 2 total matching documents
2004-12-07 20:30:06,598 INFO  (SearchManager.java:
main:149) - query 2 - 2004 range query - no cache - total time
(ms): 1996
2004-12-07 20:30:06,614 DEBUG (SearchManager.java:
main:155) - 2 total matching documents
2004-12-07 20:30:07,223 INFO  (SearchManager.java:
main:158) - query 3 - 2004 docs filter - no cache - total time
(ms): 623
2004-12-07 20:30:07,230 DEBUG (SearchManager.java:
main:164) - 2 total matching documents
2004-12-07 20:30:07,838 INFO  (SearchManager.java:
main:167) - query 4 - 2004 docs filter - cached - total time
(ms): 613


as can be seen, there is not much different between the third and fourth
queries and hence my confusion with the two types of filters.  looking
at the source code, there is not much different between them either.

the following is the test source code:


package com.blastradius.search;

import java.io.File;
import java.util.Date;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.CachingWrapperFilter;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryFilter;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.Searcher;

import com.blastradius.search.parsers.PropertiesParser;

/**
* 
* @author brooney
*/
public class SearchManager {

public final static String INDEX_DIR = "index";
public final static String ROOT_DIR = "webroot";

public final static File rootDir = new
File(SearchManager.ROOT_DIR); 
private final static Log logger =
LogFactory.getLog(SearchManager.class);

public static void main(String[] args) {

Date start = null;
Date end = null;
Hits hits = null;

try {
Searcher searcher = new IndexSearcher(SearchManager.INDEX_DIR);
Analyzer analyzer = new StandardAnalyzer();

Query query = QueryParser.parse("document", "contents",
analyzer);
Query rangeQuery = new R