date:20020327

Database integration best practices ...

2002-03-27 Thread Peter Sojan



Hi!

As many others I want to use Lucene as a frontend for searching
content which is burried in a relational database. As far as I
can see this should be no problem, by building documents for 
single rows in the tables. Since many of you have already done such
an approach I would appreciate any suggestions on the following 
issues:

- Consistency 
  What is the best way to maintain consistency between the database
  and the lucene index. I can think of two solutions: 

  - update index on every insert 
  - ignore index at insert and do full reindex after time 
(e.g. nightly)


- Transactional issues 
  what is the best way to make a database insert + index insert 
  atomic!?


- Content Separation 
  My content in the database is spread across multiple tables. 
  But there are clusters of related tables. For example I have 
  3 tables describing authors of papers. My solution would be a
  separate index for each of those clusters. When the user does
  a search every index must be searched separately of course ...

  Is maintaining a separate index for every topic a good idea?


One might ask why not searching against the database directly. Well,
I would have to build a search interface (think of boolean issues) 
on my own, which is definitely something I do not have time for. 
Additionally my database (Postgresql) doesn't support full-text 
searches (yet).

Any additional input on your expiriences are very welcome!

Thx in advance,
Peter






  
 

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Database integration best practices ...

2002-03-27 Thread Peter Sojan



I forgot one thing to ask:

search results should be anchored to a unique id which maps to a 
serial in the database. If my search now results in multiple such
id's what is the best way to transform this into a row-fetching 
SQL-statement? I think I would end up in something:

SELECT * FROM atable WHERE id = 12 AND id = 23 AND id = 34 AND 

... and so on.

For this purpose it would be nice to limit lucene search results, so 
that the SQL statement can be limited. Any better idea!?

Thx,
Peter 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

RE : Database integration best practices ...

2002-03-27 Thread Elie FRANCIS


Just a little idea, replace AND by OR in your Select statement
I used to store some fields in lucene index in order to show them in the
result page. Otherwise, I use :

Select * From atable Where id in (12, 23, 34, ...)

Elie


-Message d'origine-
De : Peter Sojan [mailto:[EMAIL PROTECTED]] 
Envoyé : mercredi 27 mars 2002 09:59
À : Lucene Users List
Objet : Re: Database integration best practices ...


I forgot one thing to ask:

search results should be anchored to a unique id which maps to a 
serial in the database. If my search now results in multiple such
id's what is the best way to transform this into a row-fetching 
SQL-statement? I think I would end up in something:

SELECT * FROM atable WHERE id = 12 AND id = 23 AND id = 34 AND 

... and so on.

For this purpose it would be nice to limit lucene search results, so 
that the SQL statement can be limited. Any better idea!?

Thx,
Peter 


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Database integration best practices ...

2002-03-27 Thread Peter Sojan


On Wed, Mar 27, 2002 at 10:53:30AM +, geoff webb wrote:

 Either:
 
SELECT * FROM atable WHERE id IN ( 12, 23, 34 ... )
 
 OR
 
SELECT * FROM atable WHERE id = 12 OR id = 23 OR id = 34 OR
 

Of course it has to be OR'ed. Must have been an Freudian typo :)

 Flow of retrieving an entry would be:
 
   search index
  - present results (from index)
 - select desired result (from database)
 

This should be the right way to go. I just don't want to let my index grow
that much, but as you mention going directly into the database for displaying 
results would cause prohibitive bottlenecks in the backend ...

Thx
Peter


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

StopFilter-troubles

2002-03-27 Thread P . Witte


Dear Lucene-users,
has someone an answer to the following question:
If I add a StopFilter to my Analyzer, the stopwords I gave him will be left
out the query. So far, so good. But when my query is like this one: (field1
: x) AND (field2 : stopword) AND (field 1 : y)
the StopFilter will do its work, but the resulting query is a big mess :
(field1 : x) AND ( ) AND (field 1 : y), and because of that the
searching results ara no good. I hoped it would search for (field1 : x)
AND (field 1 : y). 
I think the StopFilter does a poor job here. Is anyone familiar with this
problem and has an answer for me? 
Puk Witte.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: StopFilter-troubles

2002-03-27 Thread Otis Gospodnetic



--- [EMAIL PROTECTED] wrote:
 Dear Lucene-users,
 has someone an answer to the following question:
 If I add a StopFilter to my Analyzer, the stopwords I gave him will
 be left
 out the query. So far, so good. But when my query is like this one:
 (field1
 : x) AND (field2 : stopword) AND (field 1 : y)
 the StopFilter will do its work, but the resulting query is a big
 mess :
 (field1 : x) AND ( ) AND (field 1 : y), and because of that
 the
 searching results ara no good. I hoped it would search for (field1 :
 x)
 AND (field 1 : y). 
 I think the StopFilter does a poor job here. Is anyone familiar with
 this
 problem and has an answer for me? 
 Puk Witte.

I tried something like this on one Lucene index:
description:travel AND description:a

The results were the same as this query:
description:travel

This seems right to me.

Otis



__
Do You Yahoo!?
Yahoo! Movies - coverage of the 74th Academy Awards®
http://movies.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Retrieve all documents from Index, How to?

2002-03-27 Thread Tihon One


Hi all,

I indexed all records in DataBase.  One of the field in my index stores 
primary key from DataBase (each key is in different document). Is there a 
way that I can retrieve all documents from index ?  I need to validate if 
all records is indexed.

I try search for * but it return empty result.

I'm using StandardAnalyzer with Field.KeyWord for the PrimayKey field and 
everything else is Field.Text

Thanks for your help.

TihonOne

_
Chat with friends online, try MSN Messenger: http://messenger.msn.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

RE: StopFilter-troubles

2002-03-27 Thread P . Witte


Dear all, especially Otis Gospodnetic (thanks for your answer),
without ( )'s the StopFilter is doing a good job indeed, but if I put them
around parts of the query, then the searchResult is wrong. 
For example:
(field1 : x) AND (field2 : stopword) AND (field 1 : y)
So I'm afraid my problem is not solved yet. But maybe someone can try it
with the ()'s with his own tool and tell me if they've got the same problem.
Then I know whether I made a mistake. 

Puk Witte










--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

RE: Lucene with Number+Text

2002-03-27 Thread Aruna Raghavan


Hi,
I am indexing as text field. Search for 05qzFebqz01, 05q* do not work. I am
using a StandardAnalyzer. Search for 05* works.
Searches on another word cq6r work fine.
 Any idea why this is happening?
Thanks!
Aruna.

-Original Message-
From: Ian Lea [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 25, 2002 3:56 PM
To: Lucene Users List
Subject: Re: Lucene with Number+Text


Good thinking.  In my test, using a Text field, searches
for 1727a and 1727* both return a hit but if switch to
Keyword they don't.


--
Ian.

 [EMAIL PROTECTED] (Shannon Booher) wrote 

 I think I have seen a similar problem.
 
 Are you guys using Keyword or Text fields?

--
Searchable personal storage and archiving from http://www.digimem.net/


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

RE: StopFilter-troubles

2002-03-27 Thread Otis Gospodnetic


I don't know enough about the query parser to be able to answer that
question, but why do you really need those parentheses?
It would also be great if you could submit this as a bug at
http://jakarta.apache.org/lucene/

Thanks,
Otis


--- [EMAIL PROTECTED] wrote:
 Dear all, especially Otis Gospodnetic (thanks for your answer),
 without ( )'s the StopFilter is doing a good job indeed, but if I put
 them
 around parts of the query, then the searchResult is wrong. 
 For example:
 (field1 : x) AND (field2 : stopword) AND (field 1 : y)
 So I'm afraid my problem is not solved yet. But maybe someone can try
 it
 with the ()'s with his own tool and tell me if they've got the same
 problem.
 Then I know whether I made a mistake. 
 
 Puk Witte
 
 
 
 
 
 
 
 
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Yahoo! Movies - coverage of the 74th Academy Awards®
http://movies.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Retrieve all documents from Index, How to?

2002-03-27 Thread William W



Hi One,
Try to do something like this

doc.add(Field.Text(type,product));

for all records.

Then search for type:product
It will return all the records.
William.


From: Tihon One [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Retrieve all documents from Index, How to?
Date: Wed, 27 Mar 2002 16:05:16 +

Hi all,

I indexed all records in DataBase.  One of the field in my index stores
primary key from DataBase (each key is in different document). Is there a
way that I can retrieve all documents from index ?  I need to validate if
all records is indexed.

I try search for * but it return empty result.

I'm using StandardAnalyzer with Field.KeyWord for the PrimayKey field and
everything else is Field.Text

Thanks for your help.

TihonOne

_
Chat with friends online, try MSN Messenger: http://messenger.msn.com


--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]





_
Chat with friends online, try MSN Messenger: http://messenger.msn.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

RE: Retrieve all documents from Index, How to?

2002-03-27 Thread P . Witte


Dear Otis Gospodnetic,
these parantheses seem (and are) rather unnecessary, but users of my program
can fill in textfields and boolean-radiobuttons and then the program will
make a query out of it. My program has a lot of fields (about twenty) and a
query will for that reason often get rather complicated. I thought about it
and made a query-maker-tool that would also take care of the right use of
parentheses. As a result there are sometimes parentheses that are not
useful, but are a side-effect of this tool. I thought this would not lead to
any problems, unfortunately I had not thought about my StopFilter. 
But with a more common query, the problem will also occur: 
(field :  AND field : y) OR (field : stopword AND field : stopword)
In this case I get a nullpointer-exception.

I am afraid I deleted your mail by accident. Could you please mail me the
adress you gave me for reporting the bug a second time?
Thanks, 
Puk Witte

PS If there is someone else who could help me, please react!







--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

RE: Term

2002-03-27 Thread Aruna Raghavan

Hi All,
I just tried this again, seems to work fine. Not sure what I have done wrong
the first time.  Just a follow up.

-Original Message-
From: Aruna Raghavan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, March 27, 2002 12:45 PM
To: Lucene Users List
Subject: Term

Hi,
While adding documents using something like the following-
document.add(Field.Text(object number, m_strObjectNumber));
I used a string object number as you can see. I can not find the  values
for object number when I do a search. I am using a StandardAnalyzer.
Any idea why this is happening?
Thanks,
Aruna.

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Term

2002-03-27 Thread ykingma


Aruna,

 Hi,
 While adding documents using something like the following-
 document.add(Field.Text(object number, m_strObjectNumber));
 I used a string object number as you can see. I can not find the 
 values for object number when I do a search. I am using a
 StandardAnalyzer. Any idea why this is happening?

You would need to pose a query like this

object number:54321

However this is parsed by the standard analyzer  as a query looking
for the term 'object' in the default field and looking
for the term '54321' in the field named 'number'.

There are three workarounds:
- change your fieldname to eg. objectnumber, and query by:
  objectnumber:54321
- use 'object number' as the default field for searching.
- construct the query without using the standard analyzer.

I think the best solution would be to change the fieldname
into something shorter like 'onr' which allows for easy querying.


Regards,
Ype



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

RE: Term

2002-03-27 Thread Aruna Raghavan

Ype,
Thanks for the response. I think the reason my search worked was because
object number got indexed as object and the searcher searched for
object as well.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, March 27, 2002 1:31 PM
To: [EMAIL PROTECTED]
Subject: Re: Term

Aruna,

 Hi,
 While adding documents using something like the following-
 document.add(Field.Text(object number, m_strObjectNumber));
 I used a string object number as you can see. I can not find the 
 values for object number when I do a search. I am using a
 StandardAnalyzer. Any idea why this is happening?

You would need to pose a query like this

object number:54321

However this is parsed by the standard analyzer  as a query looking
for the term 'object' in the default field and looking
for the term '54321' in the field named 'number'.

There are three workarounds:
- change your fieldname to eg. objectnumber, and query by:
  objectnumber:54321
- use 'object number' as the default field for searching.
- construct the query without using the standard analyzer.

I think the best solution would be to change the fieldname
into something shorter like 'onr' which allows for easy querying.

Regards,
Ype

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Lexical Error??? Help Please!

2002-03-27 Thread Alan Weissman


I just started having trouble with Lucene.  I'm getting this Lexical error
almost out of nowhere.  What does this mean?  All I can understand from it
is that there is an * that is causing a problem, but there are no *'s in
the text being searched!

Thanks,
Alan

[Default] Searching holdings against newly inserted research
[Default] java.rmi.ServerError: Transaction rolled back; nested exception
is:
org.apache.lucene.queryParser.TokenMgrError: Lexical error at line
1, co
lumn 30.  Encountered: * (42), after : 
[Default] org.apache.lucene.queryParser.TokenMgrError: Lexical error at line
1,
column 30.  Encountered: * (42), after : 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Field.Text arguments

2002-03-27 Thread Robert A. Decker


I'm confused about using Fields.

Here's the two methods that are confusing me:
public static final Field Text(String name, Reader value)
public static final Field Text(String name, String value)

The difference is that one takes a reader and the other a string.

I have a field that will have pretty large contents after running through
my analyzer (1500 to 6000 characters).

When I use the second of the two methods above my string is not run
through the analyzer, but is stored in the index.

When I use the first method, by passing in a StringReader based of the
String, I don't get anything indexed at all (and therefore it's difficult
to know if it was analyzed).


Is there some other Field type that I should be using for text that I want
analyzed and indexed, and that the text can be fairly long?


Here's a rough order of I'm doing things. FragmentAnalyzer is my own
custom class that seems to normally work:

Document document = new Document();
Reader reader = new StringReader(text);
document.add(Field.Text(contents, reader));
...
FragmentAnalyzer analyzer = new FragmentAnalyzer();
IndexWriter writer = new IndexWriter(pathToIndex, analyzer,
isCreateNewIndex);
writer.addDocument(document);
writer.close();


rob


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Field.Text arguments

2002-03-27 Thread Joe Hajek


Hi,

thats interesting. if you do a Field.Text(String name, Reader value) it should be 
indexed but not stored. strange i had no problems, but i didnt use a stringreader, 
just file readers.

try to do create your customized field, passing a string that is not stored. i dont 
remember the documentation exactly, but this should be possible passing the right 
parameters to the field constructor.

regards joe


Robert A. Decker [EMAIL PROTECTED] writes on 
Thu, 28 Mar 2002 00:22:36 +0100 (MET):

 I'm confused about using Fields.
 
 Here's the two methods that are confusing me:
 public static final Field Text(String name, Reader value)
 public static final Field Text(String name, String value)
 
 The difference is that one takes a reader and the other a string.
 
 I have a field that will have pretty large contents after running
 through
 my analyzer (1500 to 6000 characters).
 
 When I use the second of the two methods above my string is not run
 through the analyzer, but is stored in the index.
 
 When I use the first method, by passing in a StringReader based of
 the
 String, I don't get anything indexed at all (and therefore it's
 difficult
 to know if it was analyzed).
 
 
 Is there some other Field type that I should be using for text that I
 want
 analyzed and indexed, and that the text can be fairly long?
 
 
 Here's a rough order of I'm doing things. FragmentAnalyzer is my own
 custom class that seems to normally work:
 
 Document document = new Document();
 Reader reader = new StringReader(text);
 document.add(Field.Text(contents, reader));
 ...
 FragmentAnalyzer analyzer = new FragmentAnalyzer();
 IndexWriter writer = new IndexWriter(pathToIndex, analyzer,
 isCreateNewIndex);
 writer.addDocument(document);
 writer.close();
 
 
 rob
 
 
 --
 To unsubscribe, e-mail: 
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Field.Text arguments

2002-03-27 Thread Robert A. Decker


I think I may be confused on the terminology. What is meant by 'not
stored'? The comments on the method that takes a Reader as an argument
states that it 'is tokenized and indexed, but not stored in the index
verbatim'.

I took this to mean that it stores the version of the text after it is run
through the analyzer, which is exactly what I want.

Now that I've looked at the index files closer, I'm starting to think that
perhaps the text may be being stored. It's hard to tell though.

I want to be able go get at the contents of the stored field, and can do
so easily when I use the method that takes a String as an argument.

Here's how I'm trying to get the field back:
Field contentsField = doc.getField(contents);

I get null back when I used the Reader-as-argument Field method, but get
the correct, but unanalyzed, text back when I use the String-as-argument
Field method.

thanks,
rob

On Thu, 28 Mar 2002, Joe Hajek wrote:

 Hi,
 
 thats interesting. if you do a Field.Text(String name, Reader value) it
 should be indexed but not stored. strange i had no problems, but i didnt
 use a stringreader, just file readers.
 
 try to do create your customized field, passing a string that is not
 stored. i dont remember the documentation exactly, but this should be
 possible passing the right parameters to the field constructor.
 
 regards joe
 
 
 Robert A. Decker [EMAIL PROTECTED] writes on 
 Thu, 28 Mar 2002 00:22:36 +0100 (MET):
 
  I'm confused about using Fields.
  
  Here's the two methods that are confusing me:
  public static final Field Text(String name, Reader value)
  public static final Field Text(String name, String value)
  
  The difference is that one takes a reader and the other a string.
  
  I have a field that will have pretty large contents after running
  through
  my analyzer (1500 to 6000 characters).
  
  When I use the second of the two methods above my string is not run
  through the analyzer, but is stored in the index.
  
  When I use the first method, by passing in a StringReader based of
  the
  String, I don't get anything indexed at all (and therefore it's
  difficult
  to know if it was analyzed).
  
  
  Is there some other Field type that I should be using for text that I
  want
  analyzed and indexed, and that the text can be fairly long?
  
  
  Here's a rough order of I'm doing things. FragmentAnalyzer is my own
  custom class that seems to normally work:
  
  Document document = new Document();
  Reader reader = new StringReader(text);
  document.add(Field.Text(contents, reader));
  ...
  FragmentAnalyzer analyzer = new FragmentAnalyzer();
  IndexWriter writer = new IndexWriter(pathToIndex, analyzer,
  isCreateNewIndex);
  writer.addDocument(document);
  writer.close();
  
  
  rob
  
  
  --
  To unsubscribe, e-mail: 
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
  
 
 
 --
 To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:[EMAIL PROTECTED]
 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Chainable Filter contribution

2002-03-27 Thread Kelvin Tan


Dan,

Totally my bad. I had since changed it but hadn't posted it to the list coz
I didn't think anyone found it useful.

Here's the correct version. I haven't really documented since it's pretty
straightforward. Just holler if you need any help.

Regards,
Kelvin
- Original Message -
From: Armbrust, Daniel C. [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, March 28, 2002 5:17 AM
Subject: Chainable Filter contribution


 I found this in the mailing list, and I do need something like this, as I
 need to apply more than one filter at a time.  I'm fairly new to lucene,
 however, and my knowledge of BitSets is very limited.

 My question, if you would be so kind as to donate a minute of time to me,
is
 how does this combine the filters?  From my nieve look through it, it
seems
 that all filter results would get discarded except for the last filter
that
 was applied.


 Thanks,

 Dan



 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.search.Filter;

 import java.io.IOException;
 import java.util.BitSet;

 /**
  * p
  * A ChainableFilter allows multiple filters to be chained
  * such that the result is the intersection of all the
  * filters.
  * /p
  * p
  * Order in which filters are called depends on
  * the position of the filter in the chain. It's probably
  * more efficient to place the most restrictive filters
  * /least computationally-intensive filters first.
  * /p
  *
  * @author a href=mailto:[EMAIL PROTECTED];Kelvin Tan/a
  */
 public class ChainableFilter extends Filter
 {
 /** The filter chain */
 private Filter[] chain = null;

 /**
  * Creates a new ChainableFilter.
  *
  * @param chain The chain of filters.
  */
 public ChainableFilter(Filter[] chain)
 {
 this.chain = chain;
 }

 public BitSet bits(IndexReader reader) throws IOException
 {
 BitSet result = null;
 for (int i = 0; i  chain.length; i++)
 {
 result = chain[i].bits(reader);
 }
 return result;
 }
 }





ChainableFilter.java
Description: Binary data

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Question on the FAQ list with filters

2002-03-27 Thread Steven J. Owens


On Wed, Mar 27, 2002 at 03:52:21PM -0600, Armbrust, Daniel C. wrote:
 From the FAQ:
 16. What is filtering and how is it performed ?
 * Search Query - in this approach, provide your custom filter object to the
 when you call the search() method. This filter will be called exactly once
 to evaluate every document that resulted in non zero score.
 * Selective Collection - in this approach you perform the regular search and
 when you get back the hit list, collect only those that matches your
 filtering criteria. In this approach, your filter is called only for hits
 that returned by the search method which may be only a subset of the non
 zero matches (useful when evaluating your search filter is expensive). 
 
 ***
 
 I don't see why the second way is useful.  Yes, your filter is called only
 for hits that got returned by the search method, but aren't those the same
 hits that the search() method would run through the filter?  Maybe I'm just
 not reading it close enough.
 
 Is my assumption that it is faster to provide a filter to the search()
 method, than to do a selective collation correct?  

 It Depends.  That's more or less the point of the FAQ answer,
though it could be more clearly expressed.  The gist of the FAQ seems
to be that you can either do the filtering BEFORE you do the search,
or AFTER you do the search.

 Obviously the question is, which is more expensive, filtering out
inappropriate documents, or searching for the possible hits?  If
filtering is cheaper, you do the filtering first, then do the search.
If filtering is expensive, you do the search first, then do the
filtering.  You should also factor in which is more restrictive - will
either the filter or the search drop out a large number of the
documents?  If you can arrange it so one is both cheaper and drops out
the majority of the documents, you win.

 In either case, you implement some sort of object which you can
hand a org.apache.lucene.TermDocs and get back a yes or no as to
whether it's a valid possible search result.

 From looking at the source for:

 org.apache.lucene.search.Filter,
 org.apache.lucene.search.DateFilter, and
 org.apache.lucene.search.IndexSearcher, 

 ...it appears that you instantiate your Filter subclass, then for
filtering BEFORE the search, you pass YourFilter an IndexReader and
get back a BitSet.  Or more to the point, when you invoke
IndexSearcher.search(), you pass it YourFilter, and a HitsCollector,
and IndexSearcher.search() gets the BitSet from YourFilter.  

 A BitSet, from the JDK API, is a vector of bit values (i.e. 1 or
0, corresponding to the java boolean values true and false).

 It appears, from looking at the source, that each Bit in the
BitSet corresponds to an SearchIndex TermDoc at the same sequential
location in the SearchIndex.  IndexSearcher.search() has an inner
class (this is a bit ambiguous and it's been a year since I've lookd
at inner classes, so I'm going to just handwave and move along :-)
with a collect() method that loops through the termDocs, skipping the
ones for which BitSet.get() returns false.

 I'm not sure exactly how you would use an
org.apache.lucene.search.Filter to do the filtering AFTER, but
presumably that would involve just handing it the TermDocs in
question, or maybe IndexReader and Hits both implement a common
interface... uhm, no, that's not it.  Well, I guess you use your own
class for the filter.  That's what I ended up doing anyway, in my
ignorance of the Filter abstract class.  I ended up doing my filtering
AFTER, btw, because it involved some expensive lookups in other
documents.

 There's actually a third option, figure out a way to implement
your filter as an additional boolean phrase on your search.  However,
that may or may not be feasible, or the Lucene Filter mechanism may
not have been intended to address such cases.  

 To be honest, the design of the Filter seems less
well-thought-out than the rest of Lucene, like it's an afterthought.
I really oughta join the developers list, I guess, so I can put my
money where my mouth is, and submit changes to clarify the docs, etc,
when I go roaming through the source.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Chainable Filter contribution

2002-03-27 Thread Kelvin Tan


Stephan,

I honestly don't know. There's going to be a /contrib section set up soon
though, so I think it might go in there at least.

Does it matter? :)

Regards,
Kelvin
- Original Message -
From: Strittmatter Stephan (external)
[EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Thursday, March 28, 2002 2:54 PM
Subject: RE: Chainable Filter contribution


 Hi Kelvin,

 I done som similar only doing XOR for my chains.
 But now your improved filter is better than my own.
 I think I will replace my own by yours.
 Will it be part of Lucene in future?

 Regards,
 Stephan

  -Original Message-
  From: Kelvin Tan [mailto:[EMAIL PROTECTED]]
  Sent: Thursday, March 28, 2002 2:58 AM
  To: Armbrust, Daniel C.
  Cc: [EMAIL PROTECTED]
  Subject: Re: Chainable Filter contribution
 
 
  Dan,
 
  Totally my bad. I had since changed it but hadn't posted it
  to the list coz
  I didn't think anyone found it useful.
 
  Here's the correct version. I haven't really documented since
  it's pretty
  straightforward. Just holler if you need any help.
 
  Regards,
  Kelvin
  - Original Message -
  From: Armbrust, Daniel C. [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Thursday, March 28, 2002 5:17 AM
  Subject: Chainable Filter contribution
 
 
   I found this in the mailing list, and I do need something
  like this, as I
   need to apply more than one filter at a time.  I'm fairly
  new to lucene,
   however, and my knowledge of BitSets is very limited.
  
   My question, if you would be so kind as to donate a minute
  of time to me,
  is
   how does this combine the filters?  From my nieve look
  through it, it
  seems
   that all filter results would get discarded except for the
  last filter
  that
   was applied.
  
  
   Thanks,
  
   Dan
  
  
  
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.search.Filter;
  
   import java.io.IOException;
   import java.util.BitSet;
  
   /**
* p
* A ChainableFilter allows multiple filters to be chained
* such that the result is the intersection of all the
* filters.
* /p
* p
* Order in which filters are called depends on
* the position of the filter in the chain. It's probably
* more efficient to place the most restrictive filters
* /least computationally-intensive filters first.
* /p
*
* @author a href=mailto:[EMAIL PROTECTED];Kelvin Tan/a
*/
   public class ChainableFilter extends Filter
   {
   /** The filter chain */
   private Filter[] chain = null;
  
   /**
* Creates a new ChainableFilter.
*
* @param chain The chain of filters.
*/
   public ChainableFilter(Filter[] chain)
   {
   this.chain = chain;
   }
  
   public BitSet bits(IndexReader reader) throws IOException
   {
   BitSet result = null;
   for (int i = 0; i  chain.length; i++)
   {
   result = chain[i].bits(reader);
   }
   return result;
   }
   }
  
  
 

 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Database integration best practices ...

Re: Database integration best practices ...

RE : Database integration best practices ...

Re: Database integration best practices ...

StopFilter-troubles

Re: StopFilter-troubles

Retrieve all documents from Index, How to?

RE: StopFilter-troubles

RE: Lucene with Number+Text

RE: StopFilter-troubles

Re: Retrieve all documents from Index, How to?

RE: Retrieve all documents from Index, How to?

RE: Term

Re: Term

RE: Term

Lexical Error??? Help Please!

Field.Text arguments

Re: Field.Text arguments

Re: Field.Text arguments

Re: Chainable Filter contribution

Re: Question on the FAQ list with filters

Re: Chainable Filter contribution

22 matches

Site Navigation

Mail list logo

Footer information