Re: MultiFieldQueryParser seems broken... Fix attached.

2004-10-04 Thread Bill Janssen
Doug Cutting writes:
> >>http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116
> > 
> > 
> > Yes, the approach there is similar.  I attempted to complete the
> > solution and provide a working replacement for MultiFieldQueryParser.
> 
> But, inspired by that message, couldn't MultiFieldQueryParser just be a 
> subclass of QueryParser that overrides getFieldQuery()?

This wouldn't catch PrefixQueries or RangeQueries, etc., would it?  If
QueryParser.TermQuery() wasn't final, you could just override it (or
fix it to do the right thing).

By the way, I've found a bug in my implementation of
MultiFieldQueryParser.  Single-word queries weren't being expanded
properly.  I've fixed that, and placed a revised copy of the code at
ftp://ftp.parc.xerox.com/pub/transient/janssen/SearchTest.java.  See
my original post at
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=9757
for instructions on how to use it.  Or just read the SearchTest.java code.

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread Doug Cutting
Daniel Naber wrote:
On Thursday 09 September 2004 18:52, Doug Cutting wrote:

I have not been
able to construct a two-word query that returns a page without both
words in either the content, the title, the url or in a single anchor.
Can you?

Like this one?
konvens leitseite 

Leitseite is only in the title of the first match (www.gldv.org), konvens 
is only in the body.
Good job finding that!  I guess I should fix Nutch's BasicQueryFilter.
Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread sergiu gordea

.
I reckon there has been a discussion (and solution :-) on how to achieve the
functionality you've been
after:
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116
I'm not sure if this would be the same though.
Best regards,
René
 

Hi all,
I took the code indicated by Rene but I've seen that it's not completly 
feeting my requirements, because my application should
provide the facility to check queries as beeing Fuzzy queries. so I 
modified the code to the following one, and I added a test main method.
Hope it helps someone.


package org.apache.lucene;
/* @(#) CWK 1.5 10.09.2004
*
* Copyright 2003-2005 ConfigWorks Informationssysteme & Consulting GmbH
* Universitätsstr. 94/7 9020 Klagenfurt Austria
* www.configworks.com
* All rights reserved.
*/
import java.util.Vector;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.Query;
/**
* @author sergiu
* this class is a patch for MultifieldQueryParser
* it's behaviour can be tested by running the main method
*
* 
Now:

String[] fields = new String[] { "title", "abstract", "content" };
QueryParser parser = new CustomQueryParser(fields, new SimpleAnalyzer());
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
Query query = parser.parse("foo -bar (baz OR title:bla)");
System.out.println("? " + query);
Produces:
? +(title:foo abstract:foo content:foo) -(title:bar abstract:bar
content:bar) +((title:baz abstract:baz content:baz) title:bla)
Perfect!
* @version 1.0
* @since CWK 1.5
*/
public class CustomQueryParser extends QueryParser{
  private String[] fields;
  private boolean fuzzySearch = false;
  public CustomQueryParser(String[] fields, Analyzer analyzer){
super(null, analyzer);
this.fields = fields;
  }
  public CustomQueryParser(String[] fields, Analyzer analyzer, int 
defaultOperator){
  super(null, analyzer);
  this.fields = fields;
  setOperator(defaultOperator);
  }

  protected Query getFieldQuery(String field, Analyzer analyzer, String 
queryText)
throws ParseException{
   
Query query = null;
   
if (field == null){
  Vector clauses = new Vector();
  for (int i = 0; i < fields.length; i++){
  if(isFuzzySearch())
  clauses.add(new 
BooleanClause(super.getFuzzyQuery(fields[i], queryText), false, false));
  else
  clauses.add(new 
BooleanClause(super.getFieldQuery(fields[i], analyzer, queryText), 
false, false));
 
  }
  query = getBooleanQuery(clauses); 
}else{
if (isFuzzySearch())
query = super.getFuzzyQuery(field, queryText);
else
query = super.getFieldQuery(field, analyzer, 
queryText);

}
return query;
  }
 
  public boolean isFuzzySearch() {
  return fuzzySearch;
  }
 
  public void setFuzzySearch(boolean fuzzySearch) {
  this.fuzzySearch = fuzzySearch;
  }

  public static void main(String[] args) throws Exception{
  String[] fields = new String[] { "title", "abstract", "content" };
  CustomQueryParser parser = new CustomQueryParser(fields, new 
StandardAnalyzer());
  parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
  parser.setFuzzySearch(true);
 
  String queryString = "foo -bar (baz OR title:bla)";
  System.out.println(queryString);
  Query query = parser.parse(queryString);
  System.out.println("? " + query);   

  }
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Bill Janssen
> But, inspired by that message, couldn't MultiFieldQueryParser just be a 
> subclass of QueryParser that overrides getFieldQuery()?

I wasn't sure that everything "went through" getFieldQuery().  If so,
yes, that should work.  In either case, I don't even think a subclass
is necessary.  Just have a different constructor for QueryParser that
takes multiple default field names, and just add the behavior to
QueryParser, keyed off that characteristic (more than one default
field name).

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Bill Janssen
> is it a problem if the users will search "coffee OR tea" as a search 
> string in the case that MultifieldQueryParser is
> modifyed as Bill suggested?, and the default opperator is set to AND?
> 

Here's what you get (which is correct):

% java -classpath /usr/local/lib/lucene-1.4.1.jar:. \
   -DSearchText.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=new SearchTest 'coffee OR tea'
query is (title:coffee authors:coffee contents:coffee) (title:tea authors:tea 
contents:tea)
%

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Daniel Naber
On Thursday 09 September 2004 18:52, Doug Cutting wrote:

> I have not been
> able to construct a two-word query that returns a page without both
> words in either the content, the title, the url or in a single anchor.
> Can you?

Like this one?

konvens leitseite 

Leitseite is only in the title of the first match (www.gldv.org), konvens 
is only in the body.

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Doug Cutting
Bill Janssen wrote:
I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears.  That is,
(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
Your proposal is certainly an improvement.
It's interesting to note that in Nutch I implemented something 
different.  There, a search for "cutting lucene" expands to something like:

 (+url:cutting^4.0 +url:lucene^4.0 +url:"cutting lucene"~2147483647^4.0)
 (+anchor:cutting^2.0 +anchor:lucene^2.0 +anchor:"cutting lucene"~4^2.0)
 (+content:cutting +content:lucene +content:"cutting lucene"~2147483647)
So a page with "cutting" in the body and "lucene" in anchor text won't 
match: the body, anchor or url must contain all query terms.  A single 
authority (content, url or anchor) must vouch for all attributes.

Note that Nutch also boosts matches where the terms are close together. 
 Using "~2147483647" permits them to be anywhere in the document, but 
boosts more when they're closer and in-order.  (The "~4" in anchor 
matches is to prohibit matches across different anchors.  Each anchor is 
separated by a Token.positionIncrement() of 4.)

But perhaps this is not a feature.  Perhaps Nutch should instead expand 
this to:

 +(url:cutting^4.0 anchor:cutting^2.0 content:cutting)
 +(url:lucene^4.0 anchor:lucene^2.0 content:lucene)
 url:"cutting lucene"~2147483647^4.0
 anchor:"cutting lucene"~4^2.0
 content:"cutting lucene"~2147483647
That would, e.g., permit a match with only "lucene" in an anchor and 
"cutting" in the content, which the earlier formulation would not.

Can anyone tell whether Google has this requirement?  I have not been 
able to construct a two-word query that returns a page without both 
words in either the content, the title, the url or in a single anchor. 
Can you?

If you're interested, the Nutch query expansion code in question is:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/query-basic/src/java/net/nutch/searcher/basic/BasicQueryFilter.java?view=markup
To play with it you can download Nutch and use the command:
  bin/nutch net.nutch.searcher.Query
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.
But, inspired by that message, couldn't MultiFieldQueryParser just be a 
subclass of QueryParser that overrides getFieldQuery()?

Cheers,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)

2004-09-09 Thread sergiu gordea
René Hackl wrote:
is it a problem if the users will search "coffee OR tea" as a search 
string in the case that MultifieldQueryParser is
modifyed as Bill suggested?, and the default opperator is set to AND?
   

No. There's not a problem with the proposed correction to MFQP. MFQP should
work the way Bill suggested.
My babbling about coffee or tea was more aimed at Bill's referring to "darn
users started demanding" . So this is a totally different
matter. In my experience, many users fall to everyday language traps, like
in: "What do you want to drink, coffee or tea?" The answer normally isn't
'yes' to both, is it?  

 

this problem may be solved if the users know the meaning of the 
following signs mean:
- + "" * ~
this will improve the results in a better way that our parsing is doing ...

I have an app where in some cases I make subqueries for an initial
user-stated query. The aim is to come up with pointers to partial matching
docs. The background is, one ill-advised NOT can ruin a query. But this has
nothing to do with MFQP. Just random thoughts about making users happy even
when they are new to formulating queries :-)
Cheers,
René
 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)

2004-09-09 Thread "René Hackl"
> is it a problem if the users will search "coffee OR tea" as a search 
> string in the case that MultifieldQueryParser is
> modifyed as Bill suggested?, and the default opperator is set to AND?

No. There's not a problem with the proposed correction to MFQP. MFQP should
work the way Bill suggested.

My babbling about coffee or tea was more aimed at Bill's referring to "darn
users started demanding" . So this is a totally different
matter. In my experience, many users fall to everyday language traps, like
in: "What do you want to drink, coffee or tea?" The answer normally isn't
'yes' to both, is it?  

I have an app where in some cases I make subqueries for an initial
user-stated query. The aim is to come up with pointers to partial matching
docs. The background is, one ill-advised NOT can ruin a query. But this has
nothing to do with MFQP. Just random thoughts about making users happy even
when they are new to formulating queries :-)

Cheers,
René

-- 
NEU: Bis zu 10 GB Speicher für e-mails & Dateien!
1 GB bereits bei GMX FreeMail http://www.gmx.net/de/go/mail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread sergiu gordea
René Hackl wrote:
Bill,
Thank you for clarifying on that issue. I missed the...
 

(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
   

  ...
 

(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
Note that this would match even if only "lucene" occurred in the
   

... "only lucene"/"only cutting" match. 

 

I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears. 
   

Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR
tea" would provide matches with either term, but not both. But this is
already "user-attune your application" territory. Your proposal makes
perfect sense, of course.
René
 

is it a problem if the users will search "coffee OR tea" as a search 
string in the case that MultifieldQueryParser is
modifyed as Bill suggested?, and the default opperator is set to AND?

I don't think so ... I think that the resulting Query should be:
(title:cutting OR author:cutting) OR (title:lucene OR author:lucene)
And I think that the results will be correct.
Am I wrong?
I don't know exactly what will happen with more complex queries, the uses grouping, 
exact matches and NOT operator
like:
 (alcohol NOT tea) OR ("black tea" AND brandy)
what will happen if you send this to a MultifieldQueryParser that searches in an index 
with
the fields "drink" and "juices"
Maybe this kind of search constructions should be a part of JUnit tests, if they are 
not already there.
Thanks,
Sergiu 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread "René Hackl"
Bill,

Thank you for clarifying on that issue. I missed the...

> (title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
   ...
> (title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
> 
> Note that this would match even if only "lucene" occurred in the

... "only lucene"/"only cutting" match. 

> I'd think that if a user specified a query "cutting lucene", with an
> implicit AND and the default fields "title" and "author", they'd
> expect to see a match in which both "cutting" and "lucene" appears. 

Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR
tea" would provide matches with either term, but not both. But this is
already "user-attune your application" territory. Your proposal makes
perfect sense, of course.

René


-- 
Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR*
Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread sergiu gordea
Hi Bill,
 I think that more people wait for this patch of MultifieldIndexParser.
 It would be nice if it will be included in the next realease candidate 


   All the best,
  Sergiu
Bill Janssen wrote:
René,
Thanks for your note.
I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears.  That is,
(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
Instead, what they'd get using the current (broken) strategy of outer
combination used by the current MultiFieldQueryParser, would be
(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
Note that this would match even if only "lucene" occurred in the
document, as long as it occurred both in the title field and in the
author field.  Or, for that matter, it would also match "Cutting on
Cutting", by Doug Cutting :-).
 

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116
   

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread Bill Janssen
René,

Thanks for your note.

I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears.  That is,

(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)

Instead, what they'd get using the current (broken) strategy of outer
combination used by the current MultiFieldQueryParser, would be

(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)

Note that this would match even if only "lucene" occurred in the
document, as long as it occurred both in the title field and in the
author field.  Or, for that matter, it would also match "Cutting on
Cutting", by Doug Cutting :-).

> http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread sergiu gordea
The class is at the end of the message.
But it hink that a better solution is that one suggested by Rene: 

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116
Wermus Fernando wrote:
Bill,
I don't receive any .java. Could you send it again?
Thanks.
-Mensaje original-
De: Bill Janssen [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m.
Para: Lucene Users List
CC: Ali Rouhi
Asunto: MultiFieldQueryParser seems broken... Fix attached.

Hi!
I'm using Lucene for an application which has lots of fields/document,
in which the users can specify in their config files what fields they
wish to be included by default in a search.  I'd been happily using
MultiFieldQueryParser to do the searches, but the darn users started
demanding more Google-like searches; that is, they want the search
terms to be implicitly AND-ed instead of implicitly OR-ed.  No
problem, thinks I, I'll just set the "operator".
Only to find this has no effect on MultiFieldQueryParser.
Once I looked at the code, I find that MultiFieldQueryParser combines
the clauses at the wrong level -- it combines them at the outermost
level instead of the innermost level.  This means that if you have two
fields, "author" and "title", and the search string "cutting lucene",
you'll get the final query
  (title:cutting title:lucene) (author:cutting author:lucene)
If the search operator is "OR", this isn't a problem.  But if it is,
you have two problems.  The first is that MultiFieldQueryParser seems
to ignore the operator entirely.  But even if it didn't, the second
problem is that the query formed would be
  +(title:cutting title:lucene) +(author:cutting author:lucene)
That is, if the word "Lucene" was in both the author field and the
title field, the match would fit.  This clearly isn't what the
searcher intended.
You can re-write MultiFieldQueryParser, as I've done in the example
code which I append here.  This little program allows you to run
either my parser (-DSearchTest.QueryParser=new) or the old parser
(-DSearchTest.QueryParser=old).  It allows you to use either OR
(-DSearchTest.QueryDefaultOperator=or) or AND
(-DSearchTest.QueryDefaultOperator=and) as the operator.  And it
allows you to pick your favorite set of default search terms
(-DSearchTest.QueryDefaultFields=author:title:body, for example).  It
takes one argument, a query string, and outputs the re-written query
after running it through the query parser.  So to evaluate the above
query:
% java -classpath /import/lucene/lucene-1.4.1.jar:. \
  -DSearchTest.QueryDefaultFields="title:author" \
  -DSearchTest.QueryDefaultOperator=AND \
  -DSearchTest.QueryParser=old \
  SearchTest "cutting lucene"
query is (title:cutting title:lucene) (author:cutting author:lucene)
%
The class NewMultiFieldQueryParser does the combination at the inner
level, using an override of "addClause", instead of the outer level.
Note that it can't cover all cases (notably PhrasePrefixQuery, because
that class has no access methods which allow one to introspect over
it, and SpanQueries, because I don't understand them well enough :-).
I post it here in advance of filing a formal bug report for early
feedback.  But it will show up in a bug report in the near future.
Running the above query with the new parser gives:
% java -classpath /import/lucene/lucene-1.4.1.jar:. \
  -DSearchTest.QueryDefaultFields="title:author" \
  -DSearchTest.QueryDefaultOperator=AND \
  -DSearchTest.QueryParser=new \
  SearchTest "cutting lucene"
query is +(title:cutting author:cutting) +(title:lucene author:lucene)
%
which I claim is what the user is expecting.
In addition, the new class uses an API more similar to QueryParser, so
that the user has less to learn when using it.  The code in it could
probably just be folded into QueryParser, in fact.
Bill
the code for SearchTest:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.MultiField

RE: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread Wermus Fernando
Bill,
I don't receive any .java. Could you send it again?

Thanks.

-Mensaje original-
De: Bill Janssen [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m.
Para: Lucene Users List
CC: Ali Rouhi
Asunto: MultiFieldQueryParser seems broken... Fix attached.

Hi!

I'm using Lucene for an application which has lots of fields/document,
in which the users can specify in their config files what fields they
wish to be included by default in a search.  I'd been happily using
MultiFieldQueryParser to do the searches, but the darn users started
demanding more Google-like searches; that is, they want the search
terms to be implicitly AND-ed instead of implicitly OR-ed.  No
problem, thinks I, I'll just set the "operator".

Only to find this has no effect on MultiFieldQueryParser.

Once I looked at the code, I find that MultiFieldQueryParser combines
the clauses at the wrong level -- it combines them at the outermost
level instead of the innermost level.  This means that if you have two
fields, "author" and "title", and the search string "cutting lucene",
you'll get the final query

   (title:cutting title:lucene) (author:cutting author:lucene)

If the search operator is "OR", this isn't a problem.  But if it is,
you have two problems.  The first is that MultiFieldQueryParser seems
to ignore the operator entirely.  But even if it didn't, the second
problem is that the query formed would be

   +(title:cutting title:lucene) +(author:cutting author:lucene)

That is, if the word "Lucene" was in both the author field and the
title field, the match would fit.  This clearly isn't what the
searcher intended.

You can re-write MultiFieldQueryParser, as I've done in the example
code which I append here.  This little program allows you to run
either my parser (-DSearchTest.QueryParser=new) or the old parser
(-DSearchTest.QueryParser=old).  It allows you to use either OR
(-DSearchTest.QueryDefaultOperator=or) or AND
(-DSearchTest.QueryDefaultOperator=and) as the operator.  And it
allows you to pick your favorite set of default search terms
(-DSearchTest.QueryDefaultFields=author:title:body, for example).  It
takes one argument, a query string, and outputs the re-written query
after running it through the query parser.  So to evaluate the above
query:

% java -classpath /import/lucene/lucene-1.4.1.jar:. \
   -DSearchTest.QueryDefaultFields="title:author" \
   -DSearchTest.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=old \
   SearchTest "cutting lucene"
query is (title:cutting title:lucene) (author:cutting author:lucene)
%

The class NewMultiFieldQueryParser does the combination at the inner
level, using an override of "addClause", instead of the outer level.
Note that it can't cover all cases (notably PhrasePrefixQuery, because
that class has no access methods which allow one to introspect over
it, and SpanQueries, because I don't understand them well enough :-).
I post it here in advance of filing a formal bug report for early
feedback.  But it will show up in a bug report in the near future.

Running the above query with the new parser gives:

% java -classpath /import/lucene/lucene-1.4.1.jar:. \
   -DSearchTest.QueryDefaultFields="title:author" \
   -DSearchTest.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=new \
   SearchTest "cutting lucene"
query is +(title:cutting author:cutting) +(title:lucene author:lucene)
%

which I claim is what the user is expecting.

In addition, the new class uses an API more similar to QueryParser, so
that the user has less to learn when using it.  The code in it could
probably just be folded into QueryParser, in fact.

Bill


the code for SearchTest:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.FastCharStream;
import org.apache.lucene.queryParser.TokenMgrError;
import org.apache.lucene.queryParser.ParseException;

i

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread "René Hackl"
Hi Bill,

-
But even if it didn't, the second
problem is that the query formed would be

   +(title:cutting title:lucene) +(author:cutting author:lucene)

That is, if the word "Lucene" was in both the author field and the
title field, the match would fit.  This clearly isn't what the
searcher intended.
-
AFA my understanding of the query syntax goes, this would be interpreted
as (A OR B) AND (C OR D) which would produce the same set as 
(A OR C) AND (B OR D) == +(title:cutting author:cutting) +(title:lucene
author:lucene). But it would only be true for this special case with 2 terms
and 2 fields.

I reckon there has been a discussion (and solution :-) on how to achieve the
functionality you've been
after:

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116

I'm not sure if this would be the same though.

Best regards,
René

-- 
Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR*
Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MultiFieldQueryParser seems broken... Fix attached.

2004-09-07 Thread Bill Janssen
Hi!

I'm using Lucene for an application which has lots of fields/document,
in which the users can specify in their config files what fields they
wish to be included by default in a search.  I'd been happily using
MultiFieldQueryParser to do the searches, but the darn users started
demanding more Google-like searches; that is, they want the search
terms to be implicitly AND-ed instead of implicitly OR-ed.  No
problem, thinks I, I'll just set the "operator".

Only to find this has no effect on MultiFieldQueryParser.

Once I looked at the code, I find that MultiFieldQueryParser combines
the clauses at the wrong level -- it combines them at the outermost
level instead of the innermost level.  This means that if you have two
fields, "author" and "title", and the search string "cutting lucene",
you'll get the final query

   (title:cutting title:lucene) (author:cutting author:lucene)

If the search operator is "OR", this isn't a problem.  But if it is,
you have two problems.  The first is that MultiFieldQueryParser seems
to ignore the operator entirely.  But even if it didn't, the second
problem is that the query formed would be

   +(title:cutting title:lucene) +(author:cutting author:lucene)

That is, if the word "Lucene" was in both the author field and the
title field, the match would fit.  This clearly isn't what the
searcher intended.

You can re-write MultiFieldQueryParser, as I've done in the example
code which I append here.  This little program allows you to run
either my parser (-DSearchTest.QueryParser=new) or the old parser
(-DSearchTest.QueryParser=old).  It allows you to use either OR
(-DSearchTest.QueryDefaultOperator=or) or AND
(-DSearchTest.QueryDefaultOperator=and) as the operator.  And it
allows you to pick your favorite set of default search terms
(-DSearchTest.QueryDefaultFields=author:title:body, for example).  It
takes one argument, a query string, and outputs the re-written query
after running it through the query parser.  So to evaluate the above
query:

% java -classpath /import/lucene/lucene-1.4.1.jar:. \
   -DSearchTest.QueryDefaultFields="title:author" \
   -DSearchTest.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=old \
   SearchTest "cutting lucene"
query is (title:cutting title:lucene) (author:cutting author:lucene)
%

The class NewMultiFieldQueryParser does the combination at the inner
level, using an override of "addClause", instead of the outer level.
Note that it can't cover all cases (notably PhrasePrefixQuery, because
that class has no access methods which allow one to introspect over
it, and SpanQueries, because I don't understand them well enough :-).
I post it here in advance of filing a formal bug report for early
feedback.  But it will show up in a bug report in the near future.

Running the above query with the new parser gives:

% java -classpath /import/lucene/lucene-1.4.1.jar:. \
   -DSearchTest.QueryDefaultFields="title:author" \
   -DSearchTest.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=new \
   SearchTest "cutting lucene"
query is +(title:cutting author:cutting) +(title:lucene author:lucene)
%

which I claim is what the user is expecting.

In addition, the new class uses an API more similar to QueryParser, so
that the user has less to learn when using it.  The code in it could
probably just be folded into QueryParser, in fact.

Bill


the code for SearchTest:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.FastCharStream;
import org.apache.lucene.queryParser.TokenMgrError;
import org.apache.lucene.queryParser.ParseException;

import java.io.File;
import java.io.StringReader;
import java.util.Date;
import java.util.HashMap;
import java.util.Iterator;
import java.util.StringTokenizer;

class SearchTest {

static class NewMultiFieldQueryParser extends QueryParser {

static private final String DEFAULT_FIELD = "%%";

private String[] fields = null;

public NewMultiFieldQueryParser (String[] f, Analyzer a) {
super(DEFAULT_FIELD, a);
fields =