RE: GoogleQueryParser

2002-09-12 Thread John Cwikla



So here is the problem, AND default operator:

a b OR c == a AND b OR c == c OR (a AND b) == c OR (+a +b) != c b +a != c +b
+a

However with default OR operator:

a AND b c == a AND b OR c == c OR (a AND b) == c OR (+a +b) == c (+a +b)

Since AND and OR do not actually mean required or optional in a strict
boolean sense, I claim you cannot correctly use the query parser with
a default AND operator and get results that would be expected.

I haven't looked more into the QueryParser yet, but in the last case with
the
AND operator, if at some point the internal query switched to OR, then
the last item would be correct if it had parenthesis like c (+a +b)

Or is it early and I'm missing something?

cwikla



-Original Message-
From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 11:58 PM
To: Lucene Users List
Subject: RE: GoogleQueryParser




-Original Message-
From: Philip Chan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 11:04 PM
To: Lucene Users List
Subject: RE: GoogleQueryParser


I think there's a bug, if I set the default operator to be OR, 
when I run

java org.apache.lucene.queryParser.QueryParser a AND b OR c

it will give me the result of +a +b c
if I set the default operator to be AND, and run it with the 
term a b OR
c, it will give me +a b c, which is different

To be exact it's not a bug, it's feature ;) Well, the structured query
language of Lucene (and Google and others) is not a strict boolean language.
For example I think the QueryParser of Lucene do not support parenthesis: a
AND (b OR C) 

Instead of strict boolean logic it supports constraint on query terms: a
query term is either required or optional or prohibited. If you write + sign
before the term it will be required. If you write - it will be prohibited.
The question is: is a term required or optional if you do not specify
anything? 

DEFAULT_OPERATOR_OR (default QueryParser):
A B C -- all three terms are optional

DEFAULT_OPERATOR_AND (Google style):
A B C -- +A +B +C all three terms are required.

Because a b OR c query is not a strict boolean query, the query parser can
choose how to translate it. +a +b c not too good since doesn't equal to the
result of input query c OR a b

peter







-Original Message-
From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 4:49 AM
To: Lucene Users List; Clemens Marschner
Subject: RE: GoogleQueryParser




-Original Message-
From: Eric Jain [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 1:44 PM
To: Clemens Marschner
Cc: Lucene Users List
Subject: Re: GoogleQueryParser


 queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Thanks, that would be exactely what I need. Must be a new 
method, not yet in
the public release?

check out the new QueryParser from the cvs

peter

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: GoogleQueryParser

2002-09-12 Thread John Cwikla


Actually to expand a little more after a little more digging, it
appears that the AND/OR terms are being flattened into a list of
+ - or optional query terms that are used to remove/add results
one after the other.

In this sense, I think the AND, OR and NOT operators seems to have
been an afterthought to the queryparser, since they cannot give the
same results as +, - and optional.  AND, OR and NOT should use intersections
and unions, while + and - is doing strict adds or rejections.

I guess I was expecting a syntax tree, but it looks like just flattening
of terms.

cwikla

-Original Message-
From: John Cwikla [mailto:[EMAIL PROTECTED]]
Sent: Thursday, September 12, 2002 11:53 AM
To: 'Lucene Users List'
Subject: RE: GoogleQueryParser




So here is the problem, AND default operator:

a b OR c == a AND b OR c == c OR (a AND b) == c OR (+a +b) != c b +a != c +b
+a

However with default OR operator:

a AND b c == a AND b OR c == c OR (a AND b) == c OR (+a +b) == c (+a +b)

Since AND and OR do not actually mean required or optional in a strict
boolean sense, I claim you cannot correctly use the query parser with
a default AND operator and get results that would be expected.

I haven't looked more into the QueryParser yet, but in the last case with
the
AND operator, if at some point the internal query switched to OR, then
the last item would be correct if it had parenthesis like c (+a +b)

Or is it early and I'm missing something?

cwikla



-Original Message-
From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 11:58 PM
To: Lucene Users List
Subject: RE: GoogleQueryParser




-Original Message-
From: Philip Chan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 11:04 PM
To: Lucene Users List
Subject: RE: GoogleQueryParser


I think there's a bug, if I set the default operator to be OR, 
when I run

java org.apache.lucene.queryParser.QueryParser a AND b OR c

it will give me the result of +a +b c
if I set the default operator to be AND, and run it with the 
term a b OR
c, it will give me +a b c, which is different

To be exact it's not a bug, it's feature ;) Well, the structured query
language of Lucene (and Google and others) is not a strict boolean language.
For example I think the QueryParser of Lucene do not support parenthesis: a
AND (b OR C) 

Instead of strict boolean logic it supports constraint on query terms: a
query term is either required or optional or prohibited. If you write + sign
before the term it will be required. If you write - it will be prohibited.
The question is: is a term required or optional if you do not specify
anything? 

DEFAULT_OPERATOR_OR (default QueryParser):
A B C -- all three terms are optional

DEFAULT_OPERATOR_AND (Google style):
A B C -- +A +B +C all three terms are required.

Because a b OR c query is not a strict boolean query, the query parser can
choose how to translate it. +a +b c not too good since doesn't equal to the
result of input query c OR a b

peter







-Original Message-
From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 4:49 AM
To: Lucene Users List; Clemens Marschner
Subject: RE: GoogleQueryParser




-Original Message-
From: Eric Jain [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 1:44 PM
To: Clemens Marschner
Cc: Lucene Users List
Subject: Re: GoogleQueryParser


 queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Thanks, that would be exactely what I need. Must be a new 
method, not yet in
the public release?

check out the new QueryParser from the cvs

peter

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: GoogleQueryParser

2002-09-12 Thread Joshua O'Madadhain

This issue confused me somewhat when I first encountered it from the other
direction (looking at the inputs to, and behavior of, BooleanQuery).  

However, ultimately--as I understand it--what's going on here is that
while the AND, OR syntax suggests Boolean queries, the indexing and
searching engines use the vector model (which ranks documents by their
score on the query) rather than the Boolean model (which returns all
documents which satisfy the query).  It may be that the Boolean syntax is
more misleading than useful in some cases.

Part of the problem, too (as Peter pointed out), may be that the example
queries under discussion are not well-formed Boolean queries, so it's not
necessarily clear what their behavior *should* be.

Let's consider the example:

a b OR c

If I assume a default of AND, a AND (b OR c) seems plausible; so
does (a AND b) OR c; or perhaps (a AND c) OR (b AND c).

I believe that the vector model interpretation of this query may be this
(with AND default):

+a +b c

which means 

The more of the search terms 'a', 'b', 'c' that a document has, the
higher its score will be, all other things being equal; however, if a
document doesn't have 'a', or doesn't have 'b', give it score 0
regardless of any other factors.

This is probably most similar to (a AND c) OR (b AND c).

The Boolean interpretation of (a AND c) OR (b AND c) would be

Return all documents [in any order] that have *at least one of the
following* term sets: {'a', 'c'}; {'b', 'c'}.

which, if you think about it, gives you the same document set as the
vector model interpretation.


Personally, I avoid using the QueryParser entirely and just do my own
parsing and query construction.  Part of the reason for this is that my
code is doing term expansion and reweighting, but part of it is just that
I feel that I get more power and flexibility--and less opportunity for
ambiguity such as this--by doing my own parsing.  Your mileage may vary.  
:)

Regards,

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Thu, 12 Sep 2002, John Cwikla wrote:

 Actually to expand a little more after a little more digging, it
 appears that the AND/OR terms are being flattened into a list of
 + - or optional query terms that are used to remove/add results
 one after the other.
 
 In this sense, I think the AND, OR and NOT operators seems to have
 been an afterthought to the queryparser, since they cannot give the
 same results as +, - and optional.  AND, OR and NOT should use intersections
 and unions, while + and - is doing strict adds or rejections.
 
 I guess I was expecting a syntax tree, but it looks like just flattening
 of terms.
 
 cwikla
 
 -Original Message-
 From: John Cwikla [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, September 12, 2002 11:53 AM
 To: 'Lucene Users List'
 Subject: RE: GoogleQueryParser
 
 
 So here is the problem, AND default operator:
 
 a b OR c == a AND b OR c == c OR (a AND b) == c OR (+a +b) != c b +a != c +b
 +a
 
 However with default OR operator:
 
 a AND b c == a AND b OR c == c OR (a AND b) == c OR (+a +b) == c (+a +b)
 
 Since AND and OR do not actually mean required or optional in a strict
 boolean sense, I claim you cannot correctly use the query parser with
 a default AND operator and get results that would be expected.
 
 I haven't looked more into the QueryParser yet, but in the last case with
 the
 AND operator, if at some point the internal query switched to OR, then
 the last item would be correct if it had parenthesis like c (+a +b)
 
 Or is it early and I'm missing something?
 
 cwikla
 
 
 
 -Original Message-
 From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, September 11, 2002 11:58 PM
 To: Lucene Users List
 Subject: RE: GoogleQueryParser
 
 -Original Message-
 From: Philip Chan [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, September 11, 2002 11:04 PM
 To: Lucene Users List
 Subject: RE: GoogleQueryParser
 
 
 I think there's a bug, if I set the default operator to be OR, 
 when I run
 
 java org.apache.lucene.queryParser.QueryParser a AND b OR c
 
 it will give me the result of +a +b c
 if I set the default operator to be AND, and run it with the 
 term a b OR
 c, it will give me +a b c, which is different
 
 To be exact it's not a bug, it's feature ;) Well, the structured query
 language of Lucene (and Google and others) is not a strict boolean language.
 For example I think the QueryParser of Lucene do not support parenthesis: a
 AND (b OR C) 
 
 Instead of strict boolean logic it supports constraint on query terms: a
 query term is either required or optional or prohibited. If you write + sign
 before the term it will be required. If you write - it will be prohibited.
 The question is: is a term

RE: GoogleQueryParser

2002-09-12 Thread Philip Chan

I mentioned that it was a bug because it was not consistent with how
queryParser was handling queries in 1.2.
in 1.2, a AND b OR c means +a +b c, c OR a AND b means c +a +b, 

while in this case, searching for a b OR c is not the same as searching
for a AND b OR c, even if I do a setDefaultOperator(AND) first, 
but one would expect them to mean the same thing, because whenever an
operator is not specified, it should be defaulted to AND.

basically, output#1 and output#2 are different from the code below while I
expect them to be the same

QueryParser qp = new QueryParser(field, new
org.apache.lucene.analysis.SimpleAnalyzer());
Query   q = qp.parse(a AND b OR c);
System.out.println(q.toString(field));// output#1

qp.setDefaultOperator(QueryParser.DEFAULT_OPERATOR_AND)
q = qp.parse(a b OR c);
System.out.println(q.toString(field));// output#2



Philip

-Original Message-
From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 11:58 PM
To: Lucene Users List
Subject: RE: GoogleQueryParser




-Original Message-
From: Philip Chan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 11:04 PM
To: Lucene Users List
Subject: RE: GoogleQueryParser


I think there's a bug, if I set the default operator to be OR, 
when I run

java org.apache.lucene.queryParser.QueryParser a AND b OR c

it will give me the result of +a +b c
if I set the default operator to be AND, and run it with the 
term a b OR
c, it will give me +a b c, which is different

To be exact it's not a bug, it's feature ;) Well, the structured query
language of Lucene (and Google and others) is not a strict boolean language.
For example I think the QueryParser of Lucene do not support parenthesis: a
AND (b OR C) 

Instead of strict boolean logic it supports constraint on query terms: a
query term is either required or optional or prohibited. If you write + sign
before the term it will be required. If you write - it will be prohibited.
The question is: is a term required or optional if you do not specify
anything? 

DEFAULT_OPERATOR_OR (default QueryParser):
A B C -- all three terms are optional

DEFAULT_OPERATOR_AND (Google style):
A B C -- +A +B +C all three terms are required.

Because a b OR c query is not a strict boolean query, the query parser can
choose how to translate it. +a +b c not too good since doesn't equal to the
result of input query c OR a b

peter







-Original Message-
From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 4:49 AM
To: Lucene Users List; Clemens Marschner
Subject: RE: GoogleQueryParser




-Original Message-
From: Eric Jain [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 1:44 PM
To: Clemens Marschner
Cc: Lucene Users List
Subject: Re: GoogleQueryParser


 queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Thanks, that would be exactely what I need. Must be a new 
method, not yet in
the public release?

check out the new QueryParser from the cvs

peter

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: GoogleQueryParser

2002-09-11 Thread gcasper


 - Treat a-b as a-b rather that a -b.


I came across the same. Quite an essential issue for some european sites
(as you surely know :-)

I'm not very familiar with JavaCC, but I changed QueryParser.jj in the
following way:

I changed
| MINUS: - 
to
| MINUS:  - 

and removed - from the list of
| #_ESCAPED_CHAR:
and
| #_TERM_START_CHAR:

This actually changes the behaviour to that of google and I didn't
experience any negative side effects (yet).

HTH
Guido


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: GoogleQueryParser

2002-09-11 Thread Clemens Marschner

 - Default AND rather than OR.

As for this part: This can be accomplished with

queryParser = new QueryParser(defaultField, new MyAnalyzer());
queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

 - Treat a-b as a-b rather that a -b.

That would be interesting for me, too.

Clemens



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: GoogleQueryParser

2002-09-11 Thread Eric Jain

 This actually changes the behaviour to that of google and I didn't
 experience any negative side effects (yet).

Thanks. I hope there will eventually be some standard way to accomplish
this...


--
Eric Jain


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: GoogleQueryParser

2002-09-11 Thread Halcsy Pter



-Original Message-
From: Eric Jain [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 1:44 PM
To: Clemens Marschner
Cc: Lucene Users List
Subject: Re: GoogleQueryParser


 queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Thanks, that would be exactely what I need. Must be a new 
method, not yet in
the public release?

check out the new QueryParser from the cvs

peter

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: GoogleQueryParser

2002-09-11 Thread Philip Chan

I think there's a bug, if I set the default operator to be OR, when I run

java org.apache.lucene.queryParser.QueryParser a AND b OR c

it will give me the result of +a +b c
if I set the default operator to be AND, and run it with the term a b OR
c, it will give me +a b c, which is different

-Original Message-
From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 4:49 AM
To: Lucene Users List; Clemens Marschner
Subject: RE: GoogleQueryParser




-Original Message-
From: Eric Jain [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 11, 2002 1:44 PM
To: Clemens Marschner
Cc: Lucene Users List
Subject: Re: GoogleQueryParser


 queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Thanks, that would be exactely what I need. Must be a new 
method, not yet in
the public release?

check out the new QueryParser from the cvs

peter

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]