RE: GoogleQueryParser
So here is the problem, AND default operator: a b OR c == a AND b OR c == c OR (a AND b) == c OR (+a +b) != c b +a != c +b +a However with default OR operator: a AND b c == a AND b OR c == c OR (a AND b) == c OR (+a +b) == c (+a +b) Since AND and OR do not actually mean required or optional in a strict boolean sense, I claim you cannot correctly use the query parser with a default AND operator and get results that would be expected. I haven't looked more into the QueryParser yet, but in the last case with the AND operator, if at some point the internal query switched to OR, then the last item would be correct if it had parenthesis like c (+a +b) Or is it early and I'm missing something? cwikla -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:58 PM To: Lucene Users List Subject: RE: GoogleQueryParser -Original Message- From: Philip Chan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:04 PM To: Lucene Users List Subject: RE: GoogleQueryParser I think there's a bug, if I set the default operator to be OR, when I run java org.apache.lucene.queryParser.QueryParser a AND b OR c it will give me the result of +a +b c if I set the default operator to be AND, and run it with the term a b OR c, it will give me +a b c, which is different To be exact it's not a bug, it's feature ;) Well, the structured query language of Lucene (and Google and others) is not a strict boolean language. For example I think the QueryParser of Lucene do not support parenthesis: a AND (b OR C) Instead of strict boolean logic it supports constraint on query terms: a query term is either required or optional or prohibited. If you write + sign before the term it will be required. If you write - it will be prohibited. The question is: is a term required or optional if you do not specify anything? DEFAULT_OPERATOR_OR (default QueryParser): A B C -- all three terms are optional DEFAULT_OPERATOR_AND (Google style): A B C -- +A +B +C all three terms are required. Because a b OR c query is not a strict boolean query, the query parser can choose how to translate it. +a +b c not too good since doesn't equal to the result of input query c OR a b peter -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 4:49 AM To: Lucene Users List; Clemens Marschner Subject: RE: GoogleQueryParser -Original Message- From: Eric Jain [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 1:44 PM To: Clemens Marschner Cc: Lucene Users List Subject: Re: GoogleQueryParser queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Thanks, that would be exactely what I need. Must be a new method, not yet in the public release? check out the new QueryParser from the cvs peter -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: GoogleQueryParser
Actually to expand a little more after a little more digging, it appears that the AND/OR terms are being flattened into a list of + - or optional query terms that are used to remove/add results one after the other. In this sense, I think the AND, OR and NOT operators seems to have been an afterthought to the queryparser, since they cannot give the same results as +, - and optional. AND, OR and NOT should use intersections and unions, while + and - is doing strict adds or rejections. I guess I was expecting a syntax tree, but it looks like just flattening of terms. cwikla -Original Message- From: John Cwikla [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 12, 2002 11:53 AM To: 'Lucene Users List' Subject: RE: GoogleQueryParser So here is the problem, AND default operator: a b OR c == a AND b OR c == c OR (a AND b) == c OR (+a +b) != c b +a != c +b +a However with default OR operator: a AND b c == a AND b OR c == c OR (a AND b) == c OR (+a +b) == c (+a +b) Since AND and OR do not actually mean required or optional in a strict boolean sense, I claim you cannot correctly use the query parser with a default AND operator and get results that would be expected. I haven't looked more into the QueryParser yet, but in the last case with the AND operator, if at some point the internal query switched to OR, then the last item would be correct if it had parenthesis like c (+a +b) Or is it early and I'm missing something? cwikla -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:58 PM To: Lucene Users List Subject: RE: GoogleQueryParser -Original Message- From: Philip Chan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:04 PM To: Lucene Users List Subject: RE: GoogleQueryParser I think there's a bug, if I set the default operator to be OR, when I run java org.apache.lucene.queryParser.QueryParser a AND b OR c it will give me the result of +a +b c if I set the default operator to be AND, and run it with the term a b OR c, it will give me +a b c, which is different To be exact it's not a bug, it's feature ;) Well, the structured query language of Lucene (and Google and others) is not a strict boolean language. For example I think the QueryParser of Lucene do not support parenthesis: a AND (b OR C) Instead of strict boolean logic it supports constraint on query terms: a query term is either required or optional or prohibited. If you write + sign before the term it will be required. If you write - it will be prohibited. The question is: is a term required or optional if you do not specify anything? DEFAULT_OPERATOR_OR (default QueryParser): A B C -- all three terms are optional DEFAULT_OPERATOR_AND (Google style): A B C -- +A +B +C all three terms are required. Because a b OR c query is not a strict boolean query, the query parser can choose how to translate it. +a +b c not too good since doesn't equal to the result of input query c OR a b peter -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 4:49 AM To: Lucene Users List; Clemens Marschner Subject: RE: GoogleQueryParser -Original Message- From: Eric Jain [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 1:44 PM To: Clemens Marschner Cc: Lucene Users List Subject: Re: GoogleQueryParser queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Thanks, that would be exactely what I need. Must be a new method, not yet in the public release? check out the new QueryParser from the cvs peter -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: GoogleQueryParser
This issue confused me somewhat when I first encountered it from the other direction (looking at the inputs to, and behavior of, BooleanQuery). However, ultimately--as I understand it--what's going on here is that while the AND, OR syntax suggests Boolean queries, the indexing and searching engines use the vector model (which ranks documents by their score on the query) rather than the Boolean model (which returns all documents which satisfy the query). It may be that the Boolean syntax is more misleading than useful in some cases. Part of the problem, too (as Peter pointed out), may be that the example queries under discussion are not well-formed Boolean queries, so it's not necessarily clear what their behavior *should* be. Let's consider the example: a b OR c If I assume a default of AND, a AND (b OR c) seems plausible; so does (a AND b) OR c; or perhaps (a AND c) OR (b AND c). I believe that the vector model interpretation of this query may be this (with AND default): +a +b c which means The more of the search terms 'a', 'b', 'c' that a document has, the higher its score will be, all other things being equal; however, if a document doesn't have 'a', or doesn't have 'b', give it score 0 regardless of any other factors. This is probably most similar to (a AND c) OR (b AND c). The Boolean interpretation of (a AND c) OR (b AND c) would be Return all documents [in any order] that have *at least one of the following* term sets: {'a', 'c'}; {'b', 'c'}. which, if you think about it, gives you the same document set as the vector model interpretation. Personally, I avoid using the QueryParser entirely and just do my own parsing and query construction. Part of the reason for this is that my code is doing term expansion and reweighting, but part of it is just that I feel that I get more power and flexibility--and less opportunity for ambiguity such as this--by doing my own parsing. Your mileage may vary. :) Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. On Thu, 12 Sep 2002, John Cwikla wrote: Actually to expand a little more after a little more digging, it appears that the AND/OR terms are being flattened into a list of + - or optional query terms that are used to remove/add results one after the other. In this sense, I think the AND, OR and NOT operators seems to have been an afterthought to the queryparser, since they cannot give the same results as +, - and optional. AND, OR and NOT should use intersections and unions, while + and - is doing strict adds or rejections. I guess I was expecting a syntax tree, but it looks like just flattening of terms. cwikla -Original Message- From: John Cwikla [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 12, 2002 11:53 AM To: 'Lucene Users List' Subject: RE: GoogleQueryParser So here is the problem, AND default operator: a b OR c == a AND b OR c == c OR (a AND b) == c OR (+a +b) != c b +a != c +b +a However with default OR operator: a AND b c == a AND b OR c == c OR (a AND b) == c OR (+a +b) == c (+a +b) Since AND and OR do not actually mean required or optional in a strict boolean sense, I claim you cannot correctly use the query parser with a default AND operator and get results that would be expected. I haven't looked more into the QueryParser yet, but in the last case with the AND operator, if at some point the internal query switched to OR, then the last item would be correct if it had parenthesis like c (+a +b) Or is it early and I'm missing something? cwikla -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:58 PM To: Lucene Users List Subject: RE: GoogleQueryParser -Original Message- From: Philip Chan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:04 PM To: Lucene Users List Subject: RE: GoogleQueryParser I think there's a bug, if I set the default operator to be OR, when I run java org.apache.lucene.queryParser.QueryParser a AND b OR c it will give me the result of +a +b c if I set the default operator to be AND, and run it with the term a b OR c, it will give me +a b c, which is different To be exact it's not a bug, it's feature ;) Well, the structured query language of Lucene (and Google and others) is not a strict boolean language. For example I think the QueryParser of Lucene do not support parenthesis: a AND (b OR C) Instead of strict boolean logic it supports constraint on query terms: a query term is either required or optional or prohibited. If you write + sign before the term it will be required. If you write - it will be prohibited. The question is: is a term
RE: GoogleQueryParser
I mentioned that it was a bug because it was not consistent with how queryParser was handling queries in 1.2. in 1.2, a AND b OR c means +a +b c, c OR a AND b means c +a +b, while in this case, searching for a b OR c is not the same as searching for a AND b OR c, even if I do a setDefaultOperator(AND) first, but one would expect them to mean the same thing, because whenever an operator is not specified, it should be defaulted to AND. basically, output#1 and output#2 are different from the code below while I expect them to be the same QueryParser qp = new QueryParser(field, new org.apache.lucene.analysis.SimpleAnalyzer()); Query q = qp.parse(a AND b OR c); System.out.println(q.toString(field));// output#1 qp.setDefaultOperator(QueryParser.DEFAULT_OPERATOR_AND) q = qp.parse(a b OR c); System.out.println(q.toString(field));// output#2 Philip -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:58 PM To: Lucene Users List Subject: RE: GoogleQueryParser -Original Message- From: Philip Chan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:04 PM To: Lucene Users List Subject: RE: GoogleQueryParser I think there's a bug, if I set the default operator to be OR, when I run java org.apache.lucene.queryParser.QueryParser a AND b OR c it will give me the result of +a +b c if I set the default operator to be AND, and run it with the term a b OR c, it will give me +a b c, which is different To be exact it's not a bug, it's feature ;) Well, the structured query language of Lucene (and Google and others) is not a strict boolean language. For example I think the QueryParser of Lucene do not support parenthesis: a AND (b OR C) Instead of strict boolean logic it supports constraint on query terms: a query term is either required or optional or prohibited. If you write + sign before the term it will be required. If you write - it will be prohibited. The question is: is a term required or optional if you do not specify anything? DEFAULT_OPERATOR_OR (default QueryParser): A B C -- all three terms are optional DEFAULT_OPERATOR_AND (Google style): A B C -- +A +B +C all three terms are required. Because a b OR c query is not a strict boolean query, the query parser can choose how to translate it. +a +b c not too good since doesn't equal to the result of input query c OR a b peter -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 4:49 AM To: Lucene Users List; Clemens Marschner Subject: RE: GoogleQueryParser -Original Message- From: Eric Jain [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 1:44 PM To: Clemens Marschner Cc: Lucene Users List Subject: Re: GoogleQueryParser queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Thanks, that would be exactely what I need. Must be a new method, not yet in the public release? check out the new QueryParser from the cvs peter -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: GoogleQueryParser
- Treat a-b as a-b rather that a -b. I came across the same. Quite an essential issue for some european sites (as you surely know :-) I'm not very familiar with JavaCC, but I changed QueryParser.jj in the following way: I changed | MINUS: - to | MINUS: - and removed - from the list of | #_ESCAPED_CHAR: and | #_TERM_START_CHAR: This actually changes the behaviour to that of google and I didn't experience any negative side effects (yet). HTH Guido -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: GoogleQueryParser
- Default AND rather than OR. As for this part: This can be accomplished with queryParser = new QueryParser(defaultField, new MyAnalyzer()); queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); - Treat a-b as a-b rather that a -b. That would be interesting for me, too. Clemens -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: GoogleQueryParser
This actually changes the behaviour to that of google and I didn't experience any negative side effects (yet). Thanks. I hope there will eventually be some standard way to accomplish this... -- Eric Jain -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: GoogleQueryParser
-Original Message- From: Eric Jain [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 1:44 PM To: Clemens Marschner Cc: Lucene Users List Subject: Re: GoogleQueryParser queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Thanks, that would be exactely what I need. Must be a new method, not yet in the public release? check out the new QueryParser from the cvs peter -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: GoogleQueryParser
I think there's a bug, if I set the default operator to be OR, when I run java org.apache.lucene.queryParser.QueryParser a AND b OR c it will give me the result of +a +b c if I set the default operator to be AND, and run it with the term a b OR c, it will give me +a b c, which is different -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 4:49 AM To: Lucene Users List; Clemens Marschner Subject: RE: GoogleQueryParser -Original Message- From: Eric Jain [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 1:44 PM To: Clemens Marschner Cc: Lucene Users List Subject: Re: GoogleQueryParser queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Thanks, that would be exactely what I need. Must be a new method, not yet in the public release? check out the new QueryParser from the cvs peter -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]