Re: Getting a Better Understanding of Lucene's Search Operators
Lucene uses a scoring system that behaves similarly to a boolean system. Each piece of the query contributes to the score for each document...if a document scores 0, it is not returned in the results. To search for documents that must contain apples and may contain oranges use the query: +apples oranges This query will score any document without apples as a 0. If a doc contains apples it will get a positive score and if the document also happens to have oranges, it will score higher. An absence of orange will not force a 0 score for a document, but the presence of it will boost the score. Clearly this is _not_ the same as 'apples OR oranges', which would This would be : apples oranges In this case, an absence of either term will not force a 0 score, but if no terms appear the score will be 0. Both terms appearing would score higher than just one. Conversely, the prohibit operator (-) is called out from the NOT operator: To search for documents that contain apples but not oranges use the query: apples -oranges I do not understand why this isn't simply equivalent to: apples AND NOT oranges This is equivalent. The prohibit operator will force a score of 0 on any doc that contains the term. Finding apples might put a positive score on a doc, but then finding oranges will set the score to 0 no matter what score the other terms generated. That is why this cannot be used as a unary not...-oranges would score every doc as a 0 and none would return. If you used the special MatchAllQueries and put it with -oranges you would have the effect of a unary not. MatchAllQueries would score each doc positively, and then - would 0 out all docs that had the - term. ...if it is, why all the big fuss about calling it prohibit and not just another alias for NOT? ...if it isn't, then what's the difference in behavior? Its kind of like an ANDNOT in boolean terms... The fact that the documentation calls out these operators separately, gives them their own unique names, and describes them in different terms is enough to make me think something very important or very subtle is going on. The subtle part is that a scoring system is being used that operates in something of a boolean fashion, but that has subtle difference. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting a Better Understanding of Lucene's Search Operators
On 1/10/07, Mark Miller [EMAIL PROTECTED] wrote: The subtle part is that a scoring system is being used that operates in something of a boolean fashion, but that has subtle difference. Mark, -thank you-. This explains it beautifully. So, if I understand you right, a simple query of NOT ORANGES gets me every document that does not contain the word oranges, while a separate query with -ORANGES added will force the score to zero for all documents in which oranges does not appear. One's a selector, the other is a filter. The + operator, in turn, simply affects the score (which is used for ranking). Anything with a non-zero score is returned, but the better the score, the more prominent it is in the ordered result list. Do I have correct and complete understanding of the two operators? -wls
Re: Getting a Better Understanding of Lucene's Search Operators
Walt Stoneburner wrote: Do I have correct and complete understanding of the two operators? Not entirely complete :) - more information in the October 2006 thread QueryParser is Badly Broken: http://www.gossamer-threads.com/lists/lucene/java-user/40945 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting a Better Understanding of Lucene's Search Operators
Based on responses from Steven Rowe [EMAIL PROTECTED] and Mark Miller [EMAIL PROTECTED]: Lucene uses a scoring system that behaves similarly to a boolean system. ... more information in the October 2006 thread QueryParser is Badly Brokenhttp://www.gossamer-threads.com/lists/lucene/java-user/40945 This is now my generalized understanding of the parser's operators. Am I closer? *Document* A OR B A B +A B A NOT B -B A AND NOT B A –B *No Matching Terms* 0 0 0 0 1 0 0 0 A 1 1 1 1 1 0 1 1 B 1 1 0 0 0 0 0 *0* A B 1 2 2 1 0 0 0 1* 0* Non-zero results are returned to the user. *Walt Stoneburner [EMAIL PROTECTED] 10-Jan-2007 v1.0* -wls
Re: Getting a Better Understanding of Lucene's Search Operators
So, if I understand you right, a simple query of NOT ORANGES gets me every document that does not contain the word oranges, while a separate query with -ORANGES added will force the score to zero for all documents in which oranges does not appear. One's a selector, the other is a filter. Not quite. NOT oranges is not possible. Neither is -Oranges. Both will make a query that score each doc to 0 if oranges is not found...but ever doc will start with 0 also...non will return. You need a piece of the query to generate a positive score for the NOT or - to take effect -- otherwise every document scores 0 and does not return. - mark
Re: Getting a Better Understanding of Lucene's Search Operators
: This is now my generalized understanding of the parser's operators. Am I : closer? I'm guessing there is suppose to be some sort of table structure to the mail you send ... it doesn't work in plain text mail readers so i'm not sure whta ou were trying to say. In a nut shell... 1) Lucene's QueryParser class does not parse boolean expressions -- it might look like it, but it does not. 2) Lucene's BooleanQuery clause does not model Boolean Queries ... it models aggregate queries. 3) the most native way to represent the options available in a lucene BooleanQuery as a string is with the +/- prefixes, where... +foo ... means foo is a required clause and docs must match it -foo ... means foo is prohibited clause and docs must not match it foo ... means foo is an optional clause and docs that match it will get score benefits for doing so. 4) in an attempt to make things easier for people who have simple needs, QueryParser fakes that it parses boolean expressions by interpreting A AND B as +A +B; A OR B as A B and NOT A as -A 5) if you change the default operator on QueryParser to be AND then things get more complicated, mainly because then QueryParser treats A B the same as +A +B 6) you should avoid thinking in terms of AND, OR, and NOT ... think in terms of OPTIONAL, REQUIRED, and PROHIBITED ... your life will be much easier: documentation will make more sense, conversations on the email list will be more synergistastic, wine will be sweeter, and food will taste better. : : : *Document* : : A OR B : : A B : : +A B : : A : : NOT B : : -B : : A AND NOT B : : A ?B : : *No Matching Terms* : : 0 : : 0 : : 0 : : 0 : : 1 : : 0 : : 0 : : 0 : : A : : 1 : : 1 : : 1 : : 1 : : 1 : : 0 : : 1 : : 1 : : B : : 1 : : 1 : : 0 : : 0 : : 0 : : 0 : : 0 : : *0* : : A B : : 1 : : 2 : : 2 : : 1 : : 0 : : 0 : : 0 : : 1* 0* : Non-zero results are returned to the user. *Walt Stoneburner : [EMAIL PROTECTED] 10-Jan-2007 v1.0* : : -wls : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting a Better Understanding of Lucene's Search Operators
On 1/10/07, Chris Hostetter [EMAIL PROTECTED] wrote: I'm guessing there is suppose to be some sort of table structure to the mail you send ... it doesn't work in plain text mail readers so i'm not sure whta ou were trying to say. My bad, I was using GMail, and it was trying to produce a very simple logic table. 6) you should avoid thinking in terms of AND, OR, and NOT ... think in terms of OPTIONAL, REQUIRED, and PROHIBITED ... Excellent! You've provided a wonderful bit of insight. This makes things much easier to understand. I should assume, though, that parenthesis work as expected? So where I was doing things like: ( A OR B ) AND ( C OR D ), that means that +(A B) +(C D) is actually happening? -wls
Re: Getting a Better Understanding of Lucene's Search Operators
: I should assume, though, that parenthesis work as expected? So where I was : doing things like: : ( A OR B ) AND ( C OR D ), that means that +(A B) +(C D) is actually : happening? yes ... anywhere i used a simple example like A or foo could be repalced with a parenthetical expression whose body is treated as a query expression on it's own. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting a Better Understanding of Lucene's Search Operators
To answer questions like what really happens in terms of a lucene query, I've been helped greatly by two things... query.toString(); and Luke. Of the two, luke (google lucene luke) is quickest. It will show you what lucene request is produced by various query strings etc. Sorry if you already know about them... Best Erick On 1/10/07, Walt Stoneburner [EMAIL PROTECTED] wrote: On 1/10/07, Chris Hostetter [EMAIL PROTECTED] wrote: I'm guessing there is suppose to be some sort of table structure to the mail you send ... it doesn't work in plain text mail readers so i'm not sure whta ou were trying to say. My bad, I was using GMail, and it was trying to produce a very simple logic table. 6) you should avoid thinking in terms of AND, OR, and NOT ... think in terms of OPTIONAL, REQUIRED, and PROHIBITED ... Excellent! You've provided a wonderful bit of insight. This makes things much easier to understand. I should assume, though, that parenthesis work as expected? So where I was doing things like: ( A OR B ) AND ( C OR D ), that means that +(A B) +(C D) is actually happening? -wls