Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Mark Miller
Lucene uses a scoring system that behaves similarly to a boolean system. 
Each piece of the query contributes to the score for each document...if 
a document scores 0, it is not returned in the results.



 To search for documents that must contain apples and may contain
oranges use the query:  +apples oranges
This query will score any document without apples as a 0. If a doc 
contains apples it will get a positive score and if the document also 
happens to have oranges, it will score higher. An absence of orange will 
not force a 0 score for a document, but the presence of it will boost 
the score.


Clearly this is _not_ the same as 'apples OR oranges', which would

This would be : apples oranges
In this case, an absence of either term will not force a 0 score, but if 
no terms appear the score will be 0. Both terms appearing would score 
higher than just one.
Conversely, the prohibit operator (-) is called out from the NOT 
operator:

 To search for documents that contain apples but not oranges use
the query:  apples -oranges
I do not understand why this isn't simply equivalent to:  apples AND 
NOT oranges
This is equivalent. The prohibit operator will force a score of 0 on any 
doc that contains the term. Finding apples might put a positive score on 
a doc, but then finding oranges will set the score to 0 no matter what 
score the other terms generated. That is why this cannot be used as a 
unary not...-oranges would score every doc as a 0 and none would return. 
If you used the special MatchAllQueries and put it with -oranges you 
would have the effect of a unary not. MatchAllQueries would score each 
doc positively, and then - would 0 out all docs that had the - term.


...if it is, why all the big fuss about calling it prohibit and not
just another alias for NOT?
...if it isn't, then what's the difference in behavior?

Its kind of like an ANDNOT in boolean terms...



The fact that the documentation calls out these operators separately,
gives them their own unique names, and describes them in different
terms is enough to make me think something very important or very
subtle is going on.
The subtle part is that a scoring system is being used that operates in 
something of a boolean fashion, but that has subtle difference.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner

On 1/10/07, Mark Miller [EMAIL PROTECTED] wrote:


The subtle part is that a scoring system is being used that operates in
something of a boolean fashion, but that has subtle difference.



Mark, -thank you-.  This explains it beautifully.

So, if I understand you right, a simple query of NOT ORANGES gets me every
document that does not contain the word oranges, while a separate query with
-ORANGES added will force the score to zero for all documents in which
oranges does not appear.  One's a selector, the other is a filter.

The + operator, in turn, simply affects the score (which is used for
ranking).  Anything with a non-zero score is returned, but the better the
score, the more prominent it is in the ordered result list.

Do I have correct and complete understanding of the two operators?

-wls


Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Steven Rowe
Walt Stoneburner wrote:
 Do I have correct and complete understanding of the two operators?

Not entirely complete :) - more information in the October 2006 thread
QueryParser is Badly Broken:

http://www.gossamer-threads.com/lists/lucene/java-user/40945


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner

Based on responses from Steven Rowe [EMAIL PROTECTED] and Mark Miller 
[EMAIL PROTECTED]:


Lucene uses a scoring system that behaves similarly to a boolean system.
... more information in the October 2006 thread
QueryParser is Badly 
Brokenhttp://www.gossamer-threads.com/lists/lucene/java-user/40945




This is now my generalized understanding of the parser's operators.  Am I
closer?


*Document*

A OR B

A B

+A B

A

NOT B

-B

A AND NOT B

A –B

*No Matching Terms*

0

0

0

0

1

0

0

0

A

1

1

1

1

1

0

1

1

B

1

1

0

0

0

0

0

*0*

A B

1

2

2

1

0

0

0

1* 0*
 Non-zero results are returned to the user.   *Walt Stoneburner
[EMAIL PROTECTED]   10-Jan-2007   v1.0*

-wls


Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Mark Miller


So, if I understand you right, a simple query of NOT ORANGES gets me every
document that does not contain the word oranges, while a separate query
with
-ORANGES added will force the score to zero for all documents in which
oranges does not appear.  One's a selector, the other is a filter.



Not quite. NOT oranges is not possible. Neither is  -Oranges. Both will make
a query that score each doc to 0  if oranges is not found...but  ever doc
will start with 0 also...non will return. You need a piece of the query to
generate a positive score for the NOT or - to take effect -- otherwise every
document scores 0 and does not return.

- mark


Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Chris Hostetter

: This is now my generalized understanding of the parser's operators.  Am I
: closer?

I'm guessing there is suppose to be some sort of table structure to the
mail you send ... it doesn't work in plain text mail readers so i'm not
sure whta ou were trying to say.

In a nut shell...

 1) Lucene's QueryParser class does not parse boolean expressions -- it
might look like it, but it does not.
 2) Lucene's BooleanQuery clause does not model Boolean Queries ... it
models aggregate queries.
 3) the most native way to represent the options available in a lucene
BooleanQuery as a string is with the +/- prefixes, where...
 +foo ... means foo is a required clause and docs must match it
 -foo ... means foo is prohibited clause and docs must not match it
  foo ... means foo is an optional clause and docs that match it will
  get score benefits for doing so.
 4) in an attempt to make things easier for people who have
simple needs, QueryParser fakes that it parses boolean expressions
by interpreting A AND B as +A +B; A OR B as A B and NOT A as
-A
 5) if you change the default operator on QueryParser to be AND then
things get more complicated, mainly because then QueryParser treats
A B the same as +A +B
 6) you should avoid thinking in terms of AND, OR, and NOT ... think in
terms of OPTIONAL, REQUIRED, and PROHIBITED ... your life will be much
easier: documentation will make more sense, conversations on the email
list will be more synergistastic, wine will be sweeter, and food will
taste better.

:
:
: *Document*
:
: A OR B
:
: A B
:
: +A B
:
: A
:
: NOT B
:
: -B
:
: A AND NOT B
:
: A ?B
:
: *No Matching Terms*
:
: 0
:
: 0
:
: 0
:
: 0
:
: 1
:
: 0
:
: 0
:
: 0
:
: A
:
: 1
:
: 1
:
: 1
:
: 1
:
: 1
:
: 0
:
: 1
:
: 1
:
: B
:
: 1
:
: 1
:
: 0
:
: 0
:
: 0
:
: 0
:
: 0
:
: *0*
:
: A B
:
: 1
:
: 2
:
: 2
:
: 1
:
: 0
:
: 0
:
: 0
:
: 1* 0*
:   Non-zero results are returned to the user.   *Walt Stoneburner
: [EMAIL PROTECTED]   10-Jan-2007   v1.0*
:
: -wls
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner

On 1/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:


I'm guessing there is suppose to be some sort of table structure to the
mail you send ... it doesn't work in plain text mail readers so i'm not
sure whta ou were trying to say.



My bad, I was using GMail, and it was trying to produce a very simple logic
table.


6) you should avoid thinking in terms of AND, OR, and NOT ... think in

terms of OPTIONAL, REQUIRED, and PROHIBITED ...



Excellent!  You've provided a wonderful bit of insight.  This makes things
much easier to understand.

I should assume, though, that parenthesis work as expected?  So where I was
doing things like:
( A OR B ) AND ( C OR D ), that means that +(A B) +(C D) is actually
happening?

-wls


Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Chris Hostetter

: I should assume, though, that parenthesis work as expected?  So where I was
: doing things like:
: ( A OR B ) AND ( C OR D ), that means that +(A B) +(C D) is actually
: happening?

yes ... anywhere i used a simple example like A or foo could be
repalced with a parenthetical expression whose body is treated as a query
expression on it's own.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Erick Erickson

To answer questions like what really happens in terms of a lucene query,
I've been helped greatly by two things...

query.toString();

and Luke.

Of the two, luke (google lucene luke) is quickest. It will show you what
lucene request is produced by various query strings etc.

Sorry if you already know about them...

Best
Erick

On 1/10/07, Walt Stoneburner [EMAIL PROTECTED] wrote:


On 1/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:

 I'm guessing there is suppose to be some sort of table structure to the
 mail you send ... it doesn't work in plain text mail readers so i'm not
 sure whta ou were trying to say.


My bad, I was using GMail, and it was trying to produce a very simple
logic
table.


6) you should avoid thinking in terms of AND, OR, and NOT ... think in
 terms of OPTIONAL, REQUIRED, and PROHIBITED ...


Excellent!  You've provided a wonderful bit of insight.  This makes things
much easier to understand.

I should assume, though, that parenthesis work as expected?  So where I
was
doing things like:
( A OR B ) AND ( C OR D ), that means that +(A B) +(C D) is actually
happening?

-wls