You are right, Lucene will return based on my scoring function implementation (Similarity class<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html> ):
score(q,d) = coord(q,d)<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_coord> · queryNorm(q)<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_queryNorm> · ∑ ( tf(t in d)<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_tf> · idf(t)<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_idf> 2 · t.getBoost()<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_termBoost> · norm(t,d)<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_norm> ) It can be seen that whenever tf(t in d) =0 the whole score will be 0, so as you say C will never be returned. My issue is when the query has multiple terms (my example was too simple!), and some are 'mandatory' while others not. In that case I should make a query that uses the +<%20http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#+>(eg. q=+k1). I'm unsure I'll get the syntax right, but let's say k1 is mandatory and and k2 and k3 are optional, then q=k2 k3 +k1. I see that queries made through solrj are received with + in place of the " " (default to OR), so q=k2+k3++k1. On Tue, Jun 7, 2011 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > Um, normally that would never happen, because, well, like you say, the > inverted index doesn't have docC for term K1, because doc C didn't include > term K1. > > If you search on q=K1, then how/why would docC ever be in your result set? > Are you seeing it in your result set? The question then would be _why_, > what weird thing is going on to make that happen, that's not expected. > > The result set _starts_ from only the documents that actually include the > term. Boosting/relevancy ranking only effects what order these documents > appear in, but there's no reason documentC should be in the result set at > all in your case of q=k1, where docC is not indexed under k1. > > > On 6/7/2011 2:35 AM, Gabriele Kahlout wrote: > >> Sorry being unclear and thank you for answering. >> Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and >> C(k0,k2,k3), >> where A,B,C are document identifiers and the ks in bracket with each are >> the >> terms each contains. >> So Solr inverted index should be something like: >> >> k0 --> A | C >> k1 --> A | B >> k2 --> A | B | C >> k3 --> B | C >> >> Now let q=k1, how do I make sure C doesn't appear as a result since it >> doesn't contain any occurence of k1? >> > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).