Re: How do I make sure the resulting documents contain the query terms?
Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? On Tue, Jun 7, 2011 at 12:21 AM, Erick Erickson erickerick...@gmail.comwrote: I'm having a hard time understanding what you're driving at, can you provide some examples? This *looks* like filter queries, but I think you already know about those... Best Erick On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: Hello, I've seen that through boosting it's possible to influence the scoring function, but what I would like is sort of a boolean property. In some way it's to search only the indexed documents by that keyword (or the intersection/union) rather than the whole set. Is this supported in any way? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: How do I make sure the resulting documents contain the query terms?
k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? Do we bother to do that. Now that's what lucene does :) -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How do I make sure the resulting documents contain the query terms?
On Tue, Jun 7, 2011 at 8:43 AM, pravesh suyalprav...@yahoo.com wrote: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? Do we bother to do that. Now that's what lucene does :) Lucene/Solr doesn't do that, it ranks documents based on a scoring function, and with that it lacks the possibility of specifying that a particular term must appear (the closest way I know of is boosting it). The solution would be a way to tell Solr/lucene which documents/indices to query, i.e. query only the union/intersection of the documents in which k1,...kn appear, instead of query all indexed documents and apply the ranking function (which will give weight to documents that contains k1...kn). -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: How do I make sure the resulting documents contain the query terms?
Gabriele Lucene uses a combination of boolean and VSM for its IR. A straight forward query for a keyword will only match docs with that keyword. Now things quickly get subtle and complex the more sugar you add, more complicated queries across fields and more complex analysis chains but I think the short answer to your question is C will not be returned, it will not be scored either lee c On 7 June 2011 08:30, Gabriele Kahlout gabri...@mysimpatico.com wrote: On Tue, Jun 7, 2011 at 8:43 AM, pravesh suyalprav...@yahoo.com wrote: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? Do we bother to do that. Now that's what lucene does :) Lucene/Solr doesn't do that, it ranks documents based on a scoring function, and with that it lacks the possibility of specifying that a particular term must appear (the closest way I know of is boosting it). The solution would be a way to tell Solr/lucene which documents/indices to query, i.e. query only the union/intersection of the documents in which k1,...kn appear, instead of query all indexed documents and apply the ranking function (which will give weight to documents that contains k1...kn). -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: How do I make sure the resulting documents contain the query terms?
Um, normally that would never happen, because, well, like you say, the inverted index doesn't have docC for term K1, because doc C didn't include term K1. If you search on q=K1, then how/why would docC ever be in your result set? Are you seeing it in your result set? The question then would be _why_, what weird thing is going on to make that happen, that's not expected. The result set _starts_ from only the documents that actually include the term. Boosting/relevancy ranking only effects what order these documents appear in, but there's no reason documentC should be in the result set at all in your case of q=k1, where docC is not indexed under k1. On 6/7/2011 2:35 AM, Gabriele Kahlout wrote: Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1?
Re: How do I make sure the resulting documents contain the query terms?
You are right, Lucene will return based on my scoring function implementation (Similarity classhttp://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html ): score(q,d) = coord(q,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_coord · queryNorm(q)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_queryNorm · ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_tf · idf(t)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_idf 2 · t.getBoost()http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_termBoost · norm(t,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_norm ) It can be seen that whenever tf(t in d) =0 the whole score will be 0, so as you say C will never be returned. My issue is when the query has multiple terms (my example was too simple!), and some are 'mandatory' while others not. In that case I should make a query that uses the +%20http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#+(eg. q=+k1). I'm unsure I'll get the syntax right, but let's say k1 is mandatory and and k2 and k3 are optional, then q=k2 k3 +k1. I see that queries made through solrj are received with + in place of the (default to OR), so q=k2+k3++k1. On Tue, Jun 7, 2011 at 5:23 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Um, normally that would never happen, because, well, like you say, the inverted index doesn't have docC for term K1, because doc C didn't include term K1. If you search on q=K1, then how/why would docC ever be in your result set? Are you seeing it in your result set? The question then would be _why_, what weird thing is going on to make that happen, that's not expected. The result set _starts_ from only the documents that actually include the term. Boosting/relevancy ranking only effects what order these documents appear in, but there's no reason documentC should be in the result set at all in your case of q=k1, where docC is not indexed under k1. On 6/7/2011 2:35 AM, Gabriele Kahlout wrote: Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: How do I make sure the resulting documents contain the query terms?
Okay, if you're using a custom similarity, I'm not sure what's going on, I'm not familiar with that. But ordinarily, you are right, you would require k1 with +k1. What you say about the + being lost suggests something is going wrong. Either you are not sending your query to Solr properly escaped, or there's a bug in your custom similarity or query parser, or (not too likely) there's a bug in Solr. My experience is using the standard query parser, standard similarity class, and contacting Solr via HTTP. (are you using SolrJ or HTTP?). In that case, when you send the q to Solr, you are responsible for URI-encoding it when you send it. So if you want to send a query like k2 k3 +k1, you need to URI-escape it first, and this is what you'd send: q=k2+k3+%2Bk1 or, escaping spaces as %20 instead, which is actually more 'correct' with current standards: q=k2%20k3%20%2Bk1 The important thing is that + escapes as %2B. You need to escape it before sending it to Solr via an HTTP URI query string or HTTP form post data. Yes, if you send a raw +, Solr will understand that as representing a space, not an actual +. This is because the + character is not 'safe', it needs to be escaped. The programming language of your choice probably already has a library function for URI-escaping values. On 6/7/2011 11:36 AM, Gabriele Kahlout wrote: You are right, Lucene will return based on my scoring function implementation (Similarity classhttp://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html ): score(q,d) = coord(q,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_coord · queryNorm(q)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_queryNorm · ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_tf · idf(t)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_idf 2 · t.getBoost()http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_termBoost · norm(t,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_norm ) It can be seen that whenever tf(t in d) =0 the whole score will be 0, so as you say C will never be returned. My issue is when the query has multiple terms (my example was too simple!), and some are 'mandatory' while others not. In that case I should make a query that uses the +%20http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#+(eg. q=+k1). I'm unsure I'll get the syntax right, but let's say k1 is mandatory and and k2 and k3 are optional, then q=k2 k3 +k1. I see that queries made through solrj are received with + in place of the (default to OR), so q=k2+k3++k1. On Tue, Jun 7, 2011 at 5:23 PM, Jonathan Rochkindrochk...@jhu.edu wrote: Um, normally that would never happen, because, well, like you say, the inverted index doesn't have docC for term K1, because doc C didn't include term K1. If you search on q=K1, then how/why would docC ever be in your result set? Are you seeing it in your result set? The question then would be _why_, what weird thing is going on to make that happen, that's not expected. The result set _starts_ from only the documents that actually include the term. Boosting/relevancy ranking only effects what order these documents appear in, but there's no reason documentC should be in the result set at all in your case of q=k1, where docC is not indexed under k1. On 6/7/2011 2:35 AM, Gabriele Kahlout wrote: Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1?
How do I make sure the resulting documents contain the query terms?
Hello, I've seen that through boosting it's possible to influence the scoring function, but what I would like is sort of a boolean property. In some way it's to search only the indexed documents by that keyword (or the intersection/union) rather than the whole set. Is this supported in any way? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: How do I make sure the resulting documents contain the query terms?
I'm having a hard time understanding what you're driving at, can you provide some examples? This *looks* like filter queries, but I think you already know about those... Best Erick On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: Hello, I've seen that through boosting it's possible to influence the scoring function, but what I would like is sort of a boolean property. In some way it's to search only the indexed documents by that keyword (or the intersection/union) rather than the whole set. Is this supported in any way? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).