Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Gabriele Kahlout
Sorry being unclear and thank you for answering.
Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3),
where A,B,C are document identifiers and the ks in bracket with each are the
terms each contains.
So Solr inverted index should be something like:

k0 -- A | C
k1 -- A | B
k2 -- A | B | C
k3 -- B | C

Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1?

On Tue, Jun 7, 2011 at 12:21 AM, Erick Erickson erickerick...@gmail.comwrote:

 I'm having a hard time understanding what you're driving at, can
 you provide some examples? This *looks* like filter queries,
 but I think you already know about those...

 Best
 Erick

 On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout
 gabri...@mysimpatico.com wrote:
  Hello,
 
  I've seen that through boosting it's possible to influence the scoring
  function, but what I would like is sort of a boolean property. In some
 way
  it's to search only the indexed documents by that keyword (or the
  intersection/union) rather than the whole set.
  Is this supported in any way?
 
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread pravesh
k0 -- A | C
k1 -- A | B
k2 -- A | B | C
k3 -- B | C 
Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1? 
Do we bother to do that. Now that's what lucene does :)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Gabriele Kahlout
On Tue, Jun 7, 2011 at 8:43 AM, pravesh suyalprav...@yahoo.com wrote:

 k0 -- A | C
 k1 -- A | B
 k2 -- A | B | C
 k3 -- B | C
 Now let q=k1, how do I make sure C doesn't appear as a result since it
 doesn't contain any occurence of k1?
 Do we bother to do that. Now that's what lucene does :)

 Lucene/Solr doesn't do that, it ranks documents based on a scoring
function, and with that it lacks the possibility of specifying that a
particular term must appear (the closest way I know of is boosting it).

The solution would be a way to tell Solr/lucene which documents/indices to
query, i.e. query only the union/intersection of the documents in which
k1,...kn appear, instead of query all indexed documents and apply the
ranking function (which will give weight to documents that contains
k1...kn).



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread lee carroll
Gabriele
Lucene uses a combination of boolean and VSM for its IR.

A straight forward query for a keyword will only match docs with that keyword.

Now things quickly get subtle and complex the more sugar you add, more
complicated queries across fields and more complex
analysis chains but I think the short answer to your question is C
will not be returned, it will not be scored either

lee c

On 7 June 2011 08:30, Gabriele Kahlout gabri...@mysimpatico.com wrote:
 On Tue, Jun 7, 2011 at 8:43 AM, pravesh suyalprav...@yahoo.com wrote:

 k0 -- A | C
 k1 -- A | B
 k2 -- A | B | C
 k3 -- B | C
 Now let q=k1, how do I make sure C doesn't appear as a result since it
 doesn't contain any occurence of k1?
 Do we bother to do that. Now that's what lucene does :)

 Lucene/Solr doesn't do that, it ranks documents based on a scoring
 function, and with that it lacks the possibility of specifying that a
 particular term must appear (the closest way I know of is boosting it).

 The solution would be a way to tell Solr/lucene which documents/indices to
 query, i.e. query only the union/intersection of the documents in which
 k1,...kn appear, instead of query all indexed documents and apply the
 ranking function (which will give weight to documents that contains
 k1...kn).



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).



Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Jonathan Rochkind
Um, normally that would never happen, because, well, like you say, the 
inverted index doesn't have docC for term K1, because doc C didn't 
include term K1.


If you search on q=K1, then how/why would docC ever be in your result 
set?  Are you seeing it in your result set? The question then would be 
_why_, what weird thing is going on to make that happen,  that's not 
expected.


The result set _starts_ from only the documents that actually include 
the term.  Boosting/relevancy ranking only effects what order these 
documents appear in, but there's no reason documentC should be in the 
result set at all in your case of q=k1, where docC is not indexed under k1.


On 6/7/2011 2:35 AM, Gabriele Kahlout wrote:

Sorry being unclear and thank you for answering.
Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3),
where A,B,C are document identifiers and the ks in bracket with each are the
terms each contains.
So Solr inverted index should be something like:

k0 --  A | C
k1 --  A | B
k2 --  A | B | C
k3 --  B | C

Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1?


Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Gabriele Kahlout
You are right, Lucene will return based on my scoring function
implementation (Similarity
classhttp://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html
):

score(q,d)   =
coord(q,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_coord
·
queryNorm(q)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_queryNorm
·
∑  ( tf(t in 
d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_tf
·
idf(t)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_idf
2  ·  
t.getBoost()http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_termBoost
·
norm(t,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_norm
)
It can be seen that whenever tf(t in d) =0 the whole score will be 0, so as
you say C will never be returned.

My issue is when the query has multiple terms (my example was too simple!),
and some are 'mandatory' while others not. In that case I should make a
query that uses the
+%20http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#+(eg.
q=+k1).
I'm unsure I'll get the syntax right, but let's say k1 is mandatory and and
k2 and k3 are optional, then q=k2 k3 +k1. I see that queries made through
solrj are received with + in place of the   (default to OR), so
q=k2+k3++k1.



On Tue, Jun 7, 2011 at 5:23 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Um, normally that would never happen, because, well, like you say, the
 inverted index doesn't have docC for term K1, because doc C didn't include
 term K1.

 If you search on q=K1, then how/why would docC ever be in your result set?
  Are you seeing it in your result set? The question then would be _why_,
 what weird thing is going on to make that happen,  that's not expected.

 The result set _starts_ from only the documents that actually include the
 term.  Boosting/relevancy ranking only effects what order these documents
 appear in, but there's no reason documentC should be in the result set at
 all in your case of q=k1, where docC is not indexed under k1.


 On 6/7/2011 2:35 AM, Gabriele Kahlout wrote:

 Sorry being unclear and thank you for answering.
 Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and
 C(k0,k2,k3),
 where A,B,C are document identifiers and the ks in bracket with each are
 the
 terms each contains.
 So Solr inverted index should be something like:

 k0 --  A | C
 k1 --  A | B
 k2 --  A | B | C
 k3 --  B | C

 Now let q=k1, how do I make sure C doesn't appear as a result since it
 doesn't contain any occurence of k1?




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Jonathan Rochkind
Okay, if you're using a custom similarity, I'm not sure what's going on, 
I'm not familiar with that.


But ordinarily, you are right, you would require k1 with +k1.

What you say about the + being lost suggests something is going wrong. 
Either you are not sending your query to Solr properly escaped, or 
there's a bug in your custom similarity or query parser, or (not too 
likely) there's a bug in Solr.


My experience is using the standard query parser, standard similarity 
class, and contacting Solr via HTTP. (are you using SolrJ or HTTP?).  In 
that case, when you send the q to Solr, you are responsible for 
URI-encoding it when you send it.  So if you want to send a query like 
k2 k3 +k1, you need to URI-escape it first, and this is what you'd send:


q=k2+k3+%2Bk1

or, escaping spaces as %20 instead, which is actually more 'correct' 
with current standards:


q=k2%20k3%20%2Bk1

The important thing is that + escapes as %2B.  You need to escape it 
before sending it to Solr via an HTTP URI query string or HTTP form post 
data. Yes, if you send a raw +, Solr will understand that as 
representing a space, not an actual +.  This is because the + 
character is not 'safe', it needs to be escaped.  The programming 
language of your choice probably already has a library function for 
URI-escaping values.


On 6/7/2011 11:36 AM, Gabriele Kahlout wrote:

You are right, Lucene will return based on my scoring function
implementation (Similarity
classhttp://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html
):

score(q,d)   =
coord(q,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_coord
·
queryNorm(q)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_queryNorm
·
∑  ( tf(t in 
d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_tf
·
idf(t)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_idf
2  ·  
t.getBoost()http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_termBoost
·
norm(t,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_norm
)
It can be seen that whenever tf(t in d) =0 the whole score will be 0, so as
you say C will never be returned.

My issue is when the query has multiple terms (my example was too simple!),
and some are 'mandatory' while others not. In that case I should make a
query that uses the
+%20http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#+(eg.
q=+k1).
I'm unsure I'll get the syntax right, but let's say k1 is mandatory and and
k2 and k3 are optional, then q=k2 k3 +k1. I see that queries made through
solrj are received with + in place of the   (default to OR), so
q=k2+k3++k1.



On Tue, Jun 7, 2011 at 5:23 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:


Um, normally that would never happen, because, well, like you say, the
inverted index doesn't have docC for term K1, because doc C didn't include
term K1.

If you search on q=K1, then how/why would docC ever be in your result set?
  Are you seeing it in your result set? The question then would be _why_,
what weird thing is going on to make that happen,  that's not expected.

The result set _starts_ from only the documents that actually include the
term.  Boosting/relevancy ranking only effects what order these documents
appear in, but there's no reason documentC should be in the result set at
all in your case of q=k1, where docC is not indexed under k1.


On 6/7/2011 2:35 AM, Gabriele Kahlout wrote:


Sorry being unclear and thank you for answering.
Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and
C(k0,k2,k3),
where A,B,C are document identifiers and the ks in bracket with each are
the
terms each contains.
So Solr inverted index should be something like:

k0 --   A | C
k1 --   A | B
k2 --   A | B | C
k3 --   B | C

Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1?





How do I make sure the resulting documents contain the query terms?

2011-06-06 Thread Gabriele Kahlout
Hello,

I've seen that through boosting it's possible to influence the scoring
function, but what I would like is sort of a boolean property. In some way
it's to search only the indexed documents by that keyword (or the
intersection/union) rather than the whole set.
Is this supported in any way?


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: How do I make sure the resulting documents contain the query terms?

2011-06-06 Thread Erick Erickson
I'm having a hard time understanding what you're driving at, can
you provide some examples? This *looks* like filter queries,
but I think you already know about those...

Best
Erick

On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout
gabri...@mysimpatico.com wrote:
 Hello,

 I've seen that through boosting it's possible to influence the scoring
 function, but what I would like is sort of a boolean property. In some way
 it's to search only the indexed documents by that keyword (or the
 intersection/union) rather than the whole set.
 Is this supported in any way?


 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).