Re: Exact match not the first result returned

2011-07-29 Thread Brian Lamb
I implemented both solutions Hoss suggested and was able to achieve the
desired results. I would like to go with

 defType=dismax  qf=myname  pf=myname_str^100  q=Frank

but that doesn't seem to work if I have a query like myname:Frank
otherfield:something. So I think I will go with

q=+myname:Frank myname_str:Frank^100

Thanks for the help everyone!

Brian Lamb

On Wed, Jul 27, 2011 at 10:55 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : With your solution, RECORD 1 does appear at the top but I think thats
 just
 : blind luck more than anything else because RECORD 3 shows as having the
 same
 : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
 : like all three records returned with RECORD 1 being the first listing.

 with omitNorms RECORD1 and RECORD3 have the same score because only the
 tf() matters, and both docs contain the term frank exactly twice.

 the reason RECORD1 isn't scoring higher even though it contains (as you
 put it matchings 'Fred' exactly is that from a term perspective, RECORD1
 doesn't actually match myname:Fred exactly, because there are in fact
 other terms in that field because it's multivalued.

 one way to indicate that you (only* want documents where entire field
 values to match your input (ie: RECORD1 but no other records) would be to
 use a StrField instead of a TextField or an analyzer that doesn't split up
 tokens (lie: something using KeywordTokenizer).  that way a query on
 myname:Frank would not match a document where you had indexed the value
 Frank Stalone by a query for myname:Frank Stalone would.

 in your case, you don't want *only* the exact field value matches, but you
 want them boosted, so you could do something like copyField myname into
 myname_str and then do...

  q=+myname:Frank myname_str:Frank^100

 ...in which case a match on myname is required, but a match on
 myname_str will greatly increase the score.

 dismax (and edismax) are really designed for situations like this...

  defType=dismax  qf=myname  pf=myname_str^100  q=Frank



 -Hoss



Re: Exact match not the first result returned

2011-07-28 Thread Brian Lamb
That's a clever idea. I'll put something together and see how it turns out.
Thanks for the tip.

On Wed, Jul 27, 2011 at 10:55 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : With your solution, RECORD 1 does appear at the top but I think thats
 just
 : blind luck more than anything else because RECORD 3 shows as having the
 same
 : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
 : like all three records returned with RECORD 1 being the first listing.

 with omitNorms RECORD1 and RECORD3 have the same score because only the
 tf() matters, and both docs contain the term frank exactly twice.

 the reason RECORD1 isn't scoring higher even though it contains (as you
 put it matchings 'Fred' exactly is that from a term perspective, RECORD1
 doesn't actually match myname:Fred exactly, because there are in fact
 other terms in that field because it's multivalued.

 one way to indicate that you (only* want documents where entire field
 values to match your input (ie: RECORD1 but no other records) would be to
 use a StrField instead of a TextField or an analyzer that doesn't split up
 tokens (lie: something using KeywordTokenizer).  that way a query on
 myname:Frank would not match a document where you had indexed the value
 Frank Stalone by a query for myname:Frank Stalone would.

 in your case, you don't want *only* the exact field value matches, but you
 want them boosted, so you could do something like copyField myname into
 myname_str and then do...

  q=+myname:Frank myname_str:Frank^100

 ...in which case a match on myname is required, but a match on
 myname_str will greatly increase the score.

 dismax (and edismax) are really designed for situations like this...

  defType=dismax  qf=myname  pf=myname_str^100  q=Frank



 -Hoss



Re: Exact match not the first result returned

2011-07-28 Thread Jonathan Rochkind
Keep in mind that if you use a field type that includes spaces (eg 
StrField, or KeywordTokenizer), then if you're using dismax or lucene 
query parsers, the only way to find matches in this field on queries 
that include spaces will be to do explicit phrase searches with double 
quotes.


These fields will, however, work fine with pf in dismax/edismax as per 
Hoss's example.


But yeah, I do what Hoss recommends -- I've got a KeywordTokenizer copy 
of my searchable field. I use a pf on that field with a very high boost 
to try and boost truly complete matches, that match the entirety of 
the value.  It's not exactly 'exact', I still do some normalization, 
including flattening unicode to ascii, and normalizing 1 or more 
string-or-punctuation to exactly 1 one space using a char regex filter.


It seems to pretty much work -- this is just one of various relevancy 
tweaks I've got going on, to the extent that my relevancy has become 
pretty complicated and hard to predict and doesn't always do what I'd 
expect/intend, but this particular aspect seems to mostly pretty much work.


On 7/27/2011 10:55 PM, Chris Hostetter wrote:

: With your solution, RECORD 1 does appear at the top but I think thats just
: blind luck more than anything else because RECORD 3 shows as having the same
: score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
: like all three records returned with RECORD 1 being the first listing.

with omitNorms RECORD1 and RECORD3 have the same score because only the
tf() matters, and both docs contain the term frank exactly twice.

the reason RECORD1 isn't scoring higher even though it contains (as you
put it matchings 'Fred' exactly is that from a term perspective, RECORD1
doesn't actually match myname:Fred exactly, because there are in fact
other terms in that field because it's multivalued.

one way to indicate that you (only* want documents where entire field
values to match your input (ie: RECORD1 but no other records) would be to
use a StrField instead of a TextField or an analyzer that doesn't split up
tokens (lie: something using KeywordTokenizer).  that way a query on
myname:Frank would not match a document where you had indexed the value
Frank Stalone by a query for myname:Frank Stalone would.

in your case, you don't want *only* the exact field value matches, but you
want them boosted, so you could do something like copyField myname into
myname_str and then do...

   q=+myname:Frank myname_str:Frank^100

...in which case a match on myname is required, but a match on
myname_str will greatly increase the score.

dismax (and edismax) are really designed for situations like this...

   defType=dismax  qf=myname  pf=myname_str^100  q=Frank



-Hoss



Re: Exact match not the first result returned

2011-07-27 Thread Brian Lamb
Thanks Emmanuel for that explanation. I implemented your solution but I'm
not quite there yet. Suppose I also have a record:

RECORD 3
arr name=myname
  strFred G. Anderson/str
  strFred Anderson/str
/arr

With your solution, RECORD 1 does appear at the top but I think thats just
blind luck more than anything else because RECORD 3 shows as having the same
score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
like all three records returned with RECORD 1 being the first listing.

Thanks,

Brian Lamb

On Tue, Jul 26, 2011 at 6:03 PM, Emmanuel Espina
espinaemman...@gmail.comwrote:

 That is caused by the size of the documents. The principle is pretty
 intuitive if one of your documents is the entire three volumes of The Lord
 of the Rings, and you search for tree I know that The Lord of the Rings
 will be in the results, and I haven't memorized the entire text of that
 book
 :p
 It is a matter of probability that if you have a big (big!) text any word
 will have a greater chance to be found than in a smaller letter. So one can
 infer that the letter is more relevant than the big text. That is the
 principle applied here and Lucene does that when building the ranking.
 The first document is bigger (remember that all the values of a multivalued
 field are merged into one field in the index, so you can not tell one value
 from another apart) than the second one. In the first one you have
 [Fred, coolest,
 guy, town] and in the second [Fred, Anderson], so the second document is
 more relevant than the first one.

 To avoid all this procedure you can set omitNorms to true and that should
 make the first document more relevant because Fred appears twice (not
 because Fred appears alone in a value)

 Regards
 Emmanuel

 2011/7/26 Brian Lamb brian.l...@journalexperts.com

  Hi all,
 
  I am a little confused as to why the scoring is working the way it is:
 
  I have a field defined as:
 
  field name=myname type=text indexed=true stored=true
  required=false multivalued=true /
 
  And I have several documents where that value is:
 
  RECORD 1
  arr name=myname
   strFred/str
   strFred (the coolest guy in town)/str
  /arr
 
  OR
 
  RECORD 2
  arr name=myname
   strFred Anderson/str
  /arr
 
  What happens when I do a search for
  http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2
  returned before RECORD 1.
 
  RECORD 2
  5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of:
   1.0 = tf(termFreq(myname:Fred)=1)
   8.451541 = idf(docFreq=7306, maxDocs=12586425)
   0.625 = fieldNorm(field=myname, doc=256575)
 
  RECORD 1
  4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of:
   1.4142135 = tf(termFreq(myname:Fred)=2)
   8.451541 = idf(docFreq=7306, maxDocs=12586425)
   0.375 = fieldNorm(field=myname, doc=215)
 
  So the difference is fieldNorm obviously but I think that's only part
  of the story. Why is RECORD 2 returned with a higher score than RECORD
  1 even though RECORD 1 matches Fred exactly? And how should I do
  this differently so that I am getting the results I am expecting?
 
  Thanks,
 
  Brian Lamb
 



Re: Exact match not the first result returned

2011-07-27 Thread Chris Hostetter

: With your solution, RECORD 1 does appear at the top but I think thats just
: blind luck more than anything else because RECORD 3 shows as having the same
: score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
: like all three records returned with RECORD 1 being the first listing.

with omitNorms RECORD1 and RECORD3 have the same score because only the 
tf() matters, and both docs contain the term frank exactly twice.

the reason RECORD1 isn't scoring higher even though it contains (as you 
put it matchings 'Fred' exactly is that from a term perspective, RECORD1 
doesn't actually match myname:Fred exactly, because there are in fact 
other terms in that field because it's multivalued.

one way to indicate that you (only* want documents where entire field 
values to match your input (ie: RECORD1 but no other records) would be to 
use a StrField instead of a TextField or an analyzer that doesn't split up 
tokens (lie: something using KeywordTokenizer).  that way a query on 
myname:Frank would not match a document where you had indexed the value 
Frank Stalone by a query for myname:Frank Stalone would.

in your case, you don't want *only* the exact field value matches, but you 
want them boosted, so you could do something like copyField myname into 
myname_str and then do...

  q=+myname:Frank myname_str:Frank^100

...in which case a match on myname is required, but a match on 
myname_str will greatly increase the score.

dismax (and edismax) are really designed for situations like this...

  defType=dismax  qf=myname  pf=myname_str^100  q=Frank



-Hoss


Exact match not the first result returned

2011-07-26 Thread Brian Lamb
Hi all,

I am a little confused as to why the scoring is working the way it is:

I have a field defined as:

field name=myname type=text indexed=true stored=true
required=false multivalued=true /

And I have several documents where that value is:

RECORD 1
arr name=myname
  strFred/str
  strFred (the coolest guy in town)/str
/arr

OR

RECORD 2
arr name=myname
  strFred Anderson/str
/arr

What happens when I do a search for
http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2
returned before RECORD 1.

RECORD 2
5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of:
  1.0 = tf(termFreq(myname:Fred)=1)
  8.451541 = idf(docFreq=7306, maxDocs=12586425)
  0.625 = fieldNorm(field=myname, doc=256575)

RECORD 1
4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of:
  1.4142135 = tf(termFreq(myname:Fred)=2)
  8.451541 = idf(docFreq=7306, maxDocs=12586425)
  0.375 = fieldNorm(field=myname, doc=215)

So the difference is fieldNorm obviously but I think that's only part
of the story. Why is RECORD 2 returned with a higher score than RECORD
1 even though RECORD 1 matches Fred exactly? And how should I do
this differently so that I am getting the results I am expecting?

Thanks,

Brian Lamb


Re: Exact match not the first result returned

2011-07-26 Thread Emmanuel Espina
That is caused by the size of the documents. The principle is pretty
intuitive if one of your documents is the entire three volumes of The Lord
of the Rings, and you search for tree I know that The Lord of the Rings
will be in the results, and I haven't memorized the entire text of that book
:p
It is a matter of probability that if you have a big (big!) text any word
will have a greater chance to be found than in a smaller letter. So one can
infer that the letter is more relevant than the big text. That is the
principle applied here and Lucene does that when building the ranking.
The first document is bigger (remember that all the values of a multivalued
field are merged into one field in the index, so you can not tell one value
from another apart) than the second one. In the first one you have
[Fred, coolest,
guy, town] and in the second [Fred, Anderson], so the second document is
more relevant than the first one.

To avoid all this procedure you can set omitNorms to true and that should
make the first document more relevant because Fred appears twice (not
because Fred appears alone in a value)

Regards
Emmanuel

2011/7/26 Brian Lamb brian.l...@journalexperts.com

 Hi all,

 I am a little confused as to why the scoring is working the way it is:

 I have a field defined as:

 field name=myname type=text indexed=true stored=true
 required=false multivalued=true /

 And I have several documents where that value is:

 RECORD 1
 arr name=myname
  strFred/str
  strFred (the coolest guy in town)/str
 /arr

 OR

 RECORD 2
 arr name=myname
  strFred Anderson/str
 /arr

 What happens when I do a search for
 http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2
 returned before RECORD 1.

 RECORD 2
 5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of:
  1.0 = tf(termFreq(myname:Fred)=1)
  8.451541 = idf(docFreq=7306, maxDocs=12586425)
  0.625 = fieldNorm(field=myname, doc=256575)

 RECORD 1
 4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of:
  1.4142135 = tf(termFreq(myname:Fred)=2)
  8.451541 = idf(docFreq=7306, maxDocs=12586425)
  0.375 = fieldNorm(field=myname, doc=215)

 So the difference is fieldNorm obviously but I think that's only part
 of the story. Why is RECORD 2 returned with a higher score than RECORD
 1 even though RECORD 1 matches Fred exactly? And how should I do
 this differently so that I am getting the results I am expecting?

 Thanks,

 Brian Lamb