Re: Exact match not the first result returned
I implemented both solutions Hoss suggested and was able to achieve the desired results. I would like to go with defType=dismax qf=myname pf=myname_str^100 q=Frank but that doesn't seem to work if I have a query like myname:Frank otherfield:something. So I think I will go with q=+myname:Frank myname_str:Frank^100 Thanks for the help everyone! Brian Lamb On Wed, Jul 27, 2011 at 10:55 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : With your solution, RECORD 1 does appear at the top but I think thats just : blind luck more than anything else because RECORD 3 shows as having the same : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd : like all three records returned with RECORD 1 being the first listing. with omitNorms RECORD1 and RECORD3 have the same score because only the tf() matters, and both docs contain the term frank exactly twice. the reason RECORD1 isn't scoring higher even though it contains (as you put it matchings 'Fred' exactly is that from a term perspective, RECORD1 doesn't actually match myname:Fred exactly, because there are in fact other terms in that field because it's multivalued. one way to indicate that you (only* want documents where entire field values to match your input (ie: RECORD1 but no other records) would be to use a StrField instead of a TextField or an analyzer that doesn't split up tokens (lie: something using KeywordTokenizer). that way a query on myname:Frank would not match a document where you had indexed the value Frank Stalone by a query for myname:Frank Stalone would. in your case, you don't want *only* the exact field value matches, but you want them boosted, so you could do something like copyField myname into myname_str and then do... q=+myname:Frank myname_str:Frank^100 ...in which case a match on myname is required, but a match on myname_str will greatly increase the score. dismax (and edismax) are really designed for situations like this... defType=dismax qf=myname pf=myname_str^100 q=Frank -Hoss
Re: Exact match not the first result returned
That's a clever idea. I'll put something together and see how it turns out. Thanks for the tip. On Wed, Jul 27, 2011 at 10:55 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : With your solution, RECORD 1 does appear at the top but I think thats just : blind luck more than anything else because RECORD 3 shows as having the same : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd : like all three records returned with RECORD 1 being the first listing. with omitNorms RECORD1 and RECORD3 have the same score because only the tf() matters, and both docs contain the term frank exactly twice. the reason RECORD1 isn't scoring higher even though it contains (as you put it matchings 'Fred' exactly is that from a term perspective, RECORD1 doesn't actually match myname:Fred exactly, because there are in fact other terms in that field because it's multivalued. one way to indicate that you (only* want documents where entire field values to match your input (ie: RECORD1 but no other records) would be to use a StrField instead of a TextField or an analyzer that doesn't split up tokens (lie: something using KeywordTokenizer). that way a query on myname:Frank would not match a document where you had indexed the value Frank Stalone by a query for myname:Frank Stalone would. in your case, you don't want *only* the exact field value matches, but you want them boosted, so you could do something like copyField myname into myname_str and then do... q=+myname:Frank myname_str:Frank^100 ...in which case a match on myname is required, but a match on myname_str will greatly increase the score. dismax (and edismax) are really designed for situations like this... defType=dismax qf=myname pf=myname_str^100 q=Frank -Hoss
Re: Exact match not the first result returned
Keep in mind that if you use a field type that includes spaces (eg StrField, or KeywordTokenizer), then if you're using dismax or lucene query parsers, the only way to find matches in this field on queries that include spaces will be to do explicit phrase searches with double quotes. These fields will, however, work fine with pf in dismax/edismax as per Hoss's example. But yeah, I do what Hoss recommends -- I've got a KeywordTokenizer copy of my searchable field. I use a pf on that field with a very high boost to try and boost truly complete matches, that match the entirety of the value. It's not exactly 'exact', I still do some normalization, including flattening unicode to ascii, and normalizing 1 or more string-or-punctuation to exactly 1 one space using a char regex filter. It seems to pretty much work -- this is just one of various relevancy tweaks I've got going on, to the extent that my relevancy has become pretty complicated and hard to predict and doesn't always do what I'd expect/intend, but this particular aspect seems to mostly pretty much work. On 7/27/2011 10:55 PM, Chris Hostetter wrote: : With your solution, RECORD 1 does appear at the top but I think thats just : blind luck more than anything else because RECORD 3 shows as having the same : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd : like all three records returned with RECORD 1 being the first listing. with omitNorms RECORD1 and RECORD3 have the same score because only the tf() matters, and both docs contain the term frank exactly twice. the reason RECORD1 isn't scoring higher even though it contains (as you put it matchings 'Fred' exactly is that from a term perspective, RECORD1 doesn't actually match myname:Fred exactly, because there are in fact other terms in that field because it's multivalued. one way to indicate that you (only* want documents where entire field values to match your input (ie: RECORD1 but no other records) would be to use a StrField instead of a TextField or an analyzer that doesn't split up tokens (lie: something using KeywordTokenizer). that way a query on myname:Frank would not match a document where you had indexed the value Frank Stalone by a query for myname:Frank Stalone would. in your case, you don't want *only* the exact field value matches, but you want them boosted, so you could do something like copyField myname into myname_str and then do... q=+myname:Frank myname_str:Frank^100 ...in which case a match on myname is required, but a match on myname_str will greatly increase the score. dismax (and edismax) are really designed for situations like this... defType=dismax qf=myname pf=myname_str^100 q=Frank -Hoss
Re: Exact match not the first result returned
Thanks Emmanuel for that explanation. I implemented your solution but I'm not quite there yet. Suppose I also have a record: RECORD 3 arr name=myname strFred G. Anderson/str strFred Anderson/str /arr With your solution, RECORD 1 does appear at the top but I think thats just blind luck more than anything else because RECORD 3 shows as having the same score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd like all three records returned with RECORD 1 being the first listing. Thanks, Brian Lamb On Tue, Jul 26, 2011 at 6:03 PM, Emmanuel Espina espinaemman...@gmail.comwrote: That is caused by the size of the documents. The principle is pretty intuitive if one of your documents is the entire three volumes of The Lord of the Rings, and you search for tree I know that The Lord of the Rings will be in the results, and I haven't memorized the entire text of that book :p It is a matter of probability that if you have a big (big!) text any word will have a greater chance to be found than in a smaller letter. So one can infer that the letter is more relevant than the big text. That is the principle applied here and Lucene does that when building the ranking. The first document is bigger (remember that all the values of a multivalued field are merged into one field in the index, so you can not tell one value from another apart) than the second one. In the first one you have [Fred, coolest, guy, town] and in the second [Fred, Anderson], so the second document is more relevant than the first one. To avoid all this procedure you can set omitNorms to true and that should make the first document more relevant because Fred appears twice (not because Fred appears alone in a value) Regards Emmanuel 2011/7/26 Brian Lamb brian.l...@journalexperts.com Hi all, I am a little confused as to why the scoring is working the way it is: I have a field defined as: field name=myname type=text indexed=true stored=true required=false multivalued=true / And I have several documents where that value is: RECORD 1 arr name=myname strFred/str strFred (the coolest guy in town)/str /arr OR RECORD 2 arr name=myname strFred Anderson/str /arr What happens when I do a search for http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2 returned before RECORD 1. RECORD 2 5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of: 1.0 = tf(termFreq(myname:Fred)=1) 8.451541 = idf(docFreq=7306, maxDocs=12586425) 0.625 = fieldNorm(field=myname, doc=256575) RECORD 1 4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of: 1.4142135 = tf(termFreq(myname:Fred)=2) 8.451541 = idf(docFreq=7306, maxDocs=12586425) 0.375 = fieldNorm(field=myname, doc=215) So the difference is fieldNorm obviously but I think that's only part of the story. Why is RECORD 2 returned with a higher score than RECORD 1 even though RECORD 1 matches Fred exactly? And how should I do this differently so that I am getting the results I am expecting? Thanks, Brian Lamb
Re: Exact match not the first result returned
: With your solution, RECORD 1 does appear at the top but I think thats just : blind luck more than anything else because RECORD 3 shows as having the same : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd : like all three records returned with RECORD 1 being the first listing. with omitNorms RECORD1 and RECORD3 have the same score because only the tf() matters, and both docs contain the term frank exactly twice. the reason RECORD1 isn't scoring higher even though it contains (as you put it matchings 'Fred' exactly is that from a term perspective, RECORD1 doesn't actually match myname:Fred exactly, because there are in fact other terms in that field because it's multivalued. one way to indicate that you (only* want documents where entire field values to match your input (ie: RECORD1 but no other records) would be to use a StrField instead of a TextField or an analyzer that doesn't split up tokens (lie: something using KeywordTokenizer). that way a query on myname:Frank would not match a document where you had indexed the value Frank Stalone by a query for myname:Frank Stalone would. in your case, you don't want *only* the exact field value matches, but you want them boosted, so you could do something like copyField myname into myname_str and then do... q=+myname:Frank myname_str:Frank^100 ...in which case a match on myname is required, but a match on myname_str will greatly increase the score. dismax (and edismax) are really designed for situations like this... defType=dismax qf=myname pf=myname_str^100 q=Frank -Hoss
Exact match not the first result returned
Hi all, I am a little confused as to why the scoring is working the way it is: I have a field defined as: field name=myname type=text indexed=true stored=true required=false multivalued=true / And I have several documents where that value is: RECORD 1 arr name=myname strFred/str strFred (the coolest guy in town)/str /arr OR RECORD 2 arr name=myname strFred Anderson/str /arr What happens when I do a search for http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2 returned before RECORD 1. RECORD 2 5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of: 1.0 = tf(termFreq(myname:Fred)=1) 8.451541 = idf(docFreq=7306, maxDocs=12586425) 0.625 = fieldNorm(field=myname, doc=256575) RECORD 1 4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of: 1.4142135 = tf(termFreq(myname:Fred)=2) 8.451541 = idf(docFreq=7306, maxDocs=12586425) 0.375 = fieldNorm(field=myname, doc=215) So the difference is fieldNorm obviously but I think that's only part of the story. Why is RECORD 2 returned with a higher score than RECORD 1 even though RECORD 1 matches Fred exactly? And how should I do this differently so that I am getting the results I am expecting? Thanks, Brian Lamb
Re: Exact match not the first result returned
That is caused by the size of the documents. The principle is pretty intuitive if one of your documents is the entire three volumes of The Lord of the Rings, and you search for tree I know that The Lord of the Rings will be in the results, and I haven't memorized the entire text of that book :p It is a matter of probability that if you have a big (big!) text any word will have a greater chance to be found than in a smaller letter. So one can infer that the letter is more relevant than the big text. That is the principle applied here and Lucene does that when building the ranking. The first document is bigger (remember that all the values of a multivalued field are merged into one field in the index, so you can not tell one value from another apart) than the second one. In the first one you have [Fred, coolest, guy, town] and in the second [Fred, Anderson], so the second document is more relevant than the first one. To avoid all this procedure you can set omitNorms to true and that should make the first document more relevant because Fred appears twice (not because Fred appears alone in a value) Regards Emmanuel 2011/7/26 Brian Lamb brian.l...@journalexperts.com Hi all, I am a little confused as to why the scoring is working the way it is: I have a field defined as: field name=myname type=text indexed=true stored=true required=false multivalued=true / And I have several documents where that value is: RECORD 1 arr name=myname strFred/str strFred (the coolest guy in town)/str /arr OR RECORD 2 arr name=myname strFred Anderson/str /arr What happens when I do a search for http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2 returned before RECORD 1. RECORD 2 5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of: 1.0 = tf(termFreq(myname:Fred)=1) 8.451541 = idf(docFreq=7306, maxDocs=12586425) 0.625 = fieldNorm(field=myname, doc=256575) RECORD 1 4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of: 1.4142135 = tf(termFreq(myname:Fred)=2) 8.451541 = idf(docFreq=7306, maxDocs=12586425) 0.375 = fieldNorm(field=myname, doc=215) So the difference is fieldNorm obviously but I think that's only part of the story. Why is RECORD 2 returned with a higher score than RECORD 1 even though RECORD 1 matches Fred exactly? And how should I do this differently so that I am getting the results I am expecting? Thanks, Brian Lamb