mapping and tuning payloads in Solr 8
Hi all, In our Solr 6 setup we use string payloads to boost certain tokens (URIs). These strings are mapped to floats via a schema parameter "PayloadMapping", which can be read out in our custom WKSimilarity class (extending TFIDFSimilarity). 0.4 0.4 0.5 0 0.0 10.0 3.0 1.0 isAbout=15.0,coversFiscalPeriod=10.0,type=5.0,hasTheme=5.0,subject=4.0,mentions=2.0,creator=2.0 The reason for this indirection is convenience: by storing payload strings i.s.o. floats we could change & tune the boosts easily by updating the schema without having to change the content set. Inside WKSimilarity each payload string is mapped to its corresponding boost value and the final boost is applied via the scorePayload method (where we could tune the boost curve via some additional schema parameters). This works well in Solr 6. The problem: we are about to migrate to Solr 8 and after LUCENE-8014 it isn't possible anymore the override the scorePayload method in WKSimilarity (it is removed from TFIDFSimilarity). I wonder what alternatives there are for mapping strings payload to floats and use them in a tunable formula for boosting. Thanks, Tom Burgmans
RE: Multiplicative Boosts broken since 7.3 (LUCENE-8099)
I like to bump this issue up, since this is a showstopper for us to upgrade from Solr 6. In https://issues.apache.org/jira/browse/SOLR-13126 I described a couple of more use cases in which this bug appears. We see different scores in the EXPLAIN compared to the actual scores and our analysis is that the EXPLAIN in fact is correct. It happens when a multiplicative boost is used (via the "boost" parameter) in combination with some function queries, like "query" and "field". One example (tested on Solr 7.5.0), when running: http://localhost:8983/solr/test/select?defType=edismax=id,score,[explain style=text]=*:*=sum(field(price),4) then the expectation is that a document that doesn't have the price field gets a score of 4. The result however is: { "id": "docid123576", "score": 1.0, "[explain]": "4.0 = product of:\n 1.0 = boost\n 4.0 = product of:\n 1.0 = *:*\n4.0 = sum(float(price)=0.0,const(4))\n" } EXPLAIN and score are not consistent. Best regards Tom -Original Message- From: Tobias Ibounig [mailto:t.ibou...@netconomy.net] Sent: dinsdag 22 januari 2019 10:14 To: solr-user@lucene.apache.org Subject: Multiplicative Boosts broken since 7.3 (LUCENE-8099) Hello, As described in https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSOLR-13126data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240sdata=paFEStnQwxcKQQ9mM1MfPXQm%2BrStTaqQnYFH2LolVl8%3Dreserved=0 multiplicative boots (in certain conditions) seem to be broken since 7.3. The error seems to be introduced in https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-8099data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240sdata=Gs1EzQ%2FCSO8ryZJv0EGx2etxmDA7HkW8Crj5H6mE%2FvE%3Dreserved=0. Reverting the SOLR parts to the now deprecated BoostingQuery again fixes the issue. The filed issue contains a test case and a patch with the revert (for testing purposes, not really a clean fix). We sadly couldn't find the actual issue, which seems to lie with the use of "FunctionScoreQuery" for boosting. We were able to patch our 7.5 installation with the patch. As others might be affected as well, we hope this can be helpful in resolving this bug. To all SOLR/Lucene developers, thank you for your work. Looking trough the code base gave me a new appreciation of your work. Best Regards, Tobias PS: This issue was already posted by a colleague, "Inconsistent debugQuery score with multiplicative boost", but I wanted to create a new post with a clearer title.
Change in EXPLAIN info since Solr 5
Hi group, While exploring Solr 5.4.0, I noticed a subtle difference in the EXPLAIN debug information, compared to the version we currently use (4.10.1). Solr 4.10.1: 2.0739748 = (MATCH) max plus 1.0 times others of: 2.0739748 = (MATCH) weight(text:test in 30) [DefaultSimilarity], result of: 2.0739748 = score(doc=30,freq=3.0), product of: 0.3556181 = queryWeight, product of: 3.3671236 = idf(docFreq=17, maxDocs=192) 0.105614804 = queryNorm 5.832029 = fieldWeight in 30, product of: 1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 3.3671236 = idf(docFreq=17, maxDocs=192) 1.0 = fieldNorm(doc=30) Solr 5.4.0: 2.0739748 = max plus 1.0 times others of: 2.0739748 = weight(text:test in 30) [ClassicSimilarity], result of: 2.0739748 = score(doc=30,freq=3.0), product of: 0.3556181 = queryWeight, product of: 3.3671236 = idf(docFreq=17, maxDocs=192) 0.105614804 = queryNorm 5.832029 = fieldWeight in 30, product of: 1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 3.3671236 = idf(docFreq=17, maxDocs=192) 1.0 = fieldNorm(doc=30) The difference is the removal of (MATCH) in some of the EXPLAIN lines. That is causing issues for us since we have developed an EXPLAIN parser that leans on the presence of (MATCH) in the EXPLAIN. Does anyone have a suggestion how to insert back (MATCH) in the explain info (like which file should we patch)? Thanks, Tom
Score results by only the highest scoring term
Hi All, I wonder if it's in some way possible to search for multiple terms like: (term A OR term B OR term C OR term D) and in case a document contains 2 or more of these terms: only the highest scoring term should contribute to the final relevancy score; possibly lower scoring terms should be discarded from the scoring algorithm. Ideally I'd like an operator like ANY: (term A ANY term B ANY term C ANY term D) that has the purpose: return documents, sorted by the score of the highest scoring term. Any thoughts about how to achieve this? _ Tom Burgmans
incomplete proximity boost for fielded searches
Consider query: http://10.208.152.231:8080/solr/wkustaldocsphc_A/search?q=title:(Michigan Corporate Income Tax)debugQuery=truepf=titleps=255defType=edismax The intention is to perform a search in field title and to apply a proximity boost within a window of 255 words. If I look at the debug information, I see: str name=parsedquery BoostedQuery(boost(+((title:michigan title:corporate title:income title:tax)~4) (title:corporate income tax~255)~1.0)) /str Note that the first search term (michigan) is missing in the proximity boost clause. I can't believe this is intended behavior. Why is edismax splitting (title:Michigan) and (Corporate Income Tax) while determining what to use for proximity boost? Thanks, Tom
strange edismax parsing when searching in multiple fields (#TB)
Hi group, Background: I have a collection containing English and French documents. I made sure to index the English content in field body (fieldType=text_en) and the French content in field body_fr (fieldType=text_fr). The user could be either English of French so the goal is to execute the queries against both fields simultaneously without knowing the query language upfront. The query is analyzed differently for each field. For both fields a stopFilter is configured with each its own list of stopwords (different per language). The issue: When I search for 'a result' (without single quotes) in field body and body_fr at the same time, then a is considered a stopword in English and removed for field body, but not in French so both terms are still searched inside body_fr. What happens is that the query is parsed (edismax) into this construction: ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0) This query returns only French documents, although there are many English documents in the index that contain the term 'result' as well. How can that happen? I think it is related to the way my query is parsed: there seems to be an AND-relationship between (body_fr:a) and (body:result | body_fr:result). There is no English document that has (body_fr:a), so that's why they don't show up. For me a much more logic parsed query would be: ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0) How should I interpret this? Is it a bug in edismax? Is it intended and if yes: why? Thanks for any hint, Tom This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)
The main reason of using stopwords is to speed up query performance, since we see that a huge part is consumed by highlighting stopwords. Also when reading the full highlighted document, we think that it makes a document better readable when only meaningful words are highlighted. For searching in fact I like to keep stopwords... -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Wednesday 13 March 2013 04:43 To: solr-user@lucene.apache.org Subject: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB) Importance: Low Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so. Removing stopwords was a hack developed for 16-bit computers and 40 megabyte disks. We don't need to do that any more. wunder On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote: I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for all fields that you search on. You might find this useful : http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ --- On Wed, 3/13/13, Burgmans, Tom tom.burgm...@wolterskluwer.com wrote: From: Burgmans, Tom tom.burgm...@wolterskluwer.com Subject: strange edismax parsing when searching in multiple fields (#TB) To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Wednesday, March 13, 2013, 5:22 PM Hi group, Background: I have a collection containing English and French documents. I made sure to index the English content in field body (fieldType=text_en) and the French content in field body_fr (fieldType=text_fr). The user could be either English of French so the goal is to execute the queries against both fields simultaneously without knowing the query language upfront. The query is analyzed differently for each field. For both fields a stopFilter is configured with each its own list of stopwords (different per language). The issue: When I search for 'a result' (without single quotes) in field body and body_fr at the same time, then a is considered a stopword in English and removed for field body, but not in French so both terms are still searched inside body_fr. What happens is that the query is parsed (edismax) into this construction: ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0) This query returns only French documents, although there are many English documents in the index that contain the term 'result' as well. How can that happen? I think it is related to the way my query is parsed: there seems to be an AND-relationship between (body_fr:a) and (body:result | body_fr:result). There is no English document that has (body_fr:a), so that's why they don't show up. For me a much more logic parsed query would be: ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0) How should I interpret this? Is it a bug in edismax? Is it intended and if yes: why? Thanks for any hint, Tom This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517. -- Walter Underwood wun...@wunderwood.org This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
Search in String and Text_en fields simultaneously with edismax
I have a field valueadd of type String and field body of type text_en (with tokenization and linguistic processing). When I search with edismax against field valueadd like this: q=valueadd:(test . test2) I see that the parsed query is (valueadd:test valueadd:. valueadd:test2)~3 Why not (valueadd:test . test2) ? It looks like the query is tokenized while field type String doesn't have a tokenizer configured. I know I could construct my query as: q=valueadd:test . test2 in which case the phrase is searched as a whole against valueadd. But why doesn't that happen without quotes? The reason I ask: For a simultaneous search in multiple fields I like to include field valueadd in the qf parameter which contains String and text_en fields, like: qf=valueadd body How can I search both fields simultaneously without duplicating search terms, while the query is (whitespace) tokenized for body but search as a phrase for valueadd? Thanks, Tom Burgmans This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
RE: Search in String and Text_en fields simultaneously with edismax
Ah OK. I didn't have a good view of query parsing vs query generation. Thanks for clearing this up. So it means that searching in a tokenized and non-tokenized field simultaneously is not possible when I want - the expression parsed as phrase for the non-tokenized field - the expression parsed as multiple tokens for the tokenized field ? If possible, I'd like to avoid writing my own query parser. -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday 28 February 2013 05:05 To: solr-user@lucene.apache.org Subject: Re: Search in String and Text_en fields simultaneously with edismax Query text is always tokenized (more properly, parsed), unless the text is enclosed in quotes or spaces are escaped with backslash. Try: q=valueadd:test . test2 or q=valueadd:test\ .\ test2 Parentheses simply provide grouping, either to control boolean operator evaluation order or to apply a field name to a sequence of query tokens (as you have written.) The analyzer or field type is only consulted when the query is generated, not while it is being parsed. The same identical parsing rules apply to both tokenized and non-tokenized fields. What a field type's analyzer does with its value is irrelevant to query parsing. -- Jack Krupansky -Original Message- From: Burgmans, Tom Sent: Thursday, February 28, 2013 10:48 AM To: solr-user@lucene.apache.org Subject: Search in String and Text_en fields simultaneously with edismax I have a field valueadd of type String and field body of type text_en (with tokenization and linguistic processing). When I search with edismax against field valueadd like this: q=valueadd:(test . test2) I see that the parsed query is (valueadd:test valueadd:. valueadd:test2)~3 Why not (valueadd:test . test2) ? It looks like the query is tokenized while field type String doesn't have a tokenizer configured. I know I could construct my query as: q=valueadd:test . test2 in which case the phrase is searched as a whole against valueadd. But why doesn't that happen without quotes? The reason I ask: For a simultaneous search in multiple fields I like to include field valueadd in the qf parameter which contains String and text_en fields, like: qf=valueadd body How can I search both fields simultaneously without duplicating search terms, while the query is (whitespace) tokenized for body but search as a phrase for valueadd? Thanks, Tom Burgmans This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517. This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
How to I let the FVH highlight individual terms instead of the complete phrase?
Hi group, I'm trying to highlight my complete(!) XML document, which is indexed for that purpose in a special field called wkxmlsource. I configured the wkxmlsource field like field indexed=true multiValued=false name=wkxmlsource omitNorms=true stored=true termPositions=true termOffsets=true termVectors=true type=text_xml/ And the text_xml fieldtype is almost equal to the text_en field, but with the charFilter class=solr.HTMLStripCharFilterFactory / as the first class in the index analyzer. That prevents highlighting inside XML tags. First I tried the simple highlighter and that almost worked: I get my document back with my search terms and phrases highlighted, each individual term gets it own highlight tags. But the problem is that not the complete value of field wkxmlsource is returned; it cuts off the bottom part, no matter how big I set the hl.fragsize. So my next try was to use the FVH (hl.useFastVectorHighlighter=true) instead. That helped: it returns now the complete value of wkxmlsource with all my search terms/phrases highlighted. But...in case of a phrase search, it doesn't highlight each individual term anymore, but it only puts highlight tags around the complete phrase. That could possible lead to malformed XML. An example: Search for phrase: across the country Santa Fe it highlights like this in the document: para align=left...spread emacross the country./parapara align=leftSanta Fe/em Pacific... /para How can I let the FVH highlight individual terms instead of the complete phrase? Ideally I like to have something like: para align=left...spread emacross/em emthe/em emcountry/em./parapara align=leftemSanta/em emFe/em Pacific... /para which is still valid XML. My boundaryscanner is configured like: boundaryScanner name=breakIterator class=solr.highlight.BreakIteratorBoundaryScanner lst name=defaults str name=hl.bs.typeWORD/str str name=hl.bs.languageen/str str name=hl.bs.countryUS/str /lst /boundaryScanner Thanks, Tom -- Tom Burgmans [cid:image001.jpg@01CDDFA4.2B7968E0] Search Specialist Tel: +31 (0)17 246 66 33 Mobile: +31 (0)6 306 821 78 Platform Technologies Global Platform Organization Zuidpoolsingel 2 2408 ZE, Alphen aan den Rijn The Netherlands tom.burgm...@wolterskluwer.com www.wolterskluwer.com This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
edismax: implicit AND changes into implicit OR
Hi all, I wonder if this is a bug or expected behavior: I have some documents indexed; 3 of them contain Thomas and 4 of them contain Michael, but none of the contain both. A search for http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael) returns 0 results as expected since there is an implicit AND between the two terms and there is no document that matches both. But a search for http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)+OR+xxxmatchesnothingxxx returns 7 results. For some reason the implicit AND turns into an implicit OR, in case an Explicit OR is added to the query expression. The parsedquery information confirms this behavior. Why is edismax doing this? Tested on a Solr 4.0.0 instance. Thanks, Tom -- Tom Burgmans [cid:image001.jpg@01CDD86E.DC411F70] Search Specialist Tel: +31 (0)17 246 66 33 Mobile: +31 (0)6 306 821 78 Platform Technologies Global Platform Organization Zuidpoolsingel 2 2408 ZE, Alphen aan den Rijn The Netherlands tom.burgm...@wolterskluwer.com www.wolterskluwer.com This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
RE: edismax: implicit AND changes into implicit OR
I have set solrQueryParser defaultOperator=AND/ in the schema (and restarted Solr), and tested again with http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)+OR+xxxmatchesnothingxxxq.op=AND note the extra parameter. Still it returns the 7 documents that matches (Thomas OR Michael), but not (Thomas AND Michael). The only way to enforce an implicit AND is by changing the query into http://localhost:8983/solr/collection1/browse?defType=edismaxq=(%2BThomas+%2BMichael)+OR+%2Bxxxmatchesnothingxxx But then the AND isn't implicit anymore...and I don't like to prefix all my search terms with a +. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Wednesday 12 December 2012 05:46 To: solr-user@lucene.apache.org Subject: Re: edismax: implicit AND changes into implicit OR On 12/12/2012 5:51 AM, Burgmans, Tom wrote: I have some documents indexed; 3 of them contain Thomas and 4 of them contain Michael, but none of the contain both. A search for http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael) http://localhost:8983/solr/collection1/browse?defType=edismaxq=%28Thomas+Michael%29 returns 0 results as expected since there is an implicit AND between the two terms and there is no document that matches both. But a search for http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)+OR+xxxmatchesnothingxxx http://localhost:8983/solr/collection1/browse?defType=edismaxq=%28Thomas+Michael%29+OR+xxxmatchesnothingxxx returns 7 results. For some reason the implicit AND turns into an implicit OR, in case an Explicit OR is added to the query expression. The parsedquery information confirms this behavior. I'll give you my best guess, nothing to back this up but instinct. The following statements (especially the second one) may be wrong: When you do not include any boolean operators, edismax is using its mm parameter, which defaults to 100%, meaning that all search terms must match (equivalent to a default operator of AND). When you DO include a boolean operator, mm goes out the window and edismax reverts to using the default operator for solr, your schema, or the request handler, which unless you have changed it, is OR. Thanks, Shawn This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
RE: edismax: implicit AND changes into implicit OR
Yes /browse returns velocity stuff, but I mostly add wt=xml in the query. And yes, I looked at the parsedquery feedback that debugQuery=true provides. That basically confirms my idea that the implicit AND is indeed switched to an implicit OR in case an explicit OR is somewhere else present in the query. Even the default operator set to AND seems to be overruled. Thanks, I'll think about submitting a Jira. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Wednesday 12 December 2012 06:43 To: solr-user@lucene.apache.org Subject: Re: edismax: implicit AND changes into implicit OR On 12/12/2012 10:27 AM, Burgmans, Tom wrote: I have set solrQueryParser defaultOperator=AND/ in the schema (and restarted Solr), and tested again with http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)+OR+xxxmatchesnothingxxxq.op=AND note the extra parameter. Still it returns the 7 documents that matches (Thomas OR Michael), but not (Thomas AND Michael). The only way to enforce an implicit AND is by changing the query into http://localhost:8983/solr/collection1/browse?defType=edismaxq=(%2BThomas+%2BMichael)+OR+%2Bxxxmatchesnothingxxx But then the AND isn't implicit anymore...and I don't like to prefix all my search terms with a +. It smells like a bug to me, so you should probably file an issue in Jira. I will admit that this is getting somewhat outside my experience level. I noticed the /browse there ... is this just what you have named your handler, or is this connected with the Velocity stuff? Have you tried adding debugQuery=true to your URL and seeing what your different queries actually parse to? It may also be a good idea to add echoParams=all so you can see all parameters that are going into the request. Thanks, Shawn This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
RE: Can a field with defined synonym be searched without the synonym?
In our case it's the opposite. For our clients it is very important that every synonym gets equal chances in the relevancy calculation. The fact that nol scores higher than net operating loss, simply because its document frequency is lower, is unacceptable and a reason to look for ways to disable the IDF from the score calculation. But that is in fact something I don't like to do since IDF is such an elementary part of the algorithm (and very useful for non-synonym searches). Pre-processing synonyms to apply 'reverse weighting' is also a strategy to consider but I agree with Walter that this very error-prone, things could get easily out of sync. Moreover, none of our Dev-, QA-, STG-, PRD- environment contain exactly the same content, so it would require different tuned synonyms dictionary for each of them...meh... In our previous search engine (FAST ESP) we basically switched off IDF, but I am still a bit hoping that there is a more sophisticated solution with Solr. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Thursday 13 December 2012 02:30 To: solr-user@lucene.apache.org Subject: Re: Can a field with defined synonym be searched without the synonym? All of the applications I've seen with user control over synonym expansion where recall-oriented. The give me all matches for X kind of problem. So ranking is not as important. wunder On Dec 12, 2012, at 5:23 PM, Roman Chyla wrote: Well, this IDF problem has more sides. So, let's say your synonym file contains multi-token synonyms (it does, right? or perhaps you don't need it? well, some people do) TV, TV set, TV foo, television if you use the default synonym expansion, when you index 'television' you have increased frequency of also 'set', 'foo', so, the IDF of 'TV' is the same as that of 'television' - but IDF of 'foo' and 'set' has changed (their frequency increased, their IDF decreased) -- TV's have in fact made 'foo' term very frequent and undesirable So, you might be sure that IDF of 'TV' and 'television' are the same, but you are not aware it has 'screwed' other (desirable) terms - so it really depends. And I wouldn't argue these cases are esoteric. And finally: there are use cases out there, where people NEED to switch off synonym expansion at will (find only these documents, that contain the word 'TV' and not that bloody 'foo'). This cannot be done if the index contains all synonym terms (unless you have a way to mark the original and the synonym in the index). roman On Wed, Dec 12, 2012 at 12:50 PM, Walter Underwood wun...@wunderwood.orgwrote: Query parsers cannot fix the IDF problem or make query-time synonyms faster. Query synonym expansion makes more search terms. More search terms are more work at query time. The IDF problem is real; I've run up against it. The most rare variant of the synonym have the highest score. This probably the opposite of what you want. For me, it was TV and television. Documents with TV had higher scores than those with television. wunder On Dec 12, 2012, at 9:45 AM, Roman Chyla wrote: @wunder It is a misconception (well, supported by that wiki description) that the query time synonym filter have these problems. It is actually the default parser, that is causing these problems. Look at this if you still think that index time synonyms are cure for all: https://issues.apache.org/jira/browse/LUCENE-4499 @joe If you can use the flexible query parser (as linked in by @Swati) then all you need to do is to define a different field with a different tokenizer chain and then swap the field names before the analyzers processes the document (and then rewrite the field name back - for example, we have fields called author and author_nosyn) roman On Wed, Dec 12, 2012 at 12:38 PM, Walter Underwood wun...@wunderwood.orgwrote: Query time synonyms have known problems. They are slower, cause incorrect IDF, and don't work for phrase synonyms. Apply synonyms at index time and you will have none of those problems. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory wunder On Dec 12, 2012, at 9:34 AM, Swati Swoboda wrote: Query-time analyzers are still applied, even if you include a string in quotes. Would you expect foo to not match Foo just because it's enclosed in quotes? Also look at this, someone who had similar requirements: http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-td2919876.html -Original Message- From: joe.cohe...@gmail.com [mailto:joe.cohe...@gmail.com] Sent: Wednesday, December 12, 2012 12:09 PM To: solr-user@lucene.apache.org Subject: Re: Can a field with defined synonym be searched without the synonym? I'm aplying only query-time synonym, so I have the original values stored and indexed. I would've expected that if I search a strin with quotations, i'll get the exact match, without applying a
RE: score calculation
I am also busy with getting this clear. Here are my notes so far (by copying and writing myself): queryWeight = the impact of the query against the field implementation: boost(query)*idf*queryNorm boost(query) = boost of the field at query-time Implication: hits in fields with higher boost get a higher score Rationale: a term in field A could be more relevant than the same term in field B idf = inverse document frequency = measure of how often the term appears across the index for this field implementation: log(numDocs/(docFreq+1))+1 Implication: the greater the occurrence of a term in different documents, the lower its score Rationale: common terms are less important than uncommon ones numDocs = the total number of documents in the index, not including those that are marked as deleted but have not yet been purged. This is a constant (the same value for all documents in the index). docFreq = the number of documents in the index which contain the term in this field. This is a constant (the same value for all documents in the index containing this field) queryNorm = normalization factor so that queries can be compared implementation: 1/sqrt(sumOfSquaredWeights) Implication: doesn't impact the relevancy of this result Rationale: queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. This value is equal for all results of the query fieldWeight = the score of a term matching the field implementation: tf*idf*fieldNorm tf = term frequency in a field = measure of how often a term appears in the field implementation: sqrt(freq) Implication: the more frequent a term occurs in a field, the greater its score Rationale: fields which contains more of a term are generally more relevant freq = termFreq = amount of times the term occurs in the field for this document fieldNorm = impact of a hit in this field implementation: lengthNorm*boost(index) lengthNorm = measure of the importance of a term according to the total number of terms in the field implementation: 1/sqrt(numTerms) Implication: a term matched in fields with less terms have a higher score Rationale: a term in a field with less terms is more important than one with more numTerms = amount of terms in a field boost (index) = boost of the field at index-time Implication: hits in fields with higher boost get a higher score Rationale: a term in field A could be more relevant than the same term in field B maxDocs = the number of documents in the index, including those that are marked as deleted but have not yet been purged. This is a constant (the same value for all documents in the index) Implication: (probably) doesn't play a role in the scoring calculation coord = number of terms in the query that were found in the document (omitted if equal to 1) implementation: overlap/maxOverlap Implication: of the terms in the query, a document that contains more terms will have a higher score Rationale: documents that match the most optional terms score highest overlap = the number of query terms matched in the document maxOverlap = the total number of terms in the query FunctionQuery = could be any kind of custom ranking function, which outcome is added to, or multiplied with the default rank score. Implication: various Look at the EXPLAIN information to see how the final score is calculated. Tom -Original Message- From: Sangeetha [mailto:sangeetha...@gmail.com] Sent: Thursday 13 December 2012 08:33 To: solr-user@lucene.apache.org Subject: score calculation I want to know how score is calculated? what is fieldweight, fieldNorm, queryWeight and queryNorm. And what is the formula to get the final score using fieldweight, fieldNorm, queryWeight ,queryNorm, idf and tf. Can anyone explain or provide some links? Thanks, Sangeetha -- View this message in context: http://lucene.472066.n3.nabble.com/score-calculation-tp4026669.html Sent from the Solr - User mailing list archive at Nabble.com. This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number