mapping and tuning payloads in Solr 8

2020-02-12 Thread Burgmans, Tom
Hi all,

In our Solr 6 setup we use string payloads to boost certain tokens (URIs). 
These strings are mapped to floats via a schema parameter "PayloadMapping", 
which can be read out in our custom WKSimilarity class (extending 
TFIDFSimilarity).









   
0.4
0.4
0.5
0
0.0
10.0
3.0
 1.0
 isAbout=15.0,coversFiscalPeriod=10.0,type=5.0,hasTheme=5.0,subject=4.0,mentions=2.0,creator=2.0
   


The reason for this indirection is convenience: by storing payload strings 
i.s.o. floats we could change & tune the boosts easily by updating the schema 
without having to change the content set.
Inside WKSimilarity each payload string is mapped to its corresponding boost 
value and the final boost is applied via the scorePayload method (where we 
could tune the boost curve via some additional schema parameters). This works 
well in Solr 6.

The problem: we are about to migrate to Solr 8 and after LUCENE-8014 it isn't 
possible anymore the override the scorePayload method in WKSimilarity (it is 
removed from TFIDFSimilarity). I wonder what alternatives there are for mapping 
strings payload to floats and use them in a tunable formula for boosting.

Thanks,
Tom Burgmans


RE: Multiplicative Boosts broken since 7.3 (LUCENE-8099)

2019-02-13 Thread Burgmans, Tom
I like to bump this issue up, since this is a showstopper for us to upgrade 
from Solr 6. In https://issues.apache.org/jira/browse/SOLR-13126 I described a 
couple of more use cases in which this bug appears. We see different scores in 
the EXPLAIN compared to the actual scores and our analysis is that the EXPLAIN 
in fact is correct. It happens when a multiplicative boost is used (via the 
"boost" parameter) in combination with some function queries, like "query" and 
"field". 

One example (tested on Solr 7.5.0), when running: 

http://localhost:8983/solr/test/select?defType=edismax=id,score,[explain 
style=text]=*:*=sum(field(price),4)

then the expectation is that a document that doesn't have the price field gets 
a score of 4. The result however is: 

{
"id": "docid123576",
"score": 1.0,
"[explain]": "4.0 = product of:\n  1.0 = boost\n  4.0 = product of:\n
1.0 = *:*\n4.0 = sum(float(price)=0.0,const(4))\n"
}

EXPLAIN and score are not consistent.

Best regards Tom


-Original Message-
From: Tobias Ibounig [mailto:t.ibou...@netconomy.net] 
Sent: dinsdag 22 januari 2019 10:14
To: solr-user@lucene.apache.org
Subject: Multiplicative Boosts broken since 7.3 (LUCENE-8099)

Hello,

As described in 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSOLR-13126data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240sdata=paFEStnQwxcKQQ9mM1MfPXQm%2BrStTaqQnYFH2LolVl8%3Dreserved=0
 multiplicative boots (in certain conditions) seem to be broken since 7.3.
The error seems to be introduced in 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-8099data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240sdata=Gs1EzQ%2FCSO8ryZJv0EGx2etxmDA7HkW8Crj5H6mE%2FvE%3Dreserved=0.
 Reverting the SOLR parts to the now deprecated BoostingQuery again fixes the 
issue.
The filed issue contains a test case and a patch with the revert (for testing 
purposes, not really a clean fix).
We sadly couldn't find the actual issue, which seems to lie with the use of 
"FunctionScoreQuery" for boosting.

We were able to patch our 7.5 installation with the patch. As others might be 
affected as well, we hope this can be helpful in resolving this bug.

To all SOLR/Lucene developers, thank you for your work. Looking trough the code 
base gave me a new appreciation of your work.

Best Regards,
Tobias

PS: This issue was already posted by a colleague, "Inconsistent debugQuery 
score with multiplicative boost", but I wanted to create a new post with a 
clearer title.



Change in EXPLAIN info since Solr 5

2016-02-04 Thread Burgmans, Tom
Hi group, 

While exploring Solr 5.4.0, I noticed a subtle difference in the EXPLAIN debug 
information, compared to the version we currently use (4.10.1).

Solr 4.10.1:

2.0739748 = (MATCH) max plus 1.0 times others of:
  2.0739748 = (MATCH) weight(text:test in 30) [DefaultSimilarity], result of:
2.0739748 = score(doc=30,freq=3.0), product of:
  0.3556181 = queryWeight, product of:
3.3671236 = idf(docFreq=17, maxDocs=192)
0.105614804 = queryNorm
  5.832029 = fieldWeight in 30, product of:
1.7320508 = tf(freq=3.0), with freq of:
  3.0 = termFreq=3.0
3.3671236 = idf(docFreq=17, maxDocs=192)
1.0 = fieldNorm(doc=30)

Solr 5.4.0:

2.0739748 = max plus 1.0 times others of:
  2.0739748 = weight(text:test in 30) [ClassicSimilarity], result of:
2.0739748 = score(doc=30,freq=3.0), product of:
  0.3556181 = queryWeight, product of:
3.3671236 = idf(docFreq=17, maxDocs=192)
0.105614804 = queryNorm
  5.832029 = fieldWeight in 30, product of:
1.7320508 = tf(freq=3.0), with freq of:
  3.0 = termFreq=3.0
3.3671236 = idf(docFreq=17, maxDocs=192)
1.0 = fieldNorm(doc=30)

The difference is the removal of (MATCH) in some of the EXPLAIN lines. That is 
causing issues for us since we have developed an EXPLAIN parser that leans on 
the presence of (MATCH) in the EXPLAIN.
Does anyone have a suggestion how to insert back (MATCH) in the explain info 
(like which file should we patch)?

Thanks, Tom


Score results by only the highest scoring term

2015-02-03 Thread Burgmans, Tom
Hi All,

I wonder if it's in some way possible to search for multiple terms like:

(term A OR term B OR term C OR term D)

and in case a document contains 2 or more of these terms: only the highest 
scoring term should contribute to the final relevancy score; possibly lower 
scoring  terms should be discarded from the scoring algorithm.

Ideally I'd like an operator like ANY:

(term A ANY term B ANY term C ANY term D)

that has the purpose: return documents, sorted by the score of the highest 
scoring term.

Any thoughts about how to achieve this?

_
Tom Burgmans



incomplete proximity boost for fielded searches

2014-08-28 Thread Burgmans, Tom
Consider query:
http://10.208.152.231:8080/solr/wkustaldocsphc_A/search?q=title:(Michigan 
Corporate Income Tax)debugQuery=truepf=titleps=255defType=edismax

The intention is to perform a search in field title and to apply a proximity 
boost within a window of 255 words. If I look at the debug information, I see:

str name=parsedquery
BoostedQuery(boost(+((title:michigan title:corporate title:income title:tax)~4) 
(title:corporate income tax~255)~1.0))
/str

Note that the first search term (michigan) is missing in the proximity boost 
clause. I can't believe this is intended behavior. 

Why is edismax splitting  (title:Michigan) and (Corporate Income Tax) while 
determining what to use for proximity boost?

Thanks, Tom


strange edismax parsing when searching in multiple fields (#TB)

2013-03-13 Thread Burgmans, Tom
Hi group,

Background:
I have a collection containing English and French documents. I made sure to 
index the English content in field body (fieldType=text_en) and the French 
content in field body_fr (fieldType=text_fr).

The user could be either English of French so the goal is to execute the 
queries against both fields simultaneously without knowing the query language 
upfront. The query is analyzed differently for each field. For both fields a 
stopFilter is configured with each its own list of stopwords (different per 
language).

The issue:
When I search for 'a result' (without single quotes) in field body and 
body_fr at the same time, then a is considered a stopword in English and 
removed for field body, but not in French so both terms are still searched 
inside body_fr. What happens is that the query is parsed (edismax) into this 
construction:

((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)

This query returns only French documents, although there are many English 
documents in the index that contain the term 'result' as well. How can that 
happen? I think it is related to the way my query is parsed: there seems to be 
an AND-relationship between (body_fr:a) and (body:result | body_fr:result). 
There is no English document that has (body_fr:a), so that's why they don't 
show up. For me a much more logic parsed query would be:

((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)

How should I interpret this? Is it a bug in edismax? Is it intended and if yes: 
why?

Thanks for any hint,
Tom

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.


RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

2013-03-13 Thread Burgmans, Tom
The main reason of using stopwords is to speed up query performance, since we 
see that a huge part is consumed by highlighting stopwords. Also when reading 
the full highlighted document, we think that it makes a document better 
readable when only meaningful words are highlighted.

For searching in fact I like to keep stopwords...


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Wednesday 13 March 2013 04:43
To: solr-user@lucene.apache.org
Subject: [SPAM] Re: strange edismax parsing when searching in multiple fields 
(#TB)
Importance: Low

Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so.

Removing stopwords was a hack developed for 16-bit computers and 40 megabyte 
disks. We don't need to do that any more.

wunder

On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:

 I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for all 
 fields that you search on.

 You might find this useful : 
 http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/

 --- On Wed, 3/13/13, Burgmans, Tom tom.burgm...@wolterskluwer.com wrote:

 From: Burgmans, Tom tom.burgm...@wolterskluwer.com
 Subject: strange edismax parsing when searching in multiple fields (#TB)
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Wednesday, March 13, 2013, 5:22 PM
 Hi group,

 Background:
 I have a collection containing English and French documents.
 I made sure to index the English content in field body
 (fieldType=text_en) and the French content in field
 body_fr (fieldType=text_fr).

 The user could be either English of French so the goal is to
 execute the queries against both fields simultaneously
 without knowing the query language upfront. The query is
 analyzed differently for each field. For both fields a
 stopFilter is configured with each its own list of stopwords
 (different per language).

 The issue:
 When I search for 'a result' (without single quotes) in
 field body and body_fr at the same time, then a is
 considered a stopword in English and removed for field
 body, but not in French so both terms are still searched
 inside body_fr. What happens is that the query is parsed
 (edismax) into this construction:

 ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)

 This query returns only French documents, although there are
 many English documents in the index that contain the term
 'result' as well. How can that happen? I think it is related
 to the way my query is parsed: there seems to be an
 AND-relationship between (body_fr:a) and (body:result |
 body_fr:result). There is no English document that has
 (body_fr:a), so that's why they don't show up. For me a much
 more logic parsed query would be:

 ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)

 How should I interpret this? Is it a bug in edismax? Is it
 intended and if yes: why?

 Thanks for any hint,
 Tom

 This email and any attachments may contain confidential or
 privileged information
 and is intended for the addressee only. If you are not the
 intended recipient, please
 immediately notify us by email or telephone and delete the
 original email and attachments
 without using, disseminating or reproducing its contents to
 anyone other than the intended
 recipient. Wolters Kluwer shall not be liable for the
 incorrect or incomplete transmission of
 of this email or any attachments, nor for unauthorized use
 by its employees.

 Wolters Kluwer nv has its registered address in Alphen aan
 den Rijn, The Netherlands, and is registered
 with the Trade Registry of the Dutch Chamber of Commerce
 under number 33202517.


--
Walter Underwood
wun...@wunderwood.org




This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.


Search in String and Text_en fields simultaneously with edismax

2013-02-28 Thread Burgmans, Tom
I have a field valueadd of type String and field body of type text_en (with 
tokenization and linguistic processing).

When I search with edismax against field valueadd like this:
q=valueadd:(test . test2)
I see that the parsed query is
(valueadd:test valueadd:. valueadd:test2)~3

Why not (valueadd:test . test2) ? It looks like the query is tokenized while 
field type String doesn't have a tokenizer configured.

I know I could construct my query as:
q=valueadd:test . test2
in which case the phrase is searched as a whole against valueadd. But why 
doesn't that happen without quotes?


The reason I ask:
For a simultaneous search in multiple fields I like to include field valueadd 
in the qf parameter which contains String and text_en fields, like:
qf=valueadd body

How can I search both fields simultaneously without duplicating search terms, 
while the query is (whitespace) tokenized for body but search as a phrase for 
valueadd?

Thanks,
Tom Burgmans

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.


RE: Search in String and Text_en fields simultaneously with edismax

2013-02-28 Thread Burgmans, Tom
Ah OK. I didn't have a good view of query parsing vs query generation. Thanks 
for clearing this up.

So it means that searching in a tokenized and non-tokenized field 
simultaneously is not possible when I want
- the expression parsed as phrase for the non-tokenized field
- the expression parsed as multiple tokens for the tokenized field
?

If possible, I'd like to avoid writing my own query parser.



-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday 28 February 2013 05:05
To: solr-user@lucene.apache.org
Subject: Re: Search in String and Text_en fields simultaneously with edismax

Query text is always tokenized (more properly, parsed), unless the text
is enclosed in quotes or spaces are escaped with backslash. Try:

q=valueadd:test . test2

or

q=valueadd:test\ .\ test2

Parentheses simply provide grouping, either to control boolean operator
evaluation order or to apply a field name to a sequence of query tokens (as
you have written.)

The analyzer or field type is only consulted when the query is generated,
not while it is being parsed. The same identical parsing rules apply to both
tokenized and non-tokenized fields. What a field type's analyzer does with
its value is irrelevant to query parsing.

-- Jack Krupansky

-Original Message-
From: Burgmans, Tom
Sent: Thursday, February 28, 2013 10:48 AM
To: solr-user@lucene.apache.org
Subject: Search in String and Text_en fields simultaneously with edismax

I have a field valueadd of type String and field body of type text_en
(with tokenization and linguistic processing).

When I search with edismax against field valueadd like this:
q=valueadd:(test . test2)
I see that the parsed query is
(valueadd:test valueadd:. valueadd:test2)~3

Why not (valueadd:test . test2) ? It looks like the query is tokenized while
field type String doesn't have a tokenizer configured.

I know I could construct my query as:
q=valueadd:test . test2
in which case the phrase is searched as a whole against valueadd. But why
doesn't that happen without quotes?


The reason I ask:
For a simultaneous search in multiple fields I like to include field
valueadd in the qf parameter which contains String and text_en fields, like:
qf=valueadd body

How can I search both fields simultaneously without duplicating search
terms, while the query is (whitespace) tokenized for body but search as a
phrase for valueadd?

Thanks,
Tom Burgmans

This email and any attachments may contain confidential or privileged
information
and is intended for the addressee only. If you are not the intended
recipient, please
immediately notify us by email or telephone and delete the original email
and attachments
without using, disseminating or reproducing its contents to anyone other
than the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or
incomplete transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number
33202517.


This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.


How to I let the FVH highlight individual terms instead of the complete phrase?

2012-12-21 Thread Burgmans, Tom
Hi group,

I'm trying to highlight my complete(!) XML document, which is indexed for that 
purpose in a special field called wkxmlsource. I configured the wkxmlsource 
field like

field indexed=true multiValued=false name=wkxmlsource omitNorms=true 
stored=true termPositions=true termOffsets=true termVectors=true 
type=text_xml/

And the text_xml fieldtype is almost equal to the text_en field, but with the 
charFilter class=solr.HTMLStripCharFilterFactory / as the first class in 
the index analyzer. That prevents highlighting inside XML tags.

First I tried the simple highlighter and that almost worked: I get my document 
back with my search terms and phrases highlighted, each individual term gets it 
own highlight tags. But the problem is that not the complete value of field 
wkxmlsource is returned; it cuts off the bottom part, no matter how big I set 
the hl.fragsize.

So my next try was to use the FVH (hl.useFastVectorHighlighter=true) instead. 
That helped: it returns now the complete value of wkxmlsource with all my 
search terms/phrases highlighted. But...in case of a phrase search, it doesn't 
highlight each individual term anymore, but it only puts highlight tags around 
the complete phrase. That could possible lead to malformed XML. An example:

Search for phrase: across the country Santa Fe it highlights like this in the 
document:

para align=left...spread emacross the country./parapara 
align=leftSanta Fe/em Pacific... /para

How can I let the FVH highlight individual terms instead of the complete 
phrase? Ideally I like to have something like:

para align=left...spread emacross/em  emthe/em  
emcountry/em./parapara align=leftemSanta/em  emFe/em 
Pacific... /para

which is still valid XML.

My boundaryscanner is configured like:

   boundaryScanner 
name=breakIterator class=solr.highlight.BreakIteratorBoundaryScanner
lst 
name=defaults

str name=hl.bs.typeWORD/str

str name=hl.bs.languageen/str

str name=hl.bs.countryUS/str
/lst
/boundaryScanner


Thanks, Tom
--
Tom Burgmans

[cid:image001.jpg@01CDDFA4.2B7968E0]

Search Specialist


Tel:  +31 (0)17 246 66 33
Mobile: +31 (0)6 306 821 78

Platform Technologies
Global Platform Organization

Zuidpoolsingel 2
2408 ZE, Alphen aan den Rijn The Netherlands

tom.burgm...@wolterskluwer.com


www.wolterskluwer.com






This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.


edismax: implicit AND changes into implicit OR

2012-12-12 Thread Burgmans, Tom
Hi all,

I wonder if this is a bug or expected behavior:

I have some documents indexed; 3 of them contain Thomas and 4 of them contain 
Michael, but none of the contain both. A search for
http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)
returns 0 results as expected since there is an implicit AND between the two 
terms and there is no document that matches both. But a search for
http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)+OR+xxxmatchesnothingxxx
returns 7 results. For some reason the implicit AND turns into an implicit OR, 
in case an Explicit OR is added to the query expression. The parsedquery 
information confirms this behavior.

Why is edismax doing this?

Tested on a Solr 4.0.0 instance.

Thanks, Tom

--
Tom Burgmans

[cid:image001.jpg@01CDD86E.DC411F70]

Search Specialist


Tel:  +31 (0)17 246 66 33
Mobile: +31 (0)6 306 821 78

Platform Technologies
Global Platform Organization

Zuidpoolsingel 2
2408 ZE, Alphen aan den Rijn The Netherlands

tom.burgm...@wolterskluwer.com


www.wolterskluwer.com






This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.


RE: edismax: implicit AND changes into implicit OR

2012-12-12 Thread Burgmans, Tom
I have set solrQueryParser defaultOperator=AND/ in the schema (and 
restarted Solr), and tested again with

http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)+OR+xxxmatchesnothingxxxq.op=AND

note the extra parameter. Still it returns the 7 documents that matches (Thomas 
OR Michael), but not (Thomas AND Michael).

The only way to enforce an implicit AND is by changing the query into

http://localhost:8983/solr/collection1/browse?defType=edismaxq=(%2BThomas+%2BMichael)+OR+%2Bxxxmatchesnothingxxx

But then the AND isn't implicit anymore...and I don't like to prefix all my 
search terms with a +.


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Wednesday 12 December 2012 05:46
To: solr-user@lucene.apache.org
Subject: Re: edismax: implicit AND changes into implicit OR

On 12/12/2012 5:51 AM, Burgmans, Tom wrote:
 I have some documents indexed; 3 of them contain Thomas and 4 of
 them contain Michael, but none of the contain both. A search for

 http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)
 http://localhost:8983/solr/collection1/browse?defType=edismaxq=%28Thomas+Michael%29

 returns 0 results as expected since there is an implicit AND between
 the two terms and there is no document that matches both. But a search
 for

 http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)+OR+xxxmatchesnothingxxx
 http://localhost:8983/solr/collection1/browse?defType=edismaxq=%28Thomas+Michael%29+OR+xxxmatchesnothingxxx

 returns 7 results. For some reason the implicit AND turns into an
 implicit OR, in case an Explicit OR is added to the query expression.
 The parsedquery information confirms this behavior.



I'll give you my best guess, nothing to back this up but instinct. The
following statements (especially the second one) may be wrong:

When you do not include any boolean operators, edismax is using its mm
parameter, which defaults to 100%, meaning that all search terms must
match (equivalent to a default operator of AND).

When you DO include a boolean operator, mm goes out the window and
edismax reverts to using the default operator for solr, your schema, or
the request handler, which unless you have changed it, is OR.

Thanks,
Shawn


This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.


RE: edismax: implicit AND changes into implicit OR

2012-12-12 Thread Burgmans, Tom
Yes /browse returns velocity stuff, but I mostly add wt=xml in the query. And 
yes, I looked at the parsedquery feedback that debugQuery=true provides. That 
basically confirms my idea that the implicit AND is indeed switched to an 
implicit OR in case an explicit OR is somewhere else present in the query. Even 
the default operator set to AND seems to be overruled.

Thanks, I'll think about submitting a Jira.


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Wednesday 12 December 2012 06:43
To: solr-user@lucene.apache.org
Subject: Re: edismax: implicit AND changes into implicit OR

On 12/12/2012 10:27 AM, Burgmans, Tom wrote:
 I have set solrQueryParser defaultOperator=AND/ in the schema (and 
 restarted Solr), and tested again with

 http://localhost:8983/solr/collection1/browse?defType=edismaxq=(Thomas+Michael)+OR+xxxmatchesnothingxxxq.op=AND

 note the extra parameter. Still it returns the 7 documents that matches 
 (Thomas OR Michael), but not (Thomas AND Michael).

 The only way to enforce an implicit AND is by changing the query into

 http://localhost:8983/solr/collection1/browse?defType=edismaxq=(%2BThomas+%2BMichael)+OR+%2Bxxxmatchesnothingxxx

 But then the AND isn't implicit anymore...and I don't like to prefix all my 
 search terms with a +.

It smells like a bug to me, so you should probably file an issue in
Jira.  I will admit that this is getting somewhat outside my experience
level.

I noticed the /browse there ... is this just what you have named your
handler, or is this connected with the Velocity stuff?

Have you tried adding debugQuery=true to your URL and seeing what your
different queries actually parse to?  It may also be a good idea to add
echoParams=all so you can see all parameters that are going into the
request.

Thanks,
Shawn


This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.


RE: Can a field with defined synonym be searched without the synonym?

2012-12-12 Thread Burgmans, Tom
In our case it's the opposite. For our clients it is very important that every 
synonym gets equal chances in the relevancy calculation. The fact that nol 
scores higher than net operating loss, simply because its document frequency 
is lower, is unacceptable and a reason to look for ways to disable the IDF from 
the score calculation. But that is in fact something I don't like to do since 
IDF is such an elementary part of the algorithm (and very useful for 
non-synonym searches).

Pre-processing synonyms to apply 'reverse weighting' is also a strategy to 
consider but I agree with Walter that this very error-prone, things could get 
easily out of sync. Moreover, none of our Dev-, QA-, STG-, PRD- environment 
contain exactly the same content, so it would require different tuned synonyms 
dictionary for each of them...meh...

In our previous search engine (FAST ESP) we basically switched off IDF, but I 
am still a bit hoping that there is a more sophisticated solution with Solr.


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Thursday 13 December 2012 02:30
To: solr-user@lucene.apache.org
Subject: Re: Can a field with defined synonym be searched without the synonym?

All of the applications I've seen with user control over synonym expansion 
where recall-oriented. The give me all matches for X kind of problem. So 
ranking is not as important.

wunder

On Dec 12, 2012, at 5:23 PM, Roman Chyla wrote:

 Well, this IDF problem has more sides. So, let's say your synonym file
 contains multi-token synonyms (it does, right? or perhaps you don't need
 it? well, some people do)

 TV, TV set, TV foo, television

 if you use the default synonym expansion, when you index 'television'

 you have increased frequency of also 'set', 'foo', so, the IDF of 'TV' is
 the same as that of 'television' - but IDF of 'foo' and 'set' has changed
 (their frequency increased, their IDF decreased) -- TV's have in fact made
 'foo' term very frequent and undesirable

 So, you might be sure that IDF of 'TV' and 'television' are the same, but
 you are not aware it has 'screwed' other (desirable) terms - so it really
 depends. And I wouldn't argue these cases are esoteric.

 And finally: there are use cases out there, where people NEED to switch off
 synonym expansion at will (find only these documents, that contain the word
 'TV' and not that bloody 'foo'). This cannot be done if the index contains
 all synonym terms (unless you have a way to mark the original and the
 synonym in the index).

 roman


 On Wed, Dec 12, 2012 at 12:50 PM, Walter Underwood 
 wun...@wunderwood.orgwrote:

 Query parsers cannot fix the IDF problem or make query-time synonyms
 faster. Query synonym expansion makes more search terms. More search terms
 are more work at query time.

 The IDF problem is real; I've run up against it. The most rare variant of
 the synonym have the highest score. This probably the opposite of what you
 want. For me, it was TV and television. Documents with TV had higher
 scores than those with television.

 wunder

 On Dec 12, 2012, at 9:45 AM, Roman Chyla wrote:

 @wunder
 It is a misconception (well, supported by that wiki description) that the
 query time synonym filter have these problems. It is actually the default
 parser, that is causing these problems. Look at this if you still think
 that index time synonyms are cure for all:
 https://issues.apache.org/jira/browse/LUCENE-4499

 @joe
 If you can use the flexible query parser (as linked in by @Swati) then
 all
 you need to do is to define a different field with a different tokenizer
 chain and then swap the field names before the analyzers processes the
 document (and then rewrite the field name back - for example, we have
 fields called author and author_nosyn)

 roman

 On Wed, Dec 12, 2012 at 12:38 PM, Walter Underwood 
 wun...@wunderwood.orgwrote:

 Query time synonyms have known problems. They are slower, cause
 incorrect
 IDF, and don't work for phrase synonyms.

 Apply synonyms at index time and you will have none of those problems.

 See:

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

 wunder

 On Dec 12, 2012, at 9:34 AM, Swati Swoboda wrote:

 Query-time analyzers are still applied, even if you include a string in
 quotes. Would you expect foo to not match Foo just because it's
 enclosed in quotes?

 Also look at this, someone who had similar requirements:


 http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-td2919876.html


 -Original Message-
 From: joe.cohe...@gmail.com [mailto:joe.cohe...@gmail.com]
 Sent: Wednesday, December 12, 2012 12:09 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Can a field with defined synonym be searched without the
 synonym?


 I'm aplying only query-time synonym, so I have the original values
 stored and indexed.
 I would've expected that if I search a strin with quotations, i'll get
 the exact match, without applying a 

RE: score calculation

2012-12-12 Thread Burgmans, Tom
I am also busy with getting this clear. Here are my notes so far (by copying 
and writing myself):



queryWeight = the impact of the query against the field
implementation: boost(query)*idf*queryNorm


boost(query) = boost of the field at query-time
Implication: hits in fields with higher boost get a higher score
Rationale: a term in field A could be more relevant than the same term 
in field B


idf = inverse document frequency = measure of how often the term appears 
across the index for this field
implementation: log(numDocs/(docFreq+1))+1
Implication: the greater the occurrence of a term in different 
documents, the lower its score
Rationale: common terms are less important than uncommon ones
numDocs = the total number of documents in the index, not including those 
that are marked as deleted but have not yet been purged. This is a constant 
(the same value for all documents in the index).
docFreq = the number of documents in the index which contain the term in 
this field. This is a constant (the same value for all documents in the index 
containing this field)


queryNorm = normalization factor so that queries can be compared
implementation: 1/sqrt(sumOfSquaredWeights)
Implication: doesn't impact the relevancy of this result
Rationale: queryNorm is not related to the relevance of the document, 
but rather tries to make scores between different queries comparable. This 
value is equal for all results of the query


fieldWeight = the score of a term matching the field
implementation: tf*idf*fieldNorm


tf = term frequency in a field = measure of how often a term appears in the 
field
implementation: sqrt(freq)
Implication: the more frequent a term occurs in a field, the greater 
its score
Rationale: fields which contains more of a term are generally more 
relevant
freq = termFreq = amount of times the term occurs in the field for this 
document


fieldNorm = impact of a hit in this field
implementation: lengthNorm*boost(index)
lengthNorm = measure of the importance of a term according to the total 
number of terms in the field
implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher 
score
Rationale: a term in a field with less terms is more important than one 
with more
numTerms = amount of terms in a field
boost (index) = boost of the field at index-time
Implication: hits in fields with higher boost get a higher score
Rationale: a term in field A could be more relevant than the same term 
in field B


maxDocs = the number of documents in the index, including those that are 
marked as deleted but have not yet been purged. This is a constant (the same 
value for all documents in the index)
Implication: (probably) doesn't play a role in the scoring calculation


coord = number of terms in the query that were found in the document 
(omitted if equal to 1)
implementation: overlap/maxOverlap
Implication: of the terms in the query, a document that contains more 
terms will have a higher score
Rationale: documents that match the most optional terms score highest
overlap = the number of query terms matched in the document
maxOverlap = the total number of terms in the query


FunctionQuery = could be any kind of custom ranking function, which outcome 
is added to, or multiplied with the default rank score.
Implication: various


Look at the EXPLAIN information to see how the final score is calculated.

Tom


-Original Message-
From: Sangeetha [mailto:sangeetha...@gmail.com]
Sent: Thursday 13 December 2012 08:33
To: solr-user@lucene.apache.org
Subject: score calculation


I want to know how score is calculated?

what is fieldweight, fieldNorm, queryWeight and queryNorm. And what is the
formula to get the final score using fieldweight, fieldNorm, queryWeight
,queryNorm, idf and tf.

Can anyone explain or provide some links?

Thanks,
Sangeetha



--
View this message in context: 
http://lucene.472066.n3.nabble.com/score-calculation-tp4026669.html
Sent from the Solr - User mailing list archive at Nabble.com.

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number