Re: strange edismax parsing when searching in multiple fields (#TB)

Walter Underwood Wed, 13 Mar 2013 08:43:22 -0700

Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so.


Removing stopwords was a hack developed for 16-bit computers and 40 megabyte 
disks. We don't need to do that any more.

wunder

On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:

> I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for all 
> fields that you search on.
> 
> You might find this useful : 
> http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
> 
> --- On Wed, 3/13/13, Burgmans, Tom <tom.burgm...@wolterskluwer.com> wrote:
> 
>> From: Burgmans, Tom <tom.burgm...@wolterskluwer.com>
>> Subject: strange edismax parsing when searching in multiple fields (#TB)
>> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>> Date: Wednesday, March 13, 2013, 5:22 PM
>> Hi group,
>> 
>> Background:
>> I have a collection containing English and French documents.
>> I made sure to index the English content in field "body"
>> (fieldType=text_en) and the French content in field
>> "body_fr" (fieldType=text_fr).
>> 
>> The user could be either English of French so the goal is to
>> execute the queries against both fields simultaneously
>> without knowing the query language upfront. The query is
>> analyzed differently for each field. For both fields a
>> stopFilter is configured with each its own list of stopwords
>> (different per language).
>> 
>> The issue:
>> When I search for 'a result' (without single quotes) in
>> field "body" and "body_fr" at the same time, then "a" is
>> considered a stopword in English and removed for field
>> "body", but not in French so both terms are still searched
>> inside "body_fr". What happens is that the query is parsed
>> (edismax) into this construction:
>> 
>> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)
>> 
>> This query returns only French documents, although there are
>> many English documents in the index that contain the term
>> 'result' as well. How can that happen? I think it is related
>> to the way my query is parsed: there seems to be an
>> AND-relationship between (body_fr:a) and (body:result |
>> body_fr:result). There is no English document that has
>> (body_fr:a), so that's why they don't show up. For me a much
>> more logic parsed query would be:
>> 
>> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)
>> 
>> How should I interpret this? Is it a bug in edismax? Is it
>> intended and if yes: why?
>> 
>> Thanks for any hint,
>> Tom
>> 
>> This email and any attachments may contain confidential or
>> privileged information
>> and is intended for the addressee only. If you are not the
>> intended recipient, please
>> immediately notify us by email or telephone and delete the
>> original email and attachments
>> without using, disseminating or reproducing its contents to
>> anyone other than the intended
>> recipient. Wolters Kluwer shall not be liable for the
>> incorrect or incomplete transmission of
>> of this email or any attachments, nor for unauthorized use
>> by its employees.
>> 
>> Wolters Kluwer nv has its registered address in Alphen aan
>> den Rijn, The Netherlands, and is registered
>> with the Trade Registry of the Dutch Chamber of Commerce
>> under number 33202517.
>> 

--
Walter Underwood
wun...@wunderwood.org

Re: strange edismax parsing when searching in multiple fields (#TB)

Reply via email to