RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

Burgmans, Tom Wed, 13 Mar 2013 08:56:19 -0700

The main reason of using stopwords is to speed up query performance, since we 
see that a huge part is consumed by highlighting stopwords. Also when reading 
the full highlighted document, we think that it makes a document better 
readable when only meaningful words are highlighted.


For searching in fact I like to keep stopwords...


-----Original Message-----
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Wednesday 13 March 2013 04:43
To: solr-user@lucene.apache.org
Subject: [SPAM] Re: strange edismax parsing when searching in multiple fields 
(#TB)
Importance: Low

Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so.

Removing stopwords was a hack developed for 16-bit computers and 40 megabyte 
disks. We don't need to do that any more.

wunder

On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:

> I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for all 
> fields that you search on.
>
> You might find this useful : 
> http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>
> --- On Wed, 3/13/13, Burgmans, Tom <tom.burgm...@wolterskluwer.com> wrote:
>
>> From: Burgmans, Tom <tom.burgm...@wolterskluwer.com>
>> Subject: strange edismax parsing when searching in multiple fields (#TB)
>> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>> Date: Wednesday, March 13, 2013, 5:22 PM
>> Hi group,
>>
>> Background:
>> I have a collection containing English and French documents.
>> I made sure to index the English content in field "body"
>> (fieldType=text_en) and the French content in field
>> "body_fr" (fieldType=text_fr).
>>
>> The user could be either English of French so the goal is to
>> execute the queries against both fields simultaneously
>> without knowing the query language upfront. The query is
>> analyzed differently for each field. For both fields a
>> stopFilter is configured with each its own list of stopwords
>> (different per language).
>>
>> The issue:
>> When I search for 'a result' (without single quotes) in
>> field "body" and "body_fr" at the same time, then "a" is
>> considered a stopword in English and removed for field
>> "body", but not in French so both terms are still searched
>> inside "body_fr". What happens is that the query is parsed
>> (edismax) into this construction:
>>
>> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)
>>
>> This query returns only French documents, although there are
>> many English documents in the index that contain the term
>> 'result' as well. How can that happen? I think it is related
>> to the way my query is parsed: there seems to be an
>> AND-relationship between (body_fr:a) and (body:result |
>> body_fr:result). There is no English document that has
>> (body_fr:a), so that's why they don't show up. For me a much
>> more logic parsed query would be:
>>
>> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)
>>
>> How should I interpret this? Is it a bug in edismax? Is it
>> intended and if yes: why?
>>
>> Thanks for any hint,
>> Tom
>>
>> This email and any attachments may contain confidential or
>> privileged information
>> and is intended for the addressee only. If you are not the
>> intended recipient, please
>> immediately notify us by email or telephone and delete the
>> original email and attachments
>> without using, disseminating or reproducing its contents to
>> anyone other than the intended
>> recipient. Wolters Kluwer shall not be liable for the
>> incorrect or incomplete transmission of
>> of this email or any attachments, nor for unauthorized use
>> by its employees.
>>
>> Wolters Kluwer nv has its registered address in Alphen aan
>> den Rijn, The Netherlands, and is registered
>> with the Trade Registry of the Dutch Chamber of Commerce
>> under number 33202517.
>>

--
Walter Underwood
wun...@wunderwood.org




This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

Reply via email to