RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

Ahmet Arslan Wed, 13 Mar 2013 12:04:38 -0700

Hi Tom,

I don't use stop word removal either. I use hl.q parameter fed with "meaningful 
words". 
 http://wiki.apache.org/solr/HighlightingParameters#hl.q



--- On Wed, 3/13/13, Burgmans, Tom <tom.burgm...@wolterskluwer.com> wrote:

> From: Burgmans, Tom <tom.burgm...@wolterskluwer.com>
> Subject: RE: [SPAM]  Re: strange edismax parsing when searching in multiple 
> fields (#TB)
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Date: Wednesday, March 13, 2013, 5:55 PM
> The main reason of using stopwords is
> to speed up query performance, since we see that a huge part
> is consumed by highlighting stopwords. Also when reading the
> full highlighted document, we think that it makes a document
> better readable when only meaningful words are highlighted.
> 
> For searching in fact I like to keep stopwords...
> 
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wun...@wunderwood.org]
> Sent: Wednesday 13 March 2013 04:43
> To: solr-user@lucene.apache.org
> Subject: [SPAM] Re: strange edismax parsing when searching
> in multiple fields (#TB)
> Importance: Low
> 
> Or don't use stopwords. I haven't used stopwords for, oh, a
> dozen years or so.
> 
> Removing stopwords was a hack developed for 16-bit computers
> and 40 megabyte disks. We don't need to do that any more.
> 
> wunder
> 
> On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:
> 
> > I would merge stop_en.txt and stop_fr.txt. Use same set
> of stop words for all fields that you search on.
> >
> > You might find this useful : 
> > http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
> >
> > --- On Wed, 3/13/13, Burgmans, Tom <tom.burgm...@wolterskluwer.com>
> wrote:
> >
> >> From: Burgmans, Tom <tom.burgm...@wolterskluwer.com>
> >> Subject: strange edismax parsing when searching in
> multiple fields (#TB)
> >> To: "solr-user@lucene.apache.org"
> <solr-user@lucene.apache.org>
> >> Date: Wednesday, March 13, 2013, 5:22 PM
> >> Hi group,
> >>
> >> Background:
> >> I have a collection containing English and French
> documents.
> >> I made sure to index the English content in field
> "body"
> >> (fieldType=text_en) and the French content in
> field
> >> "body_fr" (fieldType=text_fr).
> >>
> >> The user could be either English of French so the
> goal is to
> >> execute the queries against both fields
> simultaneously
> >> without knowing the query language upfront. The
> query is
> >> analyzed differently for each field. For both
> fields a
> >> stopFilter is configured with each its own list of
> stopwords
> >> (different per language).
> >>
> >> The issue:
> >> When I search for 'a result' (without single
> quotes) in
> >> field "body" and "body_fr" at the same time, then
> "a" is
> >> considered a stopword in English and removed for
> field
> >> "body", but not in French so both terms are still
> searched
> >> inside "body_fr". What happens is that the query is
> parsed
> >> (edismax) into this construction:
> >>
> >> ((body_fr:a)~1.0 (body:result |
> body_fr:result)~1.0)
> >>
> >> This query returns only French documents, although
> there are
> >> many English documents in the index that contain
> the term
> >> 'result' as well. How can that happen? I think it
> is related
> >> to the way my query is parsed: there seems to be
> an
> >> AND-relationship between (body_fr:a) and
> (body:result |
> >> body_fr:result). There is no English document that
> has
> >> (body_fr:a), so that's why they don't show up. For
> me a much
> >> more logic parsed query would be:
> >>
> >> ((body:result)~1.0 | (body_fr:a
> body_fr:result)~1.0)
> >>
> >> How should I interpret this? Is it a bug in
> edismax? Is it
> >> intended and if yes: why?
> >>
> >> Thanks for any hint,
> >> Tom
> >>
> >> This email and any attachments may contain
> confidential or
> >> privileged information
> >> and is intended for the addressee only. If you are
> not the
> >> intended recipient, please
> >> immediately notify us by email or telephone and
> delete the
> >> original email and attachments
> >> without using, disseminating or reproducing its
> contents to
> >> anyone other than the intended
> >> recipient. Wolters Kluwer shall not be liable for
> the
> >> incorrect or incomplete transmission of
> >> of this email or any attachments, nor for
> unauthorized use
> >> by its employees.
> >>
> >> Wolters Kluwer nv has its registered address in
> Alphen aan
> >> den Rijn, The Netherlands, and is registered
> >> with the Trade Registry of the Dutch Chamber of
> Commerce
> >> under number 33202517.
> >>
> 
> --
> Walter Underwood
> wun...@wunderwood.org
> 
> 
> 
> 
> This email and any attachments may contain confidential or
> privileged information
> and is intended for the addressee only. If you are not the
> intended recipient, please
> immediately notify us by email or telephone and delete the
> original email and attachments
> without using, disseminating or reproducing its contents to
> anyone other than the intended
> recipient. Wolters Kluwer shall not be liable for the
> incorrect or incomplete transmission of
> of this email or any attachments, nor for unauthorized use
> by its employees.
> 
> Wolters Kluwer nv has its registered address in Alphen aan
> den Rijn, The Netherlands, and is registered
> with the Trade Registry of the Dutch Chamber of Commerce
> under number 33202517.
>

RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

Reply via email to