Yeah, the Ultraseek highlighter did not highlight standalone stopwords. It did highlight stopwords in phrases. That is the "vitamin a" test.
wunder On Mar 13, 2013, at 8:55 AM, Burgmans, Tom wrote: > The main reason of using stopwords is to speed up query performance, since we > see that a huge part is consumed by highlighting stopwords. Also when reading > the full highlighted document, we think that it makes a document better > readable when only meaningful words are highlighted. > > For searching in fact I like to keep stopwords... > > > -----Original Message----- > From: Walter Underwood [mailto:wun...@wunderwood.org] > Sent: Wednesday 13 March 2013 04:43 > To: solr-user@lucene.apache.org > Subject: [SPAM] Re: strange edismax parsing when searching in multiple fields > (#TB) > Importance: Low > > Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so. > > Removing stopwords was a hack developed for 16-bit computers and 40 megabyte > disks. We don't need to do that any more. > > wunder > > On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote: > >> I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for >> all fields that you search on. >> >> You might find this useful : >> http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ >> >> --- On Wed, 3/13/13, Burgmans, Tom <tom.burgm...@wolterskluwer.com> wrote: >> >>> From: Burgmans, Tom <tom.burgm...@wolterskluwer.com> >>> Subject: strange edismax parsing when searching in multiple fields (#TB) >>> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> >>> Date: Wednesday, March 13, 2013, 5:22 PM >>> Hi group, >>> >>> Background: >>> I have a collection containing English and French documents. >>> I made sure to index the English content in field "body" >>> (fieldType=text_en) and the French content in field >>> "body_fr" (fieldType=text_fr). >>> >>> The user could be either English of French so the goal is to >>> execute the queries against both fields simultaneously >>> without knowing the query language upfront. The query is >>> analyzed differently for each field. For both fields a >>> stopFilter is configured with each its own list of stopwords >>> (different per language). >>> >>> The issue: >>> When I search for 'a result' (without single quotes) in >>> field "body" and "body_fr" at the same time, then "a" is >>> considered a stopword in English and removed for field >>> "body", but not in French so both terms are still searched >>> inside "body_fr". What happens is that the query is parsed >>> (edismax) into this construction: >>> >>> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0) >>> >>> This query returns only French documents, although there are >>> many English documents in the index that contain the term >>> 'result' as well. How can that happen? I think it is related >>> to the way my query is parsed: there seems to be an >>> AND-relationship between (body_fr:a) and (body:result | >>> body_fr:result). There is no English document that has >>> (body_fr:a), so that's why they don't show up. For me a much >>> more logic parsed query would be: >>> >>> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0) >>> >>> How should I interpret this? Is it a bug in edismax? Is it >>> intended and if yes: why? >>> >>> Thanks for any hint, >>> Tom >>> >>> This email and any attachments may contain confidential or >>> privileged information >>> and is intended for the addressee only. If you are not the >>> intended recipient, please >>> immediately notify us by email or telephone and delete the >>> original email and attachments >>> without using, disseminating or reproducing its contents to >>> anyone other than the intended >>> recipient. Wolters Kluwer shall not be liable for the >>> incorrect or incomplete transmission of >>> of this email or any attachments, nor for unauthorized use >>> by its employees. >>> >>> Wolters Kluwer nv has its registered address in Alphen aan >>> den Rijn, The Netherlands, and is registered >>> with the Trade Registry of the Dutch Chamber of Commerce >>> under number 33202517. >>> > > -- > Walter Underwood > wun...@wunderwood.org > > > > > This email and any attachments may contain confidential or privileged > information > and is intended for the addressee only. If you are not the intended > recipient, please > immediately notify us by email or telephone and delete the original email and > attachments > without using, disseminating or reproducing its contents to anyone other than > the intended > recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete > transmission of > of this email or any attachments, nor for unauthorized use by its employees. > > Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The > Netherlands, and is registered > with the Trade Registry of the Dutch Chamber of Commerce under number > 33202517. -- Walter Underwood wun...@wunderwood.org