Actually, I think I found the issue. Some of the PDFs weren't OCR'ed very well and the text from the word "examined" was read as "~8 mined"
Vincent Vu Nguyen Division of Science Quality and Translation Office of the Associate Director for Science Centers for Disease Control and Prevention (CDC) 404-498-6154 Century Bldg 2400 Atlanta, GA 30329 -----Original Message----- From: Nguyen, Vincent (CDC/OSELS/PHITPO) (CTR) Sent: Wednesday, September 15, 2010 12:35 PM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: RE: Solr returning irrelevant results Sorry about that, I made it uppercase to emphasize it. The word was just "examined" Vincent Vu Nguyen Division of Science Quality and Translation Office of the Associate Director for Science Centers for Disease Control and Prevention (CDC) 404-498-6154 Century Bldg 2400 Atlanta, GA 30329 -----Original Message----- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Wednesday, September 15, 2010 11:40 AM To: solr-user@lucene.apache.org Subject: Re: Solr returning irrelevant results On Wed, Sep 15, 2010 at 11:29 AM, Nguyen, Vincent (CDC/OSELS/PHITPO) (CTR) <v...@cdc.gov> wrote: > I was running a query on the word "mining" and got results from > documents that have nothing to do with mining. I got results with a > score of 0.2997284 and less. It looks like Solr was querying the > dsm.fulltext field for "mine" as well, which is ok except there were no > "mine" words in the document. However, I did find words like > "exaMINEd". Was the "MINE" in "exaMINEd" actually uppercase, or did you do that for emphasis? If it was actually uppercased, one could argue it is a relevant document since someone was trying to get "MINE" to stand out for some reason. Anyway, if you don't want that behavior then turn off splitting on case change. splitOnCaseChange="0" in WordDelimiterFilterFactory http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8