right, its fixed only in the "new trunk": http://svn.apache.org/repos/asf/lucene/dev/trunk/
nothing has been changed with regards to the solr 1.5 branch yet. On Thu, Apr 8, 2010 at 10:01 AM, Erick Erickson <erickerick...@gmail.com>wrote: > Your're right, it sure looks related. But according to that JIRA, it's > fixed > in trunk and I'm pretty sure I have a very recent version that I built from > code I updated within the last few days. > > I'll update tonight and double check. If it's still a problem I'll see if > I can write a test case illustrating the behavior and maybe poke around > to see if it's an easy fix. > > Thanks > Erick > > On Thu, Apr 8, 2010 at 8:16 AM, Robert Muir <rcm...@gmail.com> wrote: > > > Erick, this sounds like https://issues.apache.org/jira/browse/SOLR-1852 > > > > On Wed, Apr 7, 2010 at 10:04 PM, Erick Erickson <erickerick...@gmail.com > > >wrote: > > > > > Well, for a quick trial using trunk, I had to remove the > > > UnicodeNormalizationFactory, is that yours? > > > > > > But with that removed, I get the results you do, ASSUMING that you've > set > > > your default operator to AND in schema.xml... > > > > > > Believe it or not, it all changes and all your queries return a hit if > > you > > > do one of two things (I did this in both index and query when testing > > > 'cause > > > I'm lazy): > > > 1> move the inclusion of the StopFilterFactory after > WordDelimiterFactory > > > or > > > 2> for StopFilterFactory, set enablePositionIncrements="false" > > > > > > I think either of these might work in your situation....... > > > > > > On doing some more investigation, it appears that if a hyphenated word > is > > > immediately after a stopword AND the above is true (stop factory > included > > > before WordDelimiterFactory and enablePositionIncrements="true"), then > > the > > > search fails. I indexed this title: > > > > > > Love-customs in eighteenth-century Spain for nineteenth-century > > > > > > Searching in solr/admin/form.jsp for: > > > title:(nineteenth-century) > > > > > > fails. But if I remove the "for" from the title, the above query works. > > > Searching for > > > title:(love-customs) > > > always works. > > > > > > Finally, (and it's *really* time to go to sleep now), just setting > > > enablePositionIncrements="false" in the "index" portion of the schema > > also > > > causes things to work. > > > > > > Developer folks: > > > I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I > > > refine this a bit (really, sleepy time is near) and add a JIRA? > > > > > > Best > > > Erick > > > > > > On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz < > demian.k...@villanova.edu > > > >wrote: > > > > > > > Hello. It has been a few weeks, and I haven't gotten any responses. > > > > Perhaps my question is too complicated -- maybe a better approach is > > to > > > try > > > > to gain enough knowledge to answer it myself. My gut feeling is > still > > > that > > > > it's something to do with the way term positions are getting handled > by > > > the > > > > WordDelimiterFilterFactory, but I don't have a good understanding of > > how > > > > term positions are calculated or factored into searching. Can anyone > > > > recommend some good reading to familiarize myself with these concepts > > in > > > > better detail? > > > > > > > > thanks, > > > > Demian > > > > > > > > From: Demian Katz > > > > Sent: Tuesday, March 16, 2010 9:47 AM > > > > To: solr-user@lucene.apache.org > > > > Subject: solr.WordDelimiterFilterFactory problem with hyphenated > terms? > > > > > > > > This is my first post on this list -- apologies if this has been > > > discussed > > > > before; I didn't come upon anything exactly equivalent in searching > the > > > > archives via Google. > > > > > > > > I'm using Solr 1.4 as part of the VuFind application, and I just > > noticed > > > > that searches for hyphenated terms are failing in strange ways. I > > > strongly > > > > suspect it has something to do with the > solr.WordDelimiterFilterFactory > > > > filter, but I'm not exactly sure what. > > > > > > > > The problem is that I have a record with the title "Love customs in > > > > eighteenth-century Spain." Depending on how I search for this, I get > > > > successes or failures in a seemingly unpredictable pattern. > > > > > > > > Demonstration queries below were tested using the direct Solr > > > > administration tool, just to eliminate any VuFind-related factors > from > > > the > > > > equation while debugging. > > > > > > > > Queries that work: > > > > title:(Love customs in eighteenth century Spain) > > > > // no hyphen, no phrases > > > > title:("Love customs in eighteenth-century Spain") > > > > // phrase search on whole title, with hyphen > > > > > > > > Queries that fail: > > > > title:(Love customs in eighteenth-century Spain) > > > > // hyphen, no phrases > > > > title:("Love customs in eighteenth century Spain") > > > > // phrase search on whole title, without hyphen > > > > title:(Love customs in "eighteenth-century" Spain) > > > > // hyphenated word as phrase > > > > title:(Love customs in "eighteenth century" Spain) > > > > // hyphenated word as phrase, hyphen removed > > > > > > > > Here is VuFind's text field type definition: > > > > > > > > <fieldType name="text" class="solr.TextField" > > > > positionIncrementGap="100"> > > > > <analyzer type="index"> > > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > > <filter class="solr.WordDelimiterFilterFactory" > > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > <filter class="solr.LowerCaseFilterFactory"/> > > > > <filter class="solr.SnowballPorterFilterFactory" > > > language="English" > > > > protected="protwords.txt"/> > > > > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > <filter class="schema.UnicodeNormalizationFilterFactory" > > > > version="icu4j" composed="false" remove_diacritics="true" > > > > remove_modifiers="true" fold="true"/> > > > > <filter class="solr.ISOLatin1AccentFilterFactory"/> > > > > </analyzer> > > > > <analyzer type="query"> > > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > > <filter class="solr.SynonymFilterFactory" > > synonyms="synonyms.txt" > > > > ignoreCase="true" expand="true"/> > > > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > > <filter class="solr.WordDelimiterFilterFactory" > > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > <filter class="solr.LowerCaseFilterFactory"/> > > > > <filter class="solr.SnowballPorterFilterFactory" > > > language="English" > > > > protected="protwords.txt"/> > > > > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > <filter class="schema.UnicodeNormalizationFilterFactory" > > > > version="icu4j" composed="false" remove_diacritics="true" > > > > remove_modifiers="true" fold="true"/> > > > > <filter class="solr.ISOLatin1AccentFilterFactory"/> > > > > </analyzer> > > > > </fieldType> > > > > > > > > I did notice that in the "text" field type in VuFind's schema has > > > > "catenateWords" and "catenateNumbers" turned on in both the index and > > > query > > > > analyzer chains. It is my understanding that these options should be > > > > disabled for the query chain and only enabled for the index chain. > > > However, > > > > this may be a red herring -- I have already tried changing this > > setting, > > > but > > > > it didn't change the success/failure pattern described above. I have > > > also > > > > played with the preserveOriginal setting without apparent effect. > > > > > > > > From playing with the Field Analysis tool, I notice that there is a > gap > > > in > > > > the term position sequence after analysis... but I'm not sure if > this > > is > > > > significant. > > > > > > > > Has anybody else run into this sort of problem? Any ideas on a fix? > > > > > > > > thanks, > > > > Demian > > > > > > > > > > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > -- Robert Muir rcm...@gmail.com