Sorry I was backwards with my response, but the behavior is definitely correct here.
On Mon, Apr 12, 2010 at 9:46 AM, Demian Katz <demian.k...@villanova.edu>wrote: > I don't think the behavior is correct. The first example, with just one > gap, does NOT match. The second example, with an extra second gap, DOES > match. It seems that the term collapsing ("eighteenth-century" --> > "eighteenthcentury") somehow throws off the position sequence, forcing you > to add an extra gap in order to get a match. Sorry I was backwards with my response, but yeah eighteenth-century (with the options you specified to worddelimiterfilter: generateWordParts) will analyze eighteenth, followed by century, then followed by an injected (positionIncrement=0) eighteenthcentury. if you want eighteenth-century to only be a single gap, you can turn off generateWordParts, but concatenate instead, of course then you won't match "eighteenth century" with a space, but at the end of the day your problem is just that the phrase has a different number of tokens in the query than it does in the document. It's good to know that slop is an option to work around this problem... but > it still seems to me that something isn't working the way it is supposed to > in this particular case. > > - Demian > > > -----Original Message----- > > From: Robert Muir [mailto:rcm...@gmail.com] > > Sent: Friday, April 09, 2010 12:05 PM > > To: solr-user@lucene.apache.org > > Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated > > terms? > > > > but this behavior is correct, as you have position increments enabled. > > if you want the second query (which has 2 gaps) to match, you need to > > either > > use slop, or disable these increments alltogether. > > > > On Fri, Apr 9, 2010 at 11:44 AM, Demian Katz > > <demian.k...@villanova.edu>wrote: > > > > > I've given it a try, and it definitely seems to have improved the > > > situation. However, there is still one weird case that's clearly > > related to > > > term positions. If I do this search, it fails: > > > > > > title:"love customs in eighteenthcentury spain" > > > > > > ...but if I do this search, it succeeds: > > > > > > title:"love customs in in eighteenthcentury spain" > > > > > > (note the duplicate "in"). > > > > > > - Demian > > > > > > > -----Original Message----- > > > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > > > Sent: Thursday, April 08, 2010 11:20 AM > > > > To: solr-user@lucene.apache.org > > > > Subject: Re: solr.WordDelimiterFilterFactory problem with > > hyphenated > > > > terms? > > > > > > > > I'm not all that familiar with the underlying issues, but of the > > two > > > > I'd > > > > pick moving the WordDelimiterFactory rather than setting increments > > = > > > > "false". > > > > > > > > But that's at least partly a guess.... > > > > > > > > Best > > > > Erick > > > > > > > > On Thu, Apr 8, 2010 at 11:00 AM, Demian Katz > > > > <demian.k...@villanova.edu>wrote: > > > > > > > > > Thanks for looking into this -- I appreciate the help (and feel a > > > > little > > > > > better that there seems to be a bug at work here and not just my > > > > total > > > > > incomprehension). > > > > > > > > > > Sorry for any confusion over the UnicodeNormalizationFactory -- > > > > that's > > > > > actually a plug-in from the SolrMarc project ( > > > > > http://code.google.com/p/solrmarc/) that slipped into my example. > > > > Also, > > > > > as you guessed, my default operator is indeed set to "AND." > > > > > > > > > > It sounds to me that, of your two proposed work-arounds, moving > > the > > > > > StopFilterFactory after WordDelimiterFactory is the least > > disruptive. > > > > I'm > > > > > guessing that disabling position increments across the board > > might > > > > have > > > > > implications for other types of phrase searches, while filtering > > > > stopwords > > > > > later in the chain should be more functionally equivalent, if > > > > slightly less > > > > > efficient (potentially more terms to examine). Would you agree > > with > > > > this > > > > > assessment? If not, what possible negative side effects am I > > > > forgetting > > > > > about? > > > > > > > > > > thanks, > > > > > Demian > > > > > > > > > > > -----Original Message----- > > > > > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > > > > > Sent: Wednesday, April 07, 2010 10:04 PM > > > > > > To: solr-user@lucene.apache.org > > > > > > Subject: Re: solr.WordDelimiterFilterFactory problem with > > > > hyphenated > > > > > > terms? > > > > > > > > > > > > Well, for a quick trial using trunk, I had to remove the > > > > > > UnicodeNormalizationFactory, is that yours? > > > > > > > > > > > > But with that removed, I get the results you do, ASSUMING that > > > > you've > > > > > > set > > > > > > your default operator to AND in schema.xml... > > > > > > > > > > > > Believe it or not, it all changes and all your queries return a > > hit > > > > if > > > > > > you > > > > > > do one of two things (I did this in both index and query when > > > > testing > > > > > > 'cause > > > > > > I'm lazy): > > > > > > 1> move the inclusion of the StopFilterFactory after > > > > > > WordDelimiterFactory > > > > > > or > > > > > > 2> for StopFilterFactory, set enablePositionIncrements="false" > > > > > > > > > > > > I think either of these might work in your situation....... > > > > > > > > > > > > On doing some more investigation, it appears that if a > > hyphenated > > > > word > > > > > > is > > > > > > immediately after a stopword AND the above is true (stop > > factory > > > > > > included > > > > > > before WordDelimiterFactory and > > enablePositionIncrements="true"), > > > > then > > > > > > the > > > > > > search fails. I indexed this title: > > > > > > > > > > > > Love-customs in eighteenth-century Spain for nineteenth-century > > > > > > > > > > > > Searching in solr/admin/form.jsp for: > > > > > > title:(nineteenth-century) > > > > > > > > > > > > fails. But if I remove the "for" from the title, the above > > query > > > > works. > > > > > > Searching for > > > > > > title:(love-customs) > > > > > > always works. > > > > > > > > > > > > Finally, (and it's *really* time to go to sleep now), just > > setting > > > > > > enablePositionIncrements="false" in the "index" portion of the > > > > schema > > > > > > also > > > > > > causes things to work. > > > > > > > > > > > > Developer folks: > > > > > > I didn't see anything in a quick look in SOLR or Lucene JIRAs, > > > > should I > > > > > > refine this a bit (really, sleepy time is near) and add a JIRA? > > > > > > > > > > > > Best > > > > > > Erick > > > > > > > > > > > > On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz > > > > > > <demian.k...@villanova.edu>wrote: > > > > > > > > > > > > > Hello. It has been a few weeks, and I haven't gotten any > > > > responses. > > > > > > > Perhaps my question is too complicated -- maybe a better > > > > approach is > > > > > > to try > > > > > > > to gain enough knowledge to answer it myself. My gut feeling > > is > > > > > > still that > > > > > > > it's something to do with the way term positions are getting > > > > handled > > > > > > by the > > > > > > > WordDelimiterFilterFactory, but I don't have a good > > understanding > > > > of > > > > > > how > > > > > > > term positions are calculated or factored into searching. > > Can > > > > anyone > > > > > > > recommend some good reading to familiarize myself with these > > > > concepts > > > > > > in > > > > > > > better detail? > > > > > > > > > > > > > > thanks, > > > > > > > Demian > > > > > > > > > > > > > > From: Demian Katz > > > > > > > Sent: Tuesday, March 16, 2010 9:47 AM > > > > > > > To: solr-user@lucene.apache.org > > > > > > > Subject: solr.WordDelimiterFilterFactory problem with > > hyphenated > > > > > > terms? > > > > > > > > > > > > > > This is my first post on this list -- apologies if this has > > been > > > > > > discussed > > > > > > > before; I didn't come upon anything exactly equivalent in > > > > searching > > > > > > the > > > > > > > archives via Google. > > > > > > > > > > > > > > I'm using Solr 1.4 as part of the VuFind application, and I > > just > > > > > > noticed > > > > > > > that searches for hyphenated terms are failing in strange > > ways. > > > > I > > > > > > strongly > > > > > > > suspect it has something to do with the > > > > > > solr.WordDelimiterFilterFactory > > > > > > > filter, but I'm not exactly sure what. > > > > > > > > > > > > > > The problem is that I have a record with the title "Love > > customs > > > > in > > > > > > > eighteenth-century Spain." Depending on how I search for > > this, I > > > > get > > > > > > > successes or failures in a seemingly unpredictable pattern. > > > > > > > > > > > > > > Demonstration queries below were tested using the direct Solr > > > > > > > administration tool, just to eliminate any VuFind-related > > factors > > > > > > from the > > > > > > > equation while debugging. > > > > > > > > > > > > > > Queries that work: > > > > > > > title:(Love customs in eighteenth century Spain) > > > > > > > // no hyphen, no phrases > > > > > > > title:("Love customs in eighteenth-century Spain") > > > > > > > // phrase search on whole title, with hyphen > > > > > > > > > > > > > > Queries that fail: > > > > > > > title:(Love customs in eighteenth-century Spain) > > > > > > > // hyphen, no phrases > > > > > > > title:("Love customs in eighteenth century Spain") > > > > > > > // phrase search on whole title, without > > hyphen > > > > > > > title:(Love customs in "eighteenth-century" Spain) > > > > > > > // hyphenated word as phrase > > > > > > > title:(Love customs in "eighteenth century" Spain) > > > > > > > // hyphenated word as phrase, hyphen > > removed > > > > > > > > > > > > > > Here is VuFind's text field type definition: > > > > > > > > > > > > > > <fieldType name="text" class="solr.TextField" > > > > > > > positionIncrementGap="100"> > > > > > > > <analyzer type="index"> > > > > > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > > > > > <filter class="solr.StopFilterFactory" > > ignoreCase="true" > > > > > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > > > > > <filter class="solr.WordDelimiterFilterFactory" > > > > > > > generateWordParts="1" generateNumberParts="1" > > catenateWords="1" > > > > > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > > > > <filter class="solr.LowerCaseFilterFactory"/> > > > > > > > <filter class="solr.SnowballPorterFilterFactory" > > > > > > language="English" > > > > > > > protected="protwords.txt"/> > > > > > > > <filter > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > > > > <filter > > class="schema.UnicodeNormalizationFilterFactory" > > > > > > > version="icu4j" composed="false" remove_diacritics="true" > > > > > > > remove_modifiers="true" fold="true"/> > > > > > > > <filter class="solr.ISOLatin1AccentFilterFactory"/> > > > > > > > </analyzer> > > > > > > > <analyzer type="query"> > > > > > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > > > > > <filter class="solr.SynonymFilterFactory" > > > > > > synonyms="synonyms.txt" > > > > > > > ignoreCase="true" expand="true"/> > > > > > > > <filter class="solr.StopFilterFactory" > > ignoreCase="true" > > > > > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > > > > > <filter class="solr.WordDelimiterFilterFactory" > > > > > > > generateWordParts="1" generateNumberParts="1" > > catenateWords="1" > > > > > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > > > > <filter class="solr.LowerCaseFilterFactory"/> > > > > > > > <filter class="solr.SnowballPorterFilterFactory" > > > > > > language="English" > > > > > > > protected="protwords.txt"/> > > > > > > > <filter > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > > > > <filter > > class="schema.UnicodeNormalizationFilterFactory" > > > > > > > version="icu4j" composed="false" remove_diacritics="true" > > > > > > > remove_modifiers="true" fold="true"/> > > > > > > > <filter class="solr.ISOLatin1AccentFilterFactory"/> > > > > > > > </analyzer> > > > > > > > </fieldType> > > > > > > > > > > > > > > I did notice that in the "text" field type in VuFind's schema > > has > > > > > > > "catenateWords" and "catenateNumbers" turned on in both the > > index > > > > and > > > > > > query > > > > > > > analyzer chains. It is my understanding that these options > > > > should be > > > > > > > disabled for the query chain and only enabled for the index > > > > chain. > > > > > > However, > > > > > > > this may be a red herring -- I have already tried changing > > this > > > > > > setting, but > > > > > > > it didn't change the success/failure pattern described above. > > I > > > > have > > > > > > also > > > > > > > played with the preserveOriginal setting without apparent > > effect. > > > > > > > > > > > > > > From playing with the Field Analysis tool, I notice that > > there is > > > > a > > > > > > gap in > > > > > > > the term position sequence after analysis... but I'm not > > sure if > > > > > > this is > > > > > > > significant. > > > > > > > > > > > > > > Has anybody else run into this sort of problem? Any ideas on > > a > > > > fix? > > > > > > > > > > > > > > thanks, > > > > > > > Demian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > -- Robert Muir rcm...@gmail.com