RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Demian Katz Thu, 08 Apr 2010 08:04:56 -0700

Thanks for looking into this -- I appreciate the help (and feel a little better 
that there seems to be a bug at work here and not just my total 
incomprehension).


Sorry for any confusion over the UnicodeNormalizationFactory -- that's actually 
a plug-in from the SolrMarc project (http://code.google.com/p/solrmarc/) that 
slipped into my example.  Also, as you guessed, my default operator is indeed 
set to "AND."

It sounds to me that, of your two proposed work-arounds, moving the 
StopFilterFactory after WordDelimiterFactory is the least disruptive.  I'm 
guessing that disabling position increments across the board might have 
implications for other types of phrase searches, while filtering stopwords 
later in the chain should be more functionally equivalent, if slightly less 
efficient (potentially more terms to examine).  Would you agree with this 
assessment?  If not, what possible negative side effects am I forgetting about?

thanks,
Demian

> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Wednesday, April 07, 2010 10:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated
> terms?
> 
> Well, for a quick trial using trunk, I had to remove the
> UnicodeNormalizationFactory, is that yours?
> 
> But with that removed, I get the results you do, ASSUMING that you've
> set
> your default operator to AND in schema.xml...
> 
> Believe it or not, it all changes and all your queries return a hit if
> you
> do one of two things (I did this in both index and query when testing
> 'cause
> I'm lazy):
> 1> move the inclusion of the StopFilterFactory after
> WordDelimiterFactory
> or
> 2> for StopFilterFactory, set enablePositionIncrements="false"
> 
> I think either of these might work in your situation.......
> 
> On doing some more investigation, it appears that if a hyphenated word
> is
> immediately after a stopword AND the above is true (stop factory
> included
> before WordDelimiterFactory and enablePositionIncrements="true"), then
> the
> search fails. I indexed this title:
> 
> Love-customs in eighteenth-century Spain for nineteenth-century
> 
> Searching in solr/admin/form.jsp for:
> title:(nineteenth-century)
> 
> fails. But if I remove the "for" from the title, the above query works.
> Searching for
> title:(love-customs)
> always works.
> 
> Finally, (and it's *really* time to go to sleep now), just setting
> enablePositionIncrements="false" in the "index" portion of the schema
> also
> causes things to work.
> 
> Developer folks:
> I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I
> refine this a bit (really, sleepy time is near) and add a JIRA?
> 
> Best
> Erick
> 
> On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz
> <demian.k...@villanova.edu>wrote:
> 
> > Hello.  It has been a few weeks, and I haven't gotten any responses.
> >  Perhaps my question is too complicated -- maybe a better approach is
> to try
> > to gain enough knowledge to answer it myself.  My gut feeling is
> still that
> > it's something to do with the way term positions are getting handled
> by the
> > WordDelimiterFilterFactory, but I don't have a good understanding of
> how
> > term positions are calculated or factored into searching.  Can anyone
> > recommend some good reading to familiarize myself with these concepts
> in
> > better detail?
> >
> > thanks,
> > Demian
> >
> > From: Demian Katz
> > Sent: Tuesday, March 16, 2010 9:47 AM
> > To: solr-user@lucene.apache.org
> > Subject: solr.WordDelimiterFilterFactory problem with hyphenated
> terms?
> >
> > This is my first post on this list -- apologies if this has been
> discussed
> > before; I didn't come upon anything exactly equivalent in searching
> the
> > archives via Google.
> >
> > I'm using Solr 1.4 as part of the VuFind application, and I just
> noticed
> > that searches for hyphenated terms are failing in strange ways.  I
> strongly
> > suspect it has something to do with the
> solr.WordDelimiterFilterFactory
> > filter, but I'm not exactly sure what.
> >
> > The problem is that I have a record with the title "Love customs in
> > eighteenth-century Spain."  Depending on how I search for this, I get
> > successes or failures in a seemingly unpredictable pattern.
> >
> > Demonstration queries below were tested using the direct Solr
> > administration tool, just to eliminate any VuFind-related factors
> from the
> > equation while debugging.
> >
> > Queries that work:
> > title:(Love customs in eighteenth century Spain)
> >                     // no hyphen, no phrases
> > title:("Love customs in eighteenth-century Spain")
> >                  // phrase search on whole title, with hyphen
> >
> > Queries that fail:
> > title:(Love customs in eighteenth-century Spain)
> >                    // hyphen, no phrases
> > title:("Love customs in eighteenth century Spain")
> >                   // phrase search on whole title, without hyphen
> > title:(Love customs in "eighteenth-century" Spain)
> >                  // hyphenated word as phrase
> > title:(Love customs in "eighteenth century" Spain)
> >                   // hyphenated word as phrase, hyphen removed
> >
> > Here is VuFind's text field type definition:
> >
> >    <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >        <filter class="schema.UnicodeNormalizationFilterFactory"
> > version="icu4j" composed="false" remove_diacritics="true"
> > remove_modifiers="true" fold="true"/>
> >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >        <filter class="schema.UnicodeNormalizationFilterFactory"
> > version="icu4j" composed="false" remove_diacritics="true"
> > remove_modifiers="true" fold="true"/>
> >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
> >
> > I did notice that in the "text" field type in VuFind's schema has
> > "catenateWords" and "catenateNumbers" turned on in both the index and
> query
> > analyzer chains.  It is my understanding that these options should be
> > disabled for the query chain and only enabled for the index chain.
> However,
> > this may be a red herring -- I have already tried changing this
> setting, but
> > it didn't change the success/failure pattern described above.  I have
> also
> > played with the preserveOriginal setting without apparent effect.
> >
> > From playing with the Field Analysis tool, I notice that there is a
> gap in
> > the term position sequence after analysis...  but I'm not sure if
> this is
> > significant.
> >
> > Has anybody else run into this sort of problem?  Any ideas on a fix?
> >
> > thanks,
> > Demian
> >
> >

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Reply via email to