Re: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Robert Muir Thu, 08 Apr 2010 05:17:01 -0700

Erick, this sounds like https://issues.apache.org/jira/browse/SOLR-1852


On Wed, Apr 7, 2010 at 10:04 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Well, for a quick trial using trunk, I had to remove the
> UnicodeNormalizationFactory, is that yours?
>
> But with that removed, I get the results you do, ASSUMING that you've set
> your default operator to AND in schema.xml...
>
> Believe it or not, it all changes and all your queries return a hit if you
> do one of two things (I did this in both index and query when testing
> 'cause
> I'm lazy):
> 1> move the inclusion of the StopFilterFactory after WordDelimiterFactory
> or
> 2> for StopFilterFactory, set enablePositionIncrements="false"
>
> I think either of these might work in your situation.......
>
> On doing some more investigation, it appears that if a hyphenated word is
> immediately after a stopword AND the above is true (stop factory included
> before WordDelimiterFactory and enablePositionIncrements="true"), then the
> search fails. I indexed this title:
>
> Love-customs in eighteenth-century Spain for nineteenth-century
>
> Searching in solr/admin/form.jsp for:
> title:(nineteenth-century)
>
> fails. But if I remove the "for" from the title, the above query works.
> Searching for
> title:(love-customs)
> always works.
>
> Finally, (and it's *really* time to go to sleep now), just setting
> enablePositionIncrements="false" in the "index" portion of the schema also
> causes things to work.
>
> Developer folks:
> I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I
> refine this a bit (really, sleepy time is near) and add a JIRA?
>
> Best
> Erick
>
> On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz <demian.k...@villanova.edu
> >wrote:
>
> > Hello.  It has been a few weeks, and I haven't gotten any responses.
> >  Perhaps my question is too complicated -- maybe a better approach is to
> try
> > to gain enough knowledge to answer it myself.  My gut feeling is still
> that
> > it's something to do with the way term positions are getting handled by
> the
> > WordDelimiterFilterFactory, but I don't have a good understanding of how
> > term positions are calculated or factored into searching.  Can anyone
> > recommend some good reading to familiarize myself with these concepts in
> > better detail?
> >
> > thanks,
> > Demian
> >
> > From: Demian Katz
> > Sent: Tuesday, March 16, 2010 9:47 AM
> > To: solr-user@lucene.apache.org
> > Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms?
> >
> > This is my first post on this list -- apologies if this has been
> discussed
> > before; I didn't come upon anything exactly equivalent in searching the
> > archives via Google.
> >
> > I'm using Solr 1.4 as part of the VuFind application, and I just noticed
> > that searches for hyphenated terms are failing in strange ways.  I
> strongly
> > suspect it has something to do with the solr.WordDelimiterFilterFactory
> > filter, but I'm not exactly sure what.
> >
> > The problem is that I have a record with the title "Love customs in
> > eighteenth-century Spain."  Depending on how I search for this, I get
> > successes or failures in a seemingly unpredictable pattern.
> >
> > Demonstration queries below were tested using the direct Solr
> > administration tool, just to eliminate any VuFind-related factors from
> the
> > equation while debugging.
> >
> > Queries that work:
> > title:(Love customs in eighteenth century Spain)
> >                     // no hyphen, no phrases
> > title:("Love customs in eighteenth-century Spain")
> >                  // phrase search on whole title, with hyphen
> >
> > Queries that fail:
> > title:(Love customs in eighteenth-century Spain)
> >                    // hyphen, no phrases
> > title:("Love customs in eighteenth century Spain")
> >                   // phrase search on whole title, without hyphen
> > title:(Love customs in "eighteenth-century" Spain)
> >                  // hyphenated word as phrase
> > title:(Love customs in "eighteenth century" Spain)
> >                   // hyphenated word as phrase, hyphen removed
> >
> > Here is VuFind's text field type definition:
> >
> >    <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >        <filter class="schema.UnicodeNormalizationFilterFactory"
> > version="icu4j" composed="false" remove_diacritics="true"
> > remove_modifiers="true" fold="true"/>
> >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >        <filter class="schema.UnicodeNormalizationFilterFactory"
> > version="icu4j" composed="false" remove_diacritics="true"
> > remove_modifiers="true" fold="true"/>
> >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
> >
> > I did notice that in the "text" field type in VuFind's schema has
> > "catenateWords" and "catenateNumbers" turned on in both the index and
> query
> > analyzer chains.  It is my understanding that these options should be
> > disabled for the query chain and only enabled for the index chain.
>  However,
> > this may be a red herring -- I have already tried changing this setting,
> but
> > it didn't change the success/failure pattern described above.  I have
> also
> > played with the preserveOriginal setting without apparent effect.
> >
> > From playing with the Field Analysis tool, I notice that there is a gap
> in
> > the term position sequence after analysis...  but I'm not sure if this is
> > significant.
> >
> > Has anybody else run into this sort of problem?  Any ideas on a fix?
> >
> > thanks,
> > Demian
> >
> >
>



-- 
Robert Muir
rcm...@gmail.com

Re: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Reply via email to