Re: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Erick Erickson Thu, 08 Apr 2010 07:02:10 -0700

Your're right, it sure looks related. But according to that JIRA, it's
fixed
in trunk and I'm pretty sure I have a very recent version that I built from
code I updated within the last few days.


I'll update tonight and double check. If it's still a problem I'll see if
I can write a test case illustrating the behavior and maybe poke around
to see if it's an easy fix.

Thanks
Erick

On Thu, Apr 8, 2010 at 8:16 AM, Robert Muir <rcm...@gmail.com> wrote:

> Erick, this sounds like https://issues.apache.org/jira/browse/SOLR-1852
>
> On Wed, Apr 7, 2010 at 10:04 PM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > Well, for a quick trial using trunk, I had to remove the
> > UnicodeNormalizationFactory, is that yours?
> >
> > But with that removed, I get the results you do, ASSUMING that you've set
> > your default operator to AND in schema.xml...
> >
> > Believe it or not, it all changes and all your queries return a hit if
> you
> > do one of two things (I did this in both index and query when testing
> > 'cause
> > I'm lazy):
> > 1> move the inclusion of the StopFilterFactory after WordDelimiterFactory
> > or
> > 2> for StopFilterFactory, set enablePositionIncrements="false"
> >
> > I think either of these might work in your situation.......
> >
> > On doing some more investigation, it appears that if a hyphenated word is
> > immediately after a stopword AND the above is true (stop factory included
> > before WordDelimiterFactory and enablePositionIncrements="true"), then
> the
> > search fails. I indexed this title:
> >
> > Love-customs in eighteenth-century Spain for nineteenth-century
> >
> > Searching in solr/admin/form.jsp for:
> > title:(nineteenth-century)
> >
> > fails. But if I remove the "for" from the title, the above query works.
> > Searching for
> > title:(love-customs)
> > always works.
> >
> > Finally, (and it's *really* time to go to sleep now), just setting
> > enablePositionIncrements="false" in the "index" portion of the schema
> also
> > causes things to work.
> >
> > Developer folks:
> > I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I
> > refine this a bit (really, sleepy time is near) and add a JIRA?
> >
> > Best
> > Erick
> >
> > On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz <demian.k...@villanova.edu
> > >wrote:
> >
> > > Hello.  It has been a few weeks, and I haven't gotten any responses.
> > >  Perhaps my question is too complicated -- maybe a better approach is
> to
> > try
> > > to gain enough knowledge to answer it myself.  My gut feeling is still
> > that
> > > it's something to do with the way term positions are getting handled by
> > the
> > > WordDelimiterFilterFactory, but I don't have a good understanding of
> how
> > > term positions are calculated or factored into searching.  Can anyone
> > > recommend some good reading to familiarize myself with these concepts
> in
> > > better detail?
> > >
> > > thanks,
> > > Demian
> > >
> > > From: Demian Katz
> > > Sent: Tuesday, March 16, 2010 9:47 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms?
> > >
> > > This is my first post on this list -- apologies if this has been
> > discussed
> > > before; I didn't come upon anything exactly equivalent in searching the
> > > archives via Google.
> > >
> > > I'm using Solr 1.4 as part of the VuFind application, and I just
> noticed
> > > that searches for hyphenated terms are failing in strange ways.  I
> > strongly
> > > suspect it has something to do with the solr.WordDelimiterFilterFactory
> > > filter, but I'm not exactly sure what.
> > >
> > > The problem is that I have a record with the title "Love customs in
> > > eighteenth-century Spain."  Depending on how I search for this, I get
> > > successes or failures in a seemingly unpredictable pattern.
> > >
> > > Demonstration queries below were tested using the direct Solr
> > > administration tool, just to eliminate any VuFind-related factors from
> > the
> > > equation while debugging.
> > >
> > > Queries that work:
> > > title:(Love customs in eighteenth century Spain)
> > >                     // no hyphen, no phrases
> > > title:("Love customs in eighteenth-century Spain")
> > >                  // phrase search on whole title, with hyphen
> > >
> > > Queries that fail:
> > > title:(Love customs in eighteenth-century Spain)
> > >                    // hyphen, no phrases
> > > title:("Love customs in eighteenth century Spain")
> > >                   // phrase search on whole title, without hyphen
> > > title:(Love customs in "eighteenth-century" Spain)
> > >                  // hyphenated word as phrase
> > > title:(Love customs in "eighteenth century" Spain)
> > >                   // hyphenated word as phrase, hyphen removed
> > >
> > > Here is VuFind's text field type definition:
> > >
> > >    <fieldType name="text" class="solr.TextField"
> > > positionIncrementGap="100">
> > >      <analyzer type="index">
> > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true"/>
> > >        <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >        <filter class="solr.LowerCaseFilterFactory"/>
> > >        <filter class="solr.SnowballPorterFilterFactory"
> > language="English"
> > > protected="protwords.txt"/>
> > >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >        <filter class="schema.UnicodeNormalizationFilterFactory"
> > > version="icu4j" composed="false" remove_diacritics="true"
> > > remove_modifiers="true" fold="true"/>
> > >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> > >      </analyzer>
> > >      <analyzer type="query">
> > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> > > ignoreCase="true" expand="true"/>
> > >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true"/>
> > >        <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >        <filter class="solr.LowerCaseFilterFactory"/>
> > >        <filter class="solr.SnowballPorterFilterFactory"
> > language="English"
> > > protected="protwords.txt"/>
> > >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >        <filter class="schema.UnicodeNormalizationFilterFactory"
> > > version="icu4j" composed="false" remove_diacritics="true"
> > > remove_modifiers="true" fold="true"/>
> > >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> > >      </analyzer>
> > >    </fieldType>
> > >
> > > I did notice that in the "text" field type in VuFind's schema has
> > > "catenateWords" and "catenateNumbers" turned on in both the index and
> > query
> > > analyzer chains.  It is my understanding that these options should be
> > > disabled for the query chain and only enabled for the index chain.
> >  However,
> > > this may be a red herring -- I have already tried changing this
> setting,
> > but
> > > it didn't change the success/failure pattern described above.  I have
> > also
> > > played with the preserveOriginal setting without apparent effect.
> > >
> > > From playing with the Field Analysis tool, I notice that there is a gap
> > in
> > > the term position sequence after analysis...  but I'm not sure if this
> is
> > > significant.
> > >
> > > Has anybody else run into this sort of problem?  Any ideas on a fix?
> > >
> > > thanks,
> > > Demian
> > >
> > >
> >
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

Re: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Reply via email to