RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Demian Katz Mon, 12 Apr 2010 06:50:08 -0700

I don't think the behavior is correct.  The first example, with just one gap, 
does NOT match.  The second example, with an extra second gap, DOES match.  It 
seems that the term collapsing ("eighteenth-century" --> "eighteenthcentury") 
somehow throws off the position sequence, forcing you to add an extra gap in 
order to get a match.  It's good to know that slop is an option to work around 
this problem... but it still seems to me that something isn't working the way 
it is supposed to in this particular case.


- Demian

> -----Original Message-----
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Friday, April 09, 2010 12:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated
> terms?
> 
> but this behavior is correct, as you have position increments enabled.
> if you want the second query (which has 2 gaps) to match, you need to
> either
> use slop, or disable these increments alltogether.
> 
> On Fri, Apr 9, 2010 at 11:44 AM, Demian Katz
> <demian.k...@villanova.edu>wrote:
> 
> > I've given it a try, and it definitely seems to have improved the
> > situation.  However, there is still one weird case that's clearly
> related to
> > term positions.  If I do this search, it fails:
> >
> > title:"love customs in eighteenthcentury spain"
> >
> > ...but if I do this search, it succeeds:
> >
> > title:"love customs in in eighteenthcentury spain"
> >
> > (note the duplicate "in").
> >
> > - Demian
> >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > Sent: Thursday, April 08, 2010 11:20 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: solr.WordDelimiterFilterFactory problem with
> hyphenated
> > > terms?
> > >
> > > I'm not all that familiar with the underlying issues, but of the
> two
> > > I'd
> > > pick moving the WordDelimiterFactory rather than setting increments
> =
> > > "false".
> > >
> > > But that's at least partly a guess....
> > >
> > > Best
> > > Erick
> > >
> > > On Thu, Apr 8, 2010 at 11:00 AM, Demian Katz
> > > <demian.k...@villanova.edu>wrote:
> > >
> > > > Thanks for looking into this -- I appreciate the help (and feel a
> > > little
> > > > better that there seems to be a bug at work here and not just my
> > > total
> > > > incomprehension).
> > > >
> > > > Sorry for any confusion over the UnicodeNormalizationFactory --
> > > that's
> > > > actually a plug-in from the SolrMarc project (
> > > > http://code.google.com/p/solrmarc/) that slipped into my example.
> > > Also,
> > > > as you guessed, my default operator is indeed set to "AND."
> > > >
> > > > It sounds to me that, of your two proposed work-arounds, moving
> the
> > > > StopFilterFactory after WordDelimiterFactory is the least
> disruptive.
> > > I'm
> > > > guessing that disabling position increments across the board
> might
> > > have
> > > > implications for other types of phrase searches, while filtering
> > > stopwords
> > > > later in the chain should be more functionally equivalent, if
> > > slightly less
> > > > efficient (potentially more terms to examine).  Would you agree
> with
> > > this
> > > > assessment?  If not, what possible negative side effects am I
> > > forgetting
> > > > about?
> > > >
> > > > thanks,
> > > > Demian
> > > >
> > > > > -----Original Message-----
> > > > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > > > Sent: Wednesday, April 07, 2010 10:04 PM
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: solr.WordDelimiterFilterFactory problem with
> > > hyphenated
> > > > > terms?
> > > > >
> > > > > Well, for a quick trial using trunk, I had to remove the
> > > > > UnicodeNormalizationFactory, is that yours?
> > > > >
> > > > > But with that removed, I get the results you do, ASSUMING that
> > > you've
> > > > > set
> > > > > your default operator to AND in schema.xml...
> > > > >
> > > > > Believe it or not, it all changes and all your queries return a
> hit
> > > if
> > > > > you
> > > > > do one of two things (I did this in both index and query when
> > > testing
> > > > > 'cause
> > > > > I'm lazy):
> > > > > 1> move the inclusion of the StopFilterFactory after
> > > > > WordDelimiterFactory
> > > > > or
> > > > > 2> for StopFilterFactory, set enablePositionIncrements="false"
> > > > >
> > > > > I think either of these might work in your situation.......
> > > > >
> > > > > On doing some more investigation, it appears that if a
> hyphenated
> > > word
> > > > > is
> > > > > immediately after a stopword AND the above is true (stop
> factory
> > > > > included
> > > > > before WordDelimiterFactory and
> enablePositionIncrements="true"),
> > > then
> > > > > the
> > > > > search fails. I indexed this title:
> > > > >
> > > > > Love-customs in eighteenth-century Spain for nineteenth-century
> > > > >
> > > > > Searching in solr/admin/form.jsp for:
> > > > > title:(nineteenth-century)
> > > > >
> > > > > fails. But if I remove the "for" from the title, the above
> query
> > > works.
> > > > > Searching for
> > > > > title:(love-customs)
> > > > > always works.
> > > > >
> > > > > Finally, (and it's *really* time to go to sleep now), just
> setting
> > > > > enablePositionIncrements="false" in the "index" portion of the
> > > schema
> > > > > also
> > > > > causes things to work.
> > > > >
> > > > > Developer folks:
> > > > > I didn't see anything in a quick look in SOLR or Lucene JIRAs,
> > > should I
> > > > > refine this a bit (really, sleepy time is near) and add a JIRA?
> > > > >
> > > > > Best
> > > > > Erick
> > > > >
> > > > > On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz
> > > > > <demian.k...@villanova.edu>wrote:
> > > > >
> > > > > > Hello.  It has been a few weeks, and I haven't gotten any
> > > responses.
> > > > > >  Perhaps my question is too complicated -- maybe a better
> > > approach is
> > > > > to try
> > > > > > to gain enough knowledge to answer it myself.  My gut feeling
> is
> > > > > still that
> > > > > > it's something to do with the way term positions are getting
> > > handled
> > > > > by the
> > > > > > WordDelimiterFilterFactory, but I don't have a good
> understanding
> > > of
> > > > > how
> > > > > > term positions are calculated or factored into searching.
> Can
> > > anyone
> > > > > > recommend some good reading to familiarize myself with these
> > > concepts
> > > > > in
> > > > > > better detail?
> > > > > >
> > > > > > thanks,
> > > > > > Demian
> > > > > >
> > > > > > From: Demian Katz
> > > > > > Sent: Tuesday, March 16, 2010 9:47 AM
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Subject: solr.WordDelimiterFilterFactory problem with
> hyphenated
> > > > > terms?
> > > > > >
> > > > > > This is my first post on this list -- apologies if this has
> been
> > > > > discussed
> > > > > > before; I didn't come upon anything exactly equivalent in
> > > searching
> > > > > the
> > > > > > archives via Google.
> > > > > >
> > > > > > I'm using Solr 1.4 as part of the VuFind application, and I
> just
> > > > > noticed
> > > > > > that searches for hyphenated terms are failing in strange
> ways.
> > > I
> > > > > strongly
> > > > > > suspect it has something to do with the
> > > > > solr.WordDelimiterFilterFactory
> > > > > > filter, but I'm not exactly sure what.
> > > > > >
> > > > > > The problem is that I have a record with the title "Love
> customs
> > > in
> > > > > > eighteenth-century Spain."  Depending on how I search for
> this, I
> > > get
> > > > > > successes or failures in a seemingly unpredictable pattern.
> > > > > >
> > > > > > Demonstration queries below were tested using the direct Solr
> > > > > > administration tool, just to eliminate any VuFind-related
> factors
> > > > > from the
> > > > > > equation while debugging.
> > > > > >
> > > > > > Queries that work:
> > > > > > title:(Love customs in eighteenth century Spain)
> > > > > >                     // no hyphen, no phrases
> > > > > > title:("Love customs in eighteenth-century Spain")
> > > > > >                  // phrase search on whole title, with hyphen
> > > > > >
> > > > > > Queries that fail:
> > > > > > title:(Love customs in eighteenth-century Spain)
> > > > > >                    // hyphen, no phrases
> > > > > > title:("Love customs in eighteenth century Spain")
> > > > > >                   // phrase search on whole title, without
> hyphen
> > > > > > title:(Love customs in "eighteenth-century" Spain)
> > > > > >                  // hyphenated word as phrase
> > > > > > title:(Love customs in "eighteenth century" Spain)
> > > > > >                   // hyphenated word as phrase, hyphen
> removed
> > > > > >
> > > > > > Here is VuFind's text field type definition:
> > > > > >
> > > > > >    <fieldType name="text" class="solr.TextField"
> > > > > > positionIncrementGap="100">
> > > > > >      <analyzer type="index">
> > > > > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > > > >        <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > > > > > words="stopwords.txt" enablePositionIncrements="true"/>
> > > > > >        <filter class="solr.WordDelimiterFilterFactory"
> > > > > > generateWordParts="1" generateNumberParts="1"
> catenateWords="1"
> > > > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > > > >        <filter class="solr.LowerCaseFilterFactory"/>
> > > > > >        <filter class="solr.SnowballPorterFilterFactory"
> > > > > language="English"
> > > > > > protected="protwords.txt"/>
> > > > > >        <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > > > > >        <filter
> class="schema.UnicodeNormalizationFilterFactory"
> > > > > > version="icu4j" composed="false" remove_diacritics="true"
> > > > > > remove_modifiers="true" fold="true"/>
> > > > > >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> > > > > >      </analyzer>
> > > > > >      <analyzer type="query">
> > > > > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > > > >        <filter class="solr.SynonymFilterFactory"
> > > > > synonyms="synonyms.txt"
> > > > > > ignoreCase="true" expand="true"/>
> > > > > >        <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > > > > > words="stopwords.txt" enablePositionIncrements="true"/>
> > > > > >        <filter class="solr.WordDelimiterFilterFactory"
> > > > > > generateWordParts="1" generateNumberParts="1"
> catenateWords="1"
> > > > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > > > >        <filter class="solr.LowerCaseFilterFactory"/>
> > > > > >        <filter class="solr.SnowballPorterFilterFactory"
> > > > > language="English"
> > > > > > protected="protwords.txt"/>
> > > > > >        <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > > > > >        <filter
> class="schema.UnicodeNormalizationFilterFactory"
> > > > > > version="icu4j" composed="false" remove_diacritics="true"
> > > > > > remove_modifiers="true" fold="true"/>
> > > > > >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> > > > > >      </analyzer>
> > > > > >    </fieldType>
> > > > > >
> > > > > > I did notice that in the "text" field type in VuFind's schema
> has
> > > > > > "catenateWords" and "catenateNumbers" turned on in both the
> index
> > > and
> > > > > query
> > > > > > analyzer chains.  It is my understanding that these options
> > > should be
> > > > > > disabled for the query chain and only enabled for the index
> > > chain.
> > > > > However,
> > > > > > this may be a red herring -- I have already tried changing
> this
> > > > > setting, but
> > > > > > it didn't change the success/failure pattern described above.
> I
> > > have
> > > > > also
> > > > > > played with the preserveOriginal setting without apparent
> effect.
> > > > > >
> > > > > > From playing with the Field Analysis tool, I notice that
> there is
> > > a
> > > > > gap in
> > > > > > the term position sequence after analysis...  but I'm not
> sure if
> > > > > this is
> > > > > > significant.
> > > > > >
> > > > > > Has anybody else run into this sort of problem?  Any ideas on
> a
> > > fix?
> > > > > >
> > > > > > thanks,
> > > > > > Demian
> > > > > >
> > > > > >
> > > >
> >
> 
> 
> 
> --
> Robert Muir
> rcm...@gmail.com

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Reply via email to