Re: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Robert Muir Mon, 12 Apr 2010 07:32:03 -0700

Sorry I was backwards with my response, but the behavior is definitely
correct here.


On Mon, Apr 12, 2010 at 9:46 AM, Demian Katz <demian.k...@villanova.edu>wrote:

> I don't think the behavior is correct.  The first example, with just one
> gap, does NOT match.  The second example, with an extra second gap, DOES
> match.  It seems that the term collapsing ("eighteenth-century" -->
> "eighteenthcentury") somehow throws off the position sequence, forcing you
> to add an extra gap in order to get a match.


 Sorry I was backwards with my response, but yeah eighteenth-century (with
the options you specified to worddelimiterfilter: generateWordParts) will
analyze eighteenth, followed by century, then followed by an injected
(positionIncrement=0) eighteenthcentury.

if you want eighteenth-century to only be a single gap, you can turn off
generateWordParts, but concatenate instead, of course then you won't match
"eighteenth century" with a space, but at the end of the day your problem is
just that the phrase has a different number of tokens in the query than it
does in the document.

It's good to know that slop is an option to work around this problem... but
> it still seems to me that something isn't working the way it is supposed to
> in this particular case.
>
> - Demian
>
> > -----Original Message-----
> > From: Robert Muir [mailto:rcm...@gmail.com]
> > Sent: Friday, April 09, 2010 12:05 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated
> > terms?
> >
> > but this behavior is correct, as you have position increments enabled.
> > if you want the second query (which has 2 gaps) to match, you need to
> > either
> > use slop, or disable these increments alltogether.
> >
> > On Fri, Apr 9, 2010 at 11:44 AM, Demian Katz
> > <demian.k...@villanova.edu>wrote:
> >
> > > I've given it a try, and it definitely seems to have improved the
> > > situation.  However, there is still one weird case that's clearly
> > related to
> > > term positions.  If I do this search, it fails:
> > >
> > > title:"love customs in eighteenthcentury spain"
> > >
> > > ...but if I do this search, it succeeds:
> > >
> > > title:"love customs in in eighteenthcentury spain"
> > >
> > > (note the duplicate "in").
> > >
> > > - Demian
> > >
> > > > -----Original Message-----
> > > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > > Sent: Thursday, April 08, 2010 11:20 AM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: solr.WordDelimiterFilterFactory problem with
> > hyphenated
> > > > terms?
> > > >
> > > > I'm not all that familiar with the underlying issues, but of the
> > two
> > > > I'd
> > > > pick moving the WordDelimiterFactory rather than setting increments
> > =
> > > > "false".
> > > >
> > > > But that's at least partly a guess....
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On Thu, Apr 8, 2010 at 11:00 AM, Demian Katz
> > > > <demian.k...@villanova.edu>wrote:
> > > >
> > > > > Thanks for looking into this -- I appreciate the help (and feel a
> > > > little
> > > > > better that there seems to be a bug at work here and not just my
> > > > total
> > > > > incomprehension).
> > > > >
> > > > > Sorry for any confusion over the UnicodeNormalizationFactory --
> > > > that's
> > > > > actually a plug-in from the SolrMarc project (
> > > > > http://code.google.com/p/solrmarc/) that slipped into my example.
> > > > Also,
> > > > > as you guessed, my default operator is indeed set to "AND."
> > > > >
> > > > > It sounds to me that, of your two proposed work-arounds, moving
> > the
> > > > > StopFilterFactory after WordDelimiterFactory is the least
> > disruptive.
> > > > I'm
> > > > > guessing that disabling position increments across the board
> > might
> > > > have
> > > > > implications for other types of phrase searches, while filtering
> > > > stopwords
> > > > > later in the chain should be more functionally equivalent, if
> > > > slightly less
> > > > > efficient (potentially more terms to examine).  Would you agree
> > with
> > > > this
> > > > > assessment?  If not, what possible negative side effects am I
> > > > forgetting
> > > > > about?
> > > > >
> > > > > thanks,
> > > > > Demian
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > > > > Sent: Wednesday, April 07, 2010 10:04 PM
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Subject: Re: solr.WordDelimiterFilterFactory problem with
> > > > hyphenated
> > > > > > terms?
> > > > > >
> > > > > > Well, for a quick trial using trunk, I had to remove the
> > > > > > UnicodeNormalizationFactory, is that yours?
> > > > > >
> > > > > > But with that removed, I get the results you do, ASSUMING that
> > > > you've
> > > > > > set
> > > > > > your default operator to AND in schema.xml...
> > > > > >
> > > > > > Believe it or not, it all changes and all your queries return a
> > hit
> > > > if
> > > > > > you
> > > > > > do one of two things (I did this in both index and query when
> > > > testing
> > > > > > 'cause
> > > > > > I'm lazy):
> > > > > > 1> move the inclusion of the StopFilterFactory after
> > > > > > WordDelimiterFactory
> > > > > > or
> > > > > > 2> for StopFilterFactory, set enablePositionIncrements="false"
> > > > > >
> > > > > > I think either of these might work in your situation.......
> > > > > >
> > > > > > On doing some more investigation, it appears that if a
> > hyphenated
> > > > word
> > > > > > is
> > > > > > immediately after a stopword AND the above is true (stop
> > factory
> > > > > > included
> > > > > > before WordDelimiterFactory and
> > enablePositionIncrements="true"),
> > > > then
> > > > > > the
> > > > > > search fails. I indexed this title:
> > > > > >
> > > > > > Love-customs in eighteenth-century Spain for nineteenth-century
> > > > > >
> > > > > > Searching in solr/admin/form.jsp for:
> > > > > > title:(nineteenth-century)
> > > > > >
> > > > > > fails. But if I remove the "for" from the title, the above
> > query
> > > > works.
> > > > > > Searching for
> > > > > > title:(love-customs)
> > > > > > always works.
> > > > > >
> > > > > > Finally, (and it's *really* time to go to sleep now), just
> > setting
> > > > > > enablePositionIncrements="false" in the "index" portion of the
> > > > schema
> > > > > > also
> > > > > > causes things to work.
> > > > > >
> > > > > > Developer folks:
> > > > > > I didn't see anything in a quick look in SOLR or Lucene JIRAs,
> > > > should I
> > > > > > refine this a bit (really, sleepy time is near) and add a JIRA?
> > > > > >
> > > > > > Best
> > > > > > Erick
> > > > > >
> > > > > > On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz
> > > > > > <demian.k...@villanova.edu>wrote:
> > > > > >
> > > > > > > Hello.  It has been a few weeks, and I haven't gotten any
> > > > responses.
> > > > > > >  Perhaps my question is too complicated -- maybe a better
> > > > approach is
> > > > > > to try
> > > > > > > to gain enough knowledge to answer it myself.  My gut feeling
> > is
> > > > > > still that
> > > > > > > it's something to do with the way term positions are getting
> > > > handled
> > > > > > by the
> > > > > > > WordDelimiterFilterFactory, but I don't have a good
> > understanding
> > > > of
> > > > > > how
> > > > > > > term positions are calculated or factored into searching.
> > Can
> > > > anyone
> > > > > > > recommend some good reading to familiarize myself with these
> > > > concepts
> > > > > > in
> > > > > > > better detail?
> > > > > > >
> > > > > > > thanks,
> > > > > > > Demian
> > > > > > >
> > > > > > > From: Demian Katz
> > > > > > > Sent: Tuesday, March 16, 2010 9:47 AM
> > > > > > > To: solr-user@lucene.apache.org
> > > > > > > Subject: solr.WordDelimiterFilterFactory problem with
> > hyphenated
> > > > > > terms?
> > > > > > >
> > > > > > > This is my first post on this list -- apologies if this has
> > been
> > > > > > discussed
> > > > > > > before; I didn't come upon anything exactly equivalent in
> > > > searching
> > > > > > the
> > > > > > > archives via Google.
> > > > > > >
> > > > > > > I'm using Solr 1.4 as part of the VuFind application, and I
> > just
> > > > > > noticed
> > > > > > > that searches for hyphenated terms are failing in strange
> > ways.
> > > > I
> > > > > > strongly
> > > > > > > suspect it has something to do with the
> > > > > > solr.WordDelimiterFilterFactory
> > > > > > > filter, but I'm not exactly sure what.
> > > > > > >
> > > > > > > The problem is that I have a record with the title "Love
> > customs
> > > > in
> > > > > > > eighteenth-century Spain."  Depending on how I search for
> > this, I
> > > > get
> > > > > > > successes or failures in a seemingly unpredictable pattern.
> > > > > > >
> > > > > > > Demonstration queries below were tested using the direct Solr
> > > > > > > administration tool, just to eliminate any VuFind-related
> > factors
> > > > > > from the
> > > > > > > equation while debugging.
> > > > > > >
> > > > > > > Queries that work:
> > > > > > > title:(Love customs in eighteenth century Spain)
> > > > > > >                     // no hyphen, no phrases
> > > > > > > title:("Love customs in eighteenth-century Spain")
> > > > > > >                  // phrase search on whole title, with hyphen
> > > > > > >
> > > > > > > Queries that fail:
> > > > > > > title:(Love customs in eighteenth-century Spain)
> > > > > > >                    // hyphen, no phrases
> > > > > > > title:("Love customs in eighteenth century Spain")
> > > > > > >                   // phrase search on whole title, without
> > hyphen
> > > > > > > title:(Love customs in "eighteenth-century" Spain)
> > > > > > >                  // hyphenated word as phrase
> > > > > > > title:(Love customs in "eighteenth century" Spain)
> > > > > > >                   // hyphenated word as phrase, hyphen
> > removed
> > > > > > >
> > > > > > > Here is VuFind's text field type definition:
> > > > > > >
> > > > > > >    <fieldType name="text" class="solr.TextField"
> > > > > > > positionIncrementGap="100">
> > > > > > >      <analyzer type="index">
> > > > > > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > > > > >        <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > > > > > words="stopwords.txt" enablePositionIncrements="true"/>
> > > > > > >        <filter class="solr.WordDelimiterFilterFactory"
> > > > > > > generateWordParts="1" generateNumberParts="1"
> > catenateWords="1"
> > > > > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > > > > >        <filter class="solr.LowerCaseFilterFactory"/>
> > > > > > >        <filter class="solr.SnowballPorterFilterFactory"
> > > > > > language="English"
> > > > > > > protected="protwords.txt"/>
> > > > > > >        <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > > > > > >        <filter
> > class="schema.UnicodeNormalizationFilterFactory"
> > > > > > > version="icu4j" composed="false" remove_diacritics="true"
> > > > > > > remove_modifiers="true" fold="true"/>
> > > > > > >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> > > > > > >      </analyzer>
> > > > > > >      <analyzer type="query">
> > > > > > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > > > > >        <filter class="solr.SynonymFilterFactory"
> > > > > > synonyms="synonyms.txt"
> > > > > > > ignoreCase="true" expand="true"/>
> > > > > > >        <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > > > > > words="stopwords.txt" enablePositionIncrements="true"/>
> > > > > > >        <filter class="solr.WordDelimiterFilterFactory"
> > > > > > > generateWordParts="1" generateNumberParts="1"
> > catenateWords="1"
> > > > > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > > > > >        <filter class="solr.LowerCaseFilterFactory"/>
> > > > > > >        <filter class="solr.SnowballPorterFilterFactory"
> > > > > > language="English"
> > > > > > > protected="protwords.txt"/>
> > > > > > >        <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > > > > > >        <filter
> > class="schema.UnicodeNormalizationFilterFactory"
> > > > > > > version="icu4j" composed="false" remove_diacritics="true"
> > > > > > > remove_modifiers="true" fold="true"/>
> > > > > > >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> > > > > > >      </analyzer>
> > > > > > >    </fieldType>
> > > > > > >
> > > > > > > I did notice that in the "text" field type in VuFind's schema
> > has
> > > > > > > "catenateWords" and "catenateNumbers" turned on in both the
> > index
> > > > and
> > > > > > query
> > > > > > > analyzer chains.  It is my understanding that these options
> > > > should be
> > > > > > > disabled for the query chain and only enabled for the index
> > > > chain.
> > > > > > However,
> > > > > > > this may be a red herring -- I have already tried changing
> > this
> > > > > > setting, but
> > > > > > > it didn't change the success/failure pattern described above.
> > I
> > > > have
> > > > > > also
> > > > > > > played with the preserveOriginal setting without apparent
> > effect.
> > > > > > >
> > > > > > > From playing with the Field Analysis tool, I notice that
> > there is
> > > > a
> > > > > > gap in
> > > > > > > the term position sequence after analysis...  but I'm not
> > sure if
> > > > > > this is
> > > > > > > significant.
> > > > > > >
> > > > > > > Has anybody else run into this sort of problem?  Any ideas on
> > a
> > > > fix?
> > > > > > >
> > > > > > > thanks,
> > > > > > > Demian
> > > > > > >
> > > > > > >
> > > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com

Re: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Reply via email to