Re: Phrase matching on a text field

Phil Chadwick Thu, 07 May 2009 22:30:14 -0700

Hi,

I have tracked this problem to:


  https://issues.apache.org/jira/browse/SOLR-879

Executive summary is that there are errors that relate to
text fields in both:

  - src/java/org/apache/solr/search/SolrQueryParser.java
  - example/solr/conf/schema.xml

It is fixed in 1.4.

Thank you Yonik Seeley for the original diagnosis and fix.

Cheers,


-- 
Phil

It may be that your sole purpose in life is simply to serve as a
warning to others.



Phil Chadwick wrote:

> Hi Jay
> 
> Thank you for your response.
> 
> The data relating to the string (s_title) defines *exactly* what was
> fed into the SOLR indexing.  The string is not otherwise relevant to
> the question.
> 
> The essence of my question is why can the indexed text (t_title) not
> be phrase matched by the query on the text when the word "for" is
> present in the query.
> 
> The following work (and I would expect them to work):
> 
>     q=s_title:"FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT"
>     q=t_title:"future directions"
>     q=t_title:"integrated catchment"
> 
> The following do not work (and I would expect them to work):
> 
>     q=t_title:"directions for integrated"
> 
> The following do not work (not sure if I expect them to work or not):
> 
>     q=t_title:"directions integrated"
> 
> My reading is that if the "FOR" is removed in the text indexing, it
> should also be removed for the text query!
> 
> I also added 'enablePositionIncrements="true"' to the text query analyzer
> to make it the same as the text index analyzer:
> 
>     <filter class="solr.StopFilterFactory"
>       ignoreCase="true"
>       words="stopwords.txt"
>       enablePositionIncrements="true"/>
> 
> There was no change in the outcome.
> 
> The definitions for text and string were exactly as in the SOLR 1.3
> example schema (shown below).
> 
> The section of that schema for "text" is shown below.
> 
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> 
>   <analyzer type="index">
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory"
>       ignoreCase="true"
>       words="stopwords.txt"
>       enablePositionIncrements="true"/>
>     <filter class="solr.WordDelimiterFilterFactory"
>       generateWordParts="1"
>       generateNumberParts="1"
>       catenateWords="1"
>       catenateNumbers="1"
>       catenateAll="0"
>       splitOnCaseChange="1"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.EnglishPorterFilterFactory"
>       protected="protwords.txt"/>
>     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>   </analyzer>
> 
>   <analyzer type="query">
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>     <filter class="solr.SynonymFilterFactory"
>       synonyms="synonyms.txt"
>       ignoreCase="true"
>       expand="true"/>
>     <filter class="solr.StopFilterFactory"
>       ignoreCase="true"
>       words="stopwords.txt"
>       <!-- enablePositionIncrements="true" -->
>       />
>     <filter class="solr.WordDelimiterFilterFactory"
>       generateWordParts="1"
>       generateNumberParts="1"
>       catenateWords="0"
>       catenateNumbers="0"
>       catenateAll="0"
>       splitOnCaseChange="1"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.EnglishPorterFilterFactory"
>       protected="protwords.txt"/>
>     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>   </analyzer>
> 
> </fieldType>
> 
> 
> Cheers,
> 
> 
> -- 
> Phil
> 
> The art of being wise is the art of knowing what to overlook.
>       -- William James
> 
> 
> 
> Jay Hill wrote:
> >
> > The string fieldtype is not being tokenized, while the text fieldtype is
> > tokenized. So the stop word "for" is being removed by a stop word filter,
> > which doesn't happen with the text field type (no tokenizing).
> > 
> > Have a look at the schema.xml in the example dir and look at the default
> > configuration for both the text and string fieldtypes. String string
> > fieldtype is not analyzed whereas the text fieldtype has a number of
> > different filters that take action.
> 
> > On Wed, May 6, 2009 at 11:09 PM, Phil Chadwick
> > <p.chadw...@internode.on.net>wrote:
> > 
> > > Hi,
> > >
> > > I'm trying to figure out why phrase matching on a text field only works
> > > some of the time.
> > >
> > > I have a SOLR index containing a document titled "FUTURE DIRECTIONS FOR
> > > INTEGRATED CATCHMENT".  The "FOR" seems to be causing a problem...
> > >
> > > The title field is indexed as both s_title and t_title (string and text,
> > > as defined in the demo schema), thus:
> > >
> > >    <field name="title" type="string" indexed="false" stored="false"
> > >        multiValued="false" />
> > >    <field name="s_title" type="string" indexed="true" stored="true"
> > >        multiValued="false" />
> > >    <field name="t_title" type="text" indexed="true" stored="false"
> > >        multiValued="false" />
> > >    <copyField source="title" dest="s_title" />
> > >    <copyField source="title" dest="t_title" />
> > >
> > > I can match the document with an exact query on the string:
> > >
> > >    q=s_title:"FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT"
> > >
> > > I can match the document with this phrase query on the text:
> > >
> > >    q=t_title:"future directions"
> > >
> > > which uses the parsedquery shown by "&debugQuery=true":
> > >
> > >    <str name="rawquerystring">t_title:"future directions"</str>
> > >    <str name="querystring">t_title:"future directions"</str>
> > >    <str name="parsedquery">PhraseQuery(t_title:"futur direct")</str>
> > >    <str name="parsedquery_toString">t_title:"futur direct"</str>
> > >
> > > Similarly, I can match the document with this query:
> > >
> > >    q=t_title:"integrated catchment"
> > >
> > > which uses the parsedquery shown by "&debugQuery=true":
> > >
> > >    <str name="rawquerystring">t_title:"integrated catchment"</str>
> > >    <str name="querystring">t_title:"integrated catchment"</str>
> > >    <str name="parsedquery">PhraseQuery(t_title:"integr catchment")</str>
> > >    <str name="parsedquery_toString">t_title:"integr catchment"</str>
> > >
> > > But I can not match the document with the query:
> > >
> > >    q=t_title:"future directions for integrated catchment"
> > >
> > > which uses the phrase query shown by "&debugQuery=true":
> > >
> > >    <str name="rawquerystring">
> > >        t_title:"future directions for integrated catchment"</str>
> > >    <str name="querystring">
> > >        t_title:"future directions for integrated catchment"</str>
> > >    <str name="parsedquery">
> > >        PhraseQuery(t_title:"futur direct integr catchment")</str>
> > >    <str name="parsedquery_toString">
> > >        t_title:"futur direct integr catchment"</str>
> > >
> > > Any wisdom gratefully accepted.
> > >
> > > Cheers,
> > >
> > >
> > > --
> > > Phil
> > >
> > > 640K ought to be enough for anybody.
> > >        -- Bill Gates, in 1981

Re: Phrase matching on a text field

Reply via email to