Re: Phrase matching on a text field
The string fieldtype is not being tokenized, while the text fieldtype is tokenized. So the stop word for is being removed by a stop word filter, which doesn't happen with the text field type (no tokenizing). Have a look at the schema.xml in the example dir and look at the default configuration for both the text and string fieldtypes. String string fieldtype is not analyzed whereas the text fieldtype has a number of different filters that take action. -Jay On Wed, May 6, 2009 at 11:09 PM, Phil Chadwick p.chadw...@internode.on.netwrote: Hi, I'm trying to figure out why phrase matching on a text field only works some of the time. I have a SOLR index containing a document titled FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT. The FOR seems to be causing a problem... The title field is indexed as both s_title and t_title (string and text, as defined in the demo schema), thus: field name=title type=string indexed=false stored=false multiValued=false / field name=s_title type=string indexed=true stored=true multiValued=false / field name=t_title type=text indexed=true stored=false multiValued=false / copyField source=title dest=s_title / copyField source=title dest=t_title / I can match the document with an exact query on the string: q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT I can match the document with this phrase query on the text: q=t_title:future directions which uses the parsedquery shown by debugQuery=true: str name=rawquerystringt_title:future directions/str str name=querystringt_title:future directions/str str name=parsedqueryPhraseQuery(t_title:futur direct)/str str name=parsedquery_toStringt_title:futur direct/str Similarly, I can match the document with this query: q=t_title:integrated catchment which uses the parsedquery shown by debugQuery=true: str name=rawquerystringt_title:integrated catchment/str str name=querystringt_title:integrated catchment/str str name=parsedqueryPhraseQuery(t_title:integr catchment)/str str name=parsedquery_toStringt_title:integr catchment/str But I can not match the document with the query: q=t_title:future directions for integrated catchment which uses the phrase query shown by debugQuery=true: str name=rawquerystring t_title:future directions for integrated catchment/str str name=querystring t_title:future directions for integrated catchment/str str name=parsedquery PhraseQuery(t_title:futur direct integr catchment)/str str name=parsedquery_toString t_title:futur direct integr catchment/str Any wisdom gratefully accepted. Cheers, -- Phil 640K ought to be enough for anybody. -- Bill Gates, in 1981
Re: Phrase matching on a text field
Hi Jay Thank you for your response. The data relating to the string (s_title) defines *exactly* what was fed into the SOLR indexing. The string is not otherwise relevant to the question. The essence of my question is why can the indexed text (t_title) not be phrase matched by the query on the text when the word for is present in the query. The following work (and I would expect them to work): q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT q=t_title:future directions q=t_title:integrated catchment The following do not work (and I would expect them to work): q=t_title:directions for integrated The following do not work (not sure if I expect them to work or not): q=t_title:directions integrated My reading is that if the FOR is removed in the text indexing, it should also be removed for the text query! I also added 'enablePositionIncrements=true' to the text query analyzer to make it the same as the text index analyzer: filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ There was no change in the outcome. The definitions for text and string were exactly as in the SOLR 1.3 example schema (shown below). The section of that schema for text is shown below. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt !-- enablePositionIncrements=true -- / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Cheers, -- Phil The art of being wise is the art of knowing what to overlook. -- William James Jay Hill wrote: The string fieldtype is not being tokenized, while the text fieldtype is tokenized. So the stop word for is being removed by a stop word filter, which doesn't happen with the text field type (no tokenizing). Have a look at the schema.xml in the example dir and look at the default configuration for both the text and string fieldtypes. String string fieldtype is not analyzed whereas the text fieldtype has a number of different filters that take action. On Wed, May 6, 2009 at 11:09 PM, Phil Chadwick p.chadw...@internode.on.netwrote: Hi, I'm trying to figure out why phrase matching on a text field only works some of the time. I have a SOLR index containing a document titled FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT. The FOR seems to be causing a problem... The title field is indexed as both s_title and t_title (string and text, as defined in the demo schema), thus: field name=title type=string indexed=false stored=false multiValued=false / field name=s_title type=string indexed=true stored=true multiValued=false / field name=t_title type=text indexed=true stored=false multiValued=false / copyField source=title dest=s_title / copyField source=title dest=t_title / I can match the document with an exact query on the string: q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT I can match the document with this phrase query on the text: q=t_title:future directions which uses the parsedquery shown by debugQuery=true: str name=rawquerystringt_title:future directions/str str name=querystringt_title:future directions/str str name=parsedqueryPhraseQuery(t_title:futur direct)/str str name=parsedquery_toStringt_title:futur direct/str Similarly, I can match the document with this query: q=t_title:integrated catchment which uses the parsedquery shown by debugQuery=true: str name=rawquerystringt_title:integrated catchment/str str name=querystringt_title:integrated catchment/str str name=parsedqueryPhraseQuery(t_title:integr catchment)/str str name=parsedquery_toStringt_title:integr catchment/str
Re: Phrase matching on a text field
Hi, I have tracked this problem to: https://issues.apache.org/jira/browse/SOLR-879 Executive summary is that there are errors that relate to text fields in both: - src/java/org/apache/solr/search/SolrQueryParser.java - example/solr/conf/schema.xml It is fixed in 1.4. Thank you Yonik Seeley for the original diagnosis and fix. Cheers, -- Phil It may be that your sole purpose in life is simply to serve as a warning to others. Phil Chadwick wrote: Hi Jay Thank you for your response. The data relating to the string (s_title) defines *exactly* what was fed into the SOLR indexing. The string is not otherwise relevant to the question. The essence of my question is why can the indexed text (t_title) not be phrase matched by the query on the text when the word for is present in the query. The following work (and I would expect them to work): q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT q=t_title:future directions q=t_title:integrated catchment The following do not work (and I would expect them to work): q=t_title:directions for integrated The following do not work (not sure if I expect them to work or not): q=t_title:directions integrated My reading is that if the FOR is removed in the text indexing, it should also be removed for the text query! I also added 'enablePositionIncrements=true' to the text query analyzer to make it the same as the text index analyzer: filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ There was no change in the outcome. The definitions for text and string were exactly as in the SOLR 1.3 example schema (shown below). The section of that schema for text is shown below. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt !-- enablePositionIncrements=true -- / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Cheers, -- Phil The art of being wise is the art of knowing what to overlook. -- William James Jay Hill wrote: The string fieldtype is not being tokenized, while the text fieldtype is tokenized. So the stop word for is being removed by a stop word filter, which doesn't happen with the text field type (no tokenizing). Have a look at the schema.xml in the example dir and look at the default configuration for both the text and string fieldtypes. String string fieldtype is not analyzed whereas the text fieldtype has a number of different filters that take action. On Wed, May 6, 2009 at 11:09 PM, Phil Chadwick p.chadw...@internode.on.netwrote: Hi, I'm trying to figure out why phrase matching on a text field only works some of the time. I have a SOLR index containing a document titled FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT. The FOR seems to be causing a problem... The title field is indexed as both s_title and t_title (string and text, as defined in the demo schema), thus: field name=title type=string indexed=false stored=false multiValued=false / field name=s_title type=string indexed=true stored=true multiValued=false / field name=t_title type=text indexed=true stored=false multiValued=false / copyField source=title dest=s_title / copyField source=title dest=t_title / I can match the document with an exact query on the string: q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT I can match the document with this phrase query on the text: q=t_title:future directions which uses the parsedquery shown by debugQuery=true: str