Re: Phrase matching on a text field

2009-05-07 Thread Jay Hill
The string fieldtype is not being tokenized, while the text fieldtype is
tokenized. So the stop word for is being removed by a stop word filter,
which doesn't happen with the text field type (no tokenizing).

Have a look at the schema.xml in the example dir and look at the default
configuration for both the text and string fieldtypes. String string
fieldtype is not analyzed whereas the text fieldtype has a number of
different filters that take action.

-Jay


On Wed, May 6, 2009 at 11:09 PM, Phil Chadwick
p.chadw...@internode.on.netwrote:

 Hi,

 I'm trying to figure out why phrase matching on a text field only works
 some of the time.

 I have a SOLR index containing a document titled FUTURE DIRECTIONS FOR
 INTEGRATED CATCHMENT.  The FOR seems to be causing a problem...

 The title field is indexed as both s_title and t_title (string and text,
 as defined in the demo schema), thus:

field name=title type=string indexed=false stored=false
multiValued=false /
field name=s_title type=string indexed=true stored=true
multiValued=false /
field name=t_title type=text indexed=true stored=false
multiValued=false /
copyField source=title dest=s_title /
copyField source=title dest=t_title /

 I can match the document with an exact query on the string:

q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT

 I can match the document with this phrase query on the text:

q=t_title:future directions

 which uses the parsedquery shown by debugQuery=true:

str name=rawquerystringt_title:future directions/str
str name=querystringt_title:future directions/str
str name=parsedqueryPhraseQuery(t_title:futur direct)/str
str name=parsedquery_toStringt_title:futur direct/str

 Similarly, I can match the document with this query:

q=t_title:integrated catchment

 which uses the parsedquery shown by debugQuery=true:

str name=rawquerystringt_title:integrated catchment/str
str name=querystringt_title:integrated catchment/str
str name=parsedqueryPhraseQuery(t_title:integr catchment)/str
str name=parsedquery_toStringt_title:integr catchment/str

 But I can not match the document with the query:

q=t_title:future directions for integrated catchment

 which uses the phrase query shown by debugQuery=true:

str name=rawquerystring
t_title:future directions for integrated catchment/str
str name=querystring
t_title:future directions for integrated catchment/str
str name=parsedquery
PhraseQuery(t_title:futur direct integr catchment)/str
str name=parsedquery_toString
t_title:futur direct integr catchment/str

 Any wisdom gratefully accepted.

 Cheers,


 --
 Phil

 640K ought to be enough for anybody.
-- Bill Gates, in 1981



Re: Phrase matching on a text field

2009-05-07 Thread Phil Chadwick
Hi Jay

Thank you for your response.

The data relating to the string (s_title) defines *exactly* what was
fed into the SOLR indexing.  The string is not otherwise relevant to
the question.

The essence of my question is why can the indexed text (t_title) not
be phrase matched by the query on the text when the word for is
present in the query.

The following work (and I would expect them to work):

q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT
q=t_title:future directions
q=t_title:integrated catchment

The following do not work (and I would expect them to work):

q=t_title:directions for integrated

The following do not work (not sure if I expect them to work or not):

q=t_title:directions integrated

My reading is that if the FOR is removed in the text indexing, it
should also be removed for the text query!

I also added 'enablePositionIncrements=true' to the text query analyzer
to make it the same as the text index analyzer:

filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true/

There was no change in the outcome.

The definitions for text and string were exactly as in the SOLR 1.3
example schema (shown below).

The section of that schema for text is shown below.

fieldType name=text class=solr.TextField positionIncrementGap=100

  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory
  ignoreCase=true
  words=stopwords.txt
  enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1
  generateNumberParts=1
  catenateWords=1
  catenateNumbers=1
  catenateAll=0
  splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
  protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt
  ignoreCase=true
  expand=true/
filter class=solr.StopFilterFactory
  ignoreCase=true
  words=stopwords.txt
  !-- enablePositionIncrements=true --
  /
filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1
  generateNumberParts=1
  catenateWords=0
  catenateNumbers=0
  catenateAll=0
  splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
  protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

/fieldType


Cheers,


-- 
Phil

The art of being wise is the art of knowing what to overlook.
-- William James



Jay Hill wrote:

 The string fieldtype is not being tokenized, while the text fieldtype is
 tokenized. So the stop word for is being removed by a stop word filter,
 which doesn't happen with the text field type (no tokenizing).
 
 Have a look at the schema.xml in the example dir and look at the default
 configuration for both the text and string fieldtypes. String string
 fieldtype is not analyzed whereas the text fieldtype has a number of
 different filters that take action.

 On Wed, May 6, 2009 at 11:09 PM, Phil Chadwick
 p.chadw...@internode.on.netwrote:
 
  Hi,
 
  I'm trying to figure out why phrase matching on a text field only works
  some of the time.
 
  I have a SOLR index containing a document titled FUTURE DIRECTIONS FOR
  INTEGRATED CATCHMENT.  The FOR seems to be causing a problem...
 
  The title field is indexed as both s_title and t_title (string and text,
  as defined in the demo schema), thus:
 
 field name=title type=string indexed=false stored=false
 multiValued=false /
 field name=s_title type=string indexed=true stored=true
 multiValued=false /
 field name=t_title type=text indexed=true stored=false
 multiValued=false /
 copyField source=title dest=s_title /
 copyField source=title dest=t_title /
 
  I can match the document with an exact query on the string:
 
 q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT
 
  I can match the document with this phrase query on the text:
 
 q=t_title:future directions
 
  which uses the parsedquery shown by debugQuery=true:
 
 str name=rawquerystringt_title:future directions/str
 str name=querystringt_title:future directions/str
 str name=parsedqueryPhraseQuery(t_title:futur direct)/str
 str name=parsedquery_toStringt_title:futur direct/str
 
  Similarly, I can match the document with this query:
 
 q=t_title:integrated catchment
 
  which uses the parsedquery shown by debugQuery=true:
 
 str name=rawquerystringt_title:integrated catchment/str
 str name=querystringt_title:integrated catchment/str
 str name=parsedqueryPhraseQuery(t_title:integr catchment)/str
 str name=parsedquery_toStringt_title:integr catchment/str
 

Re: Phrase matching on a text field

2009-05-07 Thread Phil Chadwick
Hi,

I have tracked this problem to:

  https://issues.apache.org/jira/browse/SOLR-879

Executive summary is that there are errors that relate to
text fields in both:

  - src/java/org/apache/solr/search/SolrQueryParser.java
  - example/solr/conf/schema.xml

It is fixed in 1.4.

Thank you Yonik Seeley for the original diagnosis and fix.

Cheers,


-- 
Phil

It may be that your sole purpose in life is simply to serve as a
warning to others.



Phil Chadwick wrote:

 Hi Jay
 
 Thank you for your response.
 
 The data relating to the string (s_title) defines *exactly* what was
 fed into the SOLR indexing.  The string is not otherwise relevant to
 the question.
 
 The essence of my question is why can the indexed text (t_title) not
 be phrase matched by the query on the text when the word for is
 present in the query.
 
 The following work (and I would expect them to work):
 
 q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT
 q=t_title:future directions
 q=t_title:integrated catchment
 
 The following do not work (and I would expect them to work):
 
 q=t_title:directions for integrated
 
 The following do not work (not sure if I expect them to work or not):
 
 q=t_title:directions integrated
 
 My reading is that if the FOR is removed in the text indexing, it
 should also be removed for the text query!
 
 I also added 'enablePositionIncrements=true' to the text query analyzer
 to make it the same as the text index analyzer:
 
 filter class=solr.StopFilterFactory
   ignoreCase=true
   words=stopwords.txt
   enablePositionIncrements=true/
 
 There was no change in the outcome.
 
 The definitions for text and string were exactly as in the SOLR 1.3
 example schema (shown below).
 
 The section of that schema for text is shown below.
 
 fieldType name=text class=solr.TextField positionIncrementGap=100
 
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory
   ignoreCase=true
   words=stopwords.txt
   enablePositionIncrements=true/
 filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1
   generateNumberParts=1
   catenateWords=1
   catenateNumbers=1
   catenateAll=0
   splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory
   protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory
   synonyms=synonyms.txt
   ignoreCase=true
   expand=true/
 filter class=solr.StopFilterFactory
   ignoreCase=true
   words=stopwords.txt
   !-- enablePositionIncrements=true --
   /
 filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1
   generateNumberParts=1
   catenateWords=0
   catenateNumbers=0
   catenateAll=0
   splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory
   protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 
 /fieldType
 
 
 Cheers,
 
 
 -- 
 Phil
 
 The art of being wise is the art of knowing what to overlook.
   -- William James
 
 
 
 Jay Hill wrote:
 
  The string fieldtype is not being tokenized, while the text fieldtype is
  tokenized. So the stop word for is being removed by a stop word filter,
  which doesn't happen with the text field type (no tokenizing).
  
  Have a look at the schema.xml in the example dir and look at the default
  configuration for both the text and string fieldtypes. String string
  fieldtype is not analyzed whereas the text fieldtype has a number of
  different filters that take action.
 
  On Wed, May 6, 2009 at 11:09 PM, Phil Chadwick
  p.chadw...@internode.on.netwrote:
  
   Hi,
  
   I'm trying to figure out why phrase matching on a text field only works
   some of the time.
  
   I have a SOLR index containing a document titled FUTURE DIRECTIONS FOR
   INTEGRATED CATCHMENT.  The FOR seems to be causing a problem...
  
   The title field is indexed as both s_title and t_title (string and text,
   as defined in the demo schema), thus:
  
  field name=title type=string indexed=false stored=false
  multiValued=false /
  field name=s_title type=string indexed=true stored=true
  multiValued=false /
  field name=t_title type=text indexed=true stored=false
  multiValued=false /
  copyField source=title dest=s_title /
  copyField source=title dest=t_title /
  
   I can match the document with an exact query on the string:
  
  q=s_title:FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT
  
   I can match the document with this phrase query on the text:
  
  q=t_title:future directions
  
   which uses the parsedquery shown by debugQuery=true:
  
  str