Re: Not finding part of fulltext field when word ends in dot
That was a complicated answer, but ultimately the right one. Thank you very much. 2014-01-30 Jack Krupansky j...@basetechnology.com: The word delimiter filter will turn 26KA into two tokens, as if you had written 26 KA without the quotes. The autoGeneratePhraseQueries option will cause the multiple terms to be treated as if they actually were enclosed within quotes, otherwise they will be treated as separate and unquoted terms. If you do enclose 26KA in quotes in your query then autoGeneratePhraseQueries is not relevant. Ah... maybe the problem is that you have preserveOriginal=true in your query analyzer. Do you have your default query operator set to AND? If so, it would treat 26KA as 26 AND KA AND 26KA, which requires that 26KA (without the trailing dot) to be in the index. It seems counter-intuitive, but the attributes of the index and query word delimiter filters need to be slightly asymmetric. -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Thursday, January 30, 2014 2:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot I'm not sure I got my problem across. If I understand the snippet of documentation right, autoGeneratePhraseQueries only affects queries that result in multiple tokens, which mine does not. The version also is 3.6.0.1, and we're not planning on upgrading to any 4.x version. 2014-01-29 Jack Krupansky j...@basetechnology.com You might want to add autoGeneratePhraseQueries=true to your field type, but I don't think that would cause a break when going from 3.6 to 4.x. The default for that attribute changed in Solr 3.5. What release was your data indexed using? There may have been some subtle word delimiter filter changes between 3.x and 4.x. Read: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/% 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03. adsroot.itcs.umich.edu%3E -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 11:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot The fieldType definition is a tad on the longer side: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=1 catenateNumbers=1 generateNumberParts=1 splitOnCaseChange=1 generateWordParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=german/synonyms.txt ignoreCase=true expand=true/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=german/german-common-nouns.txt minWordSize=5 minSubwordSize=4 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=0 catenateNumbers=0 generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true
Re: Not finding part of fulltext field when word ends in dot
The word delimiter filter will turn 26KA into two tokens, as if you had written 26 KA without the quotes. The autoGeneratePhraseQueries option will cause the multiple terms to be treated as if they actually were enclosed within quotes, otherwise they will be treated as separate and unquoted terms. If you do enclose 26KA in quotes in your query then autoGeneratePhraseQueries is not relevant. Ah... maybe the problem is that you have preserveOriginal=true in your query analyzer. Do you have your default query operator set to AND? If so, it would treat 26KA as 26 AND KA AND 26KA, which requires that 26KA (without the trailing dot) to be in the index. It seems counter-intuitive, but the attributes of the index and query word delimiter filters need to be slightly asymmetric. -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Thursday, January 30, 2014 2:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot I'm not sure I got my problem across. If I understand the snippet of documentation right, autoGeneratePhraseQueries only affects queries that result in multiple tokens, which mine does not. The version also is 3.6.0.1, and we're not planning on upgrading to any 4.x version. 2014-01-29 Jack Krupansky j...@basetechnology.com You might want to add autoGeneratePhraseQueries=true to your field type, but I don't think that would cause a break when going from 3.6 to 4.x. The default for that attribute changed in Solr 3.5. What release was your data indexed using? There may have been some subtle word delimiter filter changes between 3.x and 4.x. Read: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/% 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03. adsroot.itcs.umich.edu%3E -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 11:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot The fieldType definition is a tad on the longer side: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=1 catenateNumbers=1 generateNumberParts=1 splitOnCaseChange=1 generateWordParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=german/synonyms.txt ignoreCase=true expand=true/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=german/german-common-nouns.txt minWordSize=5 minSubwordSize=4 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=0 catenateNumbers=0 generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer
Not finding part of fulltext field when word ends in dot
Hello everybody, we have a legacy solr installation in version 3.6.0.1. One of the indices defines a field named content as a fulltext field where a product description will reside. One of the records indexed contains the following data (excerpt): z. B. in der Serie 26KA. I had the problem that searching the value 26KA didn't find anything. Using the analyzer of the adminstrative interface and using the full text on one hand and 26KA as the query string, I can see how the search string is transformed by the used filter factories. The WordDelimiterFilterFactory transforms the 26KA. into 26KA, which is displayed like this (excerpt): 73 74 7576 in der Serie 26KA. 26KA It seems that it stripped the 26KA. of the dot. Using the option to highlight matches, an analysis search of 26KA shows the lower of the two entries matches (after reaching the LowerCaseFilterFactory). However, querying the index using the query interface doesn't show any matches. I discovered that adding an asterisk to the search seems to work, as does adding the dot. I am puzzled by this, as I thought that the second added entry was the word actually indexed. I've tried looking up the definition of the administrative interface, but the documentation only specifies this for the latest version, where the display is different and (at least in the sample) doesn't show such duplication. Can anybody shed some light onto this?
Re: Not finding part of fulltext field when word ends in dot
What field type and analyzer/tokenizer are you using? -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not finding part of fulltext field when word ends in dot Hello everybody, we have a legacy solr installation in version 3.6.0.1. One of the indices defines a field named content as a fulltext field where a product description will reside. One of the records indexed contains the following data (excerpt): z. B. in der Serie 26KA. I had the problem that searching the value 26KA didn't find anything. Using the analyzer of the adminstrative interface and using the full text on one hand and 26KA as the query string, I can see how the search string is transformed by the used filter factories. The WordDelimiterFilterFactory transforms the 26KA. into 26KA, which is displayed like this (excerpt): 73 74 7576 in der Serie 26KA. 26KA It seems that it stripped the 26KA. of the dot. Using the option to highlight matches, an analysis search of 26KA shows the lower of the two entries matches (after reaching the LowerCaseFilterFactory). However, querying the index using the query interface doesn't show any matches. I discovered that adding an asterisk to the search seems to work, as does adding the dot. I am puzzled by this, as I thought that the second added entry was the word actually indexed. I've tried looking up the definition of the administrative interface, but the documentation only specifies this for the latest version, where the display is different and (at least in the sample) doesn't show such duplication. Can anybody shed some light onto this?
Re: Not finding part of fulltext field when word ends in dot
The fieldType definition is a tad on the longer side: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=1 catenateNumbers=1 generateNumberParts=1 splitOnCaseChange=1 generateWordParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=german/synonyms.txt ignoreCase=true expand=true/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=german/german-common-nouns.txt minWordSize=5 minSubwordSize=4 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=0 catenateNumbers=0 generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Thank you for taking a look. 2014-01-29 Jack Krupansky j...@basetechnology.com What field type and analyzer/tokenizer are you using? -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not finding part of fulltext field when word ends in dot Hello everybody, we have a legacy solr installation in version 3.6.0.1. One of the indices defines a field named content as a fulltext field where a product description will reside. One of the records indexed contains the following data (excerpt): z. B. in der Serie 26KA. I had the problem that searching the value 26KA didn't find anything. Using the analyzer of the adminstrative interface and using the full text on one hand and 26KA as the query string, I can see how the search string is transformed by the used filter factories. The WordDelimiterFilterFactory transforms the 26KA. into 26KA, which is displayed like this (excerpt): 73 74 7576 in der Serie 26KA. 26KA It seems that it stripped the 26KA. of the dot. Using the option to highlight matches, an analysis search of 26KA shows the lower of the two entries matches (after reaching the LowerCaseFilterFactory). However, querying the index using the query interface doesn't show any matches. I discovered that adding an asterisk to the search seems to work, as does adding the dot. I am puzzled by this, as I thought that the second added entry was the word actually indexed. I've tried looking up the definition of the administrative interface, but the documentation only specifies this for the latest version, where the display is different and (at least in the sample) doesn't show such duplication. Can anybody shed some light onto this?
Re: Not finding part of fulltext field when word ends in dot
You might want to add autoGeneratePhraseQueries=true to your field type, but I don't think that would cause a break when going from 3.6 to 4.x. The default for that attribute changed in Solr 3.5. What release was your data indexed using? There may have been some subtle word delimiter filter changes between 3.x and 4.x. Read: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%3cc0551c512c863540bc59694a118452aa0764a...@its-embx-03.adsroot.itcs.umich.edu%3E -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 11:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot The fieldType definition is a tad on the longer side: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=1 catenateNumbers=1 generateNumberParts=1 splitOnCaseChange=1 generateWordParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=german/synonyms.txt ignoreCase=true expand=true/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=german/german-common-nouns.txt minWordSize=5 minSubwordSize=4 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=0 catenateNumbers=0 generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Thank you for taking a look. 2014-01-29 Jack Krupansky j...@basetechnology.com What field type and analyzer/tokenizer are you using? -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not finding part of fulltext field when word ends in dot Hello everybody, we have a legacy solr installation in version 3.6.0.1. One of the indices defines a field named content as a fulltext field where a product description will reside. One of the records indexed contains the following data (excerpt): z. B. in der Serie 26KA. I had the problem that searching the value 26KA didn't find anything. Using the analyzer of the adminstrative interface and using the full text on one hand and 26KA as the query string, I can see how the search string is transformed by the used filter factories. The WordDelimiterFilterFactory transforms the 26KA. into 26KA, which is displayed like this (excerpt): 73 74 7576 in der Serie 26KA. 26KA It seems that it stripped the 26KA. of the dot. Using the option to highlight matches, an analysis search of 26KA shows the lower of the two entries matches (after reaching the LowerCaseFilterFactory). However, querying the index using the query interface doesn't
Re: Not finding part of fulltext field when word ends in dot
I'm not sure I got my problem across. If I understand the snippet of documentation right, autoGeneratePhraseQueries only affects queries that result in multiple tokens, which mine does not. The version also is 3.6.0.1, and we're not planning on upgrading to any 4.x version. 2014-01-29 Jack Krupansky j...@basetechnology.com You might want to add autoGeneratePhraseQueries=true to your field type, but I don't think that would cause a break when going from 3.6 to 4.x. The default for that attribute changed in Solr 3.5. What release was your data indexed using? There may have been some subtle word delimiter filter changes between 3.x and 4.x. Read: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/% 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03. adsroot.itcs.umich.edu%3E -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 11:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot The fieldType definition is a tad on the longer side: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=1 catenateNumbers=1 generateNumberParts=1 splitOnCaseChange=1 generateWordParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=german/synonyms.txt ignoreCase=true expand=true/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=german/german-common-nouns.txt minWordSize=5 minSubwordSize=4 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=0 catenateNumbers=0 generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Thank you for taking a look. 2014-01-29 Jack Krupansky j...@basetechnology.com What field type and analyzer/tokenizer are you using? -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not finding part of fulltext field when word ends in dot Hello everybody, we have a legacy solr installation in version 3.6.0.1. One of the indices defines a field named content as a fulltext field where a product description will reside. One of the records indexed contains the following data (excerpt): z. B. in der Serie 26KA. I had the problem that searching the value 26KA didn't find anything. Using the analyzer of the adminstrative interface and using the full text on one hand and 26KA as the query string, I can see how the search string is transformed by the used filter factories