Re: Not finding part of fulltext field when word ends in dot

2014-02-03 Thread Thomas Michael Engelke
That was a complicated answer, but ultimately the right one. Thank you very
much.


2014-01-30 Jack Krupansky j...@basetechnology.com:

 The word delimiter filter will turn 26KA into two tokens, as if you had
 written 26 KA without the quotes. The autoGeneratePhraseQueries option
 will cause the multiple terms to be treated as if they actually were
 enclosed within quotes, otherwise they will be treated as separate and
 unquoted terms. If you do enclose 26KA in quotes in your query then
 autoGeneratePhraseQueries is not relevant.

 Ah... maybe the problem is that you have preserveOriginal=true in your
 query analyzer. Do you have your default query operator set to AND? If
 so, it would treat 26KA as 26 AND KA AND 26KA, which requires that
 26KA (without the trailing dot) to be in the index.

 It seems counter-intuitive, but the attributes of the index and query word
 delimiter filters need to be slightly asymmetric.


 -- Jack Krupansky

 -Original Message- From: Thomas Michael Engelke
 Sent: Thursday, January 30, 2014 2:16 AM

 To: solr-user@lucene.apache.org
 Subject: Re: Not finding part of fulltext field when word ends in dot

 I'm not sure I got my problem across. If I understand the snippet of
 documentation right, autoGeneratePhraseQueries only affects queries that
 result in multiple tokens, which mine does not. The version also is
 3.6.0.1, and we're not planning on upgrading to any 4.x version.


 2014-01-29 Jack Krupansky j...@basetechnology.com

  You might want to add autoGeneratePhraseQueries=true to your field
 type, but I don't think that would cause a break when going from 3.6 to
 4.x. The default for that attribute changed in Solr 3.5. What release was
 your data indexed using? There may have been some subtle word delimiter
 filter changes between 3.x and 4.x.

 Read:
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%
 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.
 adsroot.itcs.umich.edu%3E



 -Original Message- From: Thomas Michael Engelke
 Sent: Wednesday, January 29, 2014 11:16 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Not finding part of fulltext field when word ends in dot


 The fieldType definition is a tad on the longer side:

fieldType name=text class=solr.TextField
 positionIncrementGap=100
analyzer type=index
tokenizer
 class=solr.WhitespaceTokenizerFactory/

filter
 class=solr.WordDelimiterFilterFactory
catenateWords=1
catenateNumbers=1
generateNumberParts=1
splitOnCaseChange=1
generateWordParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/

filter
 class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory
 synonyms=german/synonyms.txt ignoreCase=true expand=true/
filter
 class=solr.DictionaryCompoundWordTokenFilterFactory

 dictionary=german/german-common-nouns.txt
minWordSize=5
minSubwordSize=4
maxSubwordSize=15
onlyLongestMatch=true
/

filter class=solr.StopFilterFactory
 words=german/stopwords.txt ignoreCase=true
 enablePositionIncrements=true/
filter
 class=solr.SnowballPorterFilterFactory language=German2
 protected=german/protwords.txt/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer
 class=solr.WhitespaceTokenizerFactory/

filter
 class=solr.WordDelimiterFilterFactory
catenateWords=0
catenateNumbers=0
generateWordParts=1
splitOnCaseChange=1
generateNumberParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/
filter
 class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
 words=german/stopwords.txt ignoreCase=true
 enablePositionIncrements=true

Re: Not finding part of fulltext field when word ends in dot

2014-01-30 Thread Jack Krupansky
The word delimiter filter will turn 26KA into two tokens, as if you had 
written 26 KA without the quotes. The autoGeneratePhraseQueries option 
will cause the multiple terms to be treated as if they actually were 
enclosed within quotes, otherwise they will be treated as separate and 
unquoted terms. If you do enclose 26KA in quotes in your query then 
autoGeneratePhraseQueries is not relevant.


Ah... maybe the problem is that you have preserveOriginal=true in your 
query analyzer. Do you have your default query operator set to AND? If so, 
it would treat 26KA as 26 AND KA AND 26KA, which requires that 
26KA (without the trailing dot) to be in the index.


It seems counter-intuitive, but the attributes of the index and query word 
delimiter filters need to be slightly asymmetric.


-- Jack Krupansky

-Original Message- 
From: Thomas Michael Engelke

Sent: Thursday, January 30, 2014 2:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Not finding part of fulltext field when word ends in dot

I'm not sure I got my problem across. If I understand the snippet of
documentation right, autoGeneratePhraseQueries only affects queries that
result in multiple tokens, which mine does not. The version also is
3.6.0.1, and we're not planning on upgrading to any 4.x version.


2014-01-29 Jack Krupansky j...@basetechnology.com


You might want to add autoGeneratePhraseQueries=true to your field
type, but I don't think that would cause a break when going from 3.6 to
4.x. The default for that attribute changed in Solr 3.5. What release was
your data indexed using? There may have been some subtle word delimiter
filter changes between 3.x and 4.x.

Read:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%
3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.
adsroot.itcs.umich.edu%3E



-Original Message- From: Thomas Michael Engelke
Sent: Wednesday, January 29, 2014 11:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Not finding part of fulltext field when word ends in dot


The fieldType definition is a tad on the longer side:

   fieldType name=text class=solr.TextField
positionIncrementGap=100
   analyzer type=index
   tokenizer
class=solr.WhitespaceTokenizerFactory/

   filter
class=solr.WordDelimiterFilterFactory
   catenateWords=1
   catenateNumbers=1
   generateNumberParts=1
   splitOnCaseChange=1
   generateWordParts=1
   catenateAll=0
   preserveOriginal=1
   splitOnNumerics=0
   /

   filter
class=solr.LowerCaseFilterFactory/
   filter class=solr.SynonymFilterFactory
synonyms=german/synonyms.txt ignoreCase=true expand=true/
   filter
class=solr.DictionaryCompoundWordTokenFilterFactory

dictionary=german/german-common-nouns.txt
   minWordSize=5
   minSubwordSize=4
   maxSubwordSize=15
   onlyLongestMatch=true
   /

   filter class=solr.StopFilterFactory
words=german/stopwords.txt ignoreCase=true
enablePositionIncrements=true/
   filter
class=solr.SnowballPorterFilterFactory language=German2
protected=german/protwords.txt/
   filter
class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
   tokenizer
class=solr.WhitespaceTokenizerFactory/

   filter
class=solr.WordDelimiterFilterFactory
   catenateWords=0
   catenateNumbers=0
   generateWordParts=1
   splitOnCaseChange=1
   generateNumberParts=1
   catenateAll=0
   preserveOriginal=1
   splitOnNumerics=0
   /
   filter
class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory
words=german/stopwords.txt ignoreCase=true
enablePositionIncrements=true/
   filter
class=solr.SnowballPorterFilterFactory language=German2
protected=german/protwords.txt/
   filter
class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer

Not finding part of fulltext field when word ends in dot

2014-01-29 Thread Thomas Michael Engelke
Hello everybody,

we have a legacy solr installation in version 3.6.0.1. One of the indices
defines a field named content as a fulltext field where a product
description will reside. One of the records indexed contains the following
data (excerpt):

z. B. in der Serie 26KA.

I had the problem that searching the value 26KA didn't find anything.
Using the analyzer of the adminstrative interface and using the full text
on one hand and 26KA as the query string, I can see how the search string
is transformed by the used filter factories. The WordDelimiterFilterFactory
transforms the 26KA. into 26KA, which is displayed like this (excerpt):

73 74  7576
in der Serie 26KA.
 26KA

It seems that it stripped the 26KA. of the dot. Using the option to
highlight matches, an analysis search of 26KA shows the lower of the two
entries matches (after reaching the LowerCaseFilterFactory). However,
querying the index using the query interface doesn't show any matches.

I discovered that adding an asterisk to the search seems to work, as does
adding the dot. I am puzzled by this, as I thought that the second added
entry was the word actually indexed. I've tried looking up the definition
of the administrative interface, but the documentation only specifies this
for the latest version, where the display is different and (at least in the
sample) doesn't show such duplication.

Can anybody shed some light onto this?


Re: Not finding part of fulltext field when word ends in dot

2014-01-29 Thread Jack Krupansky

What field type and analyzer/tokenizer are you using?

-- Jack Krupansky

-Original Message- 
From: Thomas Michael Engelke 
Sent: Wednesday, January 29, 2014 10:45 AM 
To: solr-user@lucene.apache.org 
Subject: Not finding part of fulltext field when word ends in dot 


Hello everybody,

we have a legacy solr installation in version 3.6.0.1. One of the indices
defines a field named content as a fulltext field where a product
description will reside. One of the records indexed contains the following
data (excerpt):

z. B. in der Serie 26KA.

I had the problem that searching the value 26KA didn't find anything.
Using the analyzer of the adminstrative interface and using the full text
on one hand and 26KA as the query string, I can see how the search string
is transformed by the used filter factories. The WordDelimiterFilterFactory
transforms the 26KA. into 26KA, which is displayed like this (excerpt):

73 74  7576
in der Serie 26KA.
26KA

It seems that it stripped the 26KA. of the dot. Using the option to
highlight matches, an analysis search of 26KA shows the lower of the two
entries matches (after reaching the LowerCaseFilterFactory). However,
querying the index using the query interface doesn't show any matches.

I discovered that adding an asterisk to the search seems to work, as does
adding the dot. I am puzzled by this, as I thought that the second added
entry was the word actually indexed. I've tried looking up the definition
of the administrative interface, but the documentation only specifies this
for the latest version, where the display is different and (at least in the
sample) doesn't show such duplication.

Can anybody shed some light onto this?


Re: Not finding part of fulltext field when word ends in dot

2014-01-29 Thread Thomas Michael Engelke
The fieldType definition is a tad on the longer side:

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer
class=solr.WhitespaceTokenizerFactory/

filter
class=solr.WordDelimiterFilterFactory
catenateWords=1
catenateNumbers=1
generateNumberParts=1
splitOnCaseChange=1
generateWordParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/

filter
class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory
synonyms=german/synonyms.txt ignoreCase=true expand=true/
filter
class=solr.DictionaryCompoundWordTokenFilterFactory

dictionary=german/german-common-nouns.txt
minWordSize=5
minSubwordSize=4
maxSubwordSize=15
onlyLongestMatch=true
/

filter class=solr.StopFilterFactory
words=german/stopwords.txt ignoreCase=true
enablePositionIncrements=true/
filter
class=solr.SnowballPorterFilterFactory language=German2
protected=german/protwords.txt/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer
class=solr.WhitespaceTokenizerFactory/

filter
class=solr.WordDelimiterFilterFactory
catenateWords=0
catenateNumbers=0
generateWordParts=1
splitOnCaseChange=1
generateNumberParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/
filter
class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=german/stopwords.txt ignoreCase=true
enablePositionIncrements=true/
filter
class=solr.SnowballPorterFilterFactory language=German2
protected=german/protwords.txt/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType


Thank you for taking a look.


2014-01-29 Jack Krupansky j...@basetechnology.com

 What field type and analyzer/tokenizer are you using?

 -- Jack Krupansky

 -Original Message- From: Thomas Michael Engelke Sent: Wednesday,
 January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
 finding part of fulltext field when word ends in dot
 Hello everybody,

 we have a legacy solr installation in version 3.6.0.1. One of the indices
 defines a field named content as a fulltext field where a product
 description will reside. One of the records indexed contains the following
 data (excerpt):

 z. B. in der Serie 26KA.

 I had the problem that searching the value 26KA didn't find anything.
 Using the analyzer of the adminstrative interface and using the full text
 on one hand and 26KA as the query string, I can see how the search string
 is transformed by the used filter factories. The WordDelimiterFilterFactory
 transforms the 26KA. into 26KA, which is displayed like this (excerpt):

 73 74  7576
 in der Serie 26KA.
 26KA

 It seems that it stripped the 26KA. of the dot. Using the option to
 highlight matches, an analysis search of 26KA shows the lower of the two
 entries matches (after reaching the LowerCaseFilterFactory). However,
 querying the index using the query interface doesn't show any matches.

 I discovered that adding an asterisk to the search seems to work, as does
 adding the dot. I am puzzled by this, as I thought that the second added
 entry was the word actually indexed. I've tried looking up the definition
 of the administrative interface, but the documentation only specifies this
 for the latest version, where the display is different and (at least in the
 sample) doesn't show such duplication.

 Can anybody shed some light onto this?



Re: Not finding part of fulltext field when word ends in dot

2014-01-29 Thread Jack Krupansky
You might want to add autoGeneratePhraseQueries=true to your field type, 
but I don't think that would cause a break when going from 3.6 to 4.x. The 
default for that attribute changed in Solr 3.5. What release was your data 
indexed using? There may have been some subtle word delimiter filter changes 
between 3.x and 4.x.


Read:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%3cc0551c512c863540bc59694a118452aa0764a...@its-embx-03.adsroot.itcs.umich.edu%3E


-Original Message- 
From: Thomas Michael Engelke

Sent: Wednesday, January 29, 2014 11:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Not finding part of fulltext field when word ends in dot

The fieldType definition is a tad on the longer side:

   fieldType name=text class=solr.TextField
positionIncrementGap=100
   analyzer type=index
   tokenizer
class=solr.WhitespaceTokenizerFactory/

   filter
class=solr.WordDelimiterFilterFactory
   catenateWords=1
   catenateNumbers=1
   generateNumberParts=1
   splitOnCaseChange=1
   generateWordParts=1
   catenateAll=0
   preserveOriginal=1
   splitOnNumerics=0
   /

   filter
class=solr.LowerCaseFilterFactory/
   filter class=solr.SynonymFilterFactory
synonyms=german/synonyms.txt ignoreCase=true expand=true/
   filter
class=solr.DictionaryCompoundWordTokenFilterFactory

dictionary=german/german-common-nouns.txt
   minWordSize=5
   minSubwordSize=4
   maxSubwordSize=15
   onlyLongestMatch=true
   /

   filter class=solr.StopFilterFactory
words=german/stopwords.txt ignoreCase=true
enablePositionIncrements=true/
   filter
class=solr.SnowballPorterFilterFactory language=German2
protected=german/protwords.txt/
   filter
class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
   tokenizer
class=solr.WhitespaceTokenizerFactory/

   filter
class=solr.WordDelimiterFilterFactory
   catenateWords=0
   catenateNumbers=0
   generateWordParts=1
   splitOnCaseChange=1
   generateNumberParts=1
   catenateAll=0
   preserveOriginal=1
   splitOnNumerics=0
   /
   filter
class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory
words=german/stopwords.txt ignoreCase=true
enablePositionIncrements=true/
   filter
class=solr.SnowballPorterFilterFactory language=German2
protected=german/protwords.txt/
   filter
class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   /fieldType


Thank you for taking a look.


2014-01-29 Jack Krupansky j...@basetechnology.com


What field type and analyzer/tokenizer are you using?

-- Jack Krupansky

-Original Message- From: Thomas Michael Engelke Sent: Wednesday,
January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
finding part of fulltext field when word ends in dot
Hello everybody,

we have a legacy solr installation in version 3.6.0.1. One of the indices
defines a field named content as a fulltext field where a product
description will reside. One of the records indexed contains the following
data (excerpt):

z. B. in der Serie 26KA.

I had the problem that searching the value 26KA didn't find anything.
Using the analyzer of the adminstrative interface and using the full text
on one hand and 26KA as the query string, I can see how the search 
string
is transformed by the used filter factories. The 
WordDelimiterFilterFactory
transforms the 26KA. into 26KA, which is displayed like this 
(excerpt):


73 74  7576
in der Serie 26KA.
26KA

It seems that it stripped the 26KA. of the dot. Using the option to
highlight matches, an analysis search of 26KA shows the lower of the two
entries matches (after reaching the LowerCaseFilterFactory). However,
querying the index using the query interface doesn't

Re: Not finding part of fulltext field when word ends in dot

2014-01-29 Thread Thomas Michael Engelke
I'm not sure I got my problem across. If I understand the snippet of
documentation right, autoGeneratePhraseQueries only affects queries that
result in multiple tokens, which mine does not. The version also is
3.6.0.1, and we're not planning on upgrading to any 4.x version.


2014-01-29 Jack Krupansky j...@basetechnology.com

 You might want to add autoGeneratePhraseQueries=true to your field
 type, but I don't think that would cause a break when going from 3.6 to
 4.x. The default for that attribute changed in Solr 3.5. What release was
 your data indexed using? There may have been some subtle word delimiter
 filter changes between 3.x and 4.x.

 Read:
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%
 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.
 adsroot.itcs.umich.edu%3E



 -Original Message- From: Thomas Michael Engelke
 Sent: Wednesday, January 29, 2014 11:16 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Not finding part of fulltext field when word ends in dot


 The fieldType definition is a tad on the longer side:

fieldType name=text class=solr.TextField
 positionIncrementGap=100
analyzer type=index
tokenizer
 class=solr.WhitespaceTokenizerFactory/

filter
 class=solr.WordDelimiterFilterFactory
catenateWords=1
catenateNumbers=1
generateNumberParts=1
splitOnCaseChange=1
generateWordParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/

filter
 class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory
 synonyms=german/synonyms.txt ignoreCase=true expand=true/
filter
 class=solr.DictionaryCompoundWordTokenFilterFactory

 dictionary=german/german-common-nouns.txt
minWordSize=5
minSubwordSize=4
maxSubwordSize=15
onlyLongestMatch=true
/

filter class=solr.StopFilterFactory
 words=german/stopwords.txt ignoreCase=true
 enablePositionIncrements=true/
filter
 class=solr.SnowballPorterFilterFactory language=German2
 protected=german/protwords.txt/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer
 class=solr.WhitespaceTokenizerFactory/

filter
 class=solr.WordDelimiterFilterFactory
catenateWords=0
catenateNumbers=0
generateWordParts=1
splitOnCaseChange=1
generateNumberParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/
filter
 class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
 words=german/stopwords.txt ignoreCase=true
 enablePositionIncrements=true/
filter
 class=solr.SnowballPorterFilterFactory language=German2
 protected=german/protwords.txt/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType


 Thank you for taking a look.


 2014-01-29 Jack Krupansky j...@basetechnology.com

  What field type and analyzer/tokenizer are you using?

 -- Jack Krupansky

 -Original Message- From: Thomas Michael Engelke Sent: Wednesday,
 January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
 finding part of fulltext field when word ends in dot
 Hello everybody,

 we have a legacy solr installation in version 3.6.0.1. One of the indices
 defines a field named content as a fulltext field where a product
 description will reside. One of the records indexed contains the following
 data (excerpt):

 z. B. in der Serie 26KA.

 I had the problem that searching the value 26KA didn't find anything.
 Using the analyzer of the adminstrative interface and using the full text
 on one hand and 26KA as the query string, I can see how the search
 string
 is transformed by the used filter factories