hi everyone,
if I define a field as
<fieldType name="subword" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1"/>
<tokenizer
class="org.apache.solr.analysis.EdgeNGramTokenizerFactory"
minGramSize="2" maxGramSize="15"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
<tokenizer
class="org.apache.solr.analysis.EdgeNGramTokenizerFactory"
minGramSize="2" maxGramSize="15"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I would expect that, when pushing data into it, this is what would happen:
- Stop words removed by StopFilterFactory
- content broken into several 'words' as per WordDelimiterFilterFactory.
- the result of all this passed to EdgeNGram (or nGram) tokenizer
so, when indexing 'The Veronicas', only 'Veronicas' would reach the NGram
tokenizer....
What I find is that the n-gram tokenizers kick in first, and the filters after,
making it a rather moot exercise. I've confirmed the steps in analysis.jsp :
Index Analyzer
org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=15, minGramSize=2}
[..]
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}
[..]
org.apache.solr.analysis.LowerCaseFilterFactory {}
[...]
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
[...]
What am I doing / understanding wrong?
thanks!!
B
_________________________
{Beto|Norberto|Numard} Meijome
Windows caters to everyone as though they are idiots. UNIX makes no such
assumption. It assumes you know what you are doing, and presents the challenge
of figuring it out for yourself if you don't.
I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.