Hello
I can't bring HTMLStripStandardTokenizerFactory to remove the content of the
style tag, as the documentation says it should.
A search for 'mso' returns a document where the search term only appears in the
style tag (it's a word document saved as html). Here is the highlight returned
by solr (by the way: the wrong word is highlighted).
"vetica; \n\tpanose-1:2 11 5 4 2 2 2 2 2
4;&<em>#13</em>;\n\tmso-font-charset:0;&<em>#13</em>;\n\tmso-generic-font-family:swiss;&<em>#13</em>"
I am using solr 1.3. Here is how I configured the tokenizer in schema.xml
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Am I doing something wrong?
thank you
Andréas Kündig
World Intellectual Property Organization Disclaimer:
This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.