problem with HTMLStripStandardTokenizerFactory

Kundig, Andreas Fri, 25 Sep 2009 02:34:49 -0700

Hello

I can't bring HTMLStripStandardTokenizerFactory to remove the content of the 
style tag, as the documentation says it should.


A search for 'mso' returns a document where the search term only appears in the 
style tag (it's a word document saved as html). Here is the highlight returned 
by solr (by the way: the wrong word is highlighted).

"vetica;&#13;\n\tpanose-1:2 11 5 4 2 2 2 2 2 
4;&<em>#13</em>;\n\tmso-font-charset:0;&<em>#13</em>;\n\tmso-generic-font-family:swiss;&<em>#13</em>"

I am using solr 1.3. Here is how I configured the tokenizer in schema.xml

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Am I doing something wrong?

thank you
Andréas Kündig

World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.

problem with HTMLStripStandardTokenizerFactory

Reply via email to