I've seen that issue too and read comments on the list yet i've never had trouble with the order, don't know what's going on. Check this analyzer, i've moved the charFilter to the bottom:
<analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="false" words="stopwords.txt"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" protected="protwords.txt" language="Dutch"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <charFilter class="solr.HTMLStripCharFilterFactory"/> </analyzer> The analysis chain still does its job as i expect for the input: <span>bla bla</span> Index Analyzer org.apache.solr.analysis.HTMLStripCharFilterFactory {luceneMatchVersion=LUCENE_34} text bla bla org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_34} position 1 2 term text bla bla startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, catenateNumbers=1} position 1 2 term text bla bla startOffset 6 10 endOffset 9 13 type word word org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_34} position 1 2 term text bla bla startOffset 6 10 endOffset 9 13 type word word org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34} position 1 2 term text bla bla type word word startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=false, luceneMatchVersion=LUCENE_34} position 1 2 term text bla bla type word word startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.ASCIIFoldingFilterFactory {luceneMatchVersion=LUCENE_34} position 1 2 term text bla bla type word word startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=Dutch, luceneMatchVersion=LUCENE_34} position 1 2 term text bla bla keyword false false type word word startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {luceneMatchVersion=LUCENE_34} position 1 2 term text bla bla keyword false false type word word startOffset 6 10 endOffset 9 13 On Monday 25 July 2011 18:07:29 Mike Sokolov wrote: > Hmm - I'm not sure about that; see > https://issues.apache.org/jira/browse/SOLR-2119 > > On 07/25/2011 12:01 PM, Markus Jelsma wrote: > > charFilters are executed first regardless of their position in the > > analyzer. > > > > On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: > >> I think you need to list the charfilter earlier in the analysis chain; > >> before the tokenizer. Porbably Solr should tell you this... > >> > >> -Mike > >> > >> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: > >>> sounds logical. I just changed it to the following, restarted and > >>> reindexed > >>> > >>> with commit: > >>> <fieldType name="text" class="solr.TextField" > >>> > >>> positionIncrementGap="100" autoGeneratePhraseQueries="true"> > >>> > >>> <analyzer type="index"> > >>> > >>> <tokenizer > >>> class="solr.WhitespaceTokenizerFactory"/> > >>> <filter class="solr.WordDelimiterFilterFactory" > >>> > >>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >>> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.KeywordMarkerFilterFactory"/> > >>> <filter class="solr.PorterStemFilterFactory"/> > >>> <charFilter > >>> class="solr.HTMLStripCharFilterFactory"/> > >>> > >>> </analyzer> > >>> <analyzer type="query"> > >>> > >>> <tokenizer > >>> class="solr.WhitespaceTokenizerFactory"/> > >>> <filter class="solr.WordDelimiterFilterFactory" > >>> > >>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >>> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.KeywordMarkerFilterFactory"/> > >>> <filter class="solr.PorterStemFilterFactory"/> > >>> <charFilter > >>> class="solr.HTMLStripCharFilterFactory"/> > >>> > >>> </analyzer> > >>> > >>> </fieldType> > >>> > >>> Unfortunatelly that did not fix the error. There are still<h3> tags > >>> inside the data. Although I believe there are viewer then before but I > >>> can not prove that. Fact is, there are still html tags inside the data. > >>> > >>> Any other ideas what the problem could be? > >>> > >>> > >>> > >>> > >>> > >>> 2011/7/25 Markus Jelsma<markus.jel...@openindex.io> > >>> > >>>> You've three analyzer elements, i wonder what that would do. You need > >>>> to add > >>>> the char filter to the index-time analyzer. > >>>> > >>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: > >>>>> Hi there, > >>>>> > >>>>> I am trying to strip html tags from the data before adding the > >>>>> documents > >>>> > >>>> to > >>>> > >>>>> the index. To do that I altered schem.xml like this: > >>>>> <fieldType name="text" class="solr.TextField" > >>>>> > >>>>> positionIncrementGap="100" autoGeneratePhraseQueries="true"> > >>>>> > >>>>> <analyzer type="index"> > >>>>> > >>>>> <tokenizer > >>>>> class="solr.WhitespaceTokenizerFactory"/> > >>>>> <filter > >>>>> class="solr.WordDelimiterFilterFactory" > >>>>> > >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >>>>> > >>>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>>> <filter > >>>>> class="solr.KeywordMarkerFilterFactory"/> > >>>>> <filter class="solr.PorterStemFilterFactory"/> > >>>>> > >>>>> </analyzer> > >>>>> <analyzer type="query"> > >>>>> > >>>>> <tokenizer > >>>>> class="solr.WhitespaceTokenizerFactory"/> > >>>>> <filter > >>>>> class="solr.WordDelimiterFilterFactory" > >>>>> > >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >>>>> > >>>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>>> <filter > >>>>> class="solr.KeywordMarkerFilterFactory"/> > >>>>> <filter class="solr.PorterStemFilterFactory"/> > >>>>> > >>>>> </analyzer> > >>>>> <analyzer> > >>>>> > >>>>> <charFilter > >>>>> class="solr.HTMLStripCharFilterFactory"/> > >>>>> > >>>>> <tokenizer > >>>>> class="solr.WhitespaceTokenizerFactory"/> > >>>>> > >>>>> </analyzer> > >>>>> > >>>>> </fieldType> > >>>>> > >>>>> <fields> > >>>>> > >>>>> <field name="text" type="text" indexed="true" stored="true" > >>>>> > >>>>> required="false"/> > >>>>> > >>>>> </fields> > >>>>> > >>>>> Unfortunatelly this does not work, the hmtl tags like<h3> are still > >>>>> present after restarting and reindexing. I also tryed > >>>>> htmlstriptransformer, but this did not work either. > >>>>> > >>>>> Has anybody an idea how to get this done? Thank you in advance for > >>>>> any hint. > >>>>> > >>>>> Merlin > >>>> > >>>> -- > >>>> Markus Jelsma - CTO - Openindex > >>>> http://www.linkedin.com/in/markus17 > >>>> 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350