Unfortunatelly I still cant get it running. The code I am using is the following: <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer>
I also tried this one: <types> <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> </analyzer> </fieldType> </types> <field name="text" type="text" indexed="true" stored="true" required="false"/> none of those worked. I restartred solr after the shema update and reindexed the data. No change, the html tags are still in there. Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on suse linux. Thank you for any help on this. 2011/7/25 Mike Sokolov <soko...@ifactory.com> > Hmm that looks like it's working fine. I stand corrected. > > > > On 07/25/2011 12:24 PM, Markus Jelsma wrote: > >> I've seen that issue too and read comments on the list yet i've never had >> trouble with the order, don't know what's going on. Check this analyzer, >> i've >> moved the charFilter to the bottom: >> >> <analyzer type="index"> >> <tokenizer class="solr.**WhitespaceTokenizerFactory"/> >> <filter class="solr.**WordDelimiterFilterFactory" generateWordParts="1" >> generateNumberParts="1" catenateWords="1" catenateNumbers="1" >> catenateAll="0" >> splitOnCaseChange="1"/> >> <filter class="solr.**LowerCaseFilterFactory"/> >> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt" >> ignoreCase="false" expand="true"/> >> <filter class="solr.StopFilterFactory" ignoreCase="false" >> words="stopwords.txt"/> >> <filter class="solr.**ASCIIFoldingFilterFactory"/> >> <filter class="solr.**SnowballPorterFilterFactory" >> protected="protwords.txt" >> language="Dutch"/> >> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/> >> <charFilter class="solr.**HTMLStripCharFilterFactory"/> >> </analyzer> >> >> The analysis chain still does its job as i expect for the input: >> <span>bla bla</span> >> >> Index Analyzer >> org.apache.solr.analysis.**HTMLStripCharFilterFactory >> {luceneMatchVersion=LUCENE_34} >> text bla bla >> org.apache.solr.analysis.**WhitespaceTokenizerFactory >> {luceneMatchVersion=LUCENE_34} >> position 1 2 >> term text bla bla >> startOffset 6 10 >> endOffset 9 13 >> org.apache.solr.analysis.**WordDelimiterFilterFactory >> {splitOnCaseChange=1, >> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34, >> generateWordParts=1, catenateAll=0, catenateNumbers=1} >> position 1 2 >> term text bla bla >> startOffset 6 10 >> endOffset 9 13 >> type word word >> org.apache.solr.analysis.**LowerCaseFilterFactory >> {luceneMatchVersion=LUCENE_34} >> position 1 2 >> term text bla bla >> startOffset 6 10 >> endOffset 9 13 >> type word word >> org.apache.solr.analysis.**SynonymFilterFactory {synonyms=synonyms.txt, >> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34} >> position 1 2 >> term text bla bla >> type word word >> startOffset 6 10 >> endOffset 9 13 >> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt, >> ignoreCase=false, luceneMatchVersion=LUCENE_34} >> position 1 2 >> term text bla bla >> type word word >> startOffset 6 10 >> endOffset 9 13 >> org.apache.solr.analysis.**ASCIIFoldingFilterFactory >> {luceneMatchVersion=LUCENE_34} >> position 1 2 >> term text bla bla >> type word word >> startOffset 6 10 >> endOffset 9 13 >> org.apache.solr.analysis.**SnowballPorterFilterFactory >> {protected=protwords.txt, >> language=Dutch, luceneMatchVersion=LUCENE_34} >> position 1 2 >> term text bla bla >> keyword false false >> type word word >> startOffset 6 10 >> endOffset 9 13 >> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory >> {luceneMatchVersion=LUCENE_34} >> position 1 2 >> term text bla bla >> keyword false false >> type word word >> startOffset 6 10 >> endOffset 9 13 >> >> >> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote: >> >> >>> Hmm - I'm not sure about that; see >>> https://issues.apache.org/**jira/browse/SOLR-2119<https://issues.apache.org/jira/browse/SOLR-2119> >>> >>> On 07/25/2011 12:01 PM, Markus Jelsma wrote: >>> >>> >>>> charFilters are executed first regardless of their position in the >>>> analyzer. >>>> >>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: >>>> >>>> >>>>> I think you need to list the charfilter earlier in the analysis chain; >>>>> before the tokenizer. Porbably Solr should tell you this... >>>>> >>>>> -Mike >>>>> >>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: >>>>> >>>>> >>>>>> sounds logical. I just changed it to the following, restarted and >>>>>> reindexed >>>>>> >>>>>> with commit: >>>>>> <fieldType name="text" class="solr.TextField" >>>>>> >>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true"> >>>>>> >>>>>> <analyzer type="index"> >>>>>> >>>>>> <tokenizer >>>>>> class="solr.**WhitespaceTokenizerFactory"/> >>>>>> <filter class="solr.** >>>>>> WordDelimiterFilterFactory" >>>>>> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" >>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> >>>>>> >>>>>> <filter class="solr.**LowerCaseFilterFactory"/> >>>>>> <filter class="solr.** >>>>>> KeywordMarkerFilterFactory"/> >>>>>> <filter class="solr.** >>>>>> PorterStemFilterFactory"/> >>>>>> <charFilter >>>>>> class="solr.**HTMLStripCharFilterFactory"/> >>>>>> >>>>>> </analyzer> >>>>>> <analyzer type="query"> >>>>>> >>>>>> <tokenizer >>>>>> class="solr.**WhitespaceTokenizerFactory"/> >>>>>> <filter class="solr.** >>>>>> WordDelimiterFilterFactory" >>>>>> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" >>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> >>>>>> >>>>>> <filter class="solr.**LowerCaseFilterFactory"/> >>>>>> <filter class="solr.** >>>>>> KeywordMarkerFilterFactory"/> >>>>>> <filter class="solr.** >>>>>> PorterStemFilterFactory"/> >>>>>> <charFilter >>>>>> class="solr.**HTMLStripCharFilterFactory"/> >>>>>> >>>>>> </analyzer> >>>>>> >>>>>> </fieldType> >>>>>> >>>>>> Unfortunatelly that did not fix the error. There are still<h3> tags >>>>>> inside the data. Although I believe there are viewer then before but I >>>>>> can not prove that. Fact is, there are still html tags inside the >>>>>> data. >>>>>> >>>>>> Any other ideas what the problem could be? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 2011/7/25 Markus >>>>>> Jelsma<markus.jelsma@**openindex.io<markus.jel...@openindex.io> >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>>> You've three analyzer elements, i wonder what that would do. You need >>>>>>> to add >>>>>>> the char filter to the index-time analyzer. >>>>>>> >>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: >>>>>>> >>>>>>> >>>>>>>> Hi there, >>>>>>>> >>>>>>>> I am trying to strip html tags from the data before adding the >>>>>>>> documents >>>>>>>> >>>>>>>> >>>>>>> to >>>>>>> >>>>>>> >>>>>>> >>>>>>>> the index. To do that I altered schem.xml like this: >>>>>>>> <fieldType name="text" class="solr.TextField" >>>>>>>> >>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true"> >>>>>>>> >>>>>>>> <analyzer type="index"> >>>>>>>> >>>>>>>> <tokenizer >>>>>>>> class="solr.**WhitespaceTokenizerFactory"/> >>>>>>>> <filter >>>>>>>> class="solr.**WordDelimiterFilterFactory" >>>>>>>> >>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" >>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> >>>>>>>> >>>>>>>> <filter class="solr.** >>>>>>>> LowerCaseFilterFactory"/> >>>>>>>> <filter >>>>>>>> class="solr.**KeywordMarkerFilterFactory"/> >>>>>>>> <filter class="solr.** >>>>>>>> PorterStemFilterFactory"/> >>>>>>>> >>>>>>>> </analyzer> >>>>>>>> <analyzer type="query"> >>>>>>>> >>>>>>>> <tokenizer >>>>>>>> class="solr.**WhitespaceTokenizerFactory"/> >>>>>>>> <filter >>>>>>>> class="solr.**WordDelimiterFilterFactory" >>>>>>>> >>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" >>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> >>>>>>>> >>>>>>>> <filter class="solr.** >>>>>>>> LowerCaseFilterFactory"/> >>>>>>>> <filter >>>>>>>> class="solr.**KeywordMarkerFilterFactory"/> >>>>>>>> <filter class="solr.** >>>>>>>> PorterStemFilterFactory"/> >>>>>>>> >>>>>>>> </analyzer> >>>>>>>> <analyzer> >>>>>>>> >>>>>>>> <charFilter >>>>>>>> class="solr.**HTMLStripCharFilterFactory"/> >>>>>>>> >>>>>>>> <tokenizer >>>>>>>> class="solr.**WhitespaceTokenizerFactory"/> >>>>>>>> >>>>>>>> </analyzer> >>>>>>>> >>>>>>>> </fieldType> >>>>>>>> >>>>>>>> <fields> >>>>>>>> >>>>>>>> <field name="text" type="text" indexed="true" >>>>>>>> stored="true" >>>>>>>> >>>>>>>> required="false"/> >>>>>>>> >>>>>>>> </fields> >>>>>>>> >>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3> are >>>>>>>> still >>>>>>>> present after restarting and reindexing. I also tryed >>>>>>>> htmlstriptransformer, but this did not work either. >>>>>>>> >>>>>>>> Has anybody an idea how to get this done? Thank you in advance for >>>>>>>> any hint. >>>>>>>> >>>>>>>> Merlin >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Markus Jelsma - CTO - Openindex >>>>>>> http://www.linkedin.com/in/**markus17<http://www.linkedin.com/in/markus17> >>>>>>> 050-8536620 / 06-50258350 >>>>>>> >>>>>>> >>>>>> >> >