You can use <charFilter class="solr.HTMLStripCharFilterFactory"/> like here in this example. Check the docs about your specific SOLR version because something has changed in the htmlstrip syntax in 1.4 and 3.x
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <charFilter class="solr.HTMLStripCharFilterFactory"/> </fieldType> 2011/8/11 Merlin Morgenstern <merlin.morgenst...@googlemail.com> > I am sorry, but I do not really understand the difference of indexed and > returned result set. > > I look on the "returned" dataset via this command: > solr/select/?q=id:533563&terms=true > > which gives me html tags like this ones: </b><br /> > > I also tried to turn on TermsComponent, but it did not change anything: > solr/select/?q=id:533563&terms=true > > The shema browser does not show any html tags inside the text field, just > indexed words of the one dataset. > > Is there a way to strip the html tags completly and not index them? If not, > how to I retrieve the results without html tags? > > Thank you for your help. > > > > 2011/8/9 Erick Erickson <erickerick...@gmail.com> > > > OK, what does "not working" mean? You never answered Markus' question: > > > > "Are you looking at the returned result set or what you've actually > > indexed? > > Analyzers are not run on the stored data, only on indexed data." > > > > If "not working" means that your returned results contain the markup, > then > > you're confusing indexing and storing. All the analysis chains operate > > on data sent into the indexing process. But the verbatim data is *stored* > > prior to (or separate from) indexing. > > > > So my assumption is that you see data returned in the document with > > markup, which is just as it should be, and there's no problem at all. And > > your > > actual indexed terms (try looking at the data with TermsComponent, or > > admin/schema browser) will NOT have any markup. > > > > Perhaps you can back up a bit and describe what's failing .vs. what you > > expect. > > > > Best > > Erick > > > > On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern > > <merlin.morgenst...@googlemail.com> wrote: > > > Unfortunatelly I still cant get it running. The code I am using is the > > > following: > > > <analyzer type="index"> > > > <charFilter > class="solr.HTMLStripCharFilterFactory"/> > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > <filter class="solr.WordDelimiterFilterFactory" > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > <filter class="solr.KeywordMarkerFilterFactory"/> > > > <filter class="solr.PorterStemFilterFactory"/> > > > </analyzer> > > > <analyzer type="query"> > > > <charFilter > class="solr.HTMLStripCharFilterFactory"/> > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > <filter class="solr.WordDelimiterFilterFactory" > > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > <filter class="solr.KeywordMarkerFilterFactory"/> > > > <filter class="solr.PorterStemFilterFactory"/> > > > </analyzer> > > > > > > I also tried this one: > > > > > > <types> > > > <fieldType name="text" class="solr.TextField" > > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > > <analyzer> > > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > > <filter class="solr.StandardFilterFactory"/> > > > </analyzer> > > > </fieldType> > > > </types> > > > <field name="text" type="text" indexed="true" stored="true" > > > required="false"/> > > > > > > none of those worked. I restartred solr after the shema update and > > reindexed > > > the data. No change, the html tags are still in there. > > > > > > Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on > > suse > > > linux. > > > > > > Thank you for any help on this. > > > > > > > > > > > > 2011/7/25 Mike Sokolov <soko...@ifactory.com> > > > > > >> Hmm that looks like it's working fine. I stand corrected. > > >> > > >> > > >> > > >> On 07/25/2011 12:24 PM, Markus Jelsma wrote: > > >> > > >>> I've seen that issue too and read comments on the list yet i've never > > had > > >>> trouble with the order, don't know what's going on. Check this > > analyzer, > > >>> i've > > >>> moved the charFilter to the bottom: > > >>> > > >>> <analyzer type="index"> > > >>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/> > > >>> <filter class="solr.**WordDelimiterFilterFactory" > generateWordParts="1" > > >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1" > > >>> catenateAll="0" > > >>> splitOnCaseChange="1"/> > > >>> <filter class="solr.**LowerCaseFilterFactory"/> > > >>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt" > > >>> ignoreCase="false" expand="true"/> > > >>> <filter class="solr.StopFilterFactory" ignoreCase="false" > > >>> words="stopwords.txt"/> > > >>> <filter class="solr.**ASCIIFoldingFilterFactory"/> > > >>> <filter class="solr.**SnowballPorterFilterFactory" > > >>> protected="protwords.txt" > > >>> language="Dutch"/> > > >>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/> > > >>> <charFilter class="solr.**HTMLStripCharFilterFactory"/> > > >>> </analyzer> > > >>> > > >>> The analysis chain still does its job as i expect for the input: > > >>> <span>bla bla</span> > > >>> > > >>> Index Analyzer > > >>> org.apache.solr.analysis.**HTMLStripCharFilterFactory > > >>> {luceneMatchVersion=LUCENE_34} > > >>> text bla bla > > >>> org.apache.solr.analysis.**WhitespaceTokenizerFactory > > >>> {luceneMatchVersion=LUCENE_34} > > >>> position 1 2 > > >>> term text bla bla > > >>> startOffset 6 10 > > >>> endOffset 9 13 > > >>> org.apache.solr.analysis.**WordDelimiterFilterFactory > > >>> {splitOnCaseChange=1, > > >>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34, > > >>> generateWordParts=1, catenateAll=0, catenateNumbers=1} > > >>> position 1 2 > > >>> term text bla bla > > >>> startOffset 6 10 > > >>> endOffset 9 13 > > >>> type word word > > >>> org.apache.solr.analysis.**LowerCaseFilterFactory > > >>> {luceneMatchVersion=LUCENE_34} > > >>> position 1 2 > > >>> term text bla bla > > >>> startOffset 6 10 > > >>> endOffset 9 13 > > >>> type word word > > >>> org.apache.solr.analysis.**SynonymFilterFactory > {synonyms=synonyms.txt, > > >>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34} > > >>> position 1 2 > > >>> term text bla bla > > >>> type word word > > >>> startOffset 6 10 > > >>> endOffset 9 13 > > >>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt, > > >>> ignoreCase=false, luceneMatchVersion=LUCENE_34} > > >>> position 1 2 > > >>> term text bla bla > > >>> type word word > > >>> startOffset 6 10 > > >>> endOffset 9 13 > > >>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory > > >>> {luceneMatchVersion=LUCENE_34} > > >>> position 1 2 > > >>> term text bla bla > > >>> type word word > > >>> startOffset 6 10 > > >>> endOffset 9 13 > > >>> org.apache.solr.analysis.**SnowballPorterFilterFactory > > >>> {protected=protwords.txt, > > >>> language=Dutch, luceneMatchVersion=LUCENE_34} > > >>> position 1 2 > > >>> term text bla bla > > >>> keyword false false > > >>> type word word > > >>> startOffset 6 10 > > >>> endOffset 9 13 > > >>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory > > >>> {luceneMatchVersion=LUCENE_34} > > >>> position 1 2 > > >>> term text bla bla > > >>> keyword false false > > >>> type word word > > >>> startOffset 6 10 > > >>> endOffset 9 13 > > >>> > > >>> > > >>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote: > > >>> > > >>> > > >>>> Hmm - I'm not sure about that; see > > >>>> https://issues.apache.org/**jira/browse/SOLR-2119< > > https://issues.apache.org/jira/browse/SOLR-2119> > > >>>> > > >>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote: > > >>>> > > >>>> > > >>>>> charFilters are executed first regardless of their position in the > > >>>>> analyzer. > > >>>>> > > >>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: > > >>>>> > > >>>>> > > >>>>>> I think you need to list the charfilter earlier in the analysis > > chain; > > >>>>>> before the tokenizer. Porbably Solr should tell you this... > > >>>>>> > > >>>>>> -Mike > > >>>>>> > > >>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: > > >>>>>> > > >>>>>> > > >>>>>>> sounds logical. I just changed it to the following, restarted and > > >>>>>>> reindexed > > >>>>>>> > > >>>>>>> with commit: > > >>>>>>> <fieldType name="text" class="solr.TextField" > > >>>>>>> > > >>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true"> > > >>>>>>> > > >>>>>>> <analyzer type="index"> > > >>>>>>> > > >>>>>>> <tokenizer > > >>>>>>> class="solr.**WhitespaceTokenizerFactory"/> > > >>>>>>> <filter class="solr.** > > >>>>>>> WordDelimiterFilterFactory" > > >>>>>>> > > >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > > >>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > >>>>>>> > > >>>>>>> <filter > > class="solr.**LowerCaseFilterFactory"/> > > >>>>>>> <filter class="solr.** > > >>>>>>> KeywordMarkerFilterFactory"/> > > >>>>>>> <filter class="solr.** > > >>>>>>> PorterStemFilterFactory"/> > > >>>>>>> <charFilter > > >>>>>>> class="solr.**HTMLStripCharFilterFactory"/> > > >>>>>>> > > >>>>>>> </analyzer> > > >>>>>>> <analyzer type="query"> > > >>>>>>> > > >>>>>>> <tokenizer > > >>>>>>> class="solr.**WhitespaceTokenizerFactory"/> > > >>>>>>> <filter class="solr.** > > >>>>>>> WordDelimiterFilterFactory" > > >>>>>>> > > >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > > >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > >>>>>>> > > >>>>>>> <filter > > class="solr.**LowerCaseFilterFactory"/> > > >>>>>>> <filter class="solr.** > > >>>>>>> KeywordMarkerFilterFactory"/> > > >>>>>>> <filter class="solr.** > > >>>>>>> PorterStemFilterFactory"/> > > >>>>>>> <charFilter > > >>>>>>> class="solr.**HTMLStripCharFilterFactory"/> > > >>>>>>> > > >>>>>>> </analyzer> > > >>>>>>> > > >>>>>>> </fieldType> > > >>>>>>> > > >>>>>>> Unfortunatelly that did not fix the error. There are still<h3> > > tags > > >>>>>>> inside the data. Although I believe there are viewer then before > > but I > > >>>>>>> can not prove that. Fact is, there are still html tags inside the > > >>>>>>> data. > > >>>>>>> > > >>>>>>> Any other ideas what the problem could be? > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> 2011/7/25 Markus Jelsma<markus.jelsma@**openindex.io< > > markus.jel...@openindex.io> > > >>>>>>> > > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>>> You've three analyzer elements, i wonder what that would do. You > > need > > >>>>>>>> to add > > >>>>>>>> the char filter to the index-time analyzer. > > >>>>>>>> > > >>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> Hi there, > > >>>>>>>>> > > >>>>>>>>> I am trying to strip html tags from the data before adding the > > >>>>>>>>> documents > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> to > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> the index. To do that I altered schem.xml like this: > > >>>>>>>>> <fieldType name="text" class="solr.TextField" > > >>>>>>>>> > > >>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true"> > > >>>>>>>>> > > >>>>>>>>> <analyzer type="index"> > > >>>>>>>>> > > >>>>>>>>> <tokenizer > > >>>>>>>>> > class="solr.**WhitespaceTokenizerFactory"/> > > >>>>>>>>> <filter > > >>>>>>>>> class="solr.**WordDelimiterFilterFactory" > > >>>>>>>>> > > >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > > >>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > >>>>>>>>> > > >>>>>>>>> <filter class="solr.** > > >>>>>>>>> LowerCaseFilterFactory"/> > > >>>>>>>>> <filter > > >>>>>>>>> > class="solr.**KeywordMarkerFilterFactory"/> > > >>>>>>>>> <filter class="solr.** > > >>>>>>>>> PorterStemFilterFactory"/> > > >>>>>>>>> > > >>>>>>>>> </analyzer> > > >>>>>>>>> <analyzer type="query"> > > >>>>>>>>> > > >>>>>>>>> <tokenizer > > >>>>>>>>> > class="solr.**WhitespaceTokenizerFactory"/> > > >>>>>>>>> <filter > > >>>>>>>>> class="solr.**WordDelimiterFilterFactory" > > >>>>>>>>> > > >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > > >>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > >>>>>>>>> > > >>>>>>>>> <filter class="solr.** > > >>>>>>>>> LowerCaseFilterFactory"/> > > >>>>>>>>> <filter > > >>>>>>>>> > class="solr.**KeywordMarkerFilterFactory"/> > > >>>>>>>>> <filter class="solr.** > > >>>>>>>>> PorterStemFilterFactory"/> > > >>>>>>>>> > > >>>>>>>>> </analyzer> > > >>>>>>>>> <analyzer> > > >>>>>>>>> > > >>>>>>>>> <charFilter > > >>>>>>>>> > class="solr.**HTMLStripCharFilterFactory"/> > > >>>>>>>>> > > >>>>>>>>> <tokenizer > > >>>>>>>>> > > class="solr.**WhitespaceTokenizerFactory"/> > > >>>>>>>>> > > >>>>>>>>> </analyzer> > > >>>>>>>>> > > >>>>>>>>> </fieldType> > > >>>>>>>>> > > >>>>>>>>> <fields> > > >>>>>>>>> > > >>>>>>>>> <field name="text" type="text" indexed="true" > > >>>>>>>>> stored="true" > > >>>>>>>>> > > >>>>>>>>> required="false"/> > > >>>>>>>>> > > >>>>>>>>> </fields> > > >>>>>>>>> > > >>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3> > are > > >>>>>>>>> still > > >>>>>>>>> present after restarting and reindexing. I also tryed > > >>>>>>>>> htmlstriptransformer, but this did not work either. > > >>>>>>>>> > > >>>>>>>>> Has anybody an idea how to get this done? Thank you in advance > > for > > >>>>>>>>> any hint. > > >>>>>>>>> > > >>>>>>>>> Merlin > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> -- > > >>>>>>>> Markus Jelsma - CTO - Openindex > > >>>>>>>> http://www.linkedin.com/in/**markus17< > > http://www.linkedin.com/in/markus17> > > >>>>>>>> 050-8536620 / 06-50258350 > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>> > > >> > > > > > > -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533