Re: strip html from data

Alexei Martchenko Thu, 11 Aug 2011 11:45:11 -0700

You can use <charFilter class="solr.HTMLStripCharFilterFactory"/> like here
in this example. Check the docs about your specific SOLR version because
something has changed in the htmlstrip syntax in 1.4 and 3.x


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</fieldType>

2011/8/11 Merlin Morgenstern <merlin.morgenst...@googlemail.com>

> I am sorry, but I do not really understand the difference of indexed and
> returned result set.
>
> I look on the "returned" dataset via this command:
> solr/select/?q=id:533563&terms=true
>
> which gives me html tags like this ones: </b><br />
>
> I also tried to turn on TermsComponent, but it did not change anything:
> solr/select/?q=id:533563&terms=true
>
> The shema browser does not show any html tags inside the text field, just
> indexed words of the one dataset.
>
> Is there a way to strip the html tags completly and not index them? If not,
> how to I retrieve the results without html tags?
>
> Thank you for your help.
>
>
>
> 2011/8/9 Erick Erickson <erickerick...@gmail.com>
>
> > OK, what does "not working" mean? You never answered Markus' question:
> >
> > "Are you looking at the returned result set or what you've actually
> > indexed?
> > Analyzers are not run on the stored data, only on indexed data."
> >
> > If "not working" means that your returned results contain the markup,
> then
> > you're confusing indexing and storing. All the analysis chains operate
> > on data sent into the indexing process. But the verbatim data is *stored*
> > prior to (or separate from) indexing.
> >
> > So my assumption is that you see data returned in the document with
> > markup, which is just as it should be, and there's no problem at all. And
> > your
> > actual indexed terms (try looking at the data with TermsComponent, or
> > admin/schema browser) will NOT have any markup.
> >
> > Perhaps you can back up a bit and describe what's failing .vs. what you
> > expect.
> >
> > Best
> > Erick
> >
> > On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern
> > <merlin.morgenst...@googlemail.com> wrote:
> > > Unfortunatelly I still cant get it running. The code I am using is the
> > > following:
> > >                <analyzer type="index">
> > >                    <charFilter
> class="solr.HTMLStripCharFilterFactory"/>
> > >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >                    <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >                    <filter class="solr.LowerCaseFilterFactory"/>
> > >                    <filter class="solr.KeywordMarkerFilterFactory"/>
> > >                    <filter class="solr.PorterStemFilterFactory"/>
> > >                </analyzer>
> > >                <analyzer type="query">
> > >                    <charFilter
> class="solr.HTMLStripCharFilterFactory"/>
> > >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >                    <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > >                    <filter class="solr.LowerCaseFilterFactory"/>
> > >                    <filter class="solr.KeywordMarkerFilterFactory"/>
> > >                    <filter class="solr.PorterStemFilterFactory"/>
> > >                </analyzer>
> > >
> > > I also tried this one:
> > >
> > >    <types>
> > >         <fieldType name="text" class="solr.TextField"
> > > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> > >               <analyzer>
> > >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > >                  <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                  <filter class="solr.StandardFilterFactory"/>
> > >            </analyzer>
> > >         </fieldType>
> > >    </types>
> > >      <field name="text" type="text" indexed="true" stored="true"
> > > required="false"/>
> > >
> > > none of those worked. I restartred solr after the shema update and
> > reindexed
> > > the data. No change, the html tags are still in there.
> > >
> > > Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on
> > suse
> > > linux.
> > >
> > > Thank you for any help on this.
> > >
> > >
> > >
> > > 2011/7/25 Mike Sokolov <soko...@ifactory.com>
> > >
> > >> Hmm that looks like it's working fine.  I stand corrected.
> > >>
> > >>
> > >>
> > >> On 07/25/2011 12:24 PM, Markus Jelsma wrote:
> > >>
> > >>> I've seen that issue too and read comments on the list yet i've never
> > had
> > >>> trouble with the order, don't know what's going on. Check this
> > analyzer,
> > >>> i've
> > >>> moved the charFilter to the bottom:
> > >>>
> > >>> <analyzer type="index">
> > >>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
> > >>> <filter class="solr.**WordDelimiterFilterFactory"
> generateWordParts="1"
> > >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > >>> catenateAll="0"
> > >>> splitOnCaseChange="1"/>
> > >>> <filter class="solr.**LowerCaseFilterFactory"/>
> > >>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
> > >>> ignoreCase="false" expand="true"/>
> > >>> <filter class="solr.StopFilterFactory" ignoreCase="false"
> > >>> words="stopwords.txt"/>
> > >>> <filter class="solr.**ASCIIFoldingFilterFactory"/>
> > >>> <filter class="solr.**SnowballPorterFilterFactory"
> > >>> protected="protwords.txt"
> > >>> language="Dutch"/>
> > >>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/>
> > >>> <charFilter class="solr.**HTMLStripCharFilterFactory"/>
> > >>> </analyzer>
> > >>>
> > >>> The analysis chain still does its job as i expect for the input:
> > >>> <span>bla bla</span>
> > >>>
> > >>> Index Analyzer
> > >>> org.apache.solr.analysis.**HTMLStripCharFilterFactory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> text    bla bla
> > >>> org.apache.solr.analysis.**WhitespaceTokenizerFactory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**WordDelimiterFilterFactory
> > >>> {splitOnCaseChange=1,
> > >>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
> > >>> generateWordParts=1, catenateAll=0, catenateNumbers=1}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> type    word    word
> > >>> org.apache.solr.analysis.**LowerCaseFilterFactory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> type    word    word
> > >>> org.apache.solr.analysis.**SynonymFilterFactory
> {synonyms=synonyms.txt,
> > >>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt,
> > >>> ignoreCase=false, luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**SnowballPorterFilterFactory
> > >>> {protected=protwords.txt,
> > >>> language=Dutch, luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> keyword         false   false
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> keyword         false   false
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>>
> > >>>
> > >>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
> > >>>
> > >>>
> > >>>> Hmm - I'm not sure about that; see
> > >>>> https://issues.apache.org/**jira/browse/SOLR-2119<
> > https://issues.apache.org/jira/browse/SOLR-2119>
> > >>>>
> > >>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
> > >>>>
> > >>>>
> > >>>>> charFilters are executed first regardless of their position in the
> > >>>>> analyzer.
> > >>>>>
> > >>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
> > >>>>>
> > >>>>>
> > >>>>>> I think you need to list the charfilter earlier in the analysis
> > chain;
> > >>>>>> before the tokenizer.  Porbably Solr should tell you this...
> > >>>>>>
> > >>>>>> -Mike
> > >>>>>>
> > >>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>> sounds logical. I just changed it to the following, restarted and
> > >>>>>>> reindexed
> > >>>>>>>
> > >>>>>>> with commit:
> > >>>>>>>            <fieldType name="text" class="solr.TextField"
> > >>>>>>>
> > >>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
> > >>>>>>>
> > >>>>>>>                   <analyzer type="index">
> > >>>>>>>
> > >>>>>>>                       <tokenizer
> > >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> WordDelimiterFilterFactory"
> > >>>>>>>
> > >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > >>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >>>>>>>
> > >>>>>>>                       <filter
> > class="solr.**LowerCaseFilterFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> KeywordMarkerFilterFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> PorterStemFilterFactory"/>
> > >>>>>>>                       <charFilter
> > >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> > >>>>>>>
> > >>>>>>>                   </analyzer>
> > >>>>>>>                   <analyzer type="query">
> > >>>>>>>
> > >>>>>>>                       <tokenizer
> > >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> WordDelimiterFilterFactory"
> > >>>>>>>
> > >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > >>>>>>>
> > >>>>>>>                       <filter
> > class="solr.**LowerCaseFilterFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> KeywordMarkerFilterFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> PorterStemFilterFactory"/>
> > >>>>>>>                       <charFilter
> > >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> > >>>>>>>
> > >>>>>>>                   </analyzer>
> > >>>>>>>
> > >>>>>>>            </fieldType>
> > >>>>>>>
> > >>>>>>> Unfortunatelly that did not fix the error. There are still<h3>
> >  tags
> > >>>>>>> inside the data. Although I believe there are viewer then before
> > but I
> > >>>>>>> can not prove that. Fact is, there are still html tags inside the
> > >>>>>>> data.
> > >>>>>>>
> > >>>>>>> Any other ideas what the problem could be?
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> 2011/7/25 Markus Jelsma<markus.jelsma@**openindex.io<
> > markus.jel...@openindex.io>
> > >>>>>>> >
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> You've three analyzer elements, i wonder what that would do. You
> > need
> > >>>>>>>> to add
> > >>>>>>>> the char filter to the index-time analyzer.
> > >>>>>>>>
> > >>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> Hi there,
> > >>>>>>>>>
> > >>>>>>>>> I am trying to strip html tags from the data before adding the
> > >>>>>>>>> documents
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> the index. To do that I altered schem.xml like this:
> > >>>>>>>>>            <fieldType name="text" class="solr.TextField"
> > >>>>>>>>>
> > >>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
> > >>>>>>>>>
> > >>>>>>>>>                   <analyzer type="index">
> > >>>>>>>>>
> > >>>>>>>>>                       <tokenizer
> > >>>>>>>>>
> class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>>>                       <filter
> > >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
> > >>>>>>>>>
> > >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > >>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >>>>>>>>>
> > >>>>>>>>>                       <filter class="solr.**
> > >>>>>>>>> LowerCaseFilterFactory"/>
> > >>>>>>>>>                       <filter
> > >>>>>>>>>
> class="solr.**KeywordMarkerFilterFactory"/>
> > >>>>>>>>>                       <filter class="solr.**
> > >>>>>>>>> PorterStemFilterFactory"/>
> > >>>>>>>>>
> > >>>>>>>>>                   </analyzer>
> > >>>>>>>>>                   <analyzer type="query">
> > >>>>>>>>>
> > >>>>>>>>>                       <tokenizer
> > >>>>>>>>>
> class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>>>                       <filter
> > >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
> > >>>>>>>>>
> > >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > >>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > >>>>>>>>>
> > >>>>>>>>>                       <filter class="solr.**
> > >>>>>>>>> LowerCaseFilterFactory"/>
> > >>>>>>>>>                       <filter
> > >>>>>>>>>
> class="solr.**KeywordMarkerFilterFactory"/>
> > >>>>>>>>>                       <filter class="solr.**
> > >>>>>>>>> PorterStemFilterFactory"/>
> > >>>>>>>>>
> > >>>>>>>>>                   </analyzer>
> > >>>>>>>>>                   <analyzer>
> > >>>>>>>>>
> > >>>>>>>>>                       <charFilter
> > >>>>>>>>>
> class="solr.**HTMLStripCharFilterFactory"/>
> > >>>>>>>>>
> > >>>>>>>>>                        <tokenizer
> > >>>>>>>>>
> >  class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>>>
> > >>>>>>>>>                   </analyzer>
> > >>>>>>>>>
> > >>>>>>>>>            </fieldType>
> > >>>>>>>>>
> > >>>>>>>>>       <fields>
> > >>>>>>>>>
> > >>>>>>>>>           <field name="text" type="text" indexed="true"
> > >>>>>>>>> stored="true"
> > >>>>>>>>>
> > >>>>>>>>> required="false"/>
> > >>>>>>>>>
> > >>>>>>>>>       </fields>
> > >>>>>>>>>
> > >>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>
>  are
> > >>>>>>>>> still
> > >>>>>>>>> present after restarting and reindexing. I also tryed
> > >>>>>>>>> htmlstriptransformer, but this did not work either.
> > >>>>>>>>>
> > >>>>>>>>> Has anybody an idea how to get this done? Thank you in advance
> > for
> > >>>>>>>>> any hint.
> > >>>>>>>>>
> > >>>>>>>>> Merlin
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Markus Jelsma - CTO - Openindex
> > >>>>>>>> http://www.linkedin.com/in/**markus17<
> > http://www.linkedin.com/in/markus17>
> > >>>>>>>> 050-8536620 / 06-50258350
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>
> > >>
> > >
> >
>



-- 

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533

Re: strip html from data

Reply via email to