Re: strip html from data

Merlin Morgenstern Thu, 11 Aug 2011 01:20:50 -0700

I am sorry, but I do not really understand the difference of indexed and
returned result set.


I look on the "returned" dataset via this command:
solr/select/?q=id:533563&terms=true

which gives me html tags like this ones: </b><br />

I also tried to turn on TermsComponent, but it did not change anything:
solr/select/?q=id:533563&terms=true

The shema browser does not show any html tags inside the text field, just
indexed words of the one dataset.

Is there a way to strip the html tags completly and not index them? If not,
how to I retrieve the results without html tags?

Thank you for your help.



2011/8/9 Erick Erickson <erickerick...@gmail.com>

> OK, what does "not working" mean? You never answered Markus' question:
>
> "Are you looking at the returned result set or what you've actually
> indexed?
> Analyzers are not run on the stored data, only on indexed data."
>
> If "not working" means that your returned results contain the markup, then
> you're confusing indexing and storing. All the analysis chains operate
> on data sent into the indexing process. But the verbatim data is *stored*
> prior to (or separate from) indexing.
>
> So my assumption is that you see data returned in the document with
> markup, which is just as it should be, and there's no problem at all. And
> your
> actual indexed terms (try looking at the data with TermsComponent, or
> admin/schema browser) will NOT have any markup.
>
> Perhaps you can back up a bit and describe what's failing .vs. what you
> expect.
>
> Best
> Erick
>
> On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern
> <merlin.morgenst...@googlemail.com> wrote:
> > Unfortunatelly I still cant get it running. The code I am using is the
> > following:
> >                <analyzer type="index">
> >                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                    <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >                    <filter class="solr.LowerCaseFilterFactory"/>
> >                    <filter class="solr.KeywordMarkerFilterFactory"/>
> >                    <filter class="solr.PorterStemFilterFactory"/>
> >                </analyzer>
> >                <analyzer type="query">
> >                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                    <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >                    <filter class="solr.LowerCaseFilterFactory"/>
> >                    <filter class="solr.KeywordMarkerFilterFactory"/>
> >                    <filter class="solr.PorterStemFilterFactory"/>
> >                </analyzer>
> >
> > I also tried this one:
> >
> >    <types>
> >         <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >               <analyzer>
> >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                  <tokenizer class="solr.StandardTokenizerFactory"/>
> >                  <filter class="solr.StandardFilterFactory"/>
> >            </analyzer>
> >         </fieldType>
> >    </types>
> >      <field name="text" type="text" indexed="true" stored="true"
> > required="false"/>
> >
> > none of those worked. I restartred solr after the shema update and
> reindexed
> > the data. No change, the html tags are still in there.
> >
> > Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on
> suse
> > linux.
> >
> > Thank you for any help on this.
> >
> >
> >
> > 2011/7/25 Mike Sokolov <soko...@ifactory.com>
> >
> >> Hmm that looks like it's working fine.  I stand corrected.
> >>
> >>
> >>
> >> On 07/25/2011 12:24 PM, Markus Jelsma wrote:
> >>
> >>> I've seen that issue too and read comments on the list yet i've never
> had
> >>> trouble with the order, don't know what's going on. Check this
> analyzer,
> >>> i've
> >>> moved the charFilter to the bottom:
> >>>
> >>> <analyzer type="index">
> >>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
> >>> <filter class="solr.**WordDelimiterFilterFactory" generateWordParts="1"
> >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >>> catenateAll="0"
> >>> splitOnCaseChange="1"/>
> >>> <filter class="solr.**LowerCaseFilterFactory"/>
> >>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
> >>> ignoreCase="false" expand="true"/>
> >>> <filter class="solr.StopFilterFactory" ignoreCase="false"
> >>> words="stopwords.txt"/>
> >>> <filter class="solr.**ASCIIFoldingFilterFactory"/>
> >>> <filter class="solr.**SnowballPorterFilterFactory"
> >>> protected="protwords.txt"
> >>> language="Dutch"/>
> >>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/>
> >>> <charFilter class="solr.**HTMLStripCharFilterFactory"/>
> >>> </analyzer>
> >>>
> >>> The analysis chain still does its job as i expect for the input:
> >>> <span>bla bla</span>
> >>>
> >>> Index Analyzer
> >>> org.apache.solr.analysis.**HTMLStripCharFilterFactory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> text    bla bla
> >>> org.apache.solr.analysis.**WhitespaceTokenizerFactory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**WordDelimiterFilterFactory
> >>> {splitOnCaseChange=1,
> >>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
> >>> generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >>> position        1       2
> >>> term text       bla     bla
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> type    word    word
> >>> org.apache.solr.analysis.**LowerCaseFilterFactory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> type    word    word
> >>> org.apache.solr.analysis.**SynonymFilterFactory {synonyms=synonyms.txt,
> >>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt,
> >>> ignoreCase=false, luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**SnowballPorterFilterFactory
> >>> {protected=protwords.txt,
> >>> language=Dutch, luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> keyword         false   false
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> keyword         false   false
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>>
> >>>
> >>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
> >>>
> >>>
> >>>> Hmm - I'm not sure about that; see
> >>>> https://issues.apache.org/**jira/browse/SOLR-2119<
> https://issues.apache.org/jira/browse/SOLR-2119>
> >>>>
> >>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
> >>>>
> >>>>
> >>>>> charFilters are executed first regardless of their position in the
> >>>>> analyzer.
> >>>>>
> >>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
> >>>>>
> >>>>>
> >>>>>> I think you need to list the charfilter earlier in the analysis
> chain;
> >>>>>> before the tokenizer.  Porbably Solr should tell you this...
> >>>>>>
> >>>>>> -Mike
> >>>>>>
> >>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
> >>>>>>
> >>>>>>
> >>>>>>> sounds logical. I just changed it to the following, restarted and
> >>>>>>> reindexed
> >>>>>>>
> >>>>>>> with commit:
> >>>>>>>            <fieldType name="text" class="solr.TextField"
> >>>>>>>
> >>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
> >>>>>>>
> >>>>>>>                   <analyzer type="index">
> >>>>>>>
> >>>>>>>                       <tokenizer
> >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> WordDelimiterFilterFactory"
> >>>>>>>
> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>
> >>>>>>>                       <filter
> class="solr.**LowerCaseFilterFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> KeywordMarkerFilterFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> PorterStemFilterFactory"/>
> >>>>>>>                       <charFilter
> >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> >>>>>>>
> >>>>>>>                   </analyzer>
> >>>>>>>                   <analyzer type="query">
> >>>>>>>
> >>>>>>>                       <tokenizer
> >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> WordDelimiterFilterFactory"
> >>>>>>>
> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>
> >>>>>>>                       <filter
> class="solr.**LowerCaseFilterFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> KeywordMarkerFilterFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> PorterStemFilterFactory"/>
> >>>>>>>                       <charFilter
> >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> >>>>>>>
> >>>>>>>                   </analyzer>
> >>>>>>>
> >>>>>>>            </fieldType>
> >>>>>>>
> >>>>>>> Unfortunatelly that did not fix the error. There are still<h3>
>  tags
> >>>>>>> inside the data. Although I believe there are viewer then before
> but I
> >>>>>>> can not prove that. Fact is, there are still html tags inside the
> >>>>>>> data.
> >>>>>>>
> >>>>>>> Any other ideas what the problem could be?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 2011/7/25 Markus Jelsma<markus.jelsma@**openindex.io<
> markus.jel...@openindex.io>
> >>>>>>> >
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> You've three analyzer elements, i wonder what that would do. You
> need
> >>>>>>>> to add
> >>>>>>>> the char filter to the index-time analyzer.
> >>>>>>>>
> >>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Hi there,
> >>>>>>>>>
> >>>>>>>>> I am trying to strip html tags from the data before adding the
> >>>>>>>>> documents
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> to
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> the index. To do that I altered schem.xml like this:
> >>>>>>>>>            <fieldType name="text" class="solr.TextField"
> >>>>>>>>>
> >>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
> >>>>>>>>>
> >>>>>>>>>                   <analyzer type="index">
> >>>>>>>>>
> >>>>>>>>>                       <tokenizer
> >>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>>>                       <filter
> >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
> >>>>>>>>>
> >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>>>
> >>>>>>>>>                       <filter class="solr.**
> >>>>>>>>> LowerCaseFilterFactory"/>
> >>>>>>>>>                       <filter
> >>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
> >>>>>>>>>                       <filter class="solr.**
> >>>>>>>>> PorterStemFilterFactory"/>
> >>>>>>>>>
> >>>>>>>>>                   </analyzer>
> >>>>>>>>>                   <analyzer type="query">
> >>>>>>>>>
> >>>>>>>>>                       <tokenizer
> >>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>>>                       <filter
> >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
> >>>>>>>>>
> >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>>>
> >>>>>>>>>                       <filter class="solr.**
> >>>>>>>>> LowerCaseFilterFactory"/>
> >>>>>>>>>                       <filter
> >>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
> >>>>>>>>>                       <filter class="solr.**
> >>>>>>>>> PorterStemFilterFactory"/>
> >>>>>>>>>
> >>>>>>>>>                   </analyzer>
> >>>>>>>>>                   <analyzer>
> >>>>>>>>>
> >>>>>>>>>                       <charFilter
> >>>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> >>>>>>>>>
> >>>>>>>>>                        <tokenizer
> >>>>>>>>>
>  class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>>>
> >>>>>>>>>                   </analyzer>
> >>>>>>>>>
> >>>>>>>>>            </fieldType>
> >>>>>>>>>
> >>>>>>>>>       <fields>
> >>>>>>>>>
> >>>>>>>>>           <field name="text" type="text" indexed="true"
> >>>>>>>>> stored="true"
> >>>>>>>>>
> >>>>>>>>> required="false"/>
> >>>>>>>>>
> >>>>>>>>>       </fields>
> >>>>>>>>>
> >>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>    are
> >>>>>>>>> still
> >>>>>>>>> present after restarting and reindexing. I also tryed
> >>>>>>>>> htmlstriptransformer, but this did not work either.
> >>>>>>>>>
> >>>>>>>>> Has anybody an idea how to get this done? Thank you in advance
> for
> >>>>>>>>> any hint.
> >>>>>>>>>
> >>>>>>>>> Merlin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> --
> >>>>>>>> Markus Jelsma - CTO - Openindex
> >>>>>>>> http://www.linkedin.com/in/**markus17<
> http://www.linkedin.com/in/markus17>
> >>>>>>>> 050-8536620 / 06-50258350
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>
> >>
> >
>

Re: strip html from data

Reply via email to