Re: strip html from data

Erick Erickson Tue, 09 Aug 2011 05:03:19 -0700

OK, what does "not working" mean? You never answered Markus' question:


"Are you looking at the returned result set or what you've actually indexed?
Analyzers are not run on the stored data, only on indexed data."

If "not working" means that your returned results contain the markup, then
you're confusing indexing and storing. All the analysis chains operate
on data sent into the indexing process. But the verbatim data is *stored*
prior to (or separate from) indexing.

So my assumption is that you see data returned in the document with
markup, which is just as it should be, and there's no problem at all. And your
actual indexed terms (try looking at the data with TermsComponent, or
admin/schema browser) will NOT have any markup.

Perhaps you can back up a bit and describe what's failing .vs. what you
expect.

Best
Erick

On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern
<merlin.morgenst...@googlemail.com> wrote:
> Unfortunatelly I still cant get it running. The code I am using is the
> following:
>                <analyzer type="index">
>                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                    <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                    <filter class="solr.LowerCaseFilterFactory"/>
>                    <filter class="solr.KeywordMarkerFilterFactory"/>
>                    <filter class="solr.PorterStemFilterFactory"/>
>                </analyzer>
>                <analyzer type="query">
>                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                    <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>                    <filter class="solr.LowerCaseFilterFactory"/>
>                    <filter class="solr.KeywordMarkerFilterFactory"/>
>                    <filter class="solr.PorterStemFilterFactory"/>
>                </analyzer>
>
> I also tried this one:
>
>    <types>
>         <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>               <analyzer>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                  <tokenizer class="solr.StandardTokenizerFactory"/>
>                  <filter class="solr.StandardFilterFactory"/>
>            </analyzer>
>         </fieldType>
>    </types>
>      <field name="text" type="text" indexed="true" stored="true"
> required="false"/>
>
> none of those worked. I restartred solr after the shema update and reindexed
> the data. No change, the html tags are still in there.
>
> Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on suse
> linux.
>
> Thank you for any help on this.
>
>
>
> 2011/7/25 Mike Sokolov <soko...@ifactory.com>
>
>> Hmm that looks like it's working fine.  I stand corrected.
>>
>>
>>
>> On 07/25/2011 12:24 PM, Markus Jelsma wrote:
>>
>>> I've seen that issue too and read comments on the list yet i've never had
>>> trouble with the order, don't know what's going on. Check this analyzer,
>>> i've
>>> moved the charFilter to the bottom:
>>>
>>> <analyzer type="index">
>>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
>>> <filter class="solr.**WordDelimiterFilterFactory" generateWordParts="1"
>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>> catenateAll="0"
>>> splitOnCaseChange="1"/>
>>> <filter class="solr.**LowerCaseFilterFactory"/>
>>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="false" expand="true"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="false"
>>> words="stopwords.txt"/>
>>> <filter class="solr.**ASCIIFoldingFilterFactory"/>
>>> <filter class="solr.**SnowballPorterFilterFactory"
>>> protected="protwords.txt"
>>> language="Dutch"/>
>>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/>
>>> <charFilter class="solr.**HTMLStripCharFilterFactory"/>
>>> </analyzer>
>>>
>>> The analysis chain still does its job as i expect for the input:
>>> <span>bla bla</span>
>>>
>>> Index Analyzer
>>> org.apache.solr.analysis.**HTMLStripCharFilterFactory
>>> {luceneMatchVersion=LUCENE_34}
>>> text    bla bla
>>> org.apache.solr.analysis.**WhitespaceTokenizerFactory
>>> {luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**WordDelimiterFilterFactory
>>> {splitOnCaseChange=1,
>>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
>>> generateWordParts=1, catenateAll=0, catenateNumbers=1}
>>> position        1       2
>>> term text       bla     bla
>>> startOffset     6       10
>>> endOffset       9       13
>>> type    word    word
>>> org.apache.solr.analysis.**LowerCaseFilterFactory
>>> {luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> startOffset     6       10
>>> endOffset       9       13
>>> type    word    word
>>> org.apache.solr.analysis.**SynonymFilterFactory {synonyms=synonyms.txt,
>>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt,
>>> ignoreCase=false, luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory
>>> {luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**SnowballPorterFilterFactory
>>> {protected=protwords.txt,
>>> language=Dutch, luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> keyword         false   false
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory
>>> {luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> keyword         false   false
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>>
>>>
>>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
>>>
>>>
>>>> Hmm - I'm not sure about that; see
>>>> https://issues.apache.org/**jira/browse/SOLR-2119<https://issues.apache.org/jira/browse/SOLR-2119>
>>>>
>>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
>>>>
>>>>
>>>>> charFilters are executed first regardless of their position in the
>>>>> analyzer.
>>>>>
>>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
>>>>>
>>>>>
>>>>>> I think you need to list the charfilter earlier in the analysis chain;
>>>>>> before the tokenizer.  Porbably Solr should tell you this...
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
>>>>>>
>>>>>>
>>>>>>> sounds logical. I just changed it to the following, restarted and
>>>>>>> reindexed
>>>>>>>
>>>>>>> with commit:
>>>>>>>            <fieldType name="text" class="solr.TextField"
>>>>>>>
>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
>>>>>>>
>>>>>>>                   <analyzer type="index">
>>>>>>>
>>>>>>>                       <tokenizer
>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> WordDelimiterFilterFactory"
>>>>>>>
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>
>>>>>>>                       <filter class="solr.**LowerCaseFilterFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> KeywordMarkerFilterFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> PorterStemFilterFactory"/>
>>>>>>>                       <charFilter
>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>>>>>>>
>>>>>>>                   </analyzer>
>>>>>>>                   <analyzer type="query">
>>>>>>>
>>>>>>>                       <tokenizer
>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> WordDelimiterFilterFactory"
>>>>>>>
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>
>>>>>>>                       <filter class="solr.**LowerCaseFilterFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> KeywordMarkerFilterFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> PorterStemFilterFactory"/>
>>>>>>>                       <charFilter
>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>>>>>>>
>>>>>>>                   </analyzer>
>>>>>>>
>>>>>>>            </fieldType>
>>>>>>>
>>>>>>> Unfortunatelly that did not fix the error. There are still<h3>    tags
>>>>>>> inside the data. Although I believe there are viewer then before but I
>>>>>>> can not prove that. Fact is, there are still html tags inside the
>>>>>>> data.
>>>>>>>
>>>>>>> Any other ideas what the problem could be?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2011/7/25 Markus 
>>>>>>> Jelsma<markus.jelsma@**openindex.io<markus.jel...@openindex.io>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> You've three analyzer elements, i wonder what that would do. You need
>>>>>>>> to add
>>>>>>>> the char filter to the index-time analyzer.
>>>>>>>>
>>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> I am trying to strip html tags from the data before adding the
>>>>>>>>> documents
>>>>>>>>>
>>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> the index. To do that I altered schem.xml like this:
>>>>>>>>>            <fieldType name="text" class="solr.TextField"
>>>>>>>>>
>>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
>>>>>>>>>
>>>>>>>>>                   <analyzer type="index">
>>>>>>>>>
>>>>>>>>>                       <tokenizer
>>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>>>                       <filter
>>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
>>>>>>>>>
>>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>>>
>>>>>>>>>                       <filter class="solr.**
>>>>>>>>> LowerCaseFilterFactory"/>
>>>>>>>>>                       <filter
>>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
>>>>>>>>>                       <filter class="solr.**
>>>>>>>>> PorterStemFilterFactory"/>
>>>>>>>>>
>>>>>>>>>                   </analyzer>
>>>>>>>>>                   <analyzer type="query">
>>>>>>>>>
>>>>>>>>>                       <tokenizer
>>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>>>                       <filter
>>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
>>>>>>>>>
>>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>>>
>>>>>>>>>                       <filter class="solr.**
>>>>>>>>> LowerCaseFilterFactory"/>
>>>>>>>>>                       <filter
>>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
>>>>>>>>>                       <filter class="solr.**
>>>>>>>>> PorterStemFilterFactory"/>
>>>>>>>>>
>>>>>>>>>                   </analyzer>
>>>>>>>>>                   <analyzer>
>>>>>>>>>
>>>>>>>>>                       <charFilter
>>>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>>>>>>>>>
>>>>>>>>>                        <tokenizer
>>>>>>>>>                        class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>>>
>>>>>>>>>                   </analyzer>
>>>>>>>>>
>>>>>>>>>            </fieldType>
>>>>>>>>>
>>>>>>>>>       <fields>
>>>>>>>>>
>>>>>>>>>           <field name="text" type="text" indexed="true"
>>>>>>>>> stored="true"
>>>>>>>>>
>>>>>>>>> required="false"/>
>>>>>>>>>
>>>>>>>>>       </fields>
>>>>>>>>>
>>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>    are
>>>>>>>>> still
>>>>>>>>> present after restarting and reindexing. I also tryed
>>>>>>>>> htmlstriptransformer, but this did not work either.
>>>>>>>>>
>>>>>>>>> Has anybody an idea how to get this done? Thank you in advance for
>>>>>>>>> any hint.
>>>>>>>>>
>>>>>>>>> Merlin
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Markus Jelsma - CTO - Openindex
>>>>>>>> http://www.linkedin.com/in/**markus17<http://www.linkedin.com/in/markus17>
>>>>>>>> 050-8536620 / 06-50258350
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>>
>

Re: strip html from data

Reply via email to