I am sorry, but I do not really understand the difference of indexed and returned result set.
I look on the "returned" dataset via this command: solr/select/?q=id:533563&terms=true which gives me html tags like this ones: </b><br /> I also tried to turn on TermsComponent, but it did not change anything: solr/select/?q=id:533563&terms=true The shema browser does not show any html tags inside the text field, just indexed words of the one dataset. Is there a way to strip the html tags completly and not index them? If not, how to I retrieve the results without html tags? Thank you for your help. 2011/8/9 Erick Erickson <erickerick...@gmail.com> > OK, what does "not working" mean? You never answered Markus' question: > > "Are you looking at the returned result set or what you've actually > indexed? > Analyzers are not run on the stored data, only on indexed data." > > If "not working" means that your returned results contain the markup, then > you're confusing indexing and storing. All the analysis chains operate > on data sent into the indexing process. But the verbatim data is *stored* > prior to (or separate from) indexing. > > So my assumption is that you see data returned in the document with > markup, which is just as it should be, and there's no problem at all. And > your > actual indexed terms (try looking at the data with TermsComponent, or > admin/schema browser) will NOT have any markup. > > Perhaps you can back up a bit and describe what's failing .vs. what you > expect. > > Best > Erick > > On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern > <merlin.morgenst...@googlemail.com> wrote: > > Unfortunatelly I still cant get it running. The code I am using is the > > following: > > <analyzer type="index"> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.WordDelimiterFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory"/> > > <filter class="solr.PorterStemFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.WordDelimiterFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory"/> > > <filter class="solr.PorterStemFilterFactory"/> > > </analyzer> > > > > I also tried this one: > > > > <types> > > <fieldType name="text" class="solr.TextField" > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > <analyzer> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.StandardFilterFactory"/> > > </analyzer> > > </fieldType> > > </types> > > <field name="text" type="text" indexed="true" stored="true" > > required="false"/> > > > > none of those worked. I restartred solr after the shema update and > reindexed > > the data. No change, the html tags are still in there. > > > > Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on > suse > > linux. > > > > Thank you for any help on this. > > > > > > > > 2011/7/25 Mike Sokolov <soko...@ifactory.com> > > > >> Hmm that looks like it's working fine. I stand corrected. > >> > >> > >> > >> On 07/25/2011 12:24 PM, Markus Jelsma wrote: > >> > >>> I've seen that issue too and read comments on the list yet i've never > had > >>> trouble with the order, don't know what's going on. Check this > analyzer, > >>> i've > >>> moved the charFilter to the bottom: > >>> > >>> <analyzer type="index"> > >>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/> > >>> <filter class="solr.**WordDelimiterFilterFactory" generateWordParts="1" > >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1" > >>> catenateAll="0" > >>> splitOnCaseChange="1"/> > >>> <filter class="solr.**LowerCaseFilterFactory"/> > >>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt" > >>> ignoreCase="false" expand="true"/> > >>> <filter class="solr.StopFilterFactory" ignoreCase="false" > >>> words="stopwords.txt"/> > >>> <filter class="solr.**ASCIIFoldingFilterFactory"/> > >>> <filter class="solr.**SnowballPorterFilterFactory" > >>> protected="protwords.txt" > >>> language="Dutch"/> > >>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/> > >>> <charFilter class="solr.**HTMLStripCharFilterFactory"/> > >>> </analyzer> > >>> > >>> The analysis chain still does its job as i expect for the input: > >>> <span>bla bla</span> > >>> > >>> Index Analyzer > >>> org.apache.solr.analysis.**HTMLStripCharFilterFactory > >>> {luceneMatchVersion=LUCENE_34} > >>> text bla bla > >>> org.apache.solr.analysis.**WhitespaceTokenizerFactory > >>> {luceneMatchVersion=LUCENE_34} > >>> position 1 2 > >>> term text bla bla > >>> startOffset 6 10 > >>> endOffset 9 13 > >>> org.apache.solr.analysis.**WordDelimiterFilterFactory > >>> {splitOnCaseChange=1, > >>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34, > >>> generateWordParts=1, catenateAll=0, catenateNumbers=1} > >>> position 1 2 > >>> term text bla bla > >>> startOffset 6 10 > >>> endOffset 9 13 > >>> type word word > >>> org.apache.solr.analysis.**LowerCaseFilterFactory > >>> {luceneMatchVersion=LUCENE_34} > >>> position 1 2 > >>> term text bla bla > >>> startOffset 6 10 > >>> endOffset 9 13 > >>> type word word > >>> org.apache.solr.analysis.**SynonymFilterFactory {synonyms=synonyms.txt, > >>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34} > >>> position 1 2 > >>> term text bla bla > >>> type word word > >>> startOffset 6 10 > >>> endOffset 9 13 > >>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt, > >>> ignoreCase=false, luceneMatchVersion=LUCENE_34} > >>> position 1 2 > >>> term text bla bla > >>> type word word > >>> startOffset 6 10 > >>> endOffset 9 13 > >>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory > >>> {luceneMatchVersion=LUCENE_34} > >>> position 1 2 > >>> term text bla bla > >>> type word word > >>> startOffset 6 10 > >>> endOffset 9 13 > >>> org.apache.solr.analysis.**SnowballPorterFilterFactory > >>> {protected=protwords.txt, > >>> language=Dutch, luceneMatchVersion=LUCENE_34} > >>> position 1 2 > >>> term text bla bla > >>> keyword false false > >>> type word word > >>> startOffset 6 10 > >>> endOffset 9 13 > >>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory > >>> {luceneMatchVersion=LUCENE_34} > >>> position 1 2 > >>> term text bla bla > >>> keyword false false > >>> type word word > >>> startOffset 6 10 > >>> endOffset 9 13 > >>> > >>> > >>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote: > >>> > >>> > >>>> Hmm - I'm not sure about that; see > >>>> https://issues.apache.org/**jira/browse/SOLR-2119< > https://issues.apache.org/jira/browse/SOLR-2119> > >>>> > >>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote: > >>>> > >>>> > >>>>> charFilters are executed first regardless of their position in the > >>>>> analyzer. > >>>>> > >>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: > >>>>> > >>>>> > >>>>>> I think you need to list the charfilter earlier in the analysis > chain; > >>>>>> before the tokenizer. Porbably Solr should tell you this... > >>>>>> > >>>>>> -Mike > >>>>>> > >>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: > >>>>>> > >>>>>> > >>>>>>> sounds logical. I just changed it to the following, restarted and > >>>>>>> reindexed > >>>>>>> > >>>>>>> with commit: > >>>>>>> <fieldType name="text" class="solr.TextField" > >>>>>>> > >>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true"> > >>>>>>> > >>>>>>> <analyzer type="index"> > >>>>>>> > >>>>>>> <tokenizer > >>>>>>> class="solr.**WhitespaceTokenizerFactory"/> > >>>>>>> <filter class="solr.** > >>>>>>> WordDelimiterFilterFactory" > >>>>>>> > >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >>>>>>> > >>>>>>> <filter > class="solr.**LowerCaseFilterFactory"/> > >>>>>>> <filter class="solr.** > >>>>>>> KeywordMarkerFilterFactory"/> > >>>>>>> <filter class="solr.** > >>>>>>> PorterStemFilterFactory"/> > >>>>>>> <charFilter > >>>>>>> class="solr.**HTMLStripCharFilterFactory"/> > >>>>>>> > >>>>>>> </analyzer> > >>>>>>> <analyzer type="query"> > >>>>>>> > >>>>>>> <tokenizer > >>>>>>> class="solr.**WhitespaceTokenizerFactory"/> > >>>>>>> <filter class="solr.** > >>>>>>> WordDelimiterFilterFactory" > >>>>>>> > >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >>>>>>> > >>>>>>> <filter > class="solr.**LowerCaseFilterFactory"/> > >>>>>>> <filter class="solr.** > >>>>>>> KeywordMarkerFilterFactory"/> > >>>>>>> <filter class="solr.** > >>>>>>> PorterStemFilterFactory"/> > >>>>>>> <charFilter > >>>>>>> class="solr.**HTMLStripCharFilterFactory"/> > >>>>>>> > >>>>>>> </analyzer> > >>>>>>> > >>>>>>> </fieldType> > >>>>>>> > >>>>>>> Unfortunatelly that did not fix the error. There are still<h3> > tags > >>>>>>> inside the data. Although I believe there are viewer then before > but I > >>>>>>> can not prove that. Fact is, there are still html tags inside the > >>>>>>> data. > >>>>>>> > >>>>>>> Any other ideas what the problem could be? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> 2011/7/25 Markus Jelsma<markus.jelsma@**openindex.io< > markus.jel...@openindex.io> > >>>>>>> > > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> You've three analyzer elements, i wonder what that would do. You > need > >>>>>>>> to add > >>>>>>>> the char filter to the index-time analyzer. > >>>>>>>> > >>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>>> Hi there, > >>>>>>>>> > >>>>>>>>> I am trying to strip html tags from the data before adding the > >>>>>>>>> documents > >>>>>>>>> > >>>>>>>>> > >>>>>>>> to > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> the index. To do that I altered schem.xml like this: > >>>>>>>>> <fieldType name="text" class="solr.TextField" > >>>>>>>>> > >>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true"> > >>>>>>>>> > >>>>>>>>> <analyzer type="index"> > >>>>>>>>> > >>>>>>>>> <tokenizer > >>>>>>>>> class="solr.**WhitespaceTokenizerFactory"/> > >>>>>>>>> <filter > >>>>>>>>> class="solr.**WordDelimiterFilterFactory" > >>>>>>>>> > >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >>>>>>>>> > >>>>>>>>> <filter class="solr.** > >>>>>>>>> LowerCaseFilterFactory"/> > >>>>>>>>> <filter > >>>>>>>>> class="solr.**KeywordMarkerFilterFactory"/> > >>>>>>>>> <filter class="solr.** > >>>>>>>>> PorterStemFilterFactory"/> > >>>>>>>>> > >>>>>>>>> </analyzer> > >>>>>>>>> <analyzer type="query"> > >>>>>>>>> > >>>>>>>>> <tokenizer > >>>>>>>>> class="solr.**WhitespaceTokenizerFactory"/> > >>>>>>>>> <filter > >>>>>>>>> class="solr.**WordDelimiterFilterFactory" > >>>>>>>>> > >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >>>>>>>>> > >>>>>>>>> <filter class="solr.** > >>>>>>>>> LowerCaseFilterFactory"/> > >>>>>>>>> <filter > >>>>>>>>> class="solr.**KeywordMarkerFilterFactory"/> > >>>>>>>>> <filter class="solr.** > >>>>>>>>> PorterStemFilterFactory"/> > >>>>>>>>> > >>>>>>>>> </analyzer> > >>>>>>>>> <analyzer> > >>>>>>>>> > >>>>>>>>> <charFilter > >>>>>>>>> class="solr.**HTMLStripCharFilterFactory"/> > >>>>>>>>> > >>>>>>>>> <tokenizer > >>>>>>>>> > class="solr.**WhitespaceTokenizerFactory"/> > >>>>>>>>> > >>>>>>>>> </analyzer> > >>>>>>>>> > >>>>>>>>> </fieldType> > >>>>>>>>> > >>>>>>>>> <fields> > >>>>>>>>> > >>>>>>>>> <field name="text" type="text" indexed="true" > >>>>>>>>> stored="true" > >>>>>>>>> > >>>>>>>>> required="false"/> > >>>>>>>>> > >>>>>>>>> </fields> > >>>>>>>>> > >>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3> are > >>>>>>>>> still > >>>>>>>>> present after restarting and reindexing. I also tryed > >>>>>>>>> htmlstriptransformer, but this did not work either. > >>>>>>>>> > >>>>>>>>> Has anybody an idea how to get this done? Thank you in advance > for > >>>>>>>>> any hint. > >>>>>>>>> > >>>>>>>>> Merlin > >>>>>>>>> > >>>>>>>>> > >>>>>>>> -- > >>>>>>>> Markus Jelsma - CTO - Openindex > >>>>>>>> http://www.linkedin.com/in/**markus17< > http://www.linkedin.com/in/markus17> > >>>>>>>> 050-8536620 / 06-50258350 > >>>>>>>> > >>>>>>>> > >>>>>>> > >>> > >> > > >