Hi ! I'm facing a similar problem. Some HTML docs are correctly indexed and others are simply rejected even I encoded all problematic HTML tags as Thorsten suggested.
In the following example, "my_doc.xml" is a valid "XML" file, compliant with my Solr's schema fields : $ java -jar post.jar ./my_doc.xml SimplePostTool: version 1.2 SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported SimplePostTool: POSTing files to http://localhost:8983/solr/update.. SimplePostTool: POSTing file solrdoc SimplePostTool: FATAL: Connection error (is Solr running at http://localhost:8983/solr/update ?): java.io.IOException: Server returned HTTP response code: 500 for URL: http://localhost:8983/solr/update Is there any way to let "Solr" to be more verbose than that ? Do I need to go into the Java code to understand what happen? I'm looking for a simple solution. Thanks in advance cheers Y. ----Message d'origine---- >De: "[EMAIL PROTECTED]" >Sujet: Re: Problem with html code inside xml >Date: Tue, 2 Oct 2007 16:15:26 +0200 >A: solr-user@lucene.apache.org > >Thanks > >I use this solution: > >put <![CDATA[ Here my hml code ]]> in the xml to be indexed and >it works, nothing to change in the xsl. > >In the schema I use this fieldType > ><fieldType name="html" class="solr.TextField" >positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" >generateWordParts="1" generateNumberParts="1" catenateWords="1" >catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" >words="stopwords.txt"/> > <filter class="solr.ISOLatin1AccentFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > >---------- >Now question: >I created a field to index only the text for this html code. > >I created a field type: > ><fieldType name="htmlTxt" class="solr.TextField" >positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" >generateWordParts="1" generateNumberParts="1" catenateWords="1" >catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" >words="stopwords.txt"/> > <filter class="solr.ISOLatin1AccentFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > >Everything works (the div tags, p tags are removed) but some ><strong>nnn</strong> or <br/> tags are style in the text after >indexing. > >If you've got any idea to solve this problem it we'll be great. > >Thanks > >S. Christin > > > >------------- > > >Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit : > >> On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote: >>> If I understand, you want to keep the raw html code in solr like that >>> (in your posting xml file): >>> >>> <field name="storyFullText"> >>> <html></html> >>> </field> >>> >>> I think you should encode your content to protect these xml entities: >>> < -> < >>>> -> > >>> " -> " >>> & -> & >>> >>> If you use perl, have a look at HTML::Entities. >> >> AFAIR you cannot use tags, they always are getting transformed to >> entities. The solution is to have a xsl transformation after the >> response that transforms the entities back to tags. >> >> Have a look at the thread >> http://marc.info/?t=116775837900001&r=1&w=2 >> and especially at >> http://marc.info/?l=solr-user&m=116782664828926&w=2 >> >> HTH >> >> salu2 >> >>> >>> >>> On 9/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> >>> wrote: >>>> Hello, >>>> >>>> I've got some problem with html code who is embedded in xml file: >>>> >>>> Sample source . >>>> >>>> <content> >>>> <stories> >>>> <div class="storyTitle"> >>>> Les débats >>>> </div> >>>> <div class="storyIntroductionText"> >>>> Le premier tour des élections fédérales >>>> se déroulera le 21 >>>> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez- >>>> vous, dont plusieurs grands débats à l'enseigne de Forums. >>>> </div> >>>> <div class="paragraph"> >>>> <div class="paragraphTitle"/> >>>> <div class="paragraphText"> >>>> my para textehere >>>> <br/> >>>> <br/> >>>> Vous trouverez sur cette page >>>> toutes les dates et les heures de >>>> ces différents rendez-vous ainsi que le nom et les partis des >>>> débatteurs. De plus, vous pourrez également écouter ou >>>> réécouter >>>> l'ensemble de ces émissions. >>>> </div> >>>> </div> >>>> .... >>>> --------- >>>> When a make a query on solr I've got something like that in the >>>> source code of the xml result: >>>> >>>> <td xmlns="http://www.w3.org/1999/xhtml"> >>>> <span class="markup"><</span> >>>> <span class="start-tag">div</span> >>>> <span class="attribute-name">class</span> >>>> <span class="markup">=</span> >>>> <span class="attribute-value">"paragraph"</span> >>>> <span class="markup">></span><div class="expander-content"> >>>> <div class="indent"><span class="markup"><</span> >>>> <span class="start-tag">div</span> >>>> <span class="attribute-name">class</span> >>>> <span class="markup">=</span> >>>> <span class="attribute-value">"paragraphTitle"</span> >>>> <span class="markup">/></span></div><table><tr> >>>> <td class="expander">â<div class="spacer"/> >>>> </td><td><span class="markup"><</span> >>>> ... >>>> >>>> It is not exactly what I want. I want to keep the html tags, that >>>> all >>>> without formatting. >>>> >>>> So the br tags and a tags are well formed in xml and json result, >>>> but >>>> the div tags are not kept. >>>> --------- >>>> In the schema.xml I've got this for the html content >>>> >>>> <fieldType name="html" class="solr.TextField" /> >>>> >>>> <field name="storyFullText" type="html" indexed="true" >>>> stored="true" multiValued="true"/> >>>> >>>> --------- >>>> >>>> Any help would be appreciate. >>>> >>>> Thanks in advance. >>>> >>>> S. Christin >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> -- >> Thorsten Scherler >> thorsten.at.apache.org >> Open Source Java consulting, training and >> solutions >> >