FYI: There's a new patch specificly for dealing with xml tags and entities that handles the CDATA case...
https://issues.apache.org/jira/browse/SOLR-2597 : Date: Fri, 27 May 2011 17:01:26 +0800 : From: Ellery Leung <elleryle...@be-o.com> : Reply-To: solr-user@lucene.apache.org, elleryle...@be-o.com : To: solr-user@lucene.apache.org : Subject: RE: HTMLStripTransformer will remove the content in XML?? : : Got it. Actually I use solr.MappingCharFilterFactory to replace the <![CDATA[ and ]]> to empty first, and use HTMLStripCharFilterFactory to get "hello" and "solr". : : For future reference, here is part of schema.xml : : <fieldType name="textMaxWord" class="solr.TextField" > : <analyzer type="index"> : <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/> : <charFilter class="solr.HTMLStripCharFilterFactory" /> : ... : : In mappings.txt (2 lines) : : "<![CDATA[" => "" : : "]]>" => "" : : Restart Solr : : It works. : : Thank you : : -----Original Message----- : From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] : Sent: 2011年5月27日 4:20 下午 : To: solr-user@lucene.apache.org; elleryle...@be-o.com : Subject: Re: HTMLStripTransformer will remove the content in XML?? : : I would expect that it doesn't understand CDATA and thinks of : everything between < and > as a 'tag'. : : Best Regards, : Bryan Rasmussen : : On Fri, May 27, 2011 at 9:41 AM, Ellery Leung <elleryle...@be-o.com> wrote: : > I have an XML string like this: : > : > : > : > <?xml version="1.0" : > encoding="UTF-8"?><language><intl><![CDATA[hello]]></intl><loc><![CDATA[solr : > ]]></loc></language> : > : > : > : > By using HTMLStripTransformer, I expect to get 'hello,solr'. : > : > : > : > But actual this transformer will remove ALL THE TEXT INSIDE! : > : > : > : > Did I do something silly, or is it a bug? : > : > : > : > Thank you : > : > : : -Hoss