Got it. Actually I use solr.MappingCharFilterFactory to replace the <![CDATA[ and ]]> to empty first, and use HTMLStripCharFilterFactory to get "hello" and "solr".
For future reference, here is part of schema.xml <fieldType name="textMaxWord" class="solr.TextField" > <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/> <charFilter class="solr.HTMLStripCharFilterFactory" /> ... In mappings.txt (2 lines) "<![CDATA[" => "" "]]>" => "" Restart Solr It works. Thank you -----Original Message----- From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] Sent: 2011年5月27日 4:20 下午 To: solr-user@lucene.apache.org; elleryle...@be-o.com Subject: Re: HTMLStripTransformer will remove the content in XML?? I would expect that it doesn't understand CDATA and thinks of everything between < and > as a 'tag'. Best Regards, Bryan Rasmussen On Fri, May 27, 2011 at 9:41 AM, Ellery Leung <elleryle...@be-o.com> wrote: > I have an XML string like this: > > > > <?xml version="1.0" > encoding="UTF-8"?><language><intl><![CDATA[hello]]></intl><loc><![CDATA[solr > ]]></loc></language> > > > > By using HTMLStripTransformer, I expect to get 'hello,solr'. > > > > But actual this transformer will remove ALL THE TEXT INSIDE! > > > > Did I do something silly, or is it a bug? > > > > Thank you > >