Got it.  Actually I use solr.MappingCharFilterFactory to replace the <![CDATA[ 
and ]]> to empty first, and use HTMLStripCharFilterFactory to get "hello" and 
"solr".

For future reference, here is part of schema.xml

<fieldType name="textMaxWord" class="solr.TextField" >
        <analyzer type="index">
                <charFilter class="solr.MappingCharFilterFactory" 
mapping="mappings.txt"/>
                <charFilter class="solr.HTMLStripCharFilterFactory" />
...

In mappings.txt (2 lines)

"<![CDATA[" => ""

"]]>" => ""

Restart Solr

It works.

Thank you

-----Original Message-----
From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] 
Sent: 2011年5月27日 4:20 下午
To: solr-user@lucene.apache.org; elleryle...@be-o.com
Subject: Re: HTMLStripTransformer will remove the content in XML??

I would expect that it doesn't understand CDATA and thinks of
everything between < and > as a 'tag'.

Best Regards,
Bryan Rasmussen

On Fri, May 27, 2011 at 9:41 AM, Ellery Leung <elleryle...@be-o.com> wrote:
> I have an XML string like this:
>
>
>
> <?xml version="1.0"
> encoding="UTF-8"?><language><intl><![CDATA[hello]]></intl><loc><![CDATA[solr
> ]]></loc></language>
>
>
>
> By using HTMLStripTransformer, I expect to get 'hello,solr'.
>
>
>
> But actual this transformer will remove ALL THE TEXT INSIDE!
>
>
>
> Did I do something silly, or is it a bug?
>
>
>
> Thank you
>
>

Reply via email to