RE: HTMLStripTransformer will remove the content in XML??

Chris Hostetter Thu, 16 Jun 2011 12:07:27 -0700

FYI: There's a new patch specificly for dealing with xml tags and entities 
that handles the CDATA case...


https://issues.apache.org/jira/browse/SOLR-2597

: Date: Fri, 27 May 2011 17:01:26 +0800
: From: Ellery Leung <elleryle...@be-o.com>
: Reply-To: solr-user@lucene.apache.org, elleryle...@be-o.com
: To: solr-user@lucene.apache.org
: Subject: RE: HTMLStripTransformer will remove the content in XML??
: 
: Got it.  Actually I use solr.MappingCharFilterFactory to replace the 
<![CDATA[ and ]]> to empty first, and use HTMLStripCharFilterFactory to get 
"hello" and "solr".
: 
: For future reference, here is part of schema.xml
: 
: <fieldType name="textMaxWord" class="solr.TextField" >
:       <analyzer type="index">
:               <charFilter class="solr.MappingCharFilterFactory" 
mapping="mappings.txt"/>
:               <charFilter class="solr.HTMLStripCharFilterFactory" />
: ...
: 
: In mappings.txt (2 lines)
: 
: "<![CDATA[" => ""
: 
: "]]>" => ""
: 
: Restart Solr
: 
: It works.
: 
: Thank you
: 
: -----Original Message-----
: From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] 
: Sent: 2011年5月27日 4:20 下午
: To: solr-user@lucene.apache.org; elleryle...@be-o.com
: Subject: Re: HTMLStripTransformer will remove the content in XML??
: 
: I would expect that it doesn't understand CDATA and thinks of
: everything between < and > as a 'tag'.
: 
: Best Regards,
: Bryan Rasmussen
: 
: On Fri, May 27, 2011 at 9:41 AM, Ellery Leung <elleryle...@be-o.com> wrote:
: > I have an XML string like this:
: >
: >
: >
: > <?xml version="1.0"
: > encoding="UTF-8"?><language><intl><![CDATA[hello]]></intl><loc><![CDATA[solr
: > ]]></loc></language>
: >
: >
: >
: > By using HTMLStripTransformer, I expect to get 'hello,solr'.
: >
: >
: >
: > But actual this transformer will remove ALL THE TEXT INSIDE!
: >
: >
: >
: > Did I do something silly, or is it a bug?
: >
: >
: >
: > Thank you
: >
: >
: 
: 

-Hoss

RE: HTMLStripTransformer will remove the content in XML??

Reply via email to