RE: HTMLStripTransformer will remove the content in XML??

2011-06-16 Thread Chris Hostetter

FYI: There's a new patch specificly for dealing with xml tags and entities 
that handles the CDATA case...

https://issues.apache.org/jira/browse/SOLR-2597

: Date: Fri, 27 May 2011 17:01:26 +0800
: From: Ellery Leung elleryle...@be-o.com
: Reply-To: solr-user@lucene.apache.org, elleryle...@be-o.com
: To: solr-user@lucene.apache.org
: Subject: RE: HTMLStripTransformer will remove the content in XML??
: 
: Got it.  Actually I use solr.MappingCharFilterFactory to replace the 
![CDATA[ and ]] to empty first, and use HTMLStripCharFilterFactory to get 
hello and solr.
: 
: For future reference, here is part of schema.xml
: 
: fieldType name=textMaxWord class=solr.TextField 
:   analyzer type=index
:   charFilter class=solr.MappingCharFilterFactory 
mapping=mappings.txt/
:   charFilter class=solr.HTMLStripCharFilterFactory /
: ...
: 
: In mappings.txt (2 lines)
: 
: ![CDATA[ = 
: 
: ]] = 
: 
: Restart Solr
: 
: It works.
: 
: Thank you
: 
: -Original Message-
: From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] 
: Sent: 2011年5月27日 4:20 下午
: To: solr-user@lucene.apache.org; elleryle...@be-o.com
: Subject: Re: HTMLStripTransformer will remove the content in XML??
: 
: I would expect that it doesn't understand CDATA and thinks of
: everything between  and  as a 'tag'.
: 
: Best Regards,
: Bryan Rasmussen
: 
: On Fri, May 27, 2011 at 9:41 AM, Ellery Leung elleryle...@be-o.com wrote:
:  I have an XML string like this:
: 
: 
: 
:  ?xml version=1.0
:  encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr
:  ]]/loc/language
: 
: 
: 
:  By using HTMLStripTransformer, I expect to get 'hello,solr'.
: 
: 
: 
:  But actual this transformer will remove ALL THE TEXT INSIDE!
: 
: 
: 
:  Did I do something silly, or is it a bug?
: 
: 
: 
:  Thank you
: 
: 
: 
: 

-Hoss

HTMLStripTransformer will remove the content in XML??

2011-05-27 Thread Ellery Leung
I have an XML string like this:

 

?xml version=1.0
encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr
]]/loc/language

 

By using HTMLStripTransformer, I expect to get 'hello,solr'.

 

But actual this transformer will remove ALL THE TEXT INSIDE!

 

Did I do something silly, or is it a bug? 

 

Thank you



Re: HTMLStripTransformer will remove the content in XML??

2011-05-27 Thread bryan rasmussen
I would expect that it doesn't understand CDATA and thinks of
everything between  and  as a 'tag'.

Best Regards,
Bryan Rasmussen

On Fri, May 27, 2011 at 9:41 AM, Ellery Leung elleryle...@be-o.com wrote:
 I have an XML string like this:



 ?xml version=1.0
 encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr
 ]]/loc/language



 By using HTMLStripTransformer, I expect to get 'hello,solr'.



 But actual this transformer will remove ALL THE TEXT INSIDE!



 Did I do something silly, or is it a bug?



 Thank you




RE: HTMLStripTransformer will remove the content in XML??

2011-05-27 Thread Ellery Leung
Got it.  Actually I use solr.MappingCharFilterFactory to replace the ![CDATA[ 
and ]] to empty first, and use HTMLStripCharFilterFactory to get hello and 
solr.

For future reference, here is part of schema.xml

fieldType name=textMaxWord class=solr.TextField 
analyzer type=index
charFilter class=solr.MappingCharFilterFactory 
mapping=mappings.txt/
charFilter class=solr.HTMLStripCharFilterFactory /
...

In mappings.txt (2 lines)

![CDATA[ = 

]] = 

Restart Solr

It works.

Thank you

-Original Message-
From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] 
Sent: 2011年5月27日 4:20 下午
To: solr-user@lucene.apache.org; elleryle...@be-o.com
Subject: Re: HTMLStripTransformer will remove the content in XML??

I would expect that it doesn't understand CDATA and thinks of
everything between  and  as a 'tag'.

Best Regards,
Bryan Rasmussen

On Fri, May 27, 2011 at 9:41 AM, Ellery Leung elleryle...@be-o.com wrote:
 I have an XML string like this:



 ?xml version=1.0
 encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr
 ]]/loc/language



 By using HTMLStripTransformer, I expect to get 'hello,solr'.



 But actual this transformer will remove ALL THE TEXT INSIDE!



 Did I do something silly, or is it a bug?



 Thank you