RE: HTMLStripTransformer will remove the content in XML??
FYI: There's a new patch specificly for dealing with xml tags and entities that handles the CDATA case... https://issues.apache.org/jira/browse/SOLR-2597 : Date: Fri, 27 May 2011 17:01:26 +0800 : From: Ellery Leung elleryle...@be-o.com : Reply-To: solr-user@lucene.apache.org, elleryle...@be-o.com : To: solr-user@lucene.apache.org : Subject: RE: HTMLStripTransformer will remove the content in XML?? : : Got it. Actually I use solr.MappingCharFilterFactory to replace the ![CDATA[ and ]] to empty first, and use HTMLStripCharFilterFactory to get hello and solr. : : For future reference, here is part of schema.xml : : fieldType name=textMaxWord class=solr.TextField : analyzer type=index : charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ : charFilter class=solr.HTMLStripCharFilterFactory / : ... : : In mappings.txt (2 lines) : : ![CDATA[ = : : ]] = : : Restart Solr : : It works. : : Thank you : : -Original Message- : From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] : Sent: 2011年5月27日 4:20 下午 : To: solr-user@lucene.apache.org; elleryle...@be-o.com : Subject: Re: HTMLStripTransformer will remove the content in XML?? : : I would expect that it doesn't understand CDATA and thinks of : everything between and as a 'tag'. : : Best Regards, : Bryan Rasmussen : : On Fri, May 27, 2011 at 9:41 AM, Ellery Leung elleryle...@be-o.com wrote: : I have an XML string like this: : : : : ?xml version=1.0 : encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr : ]]/loc/language : : : : By using HTMLStripTransformer, I expect to get 'hello,solr'. : : : : But actual this transformer will remove ALL THE TEXT INSIDE! : : : : Did I do something silly, or is it a bug? : : : : Thank you : : : : -Hoss
Re: HTMLStripTransformer will remove the content in XML??
I would expect that it doesn't understand CDATA and thinks of everything between and as a 'tag'. Best Regards, Bryan Rasmussen On Fri, May 27, 2011 at 9:41 AM, Ellery Leung elleryle...@be-o.com wrote: I have an XML string like this: ?xml version=1.0 encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr ]]/loc/language By using HTMLStripTransformer, I expect to get 'hello,solr'. But actual this transformer will remove ALL THE TEXT INSIDE! Did I do something silly, or is it a bug? Thank you
RE: HTMLStripTransformer will remove the content in XML??
Got it. Actually I use solr.MappingCharFilterFactory to replace the ![CDATA[ and ]] to empty first, and use HTMLStripCharFilterFactory to get hello and solr. For future reference, here is part of schema.xml fieldType name=textMaxWord class=solr.TextField analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ charFilter class=solr.HTMLStripCharFilterFactory / ... In mappings.txt (2 lines) ![CDATA[ = ]] = Restart Solr It works. Thank you -Original Message- From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] Sent: 2011年5月27日 4:20 下午 To: solr-user@lucene.apache.org; elleryle...@be-o.com Subject: Re: HTMLStripTransformer will remove the content in XML?? I would expect that it doesn't understand CDATA and thinks of everything between and as a 'tag'. Best Regards, Bryan Rasmussen On Fri, May 27, 2011 at 9:41 AM, Ellery Leung elleryle...@be-o.com wrote: I have an XML string like this: ?xml version=1.0 encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr ]]/loc/language By using HTMLStripTransformer, I expect to get 'hello,solr'. But actual this transformer will remove ALL THE TEXT INSIDE! Did I do something silly, or is it a bug? Thank you