There is a nice HTML stripper inside Solr. "solr.HTMLStripStandardTokenizerFactory"
-----Original Message----- From: Ahmed Hammad [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 05, 2008 10:43 AM To: solr-user@lucene.apache.org Subject: Re: Regex Transformer Error Hi, It works with the attribute regex="<(.|\n)*?>" Sorry for the disturbance. Regards, ahmd On Wed, Nov 5, 2008 at 8:18 PM, Ahmed Hammad <[EMAIL PROTECTED]> wrote: > Hi, > > I am using Solr 1.3 data import handler. One of my table fields has > html tags, I want to strip it of the field text. So obviously I need > the Regex Transformer. > > I added transformer="RegexTransformer" attribute to my entity and a > new field with: > > <field sourceColName="content" column="content" regex="English" > replaceWith="XXXXX"/> > > Every thing works fine. The text is replace without any problem. The > provlem happend with my regular experession to strip html tags. So I > use regex="<(.|\n)*?>". Of course the charecters '<' and '>' are not > allowed in XML. I tried the following regex="<(.|\n)*?>" and > regex="C;(.|\n)*?E;" but I get the following error: > > The value of attribute "regex" associated with an element type "field" > must not contain the '<' character. at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > Source) ... > > The full stack trace is following: > > *FATAL: Could not create importer. DataImporter config invalid > org.apache.solr.common.SolrException: FATAL: Could not create importer. > DataImporter config invalid at > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > Handler.java:114) > at > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody > (DataImportHandler.java:206) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle > rBase.java:131) at > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter. > java:303) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter > .java:232) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli > cationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi > lterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa > lve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa > lve.java:191) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja > va:128) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja > va:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv > e.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java > :286) > at > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor > .java:857) > at > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro > cess(Http11AprProtocol.java:565) at > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150 > 9) at java.lang.Thread.run(Unknown Source) Caused by: > org.apache.solr.handler.dataimport.DataImportHandlerException: > Exception occurred while initializing context Processing Document # at > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > orter.java:176) > at > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja > va:93) > at > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > Handler.java:106) ... 17 more Caused by: > org.xml.sax.SAXParseException: The value of attribute "regex" > associated with an element type "field" must not contain the '<' > character. at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > Source) at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn > own > Source) at > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > orter.java:166) > ... 19 more * > > *description* *The server encountered an internal error (FATAL: Could > not create importer. DataImporter config invalid > org.apache.solr.common.SolrException: FATAL: Could not create importer. > DataImporter config invalid at > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > Handler.java:114) > at > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody > (DataImportHandler.java:206) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle > rBase.java:131) at > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter. > java:303) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter > .java:232) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli > cationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi > lterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa > lve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa > lve.java:191) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja > va:128) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja > va:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv > e.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java > :286) > at > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor > .java:857) > at > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro > cess(Http11AprProtocol.java:565) at > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150 > 9) at java.lang.Thread.run(Unknown Source) Caused by: > org.apache.solr.handler.dataimport.DataImportHandlerException: > Exception occurred while initializing context Processing Document # at > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > orter.java:176) > at > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja > va:93) > at > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > Handler.java:106) ... 17 more Caused by: > org.xml.sax.SAXParseException: The value of attribute "regex" > associated with an element type "field" must not contain the '<' > character. at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > Source) at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn > own > Source) at > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > orter.java:166) ... 19 more ) that prevented it from fulfilling this > request.* > > I appreciate your help. > > Regards, > ahmd > >