On Tue, Nov 18, 2008 at 2:49 AM, Ahmed Hammad <[EMAIL PROTECTED]> wrote: > Hi All, > > Although the HTMLStripStandardTokenizerFactory will remove HTML tags, it > will be stored in the index and needed to be removed while searching. In my > case the HTML tags has no need at all. So I created HTMLStripTransformer for > the DIH to remove the HTML tags and save space on the index. I have used the > HTML parser included with Lucene ( org.apache.lucene.demo.html). It is well > performing and worked with me (while working with Lucene before moving to > Solr) > > What do you think? Does it worth contribution? Yes. You can contribute this new transformer as an enhancement . > > My best wishes, > > Regards, > Ahmed > > On Thu, Nov 6, 2008 at 2:39 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > >> There is a nice HTML stripper inside Solr. >> "solr.HTMLStripStandardTokenizerFactory" >> >> -----Original Message----- >> From: Ahmed Hammad [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, November 05, 2008 10:43 AM >> To: solr-user@lucene.apache.org >> Subject: Re: Regex Transformer Error >> >> Hi, >> >> It works with the attribute regex="<(.|\n)*?>" >> >> Sorry for the disturbance. >> >> Regards, >> >> ahmd >> >> >> On Wed, Nov 5, 2008 at 8:18 PM, Ahmed Hammad <[EMAIL PROTECTED]> wrote: >> >> > Hi, >> > >> > I am using Solr 1.3 data import handler. One of my table fields has >> > html tags, I want to strip it of the field text. So obviously I need >> > the Regex Transformer. >> > >> > I added transformer="RegexTransformer" attribute to my entity and a >> > new field with: >> > >> > <field sourceColName="content" column="content" regex="English" >> > replaceWith="XXXXX"/> >> > >> > Every thing works fine. The text is replace without any problem. The >> > provlem happend with my regular experession to strip html tags. So I >> > use regex="<(.|\n)*?>". Of course the charecters '<' and '>' are not >> > allowed in XML. I tried the following regex="<(.|\n)*?>" and >> > regex="C;(.|\n)*?E;" but I get the following error: >> > >> > The value of attribute "regex" associated with an element type "field" >> >> > must not contain the '<' character. at >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown >> > Source) ... >> > >> > The full stack trace is following: >> > >> > *FATAL: Could not create importer. DataImporter config invalid >> > org.apache.solr.common.SolrException: FATAL: Could not create >> importer. >> > DataImporter config invalid at >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport >> > Handler.java:114) >> > at >> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody >> > (DataImportHandler.java:206) >> > at >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle >> > rBase.java:131) at >> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter. >> > java:303) >> > at >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter >> > .java:232) >> > at >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli >> > cationFilterChain.java:235) >> > at >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi >> > lterChain.java:206) >> > at >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa >> > lve.java:233) >> > at >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa >> > lve.java:191) >> > at >> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja >> > va:128) >> > at >> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja >> > va:102) >> > at >> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv >> > e.java:109) >> > at >> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java >> > :286) >> > at >> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor >> > .java:857) >> > at >> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro >> > cess(Http11AprProtocol.java:565) at >> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150 >> > 9) at java.lang.Thread.run(Unknown Source) Caused by: >> > org.apache.solr.handler.dataimport.DataImportHandlerException: >> > Exception occurred while initializing context Processing Document # at >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp >> > orter.java:176) >> > at >> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja >> > va:93) >> > at >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport >> > Handler.java:106) ... 17 more Caused by: >> > org.xml.sax.SAXParseException: The value of attribute "regex" >> > associated with an element type "field" must not contain the '<' >> > character. at >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown >> > Source) at >> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn >> > own >> > Source) at >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp >> > orter.java:166) >> > ... 19 more * >> > >> > *description* *The server encountered an internal error (FATAL: Could >> > not create importer. DataImporter config invalid >> > org.apache.solr.common.SolrException: FATAL: Could not create >> importer. >> > DataImporter config invalid at >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport >> > Handler.java:114) >> > at >> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody >> > (DataImportHandler.java:206) >> > at >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle >> > rBase.java:131) at >> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter. >> > java:303) >> > at >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter >> > .java:232) >> > at >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli >> > cationFilterChain.java:235) >> > at >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi >> > lterChain.java:206) >> > at >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa >> > lve.java:233) >> > at >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa >> > lve.java:191) >> > at >> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja >> > va:128) >> > at >> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja >> > va:102) >> > at >> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv >> > e.java:109) >> > at >> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java >> > :286) >> > at >> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor >> > .java:857) >> > at >> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro >> > cess(Http11AprProtocol.java:565) at >> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150 >> > 9) at java.lang.Thread.run(Unknown Source) Caused by: >> > org.apache.solr.handler.dataimport.DataImportHandlerException: >> > Exception occurred while initializing context Processing Document # at >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp >> > orter.java:176) >> > at >> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja >> > va:93) >> > at >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport >> > Handler.java:106) ... 17 more Caused by: >> > org.xml.sax.SAXParseException: The value of attribute "regex" >> > associated with an element type "field" must not contain the '<' >> > character. at >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown >> > Source) at >> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn >> > own >> > Source) at >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp >> > orter.java:166) ... 19 more ) that prevented it from fulfilling this >> > request.* >> > >> > I appreciate your help. >> > >> > Regards, >> > ahmd >> > >> > >> >
-- --Noble Paul